2.2 Positional encoding

Incorporating position related data to vector embeddings.

Transformer-positional

Diagram 2.2.0: The Transformer, Vaswani et al. (2017)

As a Transformer processes the aforementioned tokens in parallel, a method is needed to embed the relative positional data of the tokens into the vector embeddings. The original Vaswani et al. (2017) paper proposes using the sine and cosine functions as a means of representing positions, alternately for odd and even positions.

vectors-positions

Diagram 2.2.1: Generating the positional encoding for a token, and adding it to the vector embedding of the token.

Theory

Specifically, positional encodings are generated via the following functions:

$P(k, 2i) = sin(\frac{k}{n^{2i/d}})$
$P(k, 2i+1) = cos(\frac{k}{n^{2i/d}})$

k: the position of the object within the input sequence
d: set to the same value as the dimension of the vector embeddings to be used
n: a constant called the scalar, set to 10,000 in Vaswani et al. (2017)
i: the output position in regards to the final positional encoding vector that is to be output for a specific token, whereby each set of adjacent odd and even position values are set to the same i, such that $0 \le i \le \frac{d}{2}$

The reasons that positions are encoded via sinusoidal functions are because:

the functions allow a relative position encoding P+K to derived given the encoding for a specific position P
the functions of alternating wavelengths map relatively closely to binary numbers (see Diagram 2.2.2, 1.2.3), therefore allowing uniqueness (reductionistically, binary numbers are the simplest unique representation)
the functions are of infinite length, allowing them to be adapted for any input sequence length
the functions can maintain an output within a limited range (which does not introduce issues with gradient descent during optimisation via loss function)
the functions can be quickly computed, in a deterministic way, allowing efficiency and dependability

Application of theory

So, for example, generating 16-dimension positional encodings for the input sequence “classify the book”, may look like the following, with n set to 100:

i	0	0	1	1	2	2	3	3	4	4	5	5	6	6	7	7
classify k = 0	0	1	0	1	0	1	0	1	0	1	0	1	0	1	0	1
the k = 1	0.84	0.54	0.53	0.85	0.31	0.95	0.18	0.98	0.10	1.00	0.06	1.00	0.03	1.00	0.02	1.00
book k = 2	0.91	-0.42	0.90	0.41	0.59	0.81	0.35	0.94	0.20	0.98	0.11	0.99	0.06	1.00	0.04	1.00

0.02	0.15	0.31	0.08	-0.26	0.21	-0.58	1.12	-0.38	-0.91	0.52	0.87	-0.17	0.73	-0.38	0.18

0.91	-0.42	0.90	0.41	0.59	0.81	0.35	0.94	0.20	0.98	0.11	0.99	0.06	1.00	0.04	1.00

0.93	-0.27	-0.59	-.033	-0.85	-0.6	-0.93	0.18	-0.58	-1.89	0.41	-0.12	-0.23	-0.27	-0.42	-0.82

This final result, above, is then input into the encoder of the Transformer.

For example, to deduce the position of the token “book” in the above example, knowing where in the vector the number occurs:
$k = 100^{2*1/16} cos^{-1}(0.85)$

However, modern developments have meant that some modern LLMs (including Llama-4^[1]) do not use absolute positional encoding (i.e. the encoding indicates the exact position within the sequence) as above, and instead only comprehend relative positions^[2]. Regardless, the notion of acknowledging position remains.

150 in binary, graphically depicted
Diagram 2.2.2: sourced from HuggingFace, where the x-axis regards the decimal number being represented in decimal, from 0 to 150, and each sector of the y-axis represents the binary value within the 8-bit representation of the number. The orange dot indicates that the 1s and 0s on the left hand side are related to the number displayed below them; 150 = 8’b10010110.

150 as a positional encoding, graphically depicted
Diagram 2.2.3: sourced from HuggingFace, attempts to depict the positional encoding where $k = 150$ , $n = 10000$ , $d = 8$ , where each 2 rows represent a different value for $i$ , starting from $i = 0$ at the top.

Practice questions

1. Consider that you are debugging a Vaswani style Transformer, such as for research or commercial purposes. Given the following conditions of the Transformer under test:

tokens are encoded into a 5120-dimension vectors
the positional encoding uses sinusoidal, absolute positional encodings, with scalar constant of 10,000
the value b₁₀₈₈ is at dimension position 1088 within a token’s positional encoding

What position k in the input sequence (i.e. the original sentence input into the Transformer) was b₁₀₈₈ generated for?

Where b₁₀₈₈ = 0.488814

Answer

k = 7

In the tradition of print-based Mathematics, the workings have been omitted, to encourage you to reverse engineer any incorrect answer.

References

[1] Meta blog: Introducing Llama 4
[2] RoFormer: Enhanced Transformer with Rotary Position Embedding

2.1 Input embeddings 2.3 Single-attention head