2.1 Input embeddings

Converting natural language to numerical representations.

Transformer-embeddings

Diagram 2.1.0: The Transformer, Vaswani et al. (2017)

The inputs to the encoder are first split into tokens. A token may consist of a whole word, or a portion of a word.
Each token is then mapped to a multidimensional vector (set of numbers in a format ready for matrix operations). An example of a vector representation would be a word embedding. A word embedding represents tokens in a dense way, such that similar tokens show high cosine similarity (as in vector dot product, rearranged) when compared. This dense representation also allows models to comprehend a large quantity of different tokens.

tokens-vectors Diagram 2.1.1: A depiction of a sequence being converted to tokens, and then one token being converted to a 8-dimension vector.

Word embeddings may be developed via:

neural networks and large quantities of unlabeled training data, by optimising a loss function based on tokens that are expected to be close together^[1]
global co-occurrence statistics and subword information^[2]

Modern word embeddings are typically of very high dimension, e.g. Meta’s LLM Llama-4 uses 5120 dimensions^[3] for larger models. Mapping a token to a vector embedding is a one-to-one mapping, once the dictionary set of vector embeddings have been finalised.

A theoretical example of a vector embedding could be a 6-dimensional vector, where each dimension measures how strongly the token/word is a member of the following groups:

noun
verb
adjective
adverb
prepositions
connectives

Realistically, if generated via a neural network, each dimension of an embedding could represent membership to a grouping of tokens that share some connection, but the connection may never have been formally defined by humans. Furthermore, multiple dimensions of these unique groupings could interrelate.

Representing data as vectors is a method that can be applied to a variety of contexts, and so is independently a large domain even outside the context of LLMs. For example, vector embeddings may be utilised to find similarities between documents, for classification or information retrieval purposes.

Practice questions

1. Consider you are classifying a large selection of random articles from Wikipedia, and assigning each one a set of values within an n-dimensional vector. What are the most important attributes to categorise each dimension with the n dimensions? Note that this is an open-ended question.

Answer

It may be a good idea to find what classifications of Wikipedia articles already exist. From there, it would be valuable to consider what function/algorithm can be utilised to deduce similarities between 2 vectors, and orient the data structure accordingly.

References

[1] Efficient Estimation of Word Representations in Vector Space
[2] GloVe: Global Vectors for Word Representation
[3] Llama-4 model documentation

2.0 Primer 2.2 Positional encoding