2.7 Linear

Converting contextual abstractions to vocabulary.

Transformer-linear

Diagram 2.7.0: The Transformer, Vaswani et al. (2017)

The decoder will output a set of numerical vectors (one per token, including both the input sequence and the tokens generated so far) of a prespecified, computationally efficient dimension. The linear layer will take as input an input embedding, and output a new vector, the size of the new vector being that of the total known vocabulary of the model (the Llama-4 default being 202,048,^[1] which is larger than the size of a recent paperback dictionary, at 120,000 words).

Linear layer of The Transformer, represented as neural network (SLP)

Diagram 2.7.1: a vector representing a processed token (x₁) of dimension 4 being run through an SLP of 5 neurons, meaning a model with a total known vocabulary of 5 words. Note that each neuron will be fed a different set of trainable weights.

Output variable	Variable value	Word
$l_{1}$	0.5	Sunny
$l_{2}$	1	Cloudy
$l_{3}$	2	Rainy
$l_{4}$	0.5	Misty
$l_{5}$	6	Snowy

The above table relates to Diagram 2.7.1 above. For example, if the input sequence was “today’s weather?”, and the LLM had learnt to make predictions related to weather such as from chat conversations and past weather data. Vector x₁ could have been generated by the Transformer, based on the token “weather?” being input to it, scores then assigned by the linear layer as to which upcoming tokens are most likely. Once the Transformer has been trained, there is an injective relation (one-to-one mapping) between each l_i variable and every token that the model is able to generate.

The name linear is due to this stage consisting of a fully-connected linear layer of an SLP (no activation function). An SLP can be considered a neural network with no hidden layers and no activation functions. As with all weights, these weights are trained when the Transformer is trained in its entirety. The vectors output by the SLP will each consist of a set of floats.

Due to the simplicity of an SLP, the linear layer can alternatively be thought of as a simple matrix of trainable weights; again optimisable for SIMD processors (GPUs).

Linear layer of a Transformer, represented as matrices.

Diagram 2.7.2: an equivalent representation of the SLP in Diagram 2.7.1.

References

[1] Llama-4 documentation - HuggingFace

2.6 Feed-forward neural network 2.8 Softmax