2.7 Linear
Converting contextual abstractions to vocabulary.
Diagram 2.7.0: The Transformer, Vaswani et al. (2017)
The decoder will output a set of numerical vectors (one per token, including both the input sequence and the tokens generated so far) of a prespecified, computationally efficient dimension. The linear layer will take as input an input embedding, and output a new vector, the size of the new vector being that of the total known vocabulary of the model (the Llama-4 default being 202,048,[1] which is larger than the size of a recent paperback dictionary, at 120,000 words).
Diagram 2.7.1: a vector representing a processed token (x1) of dimension 4 being run through an SLP of 5 neurons, meaning a model with a total known vocabulary of 5 words. Note that each neuron will be fed a different set of trainable weights.
| Output variable | Variable value | Word |
|---|---|---|
| 0.5 | Sunny | |
| 1 | Cloudy | |
| 2 | Rainy | |
| 0.5 | Misty | |
| 6 | Snowy |
The above table relates to Diagram 2.7.1 above. For example, if the input sequence was “today’s weather?”, and the LLM had learnt to make predictions related to weather such as from chat conversations and past weather data. Vector x1 could have been generated by the Transformer, based on the token “weather?” being input to it, scores then assigned by the linear layer as to which upcoming tokens are most likely. Once the Transformer has been trained, there is an injective relation (one-to-one mapping) between each li variable and every token that the model is able to generate.
The name linear is due to this stage consisting of a fully-connected linear layer of an SLP (no activation function). An SLP can be considered a neural network with no hidden layers and no activation functions. As with all weights, these weights are trained when the Transformer is trained in its entirety. The vectors output by the SLP will each consist of a set of floats.
Due to the simplicity of an SLP, the linear layer can alternatively be thought of as a simple matrix of trainable weights; again optimisable for SIMD processors (GPUs).
Diagram 2.7.2: an equivalent representation of the SLP in Diagram 2.7.1.