2.8 Softmax

Selecting predictions via probability.

Transformer-softmax

Diagram 2.8.0: The Transformer, Vaswani et al. (2017)

A softmax function is a standard mathematical function that can be applied to many contexts. It will take a set of numbers and convert them to a probability distribution.

In the context of the last stage of the Transformer, the unprocessed predictions (i.e. those output by the linear layer) are input into a softmax function, to produce a set of tokens with probability scores, the total probability for all the predictions adding up to 1.

P(l_i) = \frac{e^{l_i}}{\sum_{j=1}^{k} e^{l_j}}

Where $l_i$ is one output of the linear layer, and $k$ is the total quantity of outputs of the linear layer (the size of the full known vocabulary of the model).

Probability variable	Variable value (logit)	Probability	Word
$p_{1}$	0.5	0.004	Sunny
$p_{2}$	1	0.007	Cloudy
$p_{3}$	2	0.018	Rainy
$p_{4}$	0.5	0.004	Misty
$p_{5}$	6	0.968	Snowy

Probability regards the probability that the associated token is the next token generated by the LLM.

User interfaces built on top of model utilisation/inference may add a temperate gauge, and adjustments to this temperature gauge by the user indicate whether the model’s less likely predictions or more likely predictions are selected.^[1] It is possible than an LLM may have a fixed temperature, with some element of randomness, so that the model consistently selects probable tokens, but not always the same tokens, to give the impression of creativity.

References

[1] Large Language Models: A Deep Dive - 8.6.4

2.7 Linear 2.9 Connections