2.8 Softmax

Selecting predictions via probability.

Transformer-softmax

Diagram 2.8.0: The Transformer, Vaswani et al. (2017)

A softmax function is a standard mathematical function that can be applied to many contexts. It will take a set of numbers and convert them to a probability distribution.

In the context of the last stage of the Transformer, the unprocessed predictions (i.e. those output by the linear layer) are input into a softmax function, to produce a set of tokens with probability scores, the total probability for all the predictions adding up to 1.

P(li)=elij=1kelj P(l_i) = \frac{e^{l_i}}{\sum_{j=1}^{k} e^{l_j}}

Where li l_i is one output of the linear layer, and k k is the total quantity of outputs of the linear layer (the size of the full known vocabulary of the model).

Probability variable Variable value (logit) Probability Word
p1p_{1} 0.5 0.004 Sunny
p2p_{2} 1 0.007 Cloudy
p3p_{3} 2 0.018 Rainy
p4p_{4} 0.5 0.004 Misty
p5p_{5} 6 0.968 Snowy

Probability regards the probability that the associated token is the next token generated by the LLM.

User interfaces built on top of model utilisation/inference may add a temperate gauge, and adjustments to this temperature gauge by the user indicate whether the model’s less likely predictions or more likely predictions are selected.[1] It is possible than an LLM may have a fixed temperature, with some element of randomness, so that the model consistently selects probable tokens, but not always the same tokens, to give the impression of creativity.

References

[1] Large Language Models: A Deep Dive - 8.6.4