2.8 Softmax
Selecting predictions via probability.
Diagram 2.8.0: The Transformer, Vaswani et al. (2017)
A softmax function is a standard mathematical function that can be applied to many contexts. It will take a set of numbers and convert them to a probability distribution.
In the context of the last stage of the Transformer, the unprocessed predictions (i.e. those output by the linear layer) are input into a softmax function, to produce a set of tokens with probability scores, the total probability for all the predictions adding up to 1.
Where is one output of the linear layer, and is the total quantity of outputs of the linear layer (the size of the full known vocabulary of the model).
| Probability variable | Variable value (logit) | Probability | Word |
|---|---|---|---|
| 0.5 | 0.004 | Sunny | |
| 1 | 0.007 | Cloudy | |
| 2 | 0.018 | Rainy | |
| 0.5 | 0.004 | Misty | |
| 6 | 0.968 | Snowy |
Probability regards the probability that the associated token is the next token generated by the LLM.
User interfaces built on top of model utilisation/inference may add a temperate gauge, and adjustments to this temperature gauge by the user indicate whether the model’s less likely predictions or more likely predictions are selected.[1] It is possible than an LLM may have a fixed temperature, with some element of randomness, so that the model consistently selects probable tokens, but not always the same tokens, to give the impression of creativity.