2. The Transformer

2. The Transformer

Transformer-full

Diagram 2.0: The Transformer, Vaswani et al. (2017)

This chapter iterates through the layers of The Transformer from the bottom-up. It will start with the lowest layer of the Transformer, featured in Diagram 2.0 above, and gradually explain how each layer works, and then towards the end provide an overview.

Glossary

Large Language Model - a model typically oriented around a highly scaled Transformer

Transformer - a model oriented on a particular set of functions and parameters, which emphasises text processing (both generation and comprehension)

Model - architecture of a trainable model, features functions and trainable parameters; the word is not used to mean a model packaged with trained parameters (this is described as a checkpoint)

Weight - a parameter within a model, where a numerical value can be stored and adjusted during the training period

Training - the initial stage of the lifetime of a model, in which the parameters within a model are assigned numerical values, which are then adjusted as data is run through the model, such that the output of the model becomes closer to a target output

Loss function - a means of measuring the difference between the output of a model and the target output

Backpropagation - deducing how a specific numerical allocation to a parameter affects the numerical output of a loss function

Layer - a part of a Transformer, e.g. the multi attention-head; alternatively, it can mean a set of neurons within a neural network that all implement the same function

SLP - Single Layer Perception, can be represented as a very limited neural network, or simply matrix multiplication

SIMD processor - Single Instruction Multiple Data, meaning a processor that performs the same operation on all the data it holds, such as addition, or multiplication

GPU - a common type of SIMD processor, originally oriented around graphics processing, but equally useful for matrix operations, generally

Tile - a matrix of data, m x n, which is the smallest quantity of data that a GPU can operate on