2.6 Feed-forward neural network

Deductions based on training and utilisation input.

Transformer-feed-forward

Diagram 2.6.0: The Transformer, Vaswani et al. (2017)

The basics of a feed-forward neural network have been covered in limited enough depth in Chapter 1, Machine Learning basics. The major difference to the examples within Chapter 1 are the quantities of neurons and layers, and the fact that the quantity of neurons within the input layer must match the dimensionality of the vector being input.

Architecture within the Transformer

*Diagram 2.6.1: an example of a feed-forward neural network, whereby the hidden layer has double the quantity of neurons as the input and output layer. Note that a modern LLM may have many hidden layers, and that there may be multiple feed-forward neural networks stacked together when multiple Transformers are stacked together. Diagram generated via an LLM, with minimal edits; an SVG is made up of a set of word-like instructions. *

Specifically, in the context of the Transformer, the feed forward network scales the dimensionality of the input upwards, whilst processing through hidden layers, and then back downwards to the original dimensionality. The original Transformer scaled from 512 to 2048,^[1] i.e. 512 neurons for the input layer, and 2048 neurons for at least 1 hidden layer, whilst Llama-4, as of 2025, scales from 5120 to 8192.^[2]

Scaling dimensionality upwards can make it easier to group data into appropriate patterns, which is vital during training the model to generate patterns, and then again during utilisation to match the pre-generated intricate patterns. Imagine transitioning from 2 dimensional coordinates to 3 dimensional coordinates to identify the shape of a river. This is essentially the purpose of the feed forward neural network in the context of an LLM - finding the patterns in a text corpus during training, and then applying the patterns to an input sequence during utilisation.

References

[1] Attention Is All You Need, 3.3
[2] Llama-4 documentation - HuggingFace

2.5 Add & Norm 2.7 Linear