The temporal inductive bias is very important when we deal with sequences, encode such bias in the architecture, but transformers, which are pure attention mechanisms, are permutation equivariant.
This limits expressivity, modelling something like a query that returns the first element of a sequence becomes challenging.
To take full advantage of the transformer architecture, we need another way to inject this inductive bias.
If we can not change the architecture to include an inductive bias, we can do so by changing the data !
How can we include the temporal relationship in the data itself, as some form of data augmentation ?
We will use a positional embedding that we either concatenate, directly add, or inject in a different way to each input in the sequence.
Concatenate or add ?
The positional embedding vector will be as large as the original embedding so settling on concatenation would lead to vectors twice as long, working in very high dimensional spaces as such is undesirable (High computational cost, overfitting), so we settle on adding.
This might look counterintuitive at first, but it actually makes perfect sense.
We are already working in very high dimensional spaces.
One of the counterintuitive properties of high-dimensional spaces is that if you take any two random vectors they will be nearly orthogonal (They occupy different subspaces ), meaning the the original vector (after adding) can be written as a linear combination of those two orthogonal vectors, the neural network will learn to sperate the embeddings.
What if the features and frequencies overlap ? like if we use data that is based on frequencies ?
Sinusoidal Embeddings:
The initial approach described in [Viswani et al. 2017] was to use frequencies. if we let be the position in the sequence we want to encode, we can differentiate it from other positions by using a complex expression.
Where
- Usually , but it can be any number.
- is the dimension in the -dimensional embedding.
- can be learnable, but in practice it has not proven beneficial.
In practice we don't want to use complex numbers, so we use the vector expression.
This is a very elegant way of doing positional embedding, this is due to many factors:
Periodicity:
Sines and cosines are periodic, they repeat when , now since we have different frequencies for each dimension, we will never have the same embedding again, all frequencies will not line up unless we have some astronomically large number of positions which does not happen in practice. This leads to two subtle properties:
Because of periodicity, Sines and Cosines are limited to the interval , no matter how big the position is. This way we will not corrupt the input information with large numbers and we confine the activations to a compact space (Avoiding exploding gradients).
Fine and Coarse time, hands of a clock:
The core idea of attention is to compute similarity scores between a query and a set of keys to determine how much attention (weight) to pay to the corresponding values . In self-attention within a Transformer, , , and are derived from the same input sequence.
The input to a Transformer layer (after the initial embedding lookup) for a token at position is the sum of its content embedding (e.g., word embedding ) and its positional encoding :
Where is the model's embedding dimension.
, , and on the other hand are computed via linear transformations using learned weight matrices (for and ) and (for ).
Note: Often , where is the number of attention heads, but here we'll use for generality.
Here, represents the query from position , and represents the key from position .
We can now calculate the unnormalized attention score between query position and key position as the scaled dot product:
Let's expand the dot product :
Using the distributive property of the dot product:
Here, represents the query from position , and represents the key from position .
Let's now analyze each term:
-
Term 1 (): This term captures the similarity based purely on the content embeddings () after being projected by and . This is the standard content-based attention found in many models. The matrices and will be learned to appropriately capture a specific pattern related to the content of the different tokens.
-
Term 4 (): This term captures the similarity based purely on the positional embeddings () after projection. This is where the positional information directly influences the attention score based on the transformed positional vectors. Similarly, the matrices and will be learned simultaneously to appropriately capture a specific pattern related to the positions of the tokens and here comes more nuance related to how the positional embedding is designed.
-
Terms 2 & 3 ( and ): These are cross-terms representing the interaction between content and position. For example, Term 2 might allow the query (based on content ) to seek keys based on their position (). Term 3 might allow a query focused on position () to seek keys based on their content (). This enables queries like "What word (content) is at position relative to my position ?" or "Where (position) is the word 'transformer' (content) relative to my position ?". These can be negligible if the position and content are still nearly orthogonal after the learned projections.
As mentioned above, the projection lead determine:
- How much emphasis is placed on the original content () versus position () information.
- How the different dimensions within (corresponding to different frequencies) are weighted and combined.
- How the content and position information are potentially integrated or separated in the final and vectors. The network learns these matrices during training to optimize the attention mechanism for the specific task (e.g., machine translation, text generation).
The sinusoidal positional encoding vector contains components oscillating at different frequencies .
- High Frequencies (small ): is large. and change rapidly with . These components encode fine-grained, local positional information. A small change in leads to a significant change in these components.
- Low Frequencies (large ): is small. and change slowly with . These components encode coarse-grained, global positional information. These values remain similar even for moderately large differences in .
The network learns to exploit this multi-frequency representation via the learned weight matrices .
Recall .
Each of the dimensions of the output vector is a linear combination (a weighted sum) of all dimensions of the input vector .
Let be the -th dimension of the transformed vector, and be the weight in the -th row and -th column of .
Suppose the network needs to attend to very nearby positions (fine-grained focus). To achieve this, the dot product should be large only when . This requires and to be sensitive to small changes in position.
The network can achieve this by learning weights in and such that certain dimensions of and (say, dimension ) heavily weight the high-frequency components (small ) of the original and . That is, the values and would be large for indices corresponding to high frequencies, and small for indices corresponding to low frequencies.
if dimension in both and primarily reflects high-frequency information, their product will contribute significantly to the dot product only when is very close to .
Conversely, suppose the network needs to attend over longer ranges (coarse-grained focus). It can learn weights in and that cause other dimensions of and (say, dimension ) to heavily weight the low-frequency components (large ) of the original and .
If dimension in both and primarily reflects low-frequency information, their product will contribute significantly to the dot product even when and are far apart, as low-frequency components change slowly.
This isn't explicitly programmed. During training, the model adjusts and (and all other weights) via gradient descent to minimize the loss function.
If attending locally improves performance on the task, the gradients will push to develop projections that emphasize high-frequency components.
If attending globally or to specific relative offsets (enabled by lower frequencies) improves performance, the weights will evolve accordingly.
Crucially, Transformers use multi-head attention. Each head has its own set of matrices.
This allows different heads to simultaneously learn different types of positional focus. One head might learn to focus locally (emphasizing high frequencies), while another learns to focus more globally or on specific relative offsets (emphasizing lower frequencies or specific combinations). The outputs of all heads are then combined, giving the model flexibility.
It's also worth noting that composing layers on top of each other adds more to this complexity, allowing to both query content and position.
Relative Positioning:
The relative positioning property stems from the mathematical structure of the raw sinusoidal vectors themselves.
the dot product of the raw vectors for positions and is:
Where . Using the identity :
This demonstrates that the dot product of the original, untransformed vectors depends only on the relative displacement , not on the absolute positions or . This is a powerful property suggesting that similarity between raw s inherently measures relative distance.
This elegant property applies before the linear transformations and . In the actual attention mechanism, the relevant term involving only position is Term 4 from the previous section:
Let and be the transformed positional components of the query and key. Term 4 is .
Is Term 4 a function of ?
Generally, no. The matrices and mix the dimensions of the original vectors.
Unless and possess very specific structures (e.g., being orthogonal matrices that preserve the pairwise sinusoidal structure, or being related such that is diagonal with specific values), the simple relationship is lost after transformation. will depend on and in a more complex way, influenced by the specific values learned in and .
So, if the exact property doesn't directly carry over, why is it important?
The sinusoidal vectors encode relative position information in their structure. The relationship between and is inherently tied to via rotations in the 2D subspaces defined by .
Because the relative position information is already present in a structured way in the inputs , it is likely much easier for the network to learn transformations that make the attention score (specifically Term 4, but potentially influencing other terms too) sensitive to relative positions. The network doesn't need to invent the concept of relative positioning from scratch; it only needs to learn how to extract and utilize the information already provided by the s.
The network might learn such that approximates a function of , even if it's not the exact cosine sum.
In practice, our network does not extrapolate well to longer sequences, which suggests that it doesn't actually learn relative positioning.
A learnable positional embedding:
In this method we simply add a bias term:
We hope that the bias term learns to encode the relative position between the query and the key.
It might look like this approach is completely different than the one we described above, but if you think of the heuristic in which we can sperate the embeddings in high dimensional spaces.
This learnable positional embedding is usually suboptimal compared to the previous approach since our network will need to learn relative positioning and those fine-coarse aspects of time from scratch. Also, we can see that the projection matrices can not directly query position.
RoPE: Rotary Positional Embedding
Recalling the analysis of sinusoidal positional embeddings, we identified some potential drawbacks or areas for improvement:
- Information Mixing: The positional encoding is added directly to the content embedding . This combined vector propagates through the entire Transformer block, including the residual connections and the Feed-Forward Network . This forces the and potentially other components to handle mixed content and position information, which might impose unnecessary constraints or complexities. The primary need for explicit position information seems to be within the attention mechanism itself to compute position-aware similarity scores.
- Potential Dilution: Positional information is injected only once at the input layer. In deep Transformers, this initial signal might become "diluted" or less distinct as it passes through numerous transformations across many layers.
- Loss of Pure Relative Property Post-Projection: As rigorously shown before, while the raw sinusoidal vectors and have a dot product that depends only on the relative distance , this property is generally lost when calculating the attention score component . The learned projections and mix the dimensions, making the resulting position-position interaction dependent on absolute positions and in a complex way learned during training, rather than purely on . While the network can learn relative dependencies, it doesn't get the benefit of this inherent structure directly in the transformed space.
RoPE addresses these points with a fundamentally different approach.
Instead of adding positional information to the input embedding, modifies the query () and key () vectors within the attention mechanism, after they have been projected by and , using position-dependent rotations.
The goal is to incorporate positional information and into the query and key respectively, such that their dot product inherently depends on the relative position .
achieves this by applying a rotation matrix that depends on the position () and a set of non-learnable frequencies .
Mechanism:
First, we compute the standard query and key vectors from the input token embeddings and (note: here represents the output of the previous layer or the initial embedding, without any added ):
We then define a -dimensional block diagonal matrix . The vector dimensions are conceptually paired up . For each pair , a 2D rotation corresponding to position and frequency is applied:
The frequencies are typically set similarly to sinusoidal embeddings: Here the base is , but it's a hyperparameter, and bigger bases, allow for dealing with increasingly longer sequences.
This matrix, fundamentally contains the same information as the previous vector defined in the sinusoidal version (and it's equivalent complex form), the previous vector is suitable for addition to the input, while this matrix is suitable for multiplication, this is the important distinction.
Now we modify the query and key vectors using their respective position matrices: Q_t' = R_{\Theta, t}^{d_k} Q_t $$$$ K_s' = R_{\Theta, s}^{d_k} K_s We can now compute the dot product using the rotated query and key vectors: Rotation matrices are orthogonal, meaning and importantly, (rotating by ). Also, consecutive rotations add angles: .
Using these properties in the dot product:
Substituting back the original projections and : We can see how this method seamless integrates the relative positioning without getting distorted by the projection matrices as the final score explicitly depends on a transformation matrix that is solely a function of the relative position . Unlike the additive sinusoidal case where the relative property was lost after projection, maintains it through the projection process by applying the positional encoding multiplicatively afterwards.
Additionally, Positional information is injected only within the attention calculation. The original input vectors (outputs from the previous layer norm or FFN) do not contain explicit positional encoding, meaning the residual connections and FFNs operate purely on content-based representations from the attention output and since is applied independently within the attention mechanism of each Transformer layer, the positional signal is freshly incorporated at every layer, preventing dilution.
At the same time we have preserved, similar to sinusoidal embeddings, the use of different frequencies () allowing to encode position across different scales. The network learns via and how to utilize these different frequency components within the rotated space in a relative way.
On the other hand, the rotation matrices are orthogonal. Orthogonal transformations preserve vector norms (lengths), which can contribute to training stability by preventing activations from exploding or vanishing due to the positional encoding step itself.
Finally, because directly encodes relative positions in a theoretically sound way within the attention score, models using often exhibit better generalization and extrapolation capabilities to sequence lengths longer than those seen during training, compared to methods where relative positioning is only learned implicitly or lost post-projection, like in sinusoidal embeddings. In the latter, it's harder for our network to decipher relative positioning leading to a case where it memorizes embeddings, which leads to degrading performance with longer sequences, In , since we inherently encode relative positioning, our network learns how to work with relative positioning.