Rotary Position Embedding (RoPE) is an effective position-encoding technique first introduced in Su et al. (2020) [1] and later popularized in open-source models such as GPT-J, GPT-NeoX, PaLM, LLaMA, etc. We covered the mathematics and the implementation details of RoPE in this blog post about 2 years ago. Although the RoPE is limited by its pretrained context size, we will summarize a line of research that manages to extend the context length of the RoPE so that a pretrained language model can be easily adapted to fit the increasingly challenging tasks being given to LLMs.
Conventions
Given a sequence of tokens
Rotary Position Embedding
The idea of the Rotary Position Embedding (RoPE) is very simple: the attention scores should only depend on the relative distance
A few methods we introduce below will enhance RoPE by the following format: we modify the function
Position Interpolation
During the pretraining, the data is cut into chunks of sequences of equal amounts of tokens

We simply "squeeze" the new sequence inside the original context window, and it takes orders of magnitude less finetuning to let the model get used to the new position embedding. PI still has several limits:
- It normally requires finetuning on about
billion tokens. - After finetuning on longer sequences, the perplexity slightly increases for short sequences compared with the original pretrained model.
- The way it modifies the RoPE formula did not take advantage of applying better frequencies via
.
"NTK-aware" Interpolation
Looking at RoPE only from an information-encoding perspective, it was shown in [4] using Neural Tangent Kernel (NTK) theory that deep neural networks have trouble learning high-frequency information if the input dimension is low without the corresponding embeddings having high-frequency components. In our case, the one-dimensional input—the token positions—is expanded by RoPE into an n-dimensional, complex vector embedding. The scaling by PI reduces the frequencies
To take advantage of the observation, the "NTK-aware" interpolation was proposed in public as a reddit post. The modification is as follows: instead of scaling the frequencies of every dimension of RoPE by a factor of
More precisely, recall that
Digression: Wavelength
A commonly overlooked aspect in rotary embedding is the relationship between the “wavelengths” and the sequence length. Let us start by putting down the definition of wavelength in our context.
Recall that in the definition of RoPE), each hidden state of the query and key vectors is multiplied by trigonometric functions. For a fixed
Wavelength is a notion comparable with the context lengths
"NTK-by-parts" Interpolation
The performance comparison between PI and “NTK-aware” interpolation is mixed:
- When directly modifying the RoPE formula without finetuning, "NTK-aware" interpolation shows better (lower) perplexity than PI on longer sequences.
- The "NTK-aware" interpolation performs worse than PI after finetuning on longer context data.
A fix addressing this issue of “NTK-aware” was first posted in public as a GitHub pull request.
We hypothesize that the high frequency has a detrimental effect on the model's ability to understand small and local relationships between embeddings. To smooth out the effect between the original frequency and the interpolated frequency, we compare the wavelength with the original context length
The following chart compares the wavelengths between the RoPE, PI and "NTK-by-parts" in the case where the pretrained context length is 2048 and we use a scale factor of 16.

We can summarize it as follows: taking NTK theory into account, we interpolate the wavelengths from RoPE to PI over different hidden dimensions as
YaRN
In all the interpolation methods, we also observe that by putting a suitable temperature after the attention softmax, we can improve the perplexity over the extended context window.
More precisely, we slightly adjust the attention formula from
Here we conducted a small experiment between different values of

When determining the best
Overall, our YaRN method refers to a combination of this temperature-scaling technique and the “NTK-by-parts” interpolation.
Some notes on how you can use YaRN for your own model
The YaRN parameters for Llama 2 may not work out-of-box for different model classes. YaRN is a combination of NTK-by-parts and temperature scaling on attention weights. Throughout the implementation of YaRN, there are a few parameters one can tune:
Dynamic Scaling
In a lot of use cases, the sequence lengths vary constantly from
A solution to this problem was first proposed in a reddit post, which suggests dynamically adjusting the scale factor
We would like to note that the "dynamic NTK" interpolation works exceptionally well on models pretrained on
Experiments and final words
One of the experiments we ran was to compare PI, NTK-aware and YaRN over a sliding window of

For the finetunes of Mistral-7b, we have the following chart.

We direct interested readers to our arXiv preprint for more details and experiment results.
We would also like to point out that there are other recent works on context length extension, such as Rectified Rotary Position Embeddings (ReRoPE) and Positional Skip-wisE (PoSE) training, though they are different lines of work and are out-of-scope for this blog post.
Citation Information
@misc{peng2023yarn,
title={YaRN: Efficient Context Window Extension of Large Language Models},
author={Bowen Peng and Jeffrey Quesnelle and Honglu Fan and Enrico Shippole},
year={2023},
eprint={2309.00071},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
References
[1] J. Su, Y. Lu, S. Pan, A. Murtadha, B. Wen, and Y. Liu. RoFormer: Enhanced transformer with rotary position embedding, 2022. arXiv: 2104.09864.
[2] kaiokendev. Things I’m learning while training superhot., 2023. URL https://kaiokendev. github.io/til#extending-context-to-8k
[3] S. Chen, S. Wong, L. Chen, and Y. Tian. Extending context window of large language models via positional interpolation, 2023. arXiv: 2306.15595.
[4] M. Tancik, P. P. Srinivasan, B. Mildenhall, S. Fridovich-Keil, N. Raghavan, U. Singhal, R. Ra-mamoorthi, J. T. Barron, and R. Ng. Fourier features let networks learn high frequency functions in low dimensional domains. In Proceedings of the 34th International Conference on Neural Information Processing Systems, NIPS’20, Red Hook, NY, USA, 2020. Curran Associates Inc. ISBN 9781713829546.