A head-to-head comparison of Rotary Position Embedding and GPT-style learned position embeddings. Both 1.3B models were trained for 100k steps on the Pile using Mesh Transformer JAX. There isn't a very strong trend, but hopefully someone will find these results useful regardless.
| Task | Metric | Learned | Rotary |
|---|---|---|---|
| lambada | ppl | 7.940 ± 0.208 | 7.156 ± 0.208 |
| acc | 0.556 ± 0.007 | 0.567 ± 0.007 | |
| piqa | acc | 0.700 ± 0.011 | 0.714 ± 0.011 |
| acc_norm | 0.693 ± 0.011 | 0.709 ± 0.011 | |
| hellaswag | acc | 0.376 ± 0.005 | 0.389 ± 0.005 |
| acc_norm | 0.472 ± 0.005 | 0.488 ± 0.005 | |
| winogrande | acc | 0.540 ± 0.014 | 0.571 ± 0.014 |
| mathqa | acc | 0.231 ± 0.008 | 0.230 ± 0.008 |
| acc_norm | 0.234 ± 0.008 | 0.227 ± 0.008 | |
| pubmedqa | acc | 0.599 ± 0.015 | 0.583 ± 0.015 |
| boolq | acc | 0.575 ± 0.009 | 0.614 ± 0.009 |
| anli_r3 | acc | 0.344 ± 0.014 | 0.351 ± 0.014 |
| openbookqa | acc | 0.198 ± 0.018 | 0.206 ± 0.018 |
| acc_norm | 0.316 ± 0.021 | 0.330 ± 0.021 | |
| triviaqa | acc | 0.041 ± 0.002 | 0.026 ± 0.002 |
| arc_challenge | acc | 0.235 ± 0.012 | 0.230 ± 0.012 |
| acc_norm | 0.260 ± 0.013 | 0.272 ± 0.013 | |
| arc_easy | acc | 0.564 ± 0.010 | 0.568 ± 0.010 |
| acc_norm | 0.505 ± 0.010 | 0.486 ± 0.010 | |
| cb | acc | 0.375 ± 0.065 | 0.357 ± 0.065 |
| cola | mcc | 0.042 ± 0.034 | 0.022 ± 0.034 |
| copa | acc | 0.730 ± 0.044 | 0.730 ± 0.044 |
| ethics_cm | acc | 0.491 ± 0.008 | 0.480 ± 0.008 |
| ethics_deontology | acc | 0.497 ± 0.008 | 0.497 ± 0.008 |
| ethics_justice | acc | 0.501 ± 0.010 | 0.501 ± 0.010 |
| ethics_utilitarianism | acc | 0.497 ± 0.007 | 0.493 ± 0.007 |
| ethics_virtue | acc | 0.200 ± 0.006 | 0.200 ± 0.006 |
| headqa | acc | 0.227 ± 0.008 | 0.224 ± 0.008 |
| acc_norm | 0.270 ± 0.008 | 0.271 ± 0.008 | |
| logiqa | acc | 0.221 ± 0.016 | 0.215 ± 0.016 |
| acc_norm | 0.293 ± 0.018 | 0.283 ± 0.018 | |
| mnli | acc | 0.344 ± 0.005 | 0.344 ± 0.005 |
| mnli_mismatched | acc | 0.345 ± 0.005 | 0.349 ± 0.005 |
| mrpc | acc | 0.684 ± 0.023 | 0.684 ± 0.023 |
| f1 | 0.812 ± 0.017 | 0.812 ± 0.017 | |
| qa4mre_2011 | acc | 0.392 ± 0.045 | 0.358 ± 0.045 |
| acc_norm | 0.450 ± 0.045 | 0.433 ± 0.045 | |
| qa4mre_2012 | acc | 0.287 ± 0.036 | 0.312 ± 0.036 |
| acc_norm | 0.394 ± 0.039 | 0.400 ± 0.039 | |
| qa4mre_2013 | acc | 0.335 ± 0.028 | 0.335 ± 0.028 |
| acc_norm | 0.352 ± 0.028 | 0.349 ± 0.028 | |
| qnli | acc | 0.498 ± 0.007 | 0.517 ± 0.007 |
| qqp | acc | 0.370 ± 0.002 | 0.368 ± 0.002 |
| f1 | 0.538 ± 0.003 | 0.538 ± 0.003 | |
| race | acc | 0.345 ± 0.015 | 0.343 ± 0.015 |
| record | f1 | 0.805 ± 0.004 | 0.813 ± 0.004 |
| em | 0.797 ± 0.004 | 0.805 ± 0.004 | |
| rte | acc | 0.538 ± 0.030 | 0.523 ± 0.030 |
| sciq | acc | 0.867 ± 0.011 | 0.865 ± 0.011 |
| acc_norm | 0.796 ± 0.013 | 0.771 ± 0.013 | |
| sst | acc | 0.572 ± 0.017 | 0.519 ± 0.017 |
| webqs | acc | 0.021 ± 0.003 | 0.006 ± 0.003 |
| wic | acc | 0.500 ± 0.020 | 0.498 ± 0.020 |
| wnli | acc | 0.437 ± 0.059 | 0.549 ± 0.059 |
| wsc | acc | 0.365 ± 0.047 | 0.365 ± 0.047 |
| wsc273 | acc | 0.722 ± 0.027 | 0.736 ± 0.027 |