Multiple Choice Normalization in LM Evaluation

Let $x_{0:m}$ be the prompt, and $x_{m:n_i}$ be the $i$th possible continuation with a token length of $n_i - m$. There are several ways to use a language model to rank multiple possible continuations to a prompt. Since the language model only gives (log) probabilities for the next token given the context (i.e $\log \mathbb P(x_i|x_{0:i})$), there is ambiguity in handling scoring for arbitrary continuations. The following are several possible ways to resolve this problem:

Unnormalized: The score of continuation $i$ is determined using $\sum_{j=m}^{n_i - 1} \log \mathbb P(x_j|x_{0:j})$. Intuitively, this is the probability of a generation sampled from the prompt containing the continuation in question. While this is the simplest method, problems arise when there are significant differences in length between different continuations, as longer continuations tend to have lower log probabilities, thus biasing the language model towards picking shorter continuations. This approach is used by eval harness in all multiple choice tasks and presented as acc.
Token-length normalized: The score of continuation $i$ is determined using $\sum_{j=m}^{n_i - 1} \log \mathbb P(x_j|x_{0:j}) / (n_i - m)$. This approach attempts to normalize for length by computing average log probability per token; however, this approach is not tokenization agnostic, and as such two models with different tokenization that assign the same log likelihood to every single input string will have different token-length normalized scores. This approach is used by GPT-3 in most tasks. Eval harness does not report this score because it violates the design principle that all tasks should be tokenization independent.
Byte-length normalized: The score of continuation $i$ is determined using $\sum_{j=m}^{n_i - 1} \log \mathbb P(x_j|x_{0:j}) / \sum_{j=m}^{n_i - 1} L_{x_j}$, where $L_{x_j}$ is the number of bytes represented by the token $x_j$. This approach attempts to normalize for length by computing average log probability per character, which ensures that it is tokenization agnostic. This approach is also used by eval harness in all multiple choice tasks and presented as acc_norm.
Unconditional likelihood normalized: The score of continuation $i$ is determined using $\sum_{j=m}^{n_i - 1} \log \mathbb P(x_j|x_{0:j}) - \log \mathbb P(x_j)$. Intuitively, this approach measures the amount that the prompt increases the model's probability of outputting each continuation from the probability of the model unconditionally producing that continuation. This approach is used by GPT-3 in select tasks (ARC, OpenBookQA, and RACE), though no justification for why only these tasks in particular use this method is provided other than that this improves performance.

The unnormalized, token-length normalized, and byte-length normalized metrics can be computed without additional LM calls. The unconditional likelihood normalized metric requires an additional LM call to obtain the unconditional likelihood.