OpenAI hasn't officially said anything about their API model sizes, which naturally leads to the question of just how big they are. Thankfully, we can use eval harness to evaluate the API models on a bunch of tasks and compare to the figures in the GPT-3 paper. Obviously since there are going to be minor differences in task implementation and OpenAI is probably fine tuning their API models all the time, the numbers don't line up exactly, but they should give a pretty good idea of the ballpark things are in.
Model | LAMBADA ppl ↓ | LAMBADA acc ↑ | Winogrande ↑ | Hellaswag ↑ | PIQA ↑ |
---|---|---|---|---|---|
GPT-3-124M | 18.6 | 42.7% | 52.0% | 33.7% | 64.6% |
GPT-3-350M | 9.09 | 54.3% | 52.1% | 43.6% | 70.2% |
Ada | 9.95 | 51.6% | 52.9% | 43.4% | 70.5% |
GPT-3-760M | 6.53 | 60.4% | 57.4% | 51.0% | 72.9% |
GPT-3-1.3B | 5.44 | 63.6% | 58.7% | 54.7% | 75.1% |
Babbage | 5.58 | 62.4% | 59.0% | 54.5% | 75.5% |
GPT-3-2.7B | 4.60 | 67.1% | 62.3% | 62.8% | 75.6% |
GPT-3-6.7B | 4.00 | 70.3% | 64.5% | 67.4% | 78.0% |
Curie | 4.00 | 68.5% | 65.6% | 68.5% | 77.9% |
GPT-3-13B | 3.56 | 72.5% | 67.9% | 70.9% | 78.5% |
GPT-3-175B | 3.00 | 76.2% | 70.2% | 78.9% | 81.0% |
Davinci | 2.97 | 74.8% | 70.2% | 78.1% | 80.4% |
All GPT-3 figures are from the GPT-3 paper; all API figures are computed using eval harness
Ada, Babbage, Curie and Davinci line up closely with 350M, 1.3B, 6.7B, and 175B respectively. Obviously this isn't ironclad evidence that the models are those sizes, but it's pretty suggestive.