The GPT-3 paper didn't explore fine tuning on downstream tasks, so I decided to tune Neo 2.7B for 1.1k iters on all the tasks in eval harness that have a train set (all at once, because tuning one model per task would have taken ages). I was quite surprised that the tuned model didn't destroy untuned 2.7B completely on all tasks, but rather from eyeballing it seems like a tossup. Interestingly, tuned seems to defeat 2.7B by quite a lot on anli, which is especially notable given that this is one task the models in the GPT-3 paper struggled on. Also, lambada and pubmedqa are included in these tables, even though it doesn't have a training set (at least for the implementation in eval harness, using the OA version of lambada), because I wanted to look at effects on sets not in the tuning, to potentially observe some catastrophic forgetting or something. Sure enough, lambada and pubmedqa scores are significantly worse on the tuned model.
Zero shot
Task | Metric | 2.7B | Tuned |
---|---|---|---|
anli_r1 | acc | 0.332 ± 0.015 | 0.418 ± 0.015 |
anli_r2 | acc | 0.342 ± 0.015 | 0.375 ± 0.015 |
anli_r3 | acc | 0.352 ± 0.014 | 0.392 ± 0.014 |
arc_challenge | acc | 0.275 ± 0.013 | 0.286 ± 0.013 |
acc_norm | 0.301 ± 0.013 | 0.312 ± 0.013 | |
arc_easy | acc | 0.611 ± 0.010 | 0.560 ± 0.010 |
acc_norm | 0.539 ± 0.010 | 0.558 ± 0.010 | |
boolq | acc | 0.630 ± 0.008 | 0.605 ± 0.008 |
cb | acc | 0.304 ± 0.062 | 0.411 ± 0.062 |
copa | acc | 0.800 ± 0.040 | 0.730 ± 0.040 |
ethics_cm | acc | 0.510 ± 0.008 | 0.561 ± 0.008 |
ethics_deontology | acc | 0.497 ± 0.008 | 0.658 ± 0.008 |
ethics_justice | acc | 0.501 ± 0.010 | 0.589 ± 0.010 |
ethics_utilitarianism | acc | 0.497 ± 0.007 | 0.498 ± 0.007 |
ethics_virtue | acc | 0.251 ± 0.006 | 0.800 ± 0.006 |
headqa | acc | 0.235 ± 0.008 | 0.233 ± 0.008 |
acc_norm | 0.272 ± 0.008 | 0.265 ± 0.008 | |
hellaswag | acc | 0.427 ± 0.005 | 0.400 ± 0.005 |
acc_norm | 0.558 ± 0.005 | 0.517 ± 0.005 | |
hendrycksTest-abstract_algebra | acc | 0.230 ± 0.042 | 0.340 ± 0.042 |
acc_norm | 0.200 ± 0.040 | 0.350 ± 0.040 | |
hendrycksTest-anatomy | acc | 0.252 ± 0.037 | 0.267 ± 0.037 |
acc_norm | 0.222 ± 0.036 | 0.252 ± 0.036 | |
hendrycksTest-astronomy | acc | 0.250 ± 0.035 | 0.309 ± 0.035 |
acc_norm | 0.362 ± 0.039 | 0.309 ± 0.039 | |
hendrycksTest-business_ethics | acc | 0.360 ± 0.048 | 0.340 ± 0.048 |
acc_norm | 0.280 ± 0.045 | 0.310 ± 0.045 | |
hendrycksTest-clinical_knowledge | acc | 0.291 ± 0.028 | 0.370 ± 0.028 |
acc_norm | 0.287 ± 0.028 | 0.374 ± 0.028 | |
hendrycksTest-college_biology | acc | 0.250 ± 0.036 | 0.250 ± 0.036 |
acc_norm | 0.222 ± 0.035 | 0.271 ± 0.035 | |
hendrycksTest-college_chemistry | acc | 0.230 ± 0.042 | 0.350 ± 0.042 |
acc_norm | 0.250 ± 0.044 | 0.350 ± 0.044 | |
hendrycksTest-college_computer_science | acc | 0.280 ± 0.045 | 0.430 ± 0.045 |
acc_norm | 0.270 ± 0.045 | 0.390 ± 0.045 | |
hendrycksTest-college_mathematics | acc | 0.200 ± 0.040 | 0.370 ± 0.040 |
acc_norm | 0.300 ± 0.046 | 0.350 ± 0.046 | |
hendrycksTest-college_medicine | acc | 0.254 ± 0.033 | 0.312 ± 0.033 |
acc_norm | 0.260 ± 0.033 | 0.306 ± 0.033 | |
hendrycksTest-college_physics | acc | 0.225 ± 0.042 | 0.275 ± 0.042 |
acc_norm | 0.245 ± 0.043 | 0.284 ± 0.043 | |
hendrycksTest-computer_security | acc | 0.270 ± 0.045 | 0.290 ± 0.045 |
acc_norm | 0.330 ± 0.047 | 0.290 ± 0.047 | |
hendrycksTest-conceptual_physics | acc | 0.247 ± 0.028 | 0.315 ± 0.028 |
acc_norm | 0.187 ± 0.026 | 0.319 ± 0.026 | |
hendrycksTest-econometrics | acc | 0.193 ± 0.037 | 0.272 ± 0.037 |
acc_norm | 0.228 ± 0.039 | 0.281 ± 0.039 | |
hendrycksTest-electrical_engineering | acc | 0.331 ± 0.039 | 0.386 ± 0.039 |
acc_norm | 0.338 ± 0.039 | 0.386 ± 0.039 | |
hendrycksTest-elementary_mathematics | acc | 0.230 ± 0.022 | 0.280 ± 0.022 |
acc_norm | 0.270 ± 0.023 | 0.278 ± 0.023 | |
hendrycksTest-formal_logic | acc | 0.333 ± 0.042 | 0.310 ± 0.042 |
acc_norm | 0.302 ± 0.041 | 0.278 ± 0.041 | |
hendrycksTest-global_facts | acc | 0.240 ± 0.043 | 0.250 ± 0.043 |
acc_norm | 0.240 ± 0.043 | 0.260 ± 0.043 | |
hendrycksTest-high_school_biology | acc | 0.219 ± 0.024 | 0.335 ± 0.024 |
acc_norm | 0.284 ± 0.026 | 0.329 ± 0.026 | |
hendrycksTest-high_school_chemistry | acc | 0.167 ± 0.026 | 0.207 ± 0.026 |
acc_norm | 0.256 ± 0.031 | 0.212 ± 0.031 | |
hendrycksTest-high_school_computer_science | acc | 0.220 ± 0.042 | 0.290 ± 0.042 |
acc_norm | 0.280 ± 0.045 | 0.280 ± 0.045 | |
hendrycksTest-high_school_european_history | acc | 0.267 ± 0.035 | 0.358 ± 0.035 |
acc_norm | 0.285 ± 0.035 | 0.358 ± 0.035 | |
hendrycksTest-high_school_geography | acc | 0.227 ± 0.030 | 0.359 ± 0.030 |
acc_norm | 0.298 ± 0.033 | 0.333 ± 0.033 | |
hendrycksTest-high_school_government_and_politics | acc | 0.207 ± 0.029 | 0.301 ± 0.029 |
acc_norm | 0.259 ± 0.032 | 0.311 ± 0.032 | |
hendrycksTest-high_school_macroeconomics | acc | 0.262 ± 0.022 | 0.267 ± 0.022 |
acc_norm | 0.267 ± 0.022 | 0.262 ± 0.022 | |
hendrycksTest-high_school_mathematics | acc | 0.174 ± 0.023 | 0.248 ± 0.023 |
acc_norm | 0.244 ± 0.026 | 0.270 ± 0.026 | |
hendrycksTest-high_school_microeconomics | acc | 0.256 ± 0.028 | 0.265 ± 0.028 |
acc_norm | 0.328 ± 0.030 | 0.277 ± 0.030 | |
hendrycksTest-high_school_physics | acc | 0.225 ± 0.034 | 0.212 ± 0.034 |
acc_norm | 0.219 ± 0.034 | 0.225 ± 0.034 | |
hendrycksTest-high_school_psychology | acc | 0.253 ± 0.019 | 0.338 ± 0.019 |
acc_norm | 0.261 ± 0.019 | 0.330 ± 0.019 | |
hendrycksTest-high_school_statistics | acc | 0.264 ± 0.030 | 0.278 ± 0.030 |
acc_norm | 0.338 ± 0.032 | 0.273 ± 0.032 | |
hendrycksTest-high_school_us_history | acc | 0.235 ± 0.030 | 0.230 ± 0.030 |
acc_norm | 0.270 ± 0.031 | 0.235 ± 0.031 | |
hendrycksTest-high_school_world_history | acc | 0.270 ± 0.029 | 0.388 ± 0.029 |
acc_norm | 0.300 ± 0.030 | 0.392 ± 0.030 | |
hendrycksTest-human_aging | acc | 0.296 ± 0.031 | 0.318 ± 0.031 |
acc_norm | 0.238 ± 0.029 | 0.314 ± 0.029 | |
hendrycksTest-human_sexuality | acc | 0.336 ± 0.041 | 0.290 ± 0.041 |
acc_norm | 0.290 ± 0.040 | 0.290 ± 0.040 | |
hendrycksTest-international_law | acc | 0.248 ± 0.039 | 0.322 ± 0.039 |
acc_norm | 0.496 ± 0.046 | 0.347 ± 0.046 | |
hendrycksTest-jurisprudence | acc | 0.250 ± 0.042 | 0.269 ± 0.042 |
acc_norm | 0.426 ± 0.048 | 0.296 ± 0.048 | |
hendrycksTest-logical_fallacies | acc | 0.209 ± 0.032 | 0.258 ± 0.032 |
acc_norm | 0.288 ± 0.036 | 0.264 ± 0.036 | |
hendrycksTest-machine_learning | acc | 0.295 ± 0.043 | 0.250 ± 0.043 |
acc_norm | 0.259 ± 0.042 | 0.259 ± 0.042 | |
hendrycksTest-management | acc | 0.184 ± 0.038 | 0.311 ± 0.038 |
acc_norm | 0.282 ± 0.045 | 0.330 ± 0.045 | |
hendrycksTest-marketing | acc | 0.316 ± 0.030 | 0.432 ± 0.030 |
acc_norm | 0.338 ± 0.031 | 0.440 ± 0.031 | |
hendrycksTest-medical_genetics | acc | 0.300 ± 0.046 | 0.240 ± 0.046 |
acc_norm | 0.370 ± 0.049 | 0.270 ± 0.049 | |
hendrycksTest-miscellaneous | acc | 0.281 ± 0.016 | 0.323 ± 0.016 |
acc_norm | 0.271 ± 0.016 | 0.328 ± 0.016 | |
hendrycksTest-moral_disputes | acc | 0.286 ± 0.024 | 0.350 ± 0.024 |
acc_norm | 0.355 ± 0.026 | 0.364 ± 0.026 | |
hendrycksTest-moral_scenarios | acc | 0.234 ± 0.014 | 0.264 ± 0.014 |
acc_norm | 0.273 ± 0.015 | 0.269 ± 0.015 | |
hendrycksTest-nutrition | acc | 0.275 ± 0.026 | 0.307 ± 0.026 |
acc_norm | 0.359 ± 0.027 | 0.333 ± 0.027 | |
hendrycksTest-philosophy | acc | 0.270 ± 0.025 | 0.305 ± 0.025 |
acc_norm | 0.315 ± 0.026 | 0.322 ± 0.026 | |
hendrycksTest-prehistory | acc | 0.256 ± 0.024 | 0.361 ± 0.024 |
acc_norm | 0.216 ± 0.023 | 0.364 ± 0.023 | |
hendrycksTest-professional_accounting | acc | 0.248 ± 0.026 | 0.230 ± 0.026 |
acc_norm | 0.259 ± 0.026 | 0.220 ± 0.026 | |
hendrycksTest-professional_law | acc | 0.267 ± 0.011 | 0.275 ± 0.011 |
acc_norm | 0.300 ± 0.012 | 0.284 ± 0.012 | |
hendrycksTest-professional_medicine | acc | 0.246 ± 0.026 | 0.290 ± 0.026 |
acc_norm | 0.232 ± 0.026 | 0.298 ± 0.026 | |
hendrycksTest-professional_psychology | acc | 0.258 ± 0.018 | 0.299 ± 0.018 |
acc_norm | 0.253 ± 0.018 | 0.315 ± 0.018 | |
hendrycksTest-public_relations | acc | 0.300 ± 0.044 | 0.364 ± 0.044 |
acc_norm | 0.164 ± 0.035 | 0.373 ± 0.035 | |
hendrycksTest-security_studies | acc | 0.339 ± 0.030 | 0.343 ± 0.030 |
acc_norm | 0.286 ± 0.029 | 0.286 ± 0.029 | |
hendrycksTest-sociology | acc | 0.269 ± 0.031 | 0.403 ± 0.031 |
acc_norm | 0.264 ± 0.031 | 0.423 ± 0.031 | |
hendrycksTest-us_foreign_policy | acc | 0.330 ± 0.047 | 0.390 ± 0.047 |
acc_norm | 0.350 ± 0.048 | 0.390 ± 0.048 | |
hendrycksTest-virology | acc | 0.313 ± 0.036 | 0.325 ± 0.036 |
acc_norm | 0.331 ± 0.037 | 0.343 ± 0.037 | |
hendrycksTest-world_religions | acc | 0.304 ± 0.035 | 0.316 ± 0.035 |
acc_norm | 0.386 ± 0.037 | 0.339 ± 0.037 | |
logiqa | acc | 0.201 ± 0.016 | 0.280 ± 0.016 |
acc_norm | 0.281 ± 0.018 | 0.283 ± 0.018 | |
mathqa | acc | 0.247 ± 0.008 | 0.248 ± 0.008 |
acc_norm | 0.246 ± 0.008 | 0.239 ± 0.008 | |
mnli | acc | 0.339 ± 0.005 | 0.729 ± 0.005 |
mnli_mismatched | acc | 0.338 ± 0.005 | 0.742 ± 0.005 |
mrpc | acc | 0.684 ± 0.023 | 0.701 ± 0.023 |
f1 | 0.812 ± 0.016 | 0.820 ± 0.016 | |
multirc | acc | 0.016 ± 0.004 | 0.004 ± 0.004 |
openbookqa | acc | 0.234 ± 0.019 | 0.248 ± 0.019 |
acc_norm | 0.332 ± 0.021 | 0.318 ± 0.021 | |
piqa | acc | 0.721 ± 0.010 | 0.713 ± 0.010 |
acc_norm | 0.729 ± 0.010 | 0.708 ± 0.010 | |
qnli | acc | 0.509 ± 0.007 | 0.761 ± 0.007 |
qqp | acc | 0.368 ± 0.002 | 0.843 ± 0.002 |
f1 | 0.538 ± 0.003 | 0.789 ± 0.003 | |
race | acc | 0.353 ± 0.015 | 0.362 ± 0.015 |
record | f1 | 0.845 ± 0.004 | 0.779 ± 0.004 |
em | 0.838 ± 0.004 | 0.770 ± 0.004 | |
rte | acc | 0.520 ± 0.030 | 0.729 ± 0.030 |
sciq | acc | 0.893 ± 0.010 | 0.919 ± 0.010 |
acc_norm | 0.828 ± 0.012 | 0.913 ± 0.012 | |
sst | acc | 0.789 ± 0.014 | 0.862 ± 0.014 |
webqs | acc | 0.016 ± 0.003 | 0.071 ± 0.003 |
wic | acc | 0.500 ± 0.020 | 0.517 ± 0.020 |
winogrande | acc | 0.575 ± 0.014 | 0.570 ± 0.014 |
wnli | acc | 0.310 ± 0.055 | 0.563 ± 0.055 |
wsc | acc | 0.365 ± 0.047 | 0.365 ± 0.047 |
lambada | ppl | 5.626 ± 0.139 | 27.796 ± 0.139 |
acc | 0.622 ± 0.007 | 0.387 ± 0.007 | |
pubmedqa | acc | 0.565 ± 0.016 | 0.496 ± 0.016 |
coqa | f1 | 0.604 ± 0.018 | 0.598 ± 0.018 |
em | 0.479 ± 0.020 | 0.480 ± 0.020 | |
drop | em | 0.026 ± 0.002 | 0.001 ± 0.002 |
f1 | 0.083 ± 0.002 | 0.033 ± 0.002 | |
math_algebra | acc | 0.008 ± 0.003 | 0.025 ± 0.003 |
math_geometry | acc | 0.002 ± 0.002 | 0.021 ± 0.002 |
math_intermediate_algebra | acc | 0.004 ± 0.002 | 0.025 ± 0.002 |
math_num_theory | acc | 0.019 ± 0.006 | 0.046 ± 0.006 |
math_prealgebra | acc | 0.001 ± 0.001 | 0.039 ± 0.001 |
math_precalc | acc | 0.005 ± 0.003 | 0.016 ± 0.003 |
One shot
Task | Metric | 2.7B | Tuned |
---|---|---|---|
anli_r1 | acc | 0.331 ± 0.015 | 0.443 ± 0.015 |
anli_r2 | acc | 0.307 ± 0.015 | 0.373 ± 0.015 |
anli_r3 | acc | 0.343 ± 0.014 | 0.423 ± 0.014 |
arc_challenge | acc | 0.302 ± 0.013 | 0.292 ± 0.013 |
acc_norm | 0.323 ± 0.014 | 0.323 ± 0.014 | |
arc_easy | acc | 0.634 ± 0.010 | 0.567 ± 0.010 |
acc_norm | 0.622 ± 0.010 | 0.562 ± 0.010 | |
boolq | acc | 0.536 ± 0.009 | 0.620 ± 0.009 |
cb | acc | 0.429 ± 0.067 | 0.411 ± 0.067 |
cola | mcc | 0.001 ± 0.031 | 0.022 ± 0.031 |
copa | acc | 0.770 ± 0.042 | 0.780 ± 0.042 |
ethics_cm | acc | 0.508 ± 0.008 | 0.625 ± 0.008 |
ethics_deontology | acc | 0.511 ± 0.008 | 0.683 ± 0.008 |
ethics_justice | acc | 0.515 ± 0.010 | 0.604 ± 0.010 |
ethics_utilitarianism | acc | 0.490 ± 0.007 | 0.536 ± 0.007 |
ethics_virtue | acc | 0.726 ± 0.006 | 0.805 ± 0.006 |
headqa | acc | 0.230 ± 0.008 | 0.228 ± 0.008 |
acc_norm | 0.270 ± 0.008 | 0.275 ± 0.008 | |
hellaswag | acc | 0.428 ± 0.005 | 0.386 ± 0.005 |
acc_norm | 0.557 ± 0.005 | 0.494 ± 0.005 | |
hendrycksTest-abstract_algebra | acc | 0.220 ± 0.042 | 0.270 ± 0.042 |
acc_norm | 0.290 ± 0.046 | 0.260 ± 0.046 | |
hendrycksTest-anatomy | acc | 0.289 ± 0.039 | 0.304 ± 0.039 |
acc_norm | 0.230 ± 0.036 | 0.289 ± 0.036 | |
hendrycksTest-astronomy | acc | 0.204 ± 0.033 | 0.322 ± 0.033 |
acc_norm | 0.303 ± 0.037 | 0.322 ± 0.037 | |
hendrycksTest-business_ethics | acc | 0.290 ± 0.046 | 0.320 ± 0.046 |
acc_norm | 0.280 ± 0.045 | 0.280 ± 0.045 | |
hendrycksTest-clinical_knowledge | acc | 0.287 ± 0.028 | 0.351 ± 0.028 |
acc_norm | 0.328 ± 0.029 | 0.358 ± 0.029 | |
hendrycksTest-college_biology | acc | 0.215 ± 0.034 | 0.271 ± 0.034 |
acc_norm | 0.194 ± 0.033 | 0.271 ± 0.033 | |
hendrycksTest-college_chemistry | acc | 0.300 ± 0.046 | 0.330 ± 0.046 |
acc_norm | 0.340 ± 0.048 | 0.320 ± 0.048 | |
hendrycksTest-college_computer_science | acc | 0.330 ± 0.047 | 0.390 ± 0.047 |
acc_norm | 0.310 ± 0.046 | 0.360 ± 0.046 | |
hendrycksTest-college_mathematics | acc | 0.200 ± 0.040 | 0.280 ± 0.040 |
acc_norm | 0.220 ± 0.042 | 0.270 ± 0.042 | |
hendrycksTest-college_medicine | acc | 0.254 ± 0.033 | 0.295 ± 0.033 |
acc_norm | 0.260 ± 0.033 | 0.283 ± 0.033 | |
hendrycksTest-college_physics | acc | 0.304 ± 0.046 | 0.284 ± 0.046 |
acc_norm | 0.333 ± 0.047 | 0.304 ± 0.047 | |
hendrycksTest-computer_security | acc | 0.320 ± 0.047 | 0.270 ± 0.047 |
acc_norm | 0.320 ± 0.047 | 0.290 ± 0.047 | |
hendrycksTest-conceptual_physics | acc | 0.268 ± 0.029 | 0.349 ± 0.029 |
acc_norm | 0.255 ± 0.029 | 0.345 ± 0.029 | |
hendrycksTest-econometrics | acc | 0.298 ± 0.043 | 0.272 ± 0.043 |
acc_norm | 0.298 ± 0.043 | 0.263 ± 0.043 | |
hendrycksTest-electrical_engineering | acc | 0.338 ± 0.039 | 0.324 ± 0.039 |
acc_norm | 0.290 ± 0.038 | 0.303 ± 0.038 | |
hendrycksTest-elementary_mathematics | acc | 0.262 ± 0.023 | 0.275 ± 0.023 |
acc_norm | 0.294 ± 0.023 | 0.275 ± 0.023 | |
hendrycksTest-formal_logic | acc | 0.310 ± 0.041 | 0.310 ± 0.041 |
acc_norm | 0.294 ± 0.041 | 0.270 ± 0.041 | |
hendrycksTest-global_facts | acc | 0.200 ± 0.040 | 0.290 ± 0.040 |
acc_norm | 0.210 ± 0.041 | 0.290 ± 0.041 | |
hendrycksTest-high_school_biology | acc | 0.265 ± 0.025 | 0.342 ± 0.025 |
acc_norm | 0.287 ± 0.026 | 0.342 ± 0.026 | |
hendrycksTest-high_school_chemistry | acc | 0.251 ± 0.031 | 0.232 ± 0.031 |
acc_norm | 0.291 ± 0.032 | 0.227 ± 0.032 | |
hendrycksTest-high_school_computer_science | acc | 0.260 ± 0.044 | 0.280 ± 0.044 |
acc_norm | 0.300 ± 0.046 | 0.260 ± 0.046 | |
hendrycksTest-high_school_european_history | acc | 0.267 ± 0.035 | 0.309 ± 0.035 |
acc_norm | 0.315 ± 0.036 | 0.321 ± 0.036 | |
hendrycksTest-high_school_geography | acc | 0.227 ± 0.030 | 0.348 ± 0.030 |
acc_norm | 0.278 ± 0.032 | 0.354 ± 0.032 | |
hendrycksTest-high_school_government_and_politics | acc | 0.290 ± 0.033 | 0.332 ± 0.033 |
acc_norm | 0.290 ± 0.033 | 0.321 ± 0.033 | |
hendrycksTest-high_school_macroeconomics | acc | 0.279 ± 0.023 | 0.305 ± 0.023 |
acc_norm | 0.267 ± 0.022 | 0.285 ± 0.022 | |
hendrycksTest-high_school_mathematics | acc | 0.252 ± 0.026 | 0.278 ± 0.026 |
acc_norm | 0.296 ± 0.028 | 0.304 ± 0.028 | |
hendrycksTest-high_school_microeconomics | acc | 0.265 ± 0.029 | 0.256 ± 0.029 |
acc_norm | 0.324 ± 0.030 | 0.273 ± 0.030 | |
hendrycksTest-high_school_physics | acc | 0.205 ± 0.033 | 0.205 ± 0.033 |
acc_norm | 0.232 ± 0.034 | 0.212 ± 0.034 | |
hendrycksTest-high_school_psychology | acc | 0.251 ± 0.019 | 0.328 ± 0.019 |
acc_norm | 0.270 ± 0.019 | 0.325 ± 0.019 | |
hendrycksTest-high_school_statistics | acc | 0.319 ± 0.032 | 0.241 ± 0.032 |
acc_norm | 0.319 ± 0.032 | 0.245 ± 0.032 | |
hendrycksTest-high_school_us_history | acc | 0.265 ± 0.031 | 0.221 ± 0.031 |
acc_norm | 0.260 ± 0.031 | 0.230 ± 0.031 | |
hendrycksTest-high_school_world_history | acc | 0.283 ± 0.029 | 0.371 ± 0.029 |
acc_norm | 0.266 ± 0.029 | 0.380 ± 0.029 | |
hendrycksTest-human_aging | acc | 0.296 ± 0.031 | 0.296 ± 0.031 |
acc_norm | 0.274 ± 0.030 | 0.291 ± 0.030 | |
hendrycksTest-human_sexuality | acc | 0.351 ± 0.042 | 0.290 ± 0.042 |
acc_norm | 0.282 ± 0.039 | 0.290 ± 0.039 | |
hendrycksTest-international_law | acc | 0.248 ± 0.039 | 0.322 ± 0.039 |
acc_norm | 0.347 ± 0.043 | 0.331 ± 0.043 | |
hendrycksTest-jurisprudence | acc | 0.269 ± 0.043 | 0.296 ± 0.043 |
acc_norm | 0.370 ± 0.047 | 0.296 ± 0.047 | |
hendrycksTest-logical_fallacies | acc | 0.202 ± 0.032 | 0.276 ± 0.032 |
acc_norm | 0.270 ± 0.035 | 0.258 ± 0.035 | |
hendrycksTest-machine_learning | acc | 0.295 ± 0.043 | 0.250 ± 0.043 |
acc_norm | 0.330 ± 0.045 | 0.223 ± 0.045 | |
hendrycksTest-management | acc | 0.282 ± 0.045 | 0.320 ± 0.045 |
acc_norm | 0.272 ± 0.044 | 0.350 ± 0.044 | |
hendrycksTest-marketing | acc | 0.303 ± 0.030 | 0.415 ± 0.030 |
acc_norm | 0.329 ± 0.031 | 0.423 ± 0.031 | |
hendrycksTest-medical_genetics | acc | 0.330 ± 0.047 | 0.300 ± 0.047 |
acc_norm | 0.420 ± 0.050 | 0.300 ± 0.050 | |
hendrycksTest-miscellaneous | acc | 0.319 ± 0.017 | 0.318 ± 0.017 |
acc_norm | 0.319 ± 0.017 | 0.313 ± 0.017 | |
hendrycksTest-moral_disputes | acc | 0.298 ± 0.025 | 0.341 ± 0.025 |
acc_norm | 0.318 ± 0.025 | 0.344 ± 0.025 | |
hendrycksTest-moral_scenarios | acc | 0.267 ± 0.015 | 0.240 ± 0.015 |
acc_norm | 0.265 ± 0.015 | 0.238 ± 0.015 | |
hendrycksTest-nutrition | acc | 0.278 ± 0.026 | 0.330 ± 0.026 |
acc_norm | 0.337 ± 0.027 | 0.350 ± 0.027 | |
hendrycksTest-philosophy | acc | 0.251 ± 0.025 | 0.315 ± 0.025 |
acc_norm | 0.293 ± 0.026 | 0.325 ± 0.026 | |
hendrycksTest-prehistory | acc | 0.244 ± 0.024 | 0.352 ± 0.024 |
acc_norm | 0.250 ± 0.024 | 0.361 ± 0.024 | |
hendrycksTest-professional_accounting | acc | 0.287 ± 0.027 | 0.213 ± 0.027 |
acc_norm | 0.248 ± 0.026 | 0.216 ± 0.026 | |
hendrycksTest-professional_law | acc | 0.273 ± 0.011 | 0.267 ± 0.011 |
acc_norm | 0.269 ± 0.011 | 0.269 ± 0.011 | |
hendrycksTest-professional_medicine | acc | 0.301 ± 0.028 | 0.301 ± 0.028 |
acc_norm | 0.268 ± 0.027 | 0.327 ± 0.027 | |
hendrycksTest-professional_psychology | acc | 0.279 ± 0.018 | 0.304 ± 0.018 |
acc_norm | 0.284 ± 0.018 | 0.310 ± 0.018 | |
hendrycksTest-public_relations | acc | 0.327 ± 0.045 | 0.345 ± 0.045 |
acc_norm | 0.309 ± 0.044 | 0.336 ± 0.044 | |
hendrycksTest-security_studies | acc | 0.265 ± 0.028 | 0.331 ± 0.028 |
acc_norm | 0.208 ± 0.026 | 0.290 ± 0.026 | |
hendrycksTest-sociology | acc | 0.269 ± 0.031 | 0.393 ± 0.031 |
acc_norm | 0.249 ± 0.031 | 0.383 ± 0.031 | |
hendrycksTest-us_foreign_policy | acc | 0.290 ± 0.046 | 0.320 ± 0.046 |
acc_norm | 0.320 ± 0.047 | 0.320 ± 0.047 | |
hendrycksTest-virology | acc | 0.289 ± 0.035 | 0.349 ± 0.035 |
acc_norm | 0.265 ± 0.034 | 0.355 ± 0.034 | |
hendrycksTest-world_religions | acc | 0.374 ± 0.037 | 0.345 ± 0.037 |
acc_norm | 0.409 ± 0.038 | 0.351 ± 0.038 | |
logiqa | acc | 0.255 ± 0.017 | 0.273 ± 0.017 |
acc_norm | 0.272 ± 0.017 | 0.280 ± 0.017 | |
mathqa | acc | 0.256 ± 0.008 | 0.253 ± 0.008 |
acc_norm | 0.258 ± 0.008 | 0.240 ± 0.008 | |
mnli | acc | 0.338 ± 0.005 | 0.801 ± 0.005 |
mnli_mismatched | acc | 0.362 ± 0.005 | 0.811 ± 0.005 |
mrpc | acc | 0.571 ± 0.025 | 0.750 ± 0.025 |
f1 | 0.689 ± 0.022 | 0.841 ± 0.022 | |
multirc | acc | 0.047 ± 0.007 | 0.012 ± 0.007 |
openbookqa | acc | 0.222 ± 0.019 | 0.268 ± 0.019 |
acc_norm | 0.346 ± 0.021 | 0.344 ± 0.021 | |
piqa | acc | 0.726 ± 0.010 | 0.714 ± 0.010 |
acc_norm | 0.736 ± 0.010 | 0.718 ± 0.010 | |
qnli | acc | 0.504 ± 0.007 | 0.788 ± 0.007 |
qqp | acc | 0.534 ± 0.002 | 0.847 ± 0.002 |
f1 | 0.372 ± 0.004 | 0.793 ± 0.004 | |
race | acc | 0.352 ± 0.015 | 0.355 ± 0.015 |
record | f1 | 0.843 ± 0.004 | 0.778 ± 0.004 |
em | 0.835 ± 0.004 | 0.771 ± 0.004 | |
rte | acc | 0.491 ± 0.030 | 0.747 ± 0.030 |
sciq | acc | 0.930 ± 0.008 | 0.939 ± 0.008 |
acc_norm | 0.938 ± 0.008 | 0.935 ± 0.008 | |
sst | acc | 0.492 ± 0.017 | 0.916 ± 0.017 |
webqs | acc | 0.054 ± 0.005 | 0.095 ± 0.005 |
wic | acc | 0.472 ± 0.020 | 0.539 ± 0.020 |
winogrande | acc | 0.582 ± 0.014 | 0.571 ± 0.014 |
wnli | acc | 0.380 ± 0.058 | 0.549 ± 0.058 |
wsc | acc | 0.365 ± 0.047 | 0.365 ± 0.047 |
lambada | ppl | 6.423 ± 0.162 | 20.150 ± 0.162 |
acc | 0.576 ± 0.007 | 0.394 ± 0.007 | |
pubmedqa | acc | 0.529 ± 0.016 | 0.479 ± 0.016 |
coqa | f1 | 0.606 ± 0.018 | 0.581 ± 0.018 |
em | 0.484 ± 0.020 | 0.472 ± 0.020 | |
drop | em | 0.001 ± 0.000 | 0.001 ± 0.000 |
f1 | 0.039 ± 0.001 | 0.031 ± 0.001 | |
math_algebra | acc | 0.016 ± 0.004 | 0.024 ± 0.004 |
math_counting_and_prob | acc | 0.023 ± 0.007 | 0.030 ± 0.007 |
math_geometry | acc | 0.006 ± 0.004 | 0.021 ± 0.004 |
math_intermediate_algebra | acc | 0.020 ± 0.005 | 0.029 ± 0.005 |
math_num_theory | acc | 0.037 ± 0.008 | 0.039 ± 0.008 |
math_prealgebra | acc | 0.023 ± 0.005 | 0.041 ± 0.005 |
math_precalc | acc | 0.015 ± 0.005 | 0.022 ± 0.005 |
The model can be downloaded here, though I don't recommend using it for anything.