Finetuning Models on Downstream Tasks

by Leo Gao

The GPT-3 paper didn’t explore fine tuning on downstream tasks, so I decided to tune Neo 2.7B for 1.1k iters on all the tasks in eval harness that have a train set (all at once, because tuning one model per task would have taken ages). I was quite surprised that the tuned model didn’t destroy untuned 2.7B completely on all tasks, but rather from eyeballing it seems like a tossup. Interestingly, tuned seems to defeat 2.7B by quite a lot on anli, which is especially notable given that this is one task the models in the GPT-3 paper struggled on. Also, lambada and pubmedqa are included in these tables, even though it doesn’t have a training set (at least for the implementation in eval harness, using the OA version of lambada), because I wanted to look at effects on sets not in the tuning, to potentially observe some catastrophic forgetting or something. Sure enough, lambada and pubmedqa scores are significantly worse on the tuned model.

Zero shot

Task Metric 2.7B Tuned
anli_r1 acc 0.332 ± 0.015 0.418 ± 0.015
anli_r2 acc 0.342 ± 0.015 0.375 ± 0.015
anli_r3 acc 0.352 ± 0.014 0.392 ± 0.014
arc_challenge acc 0.275 ± 0.013 0.286 ± 0.013
acc_norm 0.301 ± 0.013 0.312 ± 0.013
arc_easy acc 0.611 ± 0.010 0.560 ± 0.010
acc_norm 0.539 ± 0.010 0.558 ± 0.010
boolq acc 0.630 ± 0.008 0.605 ± 0.008
cb acc 0.304 ± 0.062 0.411 ± 0.062
copa acc 0.800 ± 0.040 0.730 ± 0.040
ethics_cm acc 0.510 ± 0.008 0.561 ± 0.008
ethics_deontology acc 0.497 ± 0.008 0.658 ± 0.008
ethics_justice acc 0.501 ± 0.010 0.589 ± 0.010
ethics_utilitarianism acc 0.497 ± 0.007 0.498 ± 0.007
ethics_virtue acc 0.251 ± 0.006 0.800 ± 0.006
headqa acc 0.235 ± 0.008 0.233 ± 0.008
acc_norm 0.272 ± 0.008 0.265 ± 0.008
hellaswag acc 0.427 ± 0.005 0.400 ± 0.005
acc_norm 0.558 ± 0.005 0.517 ± 0.005
hendrycksTest-abstract_algebra acc 0.230 ± 0.042 0.340 ± 0.042
acc_norm 0.200 ± 0.040 0.350 ± 0.040
hendrycksTest-anatomy acc 0.252 ± 0.037 0.267 ± 0.037
acc_norm 0.222 ± 0.036 0.252 ± 0.036
hendrycksTest-astronomy acc 0.250 ± 0.035 0.309 ± 0.035
acc_norm 0.362 ± 0.039 0.309 ± 0.039
hendrycksTest-business_ethics acc 0.360 ± 0.048 0.340 ± 0.048
acc_norm 0.280 ± 0.045 0.310 ± 0.045
hendrycksTest-clinical_knowledge acc 0.291 ± 0.028 0.370 ± 0.028
acc_norm 0.287 ± 0.028 0.374 ± 0.028
hendrycksTest-college_biology acc 0.250 ± 0.036 0.250 ± 0.036
acc_norm 0.222 ± 0.035 0.271 ± 0.035
hendrycksTest-college_chemistry acc 0.230 ± 0.042 0.350 ± 0.042
acc_norm 0.250 ± 0.044 0.350 ± 0.044
hendrycksTest-college_computer_science acc 0.280 ± 0.045 0.430 ± 0.045
acc_norm 0.270 ± 0.045 0.390 ± 0.045
hendrycksTest-college_mathematics acc 0.200 ± 0.040 0.370 ± 0.040
acc_norm 0.300 ± 0.046 0.350 ± 0.046
hendrycksTest-college_medicine acc 0.254 ± 0.033 0.312 ± 0.033
acc_norm 0.260 ± 0.033 0.306 ± 0.033
hendrycksTest-college_physics acc 0.225 ± 0.042 0.275 ± 0.042
acc_norm 0.245 ± 0.043 0.284 ± 0.043
hendrycksTest-computer_security acc 0.270 ± 0.045 0.290 ± 0.045
acc_norm 0.330 ± 0.047 0.290 ± 0.047
hendrycksTest-conceptual_physics acc 0.247 ± 0.028 0.315 ± 0.028
acc_norm 0.187 ± 0.026 0.319 ± 0.026
hendrycksTest-econometrics acc 0.193 ± 0.037 0.272 ± 0.037
acc_norm 0.228 ± 0.039 0.281 ± 0.039
hendrycksTest-electrical_engineering acc 0.331 ± 0.039 0.386 ± 0.039
acc_norm 0.338 ± 0.039 0.386 ± 0.039
hendrycksTest-elementary_mathematics acc 0.230 ± 0.022 0.280 ± 0.022
acc_norm 0.270 ± 0.023 0.278 ± 0.023
hendrycksTest-formal_logic acc 0.333 ± 0.042 0.310 ± 0.042
acc_norm 0.302 ± 0.041 0.278 ± 0.041
hendrycksTest-global_facts acc 0.240 ± 0.043 0.250 ± 0.043
acc_norm 0.240 ± 0.043 0.260 ± 0.043
hendrycksTest-high_school_biology acc 0.219 ± 0.024 0.335 ± 0.024
acc_norm 0.284 ± 0.026 0.329 ± 0.026
hendrycksTest-high_school_chemistry acc 0.167 ± 0.026 0.207 ± 0.026
acc_norm 0.256 ± 0.031 0.212 ± 0.031
hendrycksTest-high_school_computer_science acc 0.220 ± 0.042 0.290 ± 0.042
acc_norm 0.280 ± 0.045 0.280 ± 0.045
hendrycksTest-high_school_european_history acc 0.267 ± 0.035 0.358 ± 0.035
acc_norm 0.285 ± 0.035 0.358 ± 0.035
hendrycksTest-high_school_geography acc 0.227 ± 0.030 0.359 ± 0.030
acc_norm 0.298 ± 0.033 0.333 ± 0.033
hendrycksTest-high_school_government_and_politics acc 0.207 ± 0.029 0.301 ± 0.029
acc_norm 0.259 ± 0.032 0.311 ± 0.032
hendrycksTest-high_school_macroeconomics acc 0.262 ± 0.022 0.267 ± 0.022
acc_norm 0.267 ± 0.022 0.262 ± 0.022
hendrycksTest-high_school_mathematics acc 0.174 ± 0.023 0.248 ± 0.023
acc_norm 0.244 ± 0.026 0.270 ± 0.026
hendrycksTest-high_school_microeconomics acc 0.256 ± 0.028 0.265 ± 0.028
acc_norm 0.328 ± 0.030 0.277 ± 0.030
hendrycksTest-high_school_physics acc 0.225 ± 0.034 0.212 ± 0.034
acc_norm 0.219 ± 0.034 0.225 ± 0.034
hendrycksTest-high_school_psychology acc 0.253 ± 0.019 0.338 ± 0.019
acc_norm 0.261 ± 0.019 0.330 ± 0.019
hendrycksTest-high_school_statistics acc 0.264 ± 0.030 0.278 ± 0.030
acc_norm 0.338 ± 0.032 0.273 ± 0.032
hendrycksTest-high_school_us_history acc 0.235 ± 0.030 0.230 ± 0.030
acc_norm 0.270 ± 0.031 0.235 ± 0.031
hendrycksTest-high_school_world_history acc 0.270 ± 0.029 0.388 ± 0.029
acc_norm 0.300 ± 0.030 0.392 ± 0.030
hendrycksTest-human_aging acc 0.296 ± 0.031 0.318 ± 0.031
acc_norm 0.238 ± 0.029 0.314 ± 0.029
hendrycksTest-human_sexuality acc 0.336 ± 0.041 0.290 ± 0.041
acc_norm 0.290 ± 0.040 0.290 ± 0.040
hendrycksTest-international_law acc 0.248 ± 0.039 0.322 ± 0.039
acc_norm 0.496 ± 0.046 0.347 ± 0.046
hendrycksTest-jurisprudence acc 0.250 ± 0.042 0.269 ± 0.042
acc_norm 0.426 ± 0.048 0.296 ± 0.048
hendrycksTest-logical_fallacies acc 0.209 ± 0.032 0.258 ± 0.032
acc_norm 0.288 ± 0.036 0.264 ± 0.036
hendrycksTest-machine_learning acc 0.295 ± 0.043 0.250 ± 0.043
acc_norm 0.259 ± 0.042 0.259 ± 0.042
hendrycksTest-management acc 0.184 ± 0.038 0.311 ± 0.038
acc_norm 0.282 ± 0.045 0.330 ± 0.045
hendrycksTest-marketing acc 0.316 ± 0.030 0.432 ± 0.030
acc_norm 0.338 ± 0.031 0.440 ± 0.031
hendrycksTest-medical_genetics acc 0.300 ± 0.046 0.240 ± 0.046
acc_norm 0.370 ± 0.049 0.270 ± 0.049
hendrycksTest-miscellaneous acc 0.281 ± 0.016 0.323 ± 0.016
acc_norm 0.271 ± 0.016 0.328 ± 0.016
hendrycksTest-moral_disputes acc 0.286 ± 0.024 0.350 ± 0.024
acc_norm 0.355 ± 0.026 0.364 ± 0.026
hendrycksTest-moral_scenarios acc 0.234 ± 0.014 0.264 ± 0.014
acc_norm 0.273 ± 0.015 0.269 ± 0.015
hendrycksTest-nutrition acc 0.275 ± 0.026 0.307 ± 0.026
acc_norm 0.359 ± 0.027 0.333 ± 0.027
hendrycksTest-philosophy acc 0.270 ± 0.025 0.305 ± 0.025
acc_norm 0.315 ± 0.026 0.322 ± 0.026
hendrycksTest-prehistory acc 0.256 ± 0.024 0.361 ± 0.024
acc_norm 0.216 ± 0.023 0.364 ± 0.023
hendrycksTest-professional_accounting acc 0.248 ± 0.026 0.230 ± 0.026
acc_norm 0.259 ± 0.026 0.220 ± 0.026
hendrycksTest-professional_law acc 0.267 ± 0.011 0.275 ± 0.011
acc_norm 0.300 ± 0.012 0.284 ± 0.012
hendrycksTest-professional_medicine acc 0.246 ± 0.026 0.290 ± 0.026
acc_norm 0.232 ± 0.026 0.298 ± 0.026
hendrycksTest-professional_psychology acc 0.258 ± 0.018 0.299 ± 0.018
acc_norm 0.253 ± 0.018 0.315 ± 0.018
hendrycksTest-public_relations acc 0.300 ± 0.044 0.364 ± 0.044
acc_norm 0.164 ± 0.035 0.373 ± 0.035
hendrycksTest-security_studies acc 0.339 ± 0.030 0.343 ± 0.030
acc_norm 0.286 ± 0.029 0.286 ± 0.029
hendrycksTest-sociology acc 0.269 ± 0.031 0.403 ± 0.031
acc_norm 0.264 ± 0.031 0.423 ± 0.031
hendrycksTest-us_foreign_policy acc 0.330 ± 0.047 0.390 ± 0.047
acc_norm 0.350 ± 0.048 0.390 ± 0.048
hendrycksTest-virology acc 0.313 ± 0.036 0.325 ± 0.036
acc_norm 0.331 ± 0.037 0.343 ± 0.037
hendrycksTest-world_religions acc 0.304 ± 0.035 0.316 ± 0.035
acc_norm 0.386 ± 0.037 0.339 ± 0.037
logiqa acc 0.201 ± 0.016 0.280 ± 0.016
acc_norm 0.281 ± 0.018 0.283 ± 0.018
mathqa acc 0.247 ± 0.008 0.248 ± 0.008
acc_norm 0.246 ± 0.008 0.239 ± 0.008
mnli acc 0.339 ± 0.005 0.729 ± 0.005
mnli_mismatched acc 0.338 ± 0.005 0.742 ± 0.005
mrpc acc 0.684 ± 0.023 0.701 ± 0.023
f1 0.812 ± 0.016 0.820 ± 0.016
multirc acc 0.016 ± 0.004 0.004 ± 0.004
openbookqa acc 0.234 ± 0.019 0.248 ± 0.019
acc_norm 0.332 ± 0.021 0.318 ± 0.021
piqa acc 0.721 ± 0.010 0.713 ± 0.010
acc_norm 0.729 ± 0.010 0.708 ± 0.010
qnli acc 0.509 ± 0.007 0.761 ± 0.007
qqp acc 0.368 ± 0.002 0.843 ± 0.002
f1 0.538 ± 0.003 0.789 ± 0.003
race acc 0.353 ± 0.015 0.362 ± 0.015
record f1 0.845 ± 0.004 0.779 ± 0.004
em 0.838 ± 0.004 0.770 ± 0.004
rte acc 0.520 ± 0.030 0.729 ± 0.030
sciq acc 0.893 ± 0.010 0.919 ± 0.010
acc_norm 0.828 ± 0.012 0.913 ± 0.012
sst acc 0.789 ± 0.014 0.862 ± 0.014
webqs acc 0.016 ± 0.003 0.071 ± 0.003
wic acc 0.500 ± 0.020 0.517 ± 0.020
winogrande acc 0.575 ± 0.014 0.570 ± 0.014
wnli acc 0.310 ± 0.055 0.563 ± 0.055
wsc acc 0.365 ± 0.047 0.365 ± 0.047
lambada ppl 5.626 ± 0.139 27.796 ± 0.139
acc 0.622 ± 0.007 0.387 ± 0.007
pubmedqa acc 0.565 ± 0.016 0.496 ± 0.016
coqa f1 0.604 ± 0.018 0.598 ± 0.018
em 0.479 ± 0.020 0.480 ± 0.020
drop em 0.026 ± 0.002 0.001 ± 0.002
f1 0.083 ± 0.002 0.033 ± 0.002
math_algebra acc 0.008 ± 0.003 0.025 ± 0.003
math_geometry acc 0.002 ± 0.002 0.021 ± 0.002
math_intermediate_algebra acc 0.004 ± 0.002 0.025 ± 0.002
math_num_theory acc 0.019 ± 0.006 0.046 ± 0.006
math_prealgebra acc 0.001 ± 0.001 0.039 ± 0.001
math_precalc acc 0.005 ± 0.003 0.016 ± 0.003

One shot

Task Metric 2.7B Tuned
anli_r1 acc 0.331 ± 0.015 0.443 ± 0.015
anli_r2 acc 0.307 ± 0.015 0.373 ± 0.015
anli_r3 acc 0.343 ± 0.014 0.423 ± 0.014
arc_challenge acc 0.302 ± 0.013 0.292 ± 0.013
acc_norm 0.323 ± 0.014 0.323 ± 0.014
arc_easy acc 0.634 ± 0.010 0.567 ± 0.010
acc_norm 0.622 ± 0.010 0.562 ± 0.010
boolq acc 0.536 ± 0.009 0.620 ± 0.009
cb acc 0.429 ± 0.067 0.411 ± 0.067
cola mcc 0.001 ± 0.031 0.022 ± 0.031
copa acc 0.770 ± 0.042 0.780 ± 0.042
ethics_cm acc 0.508 ± 0.008 0.625 ± 0.008
ethics_deontology acc 0.511 ± 0.008 0.683 ± 0.008
ethics_justice acc 0.515 ± 0.010 0.604 ± 0.010
ethics_utilitarianism acc 0.490 ± 0.007 0.536 ± 0.007
ethics_virtue acc 0.726 ± 0.006 0.805 ± 0.006
headqa acc 0.230 ± 0.008 0.228 ± 0.008
acc_norm 0.270 ± 0.008 0.275 ± 0.008
hellaswag acc 0.428 ± 0.005 0.386 ± 0.005
acc_norm 0.557 ± 0.005 0.494 ± 0.005
hendrycksTest-abstract_algebra acc 0.220 ± 0.042 0.270 ± 0.042
acc_norm 0.290 ± 0.046 0.260 ± 0.046
hendrycksTest-anatomy acc 0.289 ± 0.039 0.304 ± 0.039
acc_norm 0.230 ± 0.036 0.289 ± 0.036
hendrycksTest-astronomy acc 0.204 ± 0.033 0.322 ± 0.033
acc_norm 0.303 ± 0.037 0.322 ± 0.037
hendrycksTest-business_ethics acc 0.290 ± 0.046 0.320 ± 0.046
acc_norm 0.280 ± 0.045 0.280 ± 0.045
hendrycksTest-clinical_knowledge acc 0.287 ± 0.028 0.351 ± 0.028
acc_norm 0.328 ± 0.029 0.358 ± 0.029
hendrycksTest-college_biology acc 0.215 ± 0.034 0.271 ± 0.034
acc_norm 0.194 ± 0.033 0.271 ± 0.033
hendrycksTest-college_chemistry acc 0.300 ± 0.046 0.330 ± 0.046
acc_norm 0.340 ± 0.048 0.320 ± 0.048
hendrycksTest-college_computer_science acc 0.330 ± 0.047 0.390 ± 0.047
acc_norm 0.310 ± 0.046 0.360 ± 0.046
hendrycksTest-college_mathematics acc 0.200 ± 0.040 0.280 ± 0.040
acc_norm 0.220 ± 0.042 0.270 ± 0.042
hendrycksTest-college_medicine acc 0.254 ± 0.033 0.295 ± 0.033
acc_norm 0.260 ± 0.033 0.283 ± 0.033
hendrycksTest-college_physics acc 0.304 ± 0.046 0.284 ± 0.046
acc_norm 0.333 ± 0.047 0.304 ± 0.047
hendrycksTest-computer_security acc 0.320 ± 0.047 0.270 ± 0.047
acc_norm 0.320 ± 0.047 0.290 ± 0.047
hendrycksTest-conceptual_physics acc 0.268 ± 0.029 0.349 ± 0.029
acc_norm 0.255 ± 0.029 0.345 ± 0.029
hendrycksTest-econometrics acc 0.298 ± 0.043 0.272 ± 0.043
acc_norm 0.298 ± 0.043 0.263 ± 0.043
hendrycksTest-electrical_engineering acc 0.338 ± 0.039 0.324 ± 0.039
acc_norm 0.290 ± 0.038 0.303 ± 0.038
hendrycksTest-elementary_mathematics acc 0.262 ± 0.023 0.275 ± 0.023
acc_norm 0.294 ± 0.023 0.275 ± 0.023
hendrycksTest-formal_logic acc 0.310 ± 0.041 0.310 ± 0.041
acc_norm 0.294 ± 0.041 0.270 ± 0.041
hendrycksTest-global_facts acc 0.200 ± 0.040 0.290 ± 0.040
acc_norm 0.210 ± 0.041 0.290 ± 0.041
hendrycksTest-high_school_biology acc 0.265 ± 0.025 0.342 ± 0.025
acc_norm 0.287 ± 0.026 0.342 ± 0.026
hendrycksTest-high_school_chemistry acc 0.251 ± 0.031 0.232 ± 0.031
acc_norm 0.291 ± 0.032 0.227 ± 0.032
hendrycksTest-high_school_computer_science acc 0.260 ± 0.044 0.280 ± 0.044
acc_norm 0.300 ± 0.046 0.260 ± 0.046
hendrycksTest-high_school_european_history acc 0.267 ± 0.035 0.309 ± 0.035
acc_norm 0.315 ± 0.036 0.321 ± 0.036
hendrycksTest-high_school_geography acc 0.227 ± 0.030 0.348 ± 0.030
acc_norm 0.278 ± 0.032 0.354 ± 0.032
hendrycksTest-high_school_government_and_politics acc 0.290 ± 0.033 0.332 ± 0.033
acc_norm 0.290 ± 0.033 0.321 ± 0.033
hendrycksTest-high_school_macroeconomics acc 0.279 ± 0.023 0.305 ± 0.023
acc_norm 0.267 ± 0.022 0.285 ± 0.022
hendrycksTest-high_school_mathematics acc 0.252 ± 0.026 0.278 ± 0.026
acc_norm 0.296 ± 0.028 0.304 ± 0.028
hendrycksTest-high_school_microeconomics acc 0.265 ± 0.029 0.256 ± 0.029
acc_norm 0.324 ± 0.030 0.273 ± 0.030
hendrycksTest-high_school_physics acc 0.205 ± 0.033 0.205 ± 0.033
acc_norm 0.232 ± 0.034 0.212 ± 0.034
hendrycksTest-high_school_psychology acc 0.251 ± 0.019 0.328 ± 0.019
acc_norm 0.270 ± 0.019 0.325 ± 0.019
hendrycksTest-high_school_statistics acc 0.319 ± 0.032 0.241 ± 0.032
acc_norm 0.319 ± 0.032 0.245 ± 0.032
hendrycksTest-high_school_us_history acc 0.265 ± 0.031 0.221 ± 0.031
acc_norm 0.260 ± 0.031 0.230 ± 0.031
hendrycksTest-high_school_world_history acc 0.283 ± 0.029 0.371 ± 0.029
acc_norm 0.266 ± 0.029 0.380 ± 0.029
hendrycksTest-human_aging acc 0.296 ± 0.031 0.296 ± 0.031
acc_norm 0.274 ± 0.030 0.291 ± 0.030
hendrycksTest-human_sexuality acc 0.351 ± 0.042 0.290 ± 0.042
acc_norm 0.282 ± 0.039 0.290 ± 0.039
hendrycksTest-international_law acc 0.248 ± 0.039 0.322 ± 0.039
acc_norm 0.347 ± 0.043 0.331 ± 0.043
hendrycksTest-jurisprudence acc 0.269 ± 0.043 0.296 ± 0.043
acc_norm 0.370 ± 0.047 0.296 ± 0.047
hendrycksTest-logical_fallacies acc 0.202 ± 0.032 0.276 ± 0.032
acc_norm 0.270 ± 0.035 0.258 ± 0.035
hendrycksTest-machine_learning acc 0.295 ± 0.043 0.250 ± 0.043
acc_norm 0.330 ± 0.045 0.223 ± 0.045
hendrycksTest-management acc 0.282 ± 0.045 0.320 ± 0.045
acc_norm 0.272 ± 0.044 0.350 ± 0.044
hendrycksTest-marketing acc 0.303 ± 0.030 0.415 ± 0.030
acc_norm 0.329 ± 0.031 0.423 ± 0.031
hendrycksTest-medical_genetics acc 0.330 ± 0.047 0.300 ± 0.047
acc_norm 0.420 ± 0.050 0.300 ± 0.050
hendrycksTest-miscellaneous acc 0.319 ± 0.017 0.318 ± 0.017
acc_norm 0.319 ± 0.017 0.313 ± 0.017
hendrycksTest-moral_disputes acc 0.298 ± 0.025 0.341 ± 0.025
acc_norm 0.318 ± 0.025 0.344 ± 0.025
hendrycksTest-moral_scenarios acc 0.267 ± 0.015 0.240 ± 0.015
acc_norm 0.265 ± 0.015 0.238 ± 0.015
hendrycksTest-nutrition acc 0.278 ± 0.026 0.330 ± 0.026
acc_norm 0.337 ± 0.027 0.350 ± 0.027
hendrycksTest-philosophy acc 0.251 ± 0.025 0.315 ± 0.025
acc_norm 0.293 ± 0.026 0.325 ± 0.026
hendrycksTest-prehistory acc 0.244 ± 0.024 0.352 ± 0.024
acc_norm 0.250 ± 0.024 0.361 ± 0.024
hendrycksTest-professional_accounting acc 0.287 ± 0.027 0.213 ± 0.027
acc_norm 0.248 ± 0.026 0.216 ± 0.026
hendrycksTest-professional_law acc 0.273 ± 0.011 0.267 ± 0.011
acc_norm 0.269 ± 0.011 0.269 ± 0.011
hendrycksTest-professional_medicine acc 0.301 ± 0.028 0.301 ± 0.028
acc_norm 0.268 ± 0.027 0.327 ± 0.027
hendrycksTest-professional_psychology acc 0.279 ± 0.018 0.304 ± 0.018
acc_norm 0.284 ± 0.018 0.310 ± 0.018
hendrycksTest-public_relations acc 0.327 ± 0.045 0.345 ± 0.045
acc_norm 0.309 ± 0.044 0.336 ± 0.044
hendrycksTest-security_studies acc 0.265 ± 0.028 0.331 ± 0.028
acc_norm 0.208 ± 0.026 0.290 ± 0.026
hendrycksTest-sociology acc 0.269 ± 0.031 0.393 ± 0.031
acc_norm 0.249 ± 0.031 0.383 ± 0.031
hendrycksTest-us_foreign_policy acc 0.290 ± 0.046 0.320 ± 0.046
acc_norm 0.320 ± 0.047 0.320 ± 0.047
hendrycksTest-virology acc 0.289 ± 0.035 0.349 ± 0.035
acc_norm 0.265 ± 0.034 0.355 ± 0.034
hendrycksTest-world_religions acc 0.374 ± 0.037 0.345 ± 0.037
acc_norm 0.409 ± 0.038 0.351 ± 0.038
logiqa acc 0.255 ± 0.017 0.273 ± 0.017
acc_norm 0.272 ± 0.017 0.280 ± 0.017
mathqa acc 0.256 ± 0.008 0.253 ± 0.008
acc_norm 0.258 ± 0.008 0.240 ± 0.008
mnli acc 0.338 ± 0.005 0.801 ± 0.005
mnli_mismatched acc 0.362 ± 0.005 0.811 ± 0.005
mrpc acc 0.571 ± 0.025 0.750 ± 0.025
f1 0.689 ± 0.022 0.841 ± 0.022
multirc acc 0.047 ± 0.007 0.012 ± 0.007
openbookqa acc 0.222 ± 0.019 0.268 ± 0.019
acc_norm 0.346 ± 0.021 0.344 ± 0.021
piqa acc 0.726 ± 0.010 0.714 ± 0.010
acc_norm 0.736 ± 0.010 0.718 ± 0.010
qnli acc 0.504 ± 0.007 0.788 ± 0.007
qqp acc 0.534 ± 0.002 0.847 ± 0.002
f1 0.372 ± 0.004 0.793 ± 0.004
race acc 0.352 ± 0.015 0.355 ± 0.015
record f1 0.843 ± 0.004 0.778 ± 0.004
em 0.835 ± 0.004 0.771 ± 0.004
rte acc 0.491 ± 0.030 0.747 ± 0.030
sciq acc 0.930 ± 0.008 0.939 ± 0.008
acc_norm 0.938 ± 0.008 0.935 ± 0.008
sst acc 0.492 ± 0.017 0.916 ± 0.017
webqs acc 0.054 ± 0.005 0.095 ± 0.005
wic acc 0.472 ± 0.020 0.539 ± 0.020
winogrande acc 0.582 ± 0.014 0.571 ± 0.014
wnli acc 0.380 ± 0.058 0.549 ± 0.058
wsc acc 0.365 ± 0.047 0.365 ± 0.047
lambada ppl 6.423 ± 0.162 20.150 ± 0.162
acc 0.576 ± 0.007 0.394 ± 0.007
pubmedqa acc 0.529 ± 0.016 0.479 ± 0.016
coqa f1 0.606 ± 0.018 0.581 ± 0.018
em 0.484 ± 0.020 0.472 ± 0.020
drop em 0.001 ± 0.000 0.001 ± 0.000
f1 0.039 ± 0.001 0.031 ± 0.001
math_algebra acc 0.016 ± 0.004 0.024 ± 0.004
math_counting_and_prob acc 0.023 ± 0.007 0.030 ± 0.007
math_geometry acc 0.006 ± 0.004 0.021 ± 0.004
math_intermediate_algebra acc 0.020 ± 0.005 0.029 ± 0.005
math_num_theory acc 0.037 ± 0.008 0.039 ± 0.008
math_prealgebra acc 0.023 ± 0.005 0.041 ± 0.005
math_precalc acc 0.015 ± 0.005 0.022 ± 0.005

The model can be downloaded here, though I don’t recommend using it for anything.