Finetuning Models on Downstream Tasks

The GPT-3 paper didn't explore fine tuning on downstream tasks, so I decided to tune Neo 2.7B for 1.1k iters on all the tasks in eval harness that have a train set (all at once, because tuning one model per task would have taken ages). I was quite surprised that the tuned model didn't destroy untuned 2.7B completely on all tasks, but rather from eyeballing it seems like a tossup. Interestingly, tuned seems to defeat 2.7B by quite a lot on anli, which is especially notable given that this is one task the models in the GPT-3 paper struggled on. Also, lambada and pubmedqa are included in these tables, even though it doesn't have a training set (at least for the implementation in eval harness, using the OA version of lambada), because I wanted to look at effects on sets not in the tuning, to potentially observe some catastrophic forgetting or something. Sure enough, lambada and pubmedqa scores are significantly worse on the tuned model.

Zero shot

Task	Metric	2.7B	Tuned
anli_r1	acc	0.332 ± 0.015	0.418 ± 0.015
anli_r2	acc	0.342 ± 0.015	0.375 ± 0.015
anli_r3	acc	0.352 ± 0.014	0.392 ± 0.014
arc_challenge	acc	0.275 ± 0.013	0.286 ± 0.013
	acc_norm	0.301 ± 0.013	0.312 ± 0.013
arc_easy	acc	0.611 ± 0.010	0.560 ± 0.010
	acc_norm	0.539 ± 0.010	0.558 ± 0.010
boolq	acc	0.630 ± 0.008	0.605 ± 0.008
cb	acc	0.304 ± 0.062	0.411 ± 0.062
copa	acc	0.800 ± 0.040	0.730 ± 0.040
ethics_cm	acc	0.510 ± 0.008	0.561 ± 0.008
ethics_deontology	acc	0.497 ± 0.008	0.658 ± 0.008
ethics_justice	acc	0.501 ± 0.010	0.589 ± 0.010
ethics_utilitarianism	acc	0.497 ± 0.007	0.498 ± 0.007
ethics_virtue	acc	0.251 ± 0.006	0.800 ± 0.006
headqa	acc	0.235 ± 0.008	0.233 ± 0.008
	acc_norm	0.272 ± 0.008	0.265 ± 0.008
hellaswag	acc	0.427 ± 0.005	0.400 ± 0.005
	acc_norm	0.558 ± 0.005	0.517 ± 0.005
hendrycksTest-abstract_algebra	acc	0.230 ± 0.042	0.340 ± 0.042
	acc_norm	0.200 ± 0.040	0.350 ± 0.040
hendrycksTest-anatomy	acc	0.252 ± 0.037	0.267 ± 0.037
	acc_norm	0.222 ± 0.036	0.252 ± 0.036
hendrycksTest-astronomy	acc	0.250 ± 0.035	0.309 ± 0.035
	acc_norm	0.362 ± 0.039	0.309 ± 0.039
hendrycksTest-business_ethics	acc	0.360 ± 0.048	0.340 ± 0.048
	acc_norm	0.280 ± 0.045	0.310 ± 0.045
hendrycksTest-clinical_knowledge	acc	0.291 ± 0.028	0.370 ± 0.028
	acc_norm	0.287 ± 0.028	0.374 ± 0.028
hendrycksTest-college_biology	acc	0.250 ± 0.036	0.250 ± 0.036
	acc_norm	0.222 ± 0.035	0.271 ± 0.035
hendrycksTest-college_chemistry	acc	0.230 ± 0.042	0.350 ± 0.042
	acc_norm	0.250 ± 0.044	0.350 ± 0.044
hendrycksTest-college_computer_science	acc	0.280 ± 0.045	0.430 ± 0.045
	acc_norm	0.270 ± 0.045	0.390 ± 0.045
hendrycksTest-college_mathematics	acc	0.200 ± 0.040	0.370 ± 0.040
	acc_norm	0.300 ± 0.046	0.350 ± 0.046
hendrycksTest-college_medicine	acc	0.254 ± 0.033	0.312 ± 0.033
	acc_norm	0.260 ± 0.033	0.306 ± 0.033
hendrycksTest-college_physics	acc	0.225 ± 0.042	0.275 ± 0.042
	acc_norm	0.245 ± 0.043	0.284 ± 0.043
hendrycksTest-computer_security	acc	0.270 ± 0.045	0.290 ± 0.045
	acc_norm	0.330 ± 0.047	0.290 ± 0.047
hendrycksTest-conceptual_physics	acc	0.247 ± 0.028	0.315 ± 0.028
	acc_norm	0.187 ± 0.026	0.319 ± 0.026
hendrycksTest-econometrics	acc	0.193 ± 0.037	0.272 ± 0.037
	acc_norm	0.228 ± 0.039	0.281 ± 0.039
hendrycksTest-electrical_engineering	acc	0.331 ± 0.039	0.386 ± 0.039
	acc_norm	0.338 ± 0.039	0.386 ± 0.039
hendrycksTest-elementary_mathematics	acc	0.230 ± 0.022	0.280 ± 0.022
	acc_norm	0.270 ± 0.023	0.278 ± 0.023
hendrycksTest-formal_logic	acc	0.333 ± 0.042	0.310 ± 0.042
	acc_norm	0.302 ± 0.041	0.278 ± 0.041
hendrycksTest-global_facts	acc	0.240 ± 0.043	0.250 ± 0.043
	acc_norm	0.240 ± 0.043	0.260 ± 0.043
hendrycksTest-high_school_biology	acc	0.219 ± 0.024	0.335 ± 0.024
	acc_norm	0.284 ± 0.026	0.329 ± 0.026
hendrycksTest-high_school_chemistry	acc	0.167 ± 0.026	0.207 ± 0.026
	acc_norm	0.256 ± 0.031	0.212 ± 0.031
hendrycksTest-high_school_computer_science	acc	0.220 ± 0.042	0.290 ± 0.042
	acc_norm	0.280 ± 0.045	0.280 ± 0.045
hendrycksTest-high_school_european_history	acc	0.267 ± 0.035	0.358 ± 0.035
	acc_norm	0.285 ± 0.035	0.358 ± 0.035
hendrycksTest-high_school_geography	acc	0.227 ± 0.030	0.359 ± 0.030
	acc_norm	0.298 ± 0.033	0.333 ± 0.033
hendrycksTest-high_school_government_and_politics	acc	0.207 ± 0.029	0.301 ± 0.029
	acc_norm	0.259 ± 0.032	0.311 ± 0.032
hendrycksTest-high_school_macroeconomics	acc	0.262 ± 0.022	0.267 ± 0.022
	acc_norm	0.267 ± 0.022	0.262 ± 0.022
hendrycksTest-high_school_mathematics	acc	0.174 ± 0.023	0.248 ± 0.023
	acc_norm	0.244 ± 0.026	0.270 ± 0.026
hendrycksTest-high_school_microeconomics	acc	0.256 ± 0.028	0.265 ± 0.028
	acc_norm	0.328 ± 0.030	0.277 ± 0.030
hendrycksTest-high_school_physics	acc	0.225 ± 0.034	0.212 ± 0.034
	acc_norm	0.219 ± 0.034	0.225 ± 0.034
hendrycksTest-high_school_psychology	acc	0.253 ± 0.019	0.338 ± 0.019
	acc_norm	0.261 ± 0.019	0.330 ± 0.019
hendrycksTest-high_school_statistics	acc	0.264 ± 0.030	0.278 ± 0.030
	acc_norm	0.338 ± 0.032	0.273 ± 0.032
hendrycksTest-high_school_us_history	acc	0.235 ± 0.030	0.230 ± 0.030
	acc_norm	0.270 ± 0.031	0.235 ± 0.031
hendrycksTest-high_school_world_history	acc	0.270 ± 0.029	0.388 ± 0.029
	acc_norm	0.300 ± 0.030	0.392 ± 0.030
hendrycksTest-human_aging	acc	0.296 ± 0.031	0.318 ± 0.031
	acc_norm	0.238 ± 0.029	0.314 ± 0.029
hendrycksTest-human_sexuality	acc	0.336 ± 0.041	0.290 ± 0.041
	acc_norm	0.290 ± 0.040	0.290 ± 0.040
hendrycksTest-international_law	acc	0.248 ± 0.039	0.322 ± 0.039
	acc_norm	0.496 ± 0.046	0.347 ± 0.046
hendrycksTest-jurisprudence	acc	0.250 ± 0.042	0.269 ± 0.042
	acc_norm	0.426 ± 0.048	0.296 ± 0.048
hendrycksTest-logical_fallacies	acc	0.209 ± 0.032	0.258 ± 0.032
	acc_norm	0.288 ± 0.036	0.264 ± 0.036
hendrycksTest-machine_learning	acc	0.295 ± 0.043	0.250 ± 0.043
	acc_norm	0.259 ± 0.042	0.259 ± 0.042
hendrycksTest-management	acc	0.184 ± 0.038	0.311 ± 0.038
	acc_norm	0.282 ± 0.045	0.330 ± 0.045
hendrycksTest-marketing	acc	0.316 ± 0.030	0.432 ± 0.030
	acc_norm	0.338 ± 0.031	0.440 ± 0.031
hendrycksTest-medical_genetics	acc	0.300 ± 0.046	0.240 ± 0.046
	acc_norm	0.370 ± 0.049	0.270 ± 0.049
hendrycksTest-miscellaneous	acc	0.281 ± 0.016	0.323 ± 0.016
	acc_norm	0.271 ± 0.016	0.328 ± 0.016
hendrycksTest-moral_disputes	acc	0.286 ± 0.024	0.350 ± 0.024
	acc_norm	0.355 ± 0.026	0.364 ± 0.026
hendrycksTest-moral_scenarios	acc	0.234 ± 0.014	0.264 ± 0.014
	acc_norm	0.273 ± 0.015	0.269 ± 0.015
hendrycksTest-nutrition	acc	0.275 ± 0.026	0.307 ± 0.026
	acc_norm	0.359 ± 0.027	0.333 ± 0.027
hendrycksTest-philosophy	acc	0.270 ± 0.025	0.305 ± 0.025
	acc_norm	0.315 ± 0.026	0.322 ± 0.026
hendrycksTest-prehistory	acc	0.256 ± 0.024	0.361 ± 0.024
	acc_norm	0.216 ± 0.023	0.364 ± 0.023
hendrycksTest-professional_accounting	acc	0.248 ± 0.026	0.230 ± 0.026
	acc_norm	0.259 ± 0.026	0.220 ± 0.026
hendrycksTest-professional_law	acc	0.267 ± 0.011	0.275 ± 0.011
	acc_norm	0.300 ± 0.012	0.284 ± 0.012
hendrycksTest-professional_medicine	acc	0.246 ± 0.026	0.290 ± 0.026
	acc_norm	0.232 ± 0.026	0.298 ± 0.026
hendrycksTest-professional_psychology	acc	0.258 ± 0.018	0.299 ± 0.018
	acc_norm	0.253 ± 0.018	0.315 ± 0.018
hendrycksTest-public_relations	acc	0.300 ± 0.044	0.364 ± 0.044
	acc_norm	0.164 ± 0.035	0.373 ± 0.035
hendrycksTest-security_studies	acc	0.339 ± 0.030	0.343 ± 0.030
	acc_norm	0.286 ± 0.029	0.286 ± 0.029
hendrycksTest-sociology	acc	0.269 ± 0.031	0.403 ± 0.031
	acc_norm	0.264 ± 0.031	0.423 ± 0.031
hendrycksTest-us_foreign_policy	acc	0.330 ± 0.047	0.390 ± 0.047
	acc_norm	0.350 ± 0.048	0.390 ± 0.048
hendrycksTest-virology	acc	0.313 ± 0.036	0.325 ± 0.036
	acc_norm	0.331 ± 0.037	0.343 ± 0.037
hendrycksTest-world_religions	acc	0.304 ± 0.035	0.316 ± 0.035
	acc_norm	0.386 ± 0.037	0.339 ± 0.037
logiqa	acc	0.201 ± 0.016	0.280 ± 0.016
	acc_norm	0.281 ± 0.018	0.283 ± 0.018
mathqa	acc	0.247 ± 0.008	0.248 ± 0.008
	acc_norm	0.246 ± 0.008	0.239 ± 0.008
mnli	acc	0.339 ± 0.005	0.729 ± 0.005
mnli_mismatched	acc	0.338 ± 0.005	0.742 ± 0.005
mrpc	acc	0.684 ± 0.023	0.701 ± 0.023
	f1	0.812 ± 0.016	0.820 ± 0.016
multirc	acc	0.016 ± 0.004	0.004 ± 0.004
openbookqa	acc	0.234 ± 0.019	0.248 ± 0.019
	acc_norm	0.332 ± 0.021	0.318 ± 0.021
piqa	acc	0.721 ± 0.010	0.713 ± 0.010
	acc_norm	0.729 ± 0.010	0.708 ± 0.010
qnli	acc	0.509 ± 0.007	0.761 ± 0.007
qqp	acc	0.368 ± 0.002	0.843 ± 0.002
	f1	0.538 ± 0.003	0.789 ± 0.003
race	acc	0.353 ± 0.015	0.362 ± 0.015
record	f1	0.845 ± 0.004	0.779 ± 0.004
	em	0.838 ± 0.004	0.770 ± 0.004
rte	acc	0.520 ± 0.030	0.729 ± 0.030
sciq	acc	0.893 ± 0.010	0.919 ± 0.010
	acc_norm	0.828 ± 0.012	0.913 ± 0.012
sst	acc	0.789 ± 0.014	0.862 ± 0.014
webqs	acc	0.016 ± 0.003	0.071 ± 0.003
wic	acc	0.500 ± 0.020	0.517 ± 0.020
winogrande	acc	0.575 ± 0.014	0.570 ± 0.014
wnli	acc	0.310 ± 0.055	0.563 ± 0.055
wsc	acc	0.365 ± 0.047	0.365 ± 0.047
lambada	ppl	5.626 ± 0.139	27.796 ± 0.139
	acc	0.622 ± 0.007	0.387 ± 0.007
pubmedqa	acc	0.565 ± 0.016	0.496 ± 0.016
coqa	f1	0.604 ± 0.018	0.598 ± 0.018
	em	0.479 ± 0.020	0.480 ± 0.020
drop	em	0.026 ± 0.002	0.001 ± 0.002
	f1	0.083 ± 0.002	0.033 ± 0.002
math_algebra	acc	0.008 ± 0.003	0.025 ± 0.003
math_geometry	acc	0.002 ± 0.002	0.021 ± 0.002
math_intermediate_algebra	acc	0.004 ± 0.002	0.025 ± 0.002
math_num_theory	acc	0.019 ± 0.006	0.046 ± 0.006
math_prealgebra	acc	0.001 ± 0.001	0.039 ± 0.001
math_precalc	acc	0.005 ± 0.003	0.016 ± 0.003

One shot

Task	Metric	2.7B	Tuned
anli_r1	acc	0.331 ± 0.015	0.443 ± 0.015
anli_r2	acc	0.307 ± 0.015	0.373 ± 0.015
anli_r3	acc	0.343 ± 0.014	0.423 ± 0.014
arc_challenge	acc	0.302 ± 0.013	0.292 ± 0.013
	acc_norm	0.323 ± 0.014	0.323 ± 0.014
arc_easy	acc	0.634 ± 0.010	0.567 ± 0.010
	acc_norm	0.622 ± 0.010	0.562 ± 0.010
boolq	acc	0.536 ± 0.009	0.620 ± 0.009
cb	acc	0.429 ± 0.067	0.411 ± 0.067
cola	mcc	0.001 ± 0.031	0.022 ± 0.031
copa	acc	0.770 ± 0.042	0.780 ± 0.042
ethics_cm	acc	0.508 ± 0.008	0.625 ± 0.008
ethics_deontology	acc	0.511 ± 0.008	0.683 ± 0.008
ethics_justice	acc	0.515 ± 0.010	0.604 ± 0.010
ethics_utilitarianism	acc	0.490 ± 0.007	0.536 ± 0.007
ethics_virtue	acc	0.726 ± 0.006	0.805 ± 0.006
headqa	acc	0.230 ± 0.008	0.228 ± 0.008
	acc_norm	0.270 ± 0.008	0.275 ± 0.008
hellaswag	acc	0.428 ± 0.005	0.386 ± 0.005
	acc_norm	0.557 ± 0.005	0.494 ± 0.005
hendrycksTest-abstract_algebra	acc	0.220 ± 0.042	0.270 ± 0.042
	acc_norm	0.290 ± 0.046	0.260 ± 0.046
hendrycksTest-anatomy	acc	0.289 ± 0.039	0.304 ± 0.039
	acc_norm	0.230 ± 0.036	0.289 ± 0.036
hendrycksTest-astronomy	acc	0.204 ± 0.033	0.322 ± 0.033
	acc_norm	0.303 ± 0.037	0.322 ± 0.037
hendrycksTest-business_ethics	acc	0.290 ± 0.046	0.320 ± 0.046
	acc_norm	0.280 ± 0.045	0.280 ± 0.045
hendrycksTest-clinical_knowledge	acc	0.287 ± 0.028	0.351 ± 0.028
	acc_norm	0.328 ± 0.029	0.358 ± 0.029
hendrycksTest-college_biology	acc	0.215 ± 0.034	0.271 ± 0.034
	acc_norm	0.194 ± 0.033	0.271 ± 0.033
hendrycksTest-college_chemistry	acc	0.300 ± 0.046	0.330 ± 0.046
	acc_norm	0.340 ± 0.048	0.320 ± 0.048
hendrycksTest-college_computer_science	acc	0.330 ± 0.047	0.390 ± 0.047
	acc_norm	0.310 ± 0.046	0.360 ± 0.046
hendrycksTest-college_mathematics	acc	0.200 ± 0.040	0.280 ± 0.040
	acc_norm	0.220 ± 0.042	0.270 ± 0.042
hendrycksTest-college_medicine	acc	0.254 ± 0.033	0.295 ± 0.033
	acc_norm	0.260 ± 0.033	0.283 ± 0.033
hendrycksTest-college_physics	acc	0.304 ± 0.046	0.284 ± 0.046
	acc_norm	0.333 ± 0.047	0.304 ± 0.047
hendrycksTest-computer_security	acc	0.320 ± 0.047	0.270 ± 0.047
	acc_norm	0.320 ± 0.047	0.290 ± 0.047
hendrycksTest-conceptual_physics	acc	0.268 ± 0.029	0.349 ± 0.029
	acc_norm	0.255 ± 0.029	0.345 ± 0.029
hendrycksTest-econometrics	acc	0.298 ± 0.043	0.272 ± 0.043
	acc_norm	0.298 ± 0.043	0.263 ± 0.043
hendrycksTest-electrical_engineering	acc	0.338 ± 0.039	0.324 ± 0.039
	acc_norm	0.290 ± 0.038	0.303 ± 0.038
hendrycksTest-elementary_mathematics	acc	0.262 ± 0.023	0.275 ± 0.023
	acc_norm	0.294 ± 0.023	0.275 ± 0.023
hendrycksTest-formal_logic	acc	0.310 ± 0.041	0.310 ± 0.041
	acc_norm	0.294 ± 0.041	0.270 ± 0.041
hendrycksTest-global_facts	acc	0.200 ± 0.040	0.290 ± 0.040
	acc_norm	0.210 ± 0.041	0.290 ± 0.041
hendrycksTest-high_school_biology	acc	0.265 ± 0.025	0.342 ± 0.025
	acc_norm	0.287 ± 0.026	0.342 ± 0.026
hendrycksTest-high_school_chemistry	acc	0.251 ± 0.031	0.232 ± 0.031
	acc_norm	0.291 ± 0.032	0.227 ± 0.032
hendrycksTest-high_school_computer_science	acc	0.260 ± 0.044	0.280 ± 0.044
	acc_norm	0.300 ± 0.046	0.260 ± 0.046
hendrycksTest-high_school_european_history	acc	0.267 ± 0.035	0.309 ± 0.035
	acc_norm	0.315 ± 0.036	0.321 ± 0.036
hendrycksTest-high_school_geography	acc	0.227 ± 0.030	0.348 ± 0.030
	acc_norm	0.278 ± 0.032	0.354 ± 0.032
hendrycksTest-high_school_government_and_politics	acc	0.290 ± 0.033	0.332 ± 0.033
	acc_norm	0.290 ± 0.033	0.321 ± 0.033
hendrycksTest-high_school_macroeconomics	acc	0.279 ± 0.023	0.305 ± 0.023
	acc_norm	0.267 ± 0.022	0.285 ± 0.022
hendrycksTest-high_school_mathematics	acc	0.252 ± 0.026	0.278 ± 0.026
	acc_norm	0.296 ± 0.028	0.304 ± 0.028
hendrycksTest-high_school_microeconomics	acc	0.265 ± 0.029	0.256 ± 0.029
	acc_norm	0.324 ± 0.030	0.273 ± 0.030
hendrycksTest-high_school_physics	acc	0.205 ± 0.033	0.205 ± 0.033
	acc_norm	0.232 ± 0.034	0.212 ± 0.034
hendrycksTest-high_school_psychology	acc	0.251 ± 0.019	0.328 ± 0.019
	acc_norm	0.270 ± 0.019	0.325 ± 0.019
hendrycksTest-high_school_statistics	acc	0.319 ± 0.032	0.241 ± 0.032
	acc_norm	0.319 ± 0.032	0.245 ± 0.032
hendrycksTest-high_school_us_history	acc	0.265 ± 0.031	0.221 ± 0.031
	acc_norm	0.260 ± 0.031	0.230 ± 0.031
hendrycksTest-high_school_world_history	acc	0.283 ± 0.029	0.371 ± 0.029
	acc_norm	0.266 ± 0.029	0.380 ± 0.029
hendrycksTest-human_aging	acc	0.296 ± 0.031	0.296 ± 0.031
	acc_norm	0.274 ± 0.030	0.291 ± 0.030
hendrycksTest-human_sexuality	acc	0.351 ± 0.042	0.290 ± 0.042
	acc_norm	0.282 ± 0.039	0.290 ± 0.039
hendrycksTest-international_law	acc	0.248 ± 0.039	0.322 ± 0.039
	acc_norm	0.347 ± 0.043	0.331 ± 0.043
hendrycksTest-jurisprudence	acc	0.269 ± 0.043	0.296 ± 0.043
	acc_norm	0.370 ± 0.047	0.296 ± 0.047
hendrycksTest-logical_fallacies	acc	0.202 ± 0.032	0.276 ± 0.032
	acc_norm	0.270 ± 0.035	0.258 ± 0.035
hendrycksTest-machine_learning	acc	0.295 ± 0.043	0.250 ± 0.043
	acc_norm	0.330 ± 0.045	0.223 ± 0.045
hendrycksTest-management	acc	0.282 ± 0.045	0.320 ± 0.045
	acc_norm	0.272 ± 0.044	0.350 ± 0.044
hendrycksTest-marketing	acc	0.303 ± 0.030	0.415 ± 0.030
	acc_norm	0.329 ± 0.031	0.423 ± 0.031
hendrycksTest-medical_genetics	acc	0.330 ± 0.047	0.300 ± 0.047
	acc_norm	0.420 ± 0.050	0.300 ± 0.050
hendrycksTest-miscellaneous	acc	0.319 ± 0.017	0.318 ± 0.017
	acc_norm	0.319 ± 0.017	0.313 ± 0.017
hendrycksTest-moral_disputes	acc	0.298 ± 0.025	0.341 ± 0.025
	acc_norm	0.318 ± 0.025	0.344 ± 0.025
hendrycksTest-moral_scenarios	acc	0.267 ± 0.015	0.240 ± 0.015
	acc_norm	0.265 ± 0.015	0.238 ± 0.015
hendrycksTest-nutrition	acc	0.278 ± 0.026	0.330 ± 0.026
	acc_norm	0.337 ± 0.027	0.350 ± 0.027
hendrycksTest-philosophy	acc	0.251 ± 0.025	0.315 ± 0.025
	acc_norm	0.293 ± 0.026	0.325 ± 0.026
hendrycksTest-prehistory	acc	0.244 ± 0.024	0.352 ± 0.024
	acc_norm	0.250 ± 0.024	0.361 ± 0.024
hendrycksTest-professional_accounting	acc	0.287 ± 0.027	0.213 ± 0.027
	acc_norm	0.248 ± 0.026	0.216 ± 0.026
hendrycksTest-professional_law	acc	0.273 ± 0.011	0.267 ± 0.011
	acc_norm	0.269 ± 0.011	0.269 ± 0.011
hendrycksTest-professional_medicine	acc	0.301 ± 0.028	0.301 ± 0.028
	acc_norm	0.268 ± 0.027	0.327 ± 0.027
hendrycksTest-professional_psychology	acc	0.279 ± 0.018	0.304 ± 0.018
	acc_norm	0.284 ± 0.018	0.310 ± 0.018
hendrycksTest-public_relations	acc	0.327 ± 0.045	0.345 ± 0.045
	acc_norm	0.309 ± 0.044	0.336 ± 0.044
hendrycksTest-security_studies	acc	0.265 ± 0.028	0.331 ± 0.028
	acc_norm	0.208 ± 0.026	0.290 ± 0.026
hendrycksTest-sociology	acc	0.269 ± 0.031	0.393 ± 0.031
	acc_norm	0.249 ± 0.031	0.383 ± 0.031
hendrycksTest-us_foreign_policy	acc	0.290 ± 0.046	0.320 ± 0.046
	acc_norm	0.320 ± 0.047	0.320 ± 0.047
hendrycksTest-virology	acc	0.289 ± 0.035	0.349 ± 0.035
	acc_norm	0.265 ± 0.034	0.355 ± 0.034
hendrycksTest-world_religions	acc	0.374 ± 0.037	0.345 ± 0.037
	acc_norm	0.409 ± 0.038	0.351 ± 0.038
logiqa	acc	0.255 ± 0.017	0.273 ± 0.017
	acc_norm	0.272 ± 0.017	0.280 ± 0.017
mathqa	acc	0.256 ± 0.008	0.253 ± 0.008
	acc_norm	0.258 ± 0.008	0.240 ± 0.008
mnli	acc	0.338 ± 0.005	0.801 ± 0.005
mnli_mismatched	acc	0.362 ± 0.005	0.811 ± 0.005
mrpc	acc	0.571 ± 0.025	0.750 ± 0.025
	f1	0.689 ± 0.022	0.841 ± 0.022
multirc	acc	0.047 ± 0.007	0.012 ± 0.007
openbookqa	acc	0.222 ± 0.019	0.268 ± 0.019
	acc_norm	0.346 ± 0.021	0.344 ± 0.021
piqa	acc	0.726 ± 0.010	0.714 ± 0.010
	acc_norm	0.736 ± 0.010	0.718 ± 0.010
qnli	acc	0.504 ± 0.007	0.788 ± 0.007
qqp	acc	0.534 ± 0.002	0.847 ± 0.002
	f1	0.372 ± 0.004	0.793 ± 0.004
race	acc	0.352 ± 0.015	0.355 ± 0.015
record	f1	0.843 ± 0.004	0.778 ± 0.004
	em	0.835 ± 0.004	0.771 ± 0.004
rte	acc	0.491 ± 0.030	0.747 ± 0.030
sciq	acc	0.930 ± 0.008	0.939 ± 0.008
	acc_norm	0.938 ± 0.008	0.935 ± 0.008
sst	acc	0.492 ± 0.017	0.916 ± 0.017
webqs	acc	0.054 ± 0.005	0.095 ± 0.005
wic	acc	0.472 ± 0.020	0.539 ± 0.020
winogrande	acc	0.582 ± 0.014	0.571 ± 0.014
wnli	acc	0.380 ± 0.058	0.549 ± 0.058
wsc	acc	0.365 ± 0.047	0.365 ± 0.047
lambada	ppl	6.423 ± 0.162	20.150 ± 0.162
	acc	0.576 ± 0.007	0.394 ± 0.007
pubmedqa	acc	0.529 ± 0.016	0.479 ± 0.016
coqa	f1	0.606 ± 0.018	0.581 ± 0.018
	em	0.484 ± 0.020	0.472 ± 0.020
drop	em	0.001 ± 0.000	0.001 ± 0.000
	f1	0.039 ± 0.001	0.031 ± 0.001
math_algebra	acc	0.016 ± 0.004	0.024 ± 0.004
math_counting_and_prob	acc	0.023 ± 0.007	0.030 ± 0.007
math_geometry	acc	0.006 ± 0.004	0.021 ± 0.004
math_intermediate_algebra	acc	0.020 ± 0.005	0.029 ± 0.005
math_num_theory	acc	0.037 ± 0.008	0.039 ± 0.008
math_prealgebra	acc	0.023 ± 0.005	0.041 ± 0.005
math_precalc	acc	0.015 ± 0.005	0.022 ± 0.005

The model can be downloaded here, though I don't recommend using it for anything.

Zero shot#

One shot#

Zero shot

One shot