Activation Function Ablation

This was an ablation of activation functions on GPT-like models of ~100M params that I ran ages ago. Each model was run for 10k iters, which isn't very long. My original goal was to show that activation function doesn't matter than much, but to do so I'd need to run a bunch more runs to get variance and show no statistical significance, and I don't plan on running a more exhaustive version of this experiment any time soon. So, I'm just dumping these results here in case anyone has any use for them. All the activation definitions are here.

Name	Pile Validation BPB	LAMBADA acc	LAMBADA ppl
softsign	1.1485	34.3	81.32
ReLU	1.1482	34.3	82.01
spike2	1.1480	34.4	83.13
selu	1.1485	34.5	83.32
elish	1.1492	33.9	84.04
tanhexp	1.1474	33.7	84.06
sigmoid	1.1484	33.9	85.20
tanhshrink	1.1483	33.9	85.42
maxtanh	1.1479	33.7	85.53
roottanh	1.1485	33.4	86.00
softplusmone	1.1488	34.1	86.21
logsoftmax	1.1492	34.2	86.29
ELU	1.1496	33.8	86.37
Swish	1.1482	33.7	86.42
softmax	1.1491	33.2	86.74
square_relax	1.1484	33.5	86.92
lisht	1.1500	33.8	87.17
GELU	1.1453	34.0	87.84
abs	1.1489	33.5	87.96
tanh	1.1481	33.2	89.28
Mish	1.1482	33.6	89.84
triangle_relax	1.1502	33.7	89.91
seagull	1.1487	33.3	90.08
maxsig	1.1480	33.3	90.23
softplus	1.1460	33.1	90.74
minsin	1.1498	33.3	91.18
snake	1.1484	33.1	91.93
cosid	1.1490	33.3	92.99
spike	1.1498	33.3	93.78
bipolarsigmoid	1.1513	32.8	96.73