This was an ablation of activation functions on GPT-like models of ~100M params that I ran ages ago. Each model was run for 10k iters, which isn't very long. My original goal was to show that activation function doesn't matter than much, but to do so I'd need to run a bunch more runs to get variance and show no statistical significance, and I don't plan on running a more exhaustive version of this experiment any time soon. So, I'm just dumping these results here in case anyone has any use for them. All the activation definitions are here.

Name Pile Validation BPB LAMBADA acc LAMBADA ppl
softsign 1.1485 34.3 81.32
ReLU 1.1482 34.3 82.01
spike2 1.1480 34.4 83.13
selu 1.1485 34.5 83.32
elish 1.1492 33.9 84.04
tanhexp 1.1474 33.7 84.06
sigmoid 1.1484 33.9 85.20
tanhshrink 1.1483 33.9 85.42
maxtanh 1.1479 33.7 85.53
roottanh 1.1485 33.4 86.00
softplusmone 1.1488 34.1 86.21
logsoftmax 1.1492 34.2 86.29
ELU 1.1496 33.8 86.37
Swish 1.1482 33.7 86.42
softmax 1.1491 33.2 86.74
square_relax 1.1484 33.5 86.92
lisht 1.1500 33.8 87.17
GELU 1.1453 34.0 87.84
abs 1.1489 33.5 87.96
tanh 1.1481 33.2 89.28
Mish 1.1482 33.6 89.84
triangle_relax 1.1502 33.7 89.91
seagull 1.1487 33.3 90.08
maxsig 1.1480 33.3 90.23
softplus 1.1460 33.1 90.74
minsin 1.1498 33.3 91.18
snake 1.1484 33.1 91.93
cosid 1.1490 33.3 92.99
spike 1.1498 33.3 93.78
bipolarsigmoid 1.1513 32.8 96.73