This was an ablation of activation functions on GPT-like models of ~100M params that I ran ages ago. Each model was run for 10k iters, which isn't very long. My original goal was to show that activation function doesn't matter than much, but to do so I'd need to run a bunch more runs to get variance and show no statistical significance, and I don't plan on running a more exhaustive version of this experiment any time soon. So, I'm just dumping these results here in case anyone has any use for them. All the activation definitions are here.
Name | Pile Validation BPB | LAMBADA acc | LAMBADA ppl |
---|---|---|---|
softsign | 1.1485 | 34.3 | 81.32 |
ReLU | 1.1482 | 34.3 | 82.01 |
spike2 | 1.1480 | 34.4 | 83.13 |
selu | 1.1485 | 34.5 | 83.32 |
elish | 1.1492 | 33.9 | 84.04 |
tanhexp | 1.1474 | 33.7 | 84.06 |
sigmoid | 1.1484 | 33.9 | 85.20 |
tanhshrink | 1.1483 | 33.9 | 85.42 |
maxtanh | 1.1479 | 33.7 | 85.53 |
roottanh | 1.1485 | 33.4 | 86.00 |
softplusmone | 1.1488 | 34.1 | 86.21 |
logsoftmax | 1.1492 | 34.2 | 86.29 |
ELU | 1.1496 | 33.8 | 86.37 |
Swish | 1.1482 | 33.7 | 86.42 |
softmax | 1.1491 | 33.2 | 86.74 |
square_relax | 1.1484 | 33.5 | 86.92 |
lisht | 1.1500 | 33.8 | 87.17 |
GELU | 1.1453 | 34.0 | 87.84 |
abs | 1.1489 | 33.5 | 87.96 |
tanh | 1.1481 | 33.2 | 89.28 |
Mish | 1.1482 | 33.6 | 89.84 |
triangle_relax | 1.1502 | 33.7 | 89.91 |
seagull | 1.1487 | 33.3 | 90.08 |
maxsig | 1.1480 | 33.3 | 90.23 |
softplus | 1.1460 | 33.1 | 90.74 |
minsin | 1.1498 | 33.3 | 91.18 |
snake | 1.1484 | 33.1 | 91.93 |
cosid | 1.1490 | 33.3 | 92.99 |
spike | 1.1498 | 33.3 | 93.78 |
bipolarsigmoid | 1.1513 | 32.8 | 96.73 |