I'll cover the scaling laws observed in language models, the chinchilla paper, as well as proposed explanations for the parameter and data-limited regimes that can occur. Much of the talk will be about presenting empirical observations and discussing that. Depending on audience interest we can go deeper into further observations or cover proposed theoretical models of this scaling phenomena. I'll aim to keep things informal, aimed at non-experts, and I encourage any/all questions!
Some references if you are interested to learn more: "Early" scaling laws papers:
https://arxiv.org/abs/1712.00409
https://arxiv.org/abs/2001.08361
The Chinchilla paper "Training Compute-Optimal Language Models":
https://arxiv.org/abs/2203.15556
Yasaman and colleagues' nice paper on explaining neural scaling laws:
https://arxiv.org/abs/2102.06701
Equations for scaling laws of kernel regression:
Spigler, Geiger, and Wyart's nice paper:
https://arxiv.org/abs/1905.10843
Blake and Abdul's work:
https://arxiv.org/abs/2002.02561
and follow up:
https://arxiv.org/abs/2006.13198
The limiting effects of finite width:
In addition to Yasaman's paper, there's our Onset of Variance paper with Blake:
https://arxiv.org/abs/2212.12147
Relatedly, a random feature analysis by Bruno and collaborators has many of the key insights necessary to derive the limits of finite width:
https://arxiv.org/abs/2102.08127
and a further one on the role played by ensembling:
https://arxiv.org/abs/2201.13383
A related model is also explored by Maloney et al:
https://arxiv.org/abs/2210.16859
Also some recent work by our group on consistency of dynamics of trained networks across widths as well as further aspects finite width causing limiting effects
https://arxiv.org/abs/2305.18411