Schedule Dec 08, 2023
Tutorial on Neural Scaling Laws
Alex Atanasov, Harvard

I'll cover the scaling laws observed in language models, the chinchilla paper, as well as proposed explanations for the parameter and data-limited regimes that can occur. Much of the talk will be about presenting empirical observations and discussing that. Depending on audience interest we can go deeper into further observations or cover proposed theoretical models of this scaling phenomena. I'll aim to keep things informal, aimed at non-experts, and I encourage any/all questions! 

Some references if you are interested to learn more:  "Early" scaling laws papers: 

https://arxiv.org/abs/1712.00409

https://arxiv.org/abs/2001.08361

The Chinchilla paper "Training Compute-Optimal Language Models": 

https://arxiv.org/abs/2203.15556

Yasaman and colleagues' nice paper on explaining neural scaling laws: 

https://arxiv.org/abs/2102.06701

Equations for scaling laws of kernel regression:

Spigler, Geiger, and Wyart's nice paper:

https://arxiv.org/abs/1905.10843

Blake and Abdul's work:

https://arxiv.org/abs/2002.02561

and follow up:

https://arxiv.org/abs/2006.13198

The limiting effects of finite width:

In addition to Yasaman's paper, there's our Onset of Variance paper with Blake:

https://arxiv.org/abs/2212.12147

Relatedly, a random feature analysis by Bruno and collaborators has many of the key insights necessary to derive the limits of finite width:

https://arxiv.org/abs/2102.08127

and a further one on the role played by ensembling:

https://arxiv.org/abs/2201.13383

A related model is also explored by Maloney et al:

https://arxiv.org/abs/2210.16859

Also some recent work by our group on consistency of dynamics of trained networks across widths as well as further aspects finite width causing limiting effects

https://arxiv.org/abs/2305.18411


To download: Right-click and choose "Save Link As..."