The performance of current-day neural networks comes from three things.

We scaled compute.
We scaled data.
And we found architectures and generative methods that learn well and train smoothly enough.

In more recent years, gains have been mostly explained by the first two, and have yielded attractive scaling laws that can give us a sense of performance ahead of time.

Though in this post, I want to talk a bit on the last point. The tricks of modern architectures (and generative methods) embody some desirable properties that have yielded favorable performance while playing in the confines of constraints on hardware and training dynamics.

Accompanying BabyHypernetworks repo:

https://github.com/ethansmith2000/BabyHypernetworks/

</aside>

Also a recent talk I did covering a lot of the topics here:

https://youtu.be/U5vG4GRbRGU?si=91hjKjoQa9G96_k2

Intro

The two biggest game changers to ML in the past 8 years or so are perhaps the advent of Transformer architecture and generative diffusion models.

Transformers offered perks to language modeling that LSTMs, RNNs and others simply could not deliver on, and Diffusion in the midst of half-satisfactory results with VAEs and GANs, came to take the throne for generating in continuous spaces like pixels and audio.

The models of today aren’t things we just happened upon, rather I think there is an underlying set of desirable traits that previous works leading up to their development have touched upon but perhaps were never made practical on hardware or have stable training dynamics.

Table of Contents

Intro