Intro and Motivation

Something that has often interested me in ML is figuring out ways we can constrain our search space of parameters.

When we train neural networks, we typically have pretty close to uninformative priors over our model weights, opting for normal/uniform initializations. There are some heuristics over what ranges of values tend to be friendly, but there isn’t typically any data-specific choices here. Once observing results, the Bayesian framework suggests we should update our priors.

Thus, despite the fact that our discovered solutions training large models over the same datasets tend to exhibit some common traits (i.e. steady increasing norms, magnitude of residual connections) that we could take note of to constrain where we look or have better informed initializations, we don’t make much use of it.

Granted, this isn’t necessarily something easy to exploit. It might sound nice, but its not necessarily practical. The distribution of discovered solutions may be many sparse peaks across the landscape which may not give us a much better choice than to aim for a initialization that covers everything, lottery hypothesis style.

One place I think this could be practical though is with LoRAs (or other few parameter adapter methods)

The number of parameters is low enough, and they train quickly enough that we could stockpile a decent sized dataset of them and model them as with other kinds of data.
With LoRAs, and finetuning in general, the intention is often to stay relatively close to the base model.
- In the case of a image diffusion model, even if the content is quite different from the base model, at the very least, you’re still operating on the manifold of real images and there is inevitably some good overlap with the original distribution.

This in itself is an act of constraining our search space and part of why we can do these parameter efficient training methods.

But as an experiment, we can likely go further.

Table of Contents

Intro and Motivation