Background

To note, this is not a LoRA in the typical sense, we’re not learning any low rank approximation of the weights. However, I kept the name because of the very small number of weights that are trained, and how colloquially LoRA has become near synonymous with efficient training methods.

I noticed none of the Lora schemes for SD train the norms of the network, which feels like a very low cost addition to training. I can’t find the name of the paper but prior to the whole LoRA explosion there were some papers that would finetune exclusively the norm layers of the network with some pretty beneficial effects.

While I think in general it makes more sense to train the norm parameters directly, there is some utility in being able to add adapters to models and merging them and so forth, so I think it makes for a conveinent set-up.

LayerNorms

A quick recap of how layernorms work

Center the values around 0 by subtracting the mean
Divide by the standard deviation
multiply by a learned parameter
add by a learned parameter

Table of Contents

Background

LayerNorms