Written by Ethan Smith
To start off, the title is a slight misnomer. Canonically, diffusion is the portion where we destroy the data into noise through random processes, and then we learn the reverse process. That doesn’t happen here.
But I kept the name because I think most people think of the generative portion when hearing the word diffusion, and there are some inspirations taken from it.
The aspect of diffusion we are trying to recover here is the ability to to synthesize images by developing low frequency components first and moving on to higher frequencies.
<aside> 📌 **In short, the philosophy here is that the diffusion model generative process in the data/pixel domain is similar to sequence modeling in the frequency domain.
Kinda analogous to the common wisdom that convolution in the time domain is multiplication in the frequency domain
The idea is that diffusion ends up building out frequencies from low to high because of how gaussian noise acts like a low pass filter (discussed later). We can turn that on its head and explicitly model these frequencies low to high in an autoregressive manner**
</aside>
What I want to see here is if we can create an autoregressive transformer model that can build an image by sequentially generating its frequency components.
The pros are that we can now get a loss from the entire generation process simultaneously. Normally when training diffusion models, you’ll sample a random timestep, get a prediction, and take the loss for that timestep.
Because of the parallelizability of the training of causal transformer models, we can compute the loss for the entire trajectory at once.
Granted, a single frequency coefficient is not really the same as a prediction of the entire image/noise.
Secondly, I think there’s an advantage over autoregressive methods that generate images like DALLE-1, PARTI, Cogview, and some others.
To me it always felt really strange that you would just generate all the pixels or latent tokens of an image raster scan style i.e. starting from the top left corner and going across each row.