Written by Ethan Smith

Table of Contents

GFQSvIRbQAAnnrb.jpeg

Post I am responding to: https://www.reddit.com/r/StableDiffusion/comments/1ag5h5s/the_vae_used_for_stable_diffusion_1x2x_and_other/

Intro

I’ll start off by saying, their observations are correct, but the results are portrayed in a way that does not really show the magnitude of the effect. I also think the single example shown could also lead to some faulty conclusions.

I originally made a thread rebuttal to this, but I didn’t feel like it was up to par. So this post aims to do better justice for the counterargument to the claim.

“CompVis fucked up the KL divergence loss on the KL-F8 VAE that is used by SD1.x, SD2.x, SVD, DALL-E 3, and probably other models.”

It’s true that the VAE used for Stable Diffusion is 100% a deviant from typical VAEs, but the authors assert that this choice is intentional as it prioritizes exact reconstructions moreso than interpretable latent spaces. However some KL term remains to poise the VAE to at least exhibit some similarity with neighboring points.

GFQS1j2aQAA_Ur1 (1).jpeg

But it doesn’t have the kinds of properties of semantic travel you see with the VAEs most are probably familiar with, like whats portrayed here: https://medium.com/mlearning-ai/latent-spaces-part-2-a-simple-guide-to-variational-autoencoders-9369b9abd6f…

GFQUEm4a0AAMWpG (1).jpeg

“As a result, the latent space created by it has a massive KL divergence and is smuggling global information about the image through a few pixels.”

A massive KL div from the normal distribution if thats what was meant. And yes, the observations show that modifying values corresponding to an 8x8 patch can influence other patches as well.