from Hacker News

The VAE Used for Stable Diffusion Is Flawed

by prashp on 2/1/24, 12:25 PM with 66 comments

by joefourier on 2/1/24, 2:00 PM
I’ve done a lot of experiments with latent diffusion and also discovered a few flaws in the SD VAE’s training and architecture, which have hardly no attention brought to them. This is concerning as the VAE is a crucial competent when it comes to image quality and is responsible for many of the artefacts associated with AI generated imagery, and no amount of training the diffusion model will fix them.
A few I’ve seen are:
- The goal should be to have latent outputs as closely resemble gaussian distributed terms between -1 and 1 with a variance of 1, but the outputs are unbounded (you could easily clamp or apply tanh to force them to be between -1 and 1), and the KL loss weight is too low, hence why the latents are weighed by a magic number to more closely fit the -1 to 1 range before being invested by the diffusion model.
- To decrease the computational load of the diffusion model, you should reduce the spatial dimensions of the input - having a low number of channels is irrelevant. The SD VAE turns each 8x8x3 block into a 1x1x4 block when it could be turning it into a 1x1x8 (or even higher) block and preserve much more detail at basically 0 computational cost, since the first operation the diffusion model does is apply a convolution to greatly increase the number of channels.
- The discriminator is based on a tiny PatchGAN, which is an ancient model by modern standards. You can have much better results by applying some of the GAN research of the last few years, or of course using a diffusion decoder which is then distilled either with consistency or adversarial distillation.
- KL divergence in general is not even the most optimal way to achieve the goals of a latent diffusion model’s VAE, which is to decrease the spatial dimensions of the input images and have a latent space that’s robust to noise and local perturbations. I’ve had better results with a vanilla AE, clamping the outputs, having a variance loss term and applying various perturbations to the latents before they are ingested by the decoder.
by mmastrac on 2/1/24, 3:26 PM
There seems to be a convincing debunking thread on Twitter, but I definitely don't have the chops to evaluate either claim:
https://twitter.com/Ethan_smith_20/status/175306260429219874...
by dawnofdusk on 2/1/24, 2:33 PM
This is one of the cool things about various neural network architectures that I've found in my own work: you can make a lot of dumb mistakes in coding certain aspects but because the model has so many degrees of freedom it can actually "learn away" your mistakes.
by prashp on 2/1/24, 2:01 PM
Emad (StabilityAI founder) posted on the reddit thread:
"Nice post, you'd be surprised at the number of errors like this that pop up and persist.
This is one reason we have multiple teams working on stuff..
But you still get them"
by msp26 on 2/1/24, 2:32 PM
> and I would also like to thank the Glaze Team, because I accidentally discovered this while analyzing latent images perturbed by Nightshade and wouldn't have found it without them, because I guess nobody else ever had a reason to inspect the log variance of the latent distributions created by the VAE
That's just hilarious
by wokwokwok on 2/1/24, 1:39 PM
> It's a spot where the VAE is trying to smuggle global information about the image through latent space. This is exactly the problem that KL-divergence loss is supposed to prevent.
Is that what KL divergence does?
I thought it was supposed to (when combined with reconstruction loss) “smooth” the latent space out so that you could interpolate over it.
Doesn’t increasing the weight of the KL term just result in random output in the latent; eg. What you get if you opt purely for KL divergence?
I honestly have no idea at all what the OP has found or what it means, but it doesnt seem that surprising that modifying the latent results in global changes in the output.
Is manually editing latents a thing?
Surely you would interpolate from another latent…? And if the result is chaos, you dont have well clustered latents? (Which is what happens from too much KL, not too little right?)
I'd feel a lot more 'across' this if the OP had demonstrated it on a trivial MNIST vae with both the issue, the result and quantitatively what fixing it does.
> What are the implications?
> Somewhat subtle, but significant.
Mm. I have to say I don't really get it.
by treprinum on 2/1/24, 2:37 PM
The author might be right though what I've noticed with DL models is that the theory is often leading to underwhelming results after training and "bugs" in models sometimes lead to much better real-world performance, pointing out some disconnect between theory and what gradient-based optimization can achieve. One could see it also in the deep reinforcement learning where in theory the model should converge due to being Markovian via Banach fixed point but in practice the monstrous neural networks that estimate rewards can override this and change the character of the convergence.
by nness on 2/1/24, 1:29 PM
Could someone ELI5? What is the impact of this issue?
by mzs on 2/1/24, 3:36 PM
detailed counter argument: https://twitter.com/Ethan_smith_20/status/175306260429219874...
by urbandw311er on 2/1/24, 3:00 PM
Related (coincidentally) — Google also posted research on a much more efficient approach to image generation:
https://news.ycombinator.com/item?id=39210458
by brcmthrowaway on 2/1/24, 7:52 PM
These people talk like high school kids with too much time on their hands to be honest. Have we created a whole new class of practitioners hacking away with no theory background?
by stealthcat on 2/1/24, 2:53 PM
Imagine all those millions of $$$ GPU of cloud credits for training, only to overlook this bug
by Robin_Message on 2/1/24, 1:33 PM
If the latent space is meant to be highly spatially correlated, could you simply apply random rotations of rows and columns† to the latent space throughout the process? That way, there wouldn't be specific areas data could be smuggled through.
† As in, move a random vertical stripe of the image from the right to the left, and a random horizontal portion from the top to the bottom. Or, if that introduces unacceptable edge effects, simply slice the space into 4 randomly sized spaces (although that might encourage smuggling in all of the corners at once.)
by ants_everywhere on 2/1/24, 1:44 PM
I'm curious, was this well-known by experts already? How surprising is this?
I enjoyed the write up.
by bornfreddy on 2/1/24, 4:22 PM
Off-topic: is anyone aware of good tutorials / howtos / books / videos on the NN developments of last few years? (Attention, SD, LLM,...)
by itsTyrion on 2/1/24, 6:36 PM
so many kWh wasted
by firechickenbird on 2/1/24, 4:32 PM
Backprop doesn’t give a shit