by prashp on 2/1/24, 12:25 PM with 66 comments
by joefourier on 2/1/24, 2:00 PM
A few I’ve seen are:
- The goal should be to have latent outputs as closely resemble gaussian distributed terms between -1 and 1 with a variance of 1, but the outputs are unbounded (you could easily clamp or apply tanh to force them to be between -1 and 1), and the KL loss weight is too low, hence why the latents are weighed by a magic number to more closely fit the -1 to 1 range before being invested by the diffusion model.
- To decrease the computational load of the diffusion model, you should reduce the spatial dimensions of the input - having a low number of channels is irrelevant. The SD VAE turns each 8x8x3 block into a 1x1x4 block when it could be turning it into a 1x1x8 (or even higher) block and preserve much more detail at basically 0 computational cost, since the first operation the diffusion model does is apply a convolution to greatly increase the number of channels.
- The discriminator is based on a tiny PatchGAN, which is an ancient model by modern standards. You can have much better results by applying some of the GAN research of the last few years, or of course using a diffusion decoder which is then distilled either with consistency or adversarial distillation.
- KL divergence in general is not even the most optimal way to achieve the goals of a latent diffusion model’s VAE, which is to decrease the spatial dimensions of the input images and have a latent space that’s robust to noise and local perturbations. I’ve had better results with a vanilla AE, clamping the outputs, having a variance loss term and applying various perturbations to the latents before they are ingested by the decoder.
by mmastrac on 2/1/24, 3:26 PM
https://twitter.com/Ethan_smith_20/status/175306260429219874...
by dawnofdusk on 2/1/24, 2:33 PM
by prashp on 2/1/24, 2:01 PM
"Nice post, you'd be surprised at the number of errors like this that pop up and persist.
This is one reason we have multiple teams working on stuff..
But you still get them"
by msp26 on 2/1/24, 2:32 PM
That's just hilarious
by wokwokwok on 2/1/24, 1:39 PM
Is that what KL divergence does?
I thought it was supposed to (when combined with reconstruction loss) “smooth” the latent space out so that you could interpolate over it.
Doesn’t increasing the weight of the KL term just result in random output in the latent; eg. What you get if you opt purely for KL divergence?
I honestly have no idea at all what the OP has found or what it means, but it doesnt seem that surprising that modifying the latent results in global changes in the output.
Is manually editing latents a thing?
Surely you would interpolate from another latent…? And if the result is chaos, you dont have well clustered latents? (Which is what happens from too much KL, not too little right?)
I'd feel a lot more 'across' this if the OP had demonstrated it on a trivial MNIST vae with both the issue, the result and quantitatively what fixing it does.
> What are the implications?
> Somewhat subtle, but significant.
Mm. I have to say I don't really get it.
by treprinum on 2/1/24, 2:37 PM
by nness on 2/1/24, 1:29 PM
by mzs on 2/1/24, 3:36 PM
by urbandw311er on 2/1/24, 3:00 PM
by brcmthrowaway on 2/1/24, 7:52 PM
by stealthcat on 2/1/24, 2:53 PM
by Robin_Message on 2/1/24, 1:33 PM
† As in, move a random vertical stripe of the image from the right to the left, and a random horizontal portion from the top to the bottom. Or, if that introduces unacceptable edge effects, simply slice the space into 4 randomly sized spaces (although that might encourage smuggling in all of the corners at once.)
by ants_everywhere on 2/1/24, 1:44 PM
I enjoyed the write up.
by bornfreddy on 2/1/24, 4:22 PM
by itsTyrion on 2/1/24, 6:36 PM
by firechickenbird on 2/1/24, 4:32 PM