from Hacker News

Efficient high-resolution image synthesis with linear diffusion transformer

by Vt71fcAqt7 on 10/16/24, 2:56 PM with 44 comments

  • by cube2222 on 10/16/24, 5:58 PM

    This looks like quite a huge breakthrough, unless I'm missing something?

    ~25x faster performance than Flux-dev, while offering comparable quality in benchmarks. And visually the examples (surely cherry-picked, but still) look great!

    Especially since with GenAI the best way to get good results is to just generate a large amount of them and pick the best (imo). Performance like this will make that much easier/faster/cheaper.

    Code is unfortunately "(Coming soon)" for now. Can't wait to play with it!

  • by ttul on 10/17/24, 5:05 AM

    There really are some “free lunches” in generative models. Really impressive work by this group. Ultimately, their model may not be the winner, because so much of what makes a good image gen model is the images and captioning that go into it, and the fine-tuning for aesthetic quality — something Midjourney and Flux both excel at. But the architecture here certainly will get into the hands of the people who can make the next great model.

    Looking forward to it. This space just keeps getting more interesting.

  • by lpasselin on 10/16/24, 7:02 PM

    This comes from the same group as the EfficientViT model. A few months ago, their EfficientViT model was the only modern and small ViT style model I could find that had raw pytorch code available. No dependencies to the shitty framework and libraries that other ViT are using.
  • by echelon on 10/16/24, 4:25 PM

    Image models are going to be widely available. They'll probably be a dime a dozen soon. It's great that an increasing number of models are going open, because these are the ecosystems that will grow.

    3D models (sculpts, texture, retopo, etc.) are following a similar trend and trajectory.

    Open video models are lagging behind by several years. While CogVideo and Pyramid are promising, video models are petabyte scale and so much more costly to build and train.

    I'm hoping video becomes free and cheap, but it's looking like we might be waiting a while.

    Major kudos to all of the teams building and training open source models!

  • by wiradikusuma on 10/17/24, 4:54 AM

    In my opinion, what's missing in these "image GenAI" tech is the ability to generate subsequent images consistently.

    That would be useful for e.g. book illustration, comic strips, icon sets. Otherwise, people would think you pick those images all over the internet and not from one source/theme.

  • by cpldcpu on 10/16/24, 10:59 PM

    >We introduce a new Autoencoder (AE) that aggressively increases the scaling factor to 32. Compared with AE-F8, our AE-F32 outputs 16× fewer latent tokens,

    Basically they compress/decompress the images more, which means they need less computation during generation. But on the flip side this should mean less variability.

    Isn't this more of a design trade-off than an optimization?

  • by cynicalpeace on 10/17/24, 1:30 PM

    None of this means much to me unless I can actually use it. Sorta like how Sora has been totally overshadowed by Kling, Runway, Minimax.

    You have to release your model in some fashion for it to be impressive.

  • by amelius on 10/17/24, 12:35 PM

    Does this finally solve the class of "6 fingers/hand" problems?
  • by smusamashah on 10/16/24, 3:11 PM

    > (e.g. Flux-12B), being 20 times smaller and 100+ times faster in measured throughput. Moreover, Sana-0.6B can be deployed on a 16GB laptop GPU, taking less than 1 second to generate a 1024 × 1024 resolution image.