from Hacker News

PixArt-α:A New Open-Source Text-to-Image Model Challenging SDXL and Dalle·3

by liuxiaopai on 11/13/23, 7:45 PM with 27 comments

  • by Animats on 11/14/23, 12:50 AM

    This has problems usually not seen with current systems. It's produced human characters with one thick leg and one thin leg. Three legs of different sizes. Three arms.

    It can do humans in passive poses, but ask for an action shot and it botches it badly. It needs more training data on how bodies move. Maybe load it up with stills from dance, martial arts, and sports.

  • by GaggiX on 11/13/23, 11:43 PM

    The most interesting aspect of this model is that it is very training efficient: https://pixart-alpha.github.io/

    It also has the same idea as Dalle 3 to train the model on synthetic captions.

  • by ShamelessC on 11/14/23, 1:03 AM

    Why name it PixArt when it covers a broader range of media than simply pixel art? Super confusing.
  • by krasin on 11/14/23, 12:00 AM

    The source code license is AGPL-3.0 license. Perfect for these kinds of models: https://github.com/PixArt-alpha/PixArt-alpha
  • by gigel82 on 11/14/23, 12:09 AM

    From their GitHub:

    >This integration allows running the pipeline with a batch size of 4 under 11 GBs of GPU VRAM. GPU VRAM consumption under 10 GB will soon be supported, too. Stay tuned.

  • by ilaksh on 11/13/23, 9:36 PM

    Seems to have pretty good understanding and performance.
  • by camdenlock on 11/14/23, 12:14 AM

    This appears to be work sponsored by Huawei.
  • by andromeduck on 11/14/23, 4:42 AM

    Thought this was going to be a new optical sensor series :(
  • by philmitchell47 on 11/14/23, 11:48 AM

    I think it's kind of disingenuous maybe to claim such improvements in training efficiency when they rely on:

    - Existing models for data pseudo-labelling

    - ImageNet pretraining

    - A frozen text encoder

    - A frozen image encoder