from Hacker News

StoryDiffusion: Long-range image and video generation

by doodlesdev on 5/1/24, 12:07 AM with 65 comments

  • by schoen on 5/1/24, 1:52 AM

    I looked very closely at the videos for a while and managed to find some minor continuity errors (like different numbers of buttons on people's button-down shirts at different times, or different sizes or styles of earrings, or arguably different interpretations of which finger is which in an intermittently-obscured hand). I also think that the cycling woman's shorts appear to cover more of her left leg than her right leg, although that's not physically impossible, and the bear seemingly has a differently-sized canine tooth at different times.

    But I guess it took me multiple minutes to find these problems, watching each video clip many times, rather than having any of them jump out at me. So, it's not like literally full consistent object persistence, but at a casual viewing it was very persuasive.

    Maybe people who shoot or edit video frequently would notice some of these problems more quickly, because they're more attuned to looking for continuity problems?

  • by samspenc on 5/1/24, 1:11 AM

    Normally I don't mind spelling errors - and there are plenty in the examples - but my question is, did the system really produce "lunch" when the prompt was "they have launch at restraunt" (verbatim from the sample)? I would imagine it got restaurant right, but I would have expected it to produce something like a rocket launch image instead of figuring out the author meant lunch.
  • by hbbio on 5/1/24, 1:35 AM

    GitHub link is not public yet?

    https://github.com/HVision-NKU/StoryDiffusion

  • by smusamashah on 5/1/24, 8:41 AM

    This is unbelievably good. Seems better than Sora even in terms of natural look and motion in videos.

    The video of two girls talking seems so natural. There are some artifacts but the movement is so natural and clothes and other things around are not continuously changing.

    I hope it does become open source, which i suspect it won't because it's coming from byte dance.

  • by forgingahead on 5/1/24, 2:26 AM

    Github link is broken, and I honestly find it frustrating that the only link to code is the theme source and credits?? Is it really that important to give the static page theme that much real estate instead of actual code release for the project?
  • by speedgoose on 5/1/24, 9:47 AM

    Is there a video of Will Smith eating spaghetti with this model?
  • by LeoPanthera on 5/1/24, 1:07 AM

    The rate of progress of generative AI is honestly quite scary.
  • by keikobadthebad on 5/1/24, 1:31 AM

    It'll be good if the girl and the giant squirrel are ever seen in the same park at the same time.
  • by MisterTea on 5/1/24, 1:12 PM

    One day we won't have 3D engines or GPU's but AI chips that generate the scenes without calculating a single triangle or loading a single texture. We just stream in a scene, IP asset seeds provide the characters, plot and story. But even those can be generated in real-time. Video games, movies, anything will be on demand. No one will act. No one will draw. We will just sit and ask for more. Strange times.
  • by topspin on 5/1/24, 2:17 AM

    Love how under "Multiple Characters Generation" the white guy is "A Man," whereas the someone else is "An Asian Man." Reminds me of Daryl Gates and the "normal people" quote, thence patrol cars being called "black and normals."
  • by pmontra on 5/1/24, 5:38 AM

    The Moon in the sky seen from the surface of the Moon is wrong? Poetic? Funny? Recursive? A demonstration that these models don't understand anything? Add to the list.
  • by brotherdusk on 5/1/24, 1:36 AM

    sorry, i can't access the repo and the pdf doesn't have an href attr, is that by design?
  • by zhoudaquan21 on 5/3/24, 3:37 AM

    Hi guys, thanks for your interest. The paper and the code are now released: https://github.com/HVision-NKU/StoryDiffusion. Currently, only the comics-related codes are made public. We are waiting for the company's assessment for the release of the video-related codes.
  • by gbickford on 5/1/24, 2:41 AM

    It's always disappointing when people publish things to GitHub without the intention of collaborating or sharing.
  • by cykkkklz on 5/3/24, 4:17 AM

  • by spywaregorilla on 5/1/24, 2:27 PM

    How is this conceptually different from tracking an embedding for a single character or training a lora on it?
  • by jerpint on 5/1/24, 10:30 AM

    The videos look incredible, but a lot of the captions are riddled with grammar/syntax mistakes that seem odd for a model to make of that quality.
  • by gtoast on 5/1/24, 7:43 PM

    Its really challenging to think of the positive, constructive uses for this technology without thiking of the myriad, life and societal effecting uses for this. Just interpersonally the use of this technology is heavily weighted towards destruction and deception. I don't know where this ends or where researchers who release this technology think this will go, but I can't imagine its going anywhere good for all of us.
  • by 29athrowaway on 5/1/24, 2:22 AM

    Time for Microsoft Chat 2.0 it seems.
  • by nephanth on 5/1/24, 6:02 PM

    Um, the github link is a 404, and the paper link links to the webpage itself (— the paper is not on arxiv). Probably they put the website on too fast?
  • by freefruit on 5/1/24, 1:18 AM

    So is Amazon flooded with hyper niche e-books yet?
  • by peteradio on 5/1/24, 2:10 AM

    There is a video of two girls. One girl seems to be sticking out her tongue and then blowing a kiss, but the tongue is appearing again mid-kiss. Very arousing stuff I'll say. Keep up the good work microsft or goggle or whoever made it.