from Hacker News

Fuyu-8B: A multimodal architecture for AI agents

by averylamp on 10/18/23, 4:46 PM with 57 comments

  • by tasdfqwer0897 on 10/18/23, 5:04 PM

    Hey I work at Adept and helped make this! Happy to answer questions. The thing I think is especially neat/notable is how simple you can make the model architecture while still getting good performance. I expect we'll continue to see bits of these models get deleted in the next few years

    Note that you can get the model weights on HuggingFace here: https://huggingface.co/adept/fuyu-8b

  • by fpgaminer on 10/18/23, 9:49 PM

    The architecture is quite compelling. I would not have expected it to work as well as it does. Glancing at the benchmarks it's basically on par with other VLMs in its class, despite having no separate image encoder.

    Is there an associated paper? Or more specifically, details on the training dataset? It must have been a mix of text and VLM tasks, otherwise one or the other capability would have rotted during training. But I wonder if they trained off strictly VLM corpora, or also used plain image-text datasets like CLIP. It would be interesting if only the former.

    Also makes me wonder if it could be trained on something like CommonCrawl where all the images are retained and interspersed correctly throughout the text. This model could theoretically train just fine off that, and it would unlock a whole new dataset effectively.

    And has there been an inspection of what the model is outputting for predicted image "tokens"? Is it correctly predicting projected image patches to any degree of accuracy? And could therefore also generate images inline with text if another de-projection layer was trained?

  • by joanfihu on 10/21/23, 11:22 AM

  • by abrichr on 10/19/23, 12:27 AM

    Thank you to the amazing team at Adept.ai for making this available!

    For anyone interested in contributing to a fully open source alternative, join us at https://github.com/OpenAdaptAI/OpenAdapt

    Lots of interesting work to be done, including integrating with Fuyu-8B!

  • by thatcherc on 10/18/23, 5:57 PM

    Really cool that the image patches are converted to tokens with just a linear projection instead of a big embedding model! I wonder if that trick will prove viable for other multimodel media like audio.
  • by mark_l_watson on 10/19/23, 1:52 AM

    This looks so cool, and from reading the Hugging Face model card it should be easy enough to run. I do almost all of my work with text, NLP, IR, etc., and I have wanted to try multi-modal models. I just bookmarked the model card page.

    I am also getting even more excited by the explosion of work on open models. I still haven’t adjusted to how good mistral-7B is, and it runs on my Mac without breaking a sweat.

  • by yeldarb on 10/18/23, 11:31 PM

    This looks epic. Definitely going to explore adding it to Autodistill[1] this weekend. Any chance you'll be publicly releasing the internal OCR finetune?

    [1] https://github.com/autodistill/autodistill

  • by devinprater on 10/18/23, 9:46 PM

    Awesome! I can't wait to see how we can make local models for, say, describing images offline, or even getting a few screenshots of, say, a video game and describing what's going on.
  • by stavros on 10/19/23, 12:04 AM

    This looks great! Is there any software that supports these? Llama.cpp, Ollama, LM studio, etc are really convenient, but I don't think they have image support yet?
  • by paulkon on 10/18/23, 11:32 PM

    Can this be used to click around in the browser with text prompts? Maybe after some fine-tuning on screen recordings of specific workflows in browsers.
  • by WanderPanda on 10/19/23, 1:03 AM

    Why don‘t these benchmarks judge the likelihood of the example answer? Just taking the MAP predictions seems like a waste of information
  • by thefcpk on 10/18/23, 10:37 PM

    One thing that puzzles me is the lack of multilingual models... it is a bit sad to see everything through the English language.
  • by StephenAshmore on 10/19/23, 1:48 AM

    Fascinating! I love seeing more multimodal ML. Thanks for sharing!
  • by og_kalu on 10/18/23, 7:38 PM

    Oh wow. This seems to be the best released vlm model. The chart/UI understanding displayed in particular is superb.
  • by lxe on 10/18/23, 10:24 PM

    Comparable with llava13b in benchmarks! Great work!
  • by ronsor on 10/18/23, 7:43 PM

    Before someone else does, I'm going to point out that CC-BY-NC is technically not an open source license.