from Hacker News

Open Flamingo – open framework to train multimodal LLMs

by mpaepper on 3/28/23, 8:47 PM with 25 comments

by ftxbro on 3/28/23, 9:00 PM
In the demo I put the obama prank photo http://karpathy.github.io/2012/10/22/state-of-computer-visio... and asked "Why is this picture funny?" and it responded "Question: Why is this picture funny? Answer: President Obama is taller than the average person."
by yeldarb on 3/28/23, 9:54 PM
I always like to try these zero-shot models on things outside of the "normal" COCO classes. Here are some chess board queries:
Counting: https://imgur.com/KTuQ1Bv
Parse the chess board: https://imgur.com/2zYFK1P
(Result): https://imgur.com/Ei4MAl7
Few-Shot Object Detection (Pascal VOC): https://imgur.com/gZkDMn8
Few-Shot Object Detection (simplified): https://imgur.com/Hk8QGMd
Not quite there yet. I've been more impressed with the other new zero-shot multimodal models like Grounding DINO and Azure Dense Captioning. Really looking forward to putting multimodal GPT-4 through its paces as well.
by vagabund on 3/28/23, 9:23 PM
Even at this scale the model's able to answer questions fairly impressively, but I created an image with some distinct shapes in different positions and it didn't go well [0]. I think however they're doing the image encoding doesn't capture positional information which, to my mind, limits a lot of use cases.
[0] https://i.postimg.cc/GtrGs8mw/Screenshot-2023-03-28-at-5-19-...
by mpaepper on 3/28/23, 8:59 PM
This is awesome work and they also provide their 9B OpenFlamingo model which is based on Llama:
https://huggingface.co/openflamingo/OpenFlamingo-9B
by dfrankle on 3/28/23, 11:22 PM
What are the key features of Open Flamingo, and how does it compare to other frameworks for training multimodal LLMs?
by juxtaposicion on 3/29/23, 1:10 AM
What’re the techniques that’ll get this to run on a single GPU?
by duxup on 3/28/23, 8:58 PM
That title is pretty impressive/ big on mobile!