from Hacker News

Mini-Gemini: Mining the Potential of Multi-Modality Vision Language Models

by milliondreams on 3/31/24, 10:38 PM with 7 comments

  • by simonw on 4/1/24, 12:28 AM

    Mini-Gemini is a bit of a confusing name.

    Reminds me of how DALLĀ·E Mini came out three years ago and eventually had to rename itself to Craiyon https://github.com/borisdayma/dalle-mini

  • by milliondreams on 3/31/24, 10:39 PM

  • by milliondreams on 3/31/24, 10:39 PM

  • by ilaksh on 4/1/24, 1:05 AM

    Is this based on LLaVA 1.6? Not to be too lazy, but maybe someone could link to a comparison with that, if there is one?
  • by mountainriver on 4/1/24, 2:35 PM

    Excite to see how this does on open compass!
  • by milliondreams on 3/31/24, 10:38 PM

    The paper introduces Mini-Gemini, a framework aimed at enhancing Vision Language Models (VLMs) to close the performance gap with advanced models like GPT-4 and Gemini. It focuses on improving visual tokens resolution, creating high-quality datasets for better image comprehension, and expanding VLMs' operational scope. Mini-Gemini supports a range of large language models and has shown superior performance in zero-shot benchmarks. The code and models are publicly available.
  • by PontifexMinimus on 4/1/24, 1:21 PM

    WTF is a "Multi-modality Vision Language Model"? Does it mean:

    - a program where you give it a text description, and it outputs a picture

    - a program where you give it a picture, and it outputs a text description

    - both of the above

    - something else

    ?