from Hacker News

A Replacement for BERT

by cubie on 12/19/24, 4:53 PM with 75 comments

  • by jph00 on 12/19/24, 6:40 PM

    Hi gang, Jeremy from Answer.AI here. Nice to see this on HN! :) We're very excited about this model release -- it feels like it could be the basis of all kinds of interesting new startups and projects.

    In fact, the stuff mentioned in the blog post is only the tip of the iceberg. There's a lot of opportunities to fine tune the model in all kinds ways, which I expect will go far beyond what we've managed to achieve in our limited exploration so far.

    Anyhoo, if anyone has any questions, feel free to ask!

  • by janalsncm on 12/19/24, 9:23 PM

    > encoder-only models add up to over a billion downloads per month, nearly three times more than decoder-only models

    This is partially because people using decoders aren’t using huggingface at all (they would use an API call) but also because encoders are the unsung heroes of most serious ML applications.

    If you want to do any ranking, recommendation, RAG, etc it will probably require an encoder. And typically that meant something in the BERT/RoBERTa/ALBERT family. So this is huge.

  • by shahjaidev on 12/22/24, 8:06 AM

    The community would benefit a lot from a multilingual ModernBERT. Pretraining on a multilingual corpus is crucial for a ranking/retrieval model to be deployed in many industry settings.Simply extending the vocab and fine tuning the en checkpoint won’t quite work. Any plans to release a multilingual checkpoint ?
  • by deepsquirrelnet on 12/20/24, 12:01 AM

    I read your paper this morning, and am just thrilled with the work. Love the added local attention layers. I’ve experimented with them for years (lucidrains repo), and was always surprised they didn’t go further. Inference speeds are awesome on this model. Scrapping NSP, awesome. Increased masking, awesome. RoPE and longer context, again, bravo. There’s so many great incremental improvements learned over the years and you guys made so many good decisions here.

    I’d love to distill a “ModernTinyBERT”, but it seems a bit more complex with the interleaved layers.

  • by jbellis on 12/19/24, 5:49 PM

    Looks great, thanks for training this!

      - Can I fine tune it with SentenceTransformers?
      - I see ColBERT in the benchmarks, is there an answerai-colbert-small-v2 coming soon?
  • by mark_l_watson on 12/20/24, 12:44 AM

    I saw this early this morning. About for or five years ago I used BERT models for summarization, etc. BERT seemed like a miracle to me back then.

    I am going to wait until Ollama has this in their library, even though consuming HF is straight forward.

    The speedup is impressive, but then so are the massive speed improvements for LLMs recently.

    Apple has supported BERT models in their SDKs for Apple developers for years, it will be interesting to see how quickly they update to this newer tech.

  • by wenc on 12/19/24, 10:50 PM

    Can I ask where BERT models are used in production these days?

    I was given to understand that they are a better alternative to LLM type models for specific tasks like topic classification because they are trained to discriminate rather than to generate (plus they are bidirectional so they can “understand” context better through lookahead). But LLMs are pretty strong so I wonder if the difference is negligible?

  • by dmezzetti on 12/19/24, 9:32 PM

    Great news here. Will takes some time for it to trickle downstream but expect to see better vector embeddings models, entity extraction and more.
  • by pantsforbirds on 12/19/24, 6:16 PM

    Awesome news and something I really want to checkout for work. Has anyone seen any RAG evals for ModernBERT yet?
  • by readthenotes1 on 12/19/24, 8:01 PM

    I guess the next release is going to be postmodern bert.
  • by carschno on 12/19/24, 7:26 PM

    The model cars says only English, is that correct? Are there any plans to publish a multilingual model or monolingual ones for other languages?
  • by 303bookworm on 12/24/24, 8:04 AM

    Really excited to see this! 2 Questions: 1. Did you try using RTD (Electra like pretraining)? Or did you skip that for reasons of compatability? 2. Why not incorporate jamba like Mamba2 alternating layers?
  • by Labo333 on 12/20/24, 10:16 AM

    Sad that it is English only, not multilingual.
  • by GaggiX on 12/20/24, 2:51 AM

    It would be really cool to have a model like this but multilingual, it would really help with things like moderation.
  • by neodypsis on 12/20/24, 3:50 AM

    How does it compare to Jina V3 [0], which also has 8192 context length?

    0. https://arxiv.org/abs/2409.10173

  • by vietvu on 12/20/24, 2:28 AM

    So that what's Jeremy Howard was teasing about. Nice one.
  • by crimsoneer on 12/19/24, 9:41 PM

    Answer.ai team are DELIVERING today. Well done Jeremy and team!
  • by zelias on 12/19/24, 5:54 PM

    missed opportunity to call it ERNIE
  • by Arcuru on 12/19/24, 6:05 PM

    I'm not sure I am understanding where exactly this slots in, but isn't this an embedding model? Shouldn't they be comparing it to a service like Voyage AI?

    - https://docs.voyageai.com/docs/embeddings