from Hacker News

Mamba Explained

by andreyk on 3/30/24, 4:04 PM with 44 comments

  • by andy_xor_andrew on 3/30/24, 6:04 PM

    > But Transformers have one core problem. In a transformer, every token can look back at every previous token when making predictions.

    Lately I've been wondering... is this a problem, or a strength?

    It might be a fallacy to compare how LLMs "think" with how humans think. But humor me for a second. When you are speaking, each time you emit a word, you are not attending to every previous word in your sentence (like transformers), rather you have a state in your mind that represents the grammar and concepts, which is continuously updated as you speak (more similar to SSMs).

    Similarly, when you read a book, every time you read a word, you are not attending to every previous word in the book. Your model of "the book" is rather a fuzzy/approximate state that is updated with new information every time a new word appears. Right? (I'm sorry I know this is very handwavy and psuedoscientific but bear with me).

    Ok, so if (big if) you feel like the above is true, then to match human-type language modelling, SSMs seem more human-like than transformers.

    BUT... then aren't transformers strictly better in terms of accuracy? Because a transformer never "forgets" information, as long as it is within the context window, because it revisits that information every time it emits a new token.

    So let's say we can remove the "quadratic attention" problem of transformers with SSMs. That's a nice training/inference performance boost. But... look at where we got with "naive" attention. GPT 4, Claude 3. It's not like we're hitting a wall with quadratic attention. It's absurdly more expensive than SSMs, but GPUs certainly aren't getting slower. If all AI work stops now, and only hardware improves, it wouldn't be long until GPT4 could run on local hardware, right, provided Moore's law?

    /end rant, not really sure what my point was, I'm not against SSMs (they're cool) but rather I'm wondering if the SOTA will ever be SSM when attention is so damn good

  • by jongjong on 3/31/24, 2:49 AM

    I find it difficult to understand certain math and science papers/articles due to ambiguous use of language.

    For example "all previous tokens can be passed to the current token." That seems like a poorly constructed sentence. A token is not a function and it's not an algorithm either... How can you pass tokens to a token? This type of ambiguous language in academic papers makes it hard to read... Maybe the phrase 'every token has an association with every other previously encountered token' would be better? Or every token is used to compute the token vector for each token... I don't know, all I can do is guess the meaning of the word 'passed'. They want us to infer and fill in the gaps with our own assumptions. It assumes that we are primed to think in a certain highly constrained way...

    For some reason a lot of academia around AI is littered with such imprecise language. They choose to use niche concepts and repurposed wording that their own small community invented rather using words and ideas that are more widely understood but which would convey the same information.

    Rational people who aren't directly involved in those fields who generally resist jumping to conclusions will struggle to understand what is meant because a lot of those words and ideas have different interpretations in their own fields.

    I studied machine learning at university and wrote ANNs from scratch and trained them and even I find the language and concepts around LLMs too ambiguous. I'd rather just ask ChatGPT.

    One thing that bothers me is that the community has moved away from relating concepts to neurons, interconnections, input layers, hidden layers and output layers. Instead, they jump straight into vectors and matrices... Pretending as though there is only one way to map those calculations to neurons and weights. But in fact, this abstraction has many possible interpretations. You could have fully connected layers or partially connected layers... Maybe you need a transformer only in front of the input layer or between every layer... So many possibilities.

    The entire article means little if considered in isolation outside of the context of current configurations of various popular frameworks and tools.

  • by etbebl on 3/30/24, 7:56 PM

    Anyone else keep seeing articles about Mamba and thinking it's about Python/Conda? It's annoying when the new cool thing picks the same name as something else you like that deserves attention.
  • by password4321 on 3/30/24, 6:11 PM

    Links to more about Mamba (selective state space models) on HN yesterday:

    https://news.ycombinator.com/item?id=39853958#39855430

  • by xz18r on 3/30/24, 5:57 PM

    I just have to say it: that image shows gunpla, i.e. Mobile Suit Gundam, not Transformers!
  • by ein0p on 4/1/24, 1:50 AM

    This is more human like, and people will complain that it doesn’t have photographic memory. That is, it’s not superhuman in that regard. But there are many tasks where superhuman recall is not required. We know this because those tasks are currently performed by humans.
  • by sp332 on 3/30/24, 8:16 PM

    So in an effective Mamba query the question goes at the end, after input data? I thought that the question should go at the beginning, so it can decide which information in the data is relevant.
  • by programjames on 3/30/24, 7:17 PM

    This is the best explanation I have seen for Mamba.