from Hacker News

Mixture-of-Depths: Dynamically allocating compute in transformers

by milliondreams on 4/7/24, 1:42 PM with 83 comments

by whimsicalism on 4/7/24, 4:29 PM
I think more complicated routing is absolutely going to become more common.
Specifically, I think at some point we are going to move to recursive routing, ie. pass back through a set of experts again. In the future, 'chain-of-thought' will happen internal to the model recursively
by nl on 4/8/24, 5:00 AM
Most important paper of 2024.
The idea that we want models not to have to use the same amount of compute for every token has been around for a while. This is the first compelling mechanism I've seen for doing it.
> Equipped with these new methods, we can sample autoregressively by choosing to route tokens to or around a block based on the router’s output, which does not depend on any information from future tokens. We provide empirical evidence that this is a relatively easy auxiliary task that quickly achieves 99% accuracy.
Does anyone else find this is a bit surprising?
by panqueca on 4/7/24, 11:14 PM
Simplified Intro Version:
Imagine you have a smart assistant that can understand and process the words you say to it. Usually, this assistant pays equal attention to every word you say, no matter how important or unimportant each word is to the overall meaning of your message.
Now, imagine that we found a way to teach the assistant to be smarter about how it uses its "brain power." Instead of giving equal attention to every word, the assistant learns to focus more on the words that are most important for understanding what you mean. It can even adjust this focus on the fly, paying more attention to different words depending on the context of your message.
To make sure the assistant doesn't get overwhelmed, we also set a limit on how much total "brain power" it can use at any given time. It's like giving the assistant a budget and saying, "You can only spend your brain power on a certain number of words at a time." The assistant then has to decide which words are most important to focus on.
Even with this limit, the assistant is still flexible in how it uses its brain power. It might spend more on certain words and less on others, depending on what you're saying. This means that while we always know the total amount of brain power the assistant is using, it can adapt to different situations and prioritize what's most important.
When we teach the assistant using this method, it not only learns to focus its attention intelligently but also does so very efficiently. It can understand you just as well as an assistant that pays equal attention to every word, but it uses less brain power overall. This makes the assistant much faster at responding to you and processing new information.
by mattmcdonagh on 4/7/24, 6:41 PM
I wrote up a bit about it here, from what I could piece together:
https://lifeinthesingularity.com/p/googles-breakthroughs-in-...
by rughouse on 4/7/24, 3:45 PM
It’s very similar to Mixture of Experts. But instead of routing tokens to multiple experts, you "deploy to a single expert which can be dynamically skipped"
by macrolime on 4/8/24, 8:10 AM
"This is more computationally efficient than performing a full content-based lookup across an entire memory buffer for each step in the future, and could be one step towards drastically increasing the context-length available for making a prediction."
Is this how they get a context window of 10 million tokens? Or are they refering to even longer context windows in the future?
by nikvaes on 4/8/24, 3:37 PM
After trying to understand and implement some algorithms in RASP [1, 2], my take-way was that certain functions need a certain amount of transformer layers to operate. Following this logic, it should become apparent that the functions learned by transformers can be spread over multiple heads. Repeating these functions might be very valuable for understanding and solving a problem, but current inference does not allow (a set of subsequent) heads to be repeated. This paper indeed seems a promising direction.
[1] https://arxiv.org/pdf/2106.06981.pdf
[2] https://www.youtube.com/watch?v=t5LjgczaS80
by edude03 on 4/8/24, 1:28 PM
Maybe the only downside to how fast LLMs are moving is papers come out faster than anyone (not at Google) can train and test the improvements.
I got into deep learning around when ReLU and dropout was hot and on my consumer 1080 I was able to change one or two lines of code and test the improvements in a few hours, whereas now, I guess I'll need to wait a few weeks for mistral et al to try it out
by yair99dd on 4/8/24, 7:11 AM
hu-po does in-depth live-stream reviews of AI papers.
highly recommended, here is his take on the mixture-of-depths paper discussed. https://www.youtube.com/watch?v=Teru_qIdB8Y
by maxrumpf on 4/7/24, 11:17 PM
The abstract and the rest of the paper don't really match imo. It's not really allocating more to some sequences, but just introducing ~dropout. Might be different sides to the same coin, but was still a weird read.
by kromem on 4/8/24, 12:42 AM
Essentially the second law of thermodynamics for neural networks.
Neat!
by modeless on 4/8/24, 2:33 AM
It's a start but it's disappointing that half the layers still have to process every token. It seems like we ought to be able to get to 90% or even 99% savings when these models currently allocate the same compute for outputting "the" as they do for outputting the first digit of the answer of a complicated math problem.
by barrenko on 4/7/24, 6:30 PM
Are we going to hit bullseye?