by milliondreams on 4/7/24, 1:42 PM with 83 comments
by whimsicalism on 4/7/24, 4:29 PM
Specifically, I think at some point we are going to move to recursive routing, ie. pass back through a set of experts again. In the future, 'chain-of-thought' will happen internal to the model recursively
by nl on 4/8/24, 5:00 AM
The idea that we want models not to have to use the same amount of compute for every token has been around for a while. This is the first compelling mechanism I've seen for doing it.
> Equipped with these new methods, we can sample autoregressively by choosing to route tokens to or around a block based on the router’s output, which does not depend on any information from future tokens. We provide empirical evidence that this is a relatively easy auxiliary task that quickly achieves 99% accuracy.
Does anyone else find this is a bit surprising?
by panqueca on 4/7/24, 11:14 PM
Imagine you have a smart assistant that can understand and process the words you say to it. Usually, this assistant pays equal attention to every word you say, no matter how important or unimportant each word is to the overall meaning of your message.
Now, imagine that we found a way to teach the assistant to be smarter about how it uses its "brain power." Instead of giving equal attention to every word, the assistant learns to focus more on the words that are most important for understanding what you mean. It can even adjust this focus on the fly, paying more attention to different words depending on the context of your message.
To make sure the assistant doesn't get overwhelmed, we also set a limit on how much total "brain power" it can use at any given time. It's like giving the assistant a budget and saying, "You can only spend your brain power on a certain number of words at a time." The assistant then has to decide which words are most important to focus on.
Even with this limit, the assistant is still flexible in how it uses its brain power. It might spend more on certain words and less on others, depending on what you're saying. This means that while we always know the total amount of brain power the assistant is using, it can adapt to different situations and prioritize what's most important.
When we teach the assistant using this method, it not only learns to focus its attention intelligently but also does so very efficiently. It can understand you just as well as an assistant that pays equal attention to every word, but it uses less brain power overall. This makes the assistant much faster at responding to you and processing new information.
by mattmcdonagh on 4/7/24, 6:41 PM
https://lifeinthesingularity.com/p/googles-breakthroughs-in-...
by rughouse on 4/7/24, 3:45 PM
by macrolime on 4/8/24, 8:10 AM
Is this how they get a context window of 10 million tokens? Or are they refering to even longer context windows in the future?
by nikvaes on 4/8/24, 3:37 PM
by edude03 on 4/8/24, 1:28 PM
I got into deep learning around when ReLU and dropout was hot and on my consumer 1080 I was able to change one or two lines of code and test the improvements in a few hours, whereas now, I guess I'll need to wait a few weeks for mistral et al to try it out
by yair99dd on 4/8/24, 7:11 AM
highly recommended, here is his take on the mixture-of-depths paper discussed. https://www.youtube.com/watch?v=Teru_qIdB8Y
by maxrumpf on 4/7/24, 11:17 PM
by kromem on 4/8/24, 12:42 AM
Neat!
by modeless on 4/8/24, 2:33 AM
by barrenko on 4/7/24, 6:30 PM