by mottiden on 7/6/23, 12:28 PM with 68 comments
by gamegoblin on 7/6/23, 2:06 PM
When you abandon O(N^2) attention, you are forced to start adding heuristics to choose what to correlate. Any time you see one of those giant context window LLMs, you need to be asking what heuristics they added, what is getting correlated, and what is not getting correlated.
This paper chooses an exponential heuristic where tokens further in the past get exponentially less attention. This heuristic is fine for certain tasks like responding in a chat room, where the most recent tokens are the most important, but bad for tasks where tokens are roughly equally important throughout the text, such as a dense academic paper or a reference manual.
The bitter lesson [1] is going to eventually come for all of these. Eventually we'll figure out how to machine-learn the heuristic rather than hard code it. Recurrent neural networks (RNNs) do this implicitly, but we don't yet know how to effectively train RNNs on ultra-deep sequences.
Another possibility is learning a heuristic for non-recurrent LLMs via reinforcement learning, such as in [2], which is basically a reinforcement learned "auto-researcher" that was trained in a style reminiscent of AlphaGo.
[1] http://www.incompleteideas.net/IncIdeas/BitterLesson.html
by bratao on 7/6/23, 2:02 PM
The model might select random words in the context. There's definitely a case where this could be unfortunate if it ends up selecting irrelevant words.
The graph on the first page, in my opinion, seems like a needless flex
by cs702 on 7/6/23, 2:07 PM
I'm going to take a closer look.
by londons_explore on 7/6/23, 2:13 PM
I suspect github data has a lot of copy pasted code. Ie. a good chunk of what you are asking the model to do is to go back X million tokens and copy a chunk verbatim.
Sure, the model might also be looking back at some code X million tokens ago and using that to improve its guess of the next token (oh look, the API definition of the API I am using is back here, that'll help me get this right!).
But the perplexity number alone doesn't differentiate those cases - and considering how much code copying/templating happens in software, I suspect that affects the perplexity a lot more than smartly using stuff from the context window.
I wonder if these models work well on other kinds of data?
by jumpCastle on 7/6/23, 7:06 PM
by euclaise on 7/6/23, 3:18 PM
by Imnimo on 7/6/23, 2:58 PM
by spuz on 7/6/23, 2:35 PM
by daemonk on 7/6/23, 3:56 PM
by WanderPanda on 7/6/23, 3:48 PM
by kytazo on 7/6/23, 4:44 PM
Does this imply similar increases in context in practice?
by climatologist on 7/6/23, 10:36 PM