from Hacker News

Scaling Transformers to 1B Tokens

by mottiden on 7/6/23, 12:28 PM with 68 comments

  • by gamegoblin on 7/6/23, 2:06 PM

    The benefit of "traditional" O(N^2) transformer attention is you correlate every token to every other token. So, in the limit, your network won't "miss" much.

    When you abandon O(N^2) attention, you are forced to start adding heuristics to choose what to correlate. Any time you see one of those giant context window LLMs, you need to be asking what heuristics they added, what is getting correlated, and what is not getting correlated.

    This paper chooses an exponential heuristic where tokens further in the past get exponentially less attention. This heuristic is fine for certain tasks like responding in a chat room, where the most recent tokens are the most important, but bad for tasks where tokens are roughly equally important throughout the text, such as a dense academic paper or a reference manual.

    The bitter lesson [1] is going to eventually come for all of these. Eventually we'll figure out how to machine-learn the heuristic rather than hard code it. Recurrent neural networks (RNNs) do this implicitly, but we don't yet know how to effectively train RNNs on ultra-deep sequences.

    Another possibility is learning a heuristic for non-recurrent LLMs via reinforcement learning, such as in [2], which is basically a reinforcement learned "auto-researcher" that was trained in a style reminiscent of AlphaGo.

    [1] http://www.incompleteideas.net/IncIdeas/BitterLesson.html

    [2] https://arxiv.org/pdf/2109.00527.pdf

  • by bratao on 7/6/23, 2:02 PM

    I need to carefully read the article, but sparse attention is an interesting technique that has been used previously (as in BigBird) but has often proved to perform (way) worse than full attention. The sliding component that performs full attention is indeed useful (much like the Blockwise Parallel Transformer), but the sparse patterns are elements that don't intuitively resonate with me.

    The model might select random words in the context. There's definitely a case where this could be unfortunate if it ends up selecting irrelevant words.

    The graph on the first page, in my opinion, seems like a needless flex

  • by cs702 on 7/6/23, 2:07 PM

    Well, this looks promising. The key idea is to collect a different set of tokens, with different levels of sparsity for each head, apply regular (dense) self-attention over all heads, weighted by pairwise distance, and spread and add the output residuals to their corresponding location in the original sequence. It seems to work really well, judging by the perplexity scores shown in the paper -- though we don't yet know if those perplexity scores will translate into good performance on real-world tasks.

    I'm going to take a closer look.

  • by londons_explore on 7/6/23, 2:13 PM

    They use perplexity on github data to demonstrate the effectiveness of their model.

    I suspect github data has a lot of copy pasted code. Ie. a good chunk of what you are asking the model to do is to go back X million tokens and copy a chunk verbatim.

    Sure, the model might also be looking back at some code X million tokens ago and using that to improve its guess of the next token (oh look, the API definition of the API I am using is back here, that'll help me get this right!).

    But the perplexity number alone doesn't differentiate those cases - and considering how much code copying/templating happens in software, I suspect that affects the perplexity a lot more than smartly using stuff from the context window.

    I wonder if these models work well on other kinds of data?

  • by jumpCastle on 7/6/23, 7:06 PM

    Title with a 10 digits number, meaningless first page figure and no experiments related to the main claim. Did a rogue author posted it without permission again?
  • by euclaise on 7/6/23, 3:18 PM

    Important note: They only did experiments up to 32k length
  • by Imnimo on 7/6/23, 2:58 PM

    Without any experiment showing that language modeling performance actually continues to improve past 32k tokens using this scheme, how are we supposed to tell whether this is actually viable?
  • by spuz on 7/6/23, 2:35 PM

    What does the "number of tokens" characteristic of an LLM mean exactly? How does 1B compare with GPT-3.5 or GPT-4?
  • by daemonk on 7/6/23, 3:56 PM

  • by WanderPanda on 7/6/23, 3:48 PM

    How stupendous to not put the first figure on a log scale...
  • by kytazo on 7/6/23, 4:44 PM

    Is assuming the sequence length is directly correlated to the context window a meaningful thought?

    Does this imply similar increases in context in practice?

  • by climatologist on 7/6/23, 10:36 PM

    Does anyone know if 1B tokens is enough to solve sudoku puzzles?