from Hacker News

Non-determinism in GPT-4 is caused by Sparse MoE

by 152334H on 8/4/23, 9:37 PM with 181 comments

  • by jiggawatts on 8/5/23, 12:37 AM

    Floating point inaccuracies are generally deterministic - running the same calculations twice ought to yield the same results, down to the bit.

    You only get divergent results if there is some other source of state or entropy: not zeroing buffers correctly, race conditions, not setting rounding mode flags consistently, etc…

    From the quality of the code I’ve seen being cobbled together in the AI/ML ecosystem I would assume all three of those issues going on, and maybe more.

  • by gojomo on 8/4/23, 11:23 PM

    Not sure I understand the excerpt from the referenced paper.

    Is it saying that part of its more-efficient inferencing relies on mixing tokens from completely-separate inputs – eg, from other users? And then, depending on what other inputs chance into the same grouping, the relative assignment-to-'experts' varies, and thus the eventual completions?

    If so, I'd see that as not just introducing non-determinism, but also potentially making the quality of your responses dependent on how-many-concurrent-requests are fighting for the same expert-allocations.

    (For example, maybe the parts of the system best at translating/interpreting Hindi give worse results during peak usage hours-of-the-day in India, when the most concurrent inputs are competing for that same competence.)

    Perhaps also, this is another possible explanation for perceived quality-degradation over time. When certain tests were reliably succeeding earlier, there was less congestion for the relevant 'experts'. Now, with more concurrent use, those same tests aren't as reliably winning as much of relevant 'experts' effort.

    This may also suggest a bit of a quagmire: on whatever domains some sub-experts seem impressively good, initially, even more proportionate use will be attracted. But such new congestion means all the copycat use no longer gets the same expert allocations – and thus the initially-impressive performance degrades.

    (And if the effect is strong, & known-but-undisclosed-by-OpenAI, does it amount to a bait-and-switch? Attract users with unrepresentative excellence on an initially-uncongested Mixture-of-Experts system, but then offer them the lower-quality results from a more-congested system.)

  • by alpark3 on 8/4/23, 10:37 PM

    _If_ 3.5 is a MoE model, doesn't that give a lot of hope to open source movements? Once a good open source MoE model comes out, maybe even some type of variation of the decoder models available(I don't know whether MoE models have to be trained from scratch), that implies a lot more can be done with a lot less.
  • by osmarks on 8/4/23, 9:44 PM

    I feel like this introduces the potential for weird and hard-to-implement side channel attacks, if the sequences in a batch can affect the routing of others.
  • by refulgentis on 8/4/23, 10:07 PM

    This is _excellent_ work, I've been adamantly against MoE for a set of reasons, this is the first compelling evidence I've seen that hasn't been on Substack or a bare repeating of rumor.

    I had absolutely no idea GPT4 was nondeterministic and I use it about 2 hours a day. I can see why a cursory looking wasn't cutting it, they "feel" the same in your memory, a lot of similar vocab usage, but are formatted entirely differently, and have sort of a synonym-phrase thing going where some of the key words are the same.

  • by pazimzadeh on 8/4/23, 10:12 PM

    Mixture of Experts
  • by crazypython on 8/4/23, 11:36 PM

    The GPT-3.0 "davinci-instruct-beta" models have been returning non-deterministic logprobs as early as early 2021. This is speculation. CUDA itself often has nondeterminism bugs.

    text-davinci-001 and text-davinci-002 were trained through FeedMe and SFT, while text-davinci-003 was RLHF; the models themselves have more variance at high temperature.

  • by throwawayadvsec on 8/4/23, 11:13 PM

    "these tokens often compete against each other for available spots in expert buffers. " So is this also why ChatGPT is often just writing placeholders in place of functions when I ask him for some long code?
  • by afro88 on 8/5/23, 1:33 AM

    > these tokens often compete against each other for available spots in expert buffers.

    Hold up, does this mean that under heavy load the results change? Does this explain why it sometimes feels like the output quality changes?

  • by hyperthesis on 8/5/23, 1:18 AM

    MoE: Mixture of Experts
  • by cainxinth on 8/5/23, 1:58 AM

    I asked GPT to explain this:

    >In the MoE approach, different "experts" or portions of the model are selected for different parts of the input data. The selection of which experts to use can be influenced by several factors, including the specific content of the input data, the order in which data is processed in a batch, and possibly even minor variations in the internal state of the model.

    >This "expert selection" process introduces a level of stochasticity, or randomness, into the model's operation. For example, if you process the same input data twice in slightly different contexts (e.g., as part of different batches), you might end up consulting slightly different sets of experts, leading to slightly different outputs.

  • by cratermoon on 8/5/23, 5:53 AM

    > It’s well-known at this point that GPT-4/GPT-3.5-turbo is non-deterministic, even at temperature=0.0

    Interestingly, on another discussion there was a claim that setting the temperature to 0.0 made gpt-4 deterministic: https://news.ycombinator.com/item?id=36503146

  • by icelancer on 8/5/23, 12:52 AM

    How interesting. I was just discussing this last night with our analysts after I experimentally noticed that temp=0.0 (and all penalties/top_p set accordingly) still showed non-determinate behavior. Wasn't sure why this was, and now this article comes about.

    The explanation makes quite a bit of sense.

  • by rgoldste on 8/5/23, 3:41 PM

    This is a plausible hypothesis. I’m curious whether OpenAI has considered this already and examined it I feel like an average senior eng could eval this in under two focused days, but maybe OpenAI has less unit-testing than I expect.
  • by DeathArrow on 8/5/23, 6:50 AM

    Well, a colleague of mine managed to build a non deterministic GET REST API endpoint. :D
  • by albystein on 8/4/23, 11:32 PM

    this hypothesis makes a lot of sense. if indeed gpt-4 is a sparse MoE—which i believe it is—then OpenAI must have tested and proved their initial idea of a large capacity MoE LLM model first training/building a smaller one. this smaller test model might be gpt-3.5-turbo.
  • by f1shy on 8/5/23, 6:18 AM

    I see in the comments it seems to be a huge miss understanding between 2 uses of “non-deterministic”: 1) from normal English: cannot be determined beforehand (results may vary) 2) from theory of computation: loosely “parallel computation” (unknown path to the solution)
  • by rvcdbn on 8/5/23, 12:40 AM

    I wonder if there’s a side channel attack in there waiting to happen..
  • by pmarreck on 8/5/23, 2:37 PM

    Determinism should always be an option in any system.
  • by heroku on 8/5/23, 10:40 AM

    can somebody make some quantum AI, that's super deterministic.
  • by dudus on 8/4/23, 10:46 PM

    Off topic

    > 3 months later, reading a paper while on board a boring flight home, I have my answer.

    I noticed people from hacker news routinely read scientific papers. This is a habit I envy but don't share.

    Any tips or sites for someone interested in picking up more science papers to read.