by 152334H on 8/4/23, 9:37 PM with 181 comments
by jiggawatts on 8/5/23, 12:37 AM
You only get divergent results if there is some other source of state or entropy: not zeroing buffers correctly, race conditions, not setting rounding mode flags consistently, etc…
From the quality of the code I’ve seen being cobbled together in the AI/ML ecosystem I would assume all three of those issues going on, and maybe more.
by gojomo on 8/4/23, 11:23 PM
Is it saying that part of its more-efficient inferencing relies on mixing tokens from completely-separate inputs – eg, from other users? And then, depending on what other inputs chance into the same grouping, the relative assignment-to-'experts' varies, and thus the eventual completions?
If so, I'd see that as not just introducing non-determinism, but also potentially making the quality of your responses dependent on how-many-concurrent-requests are fighting for the same expert-allocations.
(For example, maybe the parts of the system best at translating/interpreting Hindi give worse results during peak usage hours-of-the-day in India, when the most concurrent inputs are competing for that same competence.)
Perhaps also, this is another possible explanation for perceived quality-degradation over time. When certain tests were reliably succeeding earlier, there was less congestion for the relevant 'experts'. Now, with more concurrent use, those same tests aren't as reliably winning as much of relevant 'experts' effort.
This may also suggest a bit of a quagmire: on whatever domains some sub-experts seem impressively good, initially, even more proportionate use will be attracted. But such new congestion means all the copycat use no longer gets the same expert allocations – and thus the initially-impressive performance degrades.
(And if the effect is strong, & known-but-undisclosed-by-OpenAI, does it amount to a bait-and-switch? Attract users with unrepresentative excellence on an initially-uncongested Mixture-of-Experts system, but then offer them the lower-quality results from a more-congested system.)
by alpark3 on 8/4/23, 10:37 PM
by osmarks on 8/4/23, 9:44 PM
by refulgentis on 8/4/23, 10:07 PM
I had absolutely no idea GPT4 was nondeterministic and I use it about 2 hours a day. I can see why a cursory looking wasn't cutting it, they "feel" the same in your memory, a lot of similar vocab usage, but are formatted entirely differently, and have sort of a synonym-phrase thing going where some of the key words are the same.
by pazimzadeh on 8/4/23, 10:12 PM
by crazypython on 8/4/23, 11:36 PM
text-davinci-001 and text-davinci-002 were trained through FeedMe and SFT, while text-davinci-003 was RLHF; the models themselves have more variance at high temperature.
by throwawayadvsec on 8/4/23, 11:13 PM
by afro88 on 8/5/23, 1:33 AM
Hold up, does this mean that under heavy load the results change? Does this explain why it sometimes feels like the output quality changes?
by hyperthesis on 8/5/23, 1:18 AM
by cainxinth on 8/5/23, 1:58 AM
>In the MoE approach, different "experts" or portions of the model are selected for different parts of the input data. The selection of which experts to use can be influenced by several factors, including the specific content of the input data, the order in which data is processed in a batch, and possibly even minor variations in the internal state of the model.
>This "expert selection" process introduces a level of stochasticity, or randomness, into the model's operation. For example, if you process the same input data twice in slightly different contexts (e.g., as part of different batches), you might end up consulting slightly different sets of experts, leading to slightly different outputs.
by cratermoon on 8/5/23, 5:53 AM
Interestingly, on another discussion there was a claim that setting the temperature to 0.0 made gpt-4 deterministic: https://news.ycombinator.com/item?id=36503146
by icelancer on 8/5/23, 12:52 AM
The explanation makes quite a bit of sense.
by rgoldste on 8/5/23, 3:41 PM
by DeathArrow on 8/5/23, 6:50 AM
by albystein on 8/4/23, 11:32 PM
by f1shy on 8/5/23, 6:18 AM
by rvcdbn on 8/5/23, 12:40 AM
by pmarreck on 8/5/23, 2:37 PM
by heroku on 8/5/23, 10:40 AM
by dudus on 8/4/23, 10:46 PM
> 3 months later, reading a paper while on board a boring flight home, I have my answer.
I noticed people from hacker news routinely read scientific papers. This is a habit I envy but don't share.
Any tips or sites for someone interested in picking up more science papers to read.