by wujerry2000 on 1/19/25, 11:27 PM with 199 comments
by agnosticmantis on 1/20/25, 12:01 AM
Ha ha ha. Even written agreements are routinely violated as long as the potential upside > downside, and all you have is verbal agreement? And you didn’t disclose this?
At the time o3 was released I wrote “this is so impressive that it brings out the pessimist in me”[0], thinking perhaps they were routing API calls to human workers.
Now we see in reality I should’ve been more cynical, as they had access to the benchmark data but verbally agreed (wink wink) not to train on it.
[0: https://news.ycombinator.com/threads?id=agnosticmantis#42476... ]
by lolinder on 1/19/25, 11:59 PM
> We acknowledge that OpenAI does have access to a large fraction of FrontierMath problems and solutions, with the exception of a unseen-by-OpenAI hold-out set that enables us to independently verify model capabilities. However, we have a verbal agreement that these materials will not be used in model training.
Ouch. A verbal agreement. As the saying goes, those aren't worth the paper they're written on, and that's doubly true when you're dealing with someone with a reputation like Altman's.
And aside from the obvious flaw in it being a verbal agreement, there are many ways in which OpenAI could technically comply with this agreement while still gaining a massive unfair advantage on the benchmarks to the point of rendering them meaningless. For just one example, knowing the benchmark questions can help you select training data that is tailored to excelling at the benchmarks without technically including the actual question in the training data.
by jsheard on 1/19/25, 11:45 PM
by diggan on 1/19/25, 11:55 PM
Not sure if "integrity of the benchmarks" should even be something that you negotiate over, what's the value of the benchmark if the results cannot be trusted because of undisclosed relationships and sharing of data? Why would they be restricted from disclosing stuff you normally disclose, and how doesn't that raise all sorts of warning flags when proposed even?
by bogtog on 1/20/25, 1:33 AM
For instance, suppose they conduct an experiment and find that changing some hyper-parameter yields a 2% boost. That could just be noise, it could be a genuine small improvement, or it may be a mix of a genuine boost along with some fortunate noise. An effect may be small enough that researchers would need to rely on their gut to interpret it. Researchers may jump on noise while believing they have discovered true optimizations. Enough of these types of nudges, and some serious benchmark gains can materialize.
(Hopefully my comment isn't entirely misguided, I don't know how they actually do testing or how often they probe their test set)
by zarzavat on 1/20/25, 1:38 AM
Whereas other AI companies now have the opportunity to be first to get a significant result on FrontierMath.
by ripped_britches on 1/20/25, 4:53 AM
I know they have lost trust and credibility, especially on HN. But this is a company with a giant revenue opportunity to sell products that work.
What works for enterprise is very different from “does it beat this benchmark”.
No matter how nefarious you think sama is, everything points to “build intelligence as rapidly as possible” rather than “spin our wheels messing with benchmarks”.
In fact, even if they did fully lie and game the benchmark - do you even care? As an OpenAI customer, all I care about is that the product works.
I code with o1 for hours every day, so I am very excited for o3 to be released via API. And if they trained on private datasets, I honestly don’t care. I just want to get a better coding partner until I’m irrelevant.
Final thought - why are these contractors owed a right to know where funding came from? I would definitely be proud to know I contributed to the advancement of the field of AI if I was included in this group.
by lionkor on 1/20/25, 1:25 AM
So with this in mind now, let me repeat: Unless you know that the question AND/OR answer are not in the training set or adjacent, do not claim that the AI or similar black box is smart.
by MattDaEskimo on 1/20/25, 1:23 AM
This maneuver by their CEO will destroy FrontierMath and Epoch AI's reputation
by benterix on 1/20/25, 10:08 AM
Man, this is huge.
by wujerry2000 on 1/20/25, 12:49 AM
(1) Companies will probably increasingly invest in building their own evals for their use cases because its becoming clear public/allegedly private benchmarks have misaligned incentives with labs sponsoring/cheating (2) Those evals will prob be proprietary "IP" - guarded as closely as the code or research itself (3) Conversely, public benchmarks are exhausted and SOMEONE has to invest in funding more frontier benchmarks. So this is prob going to continue.
by gunalx on 1/20/25, 9:19 AM
I would even go so far as to say this invalidates not only FrontierMath but also anything Epoch AI has and will touch.
Any academic misjudgement like this massive conflict and cheating makes you unthrustworthy in a academic context.
by BrenBarn on 1/22/25, 10:32 PM
by Imnimo on 1/20/25, 12:26 AM
What's much more concerning to me than the integrity of the benchmark number is the general pattern of behavior here from OpenAI and Epoch. We shouldn't accept secretly (even secret to the people doing the creation!) funding the creation of a benchmark. I also don't see how we can trust in the integrity of EpochAI going forward. This is basically their only meaningful output, and this is how they handled it?
by j_timberlake on 1/20/25, 6:52 PM
by padolsey on 1/20/25, 3:43 AM
by matt_daemon on 1/20/25, 12:36 AM
by WasimBhai on 1/20/25, 12:04 AM
by nioj on 1/19/25, 11:58 PM
by refulgentis on 1/20/25, 12:10 AM
Last time this confused a bunch of people who didn't understand what test vs. train data meant and it resulted in a particular luminary complaining on Twitter, to much guffaws, how troubling the situation was.
Literally every comment currently, modulo [1] assumes this and then goes several steps more, and a majority are wildly misusing terms with precise meanings, explaining at least part of their confusion.
[1] modulo the one saying this is irrelevant because we'll know if it's bad when it comes out, which to be fair, if evaluated rationally, we know that doesn't help us narrowly with our suspicion FrontierMath benchmarks are all invalid because it trained on (most of) the solutions
by croemer on 1/25/25, 1:54 PM
by mrg3_2013 on 1/20/25, 6:32 AM
by atleastoptimal on 1/20/25, 4:55 AM
HN loves to speculate that OpenAI is some big scam whose seeming ascendance is based on deceptive marketing hype, but o1, to anyone who has tried it seriously is undoubtedly very much within the ballpark of what OpenAI claims it is able to do. If everything they are doing really is just overfitting and gaming the tests, that discrepancy will eventually catch up to them, and people will stop using the APIs and chatgpt
by karmasimida on 1/20/25, 4:58 AM
There are ways that you could game the benchmark without adding it to the training set. By repetitively evaluating on the dataset itself it will regress into a validation set, not a test set, even in black box setting, as you can simply evaluating 100 checkpoints and pick the one that performs the best, rinse and repeat
I still believe o3 is the real deal, BUT this gimmick kind sour my appetite a bit, for that those who run the company
by nottorp on 1/20/25, 10:15 AM
Just like toothpaste manufacturers fund dentist's associations etc.
by ForHackernews on 1/20/25, 10:07 AM
Why does it have a customer service popover chat assistant?
by zrc108071849 on 1/20/25, 3:34 AM
by suchintan on 1/20/25, 2:06 AM
We tried doing that here at Skyvern (eval.skyvern.com)
by maeil on 1/20/25, 9:33 AM
by floppiplopp on 1/20/25, 2:06 PM
by moi2388 on 1/20/25, 7:09 AM
What about model testing before releasing it?
by treksis on 1/19/25, 11:46 PM
by numba888 on 1/19/25, 11:55 PM
by m3kw9 on 1/20/25, 12:03 AM
by katamari-damacy on 1/20/25, 9:04 AM
which should really be “we now know how to improve associative reasoning but we still need to cheat when it comes to math because the bottom line is that the models can only capture logic associatively, not synthesize deductively, which is what’s needed for math beyond recipe-based reasoning"