by meetpateltech on 5/16/25, 3:02 PM with 450 comments
by johnjwang on 5/16/25, 4:27 PM
We’ve long used local agents like Cursor and Claude Code, so we didn’t expect too much. But Codex shines in a few areas:
Parallel task execution: You can batch dozens of small edits (refactors, tests, boilerplate) and run them concurrently without context juggling. It's super nice to run a bunch of tasks at the same time (something that's really hard to do in Cursor, Cline, etc.)
It kind of feels like a junior engineer on steroids, you just need to point it at a file or function, specify the change, and it scaffolds out most of a PR. You still need to do a lot of work to get it production ready, but it's as if you have an infinite number of junior engineers at your disposal now all working on different things.
Model quality is good, but hard to say it's that much better than other models. In side-by-side tests with Cursor + Gemini 2.5-pro, naming, style and logic are relatively indistinguishable, so quality meets our bar but doesn’t yet exceed it.
by nadis on 5/16/25, 5:52 PM
Preview video from Open AI: https://www.youtube.com/watch?v=hhdpnbfH6NU&t=878s
As I think about what "AI-native" or just the future of building software loos like, its interesting to me that - right now - developers are still just reading code and tests rather than looking at simulations.
While a new(ish) concept for software development, simulations could provide a wider range of outcomes and, especially for the front end, are far easier to evaluate than just code/tests alone. I'm biased because this is something I've been exploring but it really hit me over the head looking at the Codex launch materials.
by ofirpress on 5/16/25, 5:51 PM
by blixt on 5/16/25, 5:03 PM
There must be room for a Modal/Cloudflare/etc infrastructure company that focuses only on providing full-fledged computer environments specifically for AI with forking/snapshotting (pause/resume), screen access, human-in-the-loop support, and so forth, and it would be very lucrative. We have browser-use, etc, but they don't (yet) capture the whole flow.
by ionwake on 5/16/25, 8:09 PM
Is this still rolling out? I dont need the team plan too do I?
I have been using openAI products for years now and I am keen to try but I have no idea what I am doing wrong.
by solresol on 5/17/25, 6:38 AM
Here's my workflow that keeps failing: - it writes some code. It looks good a first glance - I push it to github - automated tests on github show that there's a problem - go back to codex and ask it to fix it - it does stuff. It looks good again.
Now what do I do? If I ask it to push again to github, then it will often create a pull request that doesn't include stuff from the first pull request, but it's not a pull request that stacks on top of the previous pull request, it's a pull request that stacks on top of main.
When asked to write something that called out to gpt-4.1-mini, it used openai.ChatCompletion.create (!?!!?)
I just found myself using claude to fix codex's mistakes.
by alvis on 5/16/25, 4:00 PM
by ZeroCool2u on 5/17/25, 4:41 AM
What does that mean? Surely this should have a bit more elaboration. If you're just excluding a double digit number of tasks in the benchmark as uncompleted, that should be reflected in the scores.
by asdev on 5/16/25, 6:40 PM
by fullstackchris on 5/16/25, 10:29 PM
by bionhoward on 5/16/25, 5:39 PM
What about using it for AI / developing models that compete with our new overlords?
Seems like using this is just asking to get rug pulled for competing with em when they release something that competes with your thing. Am I just an old who’s crowing about nothing? It’s ok for them to tell us we own outputs we can’t use to compete with em?
by kleiba on 5/16/25, 3:59 PM
by haffi112 on 5/16/25, 3:13 PM
by CSMastermind on 5/17/25, 1:17 AM
by yanis_t on 5/16/25, 4:07 PM
When I'm using aider, after it make a commit what I do, I then immediately run git reset HEAD^ and then git diff (actually I use github desktop client to see the diff) to evaluate what exactly it did, and if I like it or not. Then I usually make some adjustments and only after that commit and push.
by simianwords on 5/16/25, 3:54 PM
by swisniewski on 5/16/25, 10:53 PM
They seem to be injected fine in the "environment setup" but don't seem to be injected when running tasks against the enviornment. This consistently repros even if I delete and re-create the enviornment and archive and resubmit the task.
by SketchySeaBeast on 5/16/25, 7:44 PM
by bearjaws on 5/17/25, 12:56 PM
I don't understand why OAI puts their alpha release products under a $200 a month plan instead of just charging for tokens.
by orliesaurus on 5/16/25, 4:18 PM
by tptacek on 5/16/25, 3:30 PM
by sudohalt on 5/16/25, 5:36 PM
by simianwords on 5/16/25, 4:07 PM
by hintymad on 5/16/25, 8:18 PM
This seems imply that the software engineering as a profession has been quite mature and saturated for a while, to the point that a model can predict most of the output. Yes, yes, I know there are thousands of advanced algorithms and amazing systems in production. It's just that the market does not need millions of engineers for such advanced skills.
Unless we get yet another new domain like cloud or like internet, I'm afraid the core value of software engineers: trailblazing for new business scenarios, will continue diminishing and being marginalized by AI. As a result, we get way less demand for our job, and many of us will either take a lower pay, or lose our jobs for extended time.
by bmcahren on 5/17/25, 1:01 AM
by asadm on 5/16/25, 3:52 PM
I made one for github action but it's not as realtime and is 2 years old now: https://github.com/asadm/chota
by colesantiago on 5/16/25, 3:17 PM
This should be possible today and surely Linus would also see this in the future.
by zrg on 5/17/25, 1:27 PM
by tough on 5/16/25, 4:35 PM
(Im trying something)
what would be an impressive program that an agent should be able to one-shot in one go?
by btbuildem on 5/16/25, 3:57 PM
I can't say I am a big fan of neutering these paradigm-shifting tools according to one culture's code of ethics / way of doing business / etc.
One man's revolutionary is another's enemy combatant and all that. What if we need top-notch malware to take down the robot dogs lobbing mortars at our madmaxian compound?!
by alvis on 5/16/25, 3:36 PM
Feels like codex is for product managers to fix bugs without touching any developer resources. Then it’s insanely surprising!
by scudsworth on 5/16/25, 3:59 PM
by adamTensor on 5/16/25, 4:08 PM
by theappsecguy on 5/16/25, 9:44 PM
by prhn on 5/16/25, 3:32 PM
I'm very interested.
In my experience ChatGPT and Gemini are absolutely terrible at these types of things. They are constantly wrong. I know I'm not saying anything new, but I'm waiting to personally experience an LLM that does something useful with any of the code I give it.
These tools aren't useless. They're great as search engines and pointing me in the right direction. They write dumb bash scripts that save me time here and there. That's it.
And it's hilarious to me how these people present these tools. It generates a bunch of code, and then you spend all your time auditing and fixing what is expected to be wrong.
That's not the type of code I'm putting in my company's code base, and I could probably write the damn code more correctly in less time than it takes to review for expected errors.
What am I missing?
by skovati on 5/16/25, 4:01 PM
I imagine many engineers are like myself in that they got into programming because they liked tinkering and hacking and implementation details, all of which are likely to be abstracted over in this new era of prompting.
by ilaksh on 5/16/25, 3:52 PM
For example, in the last month or so, I added a job queue plugin. The ability to run multiple tasks that they demoed today is quite similar. The issue I ran into with users is that without Enterprise plans, complex tasks run into rate limits when trying to run concurrently.
So I am adding an ability to have multiple queues, with each possibly using different models and/or providers, to get around rate limits.
By the way, my system has features that are somewhat similar not only to this tool they are showing but also things like Manus. It is quite rough around the edges though because I am doing 100% of it myself.
But it is MIT Licensed and it would be great if any developer on the planet wanted to contribute anything.
by RhysabOweyn on 5/16/25, 7:33 PM
On the other hand, if your job was writing code at certain companies whose profits were based on shoving ads in front of people then I would agree that no one will care if it is written by a machine or not. The days of those jobs making >$200k a year are numbered.
by ianbutler on 5/16/25, 4:14 PM
There's so many of these "vibe coding" tools and there has to be real engineering rigor at some point. I saw them demo "find the bug" but the bugs they found were pretty superficial and thats something we've seen in our internal benchmark from both Devin and Cursor. A lot of noise and false positives or superficial fixes.
by tough on 5/16/25, 3:59 PM
sigh
by energy123 on 5/16/25, 3:35 PM
by DGAP on 5/16/25, 6:20 PM