by simonpure on 5/15/25, 2:28 AM with 259 comments
by Benjammer on 5/15/25, 2:53 AM
by Sharlin on 5/15/25, 3:01 AM
This, of course, has certain implications as to the wisdom of the idea of “replacing human programmers”, given that one of the hard parts of the trade is trying to turn vague and often confused ideas into precise specifications by interacting with the shareholders.
by tmountain on 5/15/25, 11:45 AM
by airylizard on 5/15/25, 4:43 AM
+30pp uplift when using GPT-35-turbo on a mix of 300 tasks.
Free open framework, check the repo try it yourself
https://github.com/AutomationOptimization/tsce_demo
I tested this another 300 times with gpt-4.1 to remove those obtrusive "em-dashes" everyone hates. Tested a single-pass baseline vs TSCE, same exact instructions and prompt "Remove the em-dashes from my linkedin post. . .".
Out of the 300 tests, baseline failed to remove the em-dashes 149/300 times. TSCE failed to remove the em-dashes 18/300 times.
It works, all the data as well as the entire script used for testing is in the repo.
by zacksiri on 5/15/25, 3:08 AM
It dynamically swaps in / out portions of the context. This system is also not based on explicit definitions it relies on LLMs 'filling the gaps'. The system helps the llm break down problems into small tasks which then eventually aggregate into the full task.
by jumploops on 5/15/25, 6:15 AM
You can edit responses, sure, but then a bunch of other context is lost.
My flow is basically:
1. plan
2. build
3. branch (into some feature/esoteric dependency issue)
4. goto #2
Prompt pruning/branching should be a first-class tool for any LLM usage.
by podgorniy on 5/15/25, 8:13 AM
I've built telegram bot http://t.me/experai_bot as univresal UI to LLMs (with somewhat reduced functionality) exactly around idea "non-reply message means new conversation". Wanna keep context? Keep replying to replies of bot. Non-power user strugge with this idea.
--
Also I observed that OpenAI models performed worse replying to the same questions (for example list of options in reply got shorter) even with smallest system message. That was the case with 3.5, 4o. Don't know how modern ones behave. That made me decide not to include any system messages by default Still I give option to add ones if you need. You can even toggle them to mix-and-match.
by permo-w on 5/15/25, 3:10 AM
by t-kalinowski on 5/15/25, 11:29 AM
by SamPatt on 5/15/25, 6:29 AM
You develop a knack for how to steer the models or start a new conversation through experience. The system or initial prompt are important, but nothing will save you if you naively keep a conversation going too long.
by ranyume on 5/15/25, 3:15 AM
Stuff like this:
1. Do: Best practice for X model is to include at max 10k lines of code + task + CONVENTIONS.md + architecture guidance. Only queue tasks for components that are fairly decoupled from the rest of the codebase (e.g. small modules).
2. Don't: Start a project without a clearly defined architecture in this format. Don't ask for tasks that require X amount of reading hops to understand the logic.
I find it frustrating that companies release their benchmaxxing without helping developers actually use their models. It's more ironic that some people think of these AIs as employees. Employees can work with their boss about the best way to achieve things! With LLMs you don't even know how to communicate with them and as a result their output is unreliable.
by dr_dshiv on 5/15/25, 5:10 AM
[1] http://ui.adsabs.harvard.edu/abs/2023arXiv230313988H/abstrac...
by badmonster on 5/15/25, 7:27 AM
Is it due to the model's training distribution (mostly single-shot completions), the way context windows are encoded, or an architectural bottleneck?
Feels like there's no dynamic internal state that evolves over the conversation — only a repeated re-parsing of static history. Has anyone seen work on integrating memory/state mechanisms that allow belief revision within a session, not just regurgitation of past tokens?
by jsemrau on 5/15/25, 5:53 AM
My conclusion was that context needs to be managed well for the LLMs to manage accuracy in replies. Also, it helps to have a planning process ("graph reasoning") before task execution because it guardrails the models thought process.
This also introduces a discussion on general use vs workflow agent implementations as in the former it is much more difficult to generalize all components in structuring effective ReAct patterns.
by aleksituk on 5/15/25, 12:21 PM
We've been working on a lot of data processing and generation tasks. We've been doing this using an API primarily, but sometimes I end up testing creating data in a chat window and I first chat through what the requirements are for the data analysis / processing and then once I'm done I would like the whole conversation to be then summarised into basically a one-prompt process so that I can re-use it (because I can't really process new inputs via the chat).
Even when you do manage to get it down to a single prompt you can use in a chat and then ask the chat to just keep producing new data (like imagine a blog post in certain style if the base content is given as input and I'm making like 20 of them). If you produce these in the chat, there's notable benefits in that if something is wrong with the blog post the chat suggests, you can immediately edit it. The trouble is that the context window starts becoming so big that the chat starts to forget what the original instruction is and eventually you do have to just create a new chat.
One way to solve for this is having a chat with selective memory where you keep a task in memory, but you have the chat forget/not-include all the generated data in the context so that it stays clean, but only bring it to the context if the user refers to it.
Has anyone else done data processing types of tasks in chats and had issues like this? Are there some other tools to use or tricks to do in chats?
by Zobat on 5/15/25, 2:58 PM
by sattard on 5/16/25, 12:16 PM
by veunes on 5/15/25, 6:30 AM
by dontreact on 5/15/25, 2:58 AM
by debuggerson on 5/15/25, 9:51 AM
by sky2224 on 5/15/25, 8:34 AM
by overflow897 on 5/15/25, 10:54 AM
I guess chain of thought in theory should do that but having variations on prompt and context might behave differently?
by RandyOrion on 5/18/25, 2:14 AM
by guardiang on 5/15/25, 6:40 AM
by Workaccount2 on 5/15/25, 2:12 PM
by coderatlarge on 5/15/25, 3:42 AM
by giordanol on 5/15/25, 12:45 PM
by WhitneyLand on 5/15/25, 12:27 PM
One of the biggest developments in language models over the last year has been test-time reasoning (aka inference scaling or “thinking”). Most vendors tested offer such a model. It’s plausible it could make a huge difference here, and they did not bother to test it or even mention it?
Things like COT and planning can really affect this and those are just a couple of things that happen automatically in more advanced models.
Seems like it wouldn’t have been hard to add this to the experiment, but they could’ve called it out in a “Limitations” or “Future Work” section. Or at least a single sentence like “We did not test chain-of-thought prompting, which may mitigate some of these issues”.
by tsunamifury on 5/15/25, 4:12 AM
by alganet on 5/15/25, 2:56 AM
I have experienced that in person many, many times. Jumps in context that seem easy for one person to follow, but very hard for others.
So, assuming the paper is legit (arxiv, you never know...), its more like something that could be improved than a difference from human beings.