from Hacker News

LLMs get lost in multi-turn conversation

by simonpure on 5/15/25, 2:28 AM with 259 comments

by Benjammer on 5/15/25, 2:53 AM
It's nice to see a paper that confirms what anyone who has practiced using LLM tools already knows very well, heuristically. Keeping your context clean matters, "conversations" are only a construct of product interfaces, they hurt the quality of responses from the LLM itself, and once your context is "poisoned" it will not recover, you need to start fresh with a new chat.
by Sharlin on 5/15/25, 3:01 AM
Seems like this is an aspect of their well-known overconfidence and the inability to self-reflect and recognize they have to ask for more details because their priors are too low. If you look at the output of reasoning models, it’s clear that the idea of asking for clarification very rarely occurs to them – when they’re confused, it’s just endless speculation of what the user might have meant.
This, of course, has certain implications as to the wisdom of the idea of “replacing human programmers”, given that one of the hard parts of the trade is trying to turn vague and often confused ideas into precise specifications by interacting with the shareholders.
by tmountain on 5/15/25, 11:45 AM
I often ask the LLM for a concise summary of the discussion so far—formatted as a prompt. I then edit it appropriately and use it to start a new conversation without the baggage. I have found this to be a very effective technique, but I imagine it will be automated sometime soon.
by airylizard on 5/15/25, 4:43 AM
Why I came up with TSCE(Two-Step Contextual Enrichment).
+30pp uplift when using GPT-35-turbo on a mix of 300 tasks.
Free open framework, check the repo try it yourself
https://github.com/AutomationOptimization/tsce_demo
I tested this another 300 times with gpt-4.1 to remove those obtrusive "em-dashes" everyone hates. Tested a single-pass baseline vs TSCE, same exact instructions and prompt "Remove the em-dashes from my linkedin post. . .".
Out of the 300 tests, baseline failed to remove the em-dashes 149/300 times. TSCE failed to remove the em-dashes 18/300 times.
It works, all the data as well as the entire script used for testing is in the repo.
by zacksiri on 5/15/25, 3:08 AM
I've been working on solving this with quite a bit of success, I'll be sharing more on this soon. It involves having 2 systems 1st system is the LLM itself and another system which acts like a 'curator' of thoughts you could say.
It dynamically swaps in / out portions of the context. This system is also not based on explicit definitions it relies on LLMs 'filling the gaps'. The system helps the llm break down problems into small tasks which then eventually aggregate into the full task.
by jumploops on 5/15/25, 6:15 AM
It's amazing that branching/forking isn't a core aspect of the main chat tools.
You can edit responses, sure, but then a bunch of other context is lost.
My flow is basically:
1. plan
2. build
3. branch (into some feature/esoteric dependency issue)
4. goto #2
Prompt pruning/branching should be a first-class tool for any LLM usage.
by podgorniy on 5/15/25, 8:13 AM
There is a noticable issue when one builds LLMs interfaces around single turn conversations. Majority people expect linear conversations.
I've built telegram bot http://t.me/experai_bot as univresal UI to LLMs (with somewhat reduced functionality) exactly around idea "non-reply message means new conversation". Wanna keep context? Keep replying to replies of bot. Non-power user strugge with this idea.
--
Also I observed that OpenAI models performed worse replying to the same questions (for example list of options in reply got shorter) even with smallest system message. That was the case with 3.5, 4o. Don't know how modern ones behave. That made me decide not to include any system messages by default Still I give option to add ones if you need. You can even toggle them to mix-and-match.
by permo-w on 5/15/25, 3:10 AM
I feel like at this point the LLM space is just filled with people solving and resolving the same problems over and over
by t-kalinowski on 5/15/25, 11:29 AM
This was the main reason I wrote promptdown. I want to be able to edit the full chat history every turn, and the append-only standard chat interfaces don't make that easy.
https://github.com/t-kalinowski/promptdown
by SamPatt on 5/15/25, 6:29 AM
I always felt the derision around the term "prompt engineering" was partially due to people overestimating the importance of the initial prompt and underestimating the importance of managing the ongoing context.
You develop a knack for how to steer the models or start a new conversation through experience. The system or initial prompt are important, but nothing will save you if you naively keep a conversation going too long.
by ranyume on 5/15/25, 3:15 AM
I'd like more research done on context understanding other than NIAH. I don't believe LLMs support the context length companies say they support. But I need to know this to effectively use the tools. At least for coding.
Stuff like this:
1. Do: Best practice for X model is to include at max 10k lines of code + task + CONVENTIONS.md + architecture guidance. Only queue tasks for components that are fairly decoupled from the rest of the codebase (e.g. small modules).
2. Don't: Start a project without a clearly defined architecture in this format. Don't ask for tasks that require X amount of reading hops to understand the logic.
I find it frustrating that companies release their benchmaxxing without helping developers actually use their models. It's more ironic that some people think of these AIs as employees. Employees can work with their boss about the best way to achieve things! With LLMs you don't even know how to communicate with them and as a result their output is unreliable.
by dr_dshiv on 5/15/25, 5:10 AM
This is the best paper on machine psychology [1] I’ve yet seen. Rigorous, empirical, insightful — and very practical.
[1] http://ui.adsabs.harvard.edu/abs/2023arXiv230313988H/abstrac...
by badmonster on 5/15/25, 7:27 AM
Why do LLMs struggle so much with recovering from early wrong turns in multi-turn conversations — even when all prior context is available and tokenized?
Is it due to the model's training distribution (mostly single-shot completions), the way context windows are encoded, or an architectural bottleneck?
Feels like there's no dynamic internal state that evolves over the conversation — only a repeated re-parsing of static history. Has anyone seen work on integrating memory/state mechanisms that allow belief revision within a session, not just regurgitation of past tokens?
by jsemrau on 5/15/25, 5:53 AM
That's no surprise. When I was working on game theory and agent reasoning I reached the same conclusion a year ago.
My conclusion was that context needs to be managed well for the LLMs to manage accuracy in replies. Also, it helps to have a planning process ("graph reasoning") before task execution because it guardrails the models thought process.
This also introduces a discussion on general use vs workflow agent implementations as in the former it is much more difficult to generalize all components in structuring effective ReAct patterns.
by aleksituk on 5/15/25, 12:21 PM
This is very interesting and I like the conversation about not only the technology itself, but also about the importance of thinking about the interface as a user experience and where / how it fits the paradigm.
We've been working on a lot of data processing and generation tasks. We've been doing this using an API primarily, but sometimes I end up testing creating data in a chat window and I first chat through what the requirements are for the data analysis / processing and then once I'm done I would like the whole conversation to be then summarised into basically a one-prompt process so that I can re-use it (because I can't really process new inputs via the chat).
Even when you do manage to get it down to a single prompt you can use in a chat and then ask the chat to just keep producing new data (like imagine a blog post in certain style if the base content is given as input and I'm making like 20 of them). If you produce these in the chat, there's notable benefits in that if something is wrong with the blog post the chat suggests, you can immediately edit it. The trouble is that the context window starts becoming so big that the chat starts to forget what the original instruction is and eventually you do have to just create a new chat.
One way to solve for this is having a chat with selective memory where you keep a task in memory, but you have the chat forget/not-include all the generated data in the context so that it stays clean, but only bring it to the context if the user refers to it.
Has anyone else done data processing types of tasks in chats and had issues like this? Are there some other tools to use or tricks to do in chats?
by Zobat on 5/15/25, 2:58 PM
This must mean that LLMs really are like genies in bottles. You get three questions answered, anything after that will be nonsense.
by sattard on 5/16/25, 12:16 PM
Why haven't AI code editors not built this at their core yet, to automatically consolidate previous conversational turns into a more structured context summary. Instead of relying solely on the model’s memory of all prior exchanges, surely these tools should take responsibility for intermittently “restating” the clarified requirements so the model doesn’t have to reconstruct context from scratch (or worse, pick up mistakes). This might mitigate compounding errors and reduce verbosity.
by veunes on 5/15/25, 6:30 AM
Kind of wild how even the best models still struggle with keeping context straight over time. Definitely feels like a big challenge if we want these things to hold real conversations.
by dontreact on 5/15/25, 2:58 AM
My take: multi turn evals are hard because to do it really correctly you have to simulate a user. This is not yet modeled well enough for multi turn to work as well as it could.
by debuggerson on 5/15/25, 9:51 AM
The more we chat, the more irrelevant details pile up. For example, a small mention early on might get repeated or build on itself, leading to a lot of unnecessary context. As the conversation continues, it becomes harder for the model to focus on the main point because it gets tangled in all the extra information. Unlike humans, who can intuitively filter out the noise, LLMs struggle to keep track of what’s truly important in longer, more complex exchanges.
by sky2224 on 5/15/25, 8:34 AM
Ha, kind of funny to see this right now. I've been fighting copilot in vscode in trying to get it to output anything once I take the context down to a very specific problem. It feels like I have to reset and almost reground the model into what I'm trying to accomplish at a certain point.
by overflow897 on 5/15/25, 10:54 AM
I believe we're already using llms to evaluate llm output for training, I wonder if there's some variation of that which could be used to identify when one llm gets "stuck".
I guess chain of thought in theory should do that but having variations on prompt and context might behave differently?
by RandyOrion on 5/18/25, 2:14 AM
One problem with this paper is that authors didn't conduct experiments on popular LLMs from Qwen and Mistral. Why?
by guardiang on 5/15/25, 6:40 AM
Exactly why expert steering should be valued.
by Workaccount2 on 5/15/25, 2:12 PM
Reminds me of Claude plays pokemon, where it would note something insignificant, and then fixate on it for hours.
by coderatlarge on 5/15/25, 3:42 AM
i’ve see deepseek-coder local get into an infinite loop generating the same line over and over. which i assume without evidence is some sort of feedback from the generated line back into the generation process. so kind of getting lost in thought and going off topic from the simple .h api that my prompt asked for.
by giordanol on 5/15/25, 12:45 PM
Would love to see metrics that isolate recovery behaviour (if any)
by WhitneyLand on 5/15/25, 12:27 PM
Any reason to not call bullshit on this paper?
One of the biggest developments in language models over the last year has been test-time reasoning (aka inference scaling or “thinking”). Most vendors tested offer such a model. It’s plausible it could make a huge difference here, and they did not bother to test it or even mention it?
Things like COT and planning can really affect this and those are just a couple of things that happen automatically in more advanced models.
Seems like it wouldn’t have been hard to add this to the experiment, but they could’ve called it out in a “Limitations” or “Future Work” section. Or at least a single sentence like “We did not test chain-of-thought prompting, which may mitigate some of these issues”.
by tsunamifury on 5/15/25, 4:12 AM
Have you seen a bunch of humans in a room?
by alganet on 5/15/25, 2:56 AM
Humans also often get lost in multi-turn conversation.
I have experienced that in person many, many times. Jumps in context that seem easy for one person to follow, but very hard for others.
So, assuming the paper is legit (arxiv, you never know...), its more like something that could be improved than a difference from human beings.