from Hacker News

What We Learned from a Year of Building with LLMs

by 7d7n on 5/29/24, 4:14 AM with 87 comments

by wokwokwok on 5/29/24, 8:35 AM
Mildly surprised to see no mention of my top 2 LLM fails:
1) you’re sampling a distribution; if you only sample once, your sample is not representative of the distribution.
For evaluating prompts and running in production; your hallucination rate is inversely proportional to the number of times you sample.
Sample many times and vote is a highly effective (but slow) strategy.
There is almost zero value in evaluating a prompt by only running it once.
2) Sequences are generated in order.
Asking an LLM to make a decision and justify its decision in that order is literally meaningless.
Once the “decision” tokens are generated; the justification does not influence them. It’s not like they happen “all at once” there is a specific sequence to generating output where the later output cannot magically influence the output which has already been generated.
This is true for sequential outputs from an LLM (obviously), but it is also true inside single outputs. The sequence of tokens in the output is a sequence.
If you’re generating structured output (eg json, xml) which is not sequenced, and your output is something like {decision: …, reason:…} it literally does nothing.
…but, it is valuable to “show the working out” when, as above, you then evaluate multiple solutions to a single request and pick the best one(s).
by DubiousPusher on 5/29/24, 5:33 AM
Pretty good. Despite my high scepticism of the technology I have spent the last year working with LLMs myself. I would add a few things.
The LLM is like another user. And it can surprise you just like a user can. All the things you've done over the years to sanitize user input apply to LLM responses.
There is power beyond the conversational aspects of LLMs. Always ask, do you need to pass the actual text back to your user or can you leverage the LLM and constrain what you return?
LLMs are the best tool we've ever had for understanding user intent. They obsolete the hierarchies of decision trees and spaghetti logic we've written for years to classify user input into discrete tasks (realizing this and throwing away so much code has been the joy of the last year of my work).
Being concise is key and these things suck at it.
If you leave a user alone with the LLM, some users will break it. No matter what you do.
by mloncode on 5/29/24, 5:09 AM
Hello this is Hamel, one of the authors (among the list of other amazing authors). Happy to answer any questions as well as tag any of my colleagues to answer any questions!
(Note: this is only Part 1 of 3 of a series that has already been written and the other 2 parts will be released shortly)
by __loam on 5/29/24, 6:47 AM
I feel like an insane person everytime I look at the LLM development space and see what the state of the art is.
If I'm understanding this correctly, the standard way to get structured output seems to be to retry the query until the stochastic language model produces expected output. RAG also seems like a hilariously thin wrapper over traditional search systems, and it still might hallucinate in that tiny distance between the search result and the user. Like we're talking about writing sentences and coaching what amounts to an auto complete system to magically give us something we want. How is this industry getting hundreds of billions of dollars in investment?
Also the error rate is about 5-10% according to this article. That's pretty bad!
by elicksaur on 5/29/24, 1:32 PM
Upon loading the site, a chat bubble pops up and auto-plays a loud ding. Is the innovation of LLMs really a regression to 2000s spam sites? Can’t say I’m excited.
by Havoc on 5/29/24, 8:02 AM
Surely step one is carefully consider whether LLMs are the solution to you problem? That to me is the part where this is likely to go wrong for most people
by l5870uoo9y on 5/29/24, 6:07 AM
> Thus, you may expect that effective prompting for Text-to-SQL should include structured schema definitions; indeed.
I found that the simpler the better, when testing lots of different SQL schema formats on https://www.sqlai.ai/. CSV (table name, table column, data type) outperformed both a JSON formatted and SQL schema dump. And not to mention consumed fewer tokens.
If you need the database schema in a consistent format (e.g. CSV) just have LLM extract data and convert whatever the user provides into CSV. It shines at this.
by surfingdino on 5/29/24, 5:49 AM
One thing I am getting from this is that you need to be able to write prompts using well-structured English. That may be a challenge to a significant percentage of the population.
I am curious to know if the authors tried to build LLMs in languages other than English and what did they learn while doing so?
An excellent post reminding me of the best O'Reilly articles from the past. Looking forward to parts 2 and 3.
by CuriouslyC on 5/29/24, 12:15 PM
One thing that wasn't mentioned that works pretty well - if you have a RAG process running async rather than in a REPL loop, you can retrieve documents then perform a pass with another LLM to do summarization/extraction first. This saves input token costs for expensive LLMs, and lets you cram more information in the context, you just have to deal with additional latency.
by 7thpower on 5/29/24, 11:06 AM
This is excellent and matches with my experience, especially the part about prioritizing deterministic outputs. They are not as sexy as agentic chain of thought, but they actually work.
by hubraumhugo on 5/29/24, 5:32 AM
Comprehensive and practical write-up that aligns with most of my experiences.
One controversial point that has led to discussions in my team is this:
> A common anti-pattern/code smell in software is the “God Object,” where we have a single class or function that does everything. The same applies to prompts too.
In theory, a monolithic agent/prompt with infinite context size, a large toolset, and perfect attention would be ideal.
Multi-agent systems will always be less effective and more error-prone than monolithic systems on a given problem because of less context of the overall problem. Individual agents work best when they have entirely different functionalities.
I wrote down my thoughts about agent architectures here: https://www.kadoa.com/blog/ai-agents-hype-vs-reality
by anon373839 on 5/29/24, 8:47 AM
Is anyone using DSPy? It seems like a really interesting project, but I haven’t heard much from people building with it.
by hugobowne on 5/29/24, 5:13 AM
hey there, Hugo here and big fan of this work. Such a fan I'm actually doing a livestream podcast recording with all the authors here, if you're interested in hearing more from them: https://lu.ma/e8huz3s6?utm_source=hn
should be fun!
by lagrange77 on 5/29/24, 9:59 AM
Can anyone recommend resources, preferably books, on this whole topic of building applications around LLMs? It feels like running after an accelerating train to hop on.
by msp26 on 5/29/24, 9:28 AM
Thanks for sharing, I've followed these authors for a while and they're great.
Some notes from my own experience on LLMs for NLP problems:
1) The output schema is usually more impactful than the text part of a prompt.
a) Field order matters a lot. At inference, the earlier tokens generated influence the next tokens.
b) Just have the CoT as a field in the schema too.
c) PotentialField and ActualField allow the LLM to create some broad options and then select the best. This mitigates the fact that they can't backtrack a bit. If you have human evaluation in your process, this also makes it easier for them to correct mistakes.
`'PotentialThemes': ['Surreal Worlds', 'Alternate History', 'Post-Apocalyptic'], 'FinalThemes': ['Surreal Worlds']`
d) Most well definined problems should be possible zero-shot on a frontier model. Before rushing off to add examples really check that you're solving the correct problem in the most ideal way.
2) Defining the schema as typescript types is flexible and reliable and takes up minimal tokens. The output JSON structure is pretty much always correct (as long as the it fits in the context window) the only issue is that the language model can pick values outside the schema but that's easy to validate in post.
3) "Evaluating LLMs can be a minefield." yeah it's a pain in the ass.
4) Adding too many examples increases the token costs per item a lot. I've found that it's possible to process several items in one prompt and, despite it being seemingly silly and inefficient, it works reliably and cheaply.
5) Example selection is not trivial and can cause very subtle errors.
6) Structuring your inputs with XML is very good. Even if you're trying to get JSON output, XML input seems to work better. (Haven't extensively tested this because eval is hard).
by goldemerald on 5/29/24, 5:44 AM
"Ready to -dive- delve in?" is an amazingly hilarious reference. For those who don't know, LLMs (especially ChatGPT) use the word delve significantly more often than human created content. It's a primary tell-tale sign that someone used an LLM to write the text. Keep an eye out for delving, and you'll see it everywhere.
by mark_l_watson on 5/29/24, 2:33 PM
Fantastic advice. While reading the article I kept running across advice I had seen before or figured out myself, then forgot about. I am going to summarize this article and add the summary to my own Apple Notes (there are better tools, but I just use Apple Notes to act as a pile-of-text for reach notes.)
by beepbooptheory on 5/29/24, 9:32 PM
Is every "AI product" a piece of software where the end user interfaces with an llm? Or is an application that used AI to be built an "AI product"?
Is it the thing itself, or is it the thing that enables us?