by 7d7n on 5/29/24, 4:14 AM with 87 comments
by wokwokwok on 5/29/24, 8:35 AM
1) you’re sampling a distribution; if you only sample once, your sample is not representative of the distribution.
For evaluating prompts and running in production; your hallucination rate is inversely proportional to the number of times you sample.
Sample many times and vote is a highly effective (but slow) strategy.
There is almost zero value in evaluating a prompt by only running it once.
2) Sequences are generated in order.
Asking an LLM to make a decision and justify its decision in that order is literally meaningless.
Once the “decision” tokens are generated; the justification does not influence them. It’s not like they happen “all at once” there is a specific sequence to generating output where the later output cannot magically influence the output which has already been generated.
This is true for sequential outputs from an LLM (obviously), but it is also true inside single outputs. The sequence of tokens in the output is a sequence.
If you’re generating structured output (eg json, xml) which is not sequenced, and your output is something like {decision: …, reason:…} it literally does nothing.
…but, it is valuable to “show the working out” when, as above, you then evaluate multiple solutions to a single request and pick the best one(s).
by DubiousPusher on 5/29/24, 5:33 AM
The LLM is like another user. And it can surprise you just like a user can. All the things you've done over the years to sanitize user input apply to LLM responses.
There is power beyond the conversational aspects of LLMs. Always ask, do you need to pass the actual text back to your user or can you leverage the LLM and constrain what you return?
LLMs are the best tool we've ever had for understanding user intent. They obsolete the hierarchies of decision trees and spaghetti logic we've written for years to classify user input into discrete tasks (realizing this and throwing away so much code has been the joy of the last year of my work).
Being concise is key and these things suck at it.
If you leave a user alone with the LLM, some users will break it. No matter what you do.
by mloncode on 5/29/24, 5:09 AM
(Note: this is only Part 1 of 3 of a series that has already been written and the other 2 parts will be released shortly)
by __loam on 5/29/24, 6:47 AM
If I'm understanding this correctly, the standard way to get structured output seems to be to retry the query until the stochastic language model produces expected output. RAG also seems like a hilariously thin wrapper over traditional search systems, and it still might hallucinate in that tiny distance between the search result and the user. Like we're talking about writing sentences and coaching what amounts to an auto complete system to magically give us something we want. How is this industry getting hundreds of billions of dollars in investment?
Also the error rate is about 5-10% according to this article. That's pretty bad!
by elicksaur on 5/29/24, 1:32 PM
by Havoc on 5/29/24, 8:02 AM
by l5870uoo9y on 5/29/24, 6:07 AM
I found that the simpler the better, when testing lots of different SQL schema formats on https://www.sqlai.ai/. CSV (table name, table column, data type) outperformed both a JSON formatted and SQL schema dump. And not to mention consumed fewer tokens.
If you need the database schema in a consistent format (e.g. CSV) just have LLM extract data and convert whatever the user provides into CSV. It shines at this.
by surfingdino on 5/29/24, 5:49 AM
I am curious to know if the authors tried to build LLMs in languages other than English and what did they learn while doing so?
An excellent post reminding me of the best O'Reilly articles from the past. Looking forward to parts 2 and 3.
by CuriouslyC on 5/29/24, 12:15 PM
by 7thpower on 5/29/24, 11:06 AM
by hubraumhugo on 5/29/24, 5:32 AM
One controversial point that has led to discussions in my team is this:
> A common anti-pattern/code smell in software is the “God Object,” where we have a single class or function that does everything. The same applies to prompts too.
In theory, a monolithic agent/prompt with infinite context size, a large toolset, and perfect attention would be ideal.
Multi-agent systems will always be less effective and more error-prone than monolithic systems on a given problem because of less context of the overall problem. Individual agents work best when they have entirely different functionalities.
I wrote down my thoughts about agent architectures here: https://www.kadoa.com/blog/ai-agents-hype-vs-reality
by anon373839 on 5/29/24, 8:47 AM
by hugobowne on 5/29/24, 5:13 AM
should be fun!
by lagrange77 on 5/29/24, 9:59 AM
by msp26 on 5/29/24, 9:28 AM
Some notes from my own experience on LLMs for NLP problems:
1) The output schema is usually more impactful than the text part of a prompt.
a) Field order matters a lot. At inference, the earlier tokens generated influence the next tokens.
b) Just have the CoT as a field in the schema too.
c) PotentialField and ActualField allow the LLM to create some broad options and then select the best. This mitigates the fact that they can't backtrack a bit. If you have human evaluation in your process, this also makes it easier for them to correct mistakes.
`'PotentialThemes': ['Surreal Worlds', 'Alternate History', 'Post-Apocalyptic'], 'FinalThemes': ['Surreal Worlds']`
d) Most well definined problems should be possible zero-shot on a frontier model. Before rushing off to add examples really check that you're solving the correct problem in the most ideal way.
2) Defining the schema as typescript types is flexible and reliable and takes up minimal tokens. The output JSON structure is pretty much always correct (as long as the it fits in the context window) the only issue is that the language model can pick values outside the schema but that's easy to validate in post.
3) "Evaluating LLMs can be a minefield." yeah it's a pain in the ass.
4) Adding too many examples increases the token costs per item a lot. I've found that it's possible to process several items in one prompt and, despite it being seemingly silly and inefficient, it works reliably and cheaply.
5) Example selection is not trivial and can cause very subtle errors.
6) Structuring your inputs with XML is very good. Even if you're trying to get JSON output, XML input seems to work better. (Haven't extensively tested this because eval is hard).
by goldemerald on 5/29/24, 5:44 AM
by mark_l_watson on 5/29/24, 2:33 PM
by beepbooptheory on 5/29/24, 9:32 PM
Is it the thing itself, or is it the thing that enables us?