by loganfrederick on 9/20/24, 1:57 AM with 72 comments
by underlines on 9/20/24, 10:13 AM
- Hybrid Retrieval (semantic + vector) and then LLM based Reranking made no significant change using synthetic eva-questions
- HyDE decreased answer quality and retrieval quality severly when measured with RAGAS using synthetic eval-questions
(we still have to do a RAGAS eval using expert and real user questions)
So yes, hybrid retrieval is always good - that's no news to anyone building production ready or enterprise RAG solutions. But one method doesn't always win. We found semantic search of Azure AI Search being sufficient as a second method, next to vector similarity. Others might find BM25 great, or a fine tuned query post processing SLM. Depends on the use case. Test, test, test.
Next things we're going to try:
- RAPTOR
- SelfRAG
- Agentic RAG
- Query Refinement (expansion and sub-queries)
- GraphRAG
Learning so far:
- Always use a baseline and an experiment to try to refute your null hypothesis using measures like RAGAS or others.
- Use three types of evaluation questions/answers: 1. Expert written q&a, 2. Real user questions (from logs), 3. Synthetic q&a generated from your source documents
by simonw on 9/20/24, 6:34 AM
That's priced at around 1/10th of what the prompts would normally cost if they weren't cached, which means that tricks like this (running every single chunk against a full copy of the original document) become feasible where previously they wouldn't have financially made sense.
I bet there are all sorts of other neat tricks like this which are opened up by caching cost savings.
My notes on contextual retrieval: https://simonwillison.net/2024/Sep/20/introducing-contextual... and prompt caching: https://simonwillison.net/2024/Aug/14/prompt-caching-with-cl...
by postalcoder on 9/20/24, 7:56 AM
I'm not sure what Anthropic is introducing here. I looked at the cookbook code and it's just showing the process of producing said context, but there's no actual change to their API regarding "contextual retrieval".
The one change is prompt caching, introduced a month back, which allows you to very cheaply add better context to individual chunks by providing the entire (long) document as context. Caching is an awesome feature to expose to developers and I don't want to take anything away from that.
However, other than that, the only thing I see introduced is just a cookbook on how to do a particular rag workflow.
As an aside, Cohere may be my favorite API to work with. (no affiliation) Their RAG API is a delight, and unlike anything else provided by other providers. I highly recommend it.
by valstu on 9/20/24, 7:44 AM
The usual dose for adults is one or two 200mg tablets or
capsules 3 times a day.
It is now something like: # Fever
## Treatment
---
The usual dose for adults is one or two 200mg tablets or
capsules 3 times a day.
This seems to work pretty well, and doesn't require any LLMs when indexing documents.(Edited formatting)
by skeptrune on 9/20/24, 6:24 AM
Vector embeddings have bag-of-words compression properties and can over-index on the first newline separated text block to the extent that certain indices in the resulting vector end up much closer to 0 than they otherwise would. With quantization, they can eventually become 0 and cause you to lose out on lots of precision with the dense vectors. IDF search overcomes this to some extent, but not enough.
You can "semantically boost" embeddings such that they move closer to your document's title, summary, abstract, etc. and get the recall benefits of this "context" prepend without polluting the underlying vector. Implementation wise it's a weighted sum. During the augmentation step where you put things in the context window, you can always inject the summary chunk when the doc matches as well. Much cleaner solution imo.
Description of "semantic boost" in the Trieve API[1]:
>semantic_boost: Semantic boost is useful for moving the embedding vector of the chunk in the direction of the distance phrase. I.e. you can push a chunk with a chunk_html of "iphone" 25% closer to the term "flagship" by using the distance phrase "flagship" and a distance factor of 0.25. Conceptually it's drawing a line (euclidean/L2 distance) between the vector for the innerText of the chunk_html and distance_phrase then moving the vector of the chunk_html distance_factorL2Distance closer to or away from the distance_phrase point along the line between the two points.
[1]:https://docs.trieve.ai/api-reference/chunk/create-or-upsert-...
by _bramses on 9/20/24, 8:26 AM
Another way to look at it, comments. Imagine every comment under this post is a pointer back to the original post. Some will be close in distance, and others will be farther, due to the perception of the authors of the comments themselves. But if you assign each comment a “parent_id”, your access to the post multiplies.
You can see an example of this technique here [1]. I don’t attempt to mind read what the end user will query for, I simply let them tell me, and then index that as a pointer. There are only a finite number of options to represent a given object. But some representations are very, very, very far from the semantic meaning of the core object.
[1] - https://x.com/yourcommonbase/status/1833262865194557505
by layoric on 9/21/24, 2:37 AM
by ValentinA23 on 9/20/24, 3:25 PM
An example: let's suppose you're using an LLM to play a multi user dungeon. In the past your character has behaved badly with taxis so that the game has decided to create a rule that says that whenever you try to enter a taxi you're kicked out: "we know who you are, we refuse to have you as a client until you formally apologize to the taxi company director". Upon apologizing, the rule is removed. Note that the director of the taxi company could be another player and be the one who issued the rule in the first place, to be enforced by his NPC fleet of taxis.
I'm wondering how well this could scale (with respect of number of active rules) and to which extent traditional RAG could be applied. It seems deciding whether a rule applies or not is a problem that is more abstract and difficult than deciding whether a chunk of knowledge is relevant or not.
In particular the main problem I have identified that makes it more difficult is the following dependency loop that doesn't appear with knowledge retrieval: you need to retrieve a rule to identify whether it applies or not. Does anyone know how this problem could be solved ?
by msp26 on 9/20/24, 12:12 PM
I would prefer that anthropic just release their tokeniser so we don't have to make guesses.
by paxys on 9/20/24, 2:58 PM
by skybrian on 9/20/24, 4:26 AM
by will-burner on 9/20/24, 4:13 PM
Does anyone know if the datasets they used for the evaluation are publicly available or if they give more information on the datasets than what's in appendix II?
There are standard publically available datasets for this type of evaluation, like MTEB (https://github.com/embeddings-benchmark/mteb). I wonder how this technique does on the MTEB dataset.
by davedx on 9/20/24, 3:31 PM
I wonder how it would work if you generated the contexts yourself algorithmically. Depending on how well structured your docs are this could be quite trivial (eg for an html doc insert the title > h1 > h2 > chunk).
by mark_l_watson on 9/20/24, 12:07 PM
This example is well written and documented, easy to understand. Well done.
by thelastparadise on 9/20/24, 11:09 AM
What exactly is a "failure rate" and how is it computed?
by vendiddy on 9/20/24, 9:24 AM
by justanotheratom on 9/20/24, 4:23 PM
"Chunk boundaries: Consider how you split your documents into chunks. The choice of chunk size, chunk boundary, and chunk overlap can affect retrieval performance1."
by regularfry on 9/20/24, 11:10 AM
by timwaagh on 9/20/24, 7:48 AM