from Hacker News

Can we RAG the whole web?

by jeanloolz on 4/29/24, 12:43 PM with 21 comments

by manca on 4/30/24, 6:36 PM
This is exactly what https://www.perplexity.ai/ is trying to do. Maybe not "RAGing" the entire internet, but sure using the mapping between natural language query to their own (probably) vector database which contains "source of truth" from the internet.
The way how they build that database and what models they use for text tokenization, embeddings generation and ranking at "internet" scale is the secret sauce that enabled them to raise more than $165M to date.
For sure this is where the internet search will be in a couple of years and that's why Google got really concerned when original ChatGPT was released. That said, don't assume Google is not already working on something similar. In fact, the main theme of their Google Next conference was about LLMs and RAG.
by aleksiy123 on 4/30/24, 6:00 PM
Is connecting a search engine to an LLM not technically a RAG for the whole web?
by mehulashah on 4/30/24, 6:11 PM
Cool idea. This is a decentralized RAG approach and useful for individual site, e.g. those from Wordpress. How do you find the site that you want to "RAG" on, though? Individual domains can be vast, e.g. Google itself.
by troupo on 4/30/24, 5:58 PM
Well, there's nothing new under the sun. The whatever cooperation model you may have come up with, it has been invented again, and again, and again.
Before you invent a new protocol, look at Semantic Web (RDF et al), and Google Microformats, and...
by rthnbgrredf on 4/30/24, 5:59 PM
I think we need a search engine that has an API. Doesn't Kagi has an API?
by simonw on 4/30/24, 5:57 PM
FIYDRI^: The core idea discussed in this post is less about RAG and more about sharing web content in packages that are easier for crawlers to access - including an experiment that uses downloadable SQLite databases for that.
^ For If You Didn't Read It
by transitivebs on 4/30/24, 6:04 PM
this is exa's mission: https://exa.ai
by leblancfg on 4/30/24, 6:32 PM
I've been using Kagi's "Quick answer" more and more these days, which I guess is a form of "index the whole web" RAG.
Here's their blog article for it: https://help.kagi.com/kagi/ai/quick-answer.html You have to fire up your bullshit detector when looking at the results, but I find it saves a good 3/4 clicks on average.
by bagels on 4/30/24, 6:12 PM
"RAG, or Retrieval-Augmented Generation, is a method where a language model such as ChatGPT first searches for useful information in a large database and then uses this information to improve its responses."
by mooktakim on 4/29/24, 9:53 PM
Aren't the LLM's already trained on the whole web? no need to RAG, in theory.