from Hacker News

Ask HN: Why isn’t there an open source Google with pick your algorithm?

by yogi123 on 4/25/22, 10:54 PM with 7 comments

by estebarb on 4/26/22, 12:09 AM
There is a lot of public literature on how to build a search engine: it is not that secret. You must crawl millions of websites, then process them, index, deduplicate, remove spam... And then you must create a frontend client that queries the indexes and returns the most relevant web pages very fast. Is "just" that.
You can build a toy search engine easily... in fact, is is a popular project in Information Retrieval courses in universities around the world. But scaling that toy to something really web scale requires vasts amounts of compute resources, money, time to debug.
Also, swapping an "algorithm" is not easy: it requires changing the indexes (postings files vs fast neighbour queries for embeddings? in memory? in disk for long tail queries?), compute infrastructure (single node? MapReduce? Graph processing like Pregel? something Deep Learning? are we building a knowledge graph?), which languages it will support (not all languages have the same resources).
But, there are open source components that could be leveraged to build a search engine: Apache Nutch + Apache Hadoop + ElasticSearch + TensorFlow + ...
by geoah on 4/25/22, 11:11 PM
Absolute stab in the dark:
You’ll need to store a good chunk of the web in order to allow for retraining/reindexing when algorithms get added or updated. That’s expensive as disk space is not cheap, and bandwidth is even less cheap.
You then beed to be constantly processing all that content through multiple algorithms, and storing their resulting indexes in relatively fast storage so it can be retrieved. That’s a lot of processing and even more storage.
Even if this all works, your algorithms need to be performant in order to be usable. That means time and expertise.
Finally you need to figure out who actually cares enough to pay for this thing. Who pays for my crappy algorithm that is just wasting cpu and disk that no one is using?
by rektide on 4/25/22, 11:46 PM
The idea of there being decided algorithms is a lark, is fodder for pop culture. The truth is far more complex yet culture is too stupid and slow and incompetent to understand. See also twitter/the-algorithm.
But also these fools sell their wares as magic fodder capable of performing great spells on humans & have no real information & culture has been worfully mislead by this idea that all this superb algorithm has been distilled out & has such great vast & lofty powers.
The hype has sold itself & the counterhype like this askhn is ballardian hyperreal nonsense. We all know & discuss a thing which in fact has not the remotest facts of existence.
by gregjor on 4/26/22, 12:21 AM
I suspect that if we could see what Google spends on hosting and other infrastructure we'd have a reasonable answer. Open source is not the same thing as "free to operate."
Algorithms that choose results and assign them relevance/priority have to work on the indexed data, so there's more to it that just swapping algorithms.
by readonthegoapp on 4/28/22, 6:13 AM
i'd love to know about google competitors.
it's just that any new one i try is either already owned by google, is using someone else's results, is doing stupid stuff like planting trees meant to distract us from real solutions to global warming, etc.
i thought some smart people would get together, get some funding, spin up a search engine in a couple of days using the cloud, and see if there was something there.
wonder what keeps that from happening.
privacy maybe?
by yogi123 on 4/25/22, 10:56 PM
For example, with a marketplace of search algorithms for different use cases that people can submit and which could be rated or ranked like browser extensions.
by Vladimof on 4/25/22, 11:07 PM
It would end up being the same as today.... make your own algo would be nice though (they could terminate processes that arent efficient enough)