by yogi123 on 4/25/22, 10:54 PM with 7 comments
by estebarb on 4/26/22, 12:09 AM
You can build a toy search engine easily... in fact, is is a popular project in Information Retrieval courses in universities around the world. But scaling that toy to something really web scale requires vasts amounts of compute resources, money, time to debug.
Also, swapping an "algorithm" is not easy: it requires changing the indexes (postings files vs fast neighbour queries for embeddings? in memory? in disk for long tail queries?), compute infrastructure (single node? MapReduce? Graph processing like Pregel? something Deep Learning? are we building a knowledge graph?), which languages it will support (not all languages have the same resources).
But, there are open source components that could be leveraged to build a search engine: Apache Nutch + Apache Hadoop + ElasticSearch + TensorFlow + ...
by geoah on 4/25/22, 11:11 PM
You’ll need to store a good chunk of the web in order to allow for retraining/reindexing when algorithms get added or updated. That’s expensive as disk space is not cheap, and bandwidth is even less cheap.
You then beed to be constantly processing all that content through multiple algorithms, and storing their resulting indexes in relatively fast storage so it can be retrieved. That’s a lot of processing and even more storage.
Even if this all works, your algorithms need to be performant in order to be usable. That means time and expertise.
Finally you need to figure out who actually cares enough to pay for this thing. Who pays for my crappy algorithm that is just wasting cpu and disk that no one is using?
by rektide on 4/25/22, 11:46 PM
But also these fools sell their wares as magic fodder capable of performing great spells on humans & have no real information & culture has been worfully mislead by this idea that all this superb algorithm has been distilled out & has such great vast & lofty powers.
The hype has sold itself & the counterhype like this askhn is ballardian hyperreal nonsense. We all know & discuss a thing which in fact has not the remotest facts of existence.
by gregjor on 4/26/22, 12:21 AM
Algorithms that choose results and assign them relevance/priority have to work on the indexed data, so there's more to it that just swapping algorithms.
by readonthegoapp on 4/28/22, 6:13 AM
it's just that any new one i try is either already owned by google, is using someone else's results, is doing stupid stuff like planting trees meant to distract us from real solutions to global warming, etc.
i thought some smart people would get together, get some funding, spin up a search engine in a couple of days using the cloud, and see if there was something there.
wonder what keeps that from happening.
privacy maybe?
by yogi123 on 4/25/22, 10:56 PM
by Vladimof on 4/25/22, 11:07 PM