from Hacker News

Ask HN: Building a Better Google

by agencies on 7/9/21, 12:45 AM with 4 comments

If you had the 4 billion pages Google had on its index in 2004 in elasticsearch today, what would it take to replicate 2004-era web search quality using that static index?

What relevant research or projects are trying to make these sorts of algorithms and data accessible as a future commodity people can build on top of?

  • by immortal3 on 7/9/21, 11:26 AM

    I have been thinking about these a lot lately. There is lot more scope on improving current search results.

    I believe simple two stage system might be enough to produce decent system. Stage - 1. query expansion and reverse index based retrieval Stage - 2. re-ranking based on few combination of heuristics. (page rank + word embedding + query analysis)

    Would you like to talk more about this ? I have email address in my profile.

  • by mikewarot on 7/9/21, 2:06 AM

    I'd use word2vec to embed all the words into a vector space to try to get more inclusive search results.

    I'd also try to separate out multiple meanings of phrases if possible... for example "hypertext markup" could mean the HTML language, or it could mean actually marking up hypertext (annotation). I'd let the user have some way to disambiguate the meanings.

  • by ipaddr on 7/9/21, 6:25 AM

    In google cared about: - title, description meta tag - PageRank was just starting - Single word matching domains to search terms - Anchor Text

    The quality of the search algo wasn't better the sites indexed were better.

  • by blackcats on 7/10/21, 10:16 AM

    First I would ignore bad backlinks as they’re not a good indicator of quality. The better your page, the more likely a competitor will start creating thousands of bad links.