from Hacker News

Ask HN: Storing millions and billions of URLs?

by gerenuk on 5/4/18, 8:16 PM with 10 comments

Hello Everyone!

Currently, using ElasticSearch for storing the meta data and other raw data information but it is a very small scale around 500,000 domains.

I have been tasked to scale it to 20-40 million domains and storing their internal/external links while building a page rank/domain authority score for each domain which we are adding to our database.

What do you guys suggest/recommend for storing this data at a very large scale as web page internal links/external links will be stored which will lead it over 100M-1B links database?

Any kind of feedback/suggestion would be appreciated.

Thanks.

  • by nik736 on 5/5/18, 5:21 PM

    I don't think that any proper database technology will have issues with that amount of data. It all depends on how you use it.
  • by sharemywin on 5/4/18, 8:29 PM

  • by girishso on 5/5/18, 8:07 PM

    I personally have used CouchDb to store tens of millions of documents. If you can find a way get the data you want using CouchDb views, the number of documents simply doesn’t matter with CouchDb (may be just the disc usage grows with additional documents/views). And that too with excellent performance.
  • by drizzle87 on 5/7/18, 2:34 PM

    Elasticsearch should be easily able to handle your scaling needs. Why do you think that it would not? What are your concerns?
  • by jjirsa on 5/6/18, 6:52 AM

    The answer will depend primarily on how you expect to query it.

    Cassandra can do many orders of magnitude more than 1B, but would limit you in your query patterns.

  • by mr__y on 5/5/18, 4:50 PM

    Have you considered sharding the data to multiple independent ES instances? Each of them could handle amount of data that does not cause problems?
  • by cimmanom on 5/5/18, 1:05 AM

    We've found Elasticsearch to be quite performant with hundreds of millions of documents. What are your concerns with scaling it?
  • by dchuk on 5/5/18, 12:18 AM

    Building an ahrefs/moz/majestic competitor?