from Hacker News

Indexing a billion pages

by daoudc on 12/23/23, 1:51 PM with 35 comments

  • by xnx on 12/23/23, 5:08 PM

    How does the homepage of https://mwmbl.org/ not have a single sentence explaining what it is or even an "About" link?

    From Github: "Mwmbl is a non-profit, ad-free, free-libre and free-lunch search engine with a focus on useability and speed."

  • by jetrink on 12/23/23, 7:58 PM

    > We’ve indexed over 100 million pages

    > [W]e’re crawling up to a million pages a day, as you can see on our stats page.

    > Given that Mwmbl is still relatively unknown, it seems plausible that we can reach our target of crawling three billion pages a day, to refresh the entire index in one month.

    I think this is supposed to read "it seems plausible that we can reach our target of crawling three million pages a day."

  • by bdcravens on 12/23/23, 7:28 PM

    Most impressive part:

    > Our estimated annual budget is $752.36 and we have spent $174.49.

  • by mdaniel on 12/23/23, 6:06 PM

    I thought I recalled seeing this before due to its Welsh name and (as is often the case) some are from their domain and some are from the GitHub repo; the ones with over 100 comments are

    https://news.ycombinator.com/item?id=37561155

    https://news.ycombinator.com/item?id=29690877

  • by marginalia_nu on 12/23/23, 7:17 PM

    I'll race you there ;-)
  • by Alifatisk on 12/24/23, 2:14 PM

    I remember reading about a project who’s sole purpose is to provide a large index of the open web for free, anyone could download it. Forgot the name of the project.

    Why can’t mwmbl download their index?

    Also, is mwmbl planning on providing their crawled index for free? Like, can I also download it later?

    If that is the case, I’s happily download their FF extension.

  • by mdaniel on 12/23/23, 6:01 PM

    > The biggest expense was purchasing a PyCharm professional license at $116.58

    I mean, awesome that they value good tooling to spend on it but https://www.jetbrains.com/community/opensource/ almost certainly means they qualify for a complementary license

  • by Alifatisk on 12/25/23, 6:53 PM

    How do I identify my hash among the users in the stats https://mwmbl.org/stats ?
  • by Alifatisk on 12/24/23, 5:17 PM

    What's the consequence of installing the crawler to FF? Can the ISP / Cloudflare / any other party start blacklisting you?
  • by hcfman on 12/23/23, 5:02 PM

    Wuite curious. What indexing and retrieval software is this using? I couldn’t find reference to it.

    Does it index phrases ?

  • by jmclnx on 12/23/23, 8:19 PM

    Very interesting and was quick for me. Nice work!
  • by foreigner on 12/24/23, 10:55 AM

    Do they use Common Crawl?
  • by urbandw311er on 12/25/23, 12:41 AM

    I think the saddest part of this is that, owing to the total enshittification of the web due to SEO, at least 50% of what they index will be absolute garbage.