from Hacker News

What gets to the front page of Hacker News? A data project

by itunpredictable on 6/29/23, 3:36 PM with 9 comments

  • by dredmorbius on 6/29/23, 5:48 PM

    So, oddly enough, I've also been looking at HN front-page characteristics, based on the same corpus (the "past" page links). And that whole section on caveats over what that archive represents is something I could have written... The front page, both in its dynamic and archived forms is strongly subject to many influences in complex ways.

    A couple of tips:

    - It's possible to crawl the page using wget, given a reasonable delay. The full collection from 2007 to present (I'd done my first crawl in late May of this year) took a couple of days. Updates to that happen in seconds.

    - I break down data by date, story position (e.g., rank 1--30), submitted site (if present), points (votes), comments, and submitter, as well as title.

    - I'm working on classifying titles. The original question prompting my analysis was what US states get the most love from HN (NY, CA, WA*, TX, and CO are the top 5). I'd expanded that US and globally-significant cities, and been doing some tuple-based ngram analysis, though that gets pretty hairy.

    For 2022 (most recent complete year), the top 40 submitted front-page sites are:

      2022:  Distinct sites:  6446
    
      Site                            Stories     Points (   mean  )  Comments (   mean  )
      ------------------------------  -------     ------ ----------   -------- ----------
      n/a                                 432     167275 (  386.32 )    125304 (  289.39 )
      youtube.com                         105      27243 (  257.01 )     12489 (  117.82 )
      nature.com                           80      17694 (  218.44 )     11716 (  144.64 )
      wikipedia.org                        68      12258 (  177.65 )      5855 (   84.86 )
      nytimes.com                          67      21190 (  311.62 )     21765 (  320.07 )
      arstechnica.com                      63      18319 (  286.23 )     12057 (  188.39 )
      ieee.org                             53       9432 (  174.67 )      5933 (  109.87 )
      reuters.com                          53      28360 (  525.19 )     29033 (  537.65 )
      theguardian.com                      49      12228 (  244.56 )      8677 (  173.54 )
      quantamagazine.org                   48      11293 (  230.47 )      5519 (  112.63 )
      science.org                          47      12485 (  260.10 )      7655 (  159.48 )
      economist.com                        46      12504 (  266.04 )     17324 (  368.60 )
      bloomberg.com                        43      20037 (  455.39 )     20630 (  468.86 )
      lwn.net                              43      10566 (  240.14 )      5912 (  134.36 )
      theverge.com                         43      16313 (  370.75 )     14335 (  325.80 )
      arxiv.org                            39       7415 (  185.38 )      3559 (   88.97 )
      washingtonpost.com                   39      15778 (  394.45 )     18117 (  452.93 )
      bbc.com                              37      11600 (  305.26 )      8696 (  228.84 )
      newyorker.com                        37       7577 (  199.39 )      6549 (  172.34 )
      wsj.com                              36      10920 (  295.14 )     11646 (  314.76 )
      wired.com                            35       9104 (  252.89 )      6738 (  187.17 )
      archive.org                          32       8011 (  242.76 )      4626 (  140.18 )
      gist.github.com                      32      10287 (  311.73 )      5456 (  165.33 )
      reddit.com                           30      12579 (  405.77 )      8457 (  272.81 )
      theregister.com                      29       8288 (  276.27 )      4586 (  152.87 )
      apple.com                            28      13245 (  456.72 )     12917 (  445.41 )
      github.blog                          26       8398 (  311.04 )      4242 (  157.11 )
      cnbc.com                             23       8568 (  357.00 )     10356 (  431.50 )
      phys.org                             23       4918 (  204.92 )      2380 (   99.17 )
      theatlantic.com                      23       7518 (  313.25 )     10643 (  443.46 )
      axios.com                            22       8903 (  387.09 )      8616 (  374.61 )
      news.mit.edu                         22       6181 (  268.74 )      2887 (  125.52 )
      smithsonianmag.com                   22       4964 (  215.83 )      2988 (  129.91 )
      stanford.edu                         22       8461 (  367.87 )      4720 (  205.22 )
      krebsonsecurity.com                  21       6299 (  286.32 )      3331 (  151.41 )
      microsoft.com                        21       7809 (  354.95 )      4392 (  199.64 )
      atlasobscura.com                     20       2789 (  132.81 )      1637 (   77.95 )
      cnn.com                              19       4704 (  235.20 )      4252 (  212.60 )
      righto.com                           19       2568 (  128.40 )       795 (   39.75 )
      simonwillison.net                    17       4878 (  271.00 )      1553 (   86.28 )
    
    TechCrunch, BTW, lands at #41:

      techcrunch.com                       17       8681 (  482.28 )      8224 (  456.89 )
    
    (The "mean" values are the arithmetic mean of points (votes) and comments by domain.)

    For 2023, there've only been 10 TechCrunch items (through 21-6-2023), well below trend:

      Ubuntu 22.04 LTS servers and phased apt updates
      Twitterrific has been discontinued
      DuckDB – An in-process SQL OLAP database management system
      Shane Pitman, leader of the warez group Razor 1911: life after prison (2005)
      Nearly 40% of software engineers will only work remotely
      Htmx 1.9.0 has been released
      Geometry Central: library of data structures, algorithms for geometry processing
      Google Authenticator now supports Google Account synchronization
      I Wrote an Activitypub Server in OCaml: Lessons Learnt, Weekends Lost
      In New Paradox, Black Holes Appear to Evade Heat Death
    
    
    I'll note that breaking stories down by site will tend to obscure categories, as frequently-submitted sites (say, NY Times) will crowd out many individual blogs. I could probably do some manual classification based on sites, including, say, all categories of Twitter (currently broken out by user/account), and might look into that.

    One of the most surprising facts to jump out to me is how much nytimes.com has fallen since 2019. It had previously been in the top-4 submitted sites pretty consistently, and single top for 2014--2019, but fell to 7th in 2020 and 9th in 2021, recovering to 5 in 2022.

    I've also paired my own analysis with a 2022 study published by Whaly.io based on the HN API and all content submitted: <https://whaly.io/posts/hacker-news-2021-retrospective>

    I've been somewhat live-bloogging my analysis on the Fediverse under the #HackerNewsAnalytics hashtag:

    <https://toot.cat/@dredmorbius/tagged/HackerNewsAnalytics>

    That includes a number of findings (and testing/debugging notes), including: mentions of Reddit by year, mentions of the FP-500 companies (top-10: Apple, Microsoft, Amazon, Intel, Tesla, Netflix, IBM, Adobe, Oracle, and AT&T, though Google under various terms (Google, Alphabet, YouTube, Android) nearly doubles the top-ranked Apple, and no, adding in iPhone, iPad, MacBook, etc., doesn't help), trends in votes and comments by story position (interesting IMO), overall submission success rate (a hair under 3%), mentions of the FP Top 100 Global Thinkers in titles (reprising an old study of mine of numerous online sites), a look at the Leaders characteristics, what HN cares about being down, and, well, ... things: <https://toot.cat/@dredmorbius/110454128168815763>

    ________________________________

    Notes:

    * "Washington" can of course designate both a city and a state, amongst other things, and it turns out that the string is dominated by references to the Washington Post, much as "New York" is by the New York Times. But the list gives the naive ranking. Adding in "Silicon Valley" and "San Francisco" put California well on top.

    Edits: Some in situ updates as I think of things. Sorry!