by itunpredictable on 6/29/23, 3:36 PM with 9 comments
by dredmorbius on 6/29/23, 5:48 PM
A couple of tips:
- It's possible to crawl the page using wget, given a reasonable delay. The full collection from 2007 to present (I'd done my first crawl in late May of this year) took a couple of days. Updates to that happen in seconds.
- I break down data by date, story position (e.g., rank 1--30), submitted site (if present), points (votes), comments, and submitter, as well as title.
- I'm working on classifying titles. The original question prompting my analysis was what US states get the most love from HN (NY, CA, WA*, TX, and CO are the top 5). I'd expanded that US and globally-significant cities, and been doing some tuple-based ngram analysis, though that gets pretty hairy.
For 2022 (most recent complete year), the top 40 submitted front-page sites are:
2022: Distinct sites: 6446
Site Stories Points ( mean ) Comments ( mean )
------------------------------ ------- ------ ---------- -------- ----------
n/a 432 167275 ( 386.32 ) 125304 ( 289.39 )
youtube.com 105 27243 ( 257.01 ) 12489 ( 117.82 )
nature.com 80 17694 ( 218.44 ) 11716 ( 144.64 )
wikipedia.org 68 12258 ( 177.65 ) 5855 ( 84.86 )
nytimes.com 67 21190 ( 311.62 ) 21765 ( 320.07 )
arstechnica.com 63 18319 ( 286.23 ) 12057 ( 188.39 )
ieee.org 53 9432 ( 174.67 ) 5933 ( 109.87 )
reuters.com 53 28360 ( 525.19 ) 29033 ( 537.65 )
theguardian.com 49 12228 ( 244.56 ) 8677 ( 173.54 )
quantamagazine.org 48 11293 ( 230.47 ) 5519 ( 112.63 )
science.org 47 12485 ( 260.10 ) 7655 ( 159.48 )
economist.com 46 12504 ( 266.04 ) 17324 ( 368.60 )
bloomberg.com 43 20037 ( 455.39 ) 20630 ( 468.86 )
lwn.net 43 10566 ( 240.14 ) 5912 ( 134.36 )
theverge.com 43 16313 ( 370.75 ) 14335 ( 325.80 )
arxiv.org 39 7415 ( 185.38 ) 3559 ( 88.97 )
washingtonpost.com 39 15778 ( 394.45 ) 18117 ( 452.93 )
bbc.com 37 11600 ( 305.26 ) 8696 ( 228.84 )
newyorker.com 37 7577 ( 199.39 ) 6549 ( 172.34 )
wsj.com 36 10920 ( 295.14 ) 11646 ( 314.76 )
wired.com 35 9104 ( 252.89 ) 6738 ( 187.17 )
archive.org 32 8011 ( 242.76 ) 4626 ( 140.18 )
gist.github.com 32 10287 ( 311.73 ) 5456 ( 165.33 )
reddit.com 30 12579 ( 405.77 ) 8457 ( 272.81 )
theregister.com 29 8288 ( 276.27 ) 4586 ( 152.87 )
apple.com 28 13245 ( 456.72 ) 12917 ( 445.41 )
github.blog 26 8398 ( 311.04 ) 4242 ( 157.11 )
cnbc.com 23 8568 ( 357.00 ) 10356 ( 431.50 )
phys.org 23 4918 ( 204.92 ) 2380 ( 99.17 )
theatlantic.com 23 7518 ( 313.25 ) 10643 ( 443.46 )
axios.com 22 8903 ( 387.09 ) 8616 ( 374.61 )
news.mit.edu 22 6181 ( 268.74 ) 2887 ( 125.52 )
smithsonianmag.com 22 4964 ( 215.83 ) 2988 ( 129.91 )
stanford.edu 22 8461 ( 367.87 ) 4720 ( 205.22 )
krebsonsecurity.com 21 6299 ( 286.32 ) 3331 ( 151.41 )
microsoft.com 21 7809 ( 354.95 ) 4392 ( 199.64 )
atlasobscura.com 20 2789 ( 132.81 ) 1637 ( 77.95 )
cnn.com 19 4704 ( 235.20 ) 4252 ( 212.60 )
righto.com 19 2568 ( 128.40 ) 795 ( 39.75 )
simonwillison.net 17 4878 ( 271.00 ) 1553 ( 86.28 )
TechCrunch, BTW, lands at #41: techcrunch.com 17 8681 ( 482.28 ) 8224 ( 456.89 )
(The "mean" values are the arithmetic mean of points (votes) and comments by domain.)For 2023, there've only been 10 TechCrunch items (through 21-6-2023), well below trend:
Ubuntu 22.04 LTS servers and phased apt updates
Twitterrific has been discontinued
DuckDB – An in-process SQL OLAP database management system
Shane Pitman, leader of the warez group Razor 1911: life after prison (2005)
Nearly 40% of software engineers will only work remotely
Htmx 1.9.0 has been released
Geometry Central: library of data structures, algorithms for geometry processing
Google Authenticator now supports Google Account synchronization
I Wrote an Activitypub Server in OCaml: Lessons Learnt, Weekends Lost
In New Paradox, Black Holes Appear to Evade Heat Death
I'll note that breaking stories down by site will tend to obscure categories, as frequently-submitted sites (say, NY Times) will crowd out many individual blogs. I could probably do some manual classification based on sites, including, say, all categories of Twitter (currently broken out by user/account), and might look into that.One of the most surprising facts to jump out to me is how much nytimes.com has fallen since 2019. It had previously been in the top-4 submitted sites pretty consistently, and single top for 2014--2019, but fell to 7th in 2020 and 9th in 2021, recovering to 5 in 2022.
I've also paired my own analysis with a 2022 study published by Whaly.io based on the HN API and all content submitted: <https://whaly.io/posts/hacker-news-2021-retrospective>
I've been somewhat live-bloogging my analysis on the Fediverse under the #HackerNewsAnalytics hashtag:
<https://toot.cat/@dredmorbius/tagged/HackerNewsAnalytics>
That includes a number of findings (and testing/debugging notes), including: mentions of Reddit by year, mentions of the FP-500 companies (top-10: Apple, Microsoft, Amazon, Intel, Tesla, Netflix, IBM, Adobe, Oracle, and AT&T, though Google under various terms (Google, Alphabet, YouTube, Android) nearly doubles the top-ranked Apple, and no, adding in iPhone, iPad, MacBook, etc., doesn't help), trends in votes and comments by story position (interesting IMO), overall submission success rate (a hair under 3%), mentions of the FP Top 100 Global Thinkers in titles (reprising an old study of mine of numerous online sites), a look at the Leaders characteristics, what HN cares about being down, and, well, ... things: <https://toot.cat/@dredmorbius/110454128168815763>
________________________________
Notes:
* "Washington" can of course designate both a city and a state, amongst other things, and it turns out that the string is dominated by references to the Washington Post, much as "New York" is by the New York Times. But the list gives the naive ranking. Adding in "Silicon Valley" and "San Francisco" put California well on top.
Edits: Some in situ updates as I think of things. Sorry!