from Hacker News

View Counting at Reddit

by strzalek on 5/27/17, 6:13 PM with 112 comments

  • by haburka on 5/27/17, 8:09 PM

    I love the article on hyperloglog! It is really quite good to read even if you're not interested in algorithms. I always liked number theory and I think that it's very interesting that you can guess how many uniques there are by counting how long your longest run of zeroes in a hash is.

    I suppose this could be broken by injecting in a unique visitor id that would hash to something with an absurd amount of zeroes? That's assuming that the user has control over their user id and that I'm understanding the algorithm correctly.

  • by nyar on 5/28/17, 1:04 AM

    "We want to better communicate the scale of Reddit to our users."

    If that's true why did they hide vote numbers on comments and posts? It used to say "xxx upvotes xxx downvotes" now it just gives a number and hides that.

  • by mxmxm on 5/28/17, 2:09 PM

    Counting views/impressions in combination with Apache Kafka sounds like the ideal use case for a stream processor like Apache Flink. It supports very large state which can be managed off-hand. This should enable you to count the exact number of unique views in real time with exactly once semantics. Here is a blog post on large scale counting with more details. It also includes a comparison with other streaming technologies like Sanza and Spark: https://data-artisans.com/blog/counting-in-streams-a-hierarc...

    Also check out this blog post by a Twitter engineer on counting ad impressions: https://data-artisans.com/blog/extending-the-yahoo-streaming...

  • by noamhacker on 5/27/17, 9:00 PM

    How do you test a system like this for accuracy? Is this done by simulating millions of unique requests?
  • by alzaeem on 5/27/17, 10:16 PM

    So how do they determine whether a user has viewed a post already? I would think that unique counting is accomplished using the hyperloglog counter, but the article says that this decision is made by the Nazar system, which doesn't use the hyperloglog counter in Redis.
  • by stoicking on 5/27/17, 8:39 PM

    Given how much simpler it is to count total views than unique user views, why is it more valuable to count unique user views?
  • by tudorconstantin on 5/27/17, 8:23 PM

    Wouldn't it had been easier to simply increment a counter for each visit and then set a short lived cookie in the browser for that post? And put the spam detection system before the counter increment
  • by tsukaisute on 5/27/17, 8:02 PM

    Weird thing I have been seeing on Reddit is comment upvotes being off-by-one periodically on page refreshes. Reload, you get 3. Reload again, you get 4. Again, you get 3. Seems like a replication issue?
  • by theomega on 5/28/17, 11:28 AM

    Very interesting article, thanks for publishing.

    I have two related questions: 1. I assume the process which reads from Cassandra and puts it back to Redis is parallized if not even distributed. How do you ensure correctness? Implementing 2PC seems extreme overhead. Or do you lock in Redis? 2. What database is used to actually store the view counts? Cassandras Counters are afaik not very reliable...

  • by ronalbarbaren on 5/28/17, 1:25 PM

    Thanks Reddit guys. I hope engineer of Youtube will post similar article. Still curious how Youtube count.
  • by hellbanner on 5/27/17, 8:03 PM

    Slightly OT; but I wish reddit would use traditional forum style replies to push threads up, instead of the positive feedback loop of votes with opinions that agree with majority getting upvotes giving views which give proportionally more upvotes
  • by federicoponzi on 5/28/17, 5:31 AM

    Probably noob question, but:

    >> Nazar will then alter the event, adding a Boolean flag indicating whether or not it should be counted, before sending the event back to Kafka.

    Why don't they just discard it instead of reputting the event back to Kafka?

  • by golergka on 5/28/17, 7:18 AM

    A beautiful example of how a feature that seems so easy to an end user can be complex at scale.
  • by fiatjaf on 5/27/17, 7:58 PM

    At https://trackingco.de/ we store events on Redis and compile them daily into a reduced string format, storing these on CouchDB.
  • by ugh123 on 5/27/17, 9:05 PM

    Forgive my ignorance, but isn't this what Google Analytics is for?
  • by qrbLPHiKpiux on 5/27/17, 8:16 PM

    Not applied to /r/the_donald however.