from Hacker News

How Rust Lets Us Monitor 30k API calls/min

by cfabianski on 6/16/20, 3:51 PM with 76 comments

  • by meritt on 6/16/20, 5:20 PM

    Sorry, I must be missing something in this blog post because the requirements here sound incredibly minimal. You just needed an HTTP service (sitting behind an Envoy proxy) to process a mere 500 requests/second (up to 1MB payload) and pipe them to Kinesis? How much data preparation is happening in Rust? It sounds like all the permission/rate-limiting/etc happens between Envoy/Redis before it ever reaches Rust?

    I know this comes across as snarky but it really worries me that contemporary engineers think this is a feat worthy of a blog post. For example, take this book from 2003 [1] talking about Apache + mod_perl. Page 325 [2] shows a benchmark: "As you can see, the server was able to respond on average to 856 requests per second... and 10 milliseconds to process each request".

    And just to show this isn't a NodeJS vs Rust thing, check out these webframework benchmarks using various JS frameworks [3]. The worst performer on there still does >500 rps while the best does 500,000.

    It's 2020, the bar needs to be much higher.

    [1] https://www.amazon.com/Practical-mod_perl-Stas-Bekman/dp/059...

    [2] https://books.google.com/books?id=i3Ww_7a2Ff4C&pg=PT356&lpg=...

    [3] https://www.techempower.com/benchmarks/#section=data-r19&hw=...

  • by akoutmos on 6/16/20, 5:54 PM

    Great article and thanks for sharing! There are a couple of things that stand out at me as possible architecture smells (hopefully this comes across as positive constructive criticism :)).

    As someone who has been developing on the BEAM for long time now, it usually sticks out like a sore thumb any time I see Elixir/Erlang paired with Redis. Not that there is anything wrong with Redis, but most of the time you can save yourself the additional Ops dependency and application network hop by bringing that state into your application (BEAM languages excel at writing stateful applications).

    In the article you write that you were using Redis for rate limit checks. You could have very easily bundled that validation into the Elixir application and had for example a single GenServer running per customer that performs the rate limiting validation (I actually wrote a blog post on this using the leaky bucket and token bucket algorithms https://akoutmos.com/post/rate-limiting-with-genservers/). Pair this with hot code deployments, you would not lose rate limit values across application deployments.

    I would be curious to see how much more mileage you could have gotten with that given that the Node application would not have to make network calls to the Elixir service and Redis.

    Just wanted to share that little tidbit as it is something that I see quite often with people new to the BEAM :). Thanks again for sharing!

  • by didroe on 6/16/20, 5:03 PM

    I'm one of the engineers that worked on this. It was the first Rust production app code I've written so it was a really fun project.
  • by foxknox on 6/16/20, 5:44 PM

    500 requests a second.
  • by cybervasi on 6/16/20, 6:50 PM

    GC of 500 request/s could not have possibly caused a performance issue. Most likely the problem was due to JS code holding on to the 1MB requests for the duration of the asynchronous Kinesis request or a bug in the Kinesis JS library itself. With timeout of 2 minutes, you may end up with up to 30K/min x 2min x 1mb = 60GB RAM used. GC would appear running hot during this time but it is only because it is has to scrape more memory somewhere while up to 60gb is being in use.
  • by eggsnbacon1 on 6/16/20, 5:17 PM

    They didn't mention Java as a possible solution, even though its GC's are far better than anything else out there. I have nothing against Rust but if I was at a startup I would save my innovation points for where they're mandatory
  • by DevKoala on 6/16/20, 6:39 PM

    There is a couple things I see in this post that I wouldn’t do at all, and I maintain a couple services with orders of magnitude higher QPS. I feel that replacing Node.js with any compiled language would have had the same positive effect.
  • by newobj on 6/16/20, 6:40 PM

    500qps. i think the more interesting story here is what language/framework COULDN'T do this, than which one could.
  • by trimbo on 6/16/20, 5:59 PM

    > After some more research, we appeared to be another victim of a memory leak in the AWS Javascript SDK.

    Did you try using the kinesis REST API directly: https://docs.aws.amazon.com/kinesis/latest/APIReference/API_...

  • by qrczeno on 6/16/20, 3:59 PM

    That was a real issue we were struggling to solve. Feels like Rust was the right tool for the right job.
  • by hobbescotch on 6/16/20, 5:37 PM

    Having never dealt with issues relating to garbage collection before, how do you go about diagnosing GC issues in a language where that’s all handled for you?
  • by zerubeus on 6/16/20, 10:50 PM

    Feels like a HN post being upvoted just bcz it contains Rust in the title (after reader the article) ...