from Hacker News

How we built ngrok's data platform

by samber on 9/30/24, 7:35 AM with 40 comments

by Fripplebubby on 9/30/24, 1:48 PM
I found the technical details really interesting, but I think this gem applies more broadly:
> I find this is often an artifact of the DE roles not being equipped with the necessary knowledge of more generic SWE tools, and general SWEs not being equipped with knowledge of data-specific tools and workflows.
> Speaking of, especially in smaller companies, equipping all engineers with the technical tooling and knowledge to work on all parts of the platform (including data) is a big advantage, since it allows people not usually on your team to help on projects as needed. Standardized tooling is a part of that equation.
I have found this to be so true. SWE vs DE is one division where this applies, and I think it also applies for SWE vs SRE (if you have those in your company), data scientists, "analysts", basically anyone who is in a technical role should ideally know what kinds of problems other teams work on and what kinds of tooling they use to address those problems so that you can cross-pollinate.
by 1a527dd5 on 9/30/24, 10:12 AM
Blimey, that is a lot of moving parts.
Our data team currently has something similar and its costs are astronomical.
On the other hand our internal platform metrics are fired at BigQuery [1] and then we used scheduled queries that run daily (looking at the -24 hours) that aggregate/export to parquet. And it's cheap as chips. From there it's just a flat file that is stored on GCS that can be pulled for analysis.
Do you have more thoughts on Preset/Superset? We looked at both (slightly leaning towards cloud hosted as we want to move away from on-prem) - but ended up going with Metabase.
[1] https://cloud.google.com/bigquery/docs/write-api
by zurfer on 9/30/24, 8:48 AM
Kudos to the author who is responsible for the whole stack. A lot of effort goes into ingesting data into Iceberg tables to be queried via AWS Athena.
But I think it's great that analytics and data transformation is distributed, so developers also are somewhat responsible for correct analytical numbers.
In most companies there is strong split between building product and maintaining analytics for the product, which leads to all sort of inefficiencies and errors.
by tonymet on 9/30/24, 6:10 PM
15k/s event rate and 650GB volume / day is massive. Of course that's confidential, but I'd guess they are below 10k concurrent connections. So they are recording 1.5 event's / second / user. Does every packet need discrete & real-time telemetry? I've seen games with millions of active users only hit 30k concurrents and this is a developer tool.
Most events can be aggregated over time with a statistic (count, avg, max, etc). Even discrete events can be aggregated with a 5 min latency. That should reduce their event volume by 90% . Every layer in that diagram is CPU wasted on encode-decode that costs money.
The paragraph on integrity violation queries was helpful -- it would be good to understand more of the query and latency requirements.
The article is a great technical overview, but it's also helpful to discuss whether this system is a viable business investment. Sure they are making high margins, but why burn good cash on something like this?
by jmuguy on 9/30/24, 7:30 PM
I wonder if this data collection is why Ngrok's tunnels are now painfully slow to use. I've just gone back to localhost unless I specifically need to test omniauth or something similar.
by valzam on 9/30/24, 10:40 AM
i pity the developer who has to maintain tagless final plumbing code after the “functional programming enthusiast” moves on… in a Go first org no less.
by moandcompany on 9/30/24, 3:20 PM
At the end of the day, we're all pushing protobufs from place to place
by LoganDark on 9/30/24, 8:44 AM
> Note that we do not store any data about the traffic content flowing through your tunnels—we only ever look at metadata. While you have the ability to enable full capture mode of all your traffic and can opt in to this service, we never store or analyze this data in our data platform. Instead, we use Clickhouse with a short data retention period in a completely separate platform and strong access controls to store this information and make it available to customers.
Don't worry, your sensitive data isn't handled by our platform, we ship it to a third-party instead. This is for your protection!
(I have no idea if Clickhouse is actually a third party, it sounds like one though?)