by samber on 9/30/24, 7:35 AM with 40 comments
by Fripplebubby on 9/30/24, 1:48 PM
> I find this is often an artifact of the DE roles not being equipped with the necessary knowledge of more generic SWE tools, and general SWEs not being equipped with knowledge of data-specific tools and workflows.
> Speaking of, especially in smaller companies, equipping all engineers with the technical tooling and knowledge to work on all parts of the platform (including data) is a big advantage, since it allows people not usually on your team to help on projects as needed. Standardized tooling is a part of that equation.
I have found this to be so true. SWE vs DE is one division where this applies, and I think it also applies for SWE vs SRE (if you have those in your company), data scientists, "analysts", basically anyone who is in a technical role should ideally know what kinds of problems other teams work on and what kinds of tooling they use to address those problems so that you can cross-pollinate.
by 1a527dd5 on 9/30/24, 10:12 AM
Our data team currently has something similar and its costs are astronomical.
On the other hand our internal platform metrics are fired at BigQuery [1] and then we used scheduled queries that run daily (looking at the -24 hours) that aggregate/export to parquet. And it's cheap as chips. From there it's just a flat file that is stored on GCS that can be pulled for analysis.
Do you have more thoughts on Preset/Superset? We looked at both (slightly leaning towards cloud hosted as we want to move away from on-prem) - but ended up going with Metabase.
by zurfer on 9/30/24, 8:48 AM
But I think it's great that analytics and data transformation is distributed, so developers also are somewhat responsible for correct analytical numbers.
In most companies there is strong split between building product and maintaining analytics for the product, which leads to all sort of inefficiencies and errors.
by tonymet on 9/30/24, 6:10 PM
Most events can be aggregated over time with a statistic (count, avg, max, etc). Even discrete events can be aggregated with a 5 min latency. That should reduce their event volume by 90% . Every layer in that diagram is CPU wasted on encode-decode that costs money.
The paragraph on integrity violation queries was helpful -- it would be good to understand more of the query and latency requirements.
The article is a great technical overview, but it's also helpful to discuss whether this system is a viable business investment. Sure they are making high margins, but why burn good cash on something like this?
by jmuguy on 9/30/24, 7:30 PM
by valzam on 9/30/24, 10:40 AM
by moandcompany on 9/30/24, 3:20 PM
by LoganDark on 9/30/24, 8:44 AM
Don't worry, your sensitive data isn't handled by our platform, we ship it to a third-party instead. This is for your protection!
(I have no idea if Clickhouse is actually a third party, it sounds like one though?)