by itunpredictable on 4/3/24, 6:39 PM with 14 comments
by vlovich123 on 4/3/24, 7:11 PM
Telemetry is captured because after-the-fact analysis can’t be done retrospectively otherwise. If you can solve that time travel problem, people would capture less telemetry. I think the key thing is if you can do anomaly detection to capture the rare events because 90% of telemetry is garbage happy case telemetry that doesn’t really give you any extra insight. But doing that anomaly detection cheaply and correctly is extremely hard.
by nishantmodak on 4/3/24, 9:41 PM
>Engineers have to pre-define and send all telemetry data they might need – since it’s so difficult to make changes after the fact – regardless of the percentage chance of the actual need.
YES. Let them send all the data. The best place to solve for it is at Ingestion.
There's typically 5 different stages to this process.
Instrumentation -> Ingestion -> Storage -> Query (Dashboard) -> Query (Alerting)
Instrumentation is the wrong place to solve this.
Ingestion - Build pipelines that allow to process this data and provide for tools like streaming aggregation, cardinality controls that allow to 'process it' or act on anomalous patterns. This atleast makes working on observability data 'dynamic' instead of having to go change instrumentation always. Storage - Provide blaze (2hours), hot(1 month), cold(13 months) of tiered data storage with indipendent read paths.
This, in my opinion has solved for the bulk of cost & re-work challenges associated with telemetry data.
I believe, Observability is the Big Data of today, without the Big Data tools! (Disclosure: I work at Last9.io and we have taken a similar approach to solve for these challenges)
by throwaway4good on 4/3/24, 7:25 PM
Just say no to the logging industrial complex.
by komuW on 4/4/24, 6:44 AM
1. https://www.komu.engineer/blogs/09/log-without-losing-contex...
by nathants on 4/3/24, 8:40 PM
stuff logs into s3. learn to mine them in parallel using lambda or ec2 spot. grow tech or teams as needed for scale. never egress data and never persist data outside of the cheapest s3 tiers. expire data on some sane schedule.
data processing is fun, interesting, and valuable. it is core to understanding your systems.
if you can’t do this well, there is probably a lot more you can’t do well either. in that case, life is going to be very expensive.
it’s ok to not do this well yet! spend some portion of your week doing this and you will improve quickly.
by dboreham on 4/3/24, 7:10 PM