by robgering on 6/14/24, 12:36 PM with 174 comments
by codereflection on 6/14/24, 6:47 PM
One of the promises of OTEL is that it allows organizations to replace vendor-specific agents with OTEL collectors, allowing the flexibility of the end observability platform. When used with an observability pipeline (such as EdgeDelta or Cribl), you can re-process collected telemetry data and send it to another platform, like Splunk, if needed. Consequently, switching from one observability platform to another becomes a bit less of a headache. Ironically, even Splunk recognizes this and has put substantial support behind the OTEL standard.
OTEL is far from perfect, and maybe some of these goals are a bit lofty, but I can say that many large organizations are adopting OTEL for these reasons.
by doctorpangloss on 6/14/24, 3:47 PM
But I do have to “pip uninstall sentry-sdk” in my Dockerfile because it clashes with something I didn’t author. And anyway, because it is completely open source, the flaws in OpenTelemetry for my particular use case took an hour to surmount, and vitally, I didn’t have to pay the brain damage cost most developers hate: relationships with yet another vendor.
That said I appreciate all the innovation in this space, from both Sentry and OpenTelemetry. The metrics will become the standard, and that’s great.
The problem with Not OpenTelemetry: eventually everyone is going to learn how to use Kubernetes, and the USP of many startup offerings will vanish. OpenTelemetry and its feature scope creep make perfect sense for people who know Kubernetes. Then it makes sense why you have a wire protocol, why abstraction for vendors is redundant or meaningless toil, and why PostHog and others stop supporting Kubernetes: it competes with their paid offering.
by ankitnayan on 6/15/24, 6:51 AM
OpenStandards also open up a lot of usecases and startups too. SigNoz, TraceTest, TraceLoop, Signadot, all are very interesting projects which OpenTelemetry enabled.
The majority of the problem seems like sentry is not able to provide it's sentry like features by adopting otel. Getting involved at the design phase could have helped shaped the project that could have considered your usecases. The maintainers have never been opposed to such contributions AFAIK.
Regarding, limiting otel just to tracing would not be sufficient today as the teams want a single platform for all observability rather than different tools for different signals.
I have seen hundreds of companies switch to opentelemetry and save costs by being able to choose the best vendor supporting their usecases.
lack of docs, learning curve, etc are just temporary things that can happen with any big project and should be fixed. Also, otel maintainers and teams have always been seeking help in improving docs, showcasing usecases, etc. If everyone cares enough for the bigger picture, the community and existing vendors should get more involved in improving things rather than just complaining.
by no_circuit on 6/14/24, 4:53 PM
Of course implementing a spec from the provider point of view can be difficult. And also take a look at all the names of the OTEL community and notice that Sentry is not there: https://github.com/open-telemetry/community/blob/86941073816.... This really isn't news. I'd guess that a Sentry customer should just be able to use the OTEL API and could just configure a proprietary Sentry exporter, for all their compute nodes, if Sentry has some superior way of collecting and managing telemetry.
IMO most library authors do not have to worry about annotation naming or anything like that mentioned in the post. Just use the OTEL API for logs, or use a logging API where there is an OTEL exporter, and whomever is integrating your code will take care of annotating spans. Propagating span IDs is the job of "RPC" libraries, not general code authors. Your URL fetch library should know how to propagate the Span ID provided that it also uses the OTEL API.
It is the same as using something like Docker containers on a serverless platform. You really don't need to know that your code is actually being deployed in Kubernetes. Use the common Docker interface is what matters.
by serverlessmom on 6/14/24, 5:59 PM
Context propagation and distributed tracing are cool OTel features! But they are not the only thing OTel should be doing. OpenTelemetry instrumentation libraries can do a lot on their own, a friend of mine made massive savings in compute efficiency with the NodeJS OTel library: https://www.checklyhq.com/blog/coralogix-and-opentelemetry-o...
by wdb on 6/14/24, 4:10 PM
I quite like the idea of only need to change one small piece of the code to switch otel exporters instead of swapping out a vendor trace sdk.
My main gripe with OpenTelemetry I don't fully understand what the exact difference is between (trace) events and log records.
by AndreasBackx on 6/14/24, 8:15 PM
It is hard to explain how convenient `tracing` is in Rust and why I sorely miss it elsewhere. The simple part of adding context to logs can be solved in a myriad of ways, yet all boil down to a similar "span-like" approach. I'm very interested in helping bring what `tracing` offers to other programming communities.
It very likely is worth having some people from the space involved, possibly from the tracing crate itself.
by wvh on 6/14/24, 5:26 PM
It's not anymore about hey, we'll include this little library or protocol instead of rolling our own, so we can hope to be compatible with a bunch of other industry-standard software. It's a large stack with an ever evolving spec. You have to develop your applications and infrastructure around it. It's very seductive to roll your own simpler solution.
I appreciate it's not easy to build industry-wide consensus across vendors, platforms and programming languages. But be careful with projects that fail to capture developer mindshare.
by fractalwrench on 6/14/24, 8:08 PM
From this perspective it doesn't matter if the OTel SDK comes bundled with a bunch of unnecessary code or version conflicts as is suggested in the article. The whole point is to regain control over telemetry & avoid paying $$$ to an ambivalent vendor.
FWIW, I don't think the OTel implementation for mobile is perfect - a lot of the code was originally written with backend JVM apps in mind & that can cause friction. However, I'm fairly optimistic those pain points will get fixed as more folks converge on this standard.
Disclaimer: I work at a Sentry competitor
by markl42 on 6/14/24, 4:59 PM
There's no causal relationships between sibling spans. I think in theory "span links" solves this, but afaict this is not a widely used feature in SDKs are UI viewers.
(I wrote about this here https://github.com/open-telemetry/opentelemetry-specificatio...)
by tnolet on 6/14/24, 4:41 PM
I could for the life of me not get the Python integration send traces to a collector. Same URL, same setup same API key as for Nodejs and Go.
Turns out the Python SDK expect a URL encoded header, e.g. “Bearer%20somekey” whereas all other SDKs just accept a string with a whitespace.
The whole split between HTTP, protobuf over HTTP and GRPC is also massively confusing.
by NeutralForest on 6/14/24, 2:37 PM
by BiteCode_dev on 6/14/24, 2:55 PM
Every time I tried to use OT I was reading the doc and whispering "but, why? I only need...".
by spullara on 6/14/24, 8:03 PM
by drewbug01 on 6/14/24, 3:51 PM
But this ain’t it. In the opening paragraphs the author dismisses the hardest parts of the problem (presumably because they are human problems, which engineers tend to ignore), and betrays a complete lack of interest in understanding why things ended up this way. It also seems they’ve completely misunderstood the API/SDK split in its entirety - because they argue for having such a split. It’s there - that’s exactly what exists!
And it goes on and on. I think it’s fair to critique OpenTelemetry; it can be really confusing. The blog post is evidence of that, certainly. But really it just reads like someone who got frustrated that they didn’t understand how something worked - and so instead of figuring it out, they’ve decided that it’s just hot garbage. I wish I could say this was unusual amongst engineers, but it isn’t.
by shaqbert on 6/14/24, 4:01 PM
Otelbin [0] has helped me quite a bit in configuring and making sense of it, and getting stuff done.
by epgui on 6/14/24, 4:59 PM
by grenbys on 6/16/24, 7:24 PM
by hobofan on 6/14/24, 4:46 PM
OP (rightfully) complains that there is a mismatch between what they (can) advertise ("We support OTEL") and what they are actually providing to the user. I have the same pain point from the consumer side, where I have to trial multiple tools and service to figure out which of them actually supports the OTEL feature set I care about.
I feel like this could be solved by introducing better branding that has a clearly defined scope of features inside the project (like e.g. "OTEL Tracing") which can serve as a direct signifier to customers about what feature set can be expected.
by antonyt on 6/14/24, 3:06 PM
by dboreham on 6/14/24, 3:30 PM
That said, I think this rot comes from the commercial side of the sector -- if you're a successful startup with one product (e.g. graphing counters), then your investors are going to start beating you up about why don't you expand into other adjacent product areas (e.g. tracing). Repeat previous sentence reversed. And so you get Grafana, New Relic, et al). OpenTelemetry is just mirroring that arrangement.
by edenfed on 6/15/24, 5:03 AM
by prymitive on 6/14/24, 4:37 PM
by PeterZaitsev on 6/14/24, 7:06 PM
by ris on 6/14/24, 9:09 PM
2. I honestly think the main reason otel appears so complex is the existing resources that attempt to explain the various concepts around it do a poor job and are very hand-wavey. You know the main thing that made otel "click" for me? Reading the protobuf specs. Literally nothing else explained succinctly the relationships between the different types of structure and what the possibilities with each were.
by esafak on 6/14/24, 10:07 PM
> Logs are just events - which is exactly what a span is, btw - and metrics are just abstractions out of those event properties. That is, you want to know the response time of an API endpoint? You don't rewind 20 years and increment a counter, you instead aggregate the duration of the relevant span segment. Somehow though, Logs and Metrics are still front and center.
Is anyone replacing logs and metrics with traces?
by dtjohnnymonkey on 6/15/24, 3:37 PM
Isn’t this exactly what the SpanExporter API is for? This is in the Go SDK, I suppose it may not be available in other SDKs.
I have used this API to convert OTel spans into log messages as we currently don’t have a distributed tracing vendor.
by dan-allen on 6/14/24, 10:50 PM
I don’t follow closely enough to comment on possible causes.
What I do know is that the surface area of code and infrastructure that telemetry touches means adopting something unfinished is a big leap of faith.
by cogman10 on 6/14/24, 3:34 PM
I suspect OP is seeing this directly when talking about the cludgyness of the Javascript API.
by zellyn on 6/14/24, 6:53 PM
The simple API they describe is basically there in OTel. The API is larger, because it also does quite a few other things (personally, I think (W3C) Baggage is important too), but as a library author I should need only the client APIs to write to.
When implementing, you're free to plug in Providers that use OpenAPI-provided plumbing, but you can equally well plug in Providers from DataDog or Sentry or whatever.
Unless I'm missing something, any further complaints could be solved by making sure the Client APIs (almost) never have backward-incompatible changes, and are versioned separately.
by EdSchouten on 6/14/24, 5:42 PM
I've always wondered, what's the point of the trace ID? What even is a trace?
- It could be a single database query that's invoked on a distributed database, giving you information about everything that went on inside the cluster processing that query.
- Or it could be all database calls made by a single page request on a web server.
- Or it could be a collection of page requests made by a single user as part of a shopping checkout process. Each page request could make many outgoing database calls.
Which of these three you should choose merely depends on what you want to visualize at a given point in time. My hope is that at some point we get a standard for tracing that does away with the notion of trace IDs. Just treat everything going on in the universe as a graph of inter-connected events.
by noname120 on 6/14/24, 5:26 PM
by jiveturkey on 6/14/24, 11:31 PM
buried the lede!
by syngrog66 on 6/14/24, 3:42 PM