from Hacker News

Migrating to OpenTelemetry

by kkoppenhaver on 11/16/23, 5:29 PM with 75 comments

  • by CSMastermind on 11/16/23, 8:03 PM

    > The data collected from these streams is sent to several vendors including Datadog (for application logs and metrics), Honeycomb (for traces), and Google Cloud Logging (for infrastructure logs).

    It sounds like they were in a place that a lot of companies are in where they don't have a single pane of glass for observability. One of if not the main benefit I've gotten out of Datadog is having everything in Datadog so that it's all connected and I can easily jump from a trace to logs for instance.

    One of the terrible mistakes I see companies make with this tooling is fragmenting like this. Everyone has their own personal preference for tool and ultimately the collective experience is significantly worse than the sum of its parts.

  • by tapoxi on 11/16/23, 6:10 PM

    I made this switch very recently. For our Java apps it was as simple as loading the otel agent in place of the Datadog SDK, basically "-javaagent:/opt/otel/opentelemetry-javaagent.jar" in our args.

    The collector (which processes and ships metrics) can be installed in K8S through Helm or an operator, and we just added a variable to our charts so the agent can be pointed at the collector. The collector speaks OTLP which is the fancy combined metrics/traces/logs protocol the OTEL SDKs/agents use, but it also speaks Prometheus, Zipkin, etc to give you an easy migration path. We currently ship to Datadog as well as an internal service, with the end goal being migrating off of Datadog gradually.

  • by MajimasEyepatch on 11/16/23, 6:07 PM

    It's interesting that you're using both Honeycomb and Datadog. With everything migrated to OTel, would there be advantages to consolidating on just Honeycomb (or Datadog)? Have you found they're useful for different things, or is there enough overlap that you could use just one or the other?
  • by Jedd on 11/17/23, 12:33 AM

    The killer feature of OpenTelemetry for us is brokering (with ETL).

    Partly this lets us easily re-route & duplicate telemetry, partly it means changes to backend products in the future won't be a big disruption.

    For metrics we're a mostly telegraf->prometheus->grafana mimir shop - telegraf because its rock solid and feature-rich, prometheus because there's no real competition in that tier, and mimir because of scale & self-host options.

    Our scale problem means most online pricing calculators generate overflow errors.

    Our non-security log destination preference is Loki - for similar reasons to Mimir - though a SIEM it definitely is not.

    Tracing to a vendor, but looking to bring that back to grafana Tempo. Product maturity is a long way off commercial APM offerings, but it feels like the feature-set is about 70% there and converging rapidly. Off-the-shelf tracing products have an appealingly low cost of entry, which only briefly defers lock-in & pricing shocks.

  • by nevon on 11/16/23, 9:04 PM

    I would love to save a few hundred thousands a year by running Otel collector over Datadog agents, just on the cost-per-host alone. Unfortunately that would also mean giving up Datatog APM and NPM, as far as I can tell, which have been really valuable. Going back to just metrics and traces would feel like quite the step backwards and be a hard sell.
  • by nullify88 on 11/17/23, 7:17 AM

    One thing that's slightly off putting about OpenTelemetry is how resource attributes don't get included as prometheus labels for metrics, instead they are on an info metric which requires a join to enrich the metric you are interested in.

    Luckily the prometheus exporters have a switch to enable this behaviour, but there's talk of removing this functionality because it breaks the spec. If you were to use the OpenTelemetry protocol in to something like Mimir, you don't have the option of enabling that behaviour unless you use prometheus remote write.

    Our developers aren't a fan of that.

    https://opentelemetry.io/docs/specs/otel/compatibility/prome...

  • by roskilli on 11/16/23, 7:43 PM

    > Moreover, we encountered some rough edges in the metrics-related functionality of the Go SDK referenced above. Ultimately, we had to write a conversion layer on top of the OTel metrics API that allowed for simple, Prometheus-like counters, gauges, and histograms.

    Have encountered this a lot from teams attempting to use the metrics SDK.

    Are you open to comment on specifics here and also what kind of shim you had to put in front of the SDK? It would be great to continue to retrieve feedback so that we can as a community have a good idea of what remains before it's possible to use the SDK for real world production use cases in anger. Just wiring up the setup in your app used to be fairly painful but that has gotten somewhat better over the last 12-24 months, I'd love to also hear what is currently causing compatibility issues w/ the metric types themselves using the SDK which requires a shim and what the shim is doing to achieve compatibility.

  • by caust1c on 11/16/23, 6:03 PM

    Curious about the code implemented for logs! Hopefully that's something that can be shared at some point. Also curious if it integrates with `log/slog` :-)

    Congrats too! As I understand it from stories I've heard from others, migrating to OTel is no easy undertaking.

  • by throwaway084t95 on 11/16/23, 10:25 PM

    What is the "first principles" argument that observability decomposes into logs, metrics, and tracing? I see this dogma accepted everywhere, but I'm inquisitive about it
  • by tsamba on 11/16/23, 7:42 PM

    Interesting read. What did you find easier about using GCP's log tooling for your internal system logs, rather than the OTel collector?
  • by shoelessone on 11/16/23, 10:31 PM

    I really really want to use OTel for a small project but have always had a really tough time finding a path that is cheap or free for a personal project.

    In theory you can send telemetry data with OTel to Cloud Watch, but I've struggle to connect the dots with the front end application (e.g. React/Next.js).

  • by jon-wood on 11/17/23, 10:11 AM

    At the risk of being downvoted (probably justly) for having a moan, can we please have a moratorium on every blog post needing to have a generally irrelevant picture attached to it? On opening this page I can see 28 words that are actually relevant because almost the entire view is consumed by a huge picture of a graph and the padding around it.

    This is endemic now. Doesn't matter what someone is writing about there'll be some pointless stock photo taking up half the page. There'll probably be some more throughout the page. Stop it please.

  • by k__ on 11/16/23, 6:19 PM

    I had the impression, logs and metrics are a pre-observability thing.