from Hacker News

Ask HN: Tool for dashboards / alerting on operational health metrics?

by collectedparts on 12/19/21, 4:55 AM with 5 comments

I'm looking for the right tool to use to track "operational" health, which in my context mostly the size of various queues (eg, how many pending withdrawals are there, and then this is split by type of withdrawal).

Maybe some other business KPIs, but the emphasis is on building "graphs that the operational team should be looking at on ~hourly basis to prioritize fixes and spot systemic regressions."

Ideally it would integrate with Pagerduty, in that it'd be easy to express "page me if if X is above Y."

My preference is that I have my own cron job that captures the metrics from our queues/databases and pushes them to the tool via a ~REST API but I could be convinced of a different method.

So far the only thing I can think of is Datadog, specifically https://www.datadoghq.com/solutions/real-time-business-intelligence/

Anything else I should be looking at? Or any other tips?

  • by XCSme on 12/19/21, 10:07 PM

    There's an indie-hacker I follow building this: https://chartbrew.com/

    From what I understand, it's more like an easier-to-use grafana, where you can build charts/graphs from different data sources.

  • by yuppie_scum on 12/19/21, 6:07 AM

    SaaS wise there’s Datadog, Splunk, New Relic (expensive) and others

    Self hosted wise Prometheus/Grafana, ELK stack, old school Nagios maybe

    AWS also has managed Grafana/Prometheus for a fairly reasonable price, I would recommend this if you don’t have a lot of time to mess around updating your grafana stack every few months

  • by AishwaryaVenkat on 12/20/21, 7:41 AM

    Try Atatus which is affordable and also effective
  • by Jugurtha on 12/19/21, 7:26 PM

    Prometheus and Grafana. That's what we use.