from Hacker News

Ask HN: Whats features do you need for uptime monitoring of 50 endpoints?

by strobe on 2/17/20, 8:21 PM with 2 comments

If you are a user of uptime monitoring service and using more than 50 monitors, what kind of service characteristics is critical to choose a service provider for that?

by sethammons on 2/18/20, 12:17 PM
The number of endpoints is not something I think about. One service I run has two endpoints, another has several hundred, another reads from a queue.
There are standard metrics we monitor and it is more than a heartbeat or health check endpoint with a status for each dependency. We monitor success and error counts, counts of response codes, cache hit miss ratio, latency, time spent on networked resources, time spent doing complex computation, load balance between regions, queue depths, and then specific meta data like user id, payload details like types of parameters used, size of requests, method of authentication, user agent, etc.
The key here is the number of endpoints is not interesting. We just use a label and filter on that. What is interesting is how the metrics can scale, requests per second, higher cardinality labels, data aggregation over time, retention time, the ability to set alerts, trend analysis that can alert if this Tuesday morning's graphs are odd compared to other Tuesday mornings, handling math functions like derivative and sums and percentiles, etc.
If you are purely looking for what is important for an uptime service, the minimum is that it alerts me if a heartbeat fails for too long. But I would only use such on a hobby project. If an endpoint is in production, I want all the metrics I mentioned earlier as a minimum.