by strobe on 2/17/20, 8:21 PM with 2 comments
by sethammons on 2/18/20, 12:17 PM
There are standard metrics we monitor and it is more than a heartbeat or health check endpoint with a status for each dependency. We monitor success and error counts, counts of response codes, cache hit miss ratio, latency, time spent on networked resources, time spent doing complex computation, load balance between regions, queue depths, and then specific meta data like user id, payload details like types of parameters used, size of requests, method of authentication, user agent, etc.
The key here is the number of endpoints is not interesting. We just use a label and filter on that. What is interesting is how the metrics can scale, requests per second, higher cardinality labels, data aggregation over time, retention time, the ability to set alerts, trend analysis that can alert if this Tuesday morning's graphs are odd compared to other Tuesday mornings, handling math functions like derivative and sums and percentiles, etc.
If you are purely looking for what is important for an uptime service, the minimum is that it alerts me if a heartbeat fails for too long. But I would only use such on a hobby project. If an endpoint is in production, I want all the metrics I mentioned earlier as a minimum.