from Hacker News

Ask HN: What is the error alerting stack at your startup?

by aiunboxed on 10/7/23, 7:23 AM with 8 comments

If you are working at a < 30 people engineering team company what is the alerting stack that you guys are using ?

We tend to miss a lit of critical alerts that come to simply because the alerting is not set up properly.

by slap_shot on 10/10/23, 5:26 PM
I'm surprised how often I speak to technical teams that do not utilize PagerDuty (or an equivalent alternative). As PagerDuty integrates with nearly any external system, it separates the collection of telemetry from the incident response lifecycle, i.e. what is wrong? who should be or is looking into this? what did we learn from this? how often is this happening?
Personally, I find notifications in Slack to be an anti-pattern: a lot of teams expect someone to just "pick up" the incident based on their availability or expertise and _maybe_ the resolution is documented. Assigning direct responsibility by component and on-call schedule appending the RCA reduces the time-to-resolution and overall toil of the process.
by nip on 10/7/23, 2:55 PM
Custom built monitoring on top of CloudWatch logs: we subscribe to the log groups and parse the logs.
Errors are reported in dedicated slack channels
The “MVP” was built in 1 week after we were faced with an outrageous bill from an observability vendor and decided to give a shot at implementing it ourselves.
In total I’d say that we invested 2 additional weeks of man-hour to get to where we are today.
It has worked extremely well for us and has needed little maintenance (granted we pay AWS to not have to do that maintenance)
by mtmail on 10/7/23, 11:14 AM
StatusCake has a feature to call me. It's a horrible artificial voice "your website $name is down" but I'm fine with anybody shouting at me at 3am. The phone number is from the United States and I don't need to add it to my phone book because that's the only US phone calling me. (For people inside the US you might think it's another robocall)
by guybedo on 10/13/23, 2:55 AM
i keep it simple with an uptime monitoring service that monitors all the elements of my stack and run tests every minute:
- regular http monitoring for websites
- run test queries on my sql & mongo databases
- check that rabbitmq queues are not overflowing
- check that docker container are up
If something goes wrong, email & telegram alerts.
fwiw i'm using https://uptimefunk.com
by rozenmd on 10/7/23, 10:55 AM
Uptime monitoring + cron job monitoring via OnlineOrNot (dogfooding my own product), with alerts going to PagerDuty (set up to email -> SMS -> call me if I don't acknowledge), and a "public" alert in a Slack channel.
by girishso on 10/7/23, 10:45 AM
Nothing fancy, Alerts are posted in a slack channel.
by Cicero22 on 10/7/23, 2:18 PM
We have someone check grafana a few times a day and alert us if there's an issue. Not great, but it works
by 0xebo on 10/7/23, 12:05 PM
webhooks to slack