by frugal10 on 9/21/24, 1:56 PM with 56 comments
The challenge I’m currently facing is ensuring that our on-call engineers don't have sufficient time to focus on system improvements, particularly enhancing operational experience (Opex). Often, the on-call engineers are pulled into working on production features or long-term fixes from previous issues, leaving little bandwidth for proactive system improvements.
I am looking for a framework that will allow me to:
Clearly define on-call priorities, balancing immediate production needs with Opex improvements. Manage long-term fixes related to past on-call issues without overwhelming current on-call engineers. Create a structured approach that ensures ongoing focus on improving operational experience over time.
by cbanek on 9/26/24, 1:49 AM
If you don't have enough time to run the system and you have to do new feature work one has to give into the other, or you have to hire additional people (but this rarely solves the problem, if anything, it tends to make it worse for a while until the new person figures out their bearings).
One way that is very simple but not easy is to let the on call engineer not do feature work and only work on on-call issues and investigating/fixing on call issues for the period of time they are on-call, and if there isn't anything on fire, let them improve the system. This helps with things like comp-time ("worked all night on the issue, now I have to show up all day tomorrow too???") and letting people actually fix issues rather than just restart services. It also gives agency to the on-call person to help fix the problems, rather than just deal with them.
by gobins on 9/26/24, 1:47 AM
1. The roster is set weekly. You need at least 4-5 engineers so that you get rostered not more than once per month. Anything more than that and you will get your engineers burned out.
2. There is always a primary and secondary. Secondary gets called up in cases when primary cannot be reached.
3. You are expected to triage the issues that comes during your on-call roster but not expected to work on long term fixes. that is something you have to bring to the team discussion and allocate. No one wants to do too much off maintenance work.
4. Your top priorities to work on should be issues that come up repeatedly and burn your productivity. This could take upto a year. Once things settle down, your engineers should be free enough to work in things that they are interested in.
5. For any cross team collaboration that takes more than a day, the manager should be the point of contact so that your engineers don't get shoulder tapped and get pulled away from things that they are working on.
Hope this helps.
by seniortaco on 9/26/24, 2:29 AM
Driving down oncall load is all about working smarter, not necessarily harder. 30% of the issues likely need to be fixed by another team. This needs to be identified ASAP and the issues handed off so that they can parallelize the work while your team focuses on the issues you "own".
Setup a weekly rotation for issue triage and mitigation. The engineer oncall should respond to issues, prioritize based on severity, mitigate impact, and create and track Root Cause issues to fix the root cause. These should go into an operational backlog. This is 1 full time headcount on your team (but rotated).
To address the operational backlog, you need to build role expectations with your entire team. It helps if leadership is involved. Everyone needs to understand that in terms of career progression and performance evaluation, operational excellence is one of several role requirements. With these expectations clearly set, review progress with your directs in recurring 1-1s to ensure they are picking up and addressing operational excellence work, driving down the backlog.
by ipnon on 9/26/24, 4:08 AM
Management is incentivized to minimize time spent in alert because it is now cheaper to fix the root-cause issues instead of having engineers play firefighter on weekends. Long-term, which is the always the only relevant timeline, this saves money by reducing engineer burnout and churn.
Engineers are also incentivized to self-organize. Those who have more free time or are seeking more compensation can volunteer for more on-call. Those who have more strict obligations outside of work thus can spend less time on alert, or ideally none at all. In this scenario, even if the root cause is never addressed, usually the local "hero" quickly becomes so inundated with money and vacation time that everyone is happy anyway.
It doesn't completely eliminate the need for on-call or the headaches that alerts inevitably induce but it helps align seemingly opposing parties in a constructive manner. Thanks to Will Larson for suggesting this solution in his book "An Elegant Puzzle."
by tthflssy on 9/23/24, 9:02 AM
Clear up first what is the charter of your team, what should be in your team's ownership? Do you have to do everything you are doing today? Can you say no to production feature development for some time? Who do you need to convince: your team, your manager or the whole company?
Figure out how to measure / assign value to opex improvements eg you will have only 1-2 on-call issues per week instead of 4-5, and that is savings in engineering time, measurable in reliability (SLA/SLO as mentioned in another comment) - then you will understand how much time it is worth to spend on those fixes and which opex ideas worth pursuing.
Improving the efficiency of your team: are they making the right decisions and taking the right initiatives / tickets?
Argue for headcount and you will have more bandwidth after some time. Or split 2 people off and they should only work on opex improvements. You give administratively priority to these initiatives (if the rest of the team can handle on-call).
by matt_s on 9/22/24, 5:01 PM
The team needs to collectively work project work _and_ opex work coming from on-call. On-call should be a rotation through the team. Runbooks should be created on how to deal with scenarios and iterated on to keep updated.
Project work and opex work are related, if you have a separate team dealing with on-call from project work then there isn't a sense of ownership of the product since its like throwing things over a wall to another team to deal with cleaning up a mess.
by windows2020 on 9/26/24, 2:22 AM
2) Automate application monitoring by alerting at thresholds. Tweak alerts until they're correct and resolve items that trigger false positives.
3) If issues are coming from a system someone who is still there designed, they should handle those calls.
4) You mention long-term fixes for on-call issues. First focus on short-term fixes.
5) Set a new expectation that on-call issues are an unexpected exceptions. If they occur, the root cause should be resolved. But see point 4.
6) On-call issues become so rare that there's an ordered list of people to call in the event of an issue. The team informally ensures someone is always available. But if something happens, everyone else who's available is happy to jump on a call to help understand what's going on and if conditions permit, permanently resolve the next business day.
by __s on 9/21/24, 3:48 PM
At Microsoft I headed Incident Count Reduction on my team where opex could be top priority & rotating on call would have a common thread between shifts through me (ie, I would know which issues were related or not, what fixes were in the pipe, etc)
I'm guessing the above isn't an option for you, but you can try drive an understanding that while someone is on call there is no expectation for them to work on anything else. That means subtracting on call head count during project planning
by AdieuToLogic on 9/26/24, 2:47 AM
Being on-call and also responsible for asynchronous alert response is its own, distinct, job. Especially when considering:
> Often, the on-call engineers are pulled into working on production features or long-term fixes from previous issues, leaving little bandwidth for proactive system improvements.
The framework you seek could be:
- hire and train enough support personnel to perform requisite monitoring
- take your development engineers out of the on-call rotation
- treat operations concerns the same as production features, prioritizing accordingly
The last point is key. Any system change, be it functional enhancements, operations related, or otherwise, can be approached with the same vigor and professionalism .
It is just a matter of commitment.
by cjcenizal on 9/26/24, 2:11 AM
At the very least it's a fun read!
[0] https://www.amazon.com/Phoenix-Project-DevOps-Helping-Busine...
by shoo on 9/23/24, 8:58 AM
are you / your team empowered to push back & decline being responsible for certain services that haven't cleared some minimum bar of stability? e.g. "if you want to put it into prod right away, we wont block you deploying it, but you'll be carrying the pager for it"
by sholladay on 9/26/24, 4:43 AM
As for the schedule, I would recommend each engineer have a 3-night shift and then a break for a couple of weeks. Ideally, they will self-assign to certain slots. Early in the week/month might be better/worse for different people.
I strongly suggest that engineers not work on ops engineering or past on-call issues while they themselves are on-call, otherwise there is a very strong incentive for them to reduce alerts, raise thresholds, and generally make the system more opaque. All such work should be done between on-call shifts, or better yet, by engineers who are never on-call.
One way that on-call engineers can contribute when there is no current incident ongoing is to write documentation. Work on runbooks. What to do when certain types of errors occur. What to do for disaster recovery.
by maerF0x0 on 9/26/24, 2:40 AM
4-5 Pagerduty Pages is either 1) bad software or 2) mistuned alerts.
4-5 Cross team requests + customer service escalations, <= 1 Page per week is not that bad, and likely can be handled by 1 week rotations with cooperative team to cover 3-4 2hr "breaks" where the person can (workout, be with their kids/spouse, Forest Bathe) would be a decent target.
For me the best experience across >15 yrs experience was at a company that did 2 week sprints. For 1 week you'd be primary, 1 week you'd be secondary, and then for 4 weeks you'd be off rotation. The primary spent 100% of their time being the interrupt handler fixing bugs, cross team requests, customer escalations, and pages, if they ran out of work they focused on tuning alerts or improving stability even further. So you lose 1 member of your team permanently to KTLO. IMO you gain more than you lose by letting the other 5-7ish engineers be fully focused on feature work.
> Often, the on-call engineers are pulled into working on production features or long-term fixes from previous issues, leaving little bandwidth for proactive system improvements.
Have a backbone, tell someone above you "no".
by jmmv on 9/26/24, 2:42 AM
* Clearly delineate what is on-call work and how many people pay attention to it, and protect the rest of the team from such work. Otherwise, it's too easy for the team at large to fall prey to the on-call toil. That time goes unaccounted and everybody ends up being distracted by recurrent issues, increases siloing, and builds up stress. I wrote about this at large here: https://jmmv.dev/2023/08/costs-exposed-on-call-ticket-handli...
* Set up a fair on-call schedule that minimizes the chances of people having to perform swaps later on while ensuring that everybody is on-call roughly the same amount of time. Having to ask for swaps is stressful, particularly for new / junior folks. E.g. PagerDuty will let you create a round-robin rotation but lacks these "smarter" abilities. I wrote about how this could work here: https://jmmv.dev/2022/01/oncall-scheduling.html
by rozenmd on 9/26/24, 6:48 AM
- You build it, you run it
If your team wrote the code, your team ensures the code keeps running.
- Continuously improve your on-call experience
Your on-call staff shouldn't be on feature work during their shift. Their job is to improve the on-call experience while not responding to alerts.
- Good processes make a good on-call experience
In short, keep and maintain runbooks/standard operating procedures
- Have a primary on-call, and a secondary on-call
If your team is big enough, having a secondary on-call (essentially, someone responding to alerts only during business hours) can help train up newbies, and improve the on-call experience even faster.
- Handover between your on-call engineers
A regular mid-week meeting to pass the baton to the next team member ensures ongoing investigations continue, and that nothing falls between the cracks.
- Pay your staff
On-call is additional work, pay your staff for it (in some jurisdictions, you are legally required to).
More: https://onlineornot.com/incident-management/on-call/improvin...
by coderintherye on 9/26/24, 1:45 AM
For your level: Your team and org size is large enough that you should be able to commit someone half or full-time to focusing on Opex improvements as their sole or primary responsibility. Ask your team, there's likely someone who would actually enjoy focusing on that. If not, advocate for a head count for it.
Edit: Also ensure you have created playbooks for on-call engineers to follow along with a documentation culture that documents the resolutions to most common issues so as those issues arise again they can be easily dealt with by following the playbook.
Note: This is unpopular advice here because most people here don't want to spend their lives bug-fixing, but in reality it's a method that works when you have the right person who wants to do it.
by jaygreco on 9/26/24, 4:18 AM
by Joel_Mckay on 9/26/24, 2:05 AM
1. Call center support desk with documented support issues, and most recent successful resolutions.
2. Junior level technology folks dispatched for basic troubleshooting, documented repair procedures, and testing upper support level solutions
3. Specialists that understand the core systems, process tier 2 bug reports, and feed back repairs/features into the chain
4. Bipedal lab critters involved in research projects... if your are very quiet, you may see them scurry behind the rack-servers back into the shadows.
Managers tend to fail when asking talent to triple/quadruple wield roles at a firm.
No App is going to fix how inexperienced coordinators burn out staff. =3
by kyrra on 9/26/24, 3:09 AM
As well as this two from the management section: https://sre.google/sre-book/dealing-with-interrupts/ and https://sre.google/sre-book/operational-overload/
by dpifke on 9/26/24, 3:30 AM
by Swiftdream on 9/28/24, 9:30 AM
swiftdreamwebconsultant@gmail.com
by jlund-molfese on 9/26/24, 2:04 AM
That being said, some advice:
> Clearly define on-call priorities
Sit down with your team, and, if necessary, one or two stakeholders. Create a document and start listing priorities and SLAs during a meeting. The goal isn't actually the doc itself, but when you go through this exercise and solicit feedback, people should raise areas where they disagree and point out things you haven't thought of. The ordering is up to what matters to your team, but most people will tie things to revenue in some way. You can't work on everything, and the groups that complain most loudly aren't necessarily the ones who deserve the most support.
> balancing immediate production needs with Opex improvements
Well, first, are your 'immediate production needs' really immediate? If your entire product is unusable that might be the case, but certain issues, while qualifying as production support, don't need to be prioritized immediately, and can be deferred until enough of them exist at the same time to be worked on together. Otherwise you can start by committing to certain roadmap items and then do as much production support as you have time for. Or vice-versa. A lot of this depends on the stage of your company; more mature companies will naturally prioritize support over a sprint to viability.
> Manage long-term fixes related to past on-call issues without overwhelming current on-call engineers. Create a structured approach that ensures ongoing focus on improving operational experience over time.
Whenever a support task or on-call issue is completed, you should keep track of it by assigning labels or simply listing it in some tracking software. To start off, you might have really broad categories like "customer-facing" and "internal-facing" or something like that. If you find that you're spending 90% of your support time on a particular service or process, that's a good sign that investment in that area could be valuable. Over time, especially as you get a better handle on support, you should make the categories more granular so you can focus more specifically. But not so granular that only one issue per month falls into them or anything like that.
by brudgers on 9/21/24, 2:14 PM
by ojbyrne on 9/26/24, 4:05 AM
by matrix87 on 9/26/24, 2:37 AM
the way my company does it, on-call rotates around the team. The designated oncall person isn't expected to work on anything else
by Fire-Dragon-DoL on 9/26/24, 6:58 AM
by nick3443 on 9/26/24, 4:07 AM
by theideaofcoffee on 9/26/24, 2:42 AM
Someone gets called in the middle of the night? Let them take the morning to recover, no questions asked, better yet, the entire day if it was a particularly hairy issue. This is the time where your mettle as a manager is really tested against your higher-ups. If your people are putting in unscheduled time, you better be ready to cough up something in return.
Figure out what's commonly coming up and root cause those issues so they can finally be put to bed (and your on-call can go back to bed, hah).
Everyone that touches a system gets put on call for that same system. That creates an incentive to make it resilient so they don't have to be roused and so there's less us-vs-them and throwing issues over the wall.
Beyond that, if someone is on call, that's all they should be doing. No deep feature work, they really should be focusing on alerts, what's causing them, how to minimize, triaging and then retro-ing so they're always being pared down.
Lean on your alerting system to tell you the big things: when, why, how often, all that. The idea is you should understand exactly what is happening and why, you can't do much to fix anything if you don't know the why.
Look at your documentation. Can someone that is perhaps less than familiar with a given system easily start to debug things, or do they need to learn the entire thing before they can start fixing? Make sure your documentation is up to date, write runbooks for common issues (better yet, do some sort of automation work to fix those, computers are good at logic like that!), give enough context that being bleary eyed at 3:30am isn't that much of a hindrance. Minimize the chances of having to call in a system's expert to help debug. Everyone should be contributing there (see my fourth line above).
Make sure you are keeping an eye on workload too. You may need to think about increasing the number of people on your team if actual feature work isn't getting done because you're busy fighting fires.
by aaomidi on 9/26/24, 4:37 AM
This is extremely important imo. It sets a positive culture and makes people want to do oncall rather than hate and dread it.
by ivanstojic on 9/26/24, 4:20 AM
Longer reply:
I have on-call experience for major services (DynamoDB front door, CosmosDB storage, OCI LoadBalancer). Seen a lot of different philosophies. My take:
1. on-call should document their work step by step in tickets and make changes to operational docs as they go: a ticket that just has "manual intervention, resolved" after 3 hours is useless; documenting what's happening is actually your main job; if needed, work to analyze/resolve acute issues can be farmed out
2. on-call is the bus driver, shouldn't be tasked with handling long term fixes (or any other tasks beyond being on-call)
3. handover between on-calls is very important, prevents accidentally dropping the ball on resolving longer time horizon issues; handover meetings
Probably the most controversial one: separate rotation (with a longer window - eg. 2 week) should handle tasks that are RCA related or drive fixes to prevent reoccurrence
Managers should not be first tier on any pager rotation, if you wouldn't approve pull requests, you shouldn't be on the rotation (other than as a second tier escalation). Reverse should also hold: if you have the privilege to bless PRs, you should take your turn in the hot seat.
by mise_en_place on 9/26/24, 1:37 AM
by parasense on 9/26/24, 3:15 AM
https://wiki.en.it-processmaps.com/index.php/Problem_Managem...
Your on-calls folks need a way to be free of the broader problem analysis, and focus on putting out the fires. The folks in problem management will take the steps to prevent problems from ever manifesting.
Once upon a time I was into Problem Management, and one issue that kept coming up was server OS patching where the Linux systems crashed upon reboot, after having applied new kernel, etc. The customers were blaming us, and we were blaming the customer, and round and round it went. Anyhow, the new procedure was some thing like this... any time there was routine maintenance that would result in the machine rebooting (e.g. kernel updates), then the whole system had to be brought down first to prove it was viable for upgrades. Low-and Behold, machines belonging to a certain customer had a tendency to not recover after the pre-reboot. This would stop the upgrade window in it's track, and I would be given a ticket for next day to investigate why the machine was unreliable. Hint... a typical problem was Oracle admins playing god with /etc/fstab, and many other shenanigans. We eventually got that company to a place where the tier-2 on-call folks could have a nice life outside of work.
But I digress...
> Opex ...
Usually that term means "Operational Expenditure", as opposed to "Capex" or Capital Expenditure. It's your terminology, so it's fine, but I'd NOT say those kind of things to anybody publicly. You might get strange looks.
I'd say let one or two of the on-call folks be given a block of a few hours each week to think of ways to kill recurring issue. Let them take turns, and give them concrete incentives to achieve results. Something like $200 bonus per resolved problem. That leads us into the next issue, which is monitoring and logging of the issues. Because if you hired consultants to come-in tomorrow, and you don't even have stats... there's nothing anybody could do.
Good luck
by uaas on 9/22/24, 6:29 PM
by crdrost on 9/26/24, 6:58 AM
Tip 1: Everyone has opinions about on-call. Try a bunch, see what works.
Frameworks for this stuff are usually either sprint-themed, or they're SLO-flavored. Both of those are popular because they fit into goalsetting frameworks. You can say "okay this sprint what's our ticket closure rate" or you can say "okay how are we doing with our SLOs." This also helps to scope oncall: are you just restoring service, are you identifying underlying causes, are you fixing them? But those frameworks don't directly organize. Still, it's worth learning these two points from them:
Tip 2: You want to be able to phrase something positive to leadership even if the pagers didn't ring for a little bit. That's what these both address.
Tip 3: There is more overhead if you don't just root-cause and fix the problems that you see. However if you do root-cause-and-fix, then you may find that sprint planning for the oncall is "you have no other duties, you are oncall, if you get anything else done that's a nice-to-have."
Now, turning to organization... you are lucky in that you have a specific category of thing you want to improve: opex. You are unlucky that your oncall engineers are being pulled into either carryover issues or features.
I would recommend an idea that I've called "Hot Potato Agile" for this sort of circumstance. It is somewhat untested but should give a good basic starting spot. The basic setup is,
• Sprint is say 2 weeks, and intended oncall is 1 week secondary, then 1 week primary. That means a sprint contains 3 oncall engineers: Alice is current primary, Bob is current secondary and next primary, Carol is next secondary.
• At sprint planning everybody else has some individual priorities or whatever, Alice and Carol budget for half their output and Bob assumes all his time will be taken by as-yet-unknown tasks.
• But, those 3 must decide on an opex improvement (or tech debt, really any cleanup task) that could be completed by ~1 person in ~1 sprint. This task is the “hot potato.” Ideally the three of them would come up with a ticket with like a hastily scribbled checklist of 20ish subtasks that might each look like it takes an hour or so.
Now, stealing from Goldratt, there is a rough priority category at any overwhelmed workplace, everything is either Hot, Red Hot, or Drop Everything and DO IT NOW. Oncall is taking on DIN and some RH, the Red Hots that specifically are embarrassing if we're not working on them over the rest. The hot potato is clearly a task from H, it doesn't have the same urgency as other tasks, yet we are treating it with that urgency. In programming terms it is a sentinel value, a null byte. This is to leverage some more of those lean manufacturing principles... create slack in the system etc.
• The primary oncall has the responsibility of emergency response including triage and the authority to delegate their high-priority tasks to anyone else on the team as their highest priority. The hot potato makes this process less destructive by giving (a) a designated ready pair of hands at any time, and (b) a backup who is able to more gently wind down from whatever else they are doing before they have to join the fire brigade.
• The person with the hot potato works on its subtasks in a way that is unlike most other work you're used to. First, they have to know who their backup is (volunteer/volunteer); second, they have to know how stressed out the fire brigade is; communicating these things takes some intentional effort. They have to make it easy for their backup to pick up where they left off on the hot potato, so ideally the backup is reviewing all of their code. Lots of small commits, they are intentionally interruptable at any time. This is why we took something from maintenance/cleanup and elevated it to sprint goal, was so that people aren't super attached to it, it isn't actually as urgent as we're making it seem.
Hope that helps as a framework for organizing the work. The big hint is that the goals need to be owned by the team, not by the individuals on the team.