by cdev on 9/22/20, 2:52 AM with 130 comments
by crossroadsguy on 9/22/20, 4:22 AM
Production support alone is not that much of a problem. What the author skipped (conveniently? or forgot to mention?) is - it's really the "on call" phenomenon that's the problem.
The "typical" on-call - where when you are on-call you are magically on-call 24x7. Yes, during your sleeping hours as well; as if that's less important and the company can avoid spending money to hire dedicated support for those hours and instead make you suffer (yes, it's just that - there's no other name for it like "satisfaction", "learning", "growing" or any of those buzzwords).
You want engineers to do production support? Well, let them do it during normal office hours and only few times a month. Or heck, let them do it for weeks but let them punch in and punch out normal office hours. Let them choose to do only one half of the day and have someone else willing to do the another half.
There's no excuse for burning out engineers (esp. unsuspecting youngsters) by pushing them into ungodly hours of work ruining their health among other things while trying to constantly tell them - "do you even realise what a service to humanity you are doing!".
It's just exploitation.
by MattGaiser on 9/22/20, 3:48 AM
They had to quit to get out of support.
by woutr_be on 9/22/20, 4:11 AM
However, production support teams don’t have a real understanding of our application and how it’s build. So most of the times you have engineers on call with production support, telling them how to debug the problem and come up with relevant logs.
It’s incredibly infuriating and time consuming, and I absolutely hate doing it this way.
90% of the time you also get incredibly vague bug reports with irrelevant logs, and a description of what they think the problem is. Most of the time you need to spend another day finding correct logs and somehow debugging it. Most teams log every single request with all parameters and payloads because they can just replicate the problem locally instead of relying on production support.
We’ve long advocated for either having dedicated support or have engineers on some sort of schedule that can do support.
by sdevonoes on 9/22/20, 10:50 AM
My code will crash sooner or later. I already know that. I don't write 100% bug-free code. But I cannot accept to give 100% of my time one week per month or so to a company in exchange for money. I just don't understand why people can't understand that I can be a professional only during 8 hours per day, but not more.
by ocdtrekkie on 9/22/20, 4:11 AM
I would argue all developers should be required to do some support work.
by seanwilson on 9/22/20, 5:48 AM
I'm not saying you wouldn't learn from working on production, but whether it's worth the stress is another question. In terms of software development, it's hard to think of a worse feeling than when you do a production deploy, you hit refresh on the website or whatever it is, and it shows a fatal error, then there's a mad scramble to roll back the change and figure out quickly what went wrong before the consequences grow too great. Most of the time bosses + coworkers aren't that understanding about it either and get into finger-pointing.
by efitz on 9/22/20, 5:22 AM
by AkshatM on 9/22/20, 4:17 AM
Production support is customer support: responding to chat messages or communications from users.
An on-call rotation, on the other hand, involves responding to production incidents and mounting a proper incident response.
The Google SRE workbook has a great chapter on the subject: https://landing.google.com/sre/workbook/chapters/on-call/
by brailsafe on 9/22/20, 11:33 AM
by greesil on 9/22/20, 4:03 AM
by aprinsen on 9/22/20, 4:10 AM
I have always had mixed feelings about "on call". I dread my turn on the rotation because the imminent threat of a prod issue has a psychological impact on my entire week, even off hours, and usually for a day or two after.
If everybody on the team feels that way, maybe it can act as a forcing function for product quality. I've seen this work on teams that already cultivate a strong sense of ownership.
On the flip side, it really stresses me out, and I sometimes resent that I'm not getting paid overtime for 24hr on call days. Maybe that's just baked into an engineer's salary these days, though...
by wisecoder on 9/22/20, 12:47 PM
by jake_morrison on 9/22/20, 11:15 AM
A good structure is to have first line support be relatively generic ops people. They can handle problems related to infrastructure, e.g. hardware failures, network problems, or issues that can be handled by adding resources. The deployment process should be consistent enough across applications that they can e.g. roll back to a previous release.
This covers the majority of production problems. After that, it's time to bring in someone who understands the details of how the application works. If the dev team is geographically distributed, then someone is available during working hours. Otherwise, we have to get someone out of bed.
If the dev team has done their job right, this should be a rare occasion. Making the dev team fully responsible for the reliability of the application means that they are motivated to make it reliable. Otherwise there is a tendency to have an underclass of ops people who get abused.
A fundamental mindset here is taking responsibility for the user experience, including reliability. If this is not owned by the product development team, then who?
by g051051 on 9/22/20, 12:14 PM
by hermitcrab on 9/22/20, 10:59 AM
by jp0d on 9/22/20, 12:26 PM
by bcbrown on 9/22/20, 4:38 AM
> I no longer work at Gojek
by hyko on 9/22/20, 5:19 AM
by comeonseriously on 9/22/20, 1:46 PM