by gr2020 on 7/12/19, 5:13 PM with 108 comments
by vjagrawal1984 on 7/12/19, 9:47 PM
Is it because they are over the curve and don't make "any" changes to their system. As opposed to other companies, we are still maturing?
by ssalazars on 7/12/19, 6:02 PM
There's a 20 minute gap between investigation and "rollback". Why did they rollback if the service was back to normal? How can they decide, and document the change within 20 minutes? Are they using CMs to document changes in production? Were there enough engineers involved in the decision? Clearly all variables were not considered.
To me, this demonstrates poor Operational Excellence values. Your first goal is to mitigate the problem. Then, you need to analyze, understand, and document the root cause. Rolling-back was a poor decision, imo.
by laCour on 7/12/19, 5:45 PM
How did they not catch this? It's super surprising to me that they wouldn't have monitors for this.
by zby on 7/13/19, 10:02 AM
by gr2020 on 7/12/19, 5:18 PM
by segmondy on 7/13/19, 12:19 AM
by chance_state on 7/12/19, 5:17 PM
by mual on 7/22/19, 4:14 AM
by jacquesm on 7/12/19, 5:52 PM
by luminati on 7/12/19, 10:29 PM
by debt on 7/12/19, 5:37 PM
Damn what a mess. Sounds like y'all are rolling out way to many changes too quickly with little to no time for integration testing.
It's a somewhat amateur move to assume you can just arbitrarily rollback without consequence, without testing etc.
One solution I don't see mentioned, don't upgrade to minor versions ever. And create a dependency matrix so if you do rollback, you rollback all the other things that depend on the thing you're rolling back as well.