by robinson-wall on 6/20/19, 11:44 AM with 94 comments
by gregdoesit on 6/20/19, 6:12 PM
Being someone who also works in the payments space currently, relying on gateways, I have gone through several similar outages, where we detected a gateway issue causing an outage, notified the gateway who ack’d... and then we waited. More than one time, like Monzo, we built a workaround on our end, before the gateway provider could even mitigate the outage.
Hats off to the Monzo team, who clearly have a solid oncall and incident mitigation strategy in-place. They determined an outage happening in 4 minutes, built a workaround as best they could and deployed it in 2 hours, while it took the gateway provider 9 hours only to mitigate their change that caused the issue the first place. Granted the issue seemed complex, this is still slow.
Unfortunately, in cases like this, the best one can do is make sure there is a clear SLA in-place with the third party, with a contract stating financial liability in case the third party fails to meet this SLA. Monzo will not tell us much about this part, but I suspect the gateway will have to pay a hefty fee to Monzo, as their availability dropped to under 99% for this month, which should trigger payments/fee reductions from the third party with a well-written contract. It is good to see they are pushing the third party to do a proper post-mortem and prevention actions, as well as holding them accountable.
Nice work!
by mwexler on 6/20/19, 4:56 PM
For more about what makes a good apology, see https://withoutbullshit.com/?s=apology&submit=Search by Josh Bernoff, a former Forrester editor and a very direct writer.
by ziddoap on 6/20/19, 2:38 PM
A+ job on handling the unfortunate situation, Monzo.
We can only hope more companies follow this great example.
by robinson-wall on 6/20/19, 11:47 AM
I'll hang around here to answer any more technical questions if anyone's interested.
by playpause on 6/20/19, 3:03 PM
by yingw787 on 6/20/19, 3:03 PM
1. Was this post-mortem part of an official process or something of an individual initiative? I saw it published on the blog, but it might be helpful to have this information disambiguated from marketing material on a separate site: https://status.cloud.google.com/summary
2. I'm not sure how payment processors work, but would having multiple payment processors from Monzo's interface make sense from a cost/benefit perspective?
3. Any plans to expand to the U.S. anytime soon, or recommend any banks that follow Monzo's best practices? ;-)
by PhantomGremlin on 6/20/19, 2:28 PM
The bug was in a computer program the Gateway uses to translate payment messages between two formats. When the program was operating under load, the system tried to clear memory it believed to be unused (a process known as garbage collection).
But because it was using an unsafe method to access memory, the code ended up reading memory that had already been cleared away, causing it not to know how to translate the date field in payment messages.
So apparently a dangling reference.
by edraferi on 6/20/19, 5:55 PM
by retube on 6/20/19, 3:13 PM
by ablation on 6/20/19, 1:54 PM
by GordonS on 6/20/19, 9:58 PM
I'm seriously impressed they were able to deploy mitigations to product twice in the same few hours, especially given they are a bank (and a small one, at that), and the consequences of fucking up are enormous.
It's been said here many times already, but I'll join those saying "well done" for handling this so well, and for the extraordinary level of transparency!
by spiderfarmer on 6/20/19, 3:28 PM
by sandGorgon on 6/20/19, 4:15 PM
I'm wondering what do you use to call these external processing APIs. I assume these are blocking calls.
by baby on 6/20/19, 3:40 PM
by kjlfhg8 on 6/20/19, 2:32 PM
by peteretep on 6/20/19, 2:25 PM
What now? Their datacentre was ... rewriting (presumably) encrypted packets?
by osrec on 6/20/19, 3:15 PM
by nvr219 on 6/20/19, 3:54 PM