from Hacker News

Why Monzo's bank transfers weren't working on the 30th of May

by robinson-wall on 6/20/19, 11:44 AM with 94 comments

by gregdoesit on 6/20/19, 6:12 PM
This is a well-written post-mortem for public reading. I encourage people to read through it.
Being someone who also works in the payments space currently, relying on gateways, I have gone through several similar outages, where we detected a gateway issue causing an outage, notified the gateway who ack’d... and then we waited. More than one time, like Monzo, we built a workaround on our end, before the gateway provider could even mitigate the outage.
Hats off to the Monzo team, who clearly have a solid oncall and incident mitigation strategy in-place. They determined an outage happening in 4 minutes, built a workaround as best they could and deployed it in 2 hours, while it took the gateway provider 9 hours only to mitigate their change that caused the issue the first place. Granted the issue seemed complex, this is still slow.
Unfortunately, in cases like this, the best one can do is make sure there is a clear SLA in-place with the third party, with a contract stating financial liability in case the third party fails to meet this SLA. Monzo will not tell us much about this part, but I suspect the gateway will have to pay a hefty fee to Monzo, as their availability dropped to under 99% for this month, which should trigger payments/fee reductions from the third party with a well-written contract. It is good to see they are pushing the third party to do a proper post-mortem and prevention actions, as well as holding them accountable.
Nice work!
by mwexler on 6/20/19, 4:56 PM
What I think is fascinating is not just that we applaud Monzo for this, but that we allow other important services that control our lives to get away with revealing nothing about what happened or what they've changed to prevent it. Can you imagine any large bank (for the US, say JP Morgan Chase, Citi, Bank of America, etc.) putting out a note with this level of transparency, accountability, and clear direction to change?
For more about what makes a good apology, see https://withoutbullshit.com/?s=apology&submit=Search by Josh Bernoff, a former Forrester editor and a very direct writer.
by ziddoap on 6/20/19, 2:38 PM
Clear, detailed but accessible, plans in place moving forward, apology read sincerely, providing support to affected customers immediately, and answering follow up questions to technical users who are interested in more detail.
A+ job on handling the unfortunate situation, Monzo.
We can only hope more companies follow this great example.
by robinson-wall on 6/20/19, 11:47 AM
I just posted this semi-technical post-mortem on Monzo's about why we had an outage with Faster Payments (UK bank transfers) last Month.
I'll hang around here to answer any more technical questions if anyone's interested.
by playpause on 6/20/19, 3:03 PM
This is a perfect post-mortem. Their communication and support has always been really good. I've been using Monzo as my primary bank account ever since they registered as a bank, and I've converted a lot of friends to it. But... over the last year, the iOS app has fallen in quality: long UI freezes, frequent sign-outs with no explanation, silly UI bugs. My non-technical friends have noticed the same issues. It's a real shame.
by yingw787 on 6/20/19, 3:03 PM
@robinson-wall Nice writeup, definitely raises the standards in the banking industry! I have a few questions:
1. Was this post-mortem part of an official process or something of an individual initiative? I saw it published on the blog, but it might be helpful to have this information disambiguated from marketing material on a separate site: https://status.cloud.google.com/summary
2. I'm not sure how payment processors work, but would having multiple payment processors from Monzo's interface make sense from a cost/benefit perspective?
3. Any plans to expand to the U.S. anytime soon, or recommend any banks that follow Monzo's best practices? ;-)
by PhantomGremlin on 6/20/19, 2:28 PM
The software bug at the heart of the problem:
The bug was in a computer program the Gateway uses to translate payment messages between two formats. When the program was operating under load, the system tried to clear memory it believed to be unused (a process known as garbage collection).
But because it was using an unsafe method to access memory, the code ended up reading memory that had already been cleared away, causing it not to know how to translate the date field in payment messages.
So apparently a dangling reference.
by edraferi on 6/20/19, 5:55 PM
This is a very well-written postmortem. It’s clear enough that a non-technical customer effected by the outage could understand the explanation, at least at a high level. It’s also detailed enough that a technical person can trace the root cause to a buggy garbage collector in format transformation function. The whole thing uses clear language with a bare minimum of jargon. Nice work!
by retube on 6/20/19, 3:13 PM
What I still don't understand with bank transfers is: what control is there to ensure that debits and credits are offsetting. Doesn't this rely on the bank be being honest? Can't the sending bank just not debit the senders account?
by ablation on 6/20/19, 1:54 PM
Thank you for posting this. Great read, and nice to see the team at Monzo sharing this level of detail: consumable but still detailed.
by GordonS on 6/20/19, 9:58 PM
The project I'm currently working on has a QA lag of 4-5 days for code to reach production.
I'm seriously impressed they were able to deploy mitigations to product twice in the same few hours, especially given they are a bank (and a small one, at that), and the consequences of fucking up are enormous.
It's been said here many times already, but I'll join those saying "well done" for handling this so well, and for the extraordinary level of transparency!
by spiderfarmer on 6/20/19, 3:28 PM
Somewhat related question: How can Monzo offer 1.55% interest while the interest with most banks is around 0,3%?
by sandGorgon on 6/20/19, 4:15 PM
Just curious - whats the stack you guys run ?
I'm wondering what do you use to call these external processing APIs. I assume these are blocking calls.
by baby on 6/20/19, 3:40 PM
I've been using Monzo less and less since I moved to the US due to the cost of topping it up. It's really sad that there is no true equivalent to Monzo here :(
by kjlfhg8 on 6/20/19, 2:32 PM
Not related to the outage, but any plans to provide banking on pc's instead of just phones and any plans to provide small businesses accounts in the future?
by peteretep on 6/20/19, 2:25 PM
> They later tell us they believed that datacentre was introducing the corruption
What now? Their datacentre was ... rewriting (presumably) encrypted packets?
by osrec on 6/20/19, 3:15 PM
Tldr; unsafe memory management in a third party's software corrupted dates (under high load, due to garbage collection), causing transactions to fail or get reversed.
by nvr219 on 6/20/19, 3:54 PM
You can do anything at monzocom