from Hacker News

How I hunt down and fix errors in production

by The_Amp_Walrus on 5/3/22, 1:52 AM with 25 comments

  • by aaronbwebber on 5/4/22, 4:45 AM

    An important step here that is missing here is evaluating if your fix is going to cause other, potentially worse problems. I suspect that in this case, it's fairly unlikely that increasing the maximum POST body size to 60 MB is going to cause problems - eyeballing that Sendgrid chart, it looks like we are not dealing with very high throughput here. But it's not hard to imagine a situation where tripling the max POST body size would result in a large increase in server memory usage, which could result in things like OOM kills, which could result in a lot of people not getting their reply emails or whatever.

    So don't just rush a fix out. Think about what the effects of a configuration change like this might be, and whether you are just making more problems for yourself down the line trying to fix something quickly.

  • by mtippett on 5/4/22, 5:55 AM

    I agree with most of what is suggested in the article.

    However a big part is missing is the reality that there are a set of hypotheses (is that right) in play at any point in time. A lot of debugging is the cycle of

    1. Think about the system, gather any available data - you can't boil the ocean 2. Consider a set of hypotheses possible cause (even if it is a partial cause) 3. Seek any method to either refute or confirm the possible cause which gives more data.

    Wash, rinse, repeat. Each cycle will likely get closer to the problem.

    Each cycle also is likely to find other tech debt that needs to be solved.

    Rarely is there a single hypothesis that is right first time. Although an experienced person will prune out a lot of poor ideas automatically, and likely subconsciously.

    Observability goes a long way to getting the data needed to confirm or refute.

  • by notaspecialist on 5/4/22, 8:13 AM

    When a user comes over and says "this isn't happening" I write a test and sure enough, the test fails. I fix the case, re-run all the tests, push to UAT, and ask the user to verify it works in the UAT system. It's pushed into production after hours.

    Prior to TDD I would spend hours stepping through code, setting variables to replicate the scenario, scratching my head, and usually fix it after a week or so. Then I would get a bug report of something else weird happening. And repeat that process.

  • by chaps on 5/4/22, 4:28 AM

    Here's how I do it:

      xargs -I'hostname' -a hosts.txt -P128 bash -c "ssh 'hostname' find / -type f -mmin -20 | xargs -P128 -Ifilename grep -cHia error filename 2>/dev/null | sed 's/^/hostname:/' ; :" | sort -nrk3 -t':'
  • by rmbyrro on 5/4/22, 11:27 AM

    If there's an issue receiving emails, there's an endpoint /email/receive/ and nginx logs files, I would have promptly searched these logs for "[error] * /email/receive/"
  • by ricardobayes on 5/4/22, 7:47 AM

    Lately the only technical question we ask when hiring is to debug an issue. Experience in this is really difficult to fake unlike memorizing leetcode issues etc.
  • by ge96 on 5/4/22, 5:47 AM

    random thoughts about this subject

    - sucks when your bug completely blows your project up (type error blank page)

    - I'm tempted to track every click/event and log it for reproducibility

    - sucks when your product fails not because of a bug but just people not knowing how to use it (training issue I guess) eg. permissions not accepted, why isn't it working?

  • by invalidname on 5/4/22, 5:16 AM

    I'm very much in favor of this but his observability stack is seriously lacking. With Developer Observability tools this is much easier and more powerful: https://www.youtube.com/watch?v=k0DPO5jlZtU