from Hacker News

Linux Crisis Tools

by samber on 3/24/24, 12:51 AM with 124 comments

  • by FridgeSeal on 3/24/24, 1:57 AM

    This is a handy list.

    > 4:07pm The package install has failed as it can't resolve the repositories. Something is wrong with the /etc/apt configuration…

    Cloud definitely has downsides, and isn’t a fit for all scenarios but in my experience it’s great for situations like this. Instead of messing around trying to repair it, simply kill the machine, or take it out of the pool. Get a new one. New machine and app likely comes up clean. Incident resolves. Dig into machine off the hot path.

  • by devsda on 3/24/24, 4:08 AM

    Not all servers are containerized, but a significant number are and they present their own challenges.

    Unfortunately, many such tools in docker images will be flagged by automated security scanning tools in the "unnecessary tools that can aid an attacker in observing and modifying system behavior" category. Some of those ( like having gdb) are valid concerns but many are not.

    To avoid that we have some of these tools in a separate volume as (preferably) static binaries or compile & install them with the mount path as the install prefix (for config files & libs). If there's need to debug, we ask operations to mount the volume temporarily as read-only.

    Another challenge is if there's a debug tool that requires enabling a certain kernel feature, there are often questions/concerns about how that affects other containers running on the same host.

  • by infofarmer on 3/24/24, 9:09 AM

    somewhat related: /rescue/* on every FreeBSD system since 5.2 (2004) — a single statically linked ~17MB binary combining ~150 critical tools, hardlinked under their usual names

    https://man.freebsd.org/cgi/man.cgi?rescue https://github.com/freebsd/freebsd-src/blob/main/rescue/resc...

  • by sargun on 3/24/24, 5:12 AM

    When I was at Netflix, Brendan and his team made sure that we had a fair set of debugging tools installed everywhere (bpftrace, bcc, working perf)

    These were a lifesaver multiple times.

  • by mmh0000 on 3/24/24, 2:09 AM

    I was surprised that `strace` wasn't on that list. That's usually one of my first go-to tools. It's so great, especially when programs return useless or wrong error messages.
  • by donio on 3/24/24, 4:31 AM

    I always cover such tools when I interview people for SRE-type positions. Not so much about which specific commands the candidate can recall (although it always impresses when somebody teaches me about a new tool) but what's possible, what sort of tools are available and how you use them: that you can capture and analyze network traffic, syscalls, execution profiles and examine OS and hardware state.
  • by reilly3000 on 3/24/24, 2:07 AM

    In such a crisis if installing tools is impossible, you can run many utils via Docker, such as:

    Build a container with a one-liner:

    docker build -t tcpdump - <<EOF \nFROM ubuntu \nRUN apt-get update && apt-get install -y tcpdump \nCMD tcpdump -i eth0 \nEOF

    Run attached to the host network:

    docker run -dP --net=host moremagic/docker-netstat

    Run system tools attached to read host processes:

    for sysstat_tool in iostat sar vmstat mpstat pidstat; do alias "sysstat-${sysstat_tool}=docker run --rm -it -v /proc:/proc --privileged --net host --pid host ghcr.io/krishjainx/sysstat-docker:main /usr/bin/${sysstat_tool}" done unset -v sysstat_tool

    Sure, yum install is preferred, but so long as docker is available this is a viable alternative if you can manage the extra mapping needed. It probably wouldn’t work with a rootless/podman setup.

  • by rr808 on 3/24/24, 3:20 AM

    You guys get root access? I have to raise a ticket for a sysadmin to do anything.
  • by kureikain on 3/24/24, 8:05 AM

    I don't see nmap, netstat, and nc being mention. They had saved me so many time as well.
  • by zer00eyz on 3/24/24, 3:40 AM

    The only thing I would add is nmap.

    Network connectivity issues aren't always apparent in some apps.

  • by kunley on 3/24/24, 1:57 AM

    Brendan Gregg as always with down to earth approach. Love the warroom example
  • by SamuelAdams on 3/24/24, 2:01 AM

    Would these tools still be useful in a cloud environment, such as EC2?

    Most dev teams I work with are actively reducing their actual managed server and replace it with either Lambda, or docker images running in K8. I wonder if these tools are still useful for containers and serverless?

  • by js4ever on 3/24/24, 10:52 AM

    Let's add NCDU to the list, it's super usefull to find what is taking all the disk space
  • by pstuart on 3/24/24, 1:55 AM

    Sounds like it's time to create a crisis-essential package group a la build-essential.
  • by pjmlp on 3/24/24, 7:43 AM

    The list is great, but only for classical server workloads.

    Usually not even a shell is available in modern Kubernetes deployments that take a security first approach, with chiseled containers.

    And by creating a debugging image, not only is the execution environment being changed, deploying it might require disabling security policies doing image scans.

  • by randomgiy3142 on 3/24/24, 2:41 AM

    I use zfsbootmenu with hrmph (https://github.com/leahneukirchen/hrmpf). You can see the list of packages here (https://github.com/leahneukirchen/hrmpf/blob/master/hrmpf.pa...). I usually build images based off this so they’re all there, otherwise you’ll need to ssh into zfsbootmenu and load the 2 gb separate distro. This is for home server, though if I had a startup I’d probably setup a “cloud setup” and throw a bunch of servers somewhere. A lot of times for internal projects and even non-production client research having your own cluster is a lot cheaper and easier then paying for a cloud provider. It also gets around when you can’t run k8s and need bare metal. I’d advised some clients on this setup with contingencies in case of catastrophic failure and more importantly test those contingencies but this is more so you don’t have developers doing nothing not to prevent overnight outages. A lot cheaper than cloud solutions for non critical projects and while larger companies will look at the numbers closely if something happened and devs can’t work for an hour the advantage of a startup is devs will find a way to be productive locally or simply have them take the afternoon off (neither has happened).

    I imagine these problems described happen on big iron type hardware clusters that are extremely expensive and spare capacity isn’t possible. I might be wrong but especially with (sigh) AI setups with extremely expensive $30k GPUs and crazy bandwidth between planes you buy from IBM for crazy prices (hardware vendor on the line so quickly was a hint) you’re way past the commodity server cloud model. I have no idea what could go wrong with such equipment where nearly ever piece of hardware is close to custom built but I’m glad I don’t have to deal with that. The debugging on those things work hardware only a few huge pharma or research companies use has to come down to really strange things.

  • by sirwitti on 3/24/24, 7:18 AM

    Related to that, I recently learned about safe-rm which lets you configure files and directories that can't be deleted.

    This probably would have prevented a stressful incident 3 weeks ago.

  • by anthk on 3/24/24, 11:05 AM

    tmux, statically linked (musl) busybox with everything, lsof, ltrace/strace and a few more. Under OpenBSD this is not an issue as you have systat and friends in base.
  • by logifail on 3/24/24, 6:55 AM

    Doesn't one increase a system's attack surface area/privilege escalation risk by pre-installing tools such as these?
  • by prydt on 3/24/24, 2:42 AM

    Love the list and the eBPF tools look super helpful.
  • by michaelhoffman on 3/24/24, 2:42 PM

    When would you need to use rdmsr and wrmsr in a crisis?
  • by josephcsible on 3/24/24, 3:16 AM

    > and...permission errors. What!? I'm root, this makes no sense.

    This is one of the reasons why I fight back as hard as I can against any "security" measures that restrict what root can do.

  • by ur-whale on 3/24/24, 8:03 AM

    Can't imagine handling a Linux crisis without ssh

    [EDIT]: typo

  • by SuperHeavy256 on 3/24/24, 2:23 AM

    So basically busybox?