from Hacker News

Ask HN: Why don't operating systems test RAM for corruptions?

by ekns on 9/30/20, 6:37 PM with 2 comments

See recent comments on ECC RAM like https://news.ycombinator.com/item?id=24589597

I recently had a RAM problem that took me a while to correctly diagnose from lack of compatible tooling on UEFI (memtest86 on Ubuntu doesn't work with this setup out of the box).

I was wondering, why are none of the popular operating systems making cheap "RAM sanity checks" every so often, to find out failing RAM before losing ridiculous amounts of time to random crashes and corruptions.

This could be similar to how ZFS does disk scrubbing, i.e. the OS could opportunistically or systematically mark certain RAM pages as unavailable while RAM testing is ongoing and every week/month do an hour or so of comprehensive testing with something low-level like memtest86 (e.g. during update windows or other downtime on desktop machines).

If I had my way, all machines would have ECC RAM but with the status quo there seems to be so much low-hanging fruit still. I don't understand why we always have to find out the RAM has gone bad post mortem after an unknown amount of mayhem.

Could these ideas be implemented on Linux, or is there some reason why this sort of approach is unfeasible?

Similar question applies to pre-emptively diagnosing other possible hardware problems. I don't really see that happening anywhere, though I recall reading that Solaris was aiming to do something like that and automatically recovering from various problems.

  • by johndoe0815 on 9/30/20, 7:24 PM

    We built "RAMpage", an online RAM tester for Linux, about 10 years ago - https://core.ac.uk/download/pdf/194025446.pdf

    This was a prototype developed by a student which worked fairly well (tested using real defective RAMs from our AMD cluster system), but certainly not yet suitable for everyday use. See https://github.com/schirmeier/rampage for a snapshot of our code.

  • by ekns on 10/1/20, 7:12 AM

    FWIW, I just found out there's a tool called memtester [0] that could fit in here. But then, I wonder why operating systems don't use it. IMO they should integrate these things so it's done systematically for most users unless explicitly disabled.

    https://shipilev.net/jvm/test-your-memory/