from Hacker News

Can Applications Recover from Fsync Failures?

by simonz05 on 8/10/22, 6:10 PM with 46 comments

by formerly_proven on 8/13/22, 10:02 AM
IIRC Linux itself has only been reporting asynchronous writeback errors via fsync for a few short years, meaning before that basically any database that wasn't using O_DIRECT would miss I/O errors under memory pressure (or from out-of-process writebacks in general, e.g. root invoking sync). I looked into this stuff before postgres's fsyncgate, before "how are I/O errors actually handled in Linux, anyhow?" got attention, and walked away with the notion that anything other than O_DIRECT is best-effort-probably-works-most-of-the-time on a good day, and O_DIRECT's semantics are basically an unknowable opaque mixture of what drivers and hardware do and expect. There were some papers looking at error handling within Linux file systems at the time and they found a large number of issues in pretty much all of them. As far as I know, all efforts in the area of durable I/O are still focused on the notion of synchronizing I/O (fsync/fdatasync and equivalent), while many databases don't actually care about that too much and would rather want barriers instead. The kicker is of course that hardware (when honest) actually uses barriers and not block synchronization, and the databases that are journaling filesystems of course also use barriers and not synchronization to implement journaling. It struck me as a distinctly classic API-to-real-world mismatch.
by eis on 8/13/22, 11:52 AM
After decades of issues with the storage layer and even some of the most popular programs written by top notch developers having bugs due to the problematic nature of the APIs and filesystems involved I wish a completely new storage API would emerge. Something that exposes an asynchronous (and synchronous build upon it) API with ACID semantics. Filesystems are nothing more than specialized databases but they don't expose the necessary interface to use them as such.
We need an API that is dead simple and hard to misuse with clearly defined semantics and guarantees but lets seasoned developers still exploit the hardware to its fullest with additional work. Hope dies last I guess :)
by CGamesPlay on 8/13/22, 10:00 AM
> all three file systems mark pages clean after fsync fails, rendering techniques such as application-level retry ineffective. However, the content in said clean pages varies depending on the file system; ext4 and XFS contain the latest copy in memory while Btrfs reverts to the previous consistent state. Failure reporting is varied across file systems; for example, ext4 data mode does not report an fsync failure immediately in some cases, instead (oddly) failing the subsequent call. Failed updates to some structures (e.g., journal blocks) during fsync reliably lead to file-system unavailability. And finally, other potentially useful behaviors are missing; for example, none of the file systems alert the user to run a file-system checker after the failure.
Surely there's some motivations behind these behaviors and it's not a bug that was implemented in all 3 filesystems, right?
by chrsig on 8/13/22, 2:20 PM
On macOS, most likely not[0].
from the macOS fsync manpage:
> fsync() causes all modified data and attributes of fildes to be moved to a permanent storage device. This normally results in all in-core modified copies of buffers for the associated file to be written to a disk.
> Note that while fsync() will flush all data from the host to the drive (i.e. the "permanent storage device"), the drive itself may not physically write the data to the platters for quite some time and it may be written in an out-of-order sequence.
> Specifically, if the drive loses power or the OS crashes, the application may find that only some or none of their data was written. The disk drive may also re-order the data so that later writes may be present, while earlier writes are not.
> This is not a theoretical edge case. This scenario is easily reproduced with real world workloads and drive power failures.
> For applications that require tighter guarantees about the integrity of their data, Mac OS X provides the F_FULLFSYNC fcntl. The F_FULLFSYNC fcntl asks the drive to flush all buffered data to permanent storage. Applications, such as databases, that require a strict ordering of
> writes should use F_FULLFSYNC to ensure that their data is written in the order they expect. Please see fcntl(2) for more detail.
[0] https://twitter.com/marcan42/status/1494213855387734019
by xyzzy_plugh on 8/13/22, 3:37 PM
I said this elsewhere but, in isolation there will always be failure scenarios where recovery is impossible. There are plenty of verification strategies to detect failures, and combined with redundancy, you can reduce the probability of application failure in the face of fsync failures or other similar failures. But you can never eliminate failures. If your storage gives up the ghost, it's game over.
Distributed systems are the closest we've gotten to resilient, durable storage. Redundancy, external verification, quorum. Sometimes the distributed system lives in a single box on your desk.
by simonz05 on 8/10/22, 6:10 PM
The paper analyzes how file systems and PostgreSQL, LMDB, LevelDB, SQLite, and Redis react to fsync failures. It shows that although applications use many failure-handling strategies, none are sufficient: fsync failures can cause catastrophic outcomes such as data loss and corruption.
by iforgotpassword on 8/13/22, 10:05 AM
> Our findings show that although applications use many failure-handling strategies, none are sufficient: fsync failures can cause catastrophic outcomes such as data loss and corruption.
That makes it seem like an immediate abort might be the best action in most cases? Handling it wrong and then chugging along might amplify any corruption that has happened.
It might obviously depend on the application and use case, but I'd like to think projects like pgsql put a lot of effort into getting this right after fsyncgate. I've read quite a bit about it after that incident, but ultimately decided I'm too stupid to get that right and roll the "log error and bail out" route ever since.
by hyc_symas on 8/14/22, 8:16 PM
The description of LMDB's behavior and subsequent analysis are flat wrong. https://twitter.com/hyc_symas/status/1558909442737012736
To assume that any newbie has hit upon a potential failure condition that we didn't already anticipate and account for in LMDB is frankly laughable.