from Hacker News

Five Years of Btrfs

by vordoo on 1/27/20, 1:57 PM with 237 comments

by InTheArena on 1/27/20, 4:04 PM
I went on a quest a few years ago, thinking it would be good for the industry to standardize on a single next generation filesystem for UNIX. I started with ZFS on linux since that seemed to have the most vocal advocates. That lasted about a half year, until a bug in the code resulted in a completely corrupt disk, and I had to restore 4TB of data over a month from offside backups. That plus the licensing confusion around ZFS has made it impossible for ZFS to be the defacto choice.
I went down the BTRFS path, despite it's dodgy reputation when netgear announced their little embedded NASes, and switched my server over to it. The experience was solid enough that I bought high-end synology and have had zero problems with it.
by derefr on 1/27/20, 3:58 PM
A question for HN: what filesystem and/or block-device abstraction layer would you use on a database server, if you wanted to perform scheduled incremental backups using filesystem-level consistent snapshotting and differential snapshot shipping to object storage, instead of using the DBMS’s own replication layer to achieve this effect? (I.e. you want disaster recovery, not high availability.)
Or, to put that another way: what are AWS and GCP using in their SANs (EBS; GCE PD) that allows them to take on-demand incremental snapshots of SAN volumes, and then ship those snapshots away from the origin node into safer out-of-cluster replicated storage (e.g. object storage)? It it proprietary, or is it just several FOSS technologies glued together?
My naive guess would be that the cloud hosts are either using ZFS volumes, or LVM LVs (which do have incremental snapshot capability, if the disk is created in a thin pool) under iSCSI. (Or they’re relying on whatever point-solution VMware et al sold them.)
If you control the filesystem layer (i.e. you don’t need to be filesystem-agnostic), would Btrfs snapshots be better for this same use-case?
by gravypod on 1/27/20, 6:43 PM
I've seen a lot of the hacker community focusing on btrfs and zfs but very little focusing on ceph. I think ceph has a lot of the features that we want in a file system and some things that aren't even possible on traditional file systems (per-file redundancy settings) with very little downsides. The setup is a little more complex involving a few daemons to manage disks, balance, monitor, etc. I wish there was something similar to FreeNAS for ceph that only focused on making the experience seemless because I think if it became more popular in the home lab space we'd see lots of cool tools pop up for it.
by tezzer on 1/27/20, 6:20 PM
I've had one issue with btrfs that took it off my radar completely. A customer had a runaway issue that filled a btrfs device with unimportant things. We found the errant process and killed it, but apparently if a btrfs device is completely full, you can't delete anything to free up space. File removal requires some amount of free space. Bricked the device, annoyed a customer, back to ext4.
by pojntfx on 1/27/20, 3:22 PM
Love using Btrfs; the is no better filesystem than it nowadays that it's reliability issues have been fixed.

by kiney on 1/27/20, 7:35 PM

I use BTRFS on several devices for years. The tooling is a bit rough, but no major problems. Just recently data checksumming saved me: In December I replace an old 2TB drive in my RAID1 (2+4+4+4) with an 8TB drive. The new drive had checksum errors after a few weeks which BTRFS handled gracefully. With "classical" RAID i might only have noticed when it's to late. (I RMAed the bad drive)

  [/dev/mapper/h4_crypt].write_io_errs    0
  [/dev/mapper/h4_crypt].read_io_errs     0
  [/dev/mapper/h4_crypt].flush_io_errs    0
  [/dev/mapper/h4_crypt].corruption_errs  0
  [/dev/mapper/h4_crypt].generation_errs  0
  [/dev/mapper/h2_crypt].write_io_errs    0
  [/dev/mapper/h2_crypt].read_io_errs     30
  [/dev/mapper/h2_crypt].flush_io_errs    0
  [/dev/mapper/h2_crypt].corruption_errs  0
  [/dev/mapper/h2_crypt].generation_errs  0
  [/dev/mapper/h1_crypt].write_io_errs    0
  [/dev/mapper/h1_crypt].read_io_errs     0
  [/dev/mapper/h1_crypt].flush_io_errs    0
  [/dev/mapper/h1_crypt].corruption_errs  0
  [/dev/mapper/h1_crypt].generation_errs  0
  [/dev/mapper/h3_crypt].write_io_errs    0
  [/dev/mapper/h3_crypt].read_io_errs     0
  [/dev/mapper/h3_crypt].flush_io_errs    0
  [/dev/mapper/h3_crypt].corruption_errs  0
  [/dev/mapper/h3_crypt].generation_errs  0
  [/dev/mapper/luks-e120f41e-9c8a-4808-876f-fa6665ee8bb8].write_io_errs    0
  [/dev/mapper/luks-e120f41e-9c8a-4808-876f-fa6665ee8bb8].read_io_errs     16
  [/dev/mapper/luks-e120f41e-9c8a-4808-876f-fa6665ee8bb8].flush_io_errs    0
  [/dev/mapper/luks-e120f41e-9c8a-4808-876f-fa6665ee8bb8].corruption_errs  20619
  [/dev/mapper/luks-e120f41e-9c8a-4808-876f-fa6665ee8bb8].generation_errs  0

edit: formatting

by epx on 1/27/20, 3:25 PM
I have been using btrfs in my "NAS"/personal server for 3 years, changed disk configuration a couple times, I do snapshots every hour and prune them using a Fibonacci-like timeline, no problems yet.
by alyandon on 1/27/20, 3:38 PM
I use btrfs in raid1 mode and the ability to shrink/grow/add/remove devices at will without data loss or extended downtime led me to choose btrfs over zfs on my home servers.
by Shalle135 on 1/27/20, 3:52 PM
Is there any specific reasons to run btrfs over for example ext4? You can create/shrink/grow pools, create encrypted volumes etc by using LVM.
It all depends on the application but in the majority of cases the io performance of btrfs is worse than the alternatives.
Redhat for example choose to deprecate btrfs for unknown reasons while SUSE made it it’s default. The future of it seems uncertain which may cause a lot of headache’s in major environments if implemented there.
by zielmicha on 1/27/20, 4:50 PM
fsync is still a bit slow on BTRFS (on ZFS too, but to a smaller degree). For example, I just did a quick benchmark on Linux 5.3.0 - installing Emacs on fresh Ubuntu 18.04 chroot (dpkg calls fsync after every installed package).
ext4 - 33s, ZFS - 50s, btfrs - 74s
(test was ran on Vultr.com 2GB virtual machine, backing disk was allocated using "fallocate --length 10G" on ext4 filesystem, the results are very consistent)
by lousken on 1/27/20, 3:25 PM
Did anyone had the courage to use btrfs in production? Any stories to share?
by pQd on 1/28/20, 7:52 AM
i've been using BTRFS since 2014 to store backups. there is a noticeable performance penalty when rsync'ing hundreds of thousands of files to a spinning-rust disk connected to USB-SATA dock when BTRFS is used instead of EXT4. i'm accepting it in exchange for ability to run scheduled scrub of the data to detect potential bitrot.
since 2017 i'm also using BTRFS to host mysql replication slaves. every 15 min, 1h, 12h crash-consistent snapshots of the running database files are taken and kept for couple of days. there's consensus that - due to its COW nature - BTRFS is not well suited for hosting vms, databases or any other type of files that change frequently. performance is significantly worse compared to EXT4 - this can lead to slave lag. but slave-lag can be mitigated by: using NVMe drives and relaxing durability of MySQL innodb engine. i've used those snapshots few times each year - it worked fine so far. snapshots should never be the main backup strategy, independently of them there's a full database backup done daily from masters using mysqldump. snapshots are useful whenever you need to very quickly access state of the production data from few minutes or hours ago - for instance after fat fingering some live data.
during those years i've seen kernel crashes most likely due to BTRFS but i did not lose data as long as the underlying drives were healthy.
by izacus on 1/27/20, 4:29 PM
It's also worth noting that Synology uses btrfs as an option to do checksumming and snapshots on their NAS devices.
They're still using their own RAID layer though.
by cmurf on 1/28/20, 7:26 AM
kernel 5.5 released Sunday. Btrfs now has raid1c3, raid1c4 profiles for 3 and 4 copy raid1. Adds new checksum algorithms: xxhash, blake2b, sha256.
Async discards coming in 5.6. https://lore.kernel.org/linux-btrfs/cover.1580142284.git.dst...
by abotsis on 1/27/20, 4:30 PM
It’s worth noting that much of the premise of the article (wanting flexibility) is outdated. Zfs has support for removing top-level raid 0/1 vdevs now. So you can take a raid10 pool, and remove a top level mirror vdev completely. Note that this doesn’t work for raid5/6 vdevs, but as the author points out, those are becoming less and less used because of rebuild time and performance.
In addition to the slew of other features Btrfs is missing (send/recv, dedup, etc) zfs allows you to dedicate something like an Intel optane (or other similar high write endurance, low latency ssd) to act as stable storage for sync writes, and a different device (typically mlc or tlc flash) to extend the read cache.
by geophertz on 1/27/20, 4:12 PM
Is using btrfs on a personal machine something to do? It seems that all the comments as well as articles about it, just assume you're running it on a server.
The ability to add and remove disks on a desktop machine is very tempting.
by mdip on 1/27/20, 3:26 PM
I've been a `btrfs` user for the better part of 4 years despite, at the time, a very vocal group providing advice against it[0].
I'll be the first to say that it isn't a silver bullet for everything. But then, what filesystem really is? Filesystems are such a critical part of a running OS that we expect perfection for every use case; filesystem bugs or quirks[1] result in data loss which is usually Really Bad(tm).
That said, for the last two years, I've been running Linux on a Thinkpad with a Windows 10 VM in KVM/qemu -- both are running all the time. When I first configured my Windows 10 VM, performance was brutal; there were times when writes would stall the mouse cursor and the issue was directly related to `btrfs`. I didn't ditch the file-system, I switched to a raw volume for my VM and adjusted some settings that affected how `btrfs` interacted with it. I discovered similar things happened when running a `balance` on the filesystem and after a bit of research, found that changing the IO scheduler to one more commonly used on spindle HDDs made everything more stable.
So why use something that requires so much grief to get it working? Because those settings changes are a minor inconvenience compared against the things "I don't have to mess with" to cover a bigger problem that I frequently encountered: OS recovery. An out-of-the-box OpenSUSE Tumbleweed installation uses `btrfs` on root. Every time software is added/modified, or `yast` (the user-friendly administrative tool) is run, a snapshot is taken automatically. When I or my OS screws something up, I have a boot menu that lets me "go back" to prior to the modification. It Just Works(tm). In the last two years, I've had around 4-5 cases where my OS was wrecked by keeping things up to date, or tweaking configuration. In the past, I'd be re-installing. Now, I reboot after applying updates and if things are messed up, I reboot again, restore from a read-only snapshot and I'm back. I have no use for RAID or much else[2] which is one of the oft-repeated "issues" people identify with `btrfs`.
It fits for my use-case, along with many of the other use-cases I encounter frequently. It's not perfect, but neither is any filesystem. I won't even argue that other people with the same use case will come to the same conclusion. But as far as I'm concerned, damn it works well.
[0] I want to say that an installation of openSUSE ended up causing me to switch to `btrfs`, but I can't remember for sure -- that's all I run, personally, and it is a default for a new installation's root drive.
[1] Bug: a specific feature (i.e. RAID) just doesn't work. Quirk: the filesystem has multiple concepts of "free space" that don't necessarily line up with what running applications understand.
[2] My servers all have LSI or other hardware RAID controllers and present the array as a single disk to the OS; I'm not relying on my filesystem to manage that. My laptop has a single SSD.
by nickik on 1/27/20, 3:39 PM
Being 'The Dude' of file system is literally the opposite of what I want. When looking at ZFS talks and the incredible complexity of some of those operations that Btrfs seems to think are 'no big deal', I will simply not trust that. Specially because it has been proven over and over again that Btrfs claims its 'stable' and then a new series of issues show up. Or its 'stable' but not if you use 'XY feature', or if the disk is 'to full' or whatever.
I remember using it after I had heard it was 'stable' and it eat my data not long after (not using crazy features or anything). I certainty will not use it again. A FS should be stable from the beginning, as stable core that you can then build features around, rather then a system with lots of feature that promises to be stable in a couple years (and then wasn't years after being in the kernel already).
Using ZFS for me has been nothing but joy in comparison. Growing the ZFS pool for me has been no issue at all, I never saw a reason why I would want to reconfigure my pool. I went from 4TB to 16TB+ so far in multiple iterations.
Overall not having ZFS in Linux is a huge failure of the Linux world. I think its much more NIMBY then a license issue.
by curt15 on 1/27/20, 3:23 PM
BTRFS is well known for being ill-suited to VMs or databases. How come ZFS doesn't have that reputation?
by c0ffe on 1/27/20, 6:10 PM
I have a small Nextcloud instance at home that uses BTRFS (on HDD, with noatime option) for file storage, and XFS (on SSD) for database.
I started it just for testing, and has been running for up to two years, and had no problems so far.
by shmerl on 1/28/20, 12:05 AM
I'm using Btrfs currently, but I'm waiting for Bcachefs to replace it.
by e40 on 1/27/20, 3:16 PM
I've heard a lot of people say they won't use Btrfs due to reliability. Would have been nice to see that addressed.
by cyphar on 1/28/20, 12:31 PM
This article makes a few mistakes with regards to ZFS. Some are understandable (the author presumably last looked at the state of ZFS 5 years ago), but some were not true even 5 years ago:
> If you want to grow the pool, you basically have two recommended options: add a new identical vdev, or replace both devices in the existing vdev with higher capacity devices.
You can add vdevs to a pool which are different types or have different parities. It's not really recommended because it means that you're making it harder to know how many failures your pool can survive, but it's definitely something you can do -- and it's just as easy as adding any other vdev to your pool:
```
  % zpool add <pool> <vdev> <devices...>
```
This has always been possible with ZFS, as far as I'm aware.
> So let’s say you had no writes for a month and continual reads. Those two new disks would go 100% unused. Only when you started writing data would they start to see utilization
This part is accurate...
> and only for the newly written files.
... but this part is not. Modifying an existing file will almost certainly result in data being copied to the newer vdev -- because ZFS will send more writes to drives that are less utilised (and if most of the data is on the older vdevs, then most reads are to the older vdevs, and thus the newer vdevs get more writes).
> It’s likely that for the life of that pool, you’d always have a heavier load on your oldest vdevs. Not the end of the world, but it definitely kills some performance advantages of striping data.
This is also half-true -- it's definitely not ideal that ZFS doesn't have a defrag feature, but the above-mentioned characteristic means that eventually your pool will not be so unbalanced.
> Want to break a pool into smaller pools? Can’t do it. So let’s say you built your 2x8 + 2x8 pool. Then a few years from now 40 TB disks are available and you want to go back to a simple two disk mirror. There’s no way to shrink to just 2x40.
This is now possible. ZoL 0.8 and later support top-level mirror vdev removal.
> Got a 4-disk raidz2 pool and want to add a disk? Can’t do it.
It is true that this is not possible at the moment, but in the interest of fairness I'd like to mention that it is currently being worked on[1].
> For most fundamental changes, the answer is simple: start over. To be fair, that’s not always a terrible idea, but it does require some maintenance down time.
This is true, but I believe that the author makes it sound much harder than it actually is (it does have some maintenance downtime, but because you can snapshot the filesystem the downtime can be as little as a minute):
```
    # Assuming you've already created the new pool $new_pool.
    % zfs snapshot -r $old_pool/ROOT@base_snapshot
    % zfs send $old_pool/ROOT@base_snapshot | zfs recv $new_pool/ROOT

    # The base copy is done -- no downtime. Now we take some downtime by stopping all use of the pool.
    % take_offline $old_pool # or do whatever it takes for your particular system
    % zfs mount -o ro $old_pool/ROOT # optional
    % zfs snapshot -r $old_pool/ROOT@last_snapshot
    % zfs send -i @base_snapshot $old_pool/ROOT@last_snapshot | zfs recv $new_pool/ROOT

    # Finally, get rid of the old pool and add our new pool.
    % zpool export $old_pool
    % zpool import $new_pool $old_pool
    % zfs mount -a # probably optional
```
[1]: https://www.youtube.com/watch?v=Njt82e_3qVo
by lazylizard on 1/27/20, 4:47 PM
¯\_(ツ)_/¯
Raidz2+spares, compression, snapshots and send/receive are very useful. And zil and cache are easier than lvmcache..
by zozbot234 on 1/27/20, 3:27 PM
I'm so sorry teacher, Btrfs ate my homework.
by gitgudnubs on 1/27/20, 8:58 PM
Storage spaces is probably the best software raid available today. Unfortunately, it comes with windows.
It supports heterogenous drives, safe rebalancing (create a third copy, THEN delete the old copy), fault domains (3-way mirror, but no 2 copies can be on the same disk/enclosure/server/whatever), erasure coding, hierarchical storage based on disk type (e.g., use NVMe for the log, SSD for the cache), clustering (paxos, probably). Then you toss ReFS on top, and you're done.
The only compelling reasons to buy windows server are to run third party software or a storage spaces/ReFS file share.