by vordoo on 1/27/20, 1:57 PM with 237 comments
by InTheArena on 1/27/20, 4:04 PM
I went down the BTRFS path, despite it's dodgy reputation when netgear announced their little embedded NASes, and switched my server over to it. The experience was solid enough that I bought high-end synology and have had zero problems with it.
by derefr on 1/27/20, 3:58 PM
Or, to put that another way: what are AWS and GCP using in their SANs (EBS; GCE PD) that allows them to take on-demand incremental snapshots of SAN volumes, and then ship those snapshots away from the origin node into safer out-of-cluster replicated storage (e.g. object storage)? It it proprietary, or is it just several FOSS technologies glued together?
My naive guess would be that the cloud hosts are either using ZFS volumes, or LVM LVs (which do have incremental snapshot capability, if the disk is created in a thin pool) under iSCSI. (Or they’re relying on whatever point-solution VMware et al sold them.)
If you control the filesystem layer (i.e. you don’t need to be filesystem-agnostic), would Btrfs snapshots be better for this same use-case?
by gravypod on 1/27/20, 6:43 PM
by tezzer on 1/27/20, 6:20 PM
by pojntfx on 1/27/20, 3:22 PM
by kiney on 1/27/20, 7:35 PM
[/dev/mapper/h4_crypt].write_io_errs 0
[/dev/mapper/h4_crypt].read_io_errs 0
[/dev/mapper/h4_crypt].flush_io_errs 0
[/dev/mapper/h4_crypt].corruption_errs 0
[/dev/mapper/h4_crypt].generation_errs 0
[/dev/mapper/h2_crypt].write_io_errs 0
[/dev/mapper/h2_crypt].read_io_errs 30
[/dev/mapper/h2_crypt].flush_io_errs 0
[/dev/mapper/h2_crypt].corruption_errs 0
[/dev/mapper/h2_crypt].generation_errs 0
[/dev/mapper/h1_crypt].write_io_errs 0
[/dev/mapper/h1_crypt].read_io_errs 0
[/dev/mapper/h1_crypt].flush_io_errs 0
[/dev/mapper/h1_crypt].corruption_errs 0
[/dev/mapper/h1_crypt].generation_errs 0
[/dev/mapper/h3_crypt].write_io_errs 0
[/dev/mapper/h3_crypt].read_io_errs 0
[/dev/mapper/h3_crypt].flush_io_errs 0
[/dev/mapper/h3_crypt].corruption_errs 0
[/dev/mapper/h3_crypt].generation_errs 0
[/dev/mapper/luks-e120f41e-9c8a-4808-876f-fa6665ee8bb8].write_io_errs 0
[/dev/mapper/luks-e120f41e-9c8a-4808-876f-fa6665ee8bb8].read_io_errs 16
[/dev/mapper/luks-e120f41e-9c8a-4808-876f-fa6665ee8bb8].flush_io_errs 0
[/dev/mapper/luks-e120f41e-9c8a-4808-876f-fa6665ee8bb8].corruption_errs 20619
[/dev/mapper/luks-e120f41e-9c8a-4808-876f-fa6665ee8bb8].generation_errs 0
edit: formattingby epx on 1/27/20, 3:25 PM
by alyandon on 1/27/20, 3:38 PM
by Shalle135 on 1/27/20, 3:52 PM
It all depends on the application but in the majority of cases the io performance of btrfs is worse than the alternatives.
Redhat for example choose to deprecate btrfs for unknown reasons while SUSE made it it’s default. The future of it seems uncertain which may cause a lot of headache’s in major environments if implemented there.
by zielmicha on 1/27/20, 4:50 PM
ext4 - 33s, ZFS - 50s, btfrs - 74s
(test was ran on Vultr.com 2GB virtual machine, backing disk was allocated using "fallocate --length 10G" on ext4 filesystem, the results are very consistent)
by lousken on 1/27/20, 3:25 PM
by pQd on 1/28/20, 7:52 AM
since 2017 i'm also using BTRFS to host mysql replication slaves. every 15 min, 1h, 12h crash-consistent snapshots of the running database files are taken and kept for couple of days. there's consensus that - due to its COW nature - BTRFS is not well suited for hosting vms, databases or any other type of files that change frequently. performance is significantly worse compared to EXT4 - this can lead to slave lag. but slave-lag can be mitigated by: using NVMe drives and relaxing durability of MySQL innodb engine. i've used those snapshots few times each year - it worked fine so far. snapshots should never be the main backup strategy, independently of them there's a full database backup done daily from masters using mysqldump. snapshots are useful whenever you need to very quickly access state of the production data from few minutes or hours ago - for instance after fat fingering some live data.
during those years i've seen kernel crashes most likely due to BTRFS but i did not lose data as long as the underlying drives were healthy.
by izacus on 1/27/20, 4:29 PM
They're still using their own RAID layer though.
by cmurf on 1/28/20, 7:26 AM
Async discards coming in 5.6. https://lore.kernel.org/linux-btrfs/cover.1580142284.git.dst...
by abotsis on 1/27/20, 4:30 PM
In addition to the slew of other features Btrfs is missing (send/recv, dedup, etc) zfs allows you to dedicate something like an Intel optane (or other similar high write endurance, low latency ssd) to act as stable storage for sync writes, and a different device (typically mlc or tlc flash) to extend the read cache.
by geophertz on 1/27/20, 4:12 PM
The ability to add and remove disks on a desktop machine is very tempting.
by mdip on 1/27/20, 3:26 PM
I'll be the first to say that it isn't a silver bullet for everything. But then, what filesystem really is? Filesystems are such a critical part of a running OS that we expect perfection for every use case; filesystem bugs or quirks[1] result in data loss which is usually Really Bad(tm).
That said, for the last two years, I've been running Linux on a Thinkpad with a Windows 10 VM in KVM/qemu -- both are running all the time. When I first configured my Windows 10 VM, performance was brutal; there were times when writes would stall the mouse cursor and the issue was directly related to `btrfs`. I didn't ditch the file-system, I switched to a raw volume for my VM and adjusted some settings that affected how `btrfs` interacted with it. I discovered similar things happened when running a `balance` on the filesystem and after a bit of research, found that changing the IO scheduler to one more commonly used on spindle HDDs made everything more stable.
So why use something that requires so much grief to get it working? Because those settings changes are a minor inconvenience compared against the things "I don't have to mess with" to cover a bigger problem that I frequently encountered: OS recovery. An out-of-the-box OpenSUSE Tumbleweed installation uses `btrfs` on root. Every time software is added/modified, or `yast` (the user-friendly administrative tool) is run, a snapshot is taken automatically. When I or my OS screws something up, I have a boot menu that lets me "go back" to prior to the modification. It Just Works(tm). In the last two years, I've had around 4-5 cases where my OS was wrecked by keeping things up to date, or tweaking configuration. In the past, I'd be re-installing. Now, I reboot after applying updates and if things are messed up, I reboot again, restore from a read-only snapshot and I'm back. I have no use for RAID or much else[2] which is one of the oft-repeated "issues" people identify with `btrfs`.
It fits for my use-case, along with many of the other use-cases I encounter frequently. It's not perfect, but neither is any filesystem. I won't even argue that other people with the same use case will come to the same conclusion. But as far as I'm concerned, damn it works well.
[0] I want to say that an installation of openSUSE ended up causing me to switch to `btrfs`, but I can't remember for sure -- that's all I run, personally, and it is a default for a new installation's root drive.
[1] Bug: a specific feature (i.e. RAID) just doesn't work. Quirk: the filesystem has multiple concepts of "free space" that don't necessarily line up with what running applications understand.
[2] My servers all have LSI or other hardware RAID controllers and present the array as a single disk to the OS; I'm not relying on my filesystem to manage that. My laptop has a single SSD.
by nickik on 1/27/20, 3:39 PM
I remember using it after I had heard it was 'stable' and it eat my data not long after (not using crazy features or anything). I certainty will not use it again. A FS should be stable from the beginning, as stable core that you can then build features around, rather then a system with lots of feature that promises to be stable in a couple years (and then wasn't years after being in the kernel already).
Using ZFS for me has been nothing but joy in comparison. Growing the ZFS pool for me has been no issue at all, I never saw a reason why I would want to reconfigure my pool. I went from 4TB to 16TB+ so far in multiple iterations.
Overall not having ZFS in Linux is a huge failure of the Linux world. I think its much more NIMBY then a license issue.
by curt15 on 1/27/20, 3:23 PM
by c0ffe on 1/27/20, 6:10 PM
I started it just for testing, and has been running for up to two years, and had no problems so far.
by shmerl on 1/28/20, 12:05 AM
by e40 on 1/27/20, 3:16 PM
by cyphar on 1/28/20, 12:31 PM
> If you want to grow the pool, you basically have two recommended options: add a new identical vdev, or replace both devices in the existing vdev with higher capacity devices.
You can add vdevs to a pool which are different types or have different parities. It's not really recommended because it means that you're making it harder to know how many failures your pool can survive, but it's definitely something you can do -- and it's just as easy as adding any other vdev to your pool:
% zpool add <pool> <vdev> <devices...>
This has always been possible with ZFS, as far as I'm aware.> So let’s say you had no writes for a month and continual reads. Those two new disks would go 100% unused. Only when you started writing data would they start to see utilization
This part is accurate...
> and only for the newly written files.
... but this part is not. Modifying an existing file will almost certainly result in data being copied to the newer vdev -- because ZFS will send more writes to drives that are less utilised (and if most of the data is on the older vdevs, then most reads are to the older vdevs, and thus the newer vdevs get more writes).
> It’s likely that for the life of that pool, you’d always have a heavier load on your oldest vdevs. Not the end of the world, but it definitely kills some performance advantages of striping data.
This is also half-true -- it's definitely not ideal that ZFS doesn't have a defrag feature, but the above-mentioned characteristic means that eventually your pool will not be so unbalanced.
> Want to break a pool into smaller pools? Can’t do it. So let’s say you built your 2x8 + 2x8 pool. Then a few years from now 40 TB disks are available and you want to go back to a simple two disk mirror. There’s no way to shrink to just 2x40.
This is now possible. ZoL 0.8 and later support top-level mirror vdev removal.
> Got a 4-disk raidz2 pool and want to add a disk? Can’t do it.
It is true that this is not possible at the moment, but in the interest of fairness I'd like to mention that it is currently being worked on[1].
> For most fundamental changes, the answer is simple: start over. To be fair, that’s not always a terrible idea, but it does require some maintenance down time.
This is true, but I believe that the author makes it sound much harder than it actually is (it does have some maintenance downtime, but because you can snapshot the filesystem the downtime can be as little as a minute):
# Assuming you've already created the new pool $new_pool.
% zfs snapshot -r $old_pool/ROOT@base_snapshot
% zfs send $old_pool/ROOT@base_snapshot | zfs recv $new_pool/ROOT
# The base copy is done -- no downtime. Now we take some downtime by stopping all use of the pool.
% take_offline $old_pool # or do whatever it takes for your particular system
% zfs mount -o ro $old_pool/ROOT # optional
% zfs snapshot -r $old_pool/ROOT@last_snapshot
% zfs send -i @base_snapshot $old_pool/ROOT@last_snapshot | zfs recv $new_pool/ROOT
# Finally, get rid of the old pool and add our new pool.
% zpool export $old_pool
% zpool import $new_pool $old_pool
% zfs mount -a # probably optional
[1]: https://www.youtube.com/watch?v=Njt82e_3qVoby lazylizard on 1/27/20, 4:47 PM
Raidz2+spares, compression, snapshots and send/receive are very useful. And zil and cache are easier than lvmcache..
by zozbot234 on 1/27/20, 3:27 PM
by gitgudnubs on 1/27/20, 8:58 PM
It supports heterogenous drives, safe rebalancing (create a third copy, THEN delete the old copy), fault domains (3-way mirror, but no 2 copies can be on the same disk/enclosure/server/whatever), erasure coding, hierarchical storage based on disk type (e.g., use NVMe for the log, SSD for the cache), clustering (paxos, probably). Then you toss ReFS on top, and you're done.
The only compelling reasons to buy windows server are to run third party software or a storage spaces/ReFS file share.