from Hacker News

ZFS for Dummies

by giis on 9/5/23, 3:07 AM with 164 comments

  • by istjohn on 9/5/23, 7:23 AM

    I'm getting started with ZFS just now. The learning curve is steeper than I expected. I would love to have a dumbed down wrapper that made the common case dead-simple. For example:

    - Use sane defaults for pool creation. ashift=12, lz4 compression, xattr=sa, acltype=posixacl, and atime=off. Don't even ask me.

    - Make encryption just on or off instead of offering five or six options

    - Generate the encryption key for me, set up the systemd service to decrypt the pool at start up, and prompt me to back up the key somewhere

    - `zfs list` should show if a dataset is mounted or not, if it is encrypted or not, and if the encryption key is loaded or not

    - No recursive datasets and use {pool}:{dataset} instead of {pool}/{dataset} to maintain a clear distinction between pools and datasets.

    - Don't make me name pools or snapshots. Assign pools the name {hostname}-[A-Z]. Name snapshots {pool name}_{datetime created} and give them numerical shortcuts so I never have to type that all out

    - Don't make me type disk IDs when creating pools. Store metadata on the disk so ZFS doesn't get confused if I set up a pool with `/dev/sda` and `/dev/sdb` references and then shuffle around the drives

    - Always use `pv` to show progress

    - Automatically set up weekly scrubs

    - Automatically set up hourly/daily/weekly/monthly snapshots and snapshot pruning

    - If I send to a disk without a pool, ask for confirmation and then create a new single disk pool for me with the same settings as on the sending pool

    - collapse `zpool` and `zfs` into a single command

    - Automatically use `--raw` when sending encrypted datasets, default to `--replicate` when sending, and use `-I` whenever possible when sending

    - Provide an obvious way to mount and navigate a snapshot dataset instead of hiding the snapshot filesystem in a hidden directory

  • by vermaden on 9/5/23, 7:55 AM

    Other useful things about ZFS:

    - get to know the difference between zpool-attach(8) and zpool-replace(8).

    - this one will tell you where your space is used:

        # zfs list -t all -o space
        NAME                      AVAIL   USED  USEDSNAP  USEDDS  USEDREFRESERV  USEDCHILD
        (...)
    
    - ZFS Boot Environments is the best feature to protect your OS before major changes/upgrades

    --- this may be useful for a start: https://is.gd/BECTL

    - this command will tell you all history about ZFS pool config and its changes:

        # zpool history poolname
        History for 'poolname':
        2023-06-20.14:03:08 zpool create poolname ada0p1
        2023-06-20.14:03:08 zpool set autotrim=on poolname
        2023-06-20.14:03:08 zfs set atime=off poolname
        2023-06-20.14:03:08 zfs set compression=zstd poolname
        2023-06-20.14:03:08 zfs set recordsize=1m poolname
        (...)
    
    - the guide misses one important info:

      --- you can create 3-way mirror - requires 3 disks and 2 may fail - still no data lost
    
      --- you can create 4-way mirror - requires 4 disks and 3 may fail - still no data lost
    
      --- you can create N-way mirror - requires N disks and N-1 may fail - still no data lost
    
      (useful when data is most important and you do not have that many slots/disks)
  • by customizable on 9/5/23, 7:35 AM

    We have been running a large multi-TB PostgreSQL database on ZFS for years now. ZFS makes it super easy to do backups, create test environments from past snapshots, and saves a lot of disk space thanks to built-in compression. In case anyone is interested, you can read our experience at https://lackofimagination.org/2022/04/our-experience-with-po...
  • by qwertox on 9/5/23, 5:50 AM

    FreeBSD's Handbook on ZFS [0] and Aaron Toponce's articles [1] were what helped me the most when getting started with ZFS

    [0] https://docs.freebsd.org/en/books/handbook/zfs/

    [1] https://pthree.org/2012/04/17/install-zfs-on-debian-gnulinux...

  • by philsnow on 9/5/23, 5:42 AM

    One of the diagrams under the bit about snapshotting has a typo reading "snapthot" and I immediately thought it was talking about instagram.

    (I realize now after writing it that maybe snapchat should have occurred to me first, but I have never used it)

  • by tomxor on 9/5/23, 5:15 AM

    I recently rebuilt a load of infrastructure (mainly LAMP servers) and decided to back them all with ZFS on Linux for the benefit of efficient backup replication and encryption.

    I've been using ZFS in combination with rsync for backups for a long time, so I was fairly comfortable with it... and it all worked out, but it was a way bigger time sink than I expected - because I wanted to do it right - and there is a lot of misleading advice on the web, particularly when it comes to running databases and replication.

    For databases (you really should at minimum do basic tuning like block size alignment), by far the best resource I found for mariadb/innoDB is from the lets encrypt people [0]. They give reasons for everything and cite multiple sources, which is gold. If you search around the web elsewhere you will find endless contradicting advice, anecdotes and myths that are accompanied with incomplete and baseless theories. Ultimately you should also test this stuff and understand everything you tune (it's ok to decide to not tune something).

    For replication, I can only recommend the man pages... yeah, really! ZFS gives you solid replication tools, but they are too agnostic, they are like git pluming, they don't assume you're going to be doing it over SSH (even though that's almost always how it's being used)... so you have to plug it together yourself, and this feels scary at first, especially because you probably want it to be automated, which means considering edge cases... which is why everyone runs to something like syncoid, but there's something horrible I discovered with replication scripts like syncoid, which is that they don't use ZFS's send --replication mode! They try to reimplement it in perl, for "greater flexibility", but incompletely. This is maddening when you are trying to test this stuff for the first time and find that all of the encryption roots break when you do a fresh restore, and not all dataset properties are automatically synced. ZFS takes care of all of this if you simply use the build in recursive "replicate" option. It's not that hard to script manually once you commit to it, just keep it simple, don't add a bunch of unnecessary crap into the pipeline like syncoid does, (they actually slow it down if you test), just use pv to monitor progress and it will fly.

    I might publish my replication scripts at some point because I feel like there are no good functional reference scripts for this stuff that deal with the basics without going nuts and reinventing replication badly like so many others.

    [0] https://github.com/letsencrypt/openzfs-nvme-databases

  • by guerby on 9/5/23, 5:22 AM

    I started to use ZFS (on Linux) a few years ago and it went smoothly.

    My only surprise was volblocksize default which is pretty bad for most RAIDZ configuration: you need to increase it to avoid loosing 50% of raw disk space...

    Articles touching this topic :

    https://jro.io/nas/#overhead

    https://openzfs.github.io/openzfs-docs/Basic%20Concepts/RAID...

    https://www.delphix.com/blog/zfs-raidz-stripe-width-or-how-i...

    And you end up on one of the ZFS "spreadsheet" out there:

    ZFS overhead calc.xlsx https://docs.google.com/spreadsheets/d/1tf4qx1aMJp8Lo_R6gpT6...

    RAID-Z parity cost https://docs.google.com/spreadsheets/d/1pdu_X2tR4ztF6_HLtJ-D...

  • by tweetle_beetle on 9/5/23, 8:48 AM

    Might not remember the details correctly but when I was younger and stupider I read a lot about how great one of the open source NAS OSs (FreeNAS?) and ZFS were from fervent fans. I bought a very low spec second hand HP micro server on eBay and jumped straight in without really knowing what I was doing. I asked a few questions on the community forum but the vast majority of answers were "Have you read the documentation?!" "Do you have enough RAM?!".

    The documentation in question was a PowerPoint presentation with difficult to read styling, somewhat evangelical language, lots of assumptions about knowledge and it was not regularly updated. It was vague on how much RAM was required, mainly just focused on having as much as possible. Needless to say I ignored all the red flags about the technology, the hype and my own knowledge and lost a load of data. Lots of lessons learnt.

  • by unethical_ban on 9/5/23, 3:39 PM

    Some additional points for posterity, in case it isn't driven home here:

    - All redundancy in ZFS is built in the vdev layer. Zpools are created with one or more vdevs, and no matter what, if you lose any single vdev in a zpool, the zpool is permanently destroyed.

    - Historically RAIDZs (parity RAIDs) cannot be expanded by adding disks. The only way to grow a RAIDZ is to replace each disk in the array one at a time with a larger disk (and hope no disks fail during the rebuild). So in my very amateur opinion, I would only consider doing a RAIDZ if it is something like a RAIDZ2 or 3 with a large number of disks. For n<=6 and if the budget can stand it, I would do several mirrored vdevs. (Again as an amateur I am less familiar with RW performance metrics of various RAIDs so do more research for prod).

  • by mastax on 9/5/23, 9:07 AM

    I've run into a ZFS problem I don't understand. I have a zpool where zpool status prints out a list of detected errors, never in files or `<metadata>` but in snapshots (and hex numbers that I assume are deleted snapshots). If I delete the listed errored snapshots and run zpool scrub twice the errors disappear and the scrub finds no errors. Zpool status never listed any errors for any of the devices.

    So there aren't any errors in files. There aren't any errors in devices. There aren't any errors detected in scrub(?). And yet at runtime I get a dozen new "errors" showing up in zpool status per day. How?

  • by totetsu on 9/5/23, 4:25 AM

    Nice. My gotchas form using zfs on my personal laptop with Ubuntu.

    - if you want to copy files for example and connect your drive to another system and mount your zpool there, it sets some pool membership value on the file system and when you put it back in your system it won’t boot unless you set it back. Which involved chroot

    - the default settings I had made snapshot every time I apt installed something, because that snap shot included my home drive when I deleted big files thereafter I didn’t get any free space back until i figued out what was going on and arbitrarily deleted some old snapshots

    - you can’t just make a swap file and use it,

  • by idatum on 9/5/23, 5:14 AM

    I need 3 stores to feel I'm keeping safe years of digital family photos. 1) I have a live (local) FreeBSD ZFS server running for backups and snapshots; 2 pairs of mirrored physical drives 2) I have a USB device that takes 2 mirrored drives to recv ZFS snapshots from #1; I store that vdev backup in a safe place 3) I backup entire datasets to cloud storage from off-prem using rclone.

    It's #3 where I need to do some more research/work. I need to spend some time sending snapshots/diffs to cloud blob storage and make sure I can restore. Yes, I know there is rsync.net.

    Any experiences to share?

  • by asicsp on 9/5/23, 4:48 AM

  • by crawsome on 9/5/23, 1:19 PM

    "Also read up on the zpool add command."

    Haha, The only part of maintenance that I need to look up every time I do it is replacing a faulty hard drive.

    Even this guide skips that.

  • by znpy on 9/5/23, 9:00 AM

    is there an equivalent "btrfs for dummies" ?
  • by dontupvoteme on 9/5/23, 4:05 AM

    As nice as the technology is as long as there's the potential of a Damoclean license issue I'll always feel hesitant around ZFS.

    (Hey looks like it's a sore spot!)