from Hacker News

Xz format inadequate for long-term archiving (2017)

by pandalicious on 4/20/18, 2:03 PM with 135 comments

by moltensyntax on 4/20/18, 7:00 PM
This article again? In my opinion, this article is biased. The subtext here is that the author is claiming that his "lzip" format is superior. But xz was not chosen "blindly" as the article claims.
To me, most of the claims are arguable.
To say 3 levels of headers is "unsafe complexity"... I don't agree. Indirection is fundamental to design.
To say padding is "useless"... I don't understand why padding and byte-alignment that is given so much vitriol. Look at how much padding the tar format has. And tar is a good example of how "useless padding" was used to extend the format to support larger files. So this supposed "flaw" has been in tar for dozens of years, with no disastrous effects at all.
The xz decision was not made "blindly". There was thought behind the decision.
And it's pure FUD to say "Xz implementations may choose what subset of the format they support. They may even choose to not support integrity checking at all. Safe interoperability among xz implementations is not guaranteed". You could say this about any software - "oh no, someone might make a bad implementation!" Format fragmentation is essentially a social problem more than a technical problem.
I'll leave it at this for now, but there's more I could write.
by comex on 4/20/18, 11:54 PM
Last time this came up on HN, I did some research, and discovered that lzip was quite non-robust in the face of data corruption: a single bit flip in the right place in an lzip archive could cause the decompressor to silently truncate the decompressed data, without reporting an error. Not only that, this vulnerability was a direct consequence of one of the features used to claim superiority to XZ: namely, the ability to append arbitrary “trailing data” to an lzip archive without invalidating it.
Like some other compressed formats, an lzip file is just a series of compressed blocks concatenated together, each block starting with a magic number and containing a certain amount of compressed data. There’s no overall file header, nor any marker that a particular block is the last one. This structure has the advantage that you can simply concatenate two lzip files, and the result is a valid lzip file that decompresses to the concatenation of what the inputs decompress to.
Thus, when the decompressor has finished reading a block and sees there’s more input data left in the file, there are two possibilities for what that data could contain. It could be another lzip block corresponding to additional compressed data. Or it could be any other random binary data, if the user is taking advantage of the “trailing data” feature, in which case the rest of the file should be silently ignored.
How do you tell the difference? Simply enough, by checking if the data starts with the 4-byte lzip magic number. If the magic number itself is corrupted in any way? Then the entire rest of the file is treated as “trailing data” and ignored. I hope the user notices their data is missing before they delete the compressed original…
It might be possible to identify an lzip block that has its magic number corrupted, e.g. by checking whether the trailing CRC is valid. However, at least at the time I discovered this, lzip’s decompressor made no attempt to do so. It’s possible the behavior has improved in later releases; I haven’t checked.
But at least at the time this article was written: pot, meet kettle.
by tedunangst on 4/20/18, 4:38 PM
Are these concerns, about error recovery, outdated? If I want to recover a corrupted file, I find another copy. I don't fiddle with the internal length field to fix framing issues. Certainly, if I want to detect corruption, I use a sha256 of the entire file. If that fails, I don't waste time trying to find the bad bit.
To add to that, if you need parity to recover from errors, you need to calculate how much based on your storage medium durability and projected life span. It's not the file format's concern. The xz crc should be irrelevant.
by arundelo on 4/20/18, 3:54 PM
I upvoted this because it seems to make some good points and I think the topic is interesting and important, but I can't understand why the "Then, why some free software projects use xz?" section does not mention xz's main selling point of being better than other commonly used alternatives at compressing things to smaller sizes.
https://www.rootusers.com/gzip-vs-bzip2-vs-xz-performance-co...
by carussell on 4/20/18, 4:44 PM
(2016)
Previously discussed here on HN back then:
https://news.ycombinator.com/item?id=12768425
The author has made some minor revisions since then. Here are the main differences to the page compared to when it was first discussed here:
http://web.cvs.savannah.nongnu.org/viewvc/lzip/lzip/xz_inade...
And here's the full page history:
http://web.cvs.savannah.nongnu.org/viewvc/lzip/lzip/xz_inade...
by cpburns2009 on 4/20/18, 3:35 PM
It may not be a good choice for long-term data storage, but I disagree that it should not be used for data sharing or software distribution. Different use cases have different needs. If you need long-term storage, it's better to avoid lossless compression that can break after minor corruption. You should also be storing parity/ECC data (I don't recall the subtle difference). If you only need short to moderate term storage, the best compression ratio is likely optimal. Keep a spare backup just in case.
by jwilliams on 4/20/18, 5:33 PM
I sent a reasonable amount of data to Cloud Storage. It varies a lot. Usually ~10GB/day, but it can be up to 1TB/day regularly.
xz can be amazing. It can also bite you.
I've had payloads that compress to 0.16 with gzip then compress to 0.016 with xz. Hurray! Then I've had payloads where xz compression is par, or worse. However, with "best or extreme" compression, xz can peg your CPU for much longer. gzip and bzip2 will take minutes and xz -9 is taking hours at 100% CPU.
As annoying as that is, getting an order of magnitude better in many circumstances is hard to give up.
My compromise is "xz -1". It usually delivers pretty good results, in reasonable time, with manageable CPU/Memory usage.
FYI. The datasets are largely text-ish. Usually in 250MB-1GB chunks. So talking JSON data, webpages, and the like.
by freedomben on 4/20/18, 8:10 PM
This is purely anecdotal and could easily be PEBKAC, but I created a bunch of xz backups years ago and had to access them a couple of years later after a disc died. To my panicked surprise, when trying to unpack them, I was informed that something was wrong (sorry at this point I don't remember what it was). I never did get it working. From that point on I went back to gzip and have not had a problem since. Yes xz packs efficiently, but a tight archive that doesn't inflate is worse than worthless to me.
by eesmith on 4/20/18, 4:49 PM
FWIW, PNG also "fails to protect the length of variable size fields". That is, it's possible to construct PNGs such that a 1-bit corruption gives an entirely different, and still valid, image.
When I last looked into this issue, it seemed that erasure codes, like with Parchive/par/par2, was the way to go. (As others have mentioned here.) I haven't tried it out as I haven't needed that level of robustness.
by davidw on 4/20/18, 4:35 PM
FWIW, xz is also a memory hog with the default settings. I inherited an embedded system that attempts to compress and send some logs, using xz, and if they're big enough, it blows up because of memory exhaustion.
by pmoriarty on 4/20/18, 5:34 PM
When I use xz for archival purposes I always use par2[1] to provide redundancy and recoverability in case of errors.
When I burn data (including xz archives) on to DVD for archival storage, I use dvdisaster[2] for the same purpose.
I've tested both by damaging archives and scratching DVDs, and these tools work great for recovery. The amount of redundancy (with a tradeoff for space) is also tuneable for both.
[1] - https://github.com/Parchive/par2cmdline
[2] - http://dvdisaster.net/
by doubledad222 on 4/20/18, 3:50 PM
Thank you for sharing this. I am in charge of archiving the family files - pictures, video, art projects, email. I want it available through the aging of standards and protected against the bitrot of aging hard drives. I'll be converting any xz archives I get into a better format.
by ryao on 4/20/18, 7:14 PM
Requiring userland software to worry about bitrot is a great way to ensure that it is not done well. It is better to let the filesystem worry about it by using a file system that can deal with it.
This article is likely more relevant to tape archives than anything most people use today.
by nurettin on 4/20/18, 3:25 PM
Too bad for arch https://www.archlinux.org/news/switching-to-xz-compression-f...
by londons_explore on 4/20/18, 5:37 PM
The purpose of a compression format is not to provide error recovery or integrity verification.
The author seems to think the xz container file format should do that.
When you remove this requirement, nearly all his arguments become moot.
by leni536 on 4/20/18, 5:56 PM
I fail to see why integrity checking is the file format's responsibility. Is this historical? Like when you just dd a tar file directly onto a tape and there is no filesystem? Anyway seems like it should be handled by the filesystem and network layers.
I can understand the concerns about versioning and fragmented extension implementations though.
by LinuxBender on 4/20/18, 6:59 PM
Perhaps renice your job so that others don't complain about their noisy neighbor.
```
    renice 19 -p $$ > /dev/null 2>&1
```
then ...
Use tar + xz to save extra metadata about the file(s), even if it is only 1 file.
```
    tar cf - ~/test_files/* | xz -9ec -T0 > ./test.tar.xz
```
If that (or the extra options in tar for xattrs) is not enough, then create a checksum manifest, always sorted.
```
    sha256sum ~/test_files/* | sort -n > ~/test_files/.sha256
```
Then use the above command to compress it all into a .tar file that now contains your checksum manifest.

by AndyKelley on 4/20/18, 7:31 PM

I did some compression tests of the CI build of master branch of zig:

    34M zig-linux-x86_64-0.2.0.cc35f085.tar.gz
    33M zig-linux-x86_64-0.2.0.cc35f085.tar.zst
    30M zig-linux-x86_64-0.2.0.cc35f085.tar.bz2
    24M zig-linux-x86_64-0.2.0.cc35f085.tar.lz
    23M zig-linux-x86_64-0.2.0.cc35f085.tar.xz

With maximum compression (the -9 switch), lzip wins but takes longer than xz:

    23725264 zig-linux-x86_64-0.2.0.cc35f085.tar.xz  63.05 seconds
    23627771 zig-linux-x86_64-0.2.0.cc35f085.tar.lz  83.42 seconds

by qwerty456127 on 4/20/18, 5:42 PM
Why do people use xz anyway? As for me I just use tar.gz when I need to backup a piece of a Linux file system into an universally-compatible archive, zip when I need to send some files to a non-geek and 7z to backup a directory of plain data files for myself. And I dream of the world to just switch to 7z altogether but it is hardly possible as nobody seems interested in adding tar-like unix-specific metadata support to it.
by orbitur on 4/20/18, 6:50 PM
Related: where can I find a thorough step-by-step method for maintaining the integrity of family photos/videos in backups on either Windows or macOS?
by ebullientocelot on 4/20/18, 6:44 PM
The [Koopman] cited throughout is my boss, Phil! At any rate I'm sadly not surprised and a little appalled that xz doesn't store the version of the tool that did the compression..
by Annatar on 4/21/18, 10:10 AM
So long as xz(1) gets insane amounts of compression and there is no compressor which compresses better, people are going to keep preferring it.
by vortico on 4/20/18, 7:55 PM
What is the probability that a given byte will be corrupted on a hard disk in one year?
What is the probability of a complete HD failure in a year?
by loeg on 4/21/18, 4:12 AM
Use par2 to generate FEC for your archives and move on with your life.
by sirsuki on 4/20/18, 8:06 PM
So what wrong with plain and simple
```
  tar c foo | gzip > foo.tar.gz
```
or
```
  tar c foo | bzip2 > foo.tar.bz2
```
Been using these for over 20 years now. Why is is so important to change things especially as this article points out for the worse?!

by nailer on 4/20/18, 5:11 PM

To read the article:

    document.body.style['max-width'] = '550px'; document.body.style.margin = '0 auto'

by Lionsion on 4/20/18, 3:36 PM
What are better file formats for long term archiving? Were any of them designed specifically with that use case in mind?
by microcolonel on 4/21/18, 4:02 AM
Given that there is basically one standard implementation, and virtually nobody has ever had an issue with compatibility with a given file, I don't see how it is "inadequate". Sure, if it's inadequate now, it'll be inadequate if you read it in a decade, but not in any way which would prevent you from reading it.
If your storage fails, maybe you'll have a problem, but you'd have a problem anyway.
Sometimes I feel like genuine technical concerns are buried by the authors being jerks and blowing things way out of proportion. I, for one, tend to lose interest when I hear hyperbolic mudslinging.
by kazinator on 4/20/18, 4:17 PM
> The xz format lacks a version number field. The only reliable way of knowing if a given version of a xz decompressor can decompress a given file is by trial and error.
Wow ... that is inexcusably idiotic. Whoever designed that shouldn't be programming. Out of professional disdain, I pledge never to use this garbage.