by pandalicious on 4/20/18, 2:03 PM with 135 comments
by moltensyntax on 4/20/18, 7:00 PM
To me, most of the claims are arguable.
To say 3 levels of headers is "unsafe complexity"... I don't agree. Indirection is fundamental to design.
To say padding is "useless"... I don't understand why padding and byte-alignment that is given so much vitriol. Look at how much padding the tar format has. And tar is a good example of how "useless padding" was used to extend the format to support larger files. So this supposed "flaw" has been in tar for dozens of years, with no disastrous effects at all.
The xz decision was not made "blindly". There was thought behind the decision.
And it's pure FUD to say "Xz implementations may choose what subset of the format they support. They may even choose to not support integrity checking at all. Safe interoperability among xz implementations is not guaranteed". You could say this about any software - "oh no, someone might make a bad implementation!" Format fragmentation is essentially a social problem more than a technical problem.
I'll leave it at this for now, but there's more I could write.
by comex on 4/20/18, 11:54 PM
Like some other compressed formats, an lzip file is just a series of compressed blocks concatenated together, each block starting with a magic number and containing a certain amount of compressed data. There’s no overall file header, nor any marker that a particular block is the last one. This structure has the advantage that you can simply concatenate two lzip files, and the result is a valid lzip file that decompresses to the concatenation of what the inputs decompress to.
Thus, when the decompressor has finished reading a block and sees there’s more input data left in the file, there are two possibilities for what that data could contain. It could be another lzip block corresponding to additional compressed data. Or it could be any other random binary data, if the user is taking advantage of the “trailing data” feature, in which case the rest of the file should be silently ignored.
How do you tell the difference? Simply enough, by checking if the data starts with the 4-byte lzip magic number. If the magic number itself is corrupted in any way? Then the entire rest of the file is treated as “trailing data” and ignored. I hope the user notices their data is missing before they delete the compressed original…
It might be possible to identify an lzip block that has its magic number corrupted, e.g. by checking whether the trailing CRC is valid. However, at least at the time I discovered this, lzip’s decompressor made no attempt to do so. It’s possible the behavior has improved in later releases; I haven’t checked.
But at least at the time this article was written: pot, meet kettle.
by tedunangst on 4/20/18, 4:38 PM
To add to that, if you need parity to recover from errors, you need to calculate how much based on your storage medium durability and projected life span. It's not the file format's concern. The xz crc should be irrelevant.
by arundelo on 4/20/18, 3:54 PM
https://www.rootusers.com/gzip-vs-bzip2-vs-xz-performance-co...
by carussell on 4/20/18, 4:44 PM
Previously discussed here on HN back then:
https://news.ycombinator.com/item?id=12768425
The author has made some minor revisions since then. Here are the main differences to the page compared to when it was first discussed here:
http://web.cvs.savannah.nongnu.org/viewvc/lzip/lzip/xz_inade...
And here's the full page history:
http://web.cvs.savannah.nongnu.org/viewvc/lzip/lzip/xz_inade...
by cpburns2009 on 4/20/18, 3:35 PM
by jwilliams on 4/20/18, 5:33 PM
xz can be amazing. It can also bite you.
I've had payloads that compress to 0.16 with gzip then compress to 0.016 with xz. Hurray! Then I've had payloads where xz compression is par, or worse. However, with "best or extreme" compression, xz can peg your CPU for much longer. gzip and bzip2 will take minutes and xz -9 is taking hours at 100% CPU.
As annoying as that is, getting an order of magnitude better in many circumstances is hard to give up.
My compromise is "xz -1". It usually delivers pretty good results, in reasonable time, with manageable CPU/Memory usage.
FYI. The datasets are largely text-ish. Usually in 250MB-1GB chunks. So talking JSON data, webpages, and the like.
by freedomben on 4/20/18, 8:10 PM
by eesmith on 4/20/18, 4:49 PM
When I last looked into this issue, it seemed that erasure codes, like with Parchive/par/par2, was the way to go. (As others have mentioned here.) I haven't tried it out as I haven't needed that level of robustness.
by davidw on 4/20/18, 4:35 PM
by pmoriarty on 4/20/18, 5:34 PM
When I burn data (including xz archives) on to DVD for archival storage, I use dvdisaster[2] for the same purpose.
I've tested both by damaging archives and scratching DVDs, and these tools work great for recovery. The amount of redundancy (with a tradeoff for space) is also tuneable for both.
[1] - https://github.com/Parchive/par2cmdline
[2] - http://dvdisaster.net/
by doubledad222 on 4/20/18, 3:50 PM
by ryao on 4/20/18, 7:14 PM
This article is likely more relevant to tape archives than anything most people use today.
by nurettin on 4/20/18, 3:25 PM
by londons_explore on 4/20/18, 5:37 PM
The author seems to think the xz container file format should do that.
When you remove this requirement, nearly all his arguments become moot.
by leni536 on 4/20/18, 5:56 PM
I can understand the concerns about versioning and fragmented extension implementations though.
by LinuxBender on 4/20/18, 6:59 PM
renice 19 -p $$ > /dev/null 2>&1
then ...Use tar + xz to save extra metadata about the file(s), even if it is only 1 file.
tar cf - ~/test_files/* | xz -9ec -T0 > ./test.tar.xz
If that (or the extra options in tar for xattrs) is not enough, then create a checksum manifest, always sorted. sha256sum ~/test_files/* | sort -n > ~/test_files/.sha256
Then use the above command to compress it all into a .tar file that now contains your checksum manifest.by AndyKelley on 4/20/18, 7:31 PM
34M zig-linux-x86_64-0.2.0.cc35f085.tar.gz
33M zig-linux-x86_64-0.2.0.cc35f085.tar.zst
30M zig-linux-x86_64-0.2.0.cc35f085.tar.bz2
24M zig-linux-x86_64-0.2.0.cc35f085.tar.lz
23M zig-linux-x86_64-0.2.0.cc35f085.tar.xz
With maximum compression (the -9 switch), lzip wins but takes longer than xz: 23725264 zig-linux-x86_64-0.2.0.cc35f085.tar.xz 63.05 seconds
23627771 zig-linux-x86_64-0.2.0.cc35f085.tar.lz 83.42 seconds
by qwerty456127 on 4/20/18, 5:42 PM
by orbitur on 4/20/18, 6:50 PM
by ebullientocelot on 4/20/18, 6:44 PM
by Annatar on 4/21/18, 10:10 AM
by vortico on 4/20/18, 7:55 PM
What is the probability of a complete HD failure in a year?
by loeg on 4/21/18, 4:12 AM
by sirsuki on 4/20/18, 8:06 PM
tar c foo | gzip > foo.tar.gz
or tar c foo | bzip2 > foo.tar.bz2
Been using these for over 20 years now. Why is is so important to change things especially as this article points out for the worse?!by nailer on 4/20/18, 5:11 PM
document.body.style['max-width'] = '550px'; document.body.style.margin = '0 auto'
by Lionsion on 4/20/18, 3:36 PM
by microcolonel on 4/21/18, 4:02 AM
If your storage fails, maybe you'll have a problem, but you'd have a problem anyway.
Sometimes I feel like genuine technical concerns are buried by the authors being jerks and blowing things way out of proportion. I, for one, tend to lose interest when I hear hyperbolic mudslinging.
by kazinator on 4/20/18, 4:17 PM
Wow ... that is inexcusably idiotic. Whoever designed that shouldn't be programming. Out of professional disdain, I pledge never to use this garbage.