from Hacker News

Git archive generation meets Hyrum's law

by JamesCoyne on 2/2/23, 6:47 PM with 76 comments

  • by pcj-github on 2/2/23, 10:05 PM

    If it can't be made stable, `git archive` should specifically add random content (under a feature flag to be removed after a year or two) to so as to make the generated checksum completely unreliable and force users to adopt different workflows.
  • by skywal_l on 2/2/23, 9:34 PM

    Everybody who had to maintain an API knows this.

    1. You can't just rely on documentation ("we never said we would guarantee this or that") to push back on your users' claims that you introduced a breaking change. If you care more about your documentation than your users, they will turn their back on you.

    2. However if you start guaranteeing too much stability, innovation and change will become too costly or even impossible. In this instance, if the git team has to guarantee their hashes (which seems impossible anyway because it depends on the external gzip program) then they can never improve on compression.

    Tough situation to be in.

  • by mjw1007 on 2/3/23, 8:06 AM

    I don't think this is really an example of Hyrum's law. Hyrum's law claims that even if you carefully document your contract, someone will rely on the observable behaviour rather than the documentation anyway.

    But this is an example of a much weaker proposition: if you don't document your contract, then people will guess what the contract is and some of them will guess wrong.

    (In fact in this case it seems it's more like "if you don't document your contract and your support staff sometimes say the behaviour is A, people will rely on the behaviour being A".)

  • by vlovich123 on 2/3/23, 5:06 AM

    I wonder if transfer encoding the archive might be a better strategy. The client benefits from a stable format (tar) provided it’s generated in a stable order which generally easier for the server to guarantee. The network transfer occurs transparently compressed (transfer-encoding header in http parlance).

    Checksums still work and protect against malicious tarballs which are generally riskier to unpack than plain steam compression / decompression. The server and client gets the smaller file transfers and compression improvements can evolve transparently by negotiating the transfer encoding. The server can still cache the encoded form to avoid needing to compress the same file repeatedly.

    Seems like a win win solution without requesting a drastic redesign of package managers everywhere and everyone walks away having won the properties of the system they value.

  • by jancsika on 2/3/23, 2:42 AM

    > Hyrum's law

    Didn't Google beat Hyrum's law by using their weight to force middleboxes to accept some variation in some datum of an http header or something?

    Edit: hint: something about rotating a value for some number of decades. Either forcing the hand of middleboxes or CAs, I can't remember. In either case, it seemed like a real pain in the ass to keep the API observability concrete from hardening. :)

  • by cratermoon on 2/3/23, 2:18 AM

    From the post: "it may well become necessary for anybody who wants consistent results to decompress archive files before checking checksums."

    I'm certain there's some exploit waiting to subvert the decompress algorithm and substitute malicious content in place of the actual archive files.

  • by DelightOne on 2/3/23, 12:00 AM

    Can't Github just keep the old archive as it is for the already-existing releases and use the new format for new releases? Over time old releases phase out and the advantage of the new format is completely in effect. You can even use a time-based cut-off date if you somehow want to get it in sync.
  • by AJRF on 2/3/23, 2:51 AM

    HANG ON!

    I think this just made me realise an issue I was having with Swift Package Manager a few months back. We have a bunch of ObjC frameworks in our app that we don't want people to update anymore so we can rewrite them, and we just threw them all into a big umbrella project, but for some reason we couldn't get the binary target URL from Github Enterprise to work on our self hosted Enterprise instance because the checksum would be different every time, but it worked perfectly for Github Cloud.

    Is there anyone from Github here - Can you confirm that is the cause of issue for GH Enterprise?

  • by travisgriggs on 2/2/23, 9:22 PM

    Had to follow the links to figure out what Hyrum’s Law was (I like laws). The best link from that law is the obligatory xkcd at the very bottom. Reshared here:

    https://xkcd.com/1172/

  • by meling on 2/3/23, 4:57 AM

    Couldn’t they include two checksums; one for compressed, and if that fails, decompress and check the uncompressed content?
  • by avgcorrection on 2/2/23, 10:39 PM

    Mob engineering: you don’t have to read the documentation if a million other people also do not.
  • by syntheticnature on 2/2/23, 8:36 PM

    2018 Gentoo-dev called, wants to let you know this is old news: https://www.mail-archive.com/gentoo-dev@lists.gentoo.org/msg...
  • by jmclnx on 2/2/23, 8:27 PM

    > more easily support compression across operating systems

    I cannot help but wonder if this change was forced upon github by Microsoft because gzip is GPL 3, maybe this other version is a clean room clone. We all know corporations hate GPLv3, including the large corporation I work for.

    https://www.gnu.org/software/gzip/