from Hacker News

CocoaPods downloads max out five GitHub server CPUs

by jergason on 3/8/16, 3:11 PM with 308 comments

  • by onli on 3/8/16, 3:32 PM

    Note how perfect that response from mhagger is. A clear, honest sounding assurance of what Github wants to deliver. A perfectly comprehensible description of what is the problem, and where it is coming from. And then suggestion how to fix it the project actually can work on, plus mentioning changes to git itself that Github is trying to make that would help. It not only shows great work going on behind the scenes (and if that is untrue, it at least gives me that impression, which is what counts), but also explains it in a great way.
  • by Gratsby on 3/8/16, 3:52 PM

    From CocoaPods.org:

    > CocoaPods is a dependency manager for Swift and Objective-C Cocoa projects. It has over ten thousand libraries and can help you scale your projects elegantly.

    The developer response:

    > [As CocoaPods developers] Scaling and operating this repo is actually quite simple for us as CocoaPods developers whom do not want to take on the burden of having to maintain a cloud service around the clock (users in all time zones) or, frankly, at all. Trying to have a few devs do this, possibly in their spare-time, is a sure way to burn them out. And then there’s also the funding aspect to such a service.

    --

    So they want to be the go-to scaling solution, but they don't want to have to spend any time thinking about how to scale anything. It should just happen. Other people have free scalable services, they should just hand over their resources.

    Thank goodness Github thought about these kinds of cases from the beginning and instituted automatic rate limiting. Having an entire end user base use git to sync up a 16K+ directory tree is not a good idea in the first place. The developers should have long since been thinking about a more efficient solution.

  • by pjc50 on 3/8/16, 3:55 PM

    This reply: https://github.com/CocoaPods/CocoaPods/issues/4989#issuecomm...

    "Not having to develop a system that somehow syncs required data at all means we get to spend more time on the work that matters more to us, in this case. (i.e. funding of dev hours)"

    In other words, using github as a free unlimited CDN lets them be as inefficient as they like. Such as having 16k entries in a directory ( https://github.com/CocoaPods/Specs/tree/master/Specs ) which every user downloads.

    Package management and sync seems to suffer really badly from NIH. Dpkg is over 20 years old and yum is over a decade old. What's up with this particular wheel that people keep reinventing it seemingly without improvement?

  • by indygreg2 on 3/8/16, 4:50 PM

    I help run Mozilla's version control infrastructure and the problems described by the GitHub engineer have been known to me for years. Concerns over scaling Git servers are one of the reasons I am extremely reluctant to see Mozilla support a high volume Git server to support Firefox development.

    Fortunately for us, Firefox is canonically hosted in Mercurial. So, I implemented support in Mercurial for transparently cloning from server-advertised pre-generated static files. For hg.mozilla.org, we're serving >1TB/day from a CDN. Our server CPU load has fallen off a cliff, allowing us to scale hg.mozilla.org cheaply. Additionally, consumers around the globe now clone faster and more reliably since they are using a global CDN instead of hitting servers on the USA west coast!

    If you have Mercurial 3.7 installed, `hg clone https://hg.mozilla.org/mozilla-central` will automatically clone from a CDN and our servers will incur maybe 5s of CPU time to service that clone. Before, they were taking minutes of CPU time to repackage server data in an optimal format for the client (very similar to the repack operation that Git servers perform).

    More technical details and instructions on deploying this are documented in Mercurial itself: https://selenic.com/repo/hg/file/9974b8236cac/hgext/clonebun.... You can see a list of Mozilla's advertised bundles at https://hg.cdn.mozilla.net/ and what a manifest looks like on the server at https://hg.mozilla.org/mozilla-central?cmd=clonebundles.

    A number of months ago I saw talk on the Git mailing list about implementing a similar feature (which would likely save GitHub in this scenario). But I don't believe it has manifested into patches. Hopefully GitHub (or any large Git hosting provider) realizes the benefits of this feature and implements it.

  • by jdcarter on 3/8/16, 3:36 PM

    Wow, really impressive response from GitHub. The right amount of technical detail coupled with balanced tone--halfway between "we support you" and "you make us crazy."

    One correction to the post title: it's not maxing five nodes, but five CPUs.

  • by web007 on 3/8/16, 4:27 PM

    I keep coming back to point #4 - who ever thought that 16k objects in a single directory would be a good idea? Ever since FAT that's been a bad idea, and while modern FSes will handle it without completely melting down it's still going to cause long access operations on anything to do with it.

    Even Finder or `ls` will have trouble with that, and anything with * is almost certainly going to fail. Is the use-case for this something that refers to each library directly, such that nobody ever lists or searches all 16k entries?

  • by mikeash on 3/8/16, 6:39 PM

    The criticism against CocoaPods here seems awfully harsh.

    Think about it from their perspective. GitHub advertises a free service, and encourages using it. Partly it's free because it's a loss leader for their paid offerings, and partly it's free because free usage is effectively advertising GitHub. CocoaPods builds builds their project on this free service, and everything is fine for years.

    Then one day things start failing mysteriously. It looks like GitHub is down, except GitHub isn't reporting any problems, and other repositories aren't affected.

    After lots of headscratching, GitHub gets in touch and says: you're using a ton of resources, we're rate limiting you, you're using git wrong, and you shouldn't even be using git.

    That's going to be a bit of a shock! Everything seemed fine, then suddenly it turns out you've been a major problem for a while, but nobody bothered to tell you. And now you're in hair-on-fire mode because it's reached the point where the rate-limiting is making things fail, and nobody told you about any of these problems before they reached a crisis point.

    It strikes me as extremely unreasonable to expect a group to avoid abusing a free service when nobody tells them that it's abuse, and as far as they know they're using it in a way that's accepted and encouraged. If somebody is doing something you don't like and you want them to stop, you have to tell them, or nothing will happen!

    I'm not blaming GitHub here either. I'm sure they didn't make this a surprise on purpose, and they have a ton of other stuff going on. This looks like one of those things where nobody's really to blame, it's just an unfortunate thing that happened.

    (And just to be clear, I don't have much of a dog in this fight on either side. My only real exposure to CocoaPods is having people occasionally bug me to tag my open source repositories to make them easier to incorporate into CocoaPods. I use GitHub for various things like I imagine most of us do, but am not particularly attached to them.)

  • by wpeterson on 3/8/16, 4:09 PM

    It's totally reasonable to host your code on github and to build a package manager that loads the content of a package from it's github repo.

    What seems insane is to use a single github repo as the universal directory of packages and their versions driving your package manager.

    There's a reason rubygems has their own servers and web services to support this use case for the central library registry, even if the source for gems are all individually projects hosted on github.

  • by riscy on 3/8/16, 4:35 PM

    > Scaling and operating this repo is actually quite simple for us as CocoaPods developers whom do not want to take on the burden of having to maintain a cloud service around the clock (users in all time zones) or, frankly, at all.

    The CocoaPods developers seem to be missing the entire point of git: it's a _distributed_ revision control system.

    Setup a post-recieve hook on Github to notify another server, that is setup with a basic installation of git, to pull from Github so as to mirror the master repo. Then, have your client program randomly choose one of these servers to pull from at the start of an operation. Simple load balancer to solve this problem.

  • by spoiler on 3/8/16, 4:32 PM

    I find it amusing how GitHub's contact[1] form has (probably a recent addition):

    > GitHub Support is unable to help with issues specific to CocoaPods/CocoaPods.

    ---

    [1]: https://github.com/contact

  • by rmoriz on 3/8/16, 5:15 PM

    CocoaPods (and Homebrew) mainly exist because of a lack of tooling in the typical Apple ecosystem. So I would blame Apple for not supporting the community with money or tooling. Letting GitHub with its limited amount of funding pay the bill isn't a nice move. Apple dev relations should throw some money at GitHub so they can provide some dedicated resources or offer to pay the cost of other solutions (like a 3rd party CDN/AWS/Google Cloud/…).
  • by zymhan on 3/8/16, 3:50 PM

    I've always found Github's business model interesting. What if a massive open-source organization (e.g. Fedora, Apache) decided to use it for all of their development, integrating it with continuous builds and all the associated pulls. Of course this isn't likely to happen for a number of reasons, but there are large open source projects that could put a significant load on their infrastructure if they chose to use Github as their main code versioning system.
  • by iBotPeaches on 3/8/16, 4:06 PM

    This bug report is a great step in the direction for GitHub. As of this comment there are 3 different GitHub staff members responding and providing feedback to the CocoaPods team. From the previous "Dear GitHub" messages and responses, this seems like perfect community involvement.
  • by paradite on 3/8/16, 7:13 PM

    I have been seeing this trend of GitHub getting "abused" for purposes other than hosting source code.

    - My school uses GitHub to host and track our software engineering project (which still can be argued as OSS).

    - People using GitHub issue system as a forum.

    - Friends uploading pdfs to GitHub.

    - Recently people posted on HN about using GitHub to generate a status page.

    I think this is a really bad trend and people should stop doing that.

  • by fpgaminer on 3/8/16, 8:12 PM

    GitTorrent: http://blog.printf.net/articles/2015/05/29/announcing-gittor...

    Imagine a world where GitTorrent is fully developed, includes support for issue tracking, and has a nice GUI client that makes the experience on-par with browsing github.com.

    I mention this not as an "Everybody bail out of GitHub and run to GitTorrent!!!" sort of statement, because I believe GitHub's response here was excellent and confidence inspiring. But it's an unnatural relationship for community supported, open source projects to host themselves on commercial platforms such as GitHub. GitHub primarily hosts them to promote its business. That's not necessarily a bad thing, but it results impedance mismatches like demonstrated here.

    That isn't to say that a mature GitTorrent would replace GitHub. Rather, I envision GitHub becoming a supernode in the network, an identity provider, and a paid seed offering, all alongside their existing private repo business.

    Honestly, once I scrape a few projects off my plate, I'm inclined to dive into GitTorrent, see where it's at in development, and see if I can start contributing code. It just seems like such a cool and useful idea.

  • by pavlov on 3/8/16, 3:56 PM

    I've never really understood CocoaPods. Dragging a framework into Xcode was never much trouble, and the amount of 3rd party libraries in a OS X / iOS project ought to be fairly small, so the gains are trivial.

    The potential downsides seem much more annoying. Do you really want to have your dependencies on an overloaded central server somewhere?

  • by jrochkind1 on 3/8/16, 4:04 PM

    What an unusually reasonable discussion. good on everyone.
  • by sdegutis on 3/8/16, 3:37 PM

    I love how this was like the perfect storm of things that could go wrong, and how it seems like mhagger is just amazed more than anything else.
  • by ak217 on 3/8/16, 4:36 PM

    I love GitHub's response, but I would urge the project more strongly to use modern CDN solutions. CDNs are dirt cheap and incredibly powerful nowadays, for the data sizes that we're talking about here.
  • by tjdetwiler on 3/8/16, 6:21 PM

    Rust's cargo does something similar, however it looks like they were much more conscious of git-scalability (ex: limiting the directories in a single level, only appending lines to files to make diffs small).

    https://github.com/rust-lang/crates.io-index

  • by iamleppert on 3/8/16, 7:02 PM

    Amazing to me that people create inefficient systems like this and then complain when they are rate limited.
  • by maaku on 3/8/16, 9:53 PM

    Using Github as your CDN is a dick move. Kudos to GH for not banning the project out-right, but CocoaPods should seriously reconsider what they are doing.
  • by xemdetia on 3/8/16, 4:44 PM

    As a current maintenance developer/systems guy I can definitely feel the tempered annoyance from mhagger here. It's definitely nice to not remind yourself that it's not only your set of recurring issues in front of you that people have to deal with.
  • by noahlt on 3/8/16, 5:59 PM

    Go's package manager, `go get`, also downloads from GitHub. I don't know the details of how `go get` and CocoaPods work, but I would be interested in learning why one is unscalable and the other seems to work.
  • by SuperKlaus on 3/8/16, 4:21 PM

    In fact, they are maxing out five CPUs - not five nodes, big difference.
  • by fokinsean on 3/8/16, 4:23 PM

    I found the solution humorous. Ironically shallow clones are causing the problems, so fetch the max :)

    $ git fetch --depth=2147483647

  • by kodablah on 3/8/16, 5:28 PM

    Has any consideration been given to Bintray[1] as an alternative store for this stuff?

    1 - https://bintray.com/

  • by rcthompson on 3/9/16, 1:36 AM

    Reading the issue, it seems that one of the problems is a single directory with lots and lots of files in it, which is something of a pathological case for Git. Now, this could be "fixed" by splitting the files in that directory into subdirectories, but the one giant directory will still exist in all the past commits. So would this actually fix anything, or just keep it from getting worse?
  • by kmm on 3/9/16, 2:22 AM

    Funny thing is that the repo is only 7 MB gzipped (or 4 with lzma). Not that surprising, since it's just metadata of course. They say they have about 1 million fetches/clones per week, so that would make about 16 TB per month. I'm not sure how much bandwidth costs, but wouldn't some sympathetic CDN host that for free, since they're OSS?
  • by soheil on 3/8/16, 6:44 PM

    I was up until 2am last night trying to publish my Pod [1] and Github kept timing out.

    I had no idea it was just CocoaPods repo because my other repos were working fine. I accepted defeat, went to bed and everything was working great in the morning.

    [1] https://github.com/soheil/SwiftCSS

  • by sly010 on 3/8/16, 4:18 PM

    I would be interested to know what are the other top GitHub repositories. Afaik the Nix package manager uses a similar model (using a GitHub repo as a database), however they periodically release snapshots and the default configuration uses those instead of git.
  • by zoul on 3/8/16, 3:37 PM

    Another reason I consider https://github.com/Carthage/Carthage a better solution of the dependency management problem.
  • by Negative1 on 3/8/16, 4:10 PM

    When he says approaches similar to 'other packaging systems', which ones is he referring to? I can see why this is a bad approach but am unfamiliar with what would be considered a better practice (outside of just hosting a .tar on CloudFront).
  • by superuser2 on 3/8/16, 3:57 PM

    Just last night, all my pod installs were timing out after ~30ish minutes. That explains it.
  • by joeblau on 3/8/16, 5:40 PM

    I just installed Cocoapods last night and tried to clone down the repo. It took about 5 minutes and I thought to myself "Is my 150MB/s connection slow?" This definitely clears up what was going on.
  • by debacle on 3/8/16, 6:13 PM

    Why aren't the packages distributed? Composer is incredibly distributed and likely doesn't cause nearly the same headaches for GitHub.

    Seems like a poor design decision on the CocoaPods side.

  • by voltagex_ on 3/9/16, 3:03 AM

    It's difficult to run an open source project on a budget of $0. You're always relying on the goodwill of others.
  • by LoneWolf on 3/9/16, 5:51 PM

    While I do not have much knowledge on the subject, why not using rsync?
  • by nimish on 3/8/16, 4:42 PM

    hopefully we can now move to using real artifact repositories.
  • by rdancer on 3/9/16, 2:53 PM

    tl;dr: "Using GitHub as your [free-of-charge] CDN is not ideal, for anybody involved."
  • by speps on 3/8/16, 5:24 PM

    Why is everyone talking about CocoaPods where the title is CoacoaPods anyway? :)
  • by whitehat2k9 on 3/8/16, 7:44 PM

    Only the Apple development community would think it's OK to have 16,000 subdirectories in one place and abuse GitHub as a free CDN instead of putting some actual effort in and develop their own repository infrastructure - you know, like almost every other package manager in existence.
  • by Const-me on 3/9/16, 1:35 AM

    I don’t think GitHub acts wisely here.

    Short term sure, they’re doing the right thing, implementing a nice way to manage the free rider problem without hurting them too much.

    But long term it’s different.

    Financially, one average programmer = $80k/year, one average cloud server = $4k/year. And, GitHub has hundreds of millions of venture capital. More than enough to provision a few more servers, even if they will be installing new servers just for those pods.

    The way they act now will lead to someone will develop a decentralized git+torrent hybrid. When that happens, sure, those pods will no longer consume precious GitHub’s resources. However, for the rest of the github users, there will be no reason to stay on GitHub either.