from Hacker News

Apple downloads ~45 TB of models per day from our S3 bucket

by julien_c on 9/16/19, 6:49 PM with 225 comments

  • by cs702 on 9/17/19, 12:00 AM

    "Almost everyone" working on NLP uses one of hugginface's pretrained models at one point or another, sooner or later: https://github.com/huggingface/pytorch-transformers

    It's so damn convenient, and so nicely done.

    And they keep doing neat things like this one: https://github.com/huggingface/swift-coreml-transformers

    Kudos to Julien Chaumond et al for their work!

  • by nurettin on 9/17/19, 5:03 AM

    This is probably apple's continuous integration tests, lazily written to download the whole thing every time someone merges a commit.
  • by CobrastanJorji on 9/17/19, 12:31 AM

    If you host large, publicly available data in a cloud blob service, but you don't have a budget for it, one option is to use the "Requester Pays" feature that Amazon and Google provide. This makes the data available to anyone to download, but they need to pay the download cost themselves.

    This is at the tradeoff of making your data significantly more irritating to access, as it's no longer just plugging in a URL into a program, plus everyone who wants your dataset needs to set up a billing account with Amazon or Google.

  • by dharmon on 9/16/19, 8:54 PM

    I don't have high hopes for his business prospects if this is how he handles one of the richest companies in the world clearly having a high need for something his company offers.

    Maybe spend less time on Twitter and more on your business model?

  • by emeraldd on 9/16/19, 11:48 PM

    This looks kind of interesting:

    https://github.com/huggingface/pytorch-pretrained-BigGAN/blo...

    When you look further down you find:

    https://github.com/huggingface/pytorch-pretrained-BigGAN/blo...

    And that's just a quick search for s3 in the repo. It would not surprise me in the least to discover a `from_pretrained` that points at one of the s3 resources being pulled. There's probably other stuff like that as well in the code that could be causing equally nasty heartache .. especially if non-persistent containers are involved....

    (This is a WAG aka Wild A Guess)

    EDIT: Dug a little more and found:

    https://github.com/search?q=org%3Ahuggingface+s3&type=Code

    Unless I'm mistaken here, there's a crap ton of code that could be downloading models at runtime ... Which seems significantly less than ideal ...

  • by btown on 9/17/19, 12:01 AM

    A brief reminder: Whenever you publish code or documentation that might be used/scraped by the outside world, ALWAYS use a domain you own. If you're on Cloudflare you can instantly (and for free) create Page Rules to use Cloudflare as a CDN, redirect to another CDN, or black-hole or reroute traffic anywhere you want.
  • by lacker on 9/16/19, 7:54 PM

    Well, you could contact them and make a very-likely-to-succeed case that they should pay you some money, or you could complain about it on Twitter.
  • by paxys on 9/17/19, 12:26 AM

    That's about $4000/month in bandwidth costs, assuming retail pricing.

    FYI he is bragging, not complaining. There are a dozen ways to reduce or eliminate this problem.

  • by ebg13 on 9/16/19, 7:14 PM

    If you don't want someone else to do something that costs you money, you're going to have a bad time if you don't prevent them from doing it.
  • by alphagrep12345 on 9/16/19, 10:04 PM

    What does hugging face do? Do they implement models from papers and make them available for free?
  • by rhacker on 9/17/19, 12:54 AM

    I'm guessing someone at apple internally distributed a dockerfile that pulls that down.
  • by fitzroy on 9/16/19, 11:46 PM

    In a few weeks he can just point Apple's IP range to a shared iCloud folder.
  • by StreamBright on 9/17/19, 7:03 AM

    Paid by requester is the feature they are looking for.

    https://docs.aws.amazon.com/AmazonS3/latest/dev/configure-re...

  • by hi41 on 9/16/19, 7:42 PM

    I read the Twitter post but did not understand what is happening. Can someone please explain.
  • by cpach on 9/17/19, 5:52 AM

    Isn’t this a use case where BitTorrent would shine?
  • by peterwwillis on 9/17/19, 8:16 AM

    If your CI/CD is re-downloading and re-building everything on every single run, you are not only being wasteful, you're actually more likely to have an outage due to not storing dependency artifacts needed for deploy. Use a local artifact store to be more resilient to failures of servers you don't control (and also save everyone money and time).
  • by soared on 9/16/19, 7:25 PM

    Charge, them, money?
  • by jijji on 9/17/19, 1:04 AM

    Hosting terabytes of data on an S3 bucket where people would download 45TB per month ($0.023/GB == $1000+/month) sounds like a really expensive way to distribute your data to people...
  • by mrfusion on 9/16/19, 11:44 PM

    What’s the backstory on this? (Is it something I should already know)
  • by yalogin on 9/17/19, 4:31 AM

    Isn't it likely that someone wrote a script for testing some regression and it keeps running in a loop? I can almost bet that will be the case.
  • by tnolet on 9/17/19, 8:48 AM

    Is this what they call product market fit?
  • by codesternews on 9/17/19, 5:51 AM

    Looks like open source company. What's their business model? Does any one know, How they earn money?
  • by idlewords on 9/17/19, 3:20 AM

    This is what success looks like if you charge money for a good or service.
  • by ChuckMcM on 9/17/19, 11:57 AM

    And now the twitter post is gone? I'm guessing the west coast woke up and someone at Apple said "Wait, you could infer some proprietary information with that information ..."
  • by cuillevel3 on 9/16/19, 11:08 PM

    Are those full downloads or just HEAD or range requests from some CI?
  • by dlasek on 9/17/19, 2:57 AM

    They're the ones that made Amazon get those Data Trucks lol
  • by z3t4 on 9/17/19, 7:21 AM

    Apple are probably doing "continuous integration" where all assets are re-downloaded from the Internet in each iteration. Tip: put your stuff on Github :P
  • by ajay-d on 9/16/19, 11:41 PM

    Aren’t all the authors of that paper from Apple?
  • by ecnahc515 on 9/16/19, 11:20 PM

    Why can't they use cloudfront?
  • by half-kh-hacker on 9/17/19, 11:04 AM

    It's surprising that nobody here's mentioned Wasabi, since they have free egress.
  • by master_yoda_1 on 9/17/19, 3:48 AM

    So these jokers at apple publish a paper by using code from huggingface.
  • by dymk on 9/17/19, 5:17 AM

    Need to distribute large static content? Looks like a good job for a torrent.
  • by bryan_w on 9/16/19, 7:19 PM

    Have you considered cloudflair?
  • by kelnos on 9/17/19, 12:24 AM

    If a company the size of Apple finds this that useful, perhaps you should consider charging for your service, rather than just complaining on Twitter about the free usage you appear to have willingly given away?

    Or perhaps you have reached out to them, but are for some reason still complaining on Twitter to drum up PR or something?

    Regardless, this posting is ridiculously context-free to the point of being click-baity. (But hey, good job, I clicked on it anyway.)