from Hacker News

Calculating the cost of a Google DeepMind paper

by 152334H on 7/30/24, 10:26 AM with 150 comments

by rgmerk on 7/30/24, 1:43 PM
Worth pointing out here that in other scientific domains, papers routinely require hundreds of thousands of dollars, sometimes millions of dollars, of resources to produce.
My wife works on high-throughout drug screens. They routinely use over $100,000 of consumables in a single screen, not counting the cost of the screening “libraries”, the cost of using some of the -$10mil of equipment in the lab for several weeks, the cost of the staff in the lab itself, and the cost of the time of the scientists who request the screens and then take the results and turn them into papers.
by BartjeD on 7/30/24, 11:03 AM
If this ran on google's own cloud it amounts to internal bookkeeping. The only cost is then the electricity and used capacity. Not consumer pricing. So negligible.
It is rather unfortunate that this sort of paper is hard to reproduce.
That is a BIG downside, because it makes the result unreliable. They invested effort and money in getting an unreliable result. But perhaps other research will corroborate. Or it may give them an edge in their business, for a while.
They chose to publish. So they are interested in seeing it reproduced or improved upon.
by godelski on 7/30/24, 7:15 PM
Worth mentioning that "GPU Poor" isn't created because those without much GPU compute can't contribute, but rather because those with massive amounts of GPU are able to perform many more experiments and set a standard, or shift the Overton window. The big danger here is just that you'll start expecting a higher "thoroughness" from everyone else. You may not expect this level, but seeing this level often makes you think what was sufficient before is far from sufficient now, and what's the cost of that lower bound?
I mention this because a lot of universities and small labs are being edged out of the research space but we still want their contributions. It is easy to always ask for more experiments but the problem is, as this blog shows, those experiments can sometimes cost millions of dollars. This also isn't to say that small labs and academics aren't able to publish, but rather that 1) we want them to be able to publish __without__ the support of large corporations to preserve the independence of research[0], 2) we don't want these smaller entities to have to go through a roulette wheel in an effort to get published.
Instead, when reviewing be cautious in what you ask for. You can __always__ ask for more experiments, datasets, "novelty", and so on. Instead ask if what's presented is sufficient to push forward the field in any way and when requesting the previous things be specific as to why what's in the paper doesn't answer what's needed and what experiment would answer it (a sentence or two would suffice).
If not, then we'll have the death of the GPU poor and that will be the death of a lot of innovation, because the truth is, not even big companies will allocate large compute for research that is lower level (do you think state space models (mamba) started with multimillion dollar compute? Transformers?). We gotta start somewhere and all papers can be torn to shreds/are easy to critique. But you can be highly critical of a paper and that paper can still push knowledge forward.
[0] Lots of papers these days are indistinguishable from ads. A lot of papers these days are products. I've even had works rejected because they are being evaluated as products not being evaluated on the merits of their research. Though this can be difficult to distinguish when evaluation is simply empirical.
[1] I once got desk rejected for "prior submission." 2 months later they overturned it, realizing it was in fact an arxiv paper, for only a month later for it to be desk rejected again for "not citing relevant materials" with no further explanation.
by pama on 7/30/24, 12:13 PM
3USD/hour on the H100 is much more expensive than a reasonable amortized full ownership cost, unless one assumes the GPU is useless within 18 months, which I find a bit dramatic. The MFU can be above 40% and certainly well above the 35% in the estimate, also for small models with plain pytorch and trivial tuning [1] I didnt read the linked paper carefully but I seriously doubt the google team used vocab embedding layers with 2 D V parameters stated in the link, because this would be suboptimal by not tying the weights of the token embedding layer in the decoder architecture (even if they did double the params in these layers, it would not lead to 6 D V compute because the embedding input is indexed). To me these assumptions suggested a somewhat careless attitude towards the cost estimation and so I stopped reading the rest of this analysis carefully. My best guess is that the author is off by a large factor in the upward direction, and a true replication with H100/200 could be about 3x less expensive.
[1] if the total cost estimate was relatively low, say less than 10k, then of course the lowest rental price and a random training codebase might make some sense in order to reduce administrative costs; once the cost is in the ballpark of millions of USD, it feels careless to avoid optimizing it further. There exist H100s in firesales or Ebay occasionally, which could reduce the cost even more, but the author already mentions 2USD/gpu/hour for bulk rental compute, which is better than the 3USD/gpu/hour estimate they used in the writeup.
by brg on 7/30/24, 4:57 PM
I found this exercise interesting, and as arcade79 pointed out it is the cost of replication not the cost to Google. Humorously I wonder the cost of of replicating Higgs-Boson verification or Gravity Wave detection would be.
by jeffbee on 7/30/24, 2:59 PM
I think if you wanted to think about a big expense you'd look at AlphaStar.
by hiddencost on 7/30/24, 8:13 PM
It's likely the cost of the researchers was about $1m/ head, with 11 names that puts the staffing costs on par with the compute costs.
(A good rule of thumb is that an employee costs about twice their total compensation.)
by faitswulff on 7/30/24, 5:54 PM
I wonder how many tons of CO2 that amounts to. Google Gemini estimated 125,000 tons of carbon emissions, but I don’t have the know-how to double check it.
by arcade79 on 7/30/24, 11:16 AM
A lot of misunderstandings among the commenters here.
From the link: "the total compute cost it would take to replicate the paper"
It's not Google's cost. Google's cost is of course entirely different. It's the cost for the author if he were to rent the resources to replicate the paper.
For Google, all of it is running at a "best effort" resource tier, grabbing available resources when not requested by higher priority jobs. It's effectively free resources (except electricity consumption). If any "more important" jobs with a higher priority comes in and asks for the resources, the paper-writers jobs will just be preempted.
by floor_ on 7/30/24, 12:04 PM
Content aside. This is hands down my favorite blog format.
by hnthr_w_y on 7/30/24, 10:38 AM
that's not very much in the business range, it's a lot when it comes to paying us salaries.
by sigmoid10 on 7/30/24, 11:08 AM
This is calculation is pretty pointless and the title is flat out wrong. It also gets lost in finer details while totally missing the bigger picture. After all, the original paper written by people either working for Google or at Google. So you can safely assume they used Google resources. That means they wouldn't have used H100s, but Google TPUs. Since they design and own these TPUs, you can also safely assume that they don't pay whatever they charge end users for them. At the scale of Google, this basically amounts to the cost of houseing/electricity, and even that could be a tax write-off. You also can't directly assume that the on paper performance of something like an H100 will be the actual utilization you can achieve, so basing any estimate in terms of $/GPU-hour will be off by default.
That means Google payed way less than this amount and if you wanted to reproduce the paper yourself, you would potentially pay a lot more, depending on how many engineers you have in your team to squeeze every bit of performance per hour out of your cluster.
by dont_forget_me on 7/30/24, 2:45 PM
All that compute power just to invade privacy and show people more ads. Can this get anymore depressing?