from Hacker News

GShard: Scaling giant models with conditional computation and automatic sharding

by MrUssek on 7/2/20, 12:58 AM with 35 comments

  • by cs702 on 7/2/20, 1:36 AM

    "Quién es más macho?"

    In a very short time, transformers have gone from under 1B, to 1.5B, to 3B, to 5B, to 175B, and now 600B parameters. 1T is only, what, like 67% more parameters, and therefore likely to be achieved in the short term. In fact, the authors of this paper tried 1T but ran into numerical issues that they will surely address soon. Not long after someone crosses 1T, expect 10T to become the next target. And why not? The best-funded AI research groups are in a friendly competition to build the biggest, baddest, meanest m-f-ing models the world has ever seen.

    Scores continue to increase with diminishing returns, which is all fine and nice, but more importantly it seems we should expect to see machine-generated text getting much better from a qualitative standpoint -- that is, becoming less and less distinguishable from a lot of human output. That has been the trend so far.

    We live in interesting times.

  • by dig6x on 7/2/20, 1:28 AM

    "...600 billion parameters using automatic sharding. We demonstrate that such a giant model can efficiently be trained on 2048 TPU v3 accelerators in 4 days to achieve far superior quality for translation from 100 languages to English compared to the prior art."

    It does appear that at the initial, resource intensive stages of tech like NLP big tech is primed to pave the way. We saw this happen across cloud, AI more generally, storage etc. But big tech then begins focusing on making the tech accessible to industry value chains (Azure, AWS, Amazon's AI services etc.). But as the industry matures there's more room for specialized startups/companies to enter the space to capture lucrative niches - thats exactly what Snowflake did for Cloud.

    Definitely see this kind of scale as a step toward a more robust, mature industry if anything. Better it move forward than not.

  • by mensetmanusman on 7/2/20, 1:38 AM

    Awe inspiring thinking of the number of transistors working in orchestra to translate human language to english...
  • by modeless on 7/2/20, 5:57 AM

    The most important advancements in machine learning for the next 10 years at least will be in hardware, and the software to take advantage of said hardware. You could even say that was already true starting with AlexNet, but it's even more obvious now with these enormous models.

    We've barely scratched the surface of what's possible. Even if Moore's Law was dead (though it seems that TSMC may keep it alive for a bit longer) there are huge gains to be had when co-designing models and hardware. Stuff like https://www.cerebras.net/ is the direction I expect things to go.

  • by Der_Einzige on 7/2/20, 1:10 AM

    Yet another paper with results that basically look like this: https://d3b8hk1o42ev08.cloudfront.net/wp-content/uploads/201...

    Still impressive, don't get me wrong, but I am starting to believe that NLP will be dominated increasingly by the big players since they are the only ones who can train a 1 TRILLION parameter model (they show that in the paper). I can't even do inference with a 36 layer, 2048 neuron per layer network with my GTX 2080ti. Sad....

  • by teruakohatu on 7/2/20, 3:19 AM

    The brain has ~100+ trillion synapses [1] (There seems to be estimates from 100-1000 T).

    A 1 trillion parameter model should not be far off, which is about the same number of synapses as house mice.

    We will be around 1% of the way to human brain complexity (Well, probably not but it is fun to think of it).

    [1] https://en.wikipedia.org/wiki/List_of_animals_by_number_of_n...

  • by justicezyx on 7/2/20, 2:19 AM

    Note that this is a system paper, not a ML/DL/NLP paper. It's kind of OK to expand the parameter to such larger number.