from Hacker News

Google T5 scores 88.9 on SuperGLUE Benchmark, approaching human baseline

by alexwg on 10/24/19, 11:38 PM with 235 comments

  • by hn_throwaway_99 on 10/25/19, 12:40 AM

    I didn't know anything about SuperGLUE before (turns out it's a benchmark for language understanding tasks), so I clicked around their site where they show different examples of the tasks.

    One "word in context" task is to look at 2 different sentences that have a common word and decide if that word means the same thing in both sentences or different things (more details here: https://pilehvar.github.io/wic/)

    One of their examples, though, didn't make any sense to me:

    1. The pilot managed to land the airplane safely

    2. The enemy landed several of our aircrafts

    It says that the word "land" does NOT mean the same thing in those sentences. I am a native English speaker, and I honestly don't understand what they are thinking the second sentence means. Shot them down? If so, I have never heard "landed" used in that context, and it appears neither has Merriam-Webster. Also, the plural of aircraft is just "aircraft", without the s.

  • by 6gvONxR4sf7o on 10/25/19, 2:34 AM

    One thing to always point out in these cases is that the human baseline isn't "how well people do at this task," like it's often hyped to be. It's "how well does a person quickly and repetitively doing this do, on average." The 'quickly and repetitively' part is important because we all make more boneheaded errors in this scenario. The 'on average' part is important because the errors the algo makes aren't just fewer than people, they're different. The algos often still get certain things wrong that humans almost never would.

    This is really really super great, let's be clear. It's just not up to the hype "omg super human" usually gets.

  • by pmoriarty on 10/25/19, 12:40 AM

    There was an article[1] posted to HN recently about these benchmarks, and it was pretty skeptical.

    Regarding SuperGLUE specifically, it asked:

    "Indeed, Bowman and his collaborators recently introduced a test called SuperGLUE that's specifically designed to be hard for BERT-based systems. So far, no neural network can beat human performance on it. But even if (or when) it happens, does it mean that machines can really understand language any better than before? Or does just it mean that science has gotten better at teaching machines to the test?"

    [1] - https://www.quantamagazine.org/machines-beat-humans-on-a-rea...

  • by RcouF1uZ4gsC on 10/25/19, 2:23 AM

    I think classifying this as human level is misleading.

    Look at the sub-scores on the page. One score that looks very different from humans is AX-b.

    The SuperGlue paper provides more context about AX-b

    https://arxiv.org/pdf/1905.00537.pdf

    AX-b "is the broad-coverage diagnostic task, scored using Matthews’ correlation (MCC). "

    This is how the paper describes this test

    " Analyzing Linguistic and World Knowledge in Models GLUE includes an expert-constructed, diagnostic dataset that automatically tests models for a broad range of linguistic, commonsense, and world knowledge. Each example in this broad-coverage diagnostic is a sentence pair labeled with a three-way entailment relation (entailment, neutral, or contradiction) and tagged with labels that indicate the phenomena that characterize the relationship between the two sentences. Submissions to the GLUE leaderboard are required to include predictions from the submission’s MultiNLI classifier on the diagnostic dataset, and analyses of the results were shown alongside the main leaderboard. Since this broad-coverage diagnostic task has proved difficult for top models, we retain it in SuperGLUE. However, since MultiNLI is not part of SuperGLUE, we collapse contradiction and neutral into a single not_entailment label, and request that submissions include predictions on the resulting set from the model used for the RTE task. We collect non-expert annotations to estimate human performance, following the same procedure we use for the main benchmark tasks (Section 5.2). We estimate an accuracy of 88% and a Matthew’s correlation coefficient (MCC, the two-class variant of the R3 metric used in GLUE) of 0.77. "

    If you look at the scores, humans are estimated to score 0.77. Google T5 scores -0.4 on the test.

    How did T5 get such a high score if it scored so abysmally on the AX-b test?

    The AX scores are not included in the total score.

    From the paper: "The Avg column is the overall benchmarkscore on non-AX∗ tasks."

    If the AX scores were included, the gap between humans and machines would be bigger than the current score indicates.

  • by throwaway_bad on 10/25/19, 12:42 AM

    Possibly dumb question: How do you ensure there's no data leakage when benchmarking transfer learning techniques? Is that even a problem anymore when the whole point is to learn "common sense" knowledge?

    For example their “Colossal Clean Crawled Corpus” (C4), a dataset consisting of hundreds of gigabytes of clean English text scraped from the web, might contain much of the same information as the benchmark datasets, which I presume is also scraped from the web.

  • by Al-Khwarizmi on 10/25/19, 10:20 AM

    This surprised me a bit, on the creation of the corpus they use for training:

    "We removed any page that contained any word on the “List of Dirty, Naughty, Obscene or Otherwise Bad Words”."

    I don't understand this decision. This list contains words that can be used in a perfectly objective sense, like "anus", "bastard", "erotic", "eunuch", "fecal", etc.

    I can understand that they want to avoid websites full of expletives and with no useful content, but outright excluding any website with even one occurrence of such words sounds too radical. If we ask this model a text comprehension question about a legitimized bastard that inherited the throne, or about fecal transplants, I suppose it would easily fail. Strange way of limiting such a powerful model.

  • by nopinsight on 10/25/19, 4:46 AM

    As someone working in the field, I congratulate the excellent accomplishment but agree with the authors that we shouldn't get too excited yet (their quote below after the four reasons). Here are some reasons:

    1) Most likely, the model is still susceptible to adversarial triggers as demonstrated on other systems here: http://www.ericswallace.com/triggers

    2) T5 was trained with ~750GB of texts or ~150 billion words, which is > 100 times the number of words native English speakers acquire by the age of 20.

    3) Most or all of the tests are multiple-choice. Learning complex correlations from sufficient data should help solve most of them. This is useful but human-level understanding is more than correlations.

    4) The performance on datasets that require commonsense knowledge, COPA and WSC, are the weakest relative to humans (who score 100.0 on both).

    Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer, p.32 https://arxiv.org/pdf/1910.10683.pdf

    "Interestingly, on the reading comprehension tasks (MultiRC and ReCoRD) we exceed human performance by a large margin, suggesting the evaluation metrics used for these tasks may be biased towards machine-made predictions. On the other hand, humans achieve 100% accuracy on both COPA and WSC, which is significantly better than our model’s performance. This suggests that there remain linguistic tasks that are hard for our model to perfect, particularly in the low-resource setting."

    I’d like to emphasize that the work and the paper are excellent. Still, we are quite far from human-level language understanding.

    ---

    We may need more advanced tests to probe the actual language understanding ability of AI systems. Here are some ideas:

    * Test for conceptual understanding in a non-multiple-choice format. Example: Write a summary for a New Yorker article, rather than standard news pieces (which tend to follow repeated patterns).

    * Commonsense test with longer chains of inference than those needed for solving Winograd Schema and set in non-standard situations (e.g. fantasy world). This should greatly reduce the chance that an approach can simply detect correlations from huge datasets.

    * Understanding novel, creative metaphors like those used in some essays by professional writers or some of the Economist's title articles.

  • by martincmartin on 10/25/19, 12:37 AM

    "The General Language Understanding Evaluation (GLUE) benchmark is a collection of resources for training, evaluating, and analyzing natural language understanding systems."

    "We take into account the lessons learnt from original GLUE benchmark and present SuperGLUE, a new benchmark styled after GLUE with a new set of more difficult language understanding tasks, improved resources, and a new public leaderboard."

  • by YeGoblynQueenne on 10/25/19, 3:31 AM

    Assuming that the baseline human score was set according to the performance of adult humans, then according to these results T5 has a language understanding ability at least as accurate as a human child.

    In fact it's not just T5 that should be able to understand language as well as a human child, but also BERT++, BERT-mtl and RoBERTa, each of which has a score of 70 or more. There really shouldn't be anything else on the planet that has 70% of human language understanding, other than humans.

    So if the benchmarks mean what they think they mean, there are currently fully-fledged strongly artificially intelligent systems. That must mean that, in a very short time we should see strong evidence of having created human-like intelligence.

    Because make no mistake: language understanding is not like image recognition, say, or speech processing. Understanding anything is an AI-complete task, to use a colloquial term.

    Let's wait and see then. It shouldn't take more than five or six years to figure out what all this means.

  • by enisberk on 10/25/19, 3:29 AM

    I attended one of the talks(1) of the Sam Bowman. His talk was about "Task-Independent Language Understanding" and he also talked about GLUE and super GLUE; he mentioned that some models are passing an average person in experiments. They did some experiments to understand BERT's performance (2). (similar to article 'NLP's Clever Hans Moment') But they found a different answer to question "what BERT really knows," so he was skeptical about all conclusions. Check these out if you are interested in.

    (1)[https://www.nyu.edu/projects/bowman/TILU-talk-19-09.pdf]

    (2)[https://arxiv.org/abs/1905.06316]

  • by ilaksh on 10/25/19, 4:28 AM

    The AIs in the benchmark are all trained exclusively on text, correct?

    My assumption has always been that to get human-level understanding, the AI systems need to be trained on things like visual data in addition to text. This is because there is a fair amount of information that is not encoded at all in text, or at least is not described in enough detail.

    I mean, humans can't learn to understand language properly without using their other senses. You need something visual or auditory or to associate with the words which are really supposed to represent full systems that are complex and detailed.

    I think it would be much more obvious if there were questions that involved things like spatial reasoning, or combining image recognition with that and comprehension.

  • by alexwg on 10/25/19, 12:31 AM

  • by ArtWomb on 10/25/19, 1:43 PM

    "Attention is all you need", indeed. Of course, our instinct tells us there is more to language inference than word proximity. And so results approaching or exceeding expert-level human baseline raise more questions than providing cause for popping champagne corks.

    In Question Answering, which is also advancing rapidly with insights from transformers and denoising auto-encoders, but still far from human baseline. The ease with which these models can answer a sample question such as: "Who was the first human in space", demonstrates both their efficacy and limitations. Pre-trained on a large corpus of text, almost every document that contains the the name "Yuri Gagarin" will in its near vicinity describe him in relation to his pioneering accomplishment for which he became a cultural icon.

    And for even more generalizable scenarios, such as "what might you find on a Mayan monument"? It becomes imperative that an agent explain its reasoning in natural language as well to enable self-correcting backpropagation of error correction.

    Language may be considered low-dimensional relatively speaking. And sentence prediction across quotidian tasks manageable in current state-of-the-art architectures. But looking at how difficult it is to predict the next N frames of video given a short input example demonstrates the intractability of the problem in higher dimensional spaces.

    Neural Models for Speech and Language: Successes, Challenges, and the Relationship to COmputational Models of the Brain - Michael Collins

    https://www.youtube.com/watch?v=HVnFKmPaU8c

  • by skybrian on 10/25/19, 3:25 AM

    They came up with the SuperGLUE benchmark because they found that the GLUE benchmark was flawed and too easy to game. There were correlations in the dataset that made it possible to get questions right without real understanding, and so the results didn't generalize.

    Could the same thing happen again with the better benchmark due to more subtle correlations? These things are tough to judge, so I'd say wait and see if it turns out to be a real result.

  • by lettergram on 10/25/19, 1:19 AM

    Although those are some great results, I wish I could try it out locally...

    https://github.com/google-research/text-to-text-transfer-tra...

    It drives me nuts that most of these papers / publications don't have code where I can just run:

    > python evaluate_model.py

    Still exciting, just annoying that I'd have to set up google cloud to try this out.

  • by femto113 on 10/25/19, 3:41 AM

    My experience with image classification benchmarks was that they approached human levels only because the scoring only counts how much they get “right” and doesn’t penalize completely whack answers as much as they should (like getting full credit for being pretty sure a picture of a dog was either a dog or an alligator). I suspect there’s something similar going on in these language benchmarks.
  • by riku_iki on 10/25/19, 1:47 AM

    > T5-11B (11 billion parameters)

    So, this is largest language model so far?

  • by pauljurczak on 10/25/19, 6:54 AM

    Use of Natural Language Understanding term in context of this benchmark is preposterous. No understanding takes place there. Please stick to NLP (Natural Language Processing) term for the next couple of decades. Thank you.
  • by nightnight on 10/25/19, 8:53 AM

    This clearly demonstrates once again that Google is miles ahead of the competition in AI. I mean, they just have the best data.

    If you want to have an every day example of Google's AI skills: Switch you phone's keyboard to GBoard, especially all iOS users, and you will face a night and day difference to any other keyboard esepcially the stock one. When using multiple languages at the same time the leap to other keyboards gets even bigger.

    GBoard is my phone's killer app and if Google dropped it for iOS I'd left the same day to Android.

  • by rrival on 10/25/19, 1:31 AM

    Where do I take the SuperGLUE test?
  • by woodgrainz on 10/25/19, 2:52 AM

    Several of the systems in this leaderboard utilize the BERT model, a clever approach devised by Google for natural language processing. A nice laymen's guide to BERT:

    https://towardsdatascience.com/bert-explained-state-of-the-a...

  • by vagab0nd on 10/25/19, 12:46 PM

    This is cool. Since they released a 11B pre-trained model, can we finally reproduce "unicorn-level" text generation now?
  • by vonseel on 10/25/19, 1:53 AM

    I wonder what I would score on this test. Are these things correlated to standardized test scores at all for humans ?
  • by LukeB42 on 10/25/19, 6:43 AM