by alexwg on 10/24/19, 11:38 PM with 235 comments
by hn_throwaway_99 on 10/25/19, 12:40 AM
One "word in context" task is to look at 2 different sentences that have a common word and decide if that word means the same thing in both sentences or different things (more details here: https://pilehvar.github.io/wic/)
One of their examples, though, didn't make any sense to me:
1. The pilot managed to land the airplane safely
2. The enemy landed several of our aircrafts
It says that the word "land" does NOT mean the same thing in those sentences. I am a native English speaker, and I honestly don't understand what they are thinking the second sentence means. Shot them down? If so, I have never heard "landed" used in that context, and it appears neither has Merriam-Webster. Also, the plural of aircraft is just "aircraft", without the s.
by 6gvONxR4sf7o on 10/25/19, 2:34 AM
This is really really super great, let's be clear. It's just not up to the hype "omg super human" usually gets.
by pmoriarty on 10/25/19, 12:40 AM
Regarding SuperGLUE specifically, it asked:
"Indeed, Bowman and his collaborators recently introduced a test called SuperGLUE that's specifically designed to be hard for BERT-based systems. So far, no neural network can beat human performance on it. But even if (or when) it happens, does it mean that machines can really understand language any better than before? Or does just it mean that science has gotten better at teaching machines to the test?"
[1] - https://www.quantamagazine.org/machines-beat-humans-on-a-rea...
by RcouF1uZ4gsC on 10/25/19, 2:23 AM
Look at the sub-scores on the page. One score that looks very different from humans is AX-b.
The SuperGlue paper provides more context about AX-b
https://arxiv.org/pdf/1905.00537.pdf
AX-b "is the broad-coverage diagnostic task, scored using Matthews’ correlation (MCC). "
This is how the paper describes this test
" Analyzing Linguistic and World Knowledge in Models GLUE includes an expert-constructed, diagnostic dataset that automatically tests models for a broad range of linguistic, commonsense, and world knowledge. Each example in this broad-coverage diagnostic is a sentence pair labeled with a three-way entailment relation (entailment, neutral, or contradiction) and tagged with labels that indicate the phenomena that characterize the relationship between the two sentences. Submissions to the GLUE leaderboard are required to include predictions from the submission’s MultiNLI classifier on the diagnostic dataset, and analyses of the results were shown alongside the main leaderboard. Since this broad-coverage diagnostic task has proved difficult for top models, we retain it in SuperGLUE. However, since MultiNLI is not part of SuperGLUE, we collapse contradiction and neutral into a single not_entailment label, and request that submissions include predictions on the resulting set from the model used for the RTE task. We collect non-expert annotations to estimate human performance, following the same procedure we use for the main benchmark tasks (Section 5.2). We estimate an accuracy of 88% and a Matthew’s correlation coefficient (MCC, the two-class variant of the R3 metric used in GLUE) of 0.77. "
If you look at the scores, humans are estimated to score 0.77. Google T5 scores -0.4 on the test.
How did T5 get such a high score if it scored so abysmally on the AX-b test?
The AX scores are not included in the total score.
From the paper: "The Avg column is the overall benchmarkscore on non-AX∗ tasks."
If the AX scores were included, the gap between humans and machines would be bigger than the current score indicates.
by throwaway_bad on 10/25/19, 12:42 AM
For example their “Colossal Clean Crawled Corpus” (C4), a dataset consisting of hundreds of gigabytes of clean English text scraped from the web, might contain much of the same information as the benchmark datasets, which I presume is also scraped from the web.
by Al-Khwarizmi on 10/25/19, 10:20 AM
"We removed any page that contained any word on the “List of Dirty, Naughty, Obscene or Otherwise Bad Words”."
I don't understand this decision. This list contains words that can be used in a perfectly objective sense, like "anus", "bastard", "erotic", "eunuch", "fecal", etc.
I can understand that they want to avoid websites full of expletives and with no useful content, but outright excluding any website with even one occurrence of such words sounds too radical. If we ask this model a text comprehension question about a legitimized bastard that inherited the throne, or about fecal transplants, I suppose it would easily fail. Strange way of limiting such a powerful model.
by nopinsight on 10/25/19, 4:46 AM
1) Most likely, the model is still susceptible to adversarial triggers as demonstrated on other systems here: http://www.ericswallace.com/triggers
2) T5 was trained with ~750GB of texts or ~150 billion words, which is > 100 times the number of words native English speakers acquire by the age of 20.
3) Most or all of the tests are multiple-choice. Learning complex correlations from sufficient data should help solve most of them. This is useful but human-level understanding is more than correlations.
4) The performance on datasets that require commonsense knowledge, COPA and WSC, are the weakest relative to humans (who score 100.0 on both).
Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer, p.32 https://arxiv.org/pdf/1910.10683.pdf
"Interestingly, on the reading comprehension tasks (MultiRC and ReCoRD) we exceed human performance by a large margin, suggesting the evaluation metrics used for these tasks may be biased towards machine-made predictions. On the other hand, humans achieve 100% accuracy on both COPA and WSC, which is significantly better than our model’s performance. This suggests that there remain linguistic tasks that are hard for our model to perfect, particularly in the low-resource setting."
I’d like to emphasize that the work and the paper are excellent. Still, we are quite far from human-level language understanding.
---
We may need more advanced tests to probe the actual language understanding ability of AI systems. Here are some ideas:
* Test for conceptual understanding in a non-multiple-choice format. Example: Write a summary for a New Yorker article, rather than standard news pieces (which tend to follow repeated patterns).
* Commonsense test with longer chains of inference than those needed for solving Winograd Schema and set in non-standard situations (e.g. fantasy world). This should greatly reduce the chance that an approach can simply detect correlations from huge datasets.
* Understanding novel, creative metaphors like those used in some essays by professional writers or some of the Economist's title articles.
by martincmartin on 10/25/19, 12:37 AM
"We take into account the lessons learnt from original GLUE benchmark and present SuperGLUE, a new benchmark styled after GLUE with a new set of more difficult language understanding tasks, improved resources, and a new public leaderboard."
by YeGoblynQueenne on 10/25/19, 3:31 AM
In fact it's not just T5 that should be able to understand language as well as a human child, but also BERT++, BERT-mtl and RoBERTa, each of which has a score of 70 or more. There really shouldn't be anything else on the planet that has 70% of human language understanding, other than humans.
So if the benchmarks mean what they think they mean, there are currently fully-fledged strongly artificially intelligent systems. That must mean that, in a very short time we should see strong evidence of having created human-like intelligence.
Because make no mistake: language understanding is not like image recognition, say, or speech processing. Understanding anything is an AI-complete task, to use a colloquial term.
Let's wait and see then. It shouldn't take more than five or six years to figure out what all this means.
by enisberk on 10/25/19, 3:29 AM
(1)[https://www.nyu.edu/projects/bowman/TILU-talk-19-09.pdf]
by ilaksh on 10/25/19, 4:28 AM
My assumption has always been that to get human-level understanding, the AI systems need to be trained on things like visual data in addition to text. This is because there is a fair amount of information that is not encoded at all in text, or at least is not described in enough detail.
I mean, humans can't learn to understand language properly without using their other senses. You need something visual or auditory or to associate with the words which are really supposed to represent full systems that are complex and detailed.
I think it would be much more obvious if there were questions that involved things like spatial reasoning, or combining image recognition with that and comprehension.
by alexwg on 10/25/19, 12:31 AM
by ArtWomb on 10/25/19, 1:43 PM
In Question Answering, which is also advancing rapidly with insights from transformers and denoising auto-encoders, but still far from human baseline. The ease with which these models can answer a sample question such as: "Who was the first human in space", demonstrates both their efficacy and limitations. Pre-trained on a large corpus of text, almost every document that contains the the name "Yuri Gagarin" will in its near vicinity describe him in relation to his pioneering accomplishment for which he became a cultural icon.
And for even more generalizable scenarios, such as "what might you find on a Mayan monument"? It becomes imperative that an agent explain its reasoning in natural language as well to enable self-correcting backpropagation of error correction.
Language may be considered low-dimensional relatively speaking. And sentence prediction across quotidian tasks manageable in current state-of-the-art architectures. But looking at how difficult it is to predict the next N frames of video given a short input example demonstrates the intractability of the problem in higher dimensional spaces.
Neural Models for Speech and Language: Successes, Challenges, and the Relationship to COmputational Models of the Brain - Michael Collins
by skybrian on 10/25/19, 3:25 AM
Could the same thing happen again with the better benchmark due to more subtle correlations? These things are tough to judge, so I'd say wait and see if it turns out to be a real result.
by lettergram on 10/25/19, 1:19 AM
https://github.com/google-research/text-to-text-transfer-tra...
It drives me nuts that most of these papers / publications don't have code where I can just run:
> python evaluate_model.py
Still exciting, just annoying that I'd have to set up google cloud to try this out.
by femto113 on 10/25/19, 3:41 AM
by riku_iki on 10/25/19, 1:47 AM
So, this is largest language model so far?
by pauljurczak on 10/25/19, 6:54 AM
by nightnight on 10/25/19, 8:53 AM
If you want to have an every day example of Google's AI skills: Switch you phone's keyboard to GBoard, especially all iOS users, and you will face a night and day difference to any other keyboard esepcially the stock one. When using multiple languages at the same time the leap to other keyboards gets even bigger.
GBoard is my phone's killer app and if Google dropped it for iOS I'd left the same day to Android.
by rrival on 10/25/19, 1:31 AM
by woodgrainz on 10/25/19, 2:52 AM
https://towardsdatascience.com/bert-explained-state-of-the-a...
by vagab0nd on 10/25/19, 12:46 PM
by vonseel on 10/25/19, 1:53 AM
by LukeB42 on 10/25/19, 6:43 AM