by jacquesm on 5/13/21, 7:39 PM
The side-by-side display makes it pretty easy to distinguish the one from the other, simply compare them at a level where the one that makes the least sense is the one that is nonsense. Like that I score 10/11. But when looking at just the left side one suddenly the problem is
much harder, and I'm happy to get better than even. Bits that don't help: not an English native writer. Seen too many real life papers with crappy writing that quite a few of these look plausible, especially when they are about fields that I know very little of.
Presumably when you're a native English speaker and have a broader interest the difficulty goes down a bit.
I like this project very much and would like to see some overall scores, and it might not hurt to allow for a verified result link to detect bragging rather than actual results (not that anybody on HN would ever brag about their score ;) ).
Overall: I'm not worried that generated papers will swamp the publications any day soon but for spam/click farms this must be a godsend and for sure it will cause trouble for search engines to classify real content from generated content.
by jvanderbot on 5/13/21, 6:03 PM
Quite easy when you know one is fake. Flagging fake articles in a review queue, by abstract only, and when none may exist all the way up to all being fake ....
Now that's a challenge.
Also, if you train GPT on the whole corpus of Nature / Science / whatever articles up to, say, 2005, could you feed it leading text about discoveries after 2005 and see if it hypothesizes the justification for those discoveries in the same way that the authors did?
by userbinator on 5/14/21, 2:06 AM
Some of the fake ones are hilarious:
The chicken genome (the genome of a chicken that is the subject of much chicken-related activity) is now compared to its chicken chicken-to-pecking age: from a genome sequence of chicken egg, only approximately 70% of the chicken genome sequences match the chicken egg genome, which suggests that the chicken may have beenancreatic.
(Related: https://www.youtube.com/watch?v=yL_-1d9OSdk )
by ronsor on 5/13/21, 6:33 PM
Even hard mode isn't
that hard because GPT-2 tends to ramble on while saying nothing substantive. If I can't figure out what a paper is supposed to be talking about, it's fake.
4/4 on hard. Never read a Nature paper before.
by drenvuk on 5/13/21, 7:32 PM
The hard version usually requires me to understand why a number, measurement or chemical or other substance doesn't make sense in the context of what each paragraph is describing. This means I can't just skim it in order to spot the fake, I need to figure out that what it's saying is wrong.
That's close enough for this to be a success if the purpose was to persuade or fool laymen.
by quantum_mcts on 5/13/21, 9:30 PM
I always wanted similar thing but for some philosophy texts. Notably Hegel - I'd love to see a philosopher trying to figure out which pile of gibberish is generated and which is the work of a father of modern dialectics.
by carbocation on 5/13/21, 5:58 PM
Easy mode is cake.
Hard mode is good enough that I'd like to see some sort of distance metric to the nearest real story, to be sure the model isn't accidentally copying truth.
by beforeolives on 5/13/21, 6:41 PM
Cool demo.
With these GPT models, I don't get the appeal of creating fake text that at best can pass as real to someone who doesn't understand the topic and context. What's the use case? Generating more believable spam for social media? Anything else? Because there's no real knowledge representation or information extraction going on here.
by codeflo on 5/13/21, 6:08 PM
5/5 on hard node, but it’s tough sometimes, I don’t actually know much about biology. But if you’ve played around with GPT before, you get better at spotting the subtle logical errors it tends to make. I wonder whether the ability to identify machine generated texts will become a useful skill at some point.
by ta988 on 5/13/21, 6:54 PM
I'm sure GPT2 abstracts would fly through many conferences screening processes. I've seen talks and posters that were utter non-sense but everybody was too polite to say anything to the person or advisors.
I've reviewed articles that were completely made up and the other reviewer didnt even detect that. Nor did the editor.
I've contacted editors about utterly wrong papers, criticized the article on pubpeer, and the article is still published... Because it would harm their notoriety. Thats one of the madenning ascpects of academic publishing.
by CSDude on 5/13/21, 9:18 PM
How does one train GPT-2 with their own content and produce nice results at arbitrary lengths? I found a few libraries but I could not use them well, I get lost very quickly. I just want to train our internal Confluence and have fun with it.
by supermatt on 5/13/21, 6:28 PM
Even on hard, if you understand the terminology, the fake ones are mostly gibberish.
by hazeii on 5/13/21, 5:51 PM
On easy, 7 correct and 0 wrong was enough to tell me that yes, I can (I have been reading Nature for years though).
by make3 on 5/13/21, 11:11 PM
Even with GPT-3, where this "game" would be much harder, this would be kind of a weak demo, because the human reader doesn't understand what the text means in either case (most of the time), taking much if not all away the interesting part of whether the generated text makes any sense or not.
We have known for a while that language models can generate superficially good looking text. The real question is whether they can get to actually understand what is being said. As humans don't understand either, the exercise sadly moot.
by anon_tor_12345 on 5/13/21, 5:51 PM
STEM people love to bring up the sokal affair. the same STEM people also don't realize that many journals and conferences in STEM have been tricked by things like this (more specifically precursors using HMMs and etc).
https://en.wikipedia.org/wiki/List_of_scholarly_publishing_s...
edit: don't understand why i'm getting downvoted. is my comment not relevant to a post about the plausibility of abstracts generated by ML models?
by dnautics on 5/14/21, 5:56 AM
10/10 on easy and 10/10 on hard. Hard selections seem mostly hard because they are short enough you don't see gpt-2 to go off the rails with something completely nonsensical.
Only one was convincing enough to be truly challenging, I got it right because the mechanism proposed was fishy, 1) I had domain expertise, and 2) the date of the paper made no sense relative to when that sort of a discovery would be made (2009 is too early)
by jackcviers3 on 5/14/21, 1:16 AM
6 and 2 on hard mode. The failure of the model to connect ideas in long paragraphs (or to make a succinct claim) is what gives it away. It introduces far too many terms with far too little repetition and far too much specificity in such a short span.
Suggested tweak - train it against papers written by people with an Erdos number < 3 (or Feynman contributors, etc.), so that the topics and fake topics are more closely related in style and content. Maybe even feed it some of their professional letters as well. That would produce some very hard to decipher fakes.
Another great corpus for complex writing is public law books. Have it compare real laws from the training set with fake laws. I bet it would be very difficult to figure out the fake laws.
Training one of these on an entire corpus of one author (Roger Ebert, Justice Ginsberg, Joyce, anyone with a large enough body of work), and having people spot the fake paragraphs from the real ones would be very, very difficult. An entire text, however, would likely be discernible.
It is getting really, really close to being able to fool any layman, though. Impressive work!
by dougb5 on 5/13/21, 6:48 PM
The generated abstracts may be gibberish but I wonder how often they contain little bits of brilliance, or make novel connections between ideas expressed in the training set. If we got a panel of domain experts to evaluate the snippets on this basis, thrir labels could be used to fine-tune the model in the direction of novel discovery. (This is almost certainly not a novel idea!)
by th0ma5 on 5/13/21, 5:50 PM
Now here is an application where the GPT stuff can really shine, which is trying to convince people that aren't domain experts that something is speaking from authority, even if the reader doesn't intend to get anything meaningful from the material either way.
by anigbrowl on 5/13/21, 11:19 PM
18/20 on hard mode. Sherter ones are more difficult, longer ones tend to have dangling clauses or circular claims. I suspect GPT-3 could produce convincing complete abstracts. But this was good enough that I don't feel bad about the two I missed.
by karagenit on 5/13/21, 6:48 PM
Seems like the model likes to repeat words in the title, particularly when hyphens are involved (I guess it considers them
as different words?) e.g. "new dinosaur-like dinosaur" and "male-pattern traits in male rats" are a couple I saw.
by varispeed on 5/13/21, 7:06 PM
If we feed AI all the knowledge about the physics of the world, then will it be ever capable of giving answers without actually performing scientific research inferring it just from the laws that define the world?
by finin on 5/13/21, 9:32 PM
We've done recent work on using a transformer to generate fake cyber threat intelligence (CTI) and found that a set of cybersecurity experts could not reliably distinguish the fake CTI examples from real ones.
Priyanka Ranade, Aritran Piplai, Sudip Mittal, Anupam Joshi, and Tim Finin, Generating Fake Cyber Threat Intelligence Using Transformer-Based Models, Int. Joint Conf. on Neural Networks, IEEE, 2021. https://ebiq.org/p/969
by f6v on 5/13/21, 6:02 PM
The sad thing is that often there's an equal mental effort to read GPT articles and the real ones. It's as if people are trying to make their papers as incomprehensible as possible.
by jpindar on 5/13/21, 11:42 PM
Pretty easy, even in hard mode, and not due to any knowledge of the subject matter. I'm 15 - 0 so far.
I kept seeing certain types of grammatical error, such as constructs like "... and foo, despite foo, so..." or "with foo, but not foo..." where foo is the exact same word or phase appearing twice in a sentence.
I also kept seeing sentences with two clauses that should have agreed in number or tense but did not.
by writeslowly on 5/13/21, 8:57 PM
I found it relatively easy to spot the fakes, but the titles on some of them were pretty good and made me wish they were real. Like reading science journals from a whimsical fantasy universe.
Some of my favorites: "A new onset of primeval black magic in magic-ring crystals"
"The genetic network for moderate religiosity in one thousand bespectacled twins"
"Thermal vestige of the '70s and '00s disco ball trend"
by ivirshup on 5/14/21, 2:55 AM
I initially hadn't realized these were meant to be abstracts (as the site doesn't say this). Knowing this makes hard mode much easier.
I'd been having trouble with ones which had a reasonable logical flow, but didn't communicate a complete idea.
Of course, pretty small N so YMMV
by cblconfederate on 5/13/21, 5:57 PM
It seems some of the fake ones could easily have been real (e.g. the one about the 3d structure of bound Ach receptor) . I guess the brevity of the text helps to make it make sense and to make it indistinguishable
by NorwegianDude on 5/13/21, 7:08 PM
Not hard at all. The fake ones doesn't make any sense from an English perspective. Looks like someone just picked the next word on a SwiftKey keyboard or something. "this word fits here... Right?"
by hutzlibu on 5/13/21, 5:56 PM
Nice advanced logic riddles.
by blt on 5/13/21, 8:08 PM
I wish there was a version of this for computer science. We don't have a broad flagship journal like Nature, so maybe it would need to be trained on a collection of IEEE and ACM venues.
by gentleman11 on 5/13/21, 11:16 PM
The major tell of these systems is writing something coherent over a span larger than a few paragraphs, so this is less impressive than it would have been 5 years ago. Still, well done
by make3 on 5/13/21, 11:06 PM
I'm surprised by how broken the english of GPT-2 is. A lot of sentences are just broken.
I would be curious to try again with GPT-3.
by riquito on 5/13/21, 7:59 PM
With discretion, php/bootstrap/jquery still do their job for the presentation layer
by generalizations on 5/13/21, 5:55 PM
I'm curious if the trained model is available. It would be very fun to play with.
by arthurofcharn on 5/13/21, 7:26 PM
Could we feed gpt-2 Turbo Encabulator? I want more Turbo Encabulator.
by et2o on 5/14/21, 1:23 AM
I just got 10/10. This is not particularly difficult yet.
by otabdeveloper4 on 5/14/21, 7:54 AM
by f430 on 5/13/21, 7:10 PM
Progressively got tougher. I'm scared of the implications in like 20 years.