from Hacker News

Redesigning the Scientific Paper

by jsomers on 4/5/18, 12:44 PM with 107 comments

by jfaucett on 4/5/18, 7:28 PM
> These programs tend to be both so sloppily written and so central to the results that it’s contributed to a replication crisis, or put another way, a failure of the paper to perform its most basic task: to report what you’ve actually discovered, clearly enough that someone else can discover it for themselves.
This is the crux of the of the problem IMHO - at least for the fields I study (AI/ML). Replicating the results in papers I read, is way harder than it needs be, i.e. for these fields it should just be fire up a jupyter notebook and download the actual dataset they used (much harder than it seems to actually get your hands on). Very few papers actually contain links to all of this in a final polished manner so that it's #1 understandable and #2 repeatable.
Honestly, I'd much rather have your actual code and data that you used to get your results than read through the research paper if I had to choose (assuming the paper is not pure theory) - but instead there is a disproportionate focus on paper quality over "project quality" at least IMHO.
I don't really know what the solution is since apparently most academics have been perfectly fine with the status quo. I feel like we could build a much better system if we redefined our goals, since I don't think the current system is optimal for disseminating knowledge or finding and fixing mistakes in research or even generally working in a fast iterative process.
by JorgeGT on 4/5/18, 8:31 PM
> How to integrate billions of base pairs of genomic data, and 10 times that amount of proteomic data, and historical patient data, and the results of pharmacological screens into a coherent account of how somebody got sick and what to do to make them better? How to make actionable an endless stream of new temperature and precipitation data, and oceanographic and volcanic and seismic data? How to build, and make sense of, a neuron-by-neuron map of a thinking brain? Equipping scientists with computational notebooks, or some evolved form of them, might bring their minds to a level with problems now out of reach.
The article seems to conflate the praxis of science with the archival of it. Scientists do all of the above on gigantic clusters, not on an IPython/Mathematica notebook. The purpose of publishing papers, on the other hand, is adding to the archival of knowledge, and they can be easily rendered in a laptop with LaTeX.
And they are excellent at archival, by the way. You can see papers from the 19th century still being cited. On the other hand I have had issues running a Mathematica notebook from a few releases back -- and I seriously doubt one will be able to read any of my Mathematica notebooks 150 years from now. The same with the nifty web-based redesign of the Nature paper that is mentioned: I bet the original Nature article will be readable 150 years from now, whereas I doubt the web version will last 20.
by bcoughlan on 4/5/18, 8:14 PM
I used to work as a software developer for a research institute. I wanted to open source our research code and tools, and the department head was in favour of it because it would raise the profile of the research unit.
There were two forces working against us. First many of the grants came from governments, and a stipulation was that we would devote some resources to helping startups commercialise the output of the research. Some felt that open sourcing would remove the need for the startups to work directly with them to integrate the algorithms, and that this would hurt future grant applications by making the research look ineffective.
The main opposition though came from PhD and Postdoc students. Most didn't want anything related to their work open sourced. They believed that it would make it easy for others to pick up from where they were and render their next paper unpublishable by beating them to the punch.
Sadly I think there was some truth to both claims. Papers are the currency of academics, and all metrics for grants and careers hinge off it. It hinders cooperation and fosters a cynical environment of trying to game the metrics to secure a future in academics.
I don't know how else you should measure academics performance, but until those incentives change the journal paper in its current form is going nowhere.
by mettamage on 4/5/18, 10:34 PM
Regarding the article itself: Brett Victor is amazing and so is Strogatz. They are both my heroes actually.
But I do think there is a difference between scientific professionals communicating amongst each other and scientific communication to the public. And if mathematicians understood Strogatz his paper at the time when it was published, and there were enough mathematicians to disseminate the knowledge, then should you require that you create algorithms as animations?
Part of the reason why mathematicians and computer scientists (as researchers) conceive of new algorithms in the first place is because a lot of them are very strong in visualizing algorithms and 'being their own computer'.
Though, if a scientist wants to appeal to a broader group of scientists, then I'd recommend her or him to use every educational tool possible. For example, they could create an interactive blogpost a la parable of the polygons[1] and link that in their paper.
On an unrelated note, it is such a pity ncase isn't mentioned at all in this article!
Also related is explorabl.es, not everything is science communication in an interactive way but a lot of it is[2].
[1] http://ncase.me/polygons/ [2] http://explorabl.es/
by jostmey on 4/5/18, 7:12 PM
We need GitHub for science. But that's not enough. It needs to be combined with a mechanism for peer-review and publishing that funding agencies will find acceptable--that's the key.
by yiyus on 4/5/18, 9:16 PM
I do not agree that the scientific paper needs to be replaced. It should be complemented with the help of new tools, that is a very good thing, but I still want the article.
I work everyday with papers from decades ago, and I hope people will work with my papers in the future. How can I guarantee that researches of 2050 will be able to run my Jupyter notebooks?
Moreover, it is not uncommon to not be able to publish source code. I can write about models and algorithms, but I am not allowed to publish the code I write for some projects.
by aplorbust on 4/5/18, 9:28 PM
"... the skill most in demand among physicists, biologists, chemists, geologists, even anthropologists and research psychologists, is facility with programming languages and "data science" packages."
If I wanted to prove to someone this statement was true, what would be the most effective way to do that?
Is author basing this conclusion on job postings somewhere?
Has he interviewed anyone working in these fields?
Has he worked in a lab or for a company doing R&D?
How does he know?
What evidence (cf. media hype) could I cite in order to convince someone he is right?
When I look at the other articles he has written, they seem focused on popularised notions about computers, but I do not see any articles about the academic disciplines he mentions.
by cowpig on 4/5/18, 7:20 PM
GitXiv very much worth taking a look at if you're into this kind of thing: http://www.gitxiv.com/page/about
edit: as is Chris Olah's Distill project: https://distill.pub/
by Myrmornis on 4/6/18, 3:03 AM
How well does it work to version control Mathematica notebooks in git? For example, is it possible to get meaningful textual diffs when comparing two versions of a mathematica notebook, and can git compress them enough to keep repo size down?
With iPython this is also an issue -- tracking code in JSON is much less clean than tracking code in text files.
It's interesting that Mathematica and iPython both left code-as-plain-text behind as a storage format. I wonder if it would have been possible to come up with a hybrid solution, i.e. retain plain-text code files but with a serialized data structure (JSON-like, or binary) as the glue.
by hprotagonist on 4/5/18, 7:12 PM
as a practical matter, papers will remain relevant as long as they are the metric by which grant applications and tenure decisions are made.
as a philosophical matter, for computation heavy fields, i would love to see literate programming tools become de rigeur in the peer-reviewed distribution of results. In some fields (AI) this basically happens already — the blog post with code snippets and a link to arxiv at the end is a pretty common thing now.
by kemiller on 4/5/18, 8:07 PM
As is usual for Stephen Wolfram, he has a point, and then blunts it by trying to own the whole thing. Edit: to expand, part of the answer to his question, why don't more people do this, is that it requires his expensive proprietary software. Scientific papers are (nominally at least) a commons.
by vanderZwan on 4/6/18, 2:28 AM
Aside from my other comment, I think that any discussion about the scientific paper and the way knowledge is communicated is incomplete without a mention of Nick Sousanis' Unflattening. It is a thesis for a Doctor of Education degree about this very topic, that practises what it preaches by being written as a comic book.
http://www.hup.harvard.edu/catalog.php?isbn=9780674744431
by sophacles on 4/5/18, 8:44 PM
This is a really interesting article. The use of jupyter as a publication mechanism is a really neat idea! I think this path will be fruitful, and I am all for it. I do think however that some low-hanging fruit should be addressed in parallel - stuff that makes looking through the existing work a total pain:
* Date of publication and dates of research should be required in every paper. It's really difficult to trace out the research path if you start from google or random papers you find in various archive searches. Yes that info can be present but often its in metadata where the PDF is linked rather than the PDF itself. Even worse is the "pubname, vol, issue" info rather than a year that you get... now I have to track down when the publication started publishing, how they mark off volumes and so on. I just want to know when the thing was published.
* Software versions used - if you are telling me about kernel modules or plugins/interfaces to existing software, I need to know the version to make my stuff work. Again - eventually it can be tracked down, but running a 'git bisect' on some source tree to find out when the code listings will compile is not OK.
* actual permalinks to data, code, and other supplimental information. Some 3rd party escrow service is not a terrible idea even. I hate trying to track down something from a paper only to find the link is dead and the info is no longer available or has moved a several hour google journey away.
by ivotron on 4/5/18, 9:56 PM
shameless plug for the Popper Convention and CLI tool http://github.com/systemslab/popper . Our goal is to make writing papers as close as possible to writing software (DevOpsify academia) but in a domain-agnostic way.
by mettamage on 4/5/18, 10:29 PM
I love science, but I have a lot of issues with it lately. I'm going to express some of them since they're related to this topic.
The basic function of a scientific paper is understanding and reproducibility (inspired by jfaucett his comment).
I wonder, is reproducibility necessary? Is it even possible when things get really complex? Isn't consensus enough? I feel in the field of psychology (and most social sciences) that is what happens. I suppose consensus can be easily gamed by publication bias and a whole slew of other things. So I suppose as jfaucett puts it, a "discover for yourself" type of thing should still be there. I wonder how qualitative research could be saved and if you could call it science. In Dutch it is all called "wetenschap" and "weten" means to know.
But how should we go about design then? HCI papers use a a lot of design of design that is never justified. The paper is like: we build a system, it improved our user metrics. But is there any intuition or theory written down as to why they designed something a certain way? Not really.
I suppose one strong way to get reproducibility is by getting all the inputs needed. In a psychology study this means getting a dataset. Correlations are fuzzy but if I get the same answers out of the same dataset, then the claims must be true for that particular dataset.
Regarding design and qualitative studies, maybe, film everything? The general themes that everybody would agree upon watching everything would be the reproducible part of it?
Ok, I'll stop. The whole idea of that a paper needs to satisfy the criterion of reproducibility confuses me when I look at what science is nowadays.
by laderach on 4/5/18, 9:19 PM
Sometimes you don't need to redo everything from scratch to change things.
There are a number of problems in scientific publishing. Two big ones are:
1) Distribution hurdles and paywalls imposed by rent seeking journals - who knows how much this has prevented innovation and scientific advancement in the last 20 years
2) Easily replicating experiments / easily verifying accuracy and significance of results - this is related to for instance making data used in research more easily accessible and making it easier to spot p-value hacking
Fixing these might not require a completely new format for papers. Or it could. I can envision solutions both ways.
I really like what the folks from Fermat's Library have been doing. They have been developing tools that are actually useful at the present time and push us in the right direction. I use their arXiv chrome extension https://fermatslibrary.com/librarian all the time for extracting references and bibtex. At the same time they are playing with entirely new concepts - they just posted a neat article on medium about a new unit for academic publishing https://medium.com/@fermatslibrary/a-new-unit-of-academic-pu...
by omot on 4/6/18, 1:27 AM
> His secret weapon was his embrace of the computer at a time when most serious scientists thought computational work was beneath them.
They still think this.
by vanderZwan on 4/6/18, 2:24 AM
Interesting timing: for the last two years I have worked for a research group headed by Sten Linnarsson at the Karolinska Institute[0]. I was specifically hired to build a data browser for a new file format for storing the ever-growing datasets[1][2][3]. The viewer is an SPA specialised in exploring the data on the fly, doing as much as possible client side while minimising the amount of data being transferred, and staying as data-agnostic as possible.
Linnarsson's group just pre-published a paper cataloguing all cell types in the mouse brain, classifying them based on gene expression[4][5]. The whole reason that I was hired was as an "experiment" to see if there was a way to make the enormous amount of data behind it more accessible for quick explorations than raw dumps of data. The viewer uses a lot of recent (as well as slightly-less-recent-but-underused) browser technologies.
Instead of downloading the full data set (which is typically around 28k genes by N cells, where N is in the tens to hundreds of thousands), only the general metadata plus requested genes are downloaded in the form of compressed JSON arrays containing raw numbers or strings. The viewer converts them to Typed Arrays (yes, even with string arrays) and then renders nearly everything on the fly client-side. This also makes it possible to interactively tweak view settings[6]. Because the viewer makes almost no assupmtions of what the data represents, we recently re-used the scatterplot view to display individual cells in a tissue section[7].
Furthermore, this data is stored off-line through IndexedDB, so repeat viewings of the same dataset or specific genes within it does not require re-downloading the (meta)data. This minimises data transfer even further, and makes the whole thing a lot snappier (not to mention cheaper to host, which may matter if you're a small research group). The only reason it isn't completely offline-first is that using service workers is giving me weird interactions with react-router. Being the lone developer I have to prioritise other, more pressing bugs.
In the end however, the viewer is merely a complement to the full catalogue, which is set up with a DocuWiki[8]. No flashy bells and whistles there, but it works. For example, one can look up specific marker genes. it just uses a plugin to create a sortable table, which is established, stable technology that pretty much comes with the DocuWiki[9][10]. The taxonomy tree is a simple static SVG above it. Since the expression data is known client-side to generate the table dynamically, we only need a tiny bit of JavaScript to turn that into an expression heatmap underneath the taxonomy tree. Simple and very effective, and it probably even works in IE8, if not further back. Meanwhile, I got myself into an incredibly complicated mess writing a scatterplotter with low-level sprite rendering and blitting and hand-crafted memoisation to minimise redraws[11].
Personally, I think there isn't enough praise for the pragmatic DocuWiki approach. My contract ends next week. I intend to keep contributing to the viewer, working out the (way too many) rough edges and small bugs that remain, but it won't be full-time. I hope someone will be able to maintain and develop this further. I think the DocuWiki has a better chance of still being on-line and working ten years from now.
[0] http://linnarssonlab.org/
[1] http://loompy.org/
[2] https://github.com/linnarsson-lab/loom-viewer
[3] http://loom.linnarssonlab.org/
[4] https://twitter.com/slinnarsson/status/981919808726892545
[5] https://www.biorxiv.org/content/early/2018/04/05/294918
[6] https://imgur.com/f6GpMZ1
[7] http://loom.linnarssonlab.org/dataset/cells/osmFISH/osmFISH_..., https://i.imgur.com/a7Mjyuu.png
[8] http://mousebrain.org/doku.php?id=start
[9] http://mousebrain.org/doku.php?id=genes:aw551984
[10] http://mousebrain.org/doku.php?id=genes:actb
[11] https://github.com/linnarsson-lab/loom-viewer/blob/master/cl...
by tensor on 4/5/18, 8:00 PM
This title is horrible hyperbole. Science is more than just machine learning. Hell, even if we just constrain ourselves to "computer science" probably half of it is just math, for which the scientific paper is definitely not "obsolete" nor even deficient in any way.
But outside of computer science you need laboratories to replicate experiments. Scientific papers are perfectly fine vehicles to record the necessary information to replicate experiments in this setting. Historically appendices are used for the extended details. And yes, replication is hard, but it's part of science.
by awll on 4/5/18, 7:22 PM
This reads like an ad for Mathematica
by sappapp on 4/5/18, 7:16 PM
Average clickbait