by jsomers on 4/5/18, 12:44 PM with 107 comments
by jfaucett on 4/5/18, 7:28 PM
This is the crux of the of the problem IMHO - at least for the fields I study (AI/ML). Replicating the results in papers I read, is way harder than it needs be, i.e. for these fields it should just be fire up a jupyter notebook and download the actual dataset they used (much harder than it seems to actually get your hands on). Very few papers actually contain links to all of this in a final polished manner so that it's #1 understandable and #2 repeatable.
Honestly, I'd much rather have your actual code and data that you used to get your results than read through the research paper if I had to choose (assuming the paper is not pure theory) - but instead there is a disproportionate focus on paper quality over "project quality" at least IMHO.
I don't really know what the solution is since apparently most academics have been perfectly fine with the status quo. I feel like we could build a much better system if we redefined our goals, since I don't think the current system is optimal for disseminating knowledge or finding and fixing mistakes in research or even generally working in a fast iterative process.
by JorgeGT on 4/5/18, 8:31 PM
The article seems to conflate the praxis of science with the archival of it. Scientists do all of the above on gigantic clusters, not on an IPython/Mathematica notebook. The purpose of publishing papers, on the other hand, is adding to the archival of knowledge, and they can be easily rendered in a laptop with LaTeX.
And they are excellent at archival, by the way. You can see papers from the 19th century still being cited. On the other hand I have had issues running a Mathematica notebook from a few releases back -- and I seriously doubt one will be able to read any of my Mathematica notebooks 150 years from now. The same with the nifty web-based redesign of the Nature paper that is mentioned: I bet the original Nature article will be readable 150 years from now, whereas I doubt the web version will last 20.
by bcoughlan on 4/5/18, 8:14 PM
There were two forces working against us. First many of the grants came from governments, and a stipulation was that we would devote some resources to helping startups commercialise the output of the research. Some felt that open sourcing would remove the need for the startups to work directly with them to integrate the algorithms, and that this would hurt future grant applications by making the research look ineffective.
The main opposition though came from PhD and Postdoc students. Most didn't want anything related to their work open sourced. They believed that it would make it easy for others to pick up from where they were and render their next paper unpublishable by beating them to the punch.
Sadly I think there was some truth to both claims. Papers are the currency of academics, and all metrics for grants and careers hinge off it. It hinders cooperation and fosters a cynical environment of trying to game the metrics to secure a future in academics.
I don't know how else you should measure academics performance, but until those incentives change the journal paper in its current form is going nowhere.
by mettamage on 4/5/18, 10:34 PM
But I do think there is a difference between scientific professionals communicating amongst each other and scientific communication to the public. And if mathematicians understood Strogatz his paper at the time when it was published, and there were enough mathematicians to disseminate the knowledge, then should you require that you create algorithms as animations?
Part of the reason why mathematicians and computer scientists (as researchers) conceive of new algorithms in the first place is because a lot of them are very strong in visualizing algorithms and 'being their own computer'.
Though, if a scientist wants to appeal to a broader group of scientists, then I'd recommend her or him to use every educational tool possible. For example, they could create an interactive blogpost a la parable of the polygons[1] and link that in their paper.
On an unrelated note, it is such a pity ncase isn't mentioned at all in this article!
Also related is explorabl.es, not everything is science communication in an interactive way but a lot of it is[2].
by jostmey on 4/5/18, 7:12 PM
by yiyus on 4/5/18, 9:16 PM
I work everyday with papers from decades ago, and I hope people will work with my papers in the future. How can I guarantee that researches of 2050 will be able to run my Jupyter notebooks?
Moreover, it is not uncommon to not be able to publish source code. I can write about models and algorithms, but I am not allowed to publish the code I write for some projects.
by aplorbust on 4/5/18, 9:28 PM
If I wanted to prove to someone this statement was true, what would be the most effective way to do that?
Is author basing this conclusion on job postings somewhere?
Has he interviewed anyone working in these fields?
Has he worked in a lab or for a company doing R&D?
How does he know?
What evidence (cf. media hype) could I cite in order to convince someone he is right?
When I look at the other articles he has written, they seem focused on popularised notions about computers, but I do not see any articles about the academic disciplines he mentions.
by cowpig on 4/5/18, 7:20 PM
edit: as is Chris Olah's Distill project: https://distill.pub/
by Myrmornis on 4/6/18, 3:03 AM
With iPython this is also an issue -- tracking code in JSON is much less clean than tracking code in text files.
It's interesting that Mathematica and iPython both left code-as-plain-text behind as a storage format. I wonder if it would have been possible to come up with a hybrid solution, i.e. retain plain-text code files but with a serialized data structure (JSON-like, or binary) as the glue.
by hprotagonist on 4/5/18, 7:12 PM
as a philosophical matter, for computation heavy fields, i would love to see literate programming tools become de rigeur in the peer-reviewed distribution of results. In some fields (AI) this basically happens already — the blog post with code snippets and a link to arxiv at the end is a pretty common thing now.
by kemiller on 4/5/18, 8:07 PM
by vanderZwan on 4/6/18, 2:28 AM
by sophacles on 4/5/18, 8:44 PM
* Date of publication and dates of research should be required in every paper. It's really difficult to trace out the research path if you start from google or random papers you find in various archive searches. Yes that info can be present but often its in metadata where the PDF is linked rather than the PDF itself. Even worse is the "pubname, vol, issue" info rather than a year that you get... now I have to track down when the publication started publishing, how they mark off volumes and so on. I just want to know when the thing was published.
* Software versions used - if you are telling me about kernel modules or plugins/interfaces to existing software, I need to know the version to make my stuff work. Again - eventually it can be tracked down, but running a 'git bisect' on some source tree to find out when the code listings will compile is not OK.
* actual permalinks to data, code, and other supplimental information. Some 3rd party escrow service is not a terrible idea even. I hate trying to track down something from a paper only to find the link is dead and the info is no longer available or has moved a several hour google journey away.
by ivotron on 4/5/18, 9:56 PM
by mettamage on 4/5/18, 10:29 PM
The basic function of a scientific paper is understanding and reproducibility (inspired by jfaucett his comment).
I wonder, is reproducibility necessary? Is it even possible when things get really complex? Isn't consensus enough? I feel in the field of psychology (and most social sciences) that is what happens. I suppose consensus can be easily gamed by publication bias and a whole slew of other things. So I suppose as jfaucett puts it, a "discover for yourself" type of thing should still be there. I wonder how qualitative research could be saved and if you could call it science. In Dutch it is all called "wetenschap" and "weten" means to know.
But how should we go about design then? HCI papers use a a lot of design of design that is never justified. The paper is like: we build a system, it improved our user metrics. But is there any intuition or theory written down as to why they designed something a certain way? Not really.
I suppose one strong way to get reproducibility is by getting all the inputs needed. In a psychology study this means getting a dataset. Correlations are fuzzy but if I get the same answers out of the same dataset, then the claims must be true for that particular dataset.
Regarding design and qualitative studies, maybe, film everything? The general themes that everybody would agree upon watching everything would be the reproducible part of it?
Ok, I'll stop. The whole idea of that a paper needs to satisfy the criterion of reproducibility confuses me when I look at what science is nowadays.
by laderach on 4/5/18, 9:19 PM
There are a number of problems in scientific publishing. Two big ones are:
1) Distribution hurdles and paywalls imposed by rent seeking journals - who knows how much this has prevented innovation and scientific advancement in the last 20 years
2) Easily replicating experiments / easily verifying accuracy and significance of results - this is related to for instance making data used in research more easily accessible and making it easier to spot p-value hacking
Fixing these might not require a completely new format for papers. Or it could. I can envision solutions both ways.
I really like what the folks from Fermat's Library have been doing. They have been developing tools that are actually useful at the present time and push us in the right direction. I use their arXiv chrome extension https://fermatslibrary.com/librarian all the time for extracting references and bibtex. At the same time they are playing with entirely new concepts - they just posted a neat article on medium about a new unit for academic publishing https://medium.com/@fermatslibrary/a-new-unit-of-academic-pu...
by omot on 4/6/18, 1:27 AM
They still think this.
by vanderZwan on 4/6/18, 2:24 AM
Linnarsson's group just pre-published a paper cataloguing all cell types in the mouse brain, classifying them based on gene expression[4][5]. The whole reason that I was hired was as an "experiment" to see if there was a way to make the enormous amount of data behind it more accessible for quick explorations than raw dumps of data. The viewer uses a lot of recent (as well as slightly-less-recent-but-underused) browser technologies.
Instead of downloading the full data set (which is typically around 28k genes by N cells, where N is in the tens to hundreds of thousands), only the general metadata plus requested genes are downloaded in the form of compressed JSON arrays containing raw numbers or strings. The viewer converts them to Typed Arrays (yes, even with string arrays) and then renders nearly everything on the fly client-side. This also makes it possible to interactively tweak view settings[6]. Because the viewer makes almost no assupmtions of what the data represents, we recently re-used the scatterplot view to display individual cells in a tissue section[7].
Furthermore, this data is stored off-line through IndexedDB, so repeat viewings of the same dataset or specific genes within it does not require re-downloading the (meta)data. This minimises data transfer even further, and makes the whole thing a lot snappier (not to mention cheaper to host, which may matter if you're a small research group). The only reason it isn't completely offline-first is that using service workers is giving me weird interactions with react-router. Being the lone developer I have to prioritise other, more pressing bugs.
In the end however, the viewer is merely a complement to the full catalogue, which is set up with a DocuWiki[8]. No flashy bells and whistles there, but it works. For example, one can look up specific marker genes. it just uses a plugin to create a sortable table, which is established, stable technology that pretty much comes with the DocuWiki[9][10]. The taxonomy tree is a simple static SVG above it. Since the expression data is known client-side to generate the table dynamically, we only need a tiny bit of JavaScript to turn that into an expression heatmap underneath the taxonomy tree. Simple and very effective, and it probably even works in IE8, if not further back. Meanwhile, I got myself into an incredibly complicated mess writing a scatterplotter with low-level sprite rendering and blitting and hand-crafted memoisation to minimise redraws[11].
Personally, I think there isn't enough praise for the pragmatic DocuWiki approach. My contract ends next week. I intend to keep contributing to the viewer, working out the (way too many) rough edges and small bugs that remain, but it won't be full-time. I hope someone will be able to maintain and develop this further. I think the DocuWiki has a better chance of still being on-line and working ten years from now.
[2] https://github.com/linnarsson-lab/loom-viewer
[3] http://loom.linnarssonlab.org/
[4] https://twitter.com/slinnarsson/status/981919808726892545
[5] https://www.biorxiv.org/content/early/2018/04/05/294918
[7] http://loom.linnarssonlab.org/dataset/cells/osmFISH/osmFISH_..., https://i.imgur.com/a7Mjyuu.png
[8] http://mousebrain.org/doku.php?id=start
[9] http://mousebrain.org/doku.php?id=genes:aw551984
[10] http://mousebrain.org/doku.php?id=genes:actb
[11] https://github.com/linnarsson-lab/loom-viewer/blob/master/cl...
by tensor on 4/5/18, 8:00 PM
But outside of computer science you need laboratories to replicate experiments. Scientific papers are perfectly fine vehicles to record the necessary information to replicate experiments in this setting. Historically appendices are used for the extended details. And yes, replication is hard, but it's part of science.
by awll on 4/5/18, 7:22 PM
by sappapp on 4/5/18, 7:16 PM