Researchers — Share Your Data!

One of the most popular shows in the early years of television was hosted by Art Linkletter, which included a segment called “Kids say the darndest things.” Linkletter would have conversations with young children who could be counted on to say things that adults found entertaining. I’ve experienced my own version of this in recent years that could be described as “Researchers say the darndest things.” My conversations with the authors of data visualization research studies have often featured shocking statements that would be amusing if they weren’t so potentially harmful.

The most recent example occurred in email correspondence with the lead author of a study titled “Evaluating the Impact of Binning 2D Scalar Fields.” I’m currently working on a newsletter article about binned versus continuous color scales in data visualization, so this paper interested me. After reading the paper, however, I had a few questions, so I contacted the author. One of my requests was, “I would like to see the full data set that you collected during the experiment.” Here’s the response that I received from the paper’s author: “In psychology, we do not share data sets but the full analyses are available in the supplementary materials.” You can imagine my shock and dismay. Researchers say the darndest things!

Withholding the data that was collected in a research study—the data on which the published findings and claims were based—subverts the essential nature and goals of science. Published research studies should be accompanied by the data sets on which their findings were based—always. The data should be made readily available to anyone who is interested, just as “supplemental materials” are often made available.

Only good can result from sharing our research data. If we share our data, our results can be confirmed. If we share our data, errors in our work can be identified and corrected. If we share our data, science can progress.

Empirical research is based on data. We make observations, usually in the form of measurements, which serve as the data sets on which our findings are based. Only by reviewing our data can the validity of empirical research be confirmed or denied by the research community. Only by sharing our data can questions about our findings be pursued by those who are interested. Refusing to share our data is the antithesis of science.

The author’s claim that, “In psychology, we do not share our data” is false. Psychology researchers do not have a “Do not share your data” policy. I’m astounded that the author thought that I’d buy this absurd claim. What is true, however, is that, even though there is no policy that research data should not be shared, it usually isn’t. On many occasions this is not an overt act of omission, but a mere act of laziness. The data files that researchers use are often messy and they don’t want the bother of structuring and labeling those files in a manner that would make them useful if shared. On more than one occasion I have requested data files only to be told that it would take too much time to put them into a form that could be shared. This response always makes me wonder if the messiness of those files might have caused the researchers themselves to make errors during their analysis of the data. When I told a respected psychology researcher friend of mine about the “In psychology, we don’t share our data” response that I received from the study’s author, he told me, “In my experience, extreme protectiveness about data tends to correlate with work that is not stellar in quality.” I suspect that this is true.

If you can’t make your research data available, either on some public medium (e.g., accessible as a download from a web page) or upon request, you’d better have a really good excuse. You could try the old standby “My dog ate it,” but it probably won’t work any better than it did when you were in elementary school. If your excuse is, “After doing my analysis and writing my paper, I somehow misplaced the data,” the powers that be (e.g., your university or the publication that made your study public) should respond by saying, “Do it over.”

If I could set the standards for research, I would require that the data be examined during the peer review process. It isn’t necessary that every reviewer examine the data, but at least one who is qualified to detect errors should. Among other potential problems, calculations performed on the data should be checked and it should be determined if statistics have been properly used. Checking the data should be fundamental to the peer review process. If this were done, some of the poor research that wastes our time each year with shoddy work and false claims would remain unpublished. I realize that this would complicate the process. Well, guess what, good research takes time and effort. Doing it well is hard work.

If you want to keep your data private, then do the world a favor and keep your research private as well. It isn’t valid research unless your findings are subject to review, and your findings cannot be fully reviewed without the data.

Take care,

9 Comments on “Researchers — Share Your Data!”


By Steven. November 13th, 2017 at 1:37 pm

You’re so right. Given the fact that so many published studies cannot be replicated not sharing the data leaves me very uncomfortable.

By Dale Lehman. November 13th, 2017 at 2:05 pm

Steve – you’ve hit on my biggest (and growing) complaint about research, particularly academic research. I’ve had the opportunity to try to replicate someone’s results a few times – only on one occasion did it replicate (and this refers to just being able to reproduce their results). Most of the time, I can’t even come close because they won’t release the data. I’ve been told it’s publicly available – which sometimes it is – but then they manipulate it in unknown and unreported ways so I can’t get the same results they did. More frequently it is withheld for proprietary reasons – makes me think about documents stamped with “protected by national security.” Once it was 7 year old cellphone usage data about undergraduates at some college – proprietary valuable data indeed!

Anyway, I have one addendum to your post (really an elaboration). In my field, economics, much research is conducted on behalf of commercial entities. Often it is published in academic journals, but it is also used to lobby and persuade public policymakers. In such cases, when data is not made publicly available, I’d like to see the decision makers (regulators, legislators) state that the research will be given its proper attention (meaning NONE). Since it is so easy to make mistakes, unintentional or otherwise, there is no excuse for allowing research to be considered when it is not possible to replicate the results. As a peer reviewer, I have always told editors that it is impossible to determine whether a finding is valid without having the data. It forces us to use credentials (what university did the author(s) get their degree from?, where do they work?, how many articles have they published?, etc.) as screens for their validity – a sure way to subvert the scientific process.

The other recommendation I’d make is that we need to start rewarding the collection and dissemination of data – more than the analysis of it. The former is expensive and valuable if people find the data of interest. Analysis is cheap (not “good” analysis, but producing publishable work), but that is where the rewards lie. As long as we reward people for publishing articles and not collecting good data, careers will advance by withholding that data.

By Andrew. November 13th, 2017 at 3:21 pm

“In psychology, we do not share data sets…”

The part that still baffles me: Why not in psychology?

If it was just “we do not share data sets”, well then they’re just bad at research and you’ve already covered that.

But to start with “In psychology” seems to imply that the study’s author might recognize the practice, and yet somehow the field of psychology is exempt..?

By Charles Perin. November 16th, 2017 at 11:23 am

Hi Stephen,
One one hand, you are 100% right in asking researchers to share their experimental data. I also agree that it should be mandatory, and analyses should be replicable.
On the other hand, I do not think that everything is as dark as you write it. I have witnessed (and participated in) a dramatic increase in making experimental data available, both during the review process as additional material and upon acceptance on dedicated websites, in the visualization community (and the psychology community started doing that before us).
So yes, things could be better. But also yes, things are getting better. It is important to note that although this is not an isolated case you encountered, this is also certainly not the majority of researchers who think this way.

Cheers,

Charles

By Stephen Few. November 16th, 2017 at 11:43 am

Hi Charles,

I’m aware of the fact that some responsible information visualization researchers are willing to share their data and that there is also an effort by some to promote this practice. My limited experience, however, suggests that things are not as rosy as you believe. It appears that relatively few researchers automatically make their data available upon publication (e.g., as part of the supplemental materials). Most of the researchers who are willing to share their data only do so upon request, and often only grudgingly. I haven’t actually surveyed all of the information visualization research papers that have been published in recent years to determine the percentage of them that made their data available, however, so it’s possible that your greater optimism is valid. I’m very interested in hearing about the actual experiences that are driving your optimism.

I suspect that this will only improve to a significant degree if the organizations that accept and publish papers make the provision of data mandatory. As far as I know, none of these organizations do this currently. Are you aware of viable efforts to make this happen?

By Dale Lehman. November 16th, 2017 at 4:05 pm

The recent controversy at the New England Journal of Medicine (NEJM) is worth examining. While it concerns the sharing of clinical trial data, it bears many of the same traits as data sharing more generally. The NEJM published a series of editorials, eventually referring to their need to protect researchers against “research parasites” (their term, not mine). This attracted a lot of pushback. Under pressure, they sponsored the SPRINT challenge (which I competed in) to provide an experiment with the sharing of such data. They had a conference last May to announce the winners (I did not win) and discuss what was learned from the Challenge. The panels of patients from the trials were shocked to hear that there was any issue about sharing. The clinical researchers were much more reticent. The panel of funders (e.g., the Gates Foundation, Wellcome Trust, etc. – and these organizations published a letter in this week’s NEJM very clearly stating they expect the sharing of data and that the SPRINT challenge was inadequate) expressed their desire to have clinical trial data shared as widely as possible.

There is much to read about that episode (and it continues). It provides some hope, but is also cautionary about just how hard this change will be. The pressures to resist widely sharing data will not disappear easily. Being the skeptic I am, I’d judge the glass to be about 25% full at this point.

By Stephen Few. November 17th, 2017 at 4:04 pm

Today, I heard from one of the other authors of the research paper “Evaluating the Impact of Binning 2D Scalar Fields.” She wrote to let me know that the data is now available for the study. Given the fact that the author who previously said that “In psychology, we do not share data sets” is her student, I asked for her opinion of this response. She kindly responded with the following:

“I am very aware of the importance of making data accessible to others and of course in psychology we do this. Many of our journals actually request the data now when publishing, and even if not, I believe that sharing data is an important step in addressing issues of replicability and I very much support the efforts such as the open science framework. Lace knows this as well, so I’m sorry if she said otherwise originally.”

I appreciate her thoughtful and candid response. It gives us a reason to hope that this problem is being addressed to some degree, at least by some.

By Xan Gregg. November 17th, 2017 at 4:33 pm

Hi Dale, I was at your great presentation on this subject and the SPRINT challenge at the JMP conference last month, and it may be worth clarifying the “shock” of the patients in your previous comment. I believe you said in your talk that the patients felt that had taken on the risk of participating in the trial and expected the results would be used for the maximum benefit of science as a whole.

Separately, I thought, “there should be a journal for publishing data,” and there is! It’s called “Data” and is published by MDPI, Multidisciplinary Digital Publishing Institute. http://www.mdpi.com/journal/data

By Stephen Few. December 1st, 2017 at 10:20 am

Here’s an update regarding the willingness of the researchers responsible for the “Evaluating the Impact of Binning 2D Scalar Fields” to share their data. When I examined the data file that they eventually provided, I found that it was useless because the data fields have cryptic labels that are entirely meaningless to anyone but them. The data is useless if you don’t know what it represents.

On November 20th, I sent an email to the researchers requesting a description of the data fields. As of today, 10 days later, they still have not bothered to reply to my request, not even with an excuse. The hope that I allowed myself to feel when they provided a data file has now been dashed. This is bad science. The peer review process should reject all studies that refuse to share they data.

Leave a Reply