Sunday, March 9, 2014

Open Data

This is a pretty good argument for why there is resistance to completely open data:
When people don’t want to release their data, they don’t care about the data itself. They care about the papers that could result from these data. I don’t care if people have numbers that I collect. What I care about is the notion that these numbers are scientifically useful, and that I wish to get scientific credit for the usefulness of these numbers. Once the data are public, there is scant credit for that work.

It takes plenty of time and effort to generate data. In my case, lots of sweat, and occasionally some venom and blood, is required to generate data. I also spend several weeks per year away from my family, which any parent should relate with. Many of the students who work with me also have made tremendous personal investments into the work as well. Generating data in my lab often comes at great personal expense. Right now, if we publicly archived data that were used in the creation of a new paper, we would not get appropriate credit in a currency of value in the academic marketplace.
I think the key to this argument is that most of the effort in some fields lies in the collection of the data bit all of the credit is based on papers.  So you would end up, rather quickly, with a form of tragedy of the commons where the people who create the data end up with little credit . . . meaning we would end up with less data. 

Are there are alternatives to this paradigm?  Of course.  The US census is a excellent example of an alternative model -- where the data collection and cleaning is done by a government department on the behalf of all sorts of researchers.  Splitting data collection and data analysis in this way is certainly a viable model. 

But pretending that open data is a simple case of people being reluctant to share their information is really an unfair portrayal.  In my own career I have had lots of access to other peoples data and they are extremely generous so long as I offer to give proper credit.  So I don't think the open data movement is all wrong, but it does suggest that there is a difficult conversation to make this work well. 


  1. I'm sorry but this particular argument makes no sense to me. When did you last read an academic paper that did not attribute the source of their data? In fact, there are lots of famous data sets in all sorts of disciplines that get cited all the time and whose creators get lots of credit. They're usually the sorts of data sets that would be hard to recreate (longitudonal studies, emergency rescue) or pointless to reduplicate (corpora). They're also almost always funded by the government (i.e. taxpayers). I don't buy the blood and sweat argument, in the least. That's just describing a job. A job with plenty of other rewards like flexibility, travel and prestige (if not great remuneration).

    What this seems to be about is preventing others from publishing analyses of the data before the original collectors. For that the solution would be something like a limited time escrow (1-5 years). There are lots of examples (in archaeology, for instance) where people sat on unreplicable data sets for decades and then never published them anyway.

    The other issue is one of privacy, which applies to a lot of social science data. Lots of interview transcripts or questionnaire response that would be really great to have in public contain personally identifiable information that cannot be removed by pure anonymization. I'm not sure what the solution would be for this.

    So there are issues with open data. But lack of credit is not one of them. The paper actually makes a more nuanced point and I don't have any problems with its 4 recommendations:

    1 Facilitate more flexible embargoes on archived data
    2 Encourage communication between data generators and re-users
    3 Disclose data re-use ethics
    4 Encourage increased recognition of publicly archived data.

    But it should always be made available for peer review.

  2. This is the sort of policy I am thinking about:

    I should also point out that NIH cohort studies have public access datasets:

    And all of the issues appear to be with people who aren't happy with waiting through the embargo period