Monday, May 27, 2013

The Project Gutenberg Project

Joseph and I have been going back and forth on the best way to get the most out of the tidal wave of open data of which we are only seeing the beginning.

Joseph tends to be more skeptical on this subject. He almost has to be. He's approaching this both as an epidemiologist (where privacy and ethical issues create huge headaches) and an academician (where open data can create tremendous perverse incentives to rush out mediocre work in order to beat out other researchers looking at the same data). The promise of open data is very much field specific.

I tend to be more optimistic about the subject. I'm more the data miner of the blog and to find myself living in an age when anyone with a refurbished desktop computer, a copy of R and Python and a decent internet connection can do real, interesting research is tremendously exciting.

There is at least one area, though, where I am possibly more skeptical than Joseph and that's in the chances of these huge data initiatives self-organizing along anything near an optimal configuration...

I started to write something here about market forces in research and incentives and non-rival goods but then the phone rang and by the time I got off I realized that would be a lot of work (at least it would if I did enough research to make sure I wasn't making an idiot of myself). Chances are, that discussion would just be a long winded way of saying if we want to effectively coordinate all these researchers so that information flows where it needs to flow and data is fully explored and we can keep track of what's going on, we need to think this thing through.

Which brings me to the Project Gutenberg Project. Project Gutenberg has, of course, a huge and growing database. It's set up to be researcher-friendly and the system readily lends itself to automated approaches. The possibilities for text-mining are endless and a tremendous number of interesting research questions can be addressed with nothing more than a reasonably up-to-date computer and some free software (I previously posted a couple of examples here).

This would seem to be an ideal test case for setting up procedures and sites for dealing with large, open databases . Here are a few possibilities:

A place to submit and comment on proposed hypotheses;

A place to report preliminary findings;

A place to report negative findings;

A place to report confirmation of previous findings;

A database connecting approaches, hypotheses and data points;

Multiple ranking systems;

A way of identifying under-explored parts of the data.

Obviously this is a first pass and I'm just throwing out some ideas. Some might be impractical. Others, as Joseph would point out, will not be applicable to other data sets. And I have a nagging feeling that I've left something obvious out.

But that, of course, is the nature of a blog post.

No comments:

Post a Comment