[Update: I've got some more thoughts on Gutenberg-based research in my latest post.]
I'm planning on writing some posts on the potential of and the potential concerns about open data (possibly even getting Joseph to join in) so I thought I'd dust off a somewhat relevant idea I had a few years back. If anyone wants to see if they can get something publishable out of this, feel free. In the meantime, I plan on getting some mileage out of it as an example.
A few years ago, I wrote some code for text mining. It was
really basic, standard stuff -- using naive Bayesian classifiers and
n-grams (normally techniques for assigning authorship) -- but it worked well and was fun to
play around with. I used various books from Project Gutenberg as test
data and selected authors with styles and backgrounds ranging from
close (Dickens and Trollope) to out there (Thorstein Veblen) with a
translation of Verne as someone neutral. The two Victorians also had
the advantage of having written lots of books over many years.
The idea was to approach this less as a classification problem and more
of a question of distance between points in a literary space. Here the "likelihood score"
was more a measure of similarity. As you would expect, Great
Expectations was more similar to Nicholas Nickleby than to Barchester
Towers, more similar to Barchester Towers than to a translated Master
of the World and more similar to Master of the World than to Theory of
the Leisure Class. It also worked as expected when you compared works
of the same author written at different points in his career: Great Expectations (1860 to 1861) was more similar
to Our Mutual Friend (1864 to 1865) than to Nicholas Nickleby (1838 to
1839).
Obviously this was a tiny trial run, but it did suggest that there's something out there, as did a recent literature search which turned up at least one
related paper from 2011 ("Predicting the Date of Authorship of
Historical Texts" by A. Tausz) which used NBCs to determine absolute rather than relative dates. Still even with Tausz' paper (which is very interesting, by the way) there still should be room for research into intra-author questions and, more importantly, into lots of other questions using data from project Gutenberg.
And on top of that you can apparently find interesting stuff to read at the site as well.
No comments:
Post a Comment