Wednesday, April 20, 2016

A missing data parable that links to secondary data use

This is Joseph.

Thomas Lumley points out a terrifying example of mean imputation gone terribly wrong

It turns out that the source of this problem was the algorithms from a private corporation that mapped out zip code level data, coupled with the risks of secondary data use by people who don't understand what the data was collected to do:

Earlier this week, I reached Thomas Mather, a co-founder of MaxMind, via email. I told him Joyce Taylor’s story, and how I’d discovered MaxMind’s involvement in the IP mapping part of it. I asked him if he knew anything about the default coordinates that were placing unidentified IP addresses on the Taylor’s property.

Mather told me that “the default location in Kansas was chosen over ten years ago when the company was started.”
 
He continued: “At that time, we picked a latitude and longitude that was in the center of the country, and it didn’t occur to us that people would use the database to attempt to locate people down to a household level. We have always advertised the database as determining the location down to a city or zip code level. To my knowledge, we have never claimed that our database could be used to locate a household.”
This is a common problem with big data projects in which you attempt to repurpose a data source for something other than what it is intended for.  And this re-use had some pretty severe consequences -- the people living at the "missing data address" were visited by a lot of officials -- all relying on the addresses that the IP address was registered to. 

The new plan, to put the defaults in the middle of a lake, should result in less pain for the residents of the unfortunately located property.  I am curious if we will soon see divers at that lake looking for fraudsters, though. 

But the secondary lesson of this story (above and beyond mean imputation rarely being an optimal missing data strategy) is just how risky it can be use data for an unintended purpose without understanding just how that data arises and what the limitations actually are. 

No comments:

Post a Comment