This is an excellent article on machine learning. In particular I liked this:
As an extreme illustration, an algorithm designed to predict a rare condition found in only 1% of the population can be extremely accurate by labeling all individuals as not having the condition. This tool is 99% accurate, but completely useless. Yet, it may “outperform” other algorithms if accuracy is considered in isolation.It is even better discussed by Frances Wooley and Thomas Lumley using the example of classification of sexual orientation by facebook. This isn't to say that machine learning isn't useful but the proper penalty functions or sampling characteristics need to be developed (Thomas has a great discussion of this). A simple measure of accuracy is going to fail in all sorts of cases where simple but useless rules do extremely well (most people do not have pancreatic cancer and I can be exceedingly accurate guessing that any one person does not have it). It isn't that the problem is intractable, but that it isn't simply a case of running a technique on whatever data happens to be lying around. Like most worthwhile data science problems, doing the work well is what is hard.
The article also highlights the limitations of the data generating process. Is not having contact with the health system mean one is healthy? In the CPRD it seems like the people without blood pressures (routinely collected during visits to a physician) were both the most healthy and the least. Lack of contact with the medical system is complicated but these participants often form the reference group.
Anyway, go, read, and enjoy.
No comments:
Post a Comment