Wednesday, September 1, 2010

Rubin on Educational Testing

From a Daily Kos post about the use of Value-Added Assessment methodologies:

In 2004, Donald Rubin opined

We do not think that their analyses are estimating causal quantities, except under extreme and unrealistic assumptions.


Now I am not familiar with the actual research, but I am likely to take Donald Rubin seriously. Not only is he one of the founders of causal inference, multiple imputation, and propensity scores, but he has a long history of tackling extremely difficult epidemiological problems. For a humbling experience (for those of us in biomedicine) his CV is here.

I dislike appeals to authority, in general, but claims that researchers skeptical about the value of these testing methods are misinformed seem to be poorly grounded. I don’t want to say Rubin is right about everything but I do think we should take his concerns seriously.

[as a side note, he was also the PhD supervisor of Andrew Gelman, whose blog is worth following]

Econned

It is the start of the school year and time to read something non-epidemiological or statistical. So, being me, I decided to read Yves Smith's new book Econned. I'll let you know what I think but reading the introduction this morning suggests that the book is off to a strong start. The best quote so far:

Theories that fly in the face of reality often need to excise inconvenient phenomena, and mainstream economics is no exception.


This quote reminds me of Karl Popper's thinking; one often learn more based on what does not fit your theory then from what does (i.e. falsification). This principle is hard to follow in very complex fields (like economics and epidemiology) where you are guaranteed to have at least some mismatches and disconfirming evidence for everything. But it is good to cultivate a sense of humility about our models!

Tuesday, August 31, 2010

Measurement part 2

From Matt Yglesias:

On a recent evaluation, her principal, Oliver Ramirez, checked off all the appropriate boxes, Tan said — then noted that she had been late to pick up her students from recess three times.

“I threw it away because I got upset,” Tan said. “Why don’t you focus on my teaching?! Why don’t you focus on where my students are?”


Matt argues that proponents of teacher effectiveness are misunderstanding their critics:

The idea has gotten out there that proponents of measuring and rewarding high-quality teaching are somehow engaged in “teacher-bashing.” I think that’s one part bad faith on the part of our antagonists, one part misunderstanding on the part of people who don’t follow the issue closely, and at least one part our own fault for focusing too much on the negative.


But I think his own example is showing why skepticism persists. It's easy to measure the wrong things, incentive the wrong behavior and do a fair amount of damage to a system. What I would really like to see is an argument for incremental change and experimentation rather than radical reform driven by standardized tests. Or, if we must use some sort of standardized test approach, I’d like to have some better evidence that these tests are designed to measure teacher effectiveness and do not omit important elements. For example, I think clear and interesting writing is hard to do (as readers of this blog may notice when they try and follow my words) and it is very hard to objectively score. Multiple choice questions on word definitions are much easier to do but, perhaps, may not measure the most important skills we want to teach.

Certainly something to ponder.

Are you measuring the right thing?

Statistics is a fantastic tool and capable of creating enormous advances. However, it remains the slave of the data that you have. The worst case scenario is when the "objective metric" is actually measuring the wrong thing. Consider customer service in Modern America:

Modern businesses do best at improving their performance when they can use scalable technologies that increase efficiency and drive down cost. But customer service isn’t scalable in the same way; it tends to require lots of time and one-on-one attention. Even when businesses try to improve service, they often fail. They carefully monitor call centers to see how long calls last, how long workers are sitting at their desks, and so on. But none of this has much to do with actually helping customers, so companies end up thinking that their efforts are adding up to a much better job than they really do. In a recent survey of more than three hundred big companies a few years ago, eighty per cent described themselves as delivering “superior” service, but consumers put that figure at just eight per cent.


Here, the core issue seems to be that measuring efficiency at delivering customer service is not the same thing as having good outcomes. Having done statistics for a call center, I can assure you that they are obsessive about everything that can be measured. But satisfaction is a hard thing to measure and it most assuredly matters.

This analogy is why I am concerned with the use of standardized tests for measuring educational achievement. It is possible that they are capturing only part of a complex process and that the result of focusing on them could be fairly poor. After all, companies have tried to deliver exceptional customer service via call center for a couple of decades now and the results do not appear to be a uniform consensus that customer service is a delightful experience.

It is not that these processes can't be evaluated. It's just that the success of education or customer service may depend on things that are hard to measure. If we only measure those features that are easy to measure we may end up wondering why education is in decline despite a steady improvement in the key metrics we use to evaluate it.

Joel Klein's Record

From Mark Gimein (via Felix Salmon), here's a well-timed story from New York Magazine:
New York City public-school kids may be dreading the end of summer, but schools chancellor Joel Klein is the one who’ll really be tested when classes begin again. Last spring, Klein was bragging about the extraordinary upswing in scores during his tenure: a 31-point rise in the percentage of students who passed state reading tests, a 41-point increase in math. That was before state authorities admitted that they’d been progressively more lenient in scoring the tests, and decided to grade more strictly.

The new stringency resulted in the elimination of most of the miraculous gains of the Bloomberg years, and an administration that had lived by the numbers is getting clobbered by them. Klein told parents that the state “now holds students to a considerably higher bar.” This would make sense only if the state hadn’t previously been lowering that bar.

Last year, NYU professor and Klein antagonist Diane Ravitch said exactly that in a Times op-ed, an assertion that Klein claimed was “without evidence.” But the fact that New York students’ scores on the National Assessment of Educational Progress had moved only marginally, even as state scores skyrocketed, was manifest then and is inescapable now.
As discussed here, we've seen Klein omitting relevant statistics before.

Monday, August 30, 2010

Sentences to ponder

We always talk about a model being "useful" but the concept is hard to quantify.


-- Andrew Gelman

This really does match my experience. We talk about the idea that "all models are wrong but some models are useful" all of the time in Epidemiology. But it's rather tricky to actually define this quantity of "useful" in a rigorous way.

Is Ray Fisman one of the best and the brightest?

Based on some of the feedback to my past few posts on Ray Fisman's recent Slate article, there's a point I should probably make explicit: given all available evidence including reliable first-hand accounts, Ray Fisman is an accomplished researcher and a good guy. Furthermore, I am working under the assumption that, like most people in the reform movement, Dr. Fisman is motivated by a deep concern about the state of education and a genuine desire to improve it. (I have also found this a safe assumption when dealing with the vast majority of the teachers Dr. Fisman would fire.)

I have singled out Dr. Fisman not because "Clean Out Your Desk" was exceptionally bad but because it was exceptionally representative. If this were an anomaly written by someone who was stupid or incompetent or had a grudge against teachers, it wouldn't be worth anyone's time to discuss, let alone exhaustively rebut it. This is something more disturbing.

David Warsh has drawn a relevant analogy between the reform movement and the run-up to Vietnam:
Remember the recipe for a policy disaster? Start with a handful of policy intellectuals confronting a stubborn problem, in love with a Big Idea. Fold in a bunch of ambitious Ivy League kids who don’t speak the local language. Churn up enthusiasm for the program in the gullible national press – and get ready for a decade of really bad news. Take a look at David Halberstam’s Vietnam classic The Best and the Brightest, if you need to refresh your memory. Or just think back on the run-up to the war in Iraq.
The education reform is filled with smart, well-intentioned people like Ray Fisman. Under the circumstances, that doesn't provide much comfort.

Zero Tolerance

A timely post from Matt Yglesias:

The commonplace scenario in the United States when people decide to “get tough” and implement a policy of “zero tolerance” for infractions of the rules is to in practice tolerate the majority of infractions by not catching perpetrators and then hit a minority of violators with extremely harsh sanctions. For years now, Mark Kleiman has been pushing the reverse approach—make sanctions relative mild, but make them swift and nearly certain.


The results were compelling:

Now the results are in: drunk–driving fatalities fell from twice the national average, 70, in 2006 to just 34 in 2008, the most recent year for which data are available


It is a key element of public health policy to try and find ways to handle behaviors that involve both a health issue (like addiction to alcohol) and a negative externality (like hitting people with cars). It is really interesting to see researches being done on what approaches are actually the most effective. This type of research is important stuff and has some pretty interesting ramifications for improving public health in a wide range of circumstances.

Hazards of sweeping generalizations

Commenter Nat makes a good point about sensitivity versus specificity:

The other reason for having 100% sensitive tests at the cost of specificity is because of the clinical tradeoffs that occur because you have done that test.

If, for example, the treatment subsequent to a test is vitamin supplementation which should have next to zero complications then 100% sensitivity is the face of nasty complications caused by non-treatment makes quite a lot of sense.


In some of the areas that I work, like pain, we are denied these elegant trade-offs. However, I also do work in coagulation and there are good examples of this type of trade-off there. For example, despite the limited evidence of clinical utility, it can make sense for people with Homocysteine and MTHFR mutations to take b-vitamins. Similarly, a few false positives have very limited impact on the patients involved as the risk of taking a b-vitamin supplement (in the first world where economic hardship is unlikely) is small.

So this is a good reminder that there are no sweeping generalizations in epidemiology.

Saturday, August 28, 2010

Megan's List

Blogger Megan McArdle has a list of things where her best guesses turned out to be incorrect in the face of evidence. We all have these cases and it is good to look back and see where foresight failed us. There is nothing like data to help us revise our internal prediction algorithms.

This point, though was chilling:

I believed that over reasonably long time-frames, modest investments in equities would allow you to retire in comfort.


I am mildly interested (as a hobbyist) in personal investment. But I think the implications of this are far broader than one sentence really captures. It makes the whole idea of shifting Social Security (as an insurance program against poverty in old age) into personal accounts somehow less appealing. It also says very interesting things about the rate of retirement for those of us who started our careers late (due to mid-career changes). The more I think about this issue, the more it has very profound implications for the way our work force will evolve.

But, sadly, I also think that this point is the best reading of the current stagnation of equities, even if the long run is better it is hard to have the level of confidence that I once did.

Friday, August 27, 2010

Trade-offs between Type I and Type II error

I was reading this blog post by Andrew Gelman on a test that is 100% accurate for alzheimer's disease. Following the initial post up, it appears that the test has only 64% specificity.

But the feature of this discussion that I find the most interesting is that the decision to choose between sensitivity and specificity is a judgment that people seem to be very poor at. Consider pain medication. If you want to make sure that everyone in serious pain gets appropriate pain control than some fraudsters will get illicit narcotics. Alternatively, if you make the screen tight that nobody is able to obtain narcotics via "fake pain" then some real cases will be undertreated.

We see the same thing with releasing people from prison. Even if former prisoners only committed crimes at the rate of the general population, at least some crimes could be prevented by a tougher release policy. Of course, this line of reasoning leads to absurd conclusions -- we could completely eliminate adult crime by jailing everyone for life on their eighteenth birthday. We see the same thing in the Ray Fisman argument for getting rid of 80% of teachers during probation – it is so important not to make a mistake and keep an inferior teacher that we should fail to hire many good teachers just to make sure we have no sub-standard ones.

But people don't seem to like to make these trade-offs. In the case of the test for Alzheimer's disease, the authors could have been a lot more specific if they were willing to give up sensitivity. But, for some reason, people seem to prefer to end up at one extreme of a scale rather than the middle (where the value of the test is maximized).

It's a phenomenon that I wish I understood better.

Thursday, August 26, 2010

Badly needed break from Fisman

With two brilliant clips from the Daily Show


The Daily Show With Jon StewartMon - Thurs 11p / 10c
Extremist Makeover - Homeland Edition
www.thedailyshow.com
Daily Show Full EpisodesPolitical HumorTea Party


The Daily Show With Jon StewartMon - Thurs 11p / 10c
The Parent Company Trap
www.thedailyshow.com
Daily Show Full EpisodesPolitical HumorTea Party

US Mobility

In a ncie article, the mobility myth, it is pointed out that only 2.7% of Americans cange states each year. My experience is different than that but that is likely because I am in Academia. Still, I am definitely aware of the risks that changing states brings (as you can never know if a state will work out or not in advance).

On the research side, however, this fact is good news for database research as the "lost to migration" rate is low enough to make it unlikely that we will get serious bias in state level medical claims data. That is really useful to know when evaluating MedicAid studies.

Genetic Epidemiology

John Cook has a post on predicting height using genes. He quotes:

A 2009 study came up with a technique for predicting the height of a person based on looking at the 54 genes found to be correlated with height in 5,748 people — and discovered the results were one-tenth as accurate as the 125–year-old technique of averaging the heights of both parents and adjusting for sex.


I suspect that this issue is the central one facing genetic epidemiology. While it is possible that the approach of averaging the height of the parents includes some environmental information, it is a pretty strong comment on the predictive power of genes if that is the actual answer.

More likely, I think, is the idea that complex and important characteristics are due to many, many genes (all of which have a modest influence). The makes sense from a selection point of view (characteristics like height need to be stable) but makes the project of prediction using genes extraordinarily complicated. I don't know if there is a simple answer or not but it definitely provides some challenges for the paradigm of the classic epidemiological study.

Wednesday, August 25, 2010

Mystery (Education Question)

On thing that amazes me in the education debate is that people of all political stripes seem to agree that education is in a crisis. Consider Jonathan Chait (whom I think it is clear is a liberal) who seems to agree that teacher firings make sense. Yet, as Mark notes, America leads in elementary education.

So why are the two so often conflated?

It could be the "big lie" where a falsehood is said so often that the other side starts to believe it. But people are usually more sophisticated than that.

Another possibility is that we have lost perspective on the alternatives. We worry about reluctance to fire teachers but forget that private alternatives are not inexpensive. From Marginal Revolution:

A New York City charter school set to open in 2009 in Washington Heights will test one of the most fundamental questions in education: Whether significantly higher pay for teachers is the key to improving schools.

The school, which will run from fifth to eighth grades, is promising to pay teachers $125,000, plus a potential bonus based on schoolwide performance. That is nearly twice as much as the average New York City public school teacher earns, roughly two and a half times the national average teacher salary and higher than the base salary of all but the most senior teachers in the most generous districts nationwide.


However, this still doesn't explain the odd consensus of left and right as it seems improbable that many people are fighting for a serious increase in education costs.

Most likely, I suspect, the the current American focus on short term results. When we do annual ratings of employees, we do not consider issues of long term dedication -- we know people are simply going to move on anyway. Consider this report:

Among jobs started by workers when they were ages 38 to 42, 31 percent ended in less than a year, and 65 percent ended in fewer than 5 years.


Is it possible that, with 65% of middle aged workers holding a job for less than 5 years, that we have simply lost the sense of how to build long term loyalty and dedication?