New York City public-school kids may be dreading the end of summer, but schools chancellor Joel Klein is the one who’ll really be tested when classes begin again. Last spring, Klein was bragging about the extraordinary upswing in scores during his tenure: a 31-point rise in the percentage of students who passed state reading tests, a 41-point increase in math. That was before state authorities admitted that they’d been progressively more lenient in scoring the tests, and decided to grade more strictly.As discussed here, we've seen Klein omitting relevant statistics before.
The new stringency resulted in the elimination of most of the miraculous gains of the Bloomberg years, and an administration that had lived by the numbers is getting clobbered by them. Klein told parents that the state “now holds students to a considerably higher bar.” This would make sense only if the state hadn’t previously been lowering that bar.
Last year, NYU professor and Klein antagonist Diane Ravitch said exactly that in a Times op-ed, an assertion that Klein claimed was “without evidence.” But the fact that New York students’ scores on the National Assessment of Educational Progress had moved only marginally, even as state scores skyrocketed, was manifest then and is inescapable now.
Comments, observations and thoughts from two bloggers on applied statistics, higher education and epidemiology. Joseph is an associate professor. Mark is a professional statistician and former math teacher.
Tuesday, August 31, 2010
Joel Klein's Record
Monday, August 30, 2010
Sentences to ponder
We always talk about a model being "useful" but the concept is hard to quantify.
-- Andrew Gelman
This really does match my experience. We talk about the idea that "all models are wrong but some models are useful" all of the time in Epidemiology. But it's rather tricky to actually define this quantity of "useful" in a rigorous way.
Is Ray Fisman one of the best and the brightest?
I have singled out Dr. Fisman not because "Clean Out Your Desk" was exceptionally bad but because it was exceptionally representative. If this were an anomaly written by someone who was stupid or incompetent or had a grudge against teachers, it wouldn't be worth anyone's time to discuss, let alone exhaustively rebut it. This is something more disturbing.
David Warsh has drawn a relevant analogy between the reform movement and the run-up to Vietnam:
Remember the recipe for a policy disaster? Start with a handful of policy intellectuals confronting a stubborn problem, in love with a Big Idea. Fold in a bunch of ambitious Ivy League kids who don’t speak the local language. Churn up enthusiasm for the program in the gullible national press – and get ready for a decade of really bad news. Take a look at David Halberstam’s Vietnam classic The Best and the Brightest, if you need to refresh your memory. Or just think back on the run-up to the war in Iraq.The education reform is filled with smart, well-intentioned people like Ray Fisman. Under the circumstances, that doesn't provide much comfort.
Zero Tolerance
The commonplace scenario in the United States when people decide to “get tough” and implement a policy of “zero tolerance” for infractions of the rules is to in practice tolerate the majority of infractions by not catching perpetrators and then hit a minority of violators with extremely harsh sanctions. For years now, Mark Kleiman has been pushing the reverse approach—make sanctions relative mild, but make them swift and nearly certain.
The results were compelling:
Now the results are in: drunk–driving fatalities fell from twice the national average, 70, in 2006 to just 34 in 2008, the most recent year for which data are available
It is a key element of public health policy to try and find ways to handle behaviors that involve both a health issue (like addiction to alcohol) and a negative externality (like hitting people with cars). It is really interesting to see researches being done on what approaches are actually the most effective. This type of research is important stuff and has some pretty interesting ramifications for improving public health in a wide range of circumstances.
Hazards of sweeping generalizations
The other reason for having 100% sensitive tests at the cost of specificity is because of the clinical tradeoffs that occur because you have done that test.
If, for example, the treatment subsequent to a test is vitamin supplementation which should have next to zero complications then 100% sensitivity is the face of nasty complications caused by non-treatment makes quite a lot of sense.
In some of the areas that I work, like pain, we are denied these elegant trade-offs. However, I also do work in coagulation and there are good examples of this type of trade-off there. For example, despite the limited evidence of clinical utility, it can make sense for people with Homocysteine and MTHFR mutations to take b-vitamins. Similarly, a few false positives have very limited impact on the patients involved as the risk of taking a b-vitamin supplement (in the first world where economic hardship is unlikely) is small.
So this is a good reminder that there are no sweeping generalizations in epidemiology.
Saturday, August 28, 2010
Megan's List
This point, though was chilling:
I believed that over reasonably long time-frames, modest investments in equities would allow you to retire in comfort.
I am mildly interested (as a hobbyist) in personal investment. But I think the implications of this are far broader than one sentence really captures. It makes the whole idea of shifting Social Security (as an insurance program against poverty in old age) into personal accounts somehow less appealing. It also says very interesting things about the rate of retirement for those of us who started our careers late (due to mid-career changes). The more I think about this issue, the more it has very profound implications for the way our work force will evolve.
But, sadly, I also think that this point is the best reading of the current stagnation of equities, even if the long run is better it is hard to have the level of confidence that I once did.
Friday, August 27, 2010
Trade-offs between Type I and Type II error
But the feature of this discussion that I find the most interesting is that the decision to choose between sensitivity and specificity is a judgment that people seem to be very poor at. Consider pain medication. If you want to make sure that everyone in serious pain gets appropriate pain control than some fraudsters will get illicit narcotics. Alternatively, if you make the screen tight that nobody is able to obtain narcotics via "fake pain" then some real cases will be undertreated.
We see the same thing with releasing people from prison. Even if former prisoners only committed crimes at the rate of the general population, at least some crimes could be prevented by a tougher release policy. Of course, this line of reasoning leads to absurd conclusions -- we could completely eliminate adult crime by jailing everyone for life on their eighteenth birthday. We see the same thing in the Ray Fisman argument for getting rid of 80% of teachers during probation – it is so important not to make a mistake and keep an inferior teacher that we should fail to hire many good teachers just to make sure we have no sub-standard ones.
But people don't seem to like to make these trade-offs. In the case of the test for Alzheimer's disease, the authors could have been a lot more specific if they were willing to give up sensitivity. But, for some reason, people seem to prefer to end up at one extreme of a scale rather than the middle (where the value of the test is maximized).
It's a phenomenon that I wish I understood better.
Thursday, August 26, 2010
Badly needed break from Fisman
The Daily Show With Jon Stewart | Mon - Thurs 11p / 10c | |||
Extremist Makeover - Homeland Edition | ||||
www.thedailyshow.com | ||||
|
The Daily Show With Jon Stewart | Mon - Thurs 11p / 10c | |||
The Parent Company Trap | ||||
www.thedailyshow.com | ||||
|
US Mobility
On the research side, however, this fact is good news for database research as the "lost to migration" rate is low enough to make it unlikely that we will get serious bias in state level medical claims data. That is really useful to know when evaluating MedicAid studies.
Genetic Epidemiology
A 2009 study came up with a technique for predicting the height of a person based on looking at the 54 genes found to be correlated with height in 5,748 people — and discovered the results were one-tenth as accurate as the 125–year-old technique of averaging the heights of both parents and adjusting for sex.
I suspect that this issue is the central one facing genetic epidemiology. While it is possible that the approach of averaging the height of the parents includes some environmental information, it is a pretty strong comment on the predictive power of genes if that is the actual answer.
More likely, I think, is the idea that complex and important characteristics are due to many, many genes (all of which have a modest influence). The makes sense from a selection point of view (characteristics like height need to be stable) but makes the project of prediction using genes extraordinarily complicated. I don't know if there is a simple answer or not but it definitely provides some challenges for the paradigm of the classic epidemiological study.
Wednesday, August 25, 2010
Mystery (Education Question)
So why are the two so often conflated?
It could be the "big lie" where a falsehood is said so often that the other side starts to believe it. But people are usually more sophisticated than that.
Another possibility is that we have lost perspective on the alternatives. We worry about reluctance to fire teachers but forget that private alternatives are not inexpensive. From Marginal Revolution:
A New York City charter school set to open in 2009 in Washington Heights will test one of the most fundamental questions in education: Whether significantly higher pay for teachers is the key to improving schools.
The school, which will run from fifth to eighth grades, is promising to pay teachers $125,000, plus a potential bonus based on schoolwide performance. That is nearly twice as much as the average New York City public school teacher earns, roughly two and a half times the national average teacher salary and higher than the base salary of all but the most senior teachers in the most generous districts nationwide.
However, this still doesn't explain the odd consensus of left and right as it seems improbable that many people are fighting for a serious increase in education costs.
Most likely, I suspect, the the current American focus on short term results. When we do annual ratings of employees, we do not consider issues of long term dedication -- we know people are simply going to move on anyway. Consider this report:
Among jobs started by workers when they were ages 38 to 42, 31 percent ended in less than a year, and 65 percent ended in fewer than 5 years.
Is it possible that, with 65% of middle aged workers holding a job for less than 5 years, that we have simply lost the sense of how to build long term loyalty and dedication?
The Old Shell Game -- Why you have to keep your eye on Ray Fisman (and no, we're not quite through with the second paragraph)
New York City Schools Chancellor Joel Klein often quotes the commission before discussing how U.S. schools have fared since it issued its report. Despite nearly doubling per capita spending on education over the past few decades, American 15-year olds fared dismally in standardized math tests given in 2000, placing 18th out of 27 member countries in the Organization for Economic Co-operation and Development. Six years later, the U.S. had slipped to 25th out of 30. If we've been fighting against mediocrity in education since 1983, it's been a losing battle.*The OECD tests are the book of Revelations of the education reform movement, the great ominous portent to be invoked in the presence of critics and non-believers. Putting aside questions of the validity and utility of this test (perhaps for another post if my stamina holds out), we would certainly like to be in the top ten rather than the bottom.
But before we concede this one, lets pull out our well-thumbed copy of Huff and take one more look. Whenever one side in a complex debate keeps pulling out one particular statistic, you should always take a moment and check for cherry-picking.
Is Fisman distorting the data by being overly selective when picking statistics to bolster his case? Yes, and he's doing it in an egregious way.
Take a look at at the Trends in International Mathematics and Science Study. Here's a passage from the executive summary from the National Center for Education Statistics:
In 2007, the average mathematics scores of both U.S. fourth-graders (529) and eighth-graders (508) were higher than the TIMSS scale average (500 at both grades). The average U.S. fourth-grade mathematics score was higher than those of students in 23 of the 35 other countries, lower than those in 8 countries (all located in Asia or Europe), and not measurably different from those in the remaining 4 countries. At eighth grade, the average U.S. mathematics score was higher than those of students in 37 of the 47 other countries, lower than those in 5 countries (all of them located in Asia), and not measurably different from those in the other 5 countries.We could spend some time in the statistical weeds and talk about the methodology of TIMSS vs. OECD's PISA. TIMSS is the better established and arguably better credentialed, but both are serious efforts mounted by major international organizations and it would be difficult to justify leaving either out of the discussion.
If Fisman had limited his focus to the education of high school students and simply ignored the data involving earlier grades, we would have ordinary misdemeanor-level cherry-picking. Not the most ethical of behavior, but the sort of thing most of us do from time to time.
But Fisman does something far more dishonest; he quietly shifts the subject to teachers in general and often to elementary teachers in particular (take a good look at the study that's at the center of Fisman's article).
This means that, when you strip away the obfuscation, you get the following argument.
1. The best metrics for tracking American education are international rankings on math tests;
2. The best way of improving America's education system is fire massive numbers of teachers, including those in areas where we are doing well on international rankings on math tests.
The bad news here is that we have a long way to go to make it through Fisman's article and it doesn't get much better, but the good news is that we're through with the second paragraph.
* Fought, for the most part with Klein and Fisman's battle plan but we've already covered that.
More on Avandia
But it's a pretty clear that if the number needed to harm is 50 and the number needed to treat to prevent a serious outcome is 1000 then the medication is likely not favorable on the cost-benefit analysis.
There are cases where the risk-benefit calculation is a subtle problem and it is always tricky to withdraw a drug that showed actual benefits in the original clinical trials. But it is looking increasingly like Avandia may carry more risks than benefits making it an exception to the rule.
Monday, August 23, 2010
Starting from the beginning -- Ray Fisman's sins of omission
The centerpiece of Fisman's recent Slate article, "Clean Out Your Desk," is a deeply flawed analysis proposing that four out of five probationary teachers should be fired, but the problems with the article aren't limited to that one piece of research; they permeate the article, starting with the very first two paragraphs:
In 1983, a presidential commission issued the landmark report "A Nation at Risk: The Imperative for Educational Reform." The report warned that despite an increase in spending, the public education system was at risk of failure "If an unfriendly foreign power had attempted to impose on America the mediocre educational performance that exists today," the report declared, "we might well have viewed it as an act of war."
New York City Schools Chancellor Joel Klein often quotes the commission before discussing how U.S. schools have fared since it issued its report. Despite nearly doubling per capita spending on education over the past few decades, American 15-year olds fared dismally in standardized math tests given in 2000, placing 18th out of 27 member countries in the Organization for Economic Co-operation and Development. Six years later, the U.S. had slipped to 25th out of 30. If we've been fighting against mediocrity in education since 1983, it's been a losing battle.
Notice that strange gap of more than a quarter century? Other than mentioning increased spending per capita* and citing a couple of ominous sounding statistics, Fisman doesn't say a word about what happened since. There is no mention of what the response was to "A Nation at Risk." You could easily come away with the impression that there was no response, that educators had simply gone on with business as usual.
This is a common rhetorical trick in the educational reform movement.: to point out various facts suggesting a dangerous decline over the past two or three decades then quickly change the subject (sometimes citing "A Nation at Risk" to add a note of the Cassandra Syndrome). If only we had done something then, we wouldn't be on the precipice now.
The primary flaw in this narrative is that there was a response to the report, it was swift and sweeping, and mostly it consisted of the types of change reformers like Fisman, Klein, and Ben Wildavsky continue to push for to this day: importing techniques and philosophies from the private sector; encouraging privatization and entrepreneurs; basing the evaluation of schools on objective metrics (particularly standardized tests of student performance).
By the late Eighties, when I went into teaching, it was difficult to find a school without a mission statement. Staff development by then consisted almost entirely of the kind of training/motivation seminars that I would encounter a few years later working for Fortune 500 companies. Business jargon was all the rage. My first encounter with the school of education was when the dean gave us a talk on how Tom Peters and In Search of Excellence were going to revolutionize education.
For a while, I taught in a high school where the principal was known to say that he didn't like to base teacher renewal or promotion decisions on standardized test scores. He was careful not to say he wouldn't. That would have been a blatant lie. We knew he would make our lives miserable if we didn't teach to the test. He knew we knew it. But he maintained at least a vaneer of plausible deniability.
One particularly spineless history teacher spent about a month doing nothing but drilling facts that were likely to be on the test. No discussions. No writing assignments. No additional reading. No attempt to put the material into any kind of meaningful context. But his scores were good.
By the early Nineties, within less than a decade of the "Nation at Risk" report, states were starting to pass charter school laws. Pushes for merit pay and weakening tenure intensified. Faith in business as a source of answers for schools continued.
Education reform has proceeded in more or less a straight line for more than a quarter century without much that can be held up as a clear success. This isn't necessarily a damning criticism. Reformers like Fisman and Klein could honestly argue that the current state of education is mixed, not as good as it could be but not as bad as it would have been if steps had not been taken, or they could argue we have a classic case of half measures, that these reforms would solve our problems if fully implemented but in their watered down form they can do no good.
Both of these arguments are honest and defensible.
What they can't honestly do is imply that we are where we are because we didn't listen to them.
*Why per capita and not per student?
Thursday, August 19, 2010
Ray Fisman and the Tierney Ratio
As you might expect, Tierney Ratios vary greatly from author to author. The sorely-missed Olivia Judson maintained a TR of virtually zero while writing for the New York Times while John Tierney, a science writer with no appreciable background in or aptitude for science, routinely had observed TRs in excess of one or two. (it is possible that Judson was kept in the Op-Ed rather than Science section out of concern that she would unfairly lower the latter's average.)
The value of the Tierney Ratio is somewhat limited by its serious data censoring problem (analogous to this well-known example). Faced with articles and essays of sufficiently low quality, researchers are almost always forced to leave significant mistakes, distortions and fallacies unaddressed.
Which brings us back to Ray Fisman's recent column in Slate, which reaches an almost Hellmanesque level of inaccuracy. Getting a true TR on something like this is an extraordinarily tedious job so the readers who aren't into hardcore education wonkery might want to skip the next few posts. You'll know it's safe to come back when we start posting Daily Show clips again.