Tuesday, March 15, 2011

The Annotated "Evaluating New York Teachers, Perhaps the Numbers Do Lie"

As promised, here are some comments (in brackets) on Michael Winerip's NYT article on the city's teacher evaluation process.
Last year, when Ms. Isaacson was on maternity leave, she came in one full day a week for the entire school year for no pay and taught a peer leadership class.


[One thing that Winerip fails to emphasize (though I suspect he is aware of it) is how common stories like this are. Education journalists often portray ordinary excellence as something exceptional. This is partly due to journalistic laziness -- it's easier to describe something as exceptional than to find something that actually is exceptional -- and partly due to the appeal of standard narratives, in this case the Madonna/whore portrayal of teachers (I would used a non-gender specific analogy but I couldn't come up with one that fit as well.)]

The Lab School has selective admissions, and Ms. Isaacson’s students have excelled. Her first year teaching, 65 of 66 scored proficient on the state language arts test, meaning they got 3’s or 4’s; only one scored below grade level with a 2. More than two dozen students from her first two years teaching have gone on to Stuyvesant High School or Bronx High School of Science, the city’s most competitive high schools.


[Everything in this article inclines me to believe that Ms. Isaacson is a good teacher but we need to note that this is a fairly easy gig compared to other urban schools, particularly for someone with her background. Students at places like the Lab School tend to be more respectful and attentive toward academically successful people like Ms. Isaacson. In many schools, this can actually make students initially distrustful.]

You would think the Department of Education would want to replicate Ms. Isaacson — who has degrees from the University of Pennsylvania and Columbia — and sprinkle Ms. Isaacsons all over town. Instead, the department’s accountability experts have developed a complex formula to calculate how much academic progress a teacher’s students make in a year — the teacher’s value-added score — and that formula indicates that Ms. Isaacson is one of the city’s worst teachers.

According to the formula, Ms. Isaacson ranks in the 7th percentile among her teaching peers — meaning 93 per cent are better.

[One of the fallacies that follow from this Madonna/whore narrative is the idea that, since you have such a clearly bi-modal distribution, any metric that's correlated with teaching quality should be able to winnow the good from the bad. In reality you have a normal distribution with noisy data and a metric that doesn't correlate that well. The result, unsurprisingly, is a large number of teachers apparently misclassified. What is surprising is that more people didn't foresee this fairly obvious outcome.]

This may seem disconnected from reality, but it has real ramifications. Because of her 7th percentile, Ms. Isaacson was told in February that it was virtually certain that she would not be getting tenure this year. “My principal said that given the opportunity, she would advocate for me,” Ms. Isaacson said. “But she said don’t get your hopes up, with a 7th percentile, there wasn’t much she could do.”

That’s not the only problem Ms. Isaacson’s 7th percentile has caused. If the mayor and governor have their way, and layoffs are no longer based on seniority but instead are based on the city’s formulas that scientifically identify good teachers, Ms. Isaacson is pretty sure she’d be cooked.

[Well, as long as it's scientific.]

She may leave anyway. She is 33 and had a successful career in advertising and finance before taking the teaching job, at half the pay.

[This isn't unusual. I doubled my salary when I went from teaching to a corporate job. Plus I worked fewer hours and they gave us free candy, coffee and the occasional golfing trip.]

The calculation for Ms. Isaacson’s 3.69 predicted score is even more daunting. It is based on 32 variables — including whether a student was “retained in grade before pretest year” and whether a student is “new to city in pretest or post-test year.”

Those 32 variables are plugged into a statistical model that looks like one of those equations that in “Good Will Hunting” only Matt Damon was capable of solving.

The process appears transparent, but it is clear as mud, even for smart lay people like teachers, principals and — I hesitate to say this — journalists.

[There are two things about this that trouble me: the first is that Winerip doesn't seem to understand fairly simple linear regression; the second is that he doesn't seem to realize that the formula given here is actually far too simple to do the job.]

Ms. Isaacson may have two Ivy League degrees, but she is lost. “I find this impossible to understand,” she said.

In plain English, Ms. Isaacson’s best guess about what the department is trying to tell her is: Even though 65 of her 66 students scored proficient on the state test, more of her 3s should have been 4s.

But that is only a guess.

[At the risk of being harsh, grading on a curve should not be that difficult a concept.]

Moreover, as the city indicates on the data reports, there is a large margin of error. So Ms. Isaacson’s 7th percentile could actually be as low as zero or as high as the 52nd percentile — a score that could have earned her tenure.

[Once again, many people saw this coming. Joel Klein and company chose to push forward with the plan, even in the face of results like these. Klein has built a career largely on calls for greater accountability and has done very well for himself in no small part because he hasn't been held accountable for his own record.]

I've left quite a bit out so you should definitely read the whole thing. It's an interesting story but if anything here surprises you, you haven't been paying attention.

1 comment:

  1. I think that this illustrates another issue we will have with evaluation: different people benefit from different schemes. So when you do well on one scheme (raw performance) you may do poorly on another (valued added). Even worse, there may be edge effects where a very good class is impossible to score well with (due to the capping of the scores) compared with a weak class.

    So it is not that surprising that simple metrics, like seniority and advanced degrees (which are reasonably objective) have ended up being used.