Thursday, August 1, 2013

Value added testing without a gold standard outcome

From a Megan Pledger comment on StatChat comes this paper (pdf) on value added testing models for evaluating teachers.  The following concerns were brought up:

In the real world of schools, data is frequently missing or corrupt. What if students are missing past test data? What if past data was recorded incorrectly (not rare in schools)? What if students transferred into the school from outside the system?


The modern classroom is more variable than people imagine. What if students are team-taught? How do you apportion credit or blame among various teachers? Do teachers in one class (say mathematics) affect the learning in another (say science)?


Every mathematical model in sociology has to make rules, and they sometimes seem arbitrary. For example, what if students move into a class during the year? (Rule: Include them if they are in class for 150 or more days.) What if we only have a couple years of test data, or possibly more than five years? (Rule: The range three to five years is fixed for all models.) What’s the rationale for these kinds of rules?


Class sizes differ in modern schools, and the nature of the model means there will be more variability for small classes. (Think of a class of one student.) Adjusting for this will necessarily drive teacher effects for small classes toward the mean. How does one adjust sensibly?


While the basic idea underlying value-added models is the same, there are in fact many models. Do different models applied to the same data sets produce the same results? Are value-added models “robust”?


Since models are applied to longitudinal data sequentially, it is essential to ask whether the results are consistent year to year. Are the computed teacher effects comparable over successive years for individual teachers? Are value-added models “consistent”?

A lot of these concerns have been independently voiced by Mark P.  However, what is especially concerning is the idea that we could iterate through these assumptions to find a school ranking that satisfies some prior.  This can be good under some circumstances -- Thomas Lumley gives an example of a model that clearly mixed up rankings of some kind of sports team (this isn't my area of expertise so I apologize that I don't recognize the teams or sport involved).  But it does show how difficult these models are, even with the best faith involved.  Still, in the case of Dr. Lumley's example there is a universal outcome that has broad agreement (does this team win games) that is being predicted.  In education we lack this very clean outcome which is where it gets tricky -- in a sense we are modeling a latent variable (student outcomes). 

All of this suggests that we should be cautious about these models and perhaps this would be an appropriate time to put some serious effort into student outcomes ascertainment so that it will be easier to calibrate these statistical models (making the outcome the test score seems clever but merely hides the problem rather than solving it unless we are confident that the score is a very good measure of outcomes). 

No comments:

Post a Comment