19 June 2010

Teacher Effectiveness Evaluations are Crap

Filed under: Culture, Psychology — ktismatics @ 5:12 pm

Last week the state of Colorado passed legislation, introduced by the Democrats, that makes tenure for primary and secondary school teachers contingent on annual performance evaluations. The evalations are two-fold: subjective assessments performed by school principals, and student improvement on standardized achievement tests. The public rationale is straightforward: crappy teachers, instead of being rewarded with long-term job security and annual pay raises, ought to be let go and replaced by good teachers. Of course there’s also a union-busting strategy in play here; however, maybe the traditional labor contracts, which reward years on the job over excellence, are taking a toll on student learning. Are there valid ways of distinguishing good teachers from bad ones? And does better teaching result in better student learning?

So here’s a study from one of the most influential teacher-evaluation gurus: Daniel Goldhaber, an economist at the U. of Washington. It’s called “Can Teacher Quality be Effectively Assessed? National Board Certification as a Signal of Effective Teaching,” which can be downloaded near the bottom of this link. Goldhaber and Anthony looked at data on North Carolina 3rd-5th graders and their teachers across 3 school years. Included in the data set were students’ scores on statewide standardized achievement tests which are administered to every student every year. These repeated measures made it possible to evaluate the rate of individual students’ improvement from year to year, as well as the average per-student improvement for each teacher.

Of course when you measure anything you find differences: statistically, some teachers appear to be a lot more effective than others. But could these measured differences in student outcomes be attributed to differences in teacher characteristics?

Many schools push and reward their already-certified teachers to obtain national certification through the National Board for Professional Teaching Standards. To qualify, teachers have to put in something like 200 extra hours of training in teaching effectiveness, submit samples of their students’ work to a national evaluation board, and undergo on-site evaluation by trained observers. Only about half of the teachers who seek the NBPTS certification pass the evaluation. Are NBPTS-certified teachers more effective than those who aren’t so certified?

In a word, no. The study found that NBPTS-certified teachers achieved statistically significantly (p<.01) better student test results than other teachers, but this difference was minuscule (0.1 standard deviation, for you statistics nerds). The study included data from thousands of individuals, and with huge data sets like even trivial differences show up as statistically significant. Paradoxically, based on student test data teachers were more effective before than after receiving their national certification.

Goldhaber and Anthony did a lot more slicing and dicing of the data looking for more robust differences between teachers. As far as I can discern, they didn’t find very much. I suspect that the dreaded Bonferroni effect kicked in: if you conduct a whole bunch of statistical analyses on the same data, 5% of those analyses will generate statistically significant results merely because of random noise in the data.

But then finally we get to the Policy Implications section of the paper. Here’s how Goldhaber and Anthony summarize their findings:

“[T]his is the first large-scale study that appears to confirm the NBPTS assessment process is effectively identifying those teachers who contribute to relatively larger student learning gains. This finding is important both because it provides some indication of a positive return on the investment in NBPTS, and on a more fundamental level, it demonstrates that it is actually possible to identify teacher effectiveness through NBPTS-type assessments.”

Say what? I read the report, I looked at the data tables, and that’s not the implication I arrived at. The researchers then acknowledge that the NBPTS certification isn’t cheap: $2,300 for the assessment plus an average $4,200 annual pay increase for those who pass the evaluation. They conclude that it would cost about $7,300 per pupil to raise standardized test scores by 1 standard deviation — a result which, based on their own analyses, is almost certainly unattainable.

*   *   *

Okay, so let’s say that it’s not easy to explain why some teachers are effective while others aren’t. Is it at least possible to distinguish effective from ineffective teachers based on their students’ standardized results? Here’s a second study addressing that question: Goldhaber and Hansen, “Assessing the Potential of Using Value-Added Estimates of Teacher Job Performances for Making Tenure Decisions” (2009), downloadable at the top of this link.

Again using data from NC primary-school students and their teachers, the researchers report that, using 3 consecutive years of their students’ test results, half of the teachers’ outcomes were significantly different from average. But as we saw in the prior study, statistical significance doesn’t imply magnitude of difference. And here again the effect size is even punier than the national certification effects. Goldhaber and Hansen estimate that, if the lowest-performing 25% of the teachers were fired, overall test results for the students would go up an average of 0.03 standard deviation.

To give you some idea of how trivial that result is, we turn to Jacob Cohen, for whom this particular statistic we’re talking about is named, namely “Cohen’s d.” Cohen said that the lower threshold for a “small” effect size is a Cohen’s d score of 0.2 or above. And what did Goldhaber and Hansen come up with? A Cohen’s d of 0.03. This isn’t even big enough to be small; in practical terms it’s indistinguishable from nothing.

So now I skip ahead to the Policy Implications section. Will Goldhaber once again vastly inflate the potential impact of his trivial findings? He begins by asserting that teacher evaluations based on 3-years’ worth of student test performances “serve as better indicators of teacher quality than observable teacher attributes.” That sounds impressive until we remember from the prior study that observable teacher attributes were crap indicators in their own right. But what about these puny Cohen’s d numbers he estimated as the policy impact of firing the lowest-performing quarter of the teachers? Says Goldhaber:

“While these may appear to be quite small, new evidence (Hanushek, 2009) suggests that even these small impacts on the quality of the teacher workforce can have profound impacts on aggregate country growth rates.”

WTF? What is the nature of this purported “new evidence”? What are “aggregate country growth rates,” and what have they to do with teacher effectiveness and student test results? Alas, this is the last sentence of the Policy Implications section. Still, he offers this summary recommendation at the very end of the report:

“[T]he results presented here indicate that teacher effect estimates are far superior to observable teacher variables as predictors of student achievement, suggesting that these estimates are a reasonable metric to use as a factor in making substantive personnel decisions.”

Again, the numbers don’t justify the enthusiasm.

*   *   *

I conclude, based on these two studies, that if Goldhaber’s work represents the state of the art in evaluating teacher effectiveness, then the new Colorado law is ill-conceived in the extreme. The costs associated with putting teachers through fancy re-credentialing procedures and with firing and replacing presumably under-performing teachers can’t possibly result in meaningful improvements in student learning outcomes. On the other hand, using these poorly-validated means of axing teachers can save the state money, especially if doing so provides a quantitative rationale for dumping relatively high-paid tenured teachers and either replacing them with low-paid new teachers or not replacing them at all.

The fact that it’s so difficult to find meaningful distinctions between good and bad teachers would concern me if I were a teacher.  The situation is similar to that of psychotherapists and counselors, where level of training and years of experience have virtually no measurable impact on client outcomes. At least it’s been demonstrated that, for people suffering from psychological symptoms/disorders, any therapy is substantially better than none at all. Can the same be said for teaching? It’s been demonstrated that home-schooled kids do just as well or better on standardized tests compared with traditionally-schooled kids. Still, home schooling isn’t teacherless: the parent functions as a private tutor even if s/he doesn’t carry the recognized teaching certificate. It’s also been shown that increasing class size, even doubling it, has little to no effect on learning outcomes. As far as I can tell, the question of how best to enhance student learning remains wide open.



  1. […] and as you’ll see from its apologies it has resisted completion and posting before. The new post at Ktismatics on the crappiness of teacher effectiveness evaluations is shaking it loose, for what it’s […]


    Pingback by Won’t someone think of the children? « Dead Voles — 20 June 2010 @ 5:14 pm

  2. K,

    I posted on this and started a petition to boycott the Times.



    Thanks for your post.

    Tex Shelters


    Comment by texshelters — 6 September 2010 @ 9:16 pm

RSS feed for comments on this post. TrackBack URI

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s

Blog at WordPress.com.

%d bloggers like this: