Plain Talk with Douglas Harris

Wednesday, June 8, 2011

Douglas Harris

Douglas Harris—CALDER research collaborator, associate professor of educational policy studies at the University of Wisconsin-Madison, and author of Value-Added Measures in Education: What Every Educator Needs to Know—answers our questions about value-added measures.

Read more about Douglas Harris »

1. Let’s start by laying out what value-added means and clear up some misconceptions. What are value-added measures and what can they tell us (and what can’t they tell us)? What are the strengths and limitations?

Value-added refers to how much people contribute to the output of an organization. In education, it usually means how much teachers and schools contribute to the student learning measured by standardized tests.

There are a lot of advantages to this way of thinking. The “cardinal rule” of accountability, as I call it, is that you hold people accountable for what they can control. But, right now, federal law evaluates schools based essentially on the percentage of students who are proficient, which is a very poor measure of what schools contribute. The bottom line is that students begin school at very different starting points, so when we evaluate schools based on where students end the school year on tests, we’re mainly capturing things that happened before school started—factors clearly outside the current school’s control. Value-added measures try to fix that by taking into account where students are when they come in the door, using their prior achievement scores. In other words, value-added focuses more on student growth than just end-of-year snapshots.

Unfortunately, it’s difficult to put the idea in practice. For example, value-added measures are no better than the tests themselves, which often capture only low-level achievement and exclude a lot that we want students to know and do. Also, the results can vary depending on what specific achievement test you use and what statistical assumptions you make. We’re making good progress on identifying the best assumptions and models, but there’s no agreement on the best approach at this point.

2. One of the criticisms of value-added measures is that the scores vary over time—what should we make of this instability?

This is one of the biggest problems. With value-added, we are less likely to unfairly punish teachers and schools serving disadvantaged students, but that comes at a cost of imprecision or random error. The error in any measure, including test scores, adds up fast when you start to look at growth. This creates the instability—value-added measures bounce around from year to year.

I think almost everyone studying value-added measures is concerned about this, though some view it as a bigger problem than others. In my view, this is a big problem. Instability means that at any given time, a large percentage of teachers will be “misclassified” or put in the wrong performance category. If educators are going to respond to the measures, they have to trust them, and we’re definitely not there yet. Misclassification will also likely be a basis for lawsuits. I lay out some of the basic legal questions in my book—not surprisingly, the laws vary by state. The legal status isn’t going to be resolved anytime soon.

Some prior studies, including a recent Brookings report by several CALDER affiliates, show that instability in teaching is similar to, say, instability in baseball players’ batting averages. Even with the large number of at-bats players have each year, their averages still bounce around. But it’s not clear to me that baseball is really the right comparison or that any comparison like this is going to help us figure out what to do with teacher value-added.

3. Jay Mathews of the Washington Post said in a blog post about your book that we really should be evaluating whole schools, not just individual teachers. What’s your assessment? I understand that you’re studying the value-added of school principals—is this one way to evaluate a whole school’s effectiveness?

I think we should definitely be using value-added to evaluate schools. While we still sacrifice some precision and instability, it’s very hard to justify continually labeling schools as failures just because they happen to serve disadvantaged students. And the differences between value-added and No Child Left Behind (NCLB) measures are large. Many schools “in need of improvement” under NCLB actually have above-average value-added.

Now, you might think that the imprecision problem would go away at the school level because there are so many students per school, but that’s not the case. We do have greater precision for each measure, but the less recognized problem is that the differences in actual school performance are also smaller, so the challenge of distinguishing among schools with statistical confidence may be just as great with schools as with teachers. Small differences in performance are harder to judge than large differences.

In theory, if we’re going to have school value-added and teacher value-added, then we should also have principal value-added. There are different ways to do that. One is just attribute school value-added to the principal. The problem is that school value-added is determined substantially by the teachers and, when principals first start, they have no control over who is in the classroom. There is a big debate about school turnaround right now, but we know for sure that there are no quick fixes. Principals have to recruit, develop, and retain effective teachers (among many other things) and that takes time—this creates a challenge for estimating principal value-added.

Also, some schools are in much better positions to attract and retain effective teachers. If you’re a principal in a well-off suburb with a teacher education program nearby, you have 100 good applicants for a math position. If you’re in a rural area far away from teacher education programs, you might not get any. There are ways to try and account for those discrepancies in labor supply in the value-added estimates, but it’s not easy and we don’t know how well it works. Another approach then is to look at how school value-added changes over time and attributes the changes in value-added to the principal, but then the performance measure depends heavily on who happened to be the principal previously.

4. The Los Angeles Times published the individual value-added scores for some 6,000 LA teachers. You’ve said in your blog that you think it was a bad idea—that judging teachers by this measure alone could “wreak havoc on schools and undermine teaching and learning.” Why? And what is the right way to use these data?

I think value-added measures have a lot to offer, but it depends on how you use them, and I’m strongly opposed to making these measures public—naming and shaming teachers. I’ve talked to a lot of teachers about this and I’d say this is right up there with merit pay and tenure decisions in terms of the stakes involved.

In my book, I talk about another basic rule: the stakes attached to any measure should be proportional to the quality of the measure. On a scale of 1 to 10—10 being perfect accuracy—I’d say the quality of value-added is about a 3. Given the instability and the other problems, we know that we’ll be wrong more often than we’re right when we try to use these measures to put teachers in performance categories.

If the quality is about 3, then that means the stakes should also be at about a 3. Now, that’s pretty vague, I realize. But I think, given how little evidence we have on how to use teacher value-added measures well, a broad guideline like this tells you pretty clearly that using teacher value-added estimates by themselves to make high-stakes decisions—including putting the measures on public web sites—is a bad idea. Most of the pilot programs out there, including the Race to the Top states and districts, recognize this and aren’t going that far. Almost all of them are combining value-added with some type of classroom observation, which is a more promising approach.

When you have something promising, you should try pilot programs and little experiments—and really evaluate them—to see what works best. That’s generally what’s happening, though I worry that the evaluations are not up to the task.

5. What’s the next step? What should we take away from value-added and what do we still need to learn?

The bottom line, and this really isn’t much in dispute, is that we do a really bad job of evaluating and developing teachers. The most effective teachers seem so much more effective than the least effective teachers, yet we treat them all the same way. I think we can do better than that and I think value-added will probably be one useful tool, among many, for that. At the very least, it should help us identify effective practices and programs.

Perhaps the biggest value of value-added is that it’s generated serious thinking about what good teaching looks like. When I give talks and workshops, I hear this over and over. People are critical of value-added, and they usually have legitimate concerns, but they usually end the conversation by saying, “you know, we never should have allowed things to get this bad with teacher evaluation.” That’s a key point—we have to compare teacher value-added with the alternatives.

So, even if we don’t use teacher value-added at all in the end, the whole exercise will have been worthwhile, just to get people to define and measure what good practice looks like—and, hopefully, responding to those measures through instruction. That’s not to say that it doesn’t matter how we evaluate teachers—what gets measured gets done. But a measure of classroom practice has to be a big part of the puzzle. There are problems with those measures, too, because it’s hard to identify practices that make sense across teachers, classroom situations, and students, but the goal here is to improve practice, so we have to focus on and measure practice.