CALDER Conversations: Test-Based Measures of Teacher Effectiveness

Thursday, September 13, 2012

The use of student test-based measures of teacher effectiveness in personnel decisions, such as tenure, is controversial. It is a major bone of contention in the current Chicago teacher strike. The conversation here focuses on the uses, value, and limitations of these measures, often called value-added measures. CALDER working papers that have contributed to this area of inquiry are listed below the conversations. Click on the pictures for information about each author.

Eric Hanushek
Stanford University

Dan Goldhaber
University of Washington

Cory Koedel
University of Missouri

Tim Sass
Georgia State University

1) Student test-based measures of teacher effectiveness are often called “value-added” measures. Can you explain why?

KOEDEL: The concept of teacher value-added is rooted in the economics literature on production. The idea is that students come to teachers not as blank slates, but as the product of years of prior inputs from home and school. A student’s interactions with his or her teachers “add value” to the larger product (total student achievement). The models attempt to parse out teacher contributions, while acknowledging that much of the variation in student achievement can be explained by other factors.

An appealing feature of the value-added approach (and growth-modeling more generally) is that it “levels the playing field” in comparisons between teachers that teach different types of students. For example, it is well-understood that students who live in high-poverty areas score much lower on standardized tests than their low-poverty counterparts. Unlike evaluations that depend on test-score levels, growth-based evaluations account for the different starting points for different types of students by explicitly controlling for prior student achievement (as well as other student and/or school characteristics).

HANUSHEK: In reality, there are also a variety of different ways to estimate value-added models, and the specific approach can influence the statistical estimates of the contribution of individual teachers. Part of this gets wrapped up in the linkage of value-added estimation to teacher evaluations – a hot topic around state legislatures. When used to evaluate individuals, particular care is needed to ensure that the value-added of teachers is separated from the characteristics of the kids in the class, from unusual circumstances in one year, and the like. These issues are important for policy uses and are also the subject of considerable current research.

GOLDHABER: I don’t have much to add to Cory and Rick’s summary of the definition of value-added, but I do want to point out that we focus a great deal of attention on whether value-added measures are in fact good estimates of teachers’ contributions to student learning, but this issue of isolating what teachers are contributing is really much more general when it comes to any means of evaluating them. Take classroom observations for instance. When an observer assesses what’s going on in a teacher’s classroom, they likely interpret what they are seeing as being primarily about teacher practices, but it is entirely possible that some component of observation ratings are capturing the nature of the students being taught. Classrooms with lots of students who started the year scoring below grade-level may well be more disruptive than those that start with students who entered a classroom fully ready to learn the topics taught in a particular grade, and this could influence observational ratings of teachers. My general point is that we need to worry about whether any means of evaluating teachers is really capturing what teachers are doing in the classroom rather than reflecting the kind of students they are responsible for educating.

SASS: I think Cory explained the concept of “value added” well. As Rick and Dan point out, one of the key elements in the value-added approach is controlling for other factors that influence achievement so that we are measuring the teacher’s contribution to student learning and not the “value” being added by parents, school leaders or physical facilities. As states begin to implement value-added measures they face choices among a variety of value-added models that account for these other factors in different ways. While there is not one right way to calculate value-added, the decision about which student and school characteristics will be taken into account can have important implications for the value-added assessments of teachers.

2) How much are the differences in teacher effectiveness suggested by value-added estimates?

HANUSHEK:It is common today for policy makers to say that teachers are the most important part of schools, but what does that mean? Much of the research on value-added tends to produce hard-to-understand statistical results that give little sense of how different effective and ineffective teachers really are. Not many people have a feel for what, for example, “20 percent of a standard deviation of student achievement growth” might really mean. There have been a couple of attempts to translate such statistical opaqueness into things that have more intuitive meaning. Two translations go pretty far at that. First, “a good teacher gets one and a half years of learning growth out of her students while a bad teacher gets only a half year.” Second, “having a good teacher as opposed to an average teacher for 3-5 years can close the achievement gap between economically disadvantaged kids and more advantaged kids.” Are these reasonable summaries of the research? In a word, yes. In both cases if we take a good teacher as somebody at the 84^th percentile (that is, in opaque terms, one standard deviation above the mean in terms of teacher effectiveness), we see the somewhat stunning power of our most effective teachers to improve student outcomes.

Yes, there are some uncertainties in just how different teachers are, but the existing research makes a considerable effort to deal with the biggest challenges to the validity of these estimates Both of these summaries also relate directly to a common counter argument -- doesn’t the family really determine student achievement? The answer: families are important, but effective teachers can overcome the typical gaps.

GOLDHABER: I think there are two other comparisons that are helpful for understanding the import of teacher effectiveness. We know that teachers tend to become more effective with experience early on in their careers, but the impact of value-added differences amongst teachers looks like it swamps the typical gain in teacher productivity. For example, Rick mentioned that the difference between an average teacher and one at the 84^th percentile of the performance distribution is on the order of magnitude of 20 percent of a standard deviation of student achievement. This 20 percent of a standard deviation is roughly 4 to 5 times the difference in performance we tend to find for novice teachers as compared to teachers with five or more years of experience. Class size also provides a nice benchmark. The difference between an average and 84^th percentile teacher has been shown by a couple of studies to be roughly equivalent to reducing class sizes by 10–15 students. In other words, the statistical results reflect both the rhetoric about the importance of teacher quality and the anecdotal impressions that good teachers really matter.

Sass: Another way to gauge the importance of highly effective teachers is their impact on long-run outcomes for students. A recent study indicates that students who are assigned to high value-added teachers are more likely to attend college, attend higher-ranked colleges and earn more as adults. They are also less likely to have children while in their teens. For example, replacing a teacher whose value-added is in the bottom five percent of all teachers with a teacher who has the average value-added score would increase the present value of a student’s lifetime income by more than $250,000. Combining this with the evidence cited by Rick and Dan, it is pretty clear that good teachers can make a huge impact on children’s lives and that kids tend to do much better with high value-added teachers than with those who rank low on the value-added scale.

Koedel: If you don’t believe that differences in teacher value-added matter after reading the responses from Rick, Dan and Tim, I’m not going to be able to convince you. The implication of differences in teaching effectiveness being so important is that it raises the stakes for making good personnel decisions. The New Teacher Project recently put out a startling, although not entirely surprising, report illustrating how schools do not appear to be responding to what we know about how important it is to retain effective teachers and get ineffective teachers out of the classroom (The Irreplaceables). The evidence is mounting that we need to get smarter about this.

3) How good are these measures in distinguishing among high and low performing teachers and predicting the future performance of teachers?

GOLDHABER: Well, the answer as to how good a job value-added does in distinguishing effective and ineffective teachers is definitely in the eye of the beholder. There is little doubt that value-added measures are statistically noisy so will vary somewhat from year to year or classroom to classroom, particularly if they are based on small samples. This is why it is typical that less than half of teachers can be distinguished from the mean teacher at conventional levels of statistical significance. But, it is not at all clear to me that a "conventional" level of significance (the 95 percent confidence level) is the right standard to use for policy purposes. I say this because it is pretty clear that teachers' value-added is a better predictor of student achievement than other measures of teachers -- licensure, degree and experience levels -- that are typically now used for high-stakes purposes such as employment eligibility, pay, and layoff determination. Studies often find, for instance, that knowing that a teacher has a master's degree is no better than a coin flip in making a judgment about that teacher's impact on student test scores. Thus, if I had to make a bet on a teacher's future performance, judged by their impact on student tests, I'd rather know something about their value-added than any other credential that is currently used.

SASS: As Dan said, value-added estimates are noisy measures of teacher performance. Consequently there is a good deal of variability in a teacher’s value-added from year to year. For example, if you rank teachers based on their value-added estimate in each year, between 10 and 15 percent of the teachers that are ranked in the bottom 20 percent of all teachers one year will end up being ranked among the top 20 percent the next year; about an equal proportion of those ranked in the top 20 percent will fall to the bottom 20 percent the following year. While the true performance of teachers can fluctuate over time, most of this volatility in value-added is due to random fluctuations or “noise” in student test scores. Whether this is “good enough” really depends on the application and the degree to which policymakers are willing to accept some mistakes (e.g. terminating an early-career teacher who might have been good) for the sake of culling out many truly ineffective young teachers.

Although value-added estimates do fluctuate from year to year, the past value added of a teacher is by far the best predictor of a teacher’s future impact on student achievement.Teacher credentials do little to explain future effects and student achievement. Evaluations by principals or trained observers are not strong predictors of future teacher performance and they provide only modest improvements in predicting future teacher impacts on student achievement compared to using past value-added alone. Thus, while one must be mindful of the shortcoming of value-added measures, there currently are not good alternatives for assessing the current and future performance of teachers.

KOEDEL: Dan and Tim both make the important point that when you compare value-added to the available alternatives, there is nothing that comes close to being as predictive of future teacher performance. I’ll also mention that year-to-year correlations in teacher value added are not lower than year-to-year correlations of productivity measures in other professions. In baseball, for example, the year-to-year correlations for hitters’ batting averages and pitchers’ earned run averages are between 0.30 and 0.40; these numbers are similar or even slightly lower than what researchers typically find for the year-to-year correlation in teacher value added (from standard models).

HANUSHEK: The intensity of discussion of value-added measures stems directly from their use in personnel decisions – and big rewards or dismissals that involve such measures invariably rely on multiple observations. The DC IMPACT system, for example, combines any value-added measures with observations by trained evaluators and looks for two years of consistent good or bad performance. By combining test-based information with other observations and relying on consistent indications of performance, any errors are significantly reduced – almost surely below the errors in baseball contracts that Cory suggests are possible.

4) What do you see as the biggest limitation of these measures?

SASS:There are a number of criticisms of value-added measures: they are imprecise, they focus narrowly on student test scores, and they can only be calculated for a small proportion of teachers. The coverage issue can be dealt with by expanding the range of tested grades and subjects. Likewise, the focus on test scores could be muted by incorporating other metrics when evaluating teachers. Most problematic in my view is the inherent imprecision or “noise” in value-added estimates of teacher performance. Because the student test scores used to calculate value-added can bounce around due to factors unrelated to teacher performance, value-added measures are inherently imprecise. This is particularly troublesome for teachers with few students. The problem diminishes when you base value-added calculations on multiple years of data (and hence on more student test scores). This suggests that lengthening the period before teachers are granted long-term job security or “tenure” makes sense. However, lengthening the time period for evaluation doesn’t work so well if you want to use value-added measures in a “pay for performance” system. If a teacher who was struggling in the classroom works hard to improve their performance, but they are assessed on their average value-added over multiple years, they may not get much of a payoff for their effort.

KOEDEL: I know it is a common complaint that value-added is too narrowly-focused on test scores. However, generating higher test scores is an important part of what schools do. So, while I would love to have access to other useful metrics by which teachers can be evaluated, I also think it is OK to ask how well teachers’ students perform on standardized tests. I am not as optimistic as Tim about resolving the “coverage” issue. There are many teachers for whom testing their students does not make sense, and adding more testing is costly in a number of ways. The question of how to incorporate value-added measures into teacher evaluation systems, while at the same time addressing the issue that all teachers will not have these measures, is something that I know state and local education agencies across the country are struggling with. On Tim’s last point about statistical imprecision – like anyone else I do worry about the year-to-year fluctuation in teacher value-added. But per my response to question (3) above, I am not sure that it is reasonable to expect these estimates to be much more precise (conditional on us doing the best we can to limit the influence of statistical noise, like using multiple years of data as Tim suggests). Still, the perfect should not be the enemy of the good here. If more-rigorous performance evaluations are implemented for teachers, based on any metric (or combination of metrics), mistakes will be made. For me, the question is whether the new system will, on net, benefit students despite the mistakes. Holding out for a system that will never make a mistake is not a realistic option.

HANUSHEK: Cory and I clearly agree that having a perfect system is crazy. Nowhere is the personnel system perfect – not in sports, not in private industry, not in hedge fund management. On the other hand, the systems in sports, private industry, and hedge funds have in general proved to be very productive. It is common in almost all discussions of value added measures to go through the litany of potential problems, invariably with the implicit perspective that we have to eliminate these problems before proceeding. But, to me, the better perspective is a comparison to the current system where virtually nothing about student outcomes goes into evaluations and personnel systems.

Don’t get me wrong. I think we should deal with the problems of value added systems when we introduce them into personnel decisions. And, we should continue work at refining whatever measures we use. However, the position that we should have no errors before proceeding implicitly puts all of the weight on “fairness to the adults” and none of the weight on “fairness to the students.”

GOLDHABER: I totally agree with the discussion above: there are limitations to value-added measures and, at the same time, we should not expect a perfect system. No one wants to hear that “mistakes will be made,” but we have to think about mistakes both from the teacher perspective and the student perspective. An evaluation system that is unlikely to differentiate teachers, today’s system in most school districts, offers little downside risk for most teachers in the sense that they will not be unfairly judged to be poor performers. At the same time, exemplary teachers will not be identified as such (if all teachers are judged to be top performing, then the measure really hasn’t identified anyone). The problem for students is that we fail to address teacher performance issues, meaning we are making the mistake of allowing ineffective teaching to go unaddressed. Value-added is an imperfect means of assessing teacher effectiveness. It has limitations, but it also clearly contains important information about teachers, and it would seem negligent not to use this information.

You are here

CALDER Conversations: Test-Based Measures of Teacher Effectiveness