CALDER Conversations: TestBased Measures of Teacher Effectiveness
CALDER Conversations

1) Student testbased measures of teacher effectiveness are often called “valueadded” measures. Can you explain why? KOEDEL: The concept of teacher valueadded is rooted in the economics literature on production. The idea is that students come to teachers not as blank slates, but as the product of years of prior inputs from home and school. A student’s interactions with his or her teachers “add value” to the larger product (total student achievement). The models attempt to parse out teacher contributions, while acknowledging that much of the variation in student achievement can be explained by other factors. An appealing feature of the valueadded approach (and growthmodeling more generally) is that it “levels the playing field” in comparisons between teachers that teach different types of students. For example, it is wellunderstood that students who live in highpoverty areas score much lower on standardized tests than their lowpoverty counterparts. Unlike evaluations that depend on testscore levels, growthbased evaluations account for the different starting points for different types of students by explicitly controlling for prior student achievement (as well as other student and/or school characteristics). HANUSHEK: In reality, there are also a variety of different ways to estimate valueadded models, and the specific approach can influence the statistical estimates of the contribution of individual teachers. Part of this gets wrapped up in the linkage of valueadded estimation to teacher evaluations – a hot topic around state legislatures. When used to evaluate individuals, particular care is needed to ensure that the valueadded of teachers is separated from the characteristics of the kids in the class, from unusual circumstances in one year, and the like. These issues are important for policy uses and are also the subject of considerable current research. GOLDHABER: I don’t have much to add to Cory and Rick’s summary of the definition of valueadded, but I do want to point out that we focus a great deal of attention on whether valueadded measures are in fact good estimates of teachers’ contributions to student learning, but this issue of isolating what teachers are contributing is really much more general when it comes to any means of evaluating them. Take classroom observations for instance. When an observer assesses what’s going on in a teacher’s classroom, they likely interpret what they are seeing as being primarily about teacher practices, but it is entirely possible that some component of observation ratings are capturing the nature of the students being taught. Classrooms with lots of students who started the year scoring below gradelevel may well be more disruptive than those that start with students who entered a classroom fully ready to learn the topics taught in a particular grade, and this could influence observational ratings of teachers. My general point is that we need to worry about whether any means of evaluating teachers is really capturing what teachers are doing in the classroom rather than reflecting the kind of students they are responsible for educating. SASS: I think Cory explained the concept of “value added” well. As Rick and Dan point out, one of the key elements in the valueadded approach is controlling for other factors that influence achievement so that we are measuring the teacher’s contribution to student learning and not the “value” being added by parents, school leaders or physical facilities. As states begin to implement valueadded measures they face choices among a variety of valueadded models that account for these other factors in different ways. While there is not one right way to calculate valueadded, the decision about which student and school characteristics will be taken into account can have important implications for the valueadded assessments of teachers. 2) How much are the differences in teacher effectiveness suggested by valueadded estimates? HANUSHEK:It is common today for policy makers to say that teachers are the most important part of schools, but what does that mean? Much of the research on valueadded tends to produce hardtounderstand statistical results that give little sense of how different effective and ineffective teachers really are. Not many people have a feel for what, for example, “20 percent of a standard deviation of student achievement growth” might really mean. There have been a couple of attempts to translate such statistical opaqueness into things that have more intuitive meaning. Two translations go pretty far at that. First, “a good teacher gets one and a half years of learning growth out of her students while a bad teacher gets only a half year.” Second, “having a good teacher as opposed to an average teacher for 35 years can close the achievement gap between economically disadvantaged kids and more advantaged kids.” Are these reasonable summaries of the research? In a word, yes. In both cases if we take a good teacher as somebody at the 84^{th} percentile (that is, in opaque terms, one standard deviation above the mean in terms of teacher effectiveness), we see the somewhat stunning power of our most effective teachers to improve student outcomes. GOLDHABER: I think there are two other comparisons that are helpful for understanding the import of teacher effectiveness. We know that teachers tend to become more effective with experience early on in their careers, but the impact of valueadded differences amongst teachers looks like it swamps the typical gain in teacher productivity. For example, Rick mentioned that the difference between an average teacher and one at the 84^{th} percentile of the performance distribution is on the order of magnitude of 20 percent of a standard deviation of student achievement. This 20 percent of a standard deviation is roughly 4 to 5 times the difference in performance we tend to find for novice teachers as compared to teachers with five or more years of experience. Class size also provides a nice benchmark. The difference between an average and 84^{th} percentile teacher has been shown by a couple of studies to be roughly equivalent to reducing class sizes by 10–15 students. In other words, the statistical results reflect both the rhetoric about the importance of teacher quality and the anecdotal impressions that good teachers really matter. Sass: Another way to gauge the importance of highly effective teachers is their impact on longrun outcomes for students. A recent study indicates that students who are assigned to high valueadded teachers are more likely to attend college, attend higherranked colleges and earn more as adults. They are also less likely to have children while in their teens. For example, replacing a teacher whose valueadded is in the bottom five percent of all teachers with a teacher who has the average valueadded score would increase the present value of a student’s lifetime income by more than $250,000. Combining this with the evidence cited by Rick and Dan, it is pretty clear that good teachers can make a huge impact on children’s lives and that kids tend to do much better with high valueadded teachers than with those who rank low on the valueadded scale. Koedel: If you don’t believe that differences in teacher valueadded matter after reading the responses from Rick, Dan and Tim, I’m not going to be able to convince you. The implication of differences in teaching effectiveness being so important is that it raises the stakes for making good personnel decisions. The New Teacher Project recently put out a startling, although not entirely surprising, report illustrating how schools do not appear to be responding to what we know about how important it is to retain effective teachers and get ineffective teachers out of the classroom (The Irreplaceables). The evidence is mounting that we need to get smarter about this. 3) How good are these measures in distinguishing among high and low performing teachers and predicting the future performance of teachers? GOLDHABER: Well, the answer as to how good a job valueadded does in distinguishing effective and ineffective teachers is definitely in the eye of the beholder. There is little doubt that valueadded measures are statistically noisy so will vary somewhat from year to year or classroom to classroom, particularly if they are based on small samples. This is why it is typical that less than half of teachers can be distinguished from the mean teacher at conventional levels of statistical significance. But, it is not at all clear to me that a "conventional" level of significance (the 95 percent confidence level) is the right standard to use for policy purposes. I say this because it is pretty clear that teachers' valueadded is a better predictor of student achievement than other measures of teachers  licensure, degree and experience levels  that are typically now used for highstakes purposes such as employment eligibility, pay, and layoff determination. Studies often find, for instance, that knowing that a teacher has a master's degree is no better than a coin flip in making a judgment about that teacher's impact on student test scores. Thus, if I had to make a bet on a teacher's future performance, judged by their impact on student tests, I'd rather know something about their valueadded than any other credential that is currently used. SASS: As Dan said, valueadded estimates are noisy measures of teacher performance. Consequently there is a good deal of variability in a teacher’s valueadded from year to year. For example, if you rank teachers based on their valueadded estimate in each year, between 10 and 15 percent of the teachers that are ranked in the bottom 20 percent of all teachers one year will end up being ranked among the top 20 percent the next year; about an equal proportion of those ranked in the top 20 percent will fall to the bottom 20 percent the following year. While the true performance of teachers can fluctuate over time, most of this volatility in valueadded is due to random fluctuations or “noise” in student test scores. Whether this is “good enough” really depends on the application and the degree to which policymakers are willing to accept some mistakes (e.g. terminating an earlycareer teacher who might have been good) for the sake of culling out many truly ineffective young teachers. Although valueadded estimates do fluctuate from year to year, the past value added of a teacher is by far the best predictor of a teacher’s future impact on student achievement.Teacher credentials do little to explain future effects and student achievement. Evaluations by principals or trained observers are not strong predictors of future teacher performance and they provide only modest improvements in predicting future teacher impacts on student achievement compared to using past valueadded alone. Thus, while one must be mindful of the shortcoming of valueadded measures, there currently are not good alternatives for assessing the current and future performance of teachers. KOEDEL: Dan and Tim both make the important point that when you compare valueadded to the available alternatives, there is nothing that comes close to being as predictive of future teacher performance. I’ll also mention that yeartoyear correlations in teacher value added are not lower than yeartoyear correlations of productivity measures in other professions. In baseball, for example, the yeartoyear correlations for hitters’ batting averages and pitchers’ earned run averages are between 0.30 and 0.40; these numbers are similar or even slightly lower than what researchers typically find for the yeartoyear correlation in teacher value added (from standard models). HANUSHEK: The intensity of discussion of valueadded measures stems directly from their use in personnel decisions – and big rewards or dismissals that involve such measures invariably rely on multiple observations. The DC IMPACT system, for example, combines any valueadded measures with observations by trained evaluators and looks for two years of consistent good or bad performance. By combining testbased information with other observations and relying on consistent indications of performance, any errors are significantly reduced – almost surely below the errors in baseball contracts that Cory suggests are possible. 4) What do you see as the biggest limitation of these measures? SASS:There are a number of criticisms of valueadded measures: they are imprecise, they focus narrowly on student test scores, and they can only be calculated for a small proportion of teachers. The coverage issue can be dealt with by expanding the range of tested grades and subjects. Likewise, the focus on test scores could be muted by incorporating other metrics when evaluating teachers. Most problematic in my view is the inherent imprecision or “noise” in valueadded estimates of teacher performance. Because the student test scores used to calculate valueadded can bounce around due to factors unrelated to teacher performance, valueadded measures are inherently imprecise. This is particularly troublesome for teachers with few students. The problem diminishes when you base valueadded calculations on multiple years of data (and hence on more student test scores). This suggests that lengthening the period before teachers are granted longterm job security or “tenure” makes sense. However, lengthening the time period for evaluation doesn’t work so well if you want to use valueadded measures in a “pay for performance” system. If a teacher who was struggling in the classroom works hard to improve their performance, but they are assessed on their average valueadded over multiple years, they may not get much of a payoff for their effort. KOEDEL: I know it is a common complaint that valueadded is too narrowlyfocused on test scores. However, generating higher test scores is an important part of what schools do. So, while I would love to have access to other useful metrics by which teachers can be evaluated, I also think it is OK to ask how well teachers’ students perform on standardized tests. I am not as optimistic as Tim about resolving the “coverage” issue. There are many teachers for whom testing their students does not make sense, and adding more testing is costly in a number of ways. The question of how to incorporate valueadded measures into teacher evaluation systems, while at the same time addressing the issue that all teachers will not have these measures, is something that I know state and local education agencies across the country are struggling with. On Tim’s last point about statistical imprecision – like anyone else I do worry about the yeartoyear fluctuation in teacher valueadded. But per my response to question (3) above, I am not sure that it is reasonable to expect these estimates to be much more precise (conditional on us doing the best we can to limit the influence of statistical noise, like using multiple years of data as Tim suggests). Still, the perfect should not be the enemy of the good here. If morerigorous performance evaluations are implemented for teachers, based on any metric (or combination of metrics), mistakes will be made. For me, the question is whether the new system will, on net, benefit students despite the mistakes. Holding out for a system that will never make a mistake is not a realistic option. HANUSHEK: Cory and I clearly agree that having a perfect system is crazy. Nowhere is the personnel system perfect – not in sports, not in private industry, not in hedge fund management. On the other hand, the systems in sports, private industry, and hedge funds have in general proved to be very productive. It is common in almost all discussions of value added measures to go through the litany of potential problems, invariably with the implicit perspective that we have to eliminate these problems before proceeding. But, to me, the better perspective is a comparison to the current system where virtually nothing about student outcomes goes into evaluations and personnel systems. Don’t get me wrong. I think we should deal with the problems of value added systems when we introduce them into personnel decisions. And, we should continue work at refining whatever measures we use. However, the position that we should have no errors before proceeding implicitly puts all of the weight on “fairness to the adults” and none of the weight on “fairness to the students.” GOLDHABER: I totally agree with the discussion above: there are limitations to valueadded measures and, at the same time, we should not expect a perfect system. No one wants to hear that “mistakes will be made,” but we have to think about mistakes both from the teacher perspective and the student perspective. An evaluation system that is unlikely to differentiate teachers, today’s system in most school districts, offers little downside risk for most teachers in the sense that they will not be unfairly judged to be poor performers. At the same time, exemplary teachers will not be identified as such (if all teachers are judged to be top performing, then the measure really hasn’t identified anyone). The problem for students is that we fail to address teacher performance issues, meaning we are making the mistake of allowing ineffective teaching to go unaddressed. Valueadded is an imperfect means of assessing teacher effectiveness. It has limitations, but it also clearly contains important information about teachers, and it would seem negligent not to use this information. Read other commentaries in the Plain Talk and CALDER Conversation Archives To see CALDER work on TestBased Measures of Teacher Effectiveness, see the working papers below 