CALDER Conversations: Evaluating Teacher Training Programs

Thursday, July 19, 2012

Presumably one of the ways to increase teacher effectiveness – the most important factor affecting student achievement -- is through improved pre-service teacher training. Over 1500 programs in the country currently provide pre-service training for teachers. We know little, however, about the effectiveness of these programs and, more importantly, there is little consensus on the best way to measure TTP effectiveness. CALDER has produced four working papers examining pre-service teacher training programs (TTPs). The authors of these papers discuss below both conceptual and technical challenges to examining the effectiveness of TTPs. See the links below each author for full CALDER working papers on TTPs. Click on the pictures for information about each author.

Dan Goldhaber
Dan Goldhaber
University of Washington
CALDER Working Paper 65

Cory Koedel
University of Missouri
CALDER Working Paper 79

Susanna Loeb
Susanna Loeb
Standford University
CALDER Working Paper 20

Tim Sass
Tim Sass
Georgia State University
CALDER Working Paper 63

1)  How much do TTPs vary in terms of their effectiveness?  

GOLDHABER:  Well, I want to begin my answer with a clarification about what we mean by the effectiveness of TTPs. I suspect most people think of a TTP "effect" as the impact of the training prospective teachers receive while in college (or an alternative program). But in practice, the TTP effect is a combination of the selection of teacher candidates by TTPs combined with the training they receive. As researchers, we may try to disentangle what is a selection effect (which might be related to the academic preparation of the individuals who attend different training programs) versus a training effect, but in practices this is very difficult to do. What we are able to do through research is suggest whether teachers who get their credentials from different programs appear to be differentially effective (as measured by their value-added).

With that said, let me try to succinctly answer your question. In Washington State, and elsewhere, it looks like there is relatively little variation in teacher training program effects. Few teacher program indicators show up as being significantly different from teachers receiving their credential from out of state. And, as we often find in education research, there is far more variation within, than between, programs in estimated teacher effectiveness. Having said that, there are cases where the differences between programs look to be educationally meaningful. For example, we find, all else equal, that the difference between teachers credentialed from the average program and the one judged to be most effective is about as large as the regression-difference between students who are and are not receiving free or reduced price lunch, and the effects are larger than the typical productivity gains associated with early career teaching experience. 

KOEDEL: In our evaluation of Missouri TTPs, where we focus on traditional programs, the answer is very little, if at all. Like in Washington, it appears that the overwhelming majority of the variance in teaching effectiveness comes from within-program differences between teachers. In our paper, we caution administrators against overvaluing TTP rankings based on value-added. Because there is so much variation within programs, lower-ranked programs produce many teachers that are more effective than the average teacher from higher-ranked programs. A qualification to our findings is that we evaluate traditional programs. Evaluations that consider a more heterogeneous group, including alternative-certification programs, may be more likely to find differences across TTPs.

SASS:  Our findings in Florida mirror what Dan and Cory have found in Washington State and Missouri. If we judge traditional preparation programs based on the average value-added of their graduates, most are not significantly different from the statewide average. Of 33 programs we looked at, only 2 or 3 at the top and bottom consistently stood out from the rest. For these high and low-performing programs, the difference in value-added is meaningful, but not huge, something on the order of twice the difference in effectiveness between a rookie teacher and one with 3-5 years of experience. Ongoing work on alternative certification programs suggests some substantial differences between these programs and traditional TTPs.  In one alternative certification program the graduates appear to be outperforming graduates of traditional TTPs whereas the opposite is true for a different type of alternative certification route.  Clearly there is significant variation among alternative-certification programs.

LOEB:  Unlike the studies described above, studies in New York City have found differences across teacher preparation programs that can be large - even larger than the differences between the most successful alternative route program and the average of more traditional programs.  Moreover, within a large alternative route pathway that uses multiple university settings for their training, we found meaningful differences across programs.  The evidence also suggests that programs with certain characteristics, such as greater oversight of student teaching experiences, provide higher value-added teachers.  We certainly can't rule out from the studies that we have done in New York City or from the studies described above in other states that there are substantial differences across programs.  Some of these differences will not be statistically significant at traditional levels used for academic research, but that does not mean that we can't learn from the average differences across programs that can help in the improvement of both the selection and development of teachers prior to when they enter the classroom.

2)  Should "selection" (who is admitted to a TTP) be considered part of the program?  If not, how might we rule out "selection effects" as the explanation for TTP effects?

KOEDEL: It depends on who is asking the question. If a district administrator or state official is asking, then the answer is yes, absolutely. These individuals should not be concerned with why teachers from one program are better than teachers from another – their objective should simply be to put the best teachers in front of students. Some programs might produce effective teachers simply by selecting them on the front end, and this shouldn't be a problem. In fact, given what we know about the observable skills of entering education majors relative to other majors, initial selection is a potentially important area for improvement.

Researchers, on the other hand, should be interested in both selection and non-selection effects. On the selection side, if some programs recruit stronger entrants than others we should be identifying these differences and asking why. As for non-selection effects – which we might call TTP value added – learning about these effects would be useful for improving program efficacy. For example, holding selection fixed, if we can identify some TTPs as providing more value-added than others, we can replicate the more effective programs and scale back the less effective programs.

SASS:  That's right.  If you want to know where to hire the best teachers, there is no need to distinguish between selecting good students and training them well.  However, if you want to understand how best to train prospective teachers by comparing TTPs, we need to account for the ability of candidates entering different programs.  Controlling for candidate selection is not easy, however.  We still don't have a good handle on what personal attributes make for a good teacher.  The relative success of Teach-for-America teachers in math and science suggests that the sorts of traits that land students in top-tier universities, like intellectual ability and motivation, matter for teachers in those subjects.  However, there is little direct evidence linking pre-college measures of ability to eventual performance as a teacher.  Part of the problem is a lack of data; we only have relatively crude measures like SAT/ACT scores or high-school grade point averages.  Even those measures are only available in a handful of states.

LOEB:  So much for controversy.  I agree.  If I were hiring teachers, I would want the best teachers I could get regardless of whether they are great because they had abilities prior to entering teacher preparation or because of the experiences that they had during their preparation.  However, if I were designing a preparation program I would want to know the best training that I could provide to the individuals that I attract and select into my program.  In order to figure out what preparation is most useful (and what preparation is most useful for what types of teachers), I would need to separate the effects of preparation from the selection into the program.  In addition, if I were designing a whole system of teacher preparation, instead of a single program, I would want to know how I can best use the resources available to attract and prepare teachers so that I have the best teachers in classrooms with students.  This goal again requires separating selection from preparation in understanding the differences in the effectiveness of teachers coming from different preparation programs.

GOLDHABER: Yup, I basically agree with everything that has been said. The only thing that I'll stress is that for TTP accountability (i.e. program accreditation) we would want to consider selection as part of the TTP effect. State credentialing bodies are charged with making sure that training institutions produce teachers that meet minimum quality standards. If they do this by beating the bushes to get good teacher candidates into their program, then great. If instead they admit candidates that are lacking skills, but do an excellent job training them, that's terrific too. The bottom line is that we want training to wrestle with how they can most cost-effectively produce teachers that meet the quality standard (note that it is highly debatable whether such a standard actually exists today).

Now in thinking about what is socially optimal, we clearly want to distinguish between selection and training effects. If, for instance, we were to learn that differences between TTPs are solely driven by selection (i.e. purely a sheepskin effect), we might think there are better ways to assess who would make for a good teacher than requiring that they graduate from an approved training program. If, on the other hand, differences are driven more by training, then there is a lot that might be learned about what might improve the quality of the existing teacher workforce.

3)  How well a teacher performs is likely a function of not only the skills and abilities the teacher brings with her, but also a function of school-level factors that might impede or promote her effectiveness. What are the ways we can take this school-level variation into account in an effort to assess TTP effects fairly?

SASS:  There are two ways you can go about this.  One is to statistically control for things we can observe about a school.  This could include factors like the experience of the principal, the proportion of kids at the school receiving free/reduced-price lunch, the proportion of English language learners at the school and so forth.  With this approach we would essentially be comparing the performance of teachers from different TTPs who teach in schools that appear similar based on observable characteristics of the schools.  The downside of this method is that we may not be taking into account important school characteristics that do not appear in administrative records, such as the degree of parental involvement.  The second method would be to compare the performance of graduates from different TTPs who are teaching in the same school (known as the "school fixed effects" method in statistical jargon).  The problem with this approach is that many schools will not have recent graduates from multiple TTPs teaching in their school at the same time.  Also, if schools tend to hire a given quality of teacher, you could end up comparing the best graduates from a mediocre TTP to the worst graduates from a top-notch TTP; the TTPs would end up ranked as equivalents, when in fact they are not.

LOEB:  That description sums it up well.  The difficulty of separating school quality from the effectiveness of individual teachers is not specific to understanding the strengths of preparation programs.  Some schools will support teachers better than other schools do.  Within schools, some grade levels or subject areas will have better supports for their teachers than others.  If one program systematically sorts its teachers into teaching positions with poor support structures or more difficult environments for teaching, they will look worse than they are unless we are able to take those differences into account.  Some of the differences we can adjust for in creating value-added measures and some we will miss.  We should take the imprecision and potential bias into account when considering how to use these measures.  However, if we are interested in assessing the effectiveness of different programs then we will need some information in order to do so.  Value-added measures provide one source of this information, with the benefit of directly measuring student outcomes that we care about.  There are other sources of information such as aspiring teachers' evaluation of their own program or outside assessors' observational assessments of the program.  These other sources of information likely have bias and imprecision problems also, but they too can be informative if used together and with healthy understanding of their strengths and weaknesses.

GOLDHABER: Unfortunately this is an area where I think we hit the empirical wall. As is noted above, we can estimate models that account for all time-invariant school-level effects, but these models may not be generalizable due to the way teachers from certain program are sorted into particular schools. The hope would be that TTP findings would be pretty robust to model specification, indicating that sorting is not an issue we need to worry about too much. But, in the analysis we did in Washington State, we find that despite the fact that there are multiple TTP graduates in most schools, the school fixed effects models do seem to influence the estimates of at least some TTP effects. Unfortunately, I don't think this problem will be easily resolved as it's hard to imagine that we will ever see an experiment where prospective teachers from particular training programs are randomly assigned to schools!

KOEDEL: The previous responses sum up the main issues and where we're at in terms of research. It is worth mentioning that controlling for observable school characteristics in these models may be sufficient. One finding from our study is that when we compare models that do not include school fixed effects, but differ in terms of whether they control for observable school characteristics, we get very similar TTP estimates. If one believes that sorting on observables is informative about sorting on unobservables, this finding is consistent with the hypothesis that school-level factors do not introduce substantial bias into the TTP VAM measures. That said, in accordance with the above commentary, I do not know how a direct test could be performed. This is a very challenging problem given the type of data available for these analyses.

4)  What research should be done next to increase policymakers' understanding of the extent to which, and the ways in which, TTPs contribute to teacher effectiveness?

LOEB:  Identifying programs that produce effective teachers could be useful for those hiring teachers and licensing programs, but, as noted above, knowing which programs produce better teachers is not the same as knowing which programs provide experiences that improve their students' teaching abilities or which programs are particularly good at recruitment.  In order to improve both recruitment and preparation, we need to know what works.  There is so much selection in the process and so many factors of selection and preparation that vary at the same time, that basic regression approaches can only go part way in identifying effective practice.  Not all areas of education policy are ripe for random control trials, but teacher preparation is an area that is.  Randomly assigning potentially beneficial preparation experiences could give us far more information on the importance of those experiences than we can get from extant data.  If I were choosing, I'd try random assignment of prospective teachers to student teaching experiences that differed in types of schools, cooperating teachers and length; and I would randomly assign specific instruction in teaching techniques developed to serve English learners.  Others would have other preferences. 

GOLDHABER:   I think there is a tremendous amount of important research to be done about teacher training, at least for all those who believe that pre-service training has the potential to improve an individual's ability to teach. Susanna points out one way, random assignment, we might learn more about training effects, but we also could learn a great deal if we had better systematic information about the features of teacher training. Other than a very small academic literature (most of which has been authored by Susanna and colleagues), we know very little about the associations between, for instance, student teaching experiences, the quality of instruction at teacher training programs, etc. and the effectiveness of teachers once they're in the workforce. While I agree that experiments would help us to separate training and selection effects, I think we could learn a lot more now about the potential for different training to impact effectiveness, if (an important caveat) information about what happens while individuals are enrolled in teacher training was more readily available.

KOEDEL: I agree that our understanding of the internal workings of these programs is too limited. A question that I feel should be high on the priority list is this: how are student-teaching mentors selected, monitored and rated? I also believe that a better understanding of the selection process that determines who enters TTPs would be valuable. For example, do some programs systematically recruit/admit higher-skilled individuals relative to others? If so, I would like to know which programs are the most successful along this dimension, and why they are successful.

SASS:  I agree that the whole area of pre-service teacher training has been understudied and the potential payoff to understanding both the selection of teachers into programs and the type of training they receive while in a TTP is potentially huge. I like Susanna's idea of setting up some trials where potential teachers are randomly assigned to different training programs. However, I think this approach needs to be coupled with some bold experiments in the way teachers are trained.  My impression is that, at least on the surface, most traditional TTPs look pretty similar. To really understand what works and what doesn't, we need TTPs to try some radically different models and have them rigorously evaluated.  For example, one could cut back on education theory courses and greatly increase the amount of time spent on student teaching experiences.  For middle and high school teachers, one could require more subject matter coursework in the relevant department, such as math courses from the mathematics department.  Perhaps the whole education major could be shortened.  I don't think that any of us knows what an ideal training program would look like (I know I don't), but until we try some substantial changes from the status quo, we won't know how we can better select and prepare teachers.