City's value-added initiative early entrant to evolving landscape

New York City schools erupted in controversy last week when the school district released its “value-added” teacher scores to the public after a yearlong battle with the local teachers union. The city cautioned that the scores had large margins of error, and many education leaders around the country believe that publishing teachers’ names alongside their ratings is a bad idea.

Still, a growing number of states are now using evaluation systems based on students’ standardized test-scores in decisions about teacher tenure, dismissal, and compensation. So how does the city’s formula stack up to methods used elsewhere?

The Hechinger Report has spent the past 14 months reporting on teacher-effectiveness reforms around the country and has examined value-added models in several states. New York City’s formula, which was designed by researchers at the University of Wisconsin-Madison, has elements that make it more accurate than other models in some respects, but it also has elements that experts say might increase errors — a major concern for teachers whose job security is tied to their value-added ratings.

“There’s a lot of debate about what the best model is,” said Douglas Harris, an expert on value-added modeling at the University of Wisconsin-Madison who was not involved in the design of New York’s statistical formula. The city used the formula from 2007 to 2010 before discontinuing it, in part because New York State announced plans to incorporate a different formula into its teacher evaluation system.

Value-added models use complex mathematics to predict how well a student can be expected to perform on an end-of-the-year test based on several characteristics, such as the student’s attendance and past performance on tests. Teachers with students who take standardized math and English tests (usually fewer than half of the total number of teachers in a district) are held accountable for getting students to reach this mark. If a teacher’s students, on average, fall short of their predicted test-scores, the teacher is generally labeled ineffective, whereas if they do as well as or better than anticipated, the teacher is deemed effective or highly effective.

A number of states and districts across the country already tie student performance on standardized tests to teacher evaluations; others have plans to do so. Many education reformers, including those in the Obama administration, commend the practice. States were awarded points in the federal Race to the Top grant competition for creating policies that tie student academic growth to teacher evaluations.

In Florida, by 2014, all districts must use value-added ratings for at least half of a teacher’s total evaluation score. Ohio districts will start doing so in 2013. This year in Tennessee, student test-score data will count for 35 percent of each teacher’s evaluation. Value-added ratings make up 20 to 25 percent of New York’s new teacher evaluation framework. And politicians in Nebraska and Louisiana are pushing for these measures to be included in new teacher-evaluation systems.

The new evaluations, which will generally use test-scores as one of multiple measures, including classroom observations, are increasingly being used in decisions about compensation, retention and tenure.

Advocacy groups like The New Teacher Project, now known as TNTP, and the National Council on Teacher Quality have cheered the inclusion of value-added scores in teacher-evaluation systems. In the past, most teachers were rated based on infrequent, “drive-by” principal observations that resulted in satisfactory ratings for up to 99 percent of teachers. But skeptics, including teachers unions and researchers, say that value-added models have reliability problems.

Depending on which variables are included in a value-added model, the ratings for teachers can vary dramatically, critics say. As an example, researchers at the University of Colorado examined the formula that an economist hired by the Los Angeles Times created to rate teachers there (the economist’s work was funded in part by the Hechinger Institute on Education and the Media). The University of Colorado researchers found that more than a third of L.A. Unified teachers would have had different scores if a slightly different formula had been used.

A 2010 study by Mathematica Policy Research found that the error rate for value-added scores based on three years of data was 25 percent. In other words, a three-year model would rate one out of every four teachers incorrectly. The error rate jumped to 35 percent with only one year of data. The report cautioned against using value-added models for personnel decisions, a position that other experts have echoed.

In New York City, some of the teachers whose scores were published last week received ratings based on multiple years of data, according to a 23-page technical report describing the city’s statistical formula. But other New York City teachers — a spokesperson for the city education department was unable to say exactly how many — were rated based on only one year of data.

Washington, D.C. also uses just one year of student test-scores in its statistical model. But the system that Bill Sanders, a researcher known as the “grandfather” of value-added measurement, designed for Tennessee uses five years of data in creating a score for each teacher. To ensure that elementary teachers aren’t judged based on just one or two years of test-score data, the Tennessee model takes into account a student’s performance in later years, Sanders says. For example, third-grade teachers are rated based in part on how their students do in subsequent grades.

“When any one student takes a math test, on any one day, there is a huge uncertainty around that score,” Sanders told The Hechinger Report in an interview last year. “It could be the kid got lucky this year, and guessed two or three right questions. Or the kid this morning could not have been feeling well. Consequently that score on any one day is not necessarily a good reflection of a kid’s attainment level.”

Another question that educators and researchers have debated is whether the statistical models should account for student characteristics that are linked to achievement — for example, poverty, English ability and special education status. In places like Florida and Washington, D.C., value-added models have accounted for such factors, in part because of the limitations of using fewer years of test-score data.

New York City’s model does as well. Variables include race, gender, socio-economic status, and even whole-class characteristics like the size of the class and how many students are new to the city.

Many researchers argue that adjusting for student demographic characteristics is unnecessary because the growth scores are calculated by comparing students against themselves. Sanders and others say that including student characteristics could bias the scores by making it easier for teachers of disadvantaged students to be rated more highly.

A black student, for example, might be expected to do worse than a white student in such a model, an assumption that Sanders says lowers expectations for the black student, along with the teacher who has that student in class.

In New York, high-rated teachers are evenly spread across both low-performing and high-performing schools, which experts say is partly a result of the formula’s adjustments for student demographics. Teachers with demographically similar students—whether they are low-income, minority, or have special needs—are ranked relative to one other, not the entire teaching force.

Other researchers have argued that factors like student poverty should be taken into account, however, because concentrated poverty, for example, is linked to lower student performance, suggesting that a student’s peers may affect how that student does in school and on tests. That is, a teacher who has a large number of disadvantaged students in class may have a more difficult job getting a higher rating than teachers with fewer disadvantaged students.

In an attempt to settle the question, Mathematica, the research group, is currently examining the effects of whole-class characteristics on teacher value-added ratings in a study of 30 districts across the country.

Although it gets much less attention, one of the biggest problems with value-added modeling, according to many experts, is that the ratings cover only a fraction of teachers — those whose students take standardized tests in math and English, typically in grades three through eight. As new teacher-evaluation systems go into effect in more districts and states in the next two years, many, including New York City, will be grappling with how to rate everyone else.

Rhode Island is using teacher-created goals on classroom work and tests. Colorado is planning to use off-the-shelf assessments and school-generated methods to gauge how teachers in subjects like physical education and music are performing. In Tennessee, teachers without value-added ratings are graded in part on how the teachers who do receive ratings in their school perform. And Florida is creating more tests, one for every subject and grade level down through kindergarten.

Harris calls Florida “an example of what not to do.” Given the problems with value-added modeling, no matter which formula is used, he suggests that the best uses of the ratings might not be to make decisions about hiring, firing and tenure. Instead, they can be used to give low-rated teachers more training or principal observations, rather than pink slips.

This story was produced by The Hechinger Report, a nonprofit, nonpartisan education news outlet affiliated with Teachers College, Columbia University.