What Counts as a Big Effect? (I)

I woke up yesterday morning to read Norm Scott’s post on Education Notes Online about a new study of the effects of charter schools on achievement in New York City. The study, by economists Caroline Hoxby and Sonali Murarka, finds a charter school effect of .09 standard deviations per year of treatment in math and .04 standard deviations per year in reading. I haven’t read the study closely yet, but I was struck by Norm’s headline: “Study Shows NO Improvement in NYC Charters Over Public Schools.” The effects that Hoxby and Murarka report are statistically significant, which means that we can reject the claim that they are zero. But are they big? That’s a surprisingly complicated question. I’m going to argue that the answer hinges on “compared to what?”

The standard deviation is a basic measure of how spread out a given attribute—such as a test score—is in a population. When scores are widely spread out away from the average, the standard deviation is large; when the scores are narrowly bunched around the average, the standard deviation is small. Many distributions, whether in nature or by design, take on the shape of a bell curve. The family of such distributions are called normal distributions, and they have some properties that are really useful for making sense of a given effect.

The figure below shows a standard normal distribution, with a mean of zero and a standard deviation of one. A standard normal distribution is symmetric, with 50% of the cases above the mean and 50% below the mean. About 34% of the cases are between the mean and one standard deviation above the mean, and a similar fraction is between the mean and one standard deviation below the mean. An additional 13% on each end or so are between one and two standard deviations away from the mean, and about 2.5% on each are more than two standard deviations away from the mean.
What this means is that we can use the standard deviation as a way of thinking about the distance between two groups expressed as the group’s average percentile in the population. For example, the average difference between Blacks and whites on many standardized tests is about one standard deviation. This means that, regardless of the scale of the test, if the scores take on the shape of a bell curve, and if the typical white student is scoring at the 50^th percentile, then the typical Black student is scoring 34 percentiles below that, at around the 16^th percentile. That seems like a very large difference, and we can see the impact of differences of this magnitude in the underrepresentation of Blacks in settings where access is based on standardized test performance.

So it’s easy to see that a one standard deviation difference is a big difference. What about differences of the magnitude reported by Hoxby and Murarka? Are these large or small? The figure below helps us judge. This figure shows the percentile differences associated with particular differences between two groups: .05 standard deviations, .10 standard deviations, .20 standard deviations, and .50 standard deviations. What’s a little tricky is that the percentile differences depend on where in the distribution we start; they’ll be largest if we start at the middle, and smaller if we start further away from the middle.
In the figure, .05 standard deviations is represented as the distance between the blue column and the red column. If we start away from the middle, at one standard deviation below the mean, a .05 standard deviation difference equals the difference between the 15.9^th percentile and the 17.1^st percentile-a 1.2 percentile shift. If we start at the middle, a .05 standard deviation difference is the difference between the 50^th percentile and the 48^th percentile. A .10 standard deviation difference (the distance from the blue column to the green one) is the difference between the 15.9^th percentile and the 18.4^th percentile-a 2.5 percentile shift. A .10 standard deviation difference also corresponds to the difference between the 50^th percentile and the 46^th percentile.

Larger effect sizes correspond to larger percentile differences. A .20 standard deviation difference, represented in the figure as the distance from the blue column to the yellow column, ranges from 5.3 percentiles to 7.9 percentiles, and a .50 standard deviation difference-half of the performance difference between Black and white children and youth on many standardized tests-is 15 to 19 percentiles.

Subjectively, it seems to me that differences smaller than .10 standard deviations are pretty small. Moving a group of students up 2.5 to four percentiles in a year may be a challenging accomplishment, but it’s not a big move. On the other hand, moving a group up three or four percentiles a year for several years in a row seems like a bigger deal. Even so, five years in a row of an effect of .10 standard deviations would move a typical Black student from the 16^th percentile of the population distribution to around the 30^th percentile, still well behind the performance of a typical white student.

And we need to be especially cautious about claims that are made about the cumulation of effects over time. It’s tempting to extrapolate from the effects that are observed in a particular year to what we would see if those effects accumulated over several years. But in most social and educational interventions, effects “fade out” over time, reducing in intensity as time goes on. If the largest effect is observed in the first year of an intervention, it can be substantially misleading to assume that similar effects will be seen in subsequent years.

I’ll have more to say about what counts as a big effect next week.

About our First Person series:

First Person is where Chalkbeat features personal essays by educators, students, parents, and others trying to improve public education. Read our submission guidelines here.