Irrelevant or Robust? Evaluating Impact without Randomization
Last week I wrote about the invisible monitoring and evaluation decisions made all the time by social enterprises that determine the social impact printed in the glossy annual reports. In that post, I emphasized the importance of developing standards for evaluation design, implementation, and reporting so that managers and funders have a firm grasp on the likely impact of their work, and the likelihood that the estimated impacts are accurate. To date, much of the attention paid to study design has focused on the question of randomization, yet RCTs are simply not within reach for most social enterprises. In this post, I address the potential for rigorous non-randomized studies, focusing on the evaluation choices that dictate whether a study’s results are irrelevant or robust.
William Shadish and Co. have written a few papers on the subject of when and why quasi-experimental and observational studies succeed and fail in their efforts to produce causal estimates comparable or equivalent to the randomized gold standard.
I’d recommend the following: “Three conditions under which experiments and observational studies produce comparable causal estimates: New findings from within‐study comparisons” and “Can Nonrandomized Experiments Yield Accurate Answers?” (HT: Andrew Gelman)
While previous research (e.g., LaLonde’s terrific 1986 study) appeared to put the nail in the coffin of non-randomized studies, Shadish writes, “It is as though the studies’ main purpose was to test the adequacy of whatever nonexperimental statistical practice for selection bias adjustment seemed current in job training at the time. This is quite different from trying to test best possible quasi-experimental design and analysis practice, as we have done here.”
Shadish elaborates on the difference between the standard and best practice, and provides some basic lessons for the researcher constructing a comparison group for their non-randomized study:
- Don’t use a national dataset to create a comparison group when you’re interested in different “population, settings, and measurement.”
- Don’t let the sample sizes of your comparison conditions be small.
- Don’t use “off-the-shelf” characteristics (covariates), e.g., demographic information, that doesn’t correlate with your dependent variable, e.g., income; prospectively collect comparison data to find characteristics specific to your research question and treatment.
The first and last points are directly linked, and Shadish continues to critique current econometric practice:
The failure of “off-the-shelf ” covariates indicts much current causal practice in the social sciences, where researchers finish up doing propensity score and OLS analyses of what are poor quasi-experimental designs and impoverished sets of covariates.
But there is hope!
Shadish finds that the results are much more promising when a quasi-experiment or observational study follows best practices. For the sake of brevity, I’ll highlight Shadish’s main questions and conclusions.
(1) Do experiments and RD [regression discontinuity, i.e. non-randomized] studies produce comparable effect sizes?
Yes — “… reasonably well-executed experiments and RD studies will often produce similar causal estimates.”
(2) Are estimates similar when an experiment is compared to an observational study whose sampling design uses intact group matching to minimize any initial differences between the intervention and comparison populations on pretest means and/or slopes?
Yes — “… studies predicated on careful intact group matching on pretest measures of outcome will considerably reduce selection bias.”
(3) The dominant question in within-study comparisons assumes that the treatment and comparison populations obviously differ and asks whether statistical adjustments for this are effective.
Yes — “… understanding, validating, and measuring the selection process will substantially reduce the bias associated with populations that are demonstrably nonequivalent at pretest.”
This is but a very brief introduction to the topic, and I’ll let the more wonky dig into the articles themselves to get the details. To conclude, there are some research questions that will entirely frustrate non-randomized attempts to produce reasonable causal estimates. There are others for which a non-randomized study which meets certain basic standards of covariate selection, for example, will be quite effective, and, perhaps, more wise, given cost, ethical concerns, etc. The emphasis should be on clarifying when and where different studies make the most sense, and how we can communicate whether the study design is “best practice” (i.e., often comparable to RCT) or “standard” practice (i.e., more often that not fails to produce comparable estimates.)