Balancing bias and variance in the design of behavioral studies: The importance of careful measurement in randomized experiments

Andrew Gelman.

The Centre for Central Banking Studies recently hosted their annual Chief Economists Workshop, whose theme was “What can policymakers learn from other disciplines”.  In this guest post, one of the keynote speakers at the event, Andrew Gelman professor of statistics and political science at Columbia University, points out some of the pitfalls of randomly assigned experiments with control groups.

When studying the effects of interventions on individual behavior, the experimental research template is typically:  Gather a bunch of people who are willing to participate in an experiment, randomly divide them into two groups, assign one treatment to group A and the other to group B, then measure the outcomes.  If you want to increase precision, do a pre-test measurement on everyone and use that as a control variable in your regression.  But in this post I argue for an alternative approach- study individual subjects using repeated measures of performance, with each one serving as their own control.

As long as your design is not constrained by ethics, cost, realism, or a high drop-out rate, the standard randomized experiment approach gives you clean identification.  And, by ramping up your sample size N, you can get all the precision you might need to estimate treatment effects and test hypotheses.  Hence, this sort of experiment is standard in psychology research and has been increasingly popular in political science and economics with lab and field experiments.

However, the clean simplicity of such designs has led researchers to neglect important issues of measurement, as Matthew Normand points out in a recent paper, “Less Is More: Psychologists Can Learn More by Studying Fewer People,” which begins:

Psychology has been embroiled in a professional crisis as of late. . .   one problem has received little or no attention: the reliance on between-subjects research designs. The reliance on group comparisons is arguably the most fundamental problem at hand . . .

But there is an alternative.  Single-case designs involve the intensive study of individual subjects using repeated measures of performance, with each subject exposed to the independent variable(s) and each subject serving as their own control. . . 

Normand talks about “single-case designs,” which we also call “within-subject designs.”  (Here we’re using experimental jargon in which the people participating in a study are called “subjects.”)  Whatever terminology is being used, I agree with Normand.  This is something Eric Loken and I have talked about a lot, that many of the horrible Psychological Science-style papers we’ve discussed use between-subject designs to study within-subject phenomena.

An example was a study of ovulation and clothing, which posited hormonally-correlated sartorial changes within each woman during the month, but estimated this using a purely between-person design, with only a single observation for each woman in their survey.

Why use between-subject designs for studying within-subject phenomena?  I see a bunch of reasons.  In no particular order:

  1.  The between-subject design is easier, both for the experimenter and for any participant in the study.  You just perform one measurement per person.  No need to ask people a question twice, or follow them up, or ask them to keep a diary.
  2.  Analysis is simpler for the between-subject design.  No need to worry about longitudinal data analysis or within-subject correlation or anything like that.
  3.  Concerns about poisoning the well.  Ask the same question twice and you might be concerned that people are remembering their earlier responses.  This can be an issue, and it’s worth testing for such possibilities and doing your measurements in a way to limit these concerns.  But it should not be the deciding factor.  Better a within-subject study with some measurement issues than a between-subject study that’s basically pure noise.
  4.  The confirmation fallacy.  Lots of researchers think that if they’ve rejected a null hypothesis at a 5% level with some data, that they’ve proved the truth of their preferred alternative hypothesis.  Statistically significant, so case closed, is the thinking.  Then all concerns about measurements get swept aside:  After all, who cares if the measurements are noisy, if you got significance?  Such reasoning is wrong wrong wrong but lots of people don’t understand.

One motivation for between-subject design is an admirable desire to reduce bias.  But we shouldn’t let the apparent purity of randomized experiments distract us from the importance of careful measurement.  Real-world experiments are imperfect–they do have issues with ethics, cost, realism, and high drop-out, and the strategy of doing an experiment and then grabbing statistically-significant comparisons can leave a researcher with nothing but a pile of noisy, unreplicable findings.

Measurement is central to economics–it’s the link between theory and empirics–and it remains important, whether studies are experimental, observational, or some combination of the two.

Andrew Gelman is a professor statistics and political science at Columbia University, New York.

If you want to get in touch, please email us at You are also welcome to leave a comment below. Comments are moderated and will not appear until they have been approved.

Bank Underground is a blog for Bank of England staff to share views that challenge – or support – prevailing policy orthodoxies. The views expressed here are those of the authors, and are not necessarily those of the Bank of England, or its policy committees.

1 Comment

Filed under Guest Post, New Methodologies

One response to “Balancing bias and variance in the design of behavioral studies: The importance of careful measurement in randomized experiments

  1. Andrew Garrett

    Of course cross-over and n-of-1 designs all use randomisation. The point seems to be more about the importance of measuring within subject change either through repeated measures (including baseline measures) and/or multiple treatment assignments within subject. Where possible that makes good sense. Senn (2001) in particular makes this case in terms of investigating whether genes have an impact on outcome – something which cannot be properly evaluated without crossover designs with repeat treatments to tease out subject-by-treatment interaction. That paper is in Drug Information Journal 2001; 35:1479-1494. Worth a read since the points go beyond life science research – it is great to read that this Gelman post was a result of a Banking initiative to learn from other disciplines. Something we should do more of across all sectors.