Why Even the Best Charity Evaluations Contain Uncertainty
GiveWell's charity evaluations are the best. What can't they foresee?
One of the most appealing parts of GiveWell's charity evaluation work to me has been their cost-effectiveness model, because in some sense it tells you exactly what you're getting for your money. Under their default moral framework:
The Against Malaria Foundation saves one life for every $4,450 donated
Malaria Consortium saves one life for every $3,373 donated.
These are both around 13-16x as effective as handing people cash through GiveDirectly
Handing people cash is more effective than the vast majority of nonprofit programs.
At the same time, GiveWell states very clearly that these numbers are extremely rough, and therefore their decisions "rely heavily on other factors, such as an organization's track record, when we are comparing organizations with cost-effectiveness estimates that are not very different."
(This doesn't, of course, mean the cost-effectiveness estimates are useless -- we're stuck with some level of uncertainty or ambiguity, but this is much, much better than knowing nothing at all!)
When I first learned about effective altruism a few years ago, this was kind of confusing to me -- most of GiveWell's estimates are based on Randomized Controlled Trials, the gold standard for empirical evidence, so shouldn't they tell us exactly the effect of the intervention? Now that I've spent more time listening to development economists argue with each other,I think I have a better idea of what to worry about, so I want to go over a few of the main sources of uncertainty that can exist even if you have the perfect experiment, namely:
The effects of a treatment can differ in different places
Moral Weights
Issues not captured by experiments
(Once again, to be clear: this isn't a criticism of GiveWell. They are very open about this, and understand its consequences far better than I do. I am writing this because I find uncertainty much easier to interpret when I have a basic understanding of where it might be coming from.)
The effects of a treatment can differ in different places
Suppose you were worried about the effects of winter, so you decided to send communities free snowplows. You tested this program in a couple of cities in Wisconsin and Michigan, found it to be highly successful, and are ready to roll it out in Los Angeles and San Diego!
Social scientists would say your tests lacked external validity: you didn't pay attention to the particular set of circumstances that made them successful. In this case, it would be pretty easy to fix the problem by measuring the amount of snow a given community received, so you could be sure to only send plows to communities who had a use for them. But what if the ``particular set of circumstances'' includes factors you couldn't (or at least didn't) observe?
This isn’t purely theoretical. Let’s look at one of my favorite health interventions: malaria nets. The main empirical facts supporting AMF's "free nets" distribution model are:
Malaria nets, if used, significantly reduce deadly cases of malaria.
Even if malaria nets are sold at a deep discount, the number of people willing to pay for them themselves is too small to adequately defend a community against malaria.
People use malaria nets enough (not 100%, and not forever, but enough) to significantly reduce malaria rates, even if the nets were given to them for free.
(See, for example, here for a discussion of points 2 and 3)
Each of these has been established experimentally in particular places, most famously through experiments in Kenya and Uganda. But what if mosquitos in Malawi are somehow different from those in Kenya? In that case, would bednets still be effective? Do we need a different insecticide treatment or smaller holes to block the bugs?
More to the point, Africa is an enormous continent. I wouldn’t trust a study performed in Vermont to help me predict the attitudes of people in Alabama towards, say, wearing a mask, and I should be equally wary of predicting bednet attitudes in Nigeria from their counterparts in Uganda or Kenya.
To be clear, there are ways to mitigate this. I’d be happier to trust studies from Mississippi to make (rough) predictions about masks in Alabama, and I’d be especially happy if lots of studies in a diverse collection of states found similar results. Development economists (from what I've seen so far) think carefully about external validity, and GiveWell programs tend to be supported by an extraordinarily strong evidence base.
So if I have strong evidence that (to make up numbers) bednets save a life for every $4,350 spent in Kenya, for every $3,994 in Uganda, and $3,802 in Malawi, I would (assuming I found sufficiently similar communities) be reasonably justified in inferring they would save a life for every $4,000 or so spent in Tanzania, and I would certainly be justified in saying this intervention was more effective than one that saved a life for every $400,000 spent. But I wouldn't have strong grounds on which to compare this intervention to one that saves a life for every $3,900 spent, or even one that saves a life for every $5,000 spent.
Indeed, I would probably choose to donate to a charity where I was very sure $5,000 would save a life over one where a study showed $3,000 could save a life, but in which I had deep concerns about external validity. Figuring out how to make these tradeoffs is important, and from what I gather a big part of what GiveWell does, and it isn't well-captured in the "here's how much it costs to save a life!" headline numbers.
Moral Weights
GiveDirectly gives people money, which they can spend on food, a business, a better roof, or any number of other things.
The Against Malaria Foundation helps people, especially children, avoid dying of malaria.
Which should I donate to?
It’s not only difficult to compute how much more effective AMF is than GiveDirectly, it’s difficult to define what we even mean by the question. How many repaired roofs is the life of a child worth? How does a month of hunger compare to a month with severe malaria?
These are questions without clear answers, but which we all answer, implicitly or explicitly, when we decide where to donate. One traditional approach uses QALYs (Quality Adjusted Life Years) or DALYs (Disability Adjusted Life Years) or some variant of them. These, based on a series of surveys, attempt to quantify how people value different health and life outcomes, where 1 QALY is (essentially) the value of a healthy adult living for an extra year.
There are some pretty obvious critiques of this approach, starting with the fact that different people value different things. A broken leg would be a huge problem for me in the fourth-floor apartment I can only reach via stairs, but a broken arm would be easier for me than for a concert pianist or a working mother. This means that QALYs are trying to measure some sort of average value, which is at best an approximation of the values of the people the intervention is serving.
It's also not super clear that a survey is a good way to figure out what people value. If I ask whether you'd rather have a broken arm for a year or a broken leg for eight days, you could probably give me a gut feeling, but it's unlikely that this would be a careful, accurate reflection of some underlying truth. I'm not sure I can accurately predict the effect a broken arm would have on me, let alone something like malaria or "dying ten years early."
Lastly, and importantly, at least traditionally a lot of this sort of research is done by rich Westerners surveying other rich Westerners. If the extent to which Westerners value some sort of health or economic outcome differs from their counterparts affected by a particular intervention, this approach has a good chance of leading to an outcome rich US donors want, over and above the wishes of the people we ought to be listening to.
GiveWell recently sponsored some research to reassign moral weights based on the values of 2,000 people living in extreme poverty in Kenya and Ghana. This did differ somewhat from their previous, Western-centric weights: "Among other findings, they suggest that survey respondents have higher values for saving lives (relative to reducing poverty) and higher values for averting deaths of children under 5 years old (relative to averting deaths of individuals over 5 years old) than [GiveWell] had previously been using in [their] decision-making."
In practice, it turns out that while moral weights differ, they differ by relatively small amounts compared to typical differences one sees between charities. GiveWell found that changing their moral weights to fit the above study did not affect which charities they ranked as "Top Charities", and I've observed a similar robustness while playing with the weights in their spreadsheets in the past.
So we see the same issue as above: if my charity is 1.04x as effective as yours when measured by QALYs, I don't have a very strong claim to being "more effective" -- perhaps yours wins measured by DALYs, or (more importantly) when measured by the values of the people whose lives our interventions affect. But if my charity is 104x as effective as yours, this is likely to hold up whatever set of moral weights a person could choose.
Issues not captured by experiments
The last concern is that experiments capture very specific outcomes: we can tell how many children would have died but didn't, and this is probably the most important thing to know!
But there are plenty of ways a nonprofit affects its community beyond preventing disease or helping people obtain roofs. There are plenty of stories of well-meaning people founding charities to give away free clothing, undermining local clothing markets and destroying a local economy in the name of altruism, just as there are stories of nonprofits abusing the people they claim to serve, or (wittingly or unwittingly) creating deeply damaging power dynamics, especially along race or class lines.
(Listen to The Missionary for both a particularly egregious example, and a more general look at how a particular sort of missionary culture can encourage this.)
Beyond obvious (and unfortunately commonplace) misconduct, there are harder-to-quantify measures. Is AMF's supply chain likely to break down? What if a war breaks out somewhere a distribution is set to occur? Can net distribution proceed during a deadly global pandemic?
Part of judging these questions is just looking at nitty-gritty economic details. But a big part of them is personal: do I trust the leadership of this organization to make good decisions when life inevitably throws curveballs? Do they have a track record of transparency and of overcoming obstacles?
I trust GiveWell's opinion on these questions, in part because of GiveWell's own radical transparency and track record. And I think we should think of this as expert opinion interpreting and informing the cost-benefit data, not purely a feature of the data itself. And we should be okay with that!
Certainly there are other sources of uncertainty, because there always are! But hopefully this helps a bit to contextualize why the "just find the best number" approach to charity evaluation I held five years ago is oversimplified. Cost-benefit analyses remain incredibly useful, as long as we have a firm grasp on how to think about uncertainty.