The Quantitative Fallacy - Blue Noise Labs

This is the long-delayed second installation in a series about the complexities of using data in Democratic politics. You can read the introductory post here.

We’re going to be discussing the idea of the McNamara fallacy (or quantitative fallacy) today. To recap, the core idea behind the fallacy is that focusing only on aspects of your desired objective that can be measured quantitatively can be counterproductive to that objective.

While this was actually coined with respect to Robert McNamara’s emphasis on the relatively new science of operations research in his management of Ford, these days it’s most associated with his running of the Vietnam War, and specifically with the use of the sui generis “body counts” metric (i.e. number of Viet Cong soldiers killed) as the primary way of determining the progress of the war — with evidently catastrophic results.

In fact, one of the reasons this took me so long to put together is that the story is a bit more complicated (as it inevitably always is). I’m not a historian and so my ability to find and evaluate the “state-of-the-field” is definitely lacking, but it seems like the following complications are true:

Using body counts as a metric for evaluating military success preceded Vietnam
There was no shortage of other metrics that the army was attempting to gather
It was well known, at least among some officers, the limitations of the metrics being gathered

That being said, the added nuance actually provides an even more instructive, if more complicated, demonstration of the fallacy.

So today we’ll start with a longer section going into (my understanding of) the use of extremely flawed quantitative metrics in Vietnam. Essentially all of the background here comes from the following sources:

The Best and the Brightest by David Halberstam
Body Counts and “Success” in the Vietnam and Korean Wars by Scott Gartner and and Marissa Myers (behind a paywall)
The Problem of Metrics: Assessing Progress and Effectiveness in the Vietnam War by Gregory Daddis (also behind a paywall)
The Tyranny of Metrics by Jerry Muller

I urge readers to consider, as they go through that section, connections to our modern political moment, especially for those who work in data. I close this piece out with what I hope are some plausible thought experiments of how these forces can be cautionary tales for the data scientists in the field (and the strategists consuming their insights)¹.

Let’s recall from the previous post Daniel Yankelovich’s distillation of the McNamara fallacy, so we can trace its tenets through the history below:

“…the first step is to measure whatever can be easily measured. The second step is to disregard that which can’t easily be measured or given a quantitative value. The third step is to presume that what can’t be measured easily really isn’t important. The fourth step is to say that what can’t be easily measured really doesn’t exist.”

Metrics and the Vietnam War

In the standard retelling, the most infamous aspect of Robert McNamara’s running of the Vietnam War was an obsession with “body counts” — the number of enemy combatants killed. Unlike for example World War II, where military progress could be clearly indicated on maps by territorial control, the Vietnam Was was a guerilla war, and the concept of taking territory wasn’t meaningful: villages that were “cleared” would later become sites of future conflict as VC forces would melt in and out of the villages and populace. So the underlying challenge is formidable: how can you actually tell how well a war fought against a guerilla army is going?

As Secretary of Defense, McNamara was responsible for winning the war. And, in keeping with his ethos about the importance of measurement — the technocratic approach that he brought from Ford — he was desperate to find something measurable that could indicate progress. As argued by Gartner and Myers, he reached back to what had evidently been a successful measure in the latter half of the Korean War: body counts.

Towards the end of Korean War, a nasty feud developed between General MacArthur and President Truman, which amounted to a feud between alternative ways to define success. MacArthur wanted to fight a “traditional” war (with its traditional object of territorial capture) that would entail the US taking over all of Korea and potentially invading China too; Truman was worried that that would drastically increase the Chinese commitment to the war and wanted to fight a strategy of attrition, wearing down the enemy². This led to the genesis of the body count metric in the Korean War: the goal of the war became to inflict maximal casualties on the Chinese soldiers supporting North Korean forces in order to drain China’s political resolve.

Interestingly, the paper doesn’t explore in more detail why this was more or less appropriate for one conflict versus the other, so this is just my pure speculation: the key difference seems to me that the North Korean soldiers were being supported by Chinese forces, whose commitment to the cause was not infinite. Attriting the Chinese forces while not threatening their home front would of course sap Chinese political will, taking them out of the conflict. However, the Viet Cong were fighting on the home front, for what they considered their own people — and it was the US that was fighting far from its home front for an ally that they didn’t always consider reliable.

What was clear though was that the “real” objective for winning a guerilla war hinged on some concept of “political will” for which “number of casualties inflicted” served as a proxy. And as such, it was not unreasonable for certain conflicts, but the mechanism whereby increased brutality could actually increase political will seems to have been severely underconsidered. That is the McNamara fallacy in a nutshell.

Measure whatever can be easily measured

But while body counts emerged as the most prominent metric, Daddis argues persuasively that it was relatively well understood that it alone was not comprehensive and that indeed fighting a counterinsurgency required a different playbook³.

The solution that MACV (the US command in Vietnam) came up with was an order that will be familiar to many of the data analysts among us: just measure everything that could be measured. “With an incomplete understanding of counter-insurgencies and vague strategic objectives, MACV embraced the advice of … McNamara that everything that was measurable should in fact be measured.” (Daddis p. 75). Under the command of General Harkins (more on him later), the infamous MACV Directive 88 was published, which was essentially a “laundry list” of over 100 different things to measure (”… rate of VC defections, strength of combat units, ratio of enemy to friendly killed in action, percentage of VC crops destroyed, number of civil guard units formally trained, and average number of days spent on offensive operations …”) (Daddis pp 84-5) with no accompanying analysis on the connection of these to the actual objective or any sense of how to assess the importance of one metric over the other.

Unsurprisingly, amid this “blizzard” of utterly uncontextualized information, any number of narratives could be spun. Whether you were optimistic or pessimistic about the war, you could find some subset of metrics trending in the direction that supported your cause. And of course, the collectors of these metrics — from the foot soldiers counting dead bodies after an operation to General Harkins himself collating them — were not impartial bystanders, and their control over the metric quiver was a powerful political tool.

Politicization of metrics

David Halberstam reserves particularly venomous scorn for General Paul Harkins — in The Best and the Brightest, “General Paul Donal Harkins, fifty-seven, was a man of compelling mediocrity. He had mastered one thing, which was how the play the Army game” (Halberstam p. 212) — who was in charge of the military command after the massive US troop build up in 1962. And journalist Neel Sheehan memorably described him as an “American General with a swagger stick and cigarette holder…who would not deign to soil his suntans and street shoes in a rice paddy to find out what was going on [but] was prattling about having trapped the Viet Cong”. And in the above cited work by Daddis discussing the blizzard of directionless metrics, Harkins shoulders a great deal of the blame for constructing such a system.

For an “optimist” who was thoroughly unprepared for the creative thinking needed for an unorthodox theatre and who was explicitly attuned to the political maneuvering needed to play “the game”, the “measure everything” strategy became obvious, especially measuring those numbers that were going up. And, in retrospect, equally obvious were the political incentives that this directive had on the rest of the staff, and their allied South Vietnamese officers. And here, essentially every post-mortem of the war is generally in agreement: the fabrications and exaggerations of all of these metrics, whether intentional or accidental, earnest or malevolent, was rife.

For aspiring commanders in the field, especially those of the more politically ambitious bent, the path to ascending the ladder was clear — you were judged on your ability to move those metrics in the right direction, by whatever means necessary. As Daddis notes, “With few other indicators allowing commanders to stand out among their peers, the body count served as a visible yardstick for performance. One senior staff officer believed all tactical commanders were ‘judged on how many enemy they kill and how many operations they launch and how successful they are’”. This ethos even crossed national divides, as the South Vietnamese commanders began to respond to the same incentives. Halberstam recounts a briefing delivered to Harkins by a South Vietnamese commander rife with exaggerated and uncorroborated statistics — but delivered with such a crisp Americanized affect that Harkins was immediately taken.

Perhaps, you might argue, it’s unfair to blame McNamara (as we are implicitly doing by naming the entire fallacy after him) for the fabrication of data that essentially dominated at every level. But, in The Tyranny of Metrics, Muller argues that the structure he created — an emphasis of metrics for metrics’ sake, divorced from an understanding of overarching strategy, and discounting the qualitative expertise of officers on the ground — inevitably incentivized exactly those actions.

A quick diversion to examine the officers who were resistant to the “basket of metrics” approach is also instructive. One such character was General Edward Lansdale, who is described by Daddis as having the most “progressive” understanding of counterinsurgency, and as a pioneer of the “hearts and minds approach” to winning over the people. In 1962, he wrote what ended up being a portentous memo that laid out his own view of the war as primarily a contest of political will: “highlighting the importance of earning the people’s friendship, Lansdale’s report focused on ‘such things as willing care for the wounded and injured civilians, sharing rice with the hungry, repairing destroyed public structures’, all done to ‘start linking up the villages spiritually as well as mentally … and physically … with the provincial and national centers’.” (Daddis p. 82). He called these “X-factors” and his sample questionnaire for soldiers to ask villagers tried to assess exactly these factors. But the qualitative nature of the questionnaire foundered in the wake of McNamara’s obsession with quantifiability. It was in fact in a conversation with Lansdale that the we get the infamous image of McNamara sarcastically writing, and then erasing, the “feelings of the rural Vietnamese people”, on a whiteboard of important factors, because it unmeasurable.

In fact, there seems to be a vein of literature (which I am not qualified to evaluate) that imagines a brighter alternative universe if only Lansdale had been in charge, most exemplified by Max Boot’s biography of him, with the unsubtle title The Road Not Taken:Edward Lansdale and the American Tragedy in Vietnam (I have not read the book but the review here is helpful); but Daddis appears to hold similar views.⁴

Connections to Political Data Today

The analogies to today’s politics aren’t exactly veiled, but it’s worth fleshing out some examples. My goal is to use these particular thought experiments as sign posts that we can continue to evolve and complicate as we continue this series. Today we’ll introduce some vignettes and connect them to the McNamara narrative above.

Long-term damage by proxy measures

A reasonable proxy for a candidate’s popularity (the ground truth of which can only be ascertained at election day) is grassroots fundraising, and so you might judge how well a candidate is doing based on their fundraising prowess. But what if, in order to achieve a better fundraising haul, the candidate has bombarded potential voters with scaremongering and misleading fundraising messages, to the point where they have demolished their party brand by election day?

This is obviously an extremely topical issue the Democratic Party is dealing with right now — a deep investigation into the money spent on various fundraising firms by Adam Bonica went viral among the operative class, and Democratic fundraising platforms have been feeling the pressure to address so-called “ScamPACs”.

But even if we discount the explicitly scammy and predatory behavior of some of these organizations and assume the best of intentions among the grassroots fundraising operations, there remains a tension that needs to be squared. Having worked in the space myself, I know that the urgency, the explicit calls to action, the repeated messaging and the cross-pollination of prospects, all, from the standpoint of just bringing in more donations, work. And grassroots enthusiasm really is a worthwhile indicator of enthusiasm for a candidate.

But at what cost? How many times can you see that “your immediate $15 donation is the only thing standing between democracy and the abyss” (as Bonica writes) before you become numb to emergency? If grassroots fundraising and Democratic Party reputation are in tension now, then we are facing a McNamara problem because fundraising prowess is extraordinarily easy to continuous measure, but brand reputation is distinctly not.

Overfitting on new measures

Perhaps surprisingly, there’s no shortage of data now in the political world. All of the following are now essentially considered table stakes for a modern political campaign (although the last one, social media monitoring, is relatively new-fangled).

Voter files enriched with all manner of rich consumer and third party data in addition to vote history and party registration
Batteries of polling and survey data that, for good and defensible reasons, have significantly different approaches to weighting responses and can report on ever more high resolution slices of the electorate
Cookies, pixels, and digital fingerprints, while old news at this point, are ubiquitous in the space; campaigns are eager for any and all information about who their communications are reaching, and who is interacting
Randomized controlled trials and focus groups to craft messages, at greater scale than ever before because of the ease of recruiting online panels⁵
Social media and ecosystem monitoring and listening, especially in light of the “Podcast Election” of 2024, has become the Holy Grail for campaign operatives as they seek to learn what voters are actually watching and listening to⁶

It’s easy to construct any number of quantitative measures and watch their dynamics over the course of the cycle. At least from the point of view of a “political enthusiast”, the entire experience of election coverage is building narratives from a basket of metrics as they develop. Seeing that Loudoun County has moved X points in Y direction means one thing, but then, in tension with that, maybe the exit polls in Miami-Dade are seeing Z fewer of this or that demographic. And we definitely saw people⁷ selecting the metrics that pointed in the direction they wanted wanted, and finding reasons to discount all else.

The only cost of the chronically online spectator classes digging through precinct data for their hopium kick is the future public health bill for their inevitable hypertension⁸. But the strategists running these campaigns are also being confronted with this deluge of information, and much more besides. And, if we consider that the rules of politics have changed since, say, 2008, then we risk finding ourselves in a similar boat to Paul Harkins and MVAC in the 1960s: an abundance of things to measure but no overarching strategic objective to fit those measures into.

Sacrificing efficacy for statistical validity

There’s huge demand for testing the efficacy of strategies, whether for improving turnout or persuasion or fundraising, and of course, this impulse is eminently reasonable. Running ads and fundraising campaigns is extremely expensive, and making sure money is well spent is one of the main tasks of a well run campaign. By collecting data on effectiveness — grassroots donations received, focus group and survey results, and viewer metrics — campaigns are well-placed to conduct experiments on what actually works. So far so good.

From a statistical standpoint, the gold standard for determining causal relationships is the randomized controlled trial, so many organizations have tried to incorporate RCTs into their testing and decision-making process. The basic structure is:

You have a population of interest and hypothesis about how some factor affects an outcome that you care about (People on your email list are more likely to donate to an email when it contains a picture of your candidate).
You take a representative sample of that population and randomly split it into a “control” group and a “treatment” group; the only thing you do differently between the groups is applying the treatment to the treatment group (You send 500 people the standard fundraising email you were planning to send, and for another group of 500, you add a photograph of your candidate).
Measure the effect you see in the control and treatment group, and conduct a test for the statistical significance of any differences that you find. If the treatment group performed significantly better, then you have gold-standard evidence that your proposed treatment has a real and causal impact.

Doing these causal studies is extremely valuable, especially when they’re done as meta-analyses, where results from multiple individual studies are pooled together to increase the signal. Good examples of organizations that conduct these large scale studies are the Analyst Institute and Tech for Campaigns, and the movement has surely benefited from their findings.

However, the structure of an RCT requires, by definition, the ability to create a discrete treatment and control group⁹: where the only difference is the application of a treatment that can be turned on and off. But you might imagine that the highest impact changes you could make to strategy aren’t so neatly categorizable. Even in the very basic toy example presented above, it seems very plausible that the effectiveness of adding a photo of the candidate is mediated by the actual content of the email itself. Perhaps if the email contains a personal story about the candidate, a photo is more effective, but if the content is attacking an opponent, the photo actually reduces effectiveness.

More drastically, one consistent nugget of received political wisdom is that candidates are most effective when they’re being “authentic”. How can you create a controlled experiment where the only difference between a treatment group and a control group is a notion of “authenticity”¹⁰?

The McNamara risk is how your organization reacts to this difficulty. One option is to continue to do your RCTs to test what can be tested in such a way, but rely on your experienced political operatives’ judgment on the matters that aren’t so easily tested; in other words rely on the political instincts of those with political experience. The other option is to insist that every strategic decision needs to go through whatever the closest approximation of an RCT is that could evaluate it. My contention is that the second of these gets uncomfortably close to the fallacy: “disregard that which can’t be easily measured quantitatively”.

Phase transitions

The political scientist Timothy Shenk, in his book Realigners, goes through American political history to find, as the title suggests, moments of political realignment, where the coalitions of voters that supported a party underwent a relatively sudden and massive shift. An archetypal example of this shift is the New Deal coalition, where, among other things, Black voters began to shift decisively to the Democrats and away from “the Party of Lincoln”. On either side of such a moment, the indicators of a successful campaign might look drastically different: before the New Deal, getting a quarter of the Black vote could probably indicate a landslide victory for the Democratic candidate; after it, a catastrophic defeat. In today’s politics, the shift of college educated voters to the Democratic Party has scrambled some traditional bellwethers: Loudoun County in Virginia had consistently voted Republican from 1964 through 2004, and even in 2012, Obama eked out a 4.5 point margin there. In 2024, Harris’ 16 point margin, as returns were starting to come in, was taken as an early warning sign because Biden had won the county by 25 points in 2020.

Taking some liberties with terminology, these realignments bring to mind the physical process of phase transitions, most commonly used to refer to when matter changes from e.g. liquid to solid, but more generally useful to describe a process whereby the properties that describe a system change drastically and usually discontinuously. Two major aspects of phase transitions interest us here:

The “laws” that hold when describing a system in one phase cease to hold after a phase transition (a gas’ volume increases with temperature, whereas a liquid might have constant volume)
When a system is near the boundary between phases, it sits at a precipice; small perturbations can put it over the precipice, leading to drastic effects.

Phase transitions are a nonlinearity that we’ll spend a lot of time on in a future post, but for now imagine a candidate emerges whose political instincts lead them to an unorthodox playbook. Their campaign doesn’t move the normal metrics in the normal ways that we associate with successful campaigns in the past. Are we equipped to support or even recognize such a campaign or will it be thrown off course?¹¹

What’s Next?

So far, I’ve been dealing in the realm of hypotheticals and thought experiments. Hopefully I’ve persuaded you, though, that there are dangerous and potentially catastrophic traps that can result from an overemphasis on imperfect quantitative metrics.

But I’m a mathematician by training. I want to explore the actual statistical mechanisms which lead to these missteps: what must be true about the electorate or the polling methodology or even people’s persuadability for a phase transition to happen or for a proxy measure to have a potentially backlashing effect?

That’s what we’re going to explore going forward: ideally both with some mathematical models but also with real data, if I can get my hands on it, that demonstrate the difficulties of avoiding these traps.

Footnotes

One thing I’ve been thinking about, and which I would love to hear from others if they have thoughts, since it’s not exactly my area of expertise, is picking up empirical examples of these signals ↩︎
You can argue, as the authors do, that the strategy of attrition dates back even earlier, to Ulysses S. Grant and the Civil War; the main difference though was the Civil War still hinged on territorial advances ↩︎
The French had also discovered this as well when they were kicked out the decade before in the First Indochina War, but it appears that Americans weren’t particularly eager to learn from French mistakes ↩︎
For what it’s worth, Halberstam is much more ambiguous about Lansdale, assigning him much of the blame for the United States’ continued support for Ngo Dinh Diem, the President of South Vietnam widely despised for haplessness and autocracy (although there is some revisionist history that disagrees), but even he grants that Lansdale’s emphasis on sociological factors was prescient but immediately ignored (see Ch. 9 of The Best and the Brightest) ↩︎
Now, with the advent of AI tools, the scale at which these messages can be generated is also unprecedented ↩︎
At Netroots Nation, a major conference for the progressive political tech space, social media monitoring, especially of the right-wing podcast ecosystem, seemed to be the objective of at least a plurality of organizations ↩︎
Me; I am people ↩︎
Unless of course, it is this spectator class that campaigns respond to in their strategies, an investigation for another time ↩︎
Alternately, discrete treatment groups if you don’t have a control ↩︎
You might imagine having your writers create copy they think is authentic and copy they think is not, and use those as the treatment and control. This is problematic because, since “inauthentic” has the connotation of “bad”, the writers will likely create worse copy than they otherwise would have. Another option is to retroactively classify copy that was already sent; but here you’re much less likely to be able to guarantee that the treatment and the control group were treated exactly the same; and being able to identify a message as authentic in retrospect might not be helpful in telling you how to actually write such a thing ↩︎
It’s worth noting how difficult of a challenge this is — history is littered with examples of unorthodox candidates who were either not talented or did not emerge at the right moment, e.g. Barry Goldwater or Pat Buchanan. Only in retrospect does Donald Trump look prescient. ↩︎