Condition Critical

AI mental health diagnostics risk oversimplifying the complex social dynamics of any medical concern

Full-text audio version of this essay.

In September 2021, the Wall Street Journal reported on Apple’s plans to integrate diagnostic technologies into its phones. The idea is to use their various sensors and features — from high-resolution cameras to gyroscopes to usage analytics — to assess the possible presence of various medical conditions. The report mainly focused on the technology’s capability to potentially detect depression and “cognitive decline,” but it also mentioned research “that aims to create an algorithm to detect childhood autism” using the phone’s camera “to observe how young children focus.”

If you haven’t been following the breathless announcements and pitches in medical AI, such aspirations might seem like an out-there moonshot, a “what if” cooked up by ambitious computer scientists at a tier-one university. The authority of medicine, after all, is in part built around the idea that doctors are uniquely trained and sensitized to make diagnostic decisions. But efforts to develop algorithmic systems to diagnose conditions like autism are actually pretty widespread, to the point that my home institution (the University of Washington) not only works on them but also on developing datasets to help others do so, with financial support from the National Institute for Mental Health.

There are several different approaches to AI psychiatric diagnosis, but at their core, they aim to capture and analyze subjects’ bodily movements, recording how someone’s attention slides (or doesn’t), how their eyes move, their postures, their reaction to stimuli. This information is then put into a particular data format, plugged into a machine-learning system, and presto! Automated psychiatric diagnoses. In some respects, this matches existing diagnostic practices, which often rely on questionnaires, checklists, and long standardized assessments from which a point count is extracted, determining whether one falls within a diagnostic category. (Think Buzzfeed quizzes but life-altering.) The developers believe that automated diagnoses are a means of overcoming the partiality and inefficiency of doctor-directed diagnostics. Instead of depending on a medical professional — expensive, fallible, sometimes biased, and hardly evenly or sufficiently distributed around the world — you can rely on an algorithm, one that promises to be cheap, scientific, and objective with a capital O.

Medical AI’s social impact is not merely a question of practice but also the insufficiency of its promise

At this point, one can readily imagine the criticisms: Sure, AI diagnosis promises to be those things, but it’s never going to be. It’ll be expensive, it’ll be biased, it will lack resilience. (After all, Apple spent years shipping laptops that could be disabled by crumbs). These are all reasonable reactions, and part of what has become the standard playbook for the emerging field of algorithmic ethics, which (conventionally) tends to focus on fairness (whether the outcomes are biased depending on, say, the subject’s gender) and availability. Such a focus is attractive in part for developers because it implies any issue with an AI system is in its execution and not the ideas behind it, and can be ultimately solved through technical means.

This mentality has proven highly popular with software companies and — as I discussed a few months ago — with government regulators, to boot. Indeed, the developers of diagnostic algorithms themselves are aware of fairness concerns and are working on questions of dataset representativeness already. So a fairness-oriented critique coming from outside the development community only repeats what they already believe, that AI’s problems are typically a matter of outcomes, not means.

Treating problems with algorithms as technical rather than political aligns with a particular idealization of medicine, in which it too is apolitical. From this view, medicine is a clear-cut process of providing for the health of individuals, strictly a matter of outcomes. But if the past few years have shown us anything, it is that medicine is anything but apolitical. It is not as though there is a clear “right thing to do” and everyone automatically tries their best to do it. Medicine shapes our very bodies and minds, its availability and quality are quite literally matters of life and death. As a form of cultural authority, medicine affects public policy and discussion around how society is arranged, and is in turn affected by them. Its provision is both a civic and a business matter; the field is constantly negotiating the different values implied by these, with ramifications for how care is conceived and rationed. Medicine becomes subject to broader forces of neoliberalization, and the insistence on efficiency, cost-effectiveness, speed.

It’s no surprise, then, that AI is of high interest to policymakers and medical practitioners alike. It promises to reaffirm the authority of medicine while implementing a particular – neoliberal — politics of medicine’s delivery. If doctors’ diagnostics are being criticized for being unreliable, fallible, slow, have them done by algorithm. You get to promise a new degree of objectivity and certainty, and you get to fire a load of expensive medical practitioners to boot.

It’s important to point out the technical limitations of medical AI in how it’s currently being executed and their consequences. But as urgent as that is, it is also critical to consider more fundamental questions of how medical AI is embedded in society. It’s not merely a question of the insufficiency of AI in practice but the insufficiency of its promise. Rather than merely highlighting how a system fails to perform as expected, we must consider what it is being asked to perform. Rather than tacitly accept how efficiency is framed as neutral and diagnostics as automatable, we must ask how these assumptions are built into a system’s design. The stakes are not only in what happens when AI doesn’t work, but what happens when it does.

AI-driven autism diagnostics illustrates well how even if an algorithm works, it can simultaneously fail. In particular, I’m thinking of how developers understand the practices in which they intend to intervene and the consequences that has down the road.

From a computational perspective, it’s tempting to see existing systems — in this case the process of coming up with a medical diagnosis — as a neat, linear series of tasks or actions. This makes interventions a matter of plug-and-play: Swap out a doctor, swap in an algorithm. What this misses is that medical practices are not linear, modular processes at all but instead embedded in a patchwork assemblage of fragile and multiconnected events. Swapping out a doctor who makes diagnostic assessments for an algorithm can go really badly if, for example, doctors use information from their diagnostic assessment to recommend treatment — a recommendation the algorithm cannot make and the doctor would no longer have the information to make. More central to the argument here, though, is the fact that this patchwork, compromise approach doesn’t just appear in diagnostic practice but in the very definitions of diagnoses, and a failure to consider that can lead to “unforseen consequences” (as Joel Nielsen might put it).

Let’s back up a bit and talk about what I mean by “definitions of diagnoses” and how those definitions appear in algorithmic systems. Algorithmic diagnosis uses a range of techniques and data, but common to all of them is the need for a ground truth, a baseline against which to compare the algorithm’s evaluations to determine its accuracy. If you’re diagnosing something like lung cancer, you get a big database of doctors’ evaluations of say, X-ray images and train the algorithmic system until it produces the same analysis (or as close to it as you can get). The idea is that the existing judgment of medical practitioners can be formalized, refined, and automated. In this case, the ground truth is not ambiguous: the presence or absence of malignant tumors. It either grows (malignant) or it doesn’t (not malignant).

But this isn’t the case with all medical diagnoses, or even most of them. With autism, the ground truth that is usually chosen — directly or indirectly — is the formal definition of autism found in the Diagnostic Statistical Manual, or DSM-5, the current standard manual for psychiatry, at least in the U.S. Its standardized definitions would seem useful as a baseline, but these definitions change, as the fact that the DSM is in its fifth iteration makes plain. And they change not only in response to new scientific discoveries or “objective” determinations but also broader dynamics of power and pragmatic medical (rather than scientific) needs.

The diagnosis definition is where the demands of different stakeholders can be negotiated. But an AI system shapes what diagnostic criteria count once it is widely deployed

In some respects, this is inevitable — inevitable and necessary. Diagnostic standards don’t exist to be pure representations of scientific knowledge, they exist so people can do things with them: so that doctors can use them, on a day-to-day basis; so that patients can understand themselves; so that epidemiologists can understand the shape, as it were, of humanity. Precisely because all these things are done with diagnoses — precisely because so many actors have an interest in them — they are inevitably going to be the site of political action, as well as putatively objective discovery. The diagnosis definition is where the needs and demands of different stakeholders with different, sometimes contradictory, goals can be negotiated.

Autism provides a clear example of this, as sociologist Gil Eyal and colleagues have adroitly documented. Prior to the 1980s, there was a catchall diagnostic category called “childhood schizophrenia,” that comprised not just what we’d now consider schizophrenia but also many features of what we would now call childhood autism, since — thanks to Lauretta Bender, who developed diagnostic tests in the mid-20th century to assess children’s mental development — pretty much any non-normative psychiatric state in a child was seen as revealing a hidden, underlying schizophrenia. And if you were diagnosed with childhood schizophrenia, you often ended up institutionalized.

In the 1980s, there was a successful push — as Eyal and his co-authors document — to separate the two out. This wasn’t motivated by theory developments or research; it was motivated by racism. Childhood schizophrenia, as with adult schizophrenia, was heavily racialized; the diagnosis was disproportionately deployed against Black people, and it acquired the cultural associations of being a “Black disease.” Middle class white parents pressured medical bodies to carve what is now called “autism” out of childhood schizophrenia because they were scared of their kids being associated, even conceptually, with Black people. The changes made to the DSM is where that pressure played out, where that political action was made efficacious.

The point of this example is not that the process of defining psychiatric conditions may be “biased” under certain circumstances. Rather it is to point out that these processes necessarily contain politics: There is always a vast array of stakeholders invested in these definitions that extends far beyond the theorists and researchers trying to find the purest definition of what is being measured and the doctors trying to apply those measures.

With AI as a part of diagnostic practice, the map begins to redraw the territory

Diagnostics always has politics, and it always has stakes; this is the case whether it’s being undertaken by a doctor or by an algorithm. What changes when an algorithm does it, apart from the scale at which it operates, is, in part, the diagnostic method: the mechanisms of perception and computing that shape what an AI system can do and correspondingly shapes what diagnostic criteria count once it is widely deployed.

Let’s suppose that a system succeeds in making diagnoses of a particular condition and we therefore deploy it everywhere a doctor is (and everywhere a doctor isn’t, but should have been). This might look like success, both for the idea of algorithmic medicine and the idea of medicine, full stop. But precisely because the algorithmic system is now part of diagnostic practice, its dependencies (and developers) become parties that factor into and are involved in future changes to the diagnosis itself. Suddenly we have AI not only measuring but shaping the determination of what is being measured. The map begins to redraw the territory.

So for example, if new research were to reveal that new modalities of information — auditory, say, or conversational — are vital to understanding autism, it would make theoretical sense to put that in the DSM criteria. But from a practical standpoint, it would be a waste: If the established systems for diagnosing can’t accommodate this information because we’ve built diagnostic practice around algorithms optimized for vision, investing in research into it becomes an order of magnitude harder to justify. Altering the DSM to take it into account would be even more difficult.

This lock on diagnostics is not the outcome that developers are aiming for: They don’t want control over what counts as a condition. But this outcome arises not because of any failure of fairness or accuracy or intention but simply because intervening in medical practice means intervening: It means making yourself party to a whole network of pragmatic considerations not captured in formal diagnostic criteria and then playing a role in reshaping those considerations and needs. To a certain degree, unintended consequences are guaranteed with every action we take; as Louise Amoore has written in considering algorithmic accountability, this applies to AI too.

We will always, then, be surprised by some of what happens. But we don’t need to be surprised that there are surprises. We don’t know precisely how the integration of medical AI will play out, how the politics will unfold. Approaches to diagnosis — and AI in general — need development processes that remain alive to the always unfolding politics and contingencies. They should be premised on thick and rich engagement with the people subject and party to the infrastructures already in place, not imposed from without. If developers spent 15 seconds with doctors, with medical sociologists, or hell, for once in their lives, with autistic people ourselves, the fact that there will be surprises would not catch them unawares. They would be more prepared for the unexpected and in better community when it arrives.

Os Keyes is a PhD student at the University of Washington and an inaugural Ada Lovelace Fellow who studies gender, data, technology, and control.