Too Human

The more “lifelike” simulations are, the less effective they risk becoming

As I was being wheeled out of the post-anesthesia care unit at Mount Sinai, three incisions across my abdomen freshly glued shut and fentanyl warm in my veins, a nurse bustled over. “Your pictures, dear,” she said, tucking a colorful strip of photos into my sweatshirt like a garnish.

Leafing through my recovery papers the next day to search for bathing instructions, I found the photos — a sheet of high-resolution shots of my insides. I was certain there’d been a mistake. I’d taught peer sex ed for six years. To prepare I’d assembled plastic models of the pelvic region and memorized models showing how the organs shifted during pregnancy. But looking at the photos of my actual insides, I couldn’t tell my ovaries from my large intestine. Somewhere in the jump from plastic models to real world practice, my proficiency had vanished.

The greater the sense of mastery medical models provide, the more they hide the deficiencies and biases they contain

I’m not a medical student, and the models I’d studied, with their pop-in-place pastel organs, could have been made in the 1970s. Today, medical simulators are very sophisticated: High-fidelity manikins blink and breathe and can be programed to mimic everything from septic shock to ballistic trauma. Isolated body parts like arms and throats and groins have internal pressure sensors that assess how students intubate or examine them. Still, the gap between what I thought I understood, and what my insides really looked like troubled me, and seemed to go beyond the limitations of my training and equipment. It had more to do with the feeling of certainty I’d had, having gained competence with a model — a self-contained system that claimed to be an exact analogue of the real thing.

Medical simulations, like simulations in other disciplines and industries — aviation, automotive have often been judged by their perceived fidelity to the real thing: whether a model can capture the frill of tertiary bronchi or reproduce the snag of sliding a scalpel through flesh. Medical models are designed to feel realistic, even when they don’t look it: The simulated pelvis that students learn pelvic exams on doesn’t visually resemble a human — it is a roughly breadbox-sized object that sits on an exam table — but students can feel the same textural differences between the ovaries and the uterus that they’d notice on a patient.

Naturally, models that resemble the real thing feel more accurate; presumably, the more “lifelike” or intuitive a model becomes, the better it will be at teaching medical students. At the heart of this presumption is the faith that engineers and doctors are capable of correctly reproducing a universalized version of human anatomy, while identifying the right and wrong ways to interact with the human body. Yet according to anatomy professors, medical sociologists, organizational psychologists and feminist technoscientists, higher fidelity physical models haven’t closed the gap between models and real-world practice. In fact, it seems that the reverse is arguably true — the more lifelike models become, and the greater the sense of mastery they provide, the more they hide the deficiencies and biases they contain. In other words, when physical fidelity is held up as the highest goal for simulators, the simulators may become less — not more — effective.

Today, training by simulator is almost universal in U.S. medical and nursing schools. Students regularly use simulation training to practice responding to septic shock and heart attacks; to insert catheters and IV lines; and to take breast biopsies and conduct laparoscopic surgeries. In the ’90s, students might have practiced placing IV lines on each other, and practiced more intimate techniques like the pelvic exam on professional patients — people who were trained to follow specific scripts — before moving to actual patients. Now, practicing on screens and manikins is the prerequisite for hands-on training, and in some cases — like placing IV lines — has largely replaced it.

In many ways simulations are less troubling than their alternatives. They let students practice clinical skills without risking harm to patients, and study anatomy without dissecting cadavers, which are expensive to store, stressful to look at, notorious emitters of formaldehyde — a carcinogenic chemical that stimulates hunger — and much harder to procure. In 2017, the Las Vegas School of Medicine became the first allopathic American university to teach anatomy labs with virtual 3D anatomy instead of cadaver dissection; half of all Canadian medical schools don’t have students dissect cadavers at all. Instead, they study computer simulated bodies and look at pre-dissected limbs and body parts that highlight certain anatomical features.

Today, a simulated torso with organs looks more like a “real” torso than ever. However, there is limited research on how the skills developed in simulation carry over into practice, and how the training impacts patient outcomes. Peter Dieckmann, an organizational psychologist who is one of the few experts to study what makes medical simulations work, points out that there are no rigorous, longitudinal studies on how simulation training affect student’s biases, empathy, or treatment of patients — few national standards, either, around how simulations are designed, which features are included, and how they are taught. The little research that has been done over the past 20 years on the impacts of interacting with cadavers versus simulators indicates that the jump to real-world practice may create more than just an internal sense of confusion or surprise. According to one study, students who dissect cadavers identify 16 percent more anatomical features and test 11 percent higher in explanatory knowledge than students who use only simulators in their anatomy labs.

Simulation training involves substituting one medium for another — silicone for skin, screens for bodies — with the belief that the skills will transfer smoothly

By definition, simulation training involves substituting one medium for another — silicone for skin, screens for bodies — with the belief that the skills will transfer smoothly from the practice setting to performance. But we know that the design of built objects, including simulators, isn’t neutral or objective: It reveals a set of beliefs and values. The fact that men’s public restrooms are built without changing tables says something about the role men are assumed to have raising children. The design of the women’s power suit, based off menswear, says something about what we expect authority to look like. Simulators are no different. A human decides which features to include and which experiences to recreate; this in turn determines how skills are developed, and how these biases are reinforced.

Computer generated models of the body are often designed to be as average as possible: average size, average weight, average anatomy, average organs. But as Todd Rose, director of Harvard’s “Mind, Brain, and Education” program, argues in his book, The End of Average, the average version of something is often very different from any existing individual.

In the 1940s, Rose writes, the United States Air Force decided to redesign their airplane cockpits to reduce the number of crashes that had been plaguing the force after switching to more powerful planes. The air force took 140 measurements of over 4,000 pilots, planning to build a cockpit that better fit the average man. One scientist, Lieutenant Gilbert Daniels, doubted that this method would have much of an impact on the crash rate; to prove it, he calculated the average of 10 measurements he thought most important for cockpit design, generously defining average to be between the 35th and 65th percentiles. In the end, none of the young men were within average range in all 10 of the dimensions — fewer than four percent had average measurements in even three of the dimensions. In other words, none of them fit the “average” profile abstracted from their own bodies.

Rose gives a second example: In 1945, a Cleveland newspaper announced a contest in which the women with the most average body measurements would win war bonds and stamps. Contenders would be compared to “Norma,” a statue based on measurements of 15,000 young and mostly white women. The judges were shocked to find that of the 3,864 contestants, no one was average in all nine dimensions. “The Norma Look-Alike contest demonstrated that average-size women did not exist,” Rose writes — but at the time, this was not a common takeaway. “Most doctors and scientists of the era did not interpret the contest results as evidence that Norma was a misguided ideal. Just the opposite: many concluded that American women, on the whole, were unhealthy and out of shape.” The lure of the non-existent average body was so strong that the regular variation shown by healthy women was seen as a sign of illness.

“The notion that you can learn anatomy from a textbook and get any appreciation of that diversity is just completely naive,” Callum Ross, a professor of organismal biology and anatomy, told me. Medical textbooks show the kidney, for instance, in the same location with a single artery leading to each. But healthy people often have two or three arteries leading to each kidney, and kidneys can be on the same side of the body, or form a U-shaped horseshoe. “This is all normal variation in the sense that you wouldn’t even know — the person physiologically is working fine.” But when models show “average” organs, Ross said, healthy variations like these are erased.

Historically, when medicine has considered healthy variances — differences in skull shape, skin color, or sexuality — its response has often been to pathologize. In 1796, when German physician Franz Joseph Gall proposed that bumps on the human head could be read to reveal a person’s personality and moral bearing — a field that would become known as phrenology — the pro-slavery physician Charles Caldwell used it to argue that African Americans belonged to a different, tamer race and required masters. Abolitionists came to the opposite conclusion: African Americans should be protected because they were constitutionally submissive and easy to exploit. Almost no one argued that assessing someone’s character through the shape of their head might be a flawed science. These examples aren’t limited to the 18th century: When the medical community approached sexual variability in the 1950s, homosexuality became a disorder listed in the Diagnostic and Statistical Manual of Mental Disorders, where it remained for decades.

The practice of extracting averages is a normative enterprise. Humans decide who to include in their samples, which bodies are relevant and which are not — what type of person constitutes the ideal user of a technology, or recipient of care. Variability, however natural, creates a “problem” to be excised or pathologized; and this can have disastrous effects for people whose bodies fall outside the “norm.”

In her newest book, Invisible Women: Exposing Data Bias in a World Designed for Men, Caroline Criado Perez describes the deadly result of companies spending decades simulating car crashes using only male crash-test dummies. In crashes of equal intensity, women are almost 50 percent more likely to be seriously injured than a man — even when height, weight, and seat belt use are factored in. And while men are more likely to be in a crash, women are still 17 percent more likely to die in crashes of equivalent intensity.

“The reason this has been allowed to happen is very simple,” Perez writes. “Cars have been designed using car crash-test dummies based on the ‘average’ male.” As a result, everything in cars, from the scoop of the seat to the choice of materials, was selected to cushion male dummies. Simply reaching the gas pedal and looking over the dash requires many women to perch upright at the edge of their seat, putting them in a more dangerous driving position than men usually find themselves in. And because women typically have smaller neck muscles and weigh less than men, the firmness of car seats — designed to cushion a heavier male body — actively increase a women’s odds of getting whiplash. Researchers have shown that designing less firm seats would increase a woman’s odds of surviving in accidents, but the federal government didn’t begin requiring companies to test female dummies as well until 2011. Even these female manikins are often scaled down male dummies. They don’t reflect the differences in female bone density, vertebrae spacing or neck strength, all features which impact survival in car crashes.

Empathetic wondering and imagining doesn’t happen looking at a 3D amalgamated model of a generic body

The same deadly selection bias is evident throughout the medical field. Since government-funded medical research became common in the 1940s, virtually all medical research has disproportionately used white men in their studies. When studying things like heart disease, which affects both sexes, researchers assumed women would function like tinier men, even though women have different heart attack symptoms, causes and outcomes — the differences are marked enough that C. Noel Bairey Merz, director of the Women’s Heart Center at the Cedars-Sinai Heart Institute, believes men’s and women’s heart disease should go by different names. Because medical personnel are more likely to be trained to recognize men’s symptoms, they are less likely to recognize heart attacks in women, and women who have heart attacks remain more likely to die from them.

Today medical practitioners know that different ethnicities have distinct genetic variations that determine how they respond to treatments and how their disease progresses; as a result, medical interventions from asthma inhalers to cancer treatments have been found to be less effective for Pacific Islanders, African Americans and Native Americans than for the white people studied. Prostate cancer, which is twice as deadly for African-American men than white men, illustrates this treatment gap. In 2016, Edward Schaeffer, Chairman of Urology at Northwestern University, found that African American people often develop prostate cancer that is biologically distinct from the cancer white people develop. It has different biomarkers, progresses differently, and responds differently to treatment. Consequently, African-American men have historically been treated for a variation of cancer they don’t have.

This cycle, and its consequences, is similar to examples of algorithmic bias in machine learning and algorithms. When the Google Photo app attempted to use machine learning to label objects in users’ photos in 2015, it labeled photos of Black people as “gorillas,” in part because it had been trained using disproportionately white photos. The data journalist Julia Angwin and her team have proved that risk assessment algorithms, which are used in some states to determine if a defendant should be offered parole, overstate the risk of Black defendants committing a future crime while understating the risk of white people reoffending. Black defendants are treated more harshly than white defendants because the machines learn with biased data that emphasizes Black criminality and white innocence.

As machine learning, and other simulations are increasingly relied on for diagnostic and decision making process — and as the outer appearance of models becomes more realistic — there is less reason for practitioners to wonder whether the internal workings of the simulation training are based on biased data. If anything, the realism of simulations make practitioners less likely to wonder whether their training contains a sampling bias that is making them less effective.

At the University of Chicago, where cadavers are still used in anatomy labs, medical students dissect individuals, not “average” models. By the time they finish the lab, first-year students will have actively dissected six cadavers and watched classmates and proctors dissect another 20. Students who return as teaching assistants see 52 cadavers, and those who assist again in their fourth year see 78. As a group, the students at the University of Chicago are often young and able-bodied — “Most of them are used to viewing bodies as the ideal body through the lens of advertising,” Callum Ross told me. “It’s young, it’s slim, it’s attractive. And most people aren’t like that. In the anatomy lab that’s one of the first things they see.” The cadavers skew older. “They have gray pubic hair — kids don’t think about that.”

Cadavers regularly have amputations, scars from mastectomies, or medical devices still attached to them. During dissection, students see the results of procedures they may one day recommend or conduct, like the bands of fibrous tissue that often cause organs and tissue to stick together after abdominal surgery. When they get to the lower pelvis, they often see how metastasized cancer can wrap itself around the spine. It’s not a huge leap to imagine the difficulty the person must have had walking. This empathetic wondering and imagining doesn’t happen looking at a 3D amalgamated model of a generic body.

By erasing the possibility of developing the opportunity to empathize, the simulator’s design designates these skills as unimportant

Looking at cadavers — looking at death — and spending hours grappling with it gives med students time to reflect. That’s important. Even if models are redesigned to build in variance, it’s unclear whether they could create the opportunity to empathize. Ericka Johnson, a professor in technology and social change at Linköping University, began studying the impact of pelvic simulation training in the early 2000s. Simulators are designed to monitor the pressure students apply internally, but interviews with professional patients who teach pelvic exams find that the criteria that mattered most to these patients went far beyond that. They were concerned with the student’s body posture and body language, how they made eye contact, how they asked questions and elicited information, and the temperatures of the metal speculum and lubricant. As Johnson writes about the simulator’s design, “the patient’s experience of a medical practice is not merely silenced or made invisible, it is never even considered.” By erasing the possibility of developing these skills, the simulator’s design designates these skills as unimportant.

Anyone who has received a pelvic exam knows that technical proficiency is not enough. Erasing a patients’ experience during a pelvic exam isn’t just a matter of respect or sensitivity — it has a measurable effect on patients’ health. Anxiety around the pelvic exam sometimes leads patients to put off or avoid the exam all together. And people who don’t like their provider’s bedside manner are less likely to accurately share their medical history, follow their doctor’s recommendations, or keep appointments. Simulation that completely erases the provider’s ability to empathize with a patient likely has a measurable impact on patient outcome.

“Internalizing that ineffable sense that variation is normal and natural is just fundamental for medicine,” Callum Ross told me. The axes of diversity students see extend beyond age, gender, and ethnicity. It’s not just anatomy that varies; patients — people — also have unique cultural experiences and religious beliefs, and carry their own complicated histories that inform how they interact with the medical system. Recognizing this saves lives: Recent studies have shown that when doctors view patients as individuals on these axes as well, patients have measurably better health outcomes managing everything from asthma to diabetes to hypertension.

When Gilbert Daniels realized that a standard cockpit designed for the average body would imperil most pilots, he recommended adjustable cockpits so pilots could adapt equipment to their longer-than-average legs or shorter-than-average arms. Airplane manufacturers balked at first; the leading companies said that incorporating variation would be too difficult, exorbitantly expensive, and take too long. The air force insisted, and the adjustable equipment turned out to be fairly easily installed. The crash rate plummeted.

Daniel’s math ended the idea of a single, “real” pilot’s body, but discussions about medical simulators still focus on the idea of physical fidelity to a “real” patient body. As Johnson writes, new simulators often include instructions for future improvements like more realistic skin tone and texture — and when a simulator enters the educational field, professors ask whether it mimics a “real” patient body. Working toward this type of fidelity has advantages — students can see how skin color changes from shock, and feel how much pressure is needed to give a proper chest compression — but it also has pitfalls. Physical fidelity doesn’t require simulators to incorporate diversity, variance or patient experience into their design; and the push toward fidelity can hide the need for these other elements.

The Norma contest reminds us that models designed without variance end up representing no one, while the example of the car crash dummies prove that simulators that ignore diversity can have deadly outcomes for underrepresented groups. Medicine’s history of pathologizing variance gives us no reason to assume these features would inevitably be incorporated into models, and the automobile industry’s initial resistance to designing more varied models reminds us that companies hesitate to develop these features unless consumers clamor for them. What’s most needed next isn’t a simulator that better recreates the texture and hue of an organ, but one that is designed to better incorporate the messy specificities of life.

Ella Jacobson is a freelance journalist and writer. An Alaskan transplant to Brooklyn, she writes about class, literature, and the wilderness. To read her most recent work, follow her on Twitter.