Real Talk — Real Life

Full-text audio version of this essay.

In Disney’s 1989 animation The Little Mermaid, Ariel has her voice removed by the evil sea-witch Ursula. It leaves her body as a little ball of light, taking on a weird echo as it passes from the mermaid to a seashell around the witch’s throat. Later, Ursula appears on the beach before Prince Eric and sings with Ariel’s stolen voice, which snakes through the air and enters his eyes like a vapor, bewitching him: an analog vocal deepfake.

Like the famously dark 1837 fairytale on which it was based, The Little Mermaid harkens back to the ancient figure of the siren, one of many figures within folklore who use the voice to lure unwitting victims to their deaths (another example is the Skinwalker within Navajo culture, said to imitate the voice of a loved one). Because the voice is widely thought of as a source of intense intimacy, it is an effective vehicle of deception. In his story The King Listens, Italo Calvino defines the voice as evidence that “there is a living person, throat, chest, feelings, who sends into the air this voice, different from all other voices.” If the voice emerges from the body, then once emitted “into the air” it also exists somewhat independently of its referent (Slavoj Zizek calls this the “spectral autonomy” of the voice). It is simultaneously internal and external, implying the existence of an original body but not confirming it.

Where the voice is still seen as the seat of individual identity, withholding it constitutes a small attempt to retain a core “uncorrupted” by technological mediation

The voice and sound, then, have a deeply ambiguous and unstable relationship to the body. As Art Historian and Communications scholar Jonathan Sterne points out in his book The Audible Past, this ambiguity has long been eschewed in favor of a more straightforward conception of the voice. For Sterne, the now widespread idea that modernity constituted a shift towards the visual (“occularcentrism”) was accompanied by a new conception of vision as a “fallen sense” which distances us from the world by turning it into an object for analysis. By contrast, sound, hearing and voice have been conceptualized as manifesting a kind of spiritually pure interiority. Drawing on Derrida, Sterne traces this sound-image binary back to the “spirit/letter” distinction in Christianity, in which the “spirit” of religious law (the felt, live truth of it) was set apart from the “letter” of the law (the dead, inert interpretation of it). Sound, Sterne asserts, is implicitly associated with the former, and image with the latter.

Another way to understand this opposition is as the difference between objective and subjective reality. Images have a longstanding (and long-contested) association with empirical, objective reality (surface truth), whereas sound has a presumed relationship to a more emotional, intimate reality (spiritual truth). This binary persists in the way voice and image are organized by technology today. Apps that revolve around the sharing of the image online are bound up with narratives of artifice, vanity, and spectacle, while the growing market for audio media plays on expectations of intimacy, interiority, and authenticity; all of which are heightened by the individualized experience of headphone listening.

Within this persistent framework, the voice seems to hold a privileged relationship to both “the individual” (as opposed to the universal) and “the human” (as opposed to the machinic). Because of this, it has an important role in media that rely on parasocial attachment in particular: podcasts, YouTube broadcasts, celebrity apology statements, or even the notoriously short-lived “invite-only” app Clubhouse. The meaning of truth shifts from something that can be objectively measured to something that must be felt.

Meanwhile, in more public-facing media — such as TikTok, where content is made to circulate widely, accruing meaning as it passes from user to user, rather than beamed in directly to many separate, individualized viewers or listeners — the voice is used with increasing wariness. If the voice is still seen as the seat of individual identity, then such a withholding perhaps constitutes a small attempt to retain something of the “pure self,” or the soul — a core “uncorrupted” by technological mediation. We should not be so sure, however, that such a core exists.

While cross-referencing my memories of the stolen voice scenes from The Little Mermaid with clips from YouTube, I was targeted by an advertisement for a “free online shopping assistant”: a young, fresh-faced girl was sitting behind a desk, looking directly into the camera. Oh, hello, she whispered into two ASMR microphones. You must be here for your appointment. Her voice was warm, close and gentle. It was strangely effective. I felt compelled to listen.

The creeping of ASMR out of the realm of subgenre and into mainstream marketing signals a wider shift in attention to the ways voice works in a “visceral” way on listeners. This is particularly the case within personality-driven media, where the distinct voice of a broadcaster is figured as the key to their individual, corporeal self. The rise of ASMR in the mid-2010s also coincided with the rise of other forms of media: the “podcast boom,” for example, which is often said to have begun with the release of the true-crime podcast Serial in 2014. Those who have written about this have drawn attention to the more privatized nature of podcast consumption when compared to radio. When we listen to podcasts, our sense of connection to the hosts is often heightened by the fact that one can often quite literally hear the saliva moving around in the speakers’ mouths (Camilla Cannon, writing for Real Life, calls this the “biological intimacy of audio consumption”). The podcast industry is largely dominated by non-fiction, and yet its appeal doesn’t lie in the offer of empirical truth. Instead, podcasts seem to offer something that an article or a book can’t: authenticity and presence — a kind of truth that is at once spiritual and bodily.

Much TikTok content fits within the wider history of mime, lip-syncing and even puppetry

The voice is also creeping back into forms of direct interpersonal communication, such as the asynchronous voice note, which has become a popular feature on iMessage, WhatsApp, and Facebook Messenger. The appeal of the voice note is often described in terms of convenience (it is easier to tell a long story than to type it out), but there’s an obvious qualitative difference that can’t be explained away by efficiency alone. In an article for GQ on the use of voice notes in online dating, one young person describes their discomfort upon being ambushed by an online match with an unsolicited voice note (not unrelated to the unsolicited dick pic, though obviously a different kind of forced intimacy): “it was like he wanted to swap slices of our souls.”

This is a figure of speech, but it gestures towards a conception that — as Sterne argues — has the actual weight of Christian doctrine behind it. The original Hans Christian Anderson version of the Little Mermaid is, in fact, a fable about the salvation of the soul, in which Ariel ultimately becomes pure voice. Paradoxically, the idea of the voice as the seat of the soul is tied up with its embodied nature: so conceived, the voice is the sound of the soul as it passes through its fleshy trappings. These ideas run through the way voice is exploited — and withheld — in digital contexts today.

In May this year, the Canadian voice actor Bev Standing sued ByteDance, TikTok’s parent company, for using her voice without permission in its popular “text-to-speech” function. Like those celebrities who found themselves the first victims of video deepfakes, Standing was forced into the realization that her livelihood, by definition, involved the generation of massive amounts of publicly available “data” (she suspected that the imitation could in fact be traced back to a series of recordings she made for the Chinese-government-owned Institute of Acoustics in 2018). Though used to hearing herself in TV and radio commercials, Standing was nonetheless unprepared to find that her electronic counterpart was among the most listened to voices in the world, pronouncing words and phrases that she herself had never before uttered.

Like the voice note, text-to-speech has its origins in accessibility design. Initially introduced to make the platform more accessible to Blind and Low-vision audiences, as well as those who are unable or prefer not to use their voice, it has since been adopted as a core function of the platform. As Cat Zhang wrote in an article for Real Life, TikTok’s primary language is that of the face (and body), which are used in a limited vocabulary of exaggerated, sanitized movements that are repeated and personalized as different users perform their own take on a meme (one recent example is the extremely popular “chopping dance” format). TikTok creators often mime to an existing audio track from TikTok’s library, creating their own spin on the meme through overlaid written captions and text-to-speech narration.

In many instances, text-to-speech isn’t used to make videos more accessible: often, as in this video, the narration introduces a scene where the punchline is purely visual. Nor does it offer the cozy effect of an actual human voice: its intonation is distinctly robotic. What it does, intentionally or unintentionally, is highlight in a surreal way the widespread absence of actual, diegetic human voices relative to the amount of content that is posted. In a large proportion of the content on TikTok, the creator forgoes using their own voice altogether. This fact separates the app from its forerunners such as Vine and Periscope, apps that favored content that closely resembled the home video format, in which the voice appears sometimes on purpose, sometimes accidentally as background noise. Compared to the older generation of YouTube and even Vine broadcasters whose voices were a crucial part of their performance, TikTok’s current “Top” Influencer — Loren Gray — uses her own voice in roughly about one in every six videos (by my count). This is more telling given that she is not explicitly a dancer, like analogous TikTok stars Addison Rae, Charli and Dixie D’Amelio: in most of her videos, she simply lipsyncs. TikTok influencer Bella Poarch’s “real voice” is even rarer, surfacing only in the occasional interview or apology video.

On TikTok, the use of the voice is rarely accidental. TikTok videos are less organic and more crafted than Vine and Periscope content. They are less centered on individual expression and more focused on the ability of all or part of a clip to travel, multiply, snowball into a meme. At its best, TikTok can feel gleefully chaotic and irreverent, almost carnivalesque. It is particularly conducive to absurdist, physical, and slapstick humor: for example, short, earnest soundbites from film and TV are paired with bodily performances that are deliberately at odds with the original tone of the audio. In other clips, creators mime the words to a song that — overlaid on the video — expresses the subtext of a more specific narrative spelled out in captions, a format that can be disorienting and hard to follow at first. In this sense, much TikTok content fits within the wider history of mime, lip-syncing and even puppetry. If voice is the vehicle of dramatic performance, then comedic performance revolves around withholding the voice, distorting it, or borrowing that of another.

On TikTok, as on Instagram reels, when the original sound is left on the video, it is automatically spliced from the visuals, making it easy to download the track and add it to your own clip, with an attribution bearing the creator’s handle. Because of the ease with which audio can be memeified on TikTok, there is a general understanding — perhaps more than on other platforms — that once content is posted, the creator no longer has any control over where it travels and what it becomes. This feeling is heightened by the arbitrariness of its algorithm which, as Zhang writes, “could fling your content onto any stranger’s feed, instead of restricting it to those who follow you.” By using your own voice on TikTok, you have to be prepared to suddenly hear your words emerging from the mouth of a stranger — just as Bev Standing did, along with the many other actors in popular TV shows and film who find their lines, particularly the more earnest ones, suddenly circulating rapidly on TikTok as comedic content, or a replicable dance.

The reticence around using one’s real voice on TikTok, then, points at a wider sense that the voice is able to travel on the internet in increasingly unpredictable ways. This is particularly unnerving because of the lingering associations between the voice and the most fundamental aspects of who we believe ourselves to be. For philosopher Adriana Cavarero, voice “implicates a correspondence with the fleshy cavity that alludes to the deep body, the most bodily part of the body” (Roland Barthes would call this quality of corporeal expression “the grain of the voice”; while in Derrida’s analysis, it is breath and flesh that makes voice able to express the union of “ideality and living presence”).

If the voice belongs to the “deep body,” or the “deep self,” then the image, as Sterne writes, is often seen as belonging to the surface body, or the surface self. TikTok is not a particularly sincere domain. It was never home to the “bedroom confessional” like YouTube was. The scarcity of voice reflects this emphasis: while the visual presentation of self — the “surface body” — is actively commodified and instrumentalized, used as a puppet for a meme, then the “deep body” is protected and withheld from a sphere that would quickly reduce it to ridicule.

The reticence around the use of voice on TikTok also speaks to the fact of the rapidly changing landscape of artificially intelligent voice synthesis, and the increasing viability of actual vocal deepfakes. Until fairly recently, it seemed that the world of voice technologies in popular consumer settings hadn’t changed all that much since the days of Microsoft Sam, with most electronic voices still being collaged together from distinct phonemes. But the landscape of synthesized voice technologies is quickly changing. In 2018, Google showcased its “Duplex” natural language processing (NLP) technology, posting a video that has since amassed over two million views on YouTube in which two realistic voices book a hair appointment and a restaurant reservation. The robots’ speech approximates that of a human, complete with shifting cadence, “ums” and “ahs.” In each case, the person on the other end of the phone remains unaware that they are speaking to a machine. Another company, called Modulate, is currently working to create realistic and customizable “voice skins” for those hoping to change or mask their voice in gaming audio-chats and other online spaces.

The slowness of technological developments in deepfake voice exists in a chicken-and-egg relationship with the profound aversion that such technologies appear to arouse: It is not entirely clear whether we fear fake voices because they are not yet as widespread as fake images, or whether they are not widespread because we fear them. The Bev Standing court case highlights the validity of these fears. One of the most common points of opposition to AI voice simulators is the very real threat they pose to the livelihoods of voice actors, with most AI voice synthesis companies advertising this as one of their primary services. Another tangible widespread concern is the the use of voice clones for illegal purposes: In 2019, a British energy company lost €200K when an employee received a call from a voice he understood to be that of his boss, asking him to transfer the money to a Hungarian bank account as a matter of urgency. Both these points of opposition also signal an underlying, more intangible discomfort which involves the idea of the voice as the seat of the self.

While the “surface body” is actively commodified and instrumentalized, the “deep body” is protected and withheld from a sphere that would quickly reduce it to ridicule

Voice cloning — which refers to the process of training an AI on a specific person’s voice so that it can be replicated to vocalize any phrase typed into a keyboard — has potential to give more communicative options to those who are unable or who prefer not to use their voice. When Val Kilmer lost his natural speaking voice after undergoing a tracheotomy as part his treatment for throat cancer, the actor, along with the team behind his new autobiographical documentary (entitled Val), approached the AI voice company Sonantic to generate a synthetic narration. “We all have the capacity to be creative,” says Kilmer, in a video posted on Sonantic’s YouTube channel this August. “Now that I can express myself again, I can bring these dreams to you, and show you this part of myself once more; a part that was never truly gone, just hiding away.” Kilmer’s words echo the old Christian idea that, with the voice silenced, the soul is too.

Sonantic posted another clip on their YouTube channel in which a pair of artificially-voiced lovers argue with each other (“Maybe I should just leave!” shouts one, the emphasis only slightly off. “Go ahead, I’m not gonna stop you!” responds its partner). In some ways, the unease that this recording evokes is similar to the unease surrounding the emergence of the first video deepfakes some years ago; in other ways, it has a more explicitly direct relationship to the aspects of the self we like to consider too intangible (too “deep”) for technology to reach — the minute emotional inflections of the voice by which a “soul-bearing” human might be distinguishable from a “soulless” machine.

Since its origins in the camera obscura or “magic lantern,” the recorded image has been bound up with narratives of illusion and façade. For this reason, deepfake images of dead people — such as the recent “Deep Nostalgia” phenomenon in which historical figures were brought back to life in animated clips — are creepy, but perhaps not quite as creepy as a resurrected voice, which more closely implies a resurrected self. A month before the Val Kilmer audio clip was posted, it emerged that the filmmakers behind Roadrunner — a new documentary about the late Anthony Bourdain — features several lines of AI-generated narration in Bourdain’s voice, generated after the actor’s death. The decision to include these tiny, fake snippets of audio, totaling around 45 seconds, is puzzling; the documentarists could have easily used a different strategy to communicate the words (Bourdain’s own), including in their original form as text. Deploying the technology so flippantly would normally suggest that they didn’t anticipate a backlash, but the fact that the filmmakers did not disclose the use of the AI to viewers suggests otherwise. In an article for the New Yorker on the Bourdain deepfake voice, Helen Rosner suggests that the revelation was particularly upsetting to those who felt a “parasocial intimacy” with Bourdain, and cites MIT Technology Review editor Karen Hao, who explains that the reaction to Bourdain’s voice is a “visceral” one.

If the extreme viscerality of sound is the bodily presence of another, internally felt, then the aversion to the Bourdain deepfake voice signals something fundamental about what voice is understood to be within traditions influenced by Chrisitan doctrine — namely, something that cannot exist outside of the particular meeting of body and soul. But the concept of “viscerality” as it pertains to voice, though often invoked, is a slippery one. It is just as often scientized (see, for example the literature around ASMR) as it is mystified. In both cases, the the meaning of voice is objectified. For Jonathan Sterne, cultural narratives around voice and image often rely on a set of presumptions about their presumably stable nature as biological, psychological and physical facts. In one of the first research papers published on ASMR in 2014, Joceline Andersen challenged the idea that the phenomenon of ASMR is a “direct” line between stimuli and brain, highlighting the way it relies extensively on associations between sound, care and domesticity. The sensory response that was being reported, Andersen argued, was in fact largely socially determined.

Technologies that play on the “intimacy” of the voice reinforce the idea that it constitutes a more direct, “authentic” mode of communication. But the voice already involves many layers of mediation: It is the site at which our selves enter into negotiation with our surroundings. Our voices sometimes betray our emotions and sometimes disguise them. They change throughout our lifetime, vary between different contexts, and can work with or against the many other ways we communicate, with our bodies, faces and choice of words. The “purity” of the voice invoked by platforms aiming for a sense of intimacy is an ideal that doesn’t correspond to the reality of how the majority of people use their voices in the world.

In an interview from 2019, the Turner-prize winning artist Lawrence Abu Hamdan — a self-described “private ear” who has used audio forensics in projects ranging from uncovering atrocities in a Syrian prison to validating refugees claims for asylum in the EU — describes a work entitled The Recovered Manifesto of Wissam [inaudible]. Speaking about the juridical concept of Taqiyya in Islam — which he translates as “to speak at the readiness of the other to listen” — Abu Hamdan says:

It’s trying to say that the voice — of all things — should be the last register of truth because we constantly reinvent ourselves when we speak. I’m speaking to you in a very distinct way than I would speak to the person who just checked me in at the boarding counter [at the airport].”

Abu Hamdan’s work uses Taqiyya to think through the role of the voice in the age of surveillance capitalism. If our voices are being listened to at all times, he argues, then artificiality can have emancipatory potential, and be an important tool for self-preservation. Hamdan evokes a conception of voice as malleable, a site where specific relations are forged and negotiated. The voice is not a direct portal to our innermost soul; it is something that we use and manipulate both consciously and unconsciously.

Our instinctive wariness around AI voice synthesis and other technologies that bring the voice more effortlessly into digital spaces is not unjustified. Withholding the voice, as many users on TikTok do, whether consciously or unconsciously, can — maybe — be conceptualized as minor act of resistance, frustrating the development of technologies that would use massive amounts of harvested voice data for profit, or which would be used to train AI to “guess” our emotions based on vocal tonality.

But critiques of technologies that mediate the self tend to fall short when they invoke the self as a pure, stable and monolithic entity. In this sense, the supposed purity of the “voice” is no different. The voice is not any last remnant of unmediated human “self” — it is something much more complex than that: a living, unstable source of expression which has never been siloed from the world around us.