Is the clip stupid or terrifying? I can’t decide. To be honest, it’s a bit of both.
“I just think I would love to get Ratatouille’d,” a familiar-sounding voice begins.
“Ratatouille’d?” asks another recognizable voice.
“Like, have a little guy up there,” the first voice replies. “You know, making me cook delicious meals.”
It sounds like Joe Rogan and Ben Shapiro, two of podcasting’s biggest, most recognizable voices, bantering over the potential real-world execution of the Pixar movie’s premise. A circular argument ensues. What constitutes “getting Ratatouille’d” in the first place? Do the rat’s powers extend beyond the kitchen?
A friend recently sent me the audio of this mind-numbing exchange. I let out a belly laugh, then promptly texted it to several other people—including a guy who once sheepishly told me that he regularly listens to The Joe Rogan Experience.
“Is this real?” he texted back.
They’re AI voices, I told him.
“Whoa. That’s insane,” he said. “Politics is going to get wild.”
I haven’t stopped thinking about how right he is. The voices in that clip, while not perfect replicants of their subjects, are deeply convincing in an uncanny-valley sort of way. “Rogan” has real-world Joe Rogan’s familiar inflection, his half-stoned curiosity. “Shapiro,” for his part, is there with rapid-fire responses and his trademark scoff.
Last week, I reached out to Zach Silberberg, who created the clip using an online tool from the Silicon Valley start-up ElevenLabs. “Eleven brings the most compelling, rich and lifelike voices to creators and publishers seeking the ultimate tools for storytelling,” the firm’s website boasts. The word storytelling is doing a lot of work in that sentence. When does storytelling cross over into disinformation or propaganda?
I asked Silberberg if we could sit down in person to talk about the implications of his viral joke. Though he didn’t engineer the product, he had already seemed to master it in a way few others had. Would bad actors soon follow his lead? Did he care? Was it his responsibility to care?
Silberberg is in his late 20s and works in television in New York City. On the morning of our meeting, he shuffled into a TriBeCa coffee shop in a tattered sweater with an upside-down Bart Simpson stitched on the front. He told me how he had been busy making other—in his words—“stupid” clips. In one, an AI version of President Joe Biden informs his fellow Americans that, after watching the 2011 Cameron Crowe flop, We Bought a Zoo, he, Biden, also bought a zoo. In another, AI Biden says the reason he has yet to visit the site of the East Palestine, Ohio, train derailment is because he got lost on the island from Lost. While neither piece of audio features Biden stuttering or word-switching, as he often does when public speaking, both clips have the distinct Biden cadence, those familiar rises and falls. The scripts, too, have an unmistakable Biden folksiness to them.
“The reason I think these are funny is because you know they’re fake,” Silberberg told me. He said the Rogan-Shapiro conversation took him roughly an hour and a half to produce—it was meant to be a joke, not some well-crafted attempt at tricking people. When I informed him that my Rogan-listening friend initially thought the Ratatouille clip was authentic, Silberberg freaked out: “No! God, no!” he said with a cringe. “That, to me, is fucked up.” He shook his head. “I’m trying to not fall into that, because I’m making it so outlandish,” he said. “I don’t ever want to create a thing that could be mistaken for real.” Like so much involving AI these past few months, it seemed to already be too late.
[Read: Is this the start of an AI takeover?]
What if, instead of a sitting president talking about how he regrets buying a zoo, a voice that sounded enough like Biden’s was “caught on tape” saying something much more nefarious? Any number of Big Lie talking points would instantly drive a news cycle. Imagine a convincing AI voice talking about ballot harvesting, or hacked voting machines; voters who are conspiracy-minded would be validated, while others might simply be confused. And what if the accused public figure—Biden, or anyone, for that matter—couldn’t immediately prove that a viral, potentially career-ending clip was fake?
One of the major political scandals of the past quarter century involved a sketchy recording of a disembodied voice. “When you’re a star, they let you do it,” future President Donald Trump proclaimed. (You know the rest.) That clip was real. Trump, being Trump, survived the scandal, and went on to the White House.
But, given the arsenal of public-facing AI tools seizing the internet—including the voice generator that Silberberg and other shitposters have been playing around with—how easy would it be for a bad actor to create a piece of Access Hollywood–style audio in the run-up to the next election? And what if said clip was created with a TV writer’s touch? Five years ago, Jordan Peele went viral with an AI video of then-President Barack Obama saying “Killmonger was right,” “Ben Carson is in the sunken place,” and “President Trump is a total and complete dipshit.” The voice was close, but not that close. And because it was a video, the strange mouth movements were a dead giveaway that the clip was fake. AI audio clips are potentially much more menacing because the audience has fewer context clues to work with. “It doesn’t take a lot, which is the scary thing,” Silberberg said.
He discovered that the AI seems to produce more convincing work when processing just a few words of dialogue at a time. The Rogan-Shapiro clip was successful because of the “Who’s on first?” back-and-forth aspect of it. He downloaded existing audio samples from each podcast host’s massive online archive—three from Shapiro, two from Rogan—uploaded them to ElevenLabs’ website, then input his own script. This is the point where most amateurs will likely fail in their trolling. For a clip to land, even a clear piece of satire, the subject’s diction has to be both believable and familiar. You need to nail the Biden-isms. The shorter the sentences, the less time the listener has to question the validity of the voice. Plus, Silberberg learned, the more you type, the more likely the AI voices will string phrases together with flawed punctuation or other awkward vocal flourishes. Sticking to quick snippets makes it easier to retry certain lines of the script to perfect the specific inflection, rather than having to trudge through a whole paragraph of dialogue. But this is just where we are today, 21 months before the next federal elections. It’s going to get better, and scarier, very fast.
If it seems like AI is everywhere all at once right now, swallowing both our attention and the internet, that’s because it is. While transcribing my interview with Silberberg in a Google doc, Google’s own AI began suggesting upcoming words in our conversation as I typed. Many of the fill-ins were close, but not entirely accurate; I ignored them. On Monday, Mark Zuckerberg said he was creating “a new top-level product group at Meta focused on generative AI to turbocharge our work in this area.” This news came just weeks after Kevin Roose, of The New York Times, published a widely read story about how he had provoked Microsoft’s Bing AI tool into saying a range of unsettling, emotionally charged statements. A couple of weeks before that, the DJ David Guetta revealed that he had used an AI version of Eminem’s voice in a live performance—lyrics that the real-life Eminem had never rapped. Elsewhere last month, the editor of the science-fiction magazine Clarkesworld said he had stopped accepting submissions because too many of them appeared to be AI-generated texts.
[Derek Thompson: The AI disaster scenario]
This past Sunday, Sam Altman, the CEO of OpenAI, the company behind the ChatGPT AI tool, cryptically tweeted, “A new version of Moore’s Law that could start soon: the amount of intelligence in the universe doubles every 18 months.” Altman is 37 years old, meaning he’s of the generation that remembers living some daily life without a computer. Silberberg’s generation, the one after Altman’s, does not, and that cohort is already embracing AI faster than the rest of us.
Like a lot of people, I first encountered a “naturalistic” AI voice when watching last year’s otherwise excellent Anthony Bourdain documentary, Roadrunner. News of the filmmakers’ curious decision to include a brief, fake voice-over from the late Bourdain dominated the media coverage of the movie and, for some viewers, made it distracting to watch at all. (You may have found yourself always listening for “the moment.”) They had so much material to work with, including hours of actual Bourdain narration. What did faking a brief moment really accomplish? And why didn’t they disclose it to viewers?
“My opinion is that, blanket statement, the use of AI technology is pretty bleak,” Silberberg said. “The way that it is headed is scary. And it is already replacing artists, and is already creating really fucked-up, gross scenarios.”
A brief survey of those scenarios that have already come into existence: an AI version of Emma Watson reading Mein Kampf, an AI Bill Gates “revealing” that the coronavirus vaccine causes AIDS, an AI Biden attacking transgender individuals. Reporters at The Verge created their own AI Biden to announce the invasion of Russia and validate one of the most toxic conspiracy theories of our time.
The problem, essentially, is that far too many people find the cruel, nihilistic examples just as funny as Silberberg’s absurd, low-stakes mastery of the form. He told me that as the Ratatouille clip began to go viral, he muted his own tweet, so he still doesn’t know just how far and wide it has gone. A bot notified him that Twitter’s owner, Elon Musk, “liked” the video. Shapiro, for his part, posted “LMFAO” and a laughing-crying emoji over another Twitter account’s carbon copy of Silberberg’s clip. As he and I talked about the implications of his work that morning, he seemed to grow more and more concerned.
“I’m already in weird ethical waters, because I’m using people’s voices without their consent. But they’re public figures, political figures, or public commentators,” he said. “These are questions that I’m grappling with—these are things that I haven’t fully thought through all the way to the end, where I’m like, ‘Oh yeah, maybe I should not even have done this. Maybe I shouldn’t have even touched these tools, because it’s reinforcing the idea that they’re useful.’ Or maybe someone saw the Ratatouille video and was like, ‘Oh, I can do this? Let me do this.’ And I’ve exposed a bunch of right-wing Rogan fans to the idea that they can deepfake a public figure. And that to me is scary. That’s not my goal. My goal is to make people chuckle. My goal is to make people have a little giggle.”
Neither the White House nor ElevenLabs responded to my request for comment on the potential effects of these videos on American politics. Several weeks ago, after the first round of trolls used Eleven’s technology for what the company described as “malicious purposes,” Eleven responded with a lengthy tweet thread of steps it was taking to curb abuse. Although most of it was boilerplate, one notable change was restricting the creation of new voice clones to paid users only, under the thinking that a person supplying a credit-card number is less likely to troll.
Near the end of our conversation, Silberberg took a stab at optimism. “As these tools progress, countermeasures will also progress to be able to detect these tools. ChatGPT started gaining popularity, and within days someone had written a thing that could detect whether something was ChatGPT,” he said. But then he thought more about the future: “I think as soon as you’re trying to trick someone, you’re trying to take someone’s job, you’re trying to reinforce a political agenda—you know, you can satirize something, but the instant you’re trying to convince someone it’s real, it chills me. It shakes me to my very core.”
On its website, Eleven still proudly advertises its “uncanny quality,” bragging that its model “is built to grasp the logic and emotions behind words.” Soon, the unsettling uncanny-valley element may be replaced by something indistinguishable from human intonation. And then even the funny stuff, like Silberberg’s work, may stop making us laugh.