By Ryan Fauber
Disclaimer: this was not written by ChatGPT. However, the same can’t be said for an increasing portion of the content we come across online. In the space of 6 months we’ve gone from laughing at how obvious AI-generated text was to marveling at its ability to write Shakespearean poetry. How long until the same is true for audio?
Historically, the most widespread examples of generative voice have been 1) used in short-form outputs like GPS directions and 2) applied to business contexts rather than creative ones. But it won’t be long before you listen to a podcast or watch a movie and not only ask “was this written by AI” but also “was this spoken by AI”?
The pace of change can be hard to comprehend. In April 2023 alone we saw two viral examples of creative generative voice in the wild: DeepZen narrated a book using a long-passed narrator’s voice; a fake single using AI-generated Drake and Weeknd vocals went viral. We think these examples barely scratch the surface of what’s possible with complex, contextual generative voice.
In a previous post, we compared foundational models to the Pensieve from Harry Potter, and we view generative voice tools as equally magical. Imagine listening to emails in the voice of the sender or having real-time, unique conversations with video game characters, like a Hogwarts student listening to Howler letters or talking to a portrait.
Below, we take a closer look at why we believe we’re at an inflection point in generative voice applications and what opportunities excite us the most:
Why is now the right time to think about generative voice in creative media?
Until recently, AI-generated voices in GPS, Siri or Alexa, or on phone calls with robot customer support sounded, well, robotic. Rule-based systems that concatenated sounds produced clunky, emotionless outputs.
However, advances in Natural Language Processing (NLP), computing power, and audio datasets needed to train AI on a diverse range of voices have led to a rapid improvement in AI generated voice technology in recent years. Generative voices have evolved from the monotonous tones of Dr. Who’s Daleks to more closely resemble the deeply human expressions of Scarlet Johansson’s OS character in HER.
We assess use cases across two spectrums: creativity and length. In this context, we define creativity as the variety and depth of emotions, pace, volume, and timbre of the voice acting required. For example, while educational videos are certainly long-form, the range required in narration is far narrower than the range required of a voice actor portraying a main character in an animated film.
What have we learned from our research and network conversations?
- We believe consumers are increasingly excited about non-native creative media. Netflix’s most popular show ever is a Korean drama, Squid Game, drawing 1.65B viewing hours in 28 days – and the majority of viewers chose to watch a dubbed version of the show! But creators still struggle to access new geographic markets due to the cost and time of producing dubbed versions – it can take months of studio and post-production time.
- Short-form creative media is no different, and the production bottlenecks that constrict big-budget TV shows are even more limiting for small-scale content creators. However, there is clearly demand for audio-rich, cross-border content. A recent MrBeast video received 1/5 of its total views at time of writing from its Spanish-dubbed version (57M out of 259M).
- Audiobook consumption is growing rapidly, but a strikingly low percentage of books are translated outside of their native language – only 3% of books published in the US are in translation. A publishing industry expert we spoke with said that many authors preferred to clone their voices with AI.
- Finally, generative voice opens up entirely new features in content that were impossible to realize with earlier technology: from post-production dialogue edits to inserting localized ads into a podcast using a host’s voice to interactive characters in video games that respond to the player in real time.
What are the current shortcomings of generative voice?
While there are many promising tools on the market today, there are several hurdles that we think are limiting their full adoption in the creative media industry:
- Outputs can still come across as robotic in certain contexts and refining them can require significant manual and technical input. You cannot yet give AI voices direction the same way a director might work with an actor.
- There are concerns about copyright and attribution. Voice actors are understandably reluctant to offer their voice for cloning if a proper compensation framework is not developed. Further, with the notable exception of Grimes (who owns her own catalogue), musicians and studios have been extremely cautious in their approach to AI in creative production.
None of these hurdles are insurmountable, and we believe many companies are working thoughtfully with creators and industry executives to ensure that generative voice augments the creative process rather than disintermediates it.
What’s the opportunity?
We think some of the most compelling long form, creative applications of generative voice include:
- [Augment] Reshape how users interact with content: allow users to listen to and converse with routine content, such as email and news summaries. Generate dialogue in video games based on user inputs.
- [Automate] Audiobooks, ads, learning & development: clone an author or actor’s voice to automate the tedious process of narrating an audiobook, voicing education content, or producing ads.
- [Augment] Creative media localization: dub media into new languages, using the original voice actor’s work as a guide for the AI-generated voice.
- [Augment] Pre-production script prototyping and post-production tweaks in film, TV, and videogames: This could give creative teams more flexibility to try out dialogue before hiring a voice actor, or fix in post-production.
What does it take to win here?
- A platform advantage (making it magical): The end markets for voice acting are large by themselves, but we think the most exciting opportunity is in serving all of them via a universal platform (for example, a high-performance API and a robust creator studio). A successful generative voice tool should be able to work across multiple mediums the same way a voice actor would.
- Talent moat: Today, professional actors play a critical role in productions. We expect that a successful company will develop a proprietary marketplace of cloned voices that can be deployed in a variety of applications while appropriately compensating the original actor.
- Data moat (feeding the beast): the most powerful AI tools in any industry can win by learning from the user. Generative voice is no different: we expect a platform that continuously improves its outputs via feedback loops to outpace non-adaptive solutions.
- Powerful translation: dubbing represents a major opportunity for generative voice. But translations are complex even in standard use cases, let alone creative contexts. The most famous example may be the monumental task of translating Lewis Carroll’s Jabberwocky. Squid Game itself was not immune from translation controversy. We believe getting the translation piece right will be nonnegotiable for a generative voice platform looking to scale as a viable dubbing alternative.
We are excited for the future of generative voice in creative media, just as much as we are excited about the use cases we are already seeing in short-form contexts. Generative voice will expand content across many new geographies and put new tools in the hands of creators in order to elevate creativity by augmenting and inspiring artists. We’d love to connect if you are exploring the power of AI in creative media – please reach out! We promise you’ll reach a human at the other end of the line 🙂
Let us know if you are…
- Forming your own perspective on the generative voice space
- Building the next generation of generative voice tools
- Thinking more broadly about the applications of AI in creative media