Sara Weisweaver, PhD; Rhyan Johnson; Sarah Fairweather; Alison Ma; Jordan Hoskins; Michael Petrochuk 1
WellSaid Labs Research
Abstract. Despite recent progress in generative speech modeling, generating high-quality, diverse samples from complex datasets remains an elusive goal. This work introduces HINTS, a novel generative model architecture combining state-of-the-art neural text-to-speech (TTS) and contextual annotations. We learn a separate mapping network that accepts any manner of supervised annotations for controlling the generator, allowing for scale-specific modulation and interpolation operations such as loudness and tempo adjustments. Such a setup ensures that our annotations are consistent, interpretable, and context-aware. Audio samples are available below. A beta model built on the HINTS architecture is available on wellsaidlabs.com.
In recent years, generative models have ushered in a paradigm shift in content production. Despite their transformative capabilities, ensuring these models adhere to specific creative preferences remains challenging. The prevailing method for controlling generative models is by using natural language descriptions (i.e., prompts). However, many artistic preferences are nuanced and challenging to describe.
The method introduced in StyleGAN2 and its related models offers an alternative approach. StyleGAN decouples latent spaces, enabling precise manipulations that range from high-level attributes to finer details. These controls are not only precise and interpolable but are also interpretable and context-aware.
Today, we announce a breakthrough in generative modeling for speech synthesis: HINTS (Highly Intuitive Naturally Tailored Speech).
Our flagship text-to-speech model learns a separate mapping network that maps from contextual annotations (cues) to a latent space đť’˛ that controls the generator. This allows for generating high-quality and diverse performances of the same script and speaker through a consistent, interpretable, and context-aware mechanism.
Initially, we studied loudness and tempo cues, addressing their historical challenges using this framework. Where loudness controls traditionally vary decibel outputs, our loudness cue allows for a range of performances that vary in timbre, which is important for natural prosody. Similarly, our tempo cue does not modify pitch, addressing the complex inverse relationship between frequency and time. Both cue options, when applied individually or nested, allow for an expansive range of realistically synthesized expressive and performative audio.
This general framework supports many types of cues. We will be releasing more soon. We include audio samples below.
Please email our CTO, Michael Petrochuk (michael [at] wellsaidlabs.com), with any questions.
In a novel approach, we use annotations alone to successfully guide the model into comprehensive possible points in the solution space for a single target speaker using the same sample script.
Sample set 1
We use annotations to craft three distinct listener-friendly versions of the same script. Sample 1D includes
a tempo annotation, a loudness annotation, and a tempo annotation nested inside a loudness annotation.
Speaker: Ben D.
Style: Narration
Source speaker language & location: English, South Africa
Descriptor | Text Input | Audio Output | |
---|---|---|---|
1A | No annotations |
|
|
1B | Call to action slowed |
|
|
1C | Activity name made louder and slower |
|
|
1D | Focus on eliciting user response |
|
Sample set 2
We show how a large area of the solution space can be represented by applying maximum and minimum
value annotations. Sample 2F shows how cues can be used to emphasize and slow down the key
technical information in this passage.
Speaker: Terra G.
Style: Narration
Source speaker language & location: English, United States
Descriptor | Text Input | Audio Output | |
---|---|---|---|
2A | No annotations |
|
|
2B | Louder |
|
|
2C | Quieter |
|
|
2D | Slower |
|
|
2E | Faster |
|
|
2F | Key information emphasized |
|
Sample set 3
Our catalog of avatars is responsive to cues. Cues can be nested, even at maximum
levels.
Speaker: Alan T.
Style: Narration
Source speaker language & location: English, United Kingdom
Descriptor | Text Input | Audio Output | |
---|---|---|---|
3A | No annotations |
|
|
3B | Faster |
|
|
3C | Slower |
|
|
3D | Quieter |
|
|
3E | Louder |
|
|
3F | Louder and Faster |
|
|
3G | Louder and Slower |
|
|
3H | Slower and Quieter |
|
|
3I | Quieter and Faster |
|
Sample set 4
Annotation combinations, particularly for texts subject to actor delivery nuances,
result in audio clips exhibiting diverse emotional tonalities.
Speaker: Jordan T.
Style: Narration
Source speaker language & location: English, United States
Descriptor | Text Input | Audio Output | |
---|---|---|---|
4A | No annotations |
|
|
4B | Casual, off-hand effect |
|
|
4C | Measured, emotive effect |
|
The model responds intuitively to a variety of inputs: various cue and text lengths, various annotation combinations, and various nesting patterns.
Sample set 5
Annotations allow users to direct the AI to see their artistic vision through. Avatars respond to cues
consistent with their own individual styles.
Speaker: Paige L.
Style: Narration
Source speaker language & location: English, United States
Descriptor | Text Input | Audio Output | |
---|---|---|---|
5A | Editor's choice: entire clip slowed, with loudness increased on key phrases |
|
Speaker: Paul B.
Style: Promo
Source speaker language & location: English, United States
Descriptor | Text Input | Audio Output | |
---|---|---|---|
5B | Paced (pauses lengthened) and directed for short ad to appear on social media |
|
Speaker: Ramona J.
Style: Promo
Source speaker language & location: English, United States
Descriptor | Text Input | Audio Output | |
---|---|---|---|
5C | Paced (pauses lengthened) and directed for short ad to appear on social media |
|
Sample set 6
Annotations can be applied to very long passages with no degradation. An annotation applied to a penultimate
paragraph results in a corresponding expected result. Moreover, the final paragraph is delivered in the default,
non-annotated style with no degradation.
Speaker: Lulu G.
Style: Narration
Source speaker language & location: English, United States
Descriptor | Text Input | Audio Output | |
---|---|---|---|
6A | 4239 characters generated in a single take with no annotations and no degradation.
Click for full text. |
|
|
6B | 4239 characters generated in a single take with second to last paragraph annotated. The model shows no degradation in the annotated portion, and resumes normal loudness and tempo for the final paragraph. |
|
|
6C | 4239 characters generated in a single take with entire passage annotated and no degradation. |
|
The model can generate audio between a range of annotation values, allowing for precise control. In the following sample sets, we
illustrate the model's capacity for incremental increases or decreases of specific audio elements in a scaled manner.
We show the control sentence, which is unannotated, alongside incremental increases or decreases in loudness and tempo.
Our examples match what users would actually want to do, such as:
Sample set 7
Dialogue from this passage in Kafka's Metamorphosis is made gradually louder.
Speaker: Garry J.
Style: Narration
Source speaker language & location: English, Canada
Descriptor | Text Input | Audio Output | |
---|---|---|---|
7A | No annotations |
|
|
7B | Dialogue louder at 2 |
|
|
7C | Dialogue louder at 4 |
|
|
7D | Dialogue louder at 6 |
|
|
7E | Dialogue louder at 8 |
|
|
7F | Dialogue full user-facing loudness at 10 |
|
Sample set 8
The middle sentence of this invented customer dialogue is gradually quietened.
Speaker: Zach E.
Style: Promo
Source speaker language & location: English, United States
Descriptor | Text Input | Audio Output | |
---|---|---|---|
8A | No annotations |
|
|
8B | Middle sentence quieter at -2 |
|
|
8C | Middle sentence quieter at -4 |
|
|
8D | Middle sentence quieter at -8 |
|
|
8E | Middle sentence quieter at -12 |
|
|
8F | Middle sentence max user-facing quiet at -20 |
|
Sample set 9
Content warning is delivered at an incrementally increased pace.
Speaker: Sofia H.
Style: Conversational
Source speaker language & location: English, United States
Descriptor | Text Input | Audio Output | |
---|---|---|---|
9A | No annotations |
|
|
9B | Faster at 1.3 |
|
|
9C | Faster at 1.6 |
|
|
9D | Faster at 1.9 |
|
|
9E | Faster at 2.2 |
|
|
9F | Fastest user-facing pace at 2.5 |
|
Sample set 10
In this definiton of Boyle's Law provided by Wikipedia, the key defining phrase is delivered at an
incrementally decreased pace. The respelling nested inside the slowed passage responds as expected, with
no pronunciation degradation.
Speaker: Michael V.
Style: Narration
Source speaker language & location: English, United States
Descriptor | Text Input | Audio Output | |
---|---|---|---|
10A | No speed annotations, one respelling cue |
|
|
10B | Definition slower at 0.9 |
|
|
10C | Definition slower at 0.8 |
|
|
10D | Definition slower at 0.7 |
|
|
10E | Definition slower at 0.6 |
|
|
10F | Definition slowest user-facing pace at 0.5 |
|
Cues can be effectively applied to spaces and punctuation marks to customize pausing and spacing.
Sample set 11
Periods, commas, ellipses, and colons are slowed to create a moment of pause while preserving
each text's prosody.
Speaker: Cameron S.
Style: Narration
Source speaker language & location: English, United States
Descriptor | Text Input | Audio Output | |
---|---|---|---|
11A | No annotations |
|
|
11B | Pause lengthened on a comma |
|
Speaker: Ali P.
Style: Narration
Source speaker language & location: English, Australia
Descriptor | Text Input | Audio Output | |
---|---|---|---|
11C | No annotations |
|
|
11D | Pause lengthened on a period |
|
Speaker: Joe F.
Style: Promo
Source speaker language & location: English, United States
Descriptor | Text Input | Audio Output | |
---|---|---|---|
11E | No annotations |
|
|
11F | Pauses lengthened on three periods |
|
|
11G | Pauses slightly lengthened on three periods; final phrase slowed and quieted for dramatic effect |
|
Speaker: Lulu G.
Style: Narration
Source speaker language & location: English, United States
Descriptor | Text Input | Audio Output | |
---|---|---|---|
11H | No annotations |
|
|
11I | Pause lengthened on a colon |
|
Within cued performances, the model can push a target speaker’s performance range beyond what is present in the source speaker’s training data.
Sample set 12
In the following audio samples, we include maximum and minimum portions of
the original gold dataset for loudness (LUFS) and tempo (CPS). These are presented
in comparison to the synthetic voice outputs’ maximum and minimum performances for
loudness and tempo.
Speaker: Lee M.
Style: Narration
Source speaker language & location: English, United States
Descriptor | Text Input | Audio Output | |
---|---|---|---|
12A | Source recording (gold): loudest voiced speech segment |
|
|
12B | Synthetic voice at loudness 16 |
|
|
12C | Source recording (gold): slowest voiced speech segment |
|
|
12D | Synthetic voice at pace 0.3 |
|
Descriptor | Text Input | Audio Output | |
---|---|---|---|
12E | Source recording (gold): fastest voiced speech segment |
|
|
12F | Synthetic voice at tempo 2.7 |
|
|
12G | Source recording (gold): quietest voiced speech segment |
|
|
12H | Synthetic voice at loudness -50 |
|
The model learns annotations in context, such that pronunciation and prosody are impacted by cues. This demonstrates the model's ability to generalize well. With subsequent training, this capability may become an emergent ability.
Sample set 13
Extreme tempo and loudness annotations, when nested, can prompt a dramatic performance that
impacts syllabic stress. Extreme slow annotations can prompt the model to spell the annotated word.
Speaker: Damian P.
Style: Promo
Source speaker language & location: English, Canada
Descriptor | Text Input | Audio Output | |
---|---|---|---|
13A | Extreme slow + loud cues impact word-level prosody |
|
Speaker: Fiona H.
Style: Narration
Source speaker language & location: English, United Kingdom
Descriptor | Text Input | Audio Output | |
---|---|---|---|
13B | Extreme slow cue prompts word spelling |
|
Speaker: Se’Von M.
Style: Narration
Source speaker language & location: English, United States
Descriptor | Text Input | Audio Output | |
---|---|---|---|
13C | No annotations |
|
|
13D | Extreme slow cue prompts word spelling |
|
|
13E | No annotations |
|
|
13F | Extreme slow cue prompts word spelling |
|
Speaker: Genevieve M.
Style: Promo
Source speaker language & location: English, United States
Descriptor | Text Input | Audio Output | |
---|---|---|---|
13G | Fast cue within user range prompts fast word delivery |
|
|
13H | Extreme slow cue prompts word spelling |
|
We have shown that the annotation mapping network generalizes well and is context sensitive, supporting diverse input variability and an expansive annotation range. This framework can be quickly expanded with additional annotations such as pitch, brightness, fullness, range, and breath control, in near future releases.
We have already prototyped a pitch annotation:
Speaker: Charlie Z.
Source speaker language & location: English, Canada
Style: Narration
Script: A new art exhibit is drawing crowds at the city’s museum.
Audio Type | Pitch Annotation Value | Audio Output | |
---|---|---|---|
A | Griffin-Lim | -200Hz | |
B | Griffin-Lim | +300Hz |
With additional training, more data, and subsequent model improvements, we are excited for the creative applications of this approach.
Our commitment to the principles of Responsible AI informed the practices and approaches we used for this work. Specifically, we leaned on the principles of accountability, transparency, and privacy & security.
The tenets of accountability and transparency are reflected in our requirement that we have the explicit, informed consent of any individual who records voice datasets for WellSaid Labs. End users can only access target speakers built from datasets recorded by voice talent who have provided consent for end-user access. Additionally, annotations do not provide a means for end users to guide the model into a different target speaker’s solution space. We limit the collection of data from other providers to those that are open source (e.g., LibriSpeech), and voices we create using that open source data are not available to our users.
The principle of privacy & security drives us to design our systems so that we can protect the privacy of our users and participants who provide us with their voice datasets, and reduce opportunities for data or voice avatars to be misused. Our Trust & Safety team ensures that all users undergo identity verification when creating an account, and content created on our platform is subject to robust content moderation, limiting the creation and release of content that does not align with our Terms of Service.
We listened to 1700+ audio samples to evaluate the occurrence of common issues in text-to-speech sequence-to-sequence attention models.
Metric | Interpretation | Goal | WSL Baseline Model |
WSL New Model |
---|---|---|---|---|
Initialism Pronunciation | % correct | 100% | 86.27% | 94.12% |
Question Intonation (Rising) | % correct | 100% | 30.83% | 38.33% |
100 Words Pronunciation | % correct | 100% | 94.13% | 94.58% |
Slurring | % occurring | 0% | 0.00% | 0.67% |
Word Cutoff | % occurring | 0% | 0.00% | 0.67% |
Word Skip | % occurring | 0% | 0.67% | 0.67% |
Speaker-Swapping | % occurring | 0% | 0% | 0% |
Loudness Inconsistency Across Clip | % occurring | 0% | ~0% | ~0% |
Latency | seconds | decreasing | 0.5s | 0.5s |
Target Speakers | total # | increasing | 126 | 106 |
Dataset | # hours | increasing | 560 hours | 530 hours |
Note. During development, we noted that extreme annotation values force the model’s latent state representing one speaker into that of another. The net effect is that the model produces audio using a speaker other than the target speaker. Prior to releasing WSL New to users, we implemented per-speaker cue value limitations to specific ranges within which speaker swapping does not occur.
Contributors. Michael Petrochuk; Sara Weisweaver; Rhyan Johnson; Jordan Hoskins; Sarah Fairweather; Courtney Mathy; Alecia Murray; Alison Ma; Daniel “Dandie” Swain, Jr.; Jon Delgado; Jessica Petrochuk
A special thank you to the voice talent that make our avatars possible, especially those featured in this paper: Alan T., Ali P., Ben D., Cameron S., Damian P., Fiona H., Garry J., Genevieve M., Joe F., Jordan T., Jude D., Lee M., Lulu G., Michael V., Paige L., Paul B., Ramona J., Se’Von M., Sofia H., Terra G., and Zach E.