PerMod: Perceptually Grounded Voice Modification with Latent Diffusion Models
Robin Netzorg
Ajil Jalal
Luna McNulty
Gopala Anumanchipalli
[Paper]

Abstract

Perceptual modification of voice is an elusive goal. While non-experts can modify an image or sentence perceptually with available tools, it is not clear how to similarly modify speech along perceptual axes. Voice conversion does make it possible to convert one voice to another, but these modifications are handled by black box models, and the specifics of what perceptual qualities to modify how to modify them are unclear. Towards allowing greater perceptual control over voice, we introduce PerMod, a conditional latent diffusion model that takes in an input voice and a perceptual qualities vector, and produces a voice with the matching perceptual qualities. Unlike prior work, PerMod generates a new voice corresponding to perceptual modifications. Evaluating perceptual quality vectors with RMSE from both human and predicted labels, we demonstrate that PerMod produces voices with the desired perceptual qualities for typical voices, but performs poorly on atypical voices.

Perceptual Qualities

Thought of as the acoustic "coloring" of an individual's voice, perceptual qualities are one way that clinicians and voice experts conceptualize the subjective perception of a voice. Here, we provide examples of the CAPE-V perceptual qualities we use in our work. For a description of the gendered perceptual qualities of resonance and weight, we recommend taking a look over voice examples from the Voice Resource Repository by SumianVoice.

Perceptual Quality Description Audio Example
Strain Perception of excessive vocal effort (hyperfunction)
Loudness Deviation in perceived loudness typical for that speaker's age and gender.
Roughness Perceived irregularity in the voicing source.
Breathiness Audible air escape in the voice.
Pitch Deviation in pitch values typical for that speaker's age and gender.

Demo

Below we provide examples of voice modification (VM) performed by PerMod on test samples from the Voice Cloning Toolkit (VCTK) and the Perceptual Voice Qualities Database (PVQD). For comparison, we also provide voice conversion (VC) examples generated by the DSVAE.

Typical-to-Typical (T2T)

Input Speech Target Speech DSVAE (VC) PerMod-Pretrained (VM) PerMod-Finetuned (VM)

Typical-to-Aypical (T2A)

Input Speech Target Speech DSVAE (VC) PerMod-Pretrained (VM) PerMod-Finetuned (VM)

Atypical-to-Typical (A2T)

Input Speech Target Speech DSVAE (VC) PerMod-Pretrained (VM) PerMod-Finetuned (VM)

Atypical-to-Atypical (A2A)

Input Speech Target Speech DSVAE (VC) PerMod-Pretrained (VM) PerMod-Finetuned (VM)


[Bibtex]


Acknowledgements

This work was supported by the UC Noyce Initiative, Society of Hellman Fellows, NSF, NIH/NIDCD and the Schwab Innovation Fund.