RT-VC: Real-Time Zero-Shot Voice Conversion with Speech Articulatory Coding

University of California, Berkeley
System Overview Diagram

Abstract

Streaming voice conversion has emerged as a pivotal technology in numerous applications ranging from assistive communication to entertainment. In this paper, we present RT-VC, a zero-shot real-time voice conversion system that delivers ultra-low latency and high-quality performance. Our approach leverages an articulatory feature space to naturally disentangle content and speaker characteristics, facilitating more robust and interpretable voice transformations. Additionally, the integration of differentiable digital signal processing (DDSP) enables efficient vocoding directly from articulatory features, significantly reducing conversion latency. Experimental evaluations demonstrate that, while maintaining synthesis quality comparable to the current state-of-the-art (SOTA) method, RT-VC achieves a CPU latency of 61.4 ms, representing a 13.3% reduction in latency.

Conversion Quality

Zero-shot voice conversion results. StreamVC samples copied from here.

System Overview Diagram
Source (Unseen) Target (Unseen) StreamVC RT-VC (Our Model)

Noise Robustness

Voice conversion results under different input SNR (white noise).

System Overview Diagram

40dB SNR

Source (Unseen) Target (Unseen) RT-VC (Our Model)

30dB SNR

Source (Unseen) Target (Unseen) RT-VC (Our Model)

20dB SNR

Source (Unseen) Target (Unseen) RT-VC (Our Model)

10dB SNR

Source (Unseen) Target (Unseen) RT-VC (Our Model)

5dB SNR

Source (Unseen) Target (Unseen) RT-VC (Our Model)

(Bonus) Multilingual Conversion

Chinese

Source (Unseen) Target (Unseen) RT-VC (Our Model)

French

Source (Unseen) Target (Unseen) RT-VC (Our Model)

German

Source (Unseen) Target (Unseen) RT-VC (Our Model)

Spanish

Source (Unseen) Target (Unseen) RT-VC (Our Model)

(Bonus) Singing Voice Conversion

Source (Unseen) Target (Unseen) RT-VC