StyleStream: Real-Time Zero-Shot Voice Style Conversion

1University of California, Berkeley
System Overview Diagram

Abstract

Voice style conversion aims to transform an input utterance to match a target speaker's timbre, accent, and emotion, with a central challenge being the disentanglement of linguistic content from style. While prior work has explored this problem, conversion quality remains limited, and real-time voice style conversion has not been addressed. We propose StyleStream, the first streamable zero-shot voice style conversion system that achieves state-of-the-art performance. StyleStream consists of two components: a Destylizer, which removes style attributes while preserving linguistic content, and a Stylizer, a diffusion transformer (DiT) that reintroduces target style conditioned on reference speech. Robust content-style disentanglement is enforced through text supervision and a highly constrained information bottleneck. This design enables a fully non-autoregressive architecture, achieving real-time voice style conversion with an end-to-end latency of 1 second.

Video Demo

Real-time demo running on-device on a laptop with RTX 4060, demonstrating zero-shot streaming Voice Style Conversion (VSC) as well as online target style swapping.

Source Audio Target Style Target Audio FaCodec CosyVoice 2.0 SeedVC v2 Vevo Vevo 1.5 StyleStream (streaming) StyleStream (offline)
Arabic
Arabic
Indian
Indian
Mandarin
Mandarin
British
British
US
US
Sad
Sad
Calm
Calm
Fear
Fear
Angry
Angry
Happy
Happy

Cross-Lingual Accent Conversion

StyleStream can induce foreign accents in English speech by conditioning on target samples in other languages. Below are examples where English source utterances are converted to have accents from various languages (Arabic, French, Hindi, Italian, Russian, Spanish).

English Source Target Language Foreign Language Target Converted (English with Accent)
Arabic
French
Hindi
Italian
Russian
Spanish