Real-time demo running on-device on a laptop with RTX 4060, demonstrating zero-shot streaming Voice Style Conversion (VSC) as well as online target style swapping.
Voice style conversion aims to transform an input utterance to match a target speaker's timbre, accent, and emotion, with a central challenge being the disentanglement of linguistic content from style. While prior work has explored this problem, conversion quality remains limited, and real-time voice style conversion has not been addressed. We propose StyleStream, the first streamable zero-shot voice style conversion system that achieves state-of-the-art performance. StyleStream consists of two components: a Destylizer, which removes style attributes while preserving linguistic content, and a Stylizer, a diffusion transformer (DiT) that reintroduces target style conditioned on reference speech. Robust content-style disentanglement is enforced through text supervision and a highly constrained information bottleneck. This design enables a fully non-autoregressive architecture, achieving real-time voice style conversion with an end-to-end latency of 1 second.
Real-time demo running on-device on a laptop with RTX 4060, demonstrating zero-shot streaming Voice Style Conversion (VSC) as well as online target style swapping.
| Source Audio | Target Style | Target Audio | FaCodec | CosyVoice 2.0 | SeedVC v2 | Vevo | Vevo 1.5 | StyleStream (streaming) | StyleStream (offline) |
|---|---|---|---|---|---|---|---|---|---|
| Arabic | |||||||||
| Arabic | |||||||||
| Indian | |||||||||
| Indian | |||||||||
| Mandarin | |||||||||
| Mandarin | |||||||||
| British | |||||||||
| British | |||||||||
| US | |||||||||
| US | |||||||||
| Sad | |||||||||
| Sad | |||||||||
| Calm | |||||||||
| Calm | |||||||||
| Fear | |||||||||
| Fear | |||||||||
| Angry | |||||||||
| Angry | |||||||||
| Happy | |||||||||
| Happy |
StyleStream can induce foreign accents in English speech by conditioning on target samples in other languages. Below are examples where English source utterances are converted to have accents from various languages (Arabic, French, Hindi, Italian, Russian, Spanish).
| English Source | Target Language | Foreign Language Target | Converted (English with Accent) |
|---|---|---|---|
| Arabic | |||
| French | |||
| Hindi | |||
| Italian | |||
| Russian | |||
| Spanish |