StyleStream: Real-Time Zero-Shot Voice Style Conversion

Abstract

Voice style conversion aims to transform an input utterance to match a target speaker's timbre, accent, and emotion, with a central challenge being the disentanglement of linguistic content from style. While prior work has explored this problem, conversion quality remains limited, and real-time voice style conversion has not been addressed. We propose StyleStream, the first streamable zero-shot voice style conversion system that achieves state-of-the-art performance. StyleStream consists of two components: a Destylizer, which removes style attributes while preserving linguistic content, and a Stylizer, a diffusion transformer (DiT) that reintroduces target style conditioned on reference speech. Robust content-style disentanglement is enforced through text supervision and a highly constrained information bottleneck. This design enables a fully non-autoregressive architecture, achieving real-time voice style conversion with an end-to-end latency of 1 second.

Video Demo

Real-time demo running on-device on a laptop with RTX 4060, demonstrating zero-shot streaming Voice Style Conversion (VSC) as well as online target style swapping.

Source Audio	Target Style	Target Audio	FaCodec	CosyVoice 2.0	SeedVC v2	Vevo	Vevo 1.5	StyleStream (streaming)	StyleStream (offline)
	Arabic
	Arabic
	Indian
	Indian
	Mandarin
	Mandarin
	British
	British
	US
	US
	Sad
	Sad
	Calm
	Calm
	Fear
	Fear
	Angry
	Angry
	Happy
	Happy

Cross-Lingual Accent Conversion

StyleStream can induce foreign accents in English speech by conditioning on target samples in other languages. Below are examples where English source utterances are converted to have accents from various languages (Arabic, French, Hindi, Italian, Russian, Spanish).

English Source	Target Language	Foreign Language Target	Converted (English with Accent)
	Arabic
	French
	Hindi
	Italian
	Russian
	Spanish