Alibaba Cloud’s Qwen group has open-sourced Qwen3-TTS, a household of multilingual text-to-speech fashions that focus on three core duties in a single stack, voice clone, voice design, and top quality speech era.

Mannequin household and capabilities
Qwen3-TTS makes use of a 12Hz speech tokenizer and a couple of language mannequin sizes, 0.6B and 1.7B, packaged into 3 principal duties. The open launch exposes 5 fashions, Qwen3-TTS-12Hz-0.6B-Base and Qwen3-TTS-12Hz-1.7B-Base for voice cloning and generic TTS, Qwen3-TTS-12Hz-0.6B-CustomVoice and Qwen3-TTS-12Hz-1.7B-CustomVoice for promptable preset audio system, and Qwen3-TTS-12Hz-1.7B-VoiceDesign without cost kind voice creation from pure language descriptions, together with the Qwen3-TTS-Tokenizer-12Hz codec.
All fashions assist 10 languages, Chinese language, English, Japanese, Korean, German, French, Russian, Portuguese, Spanish, and Italian. CustomVoice variants ship with 9 curated timbres, akin to Vivian, a brilliant younger Chinese language feminine voice, Ryan, a dynamic English male voice, and Ono_Anna, a playful Japanese feminine voice, every with a brief description that encodes timbre and talking model.
The VoiceDesign mannequin maps textual content directions on to new voices, for instance ‘communicate in a nervous teenage male voice with rising intonation’ and may then be mixed with the Base mannequin by first producing a brief reference clip and reusing it by way of create_voice_clone_prompt.


Structure, tokenizer, and streaming path
Qwen3-TTS is a twin monitor language mannequin, one monitor predicts discrete acoustic tokens from textual content, the opposite handles alignment and management indicators. The system is skilled on greater than 5 million hours of multilingual speech in 3 pre coaching levels that transfer from basic mapping, to top quality information, to lengthy context assist as much as 32,768 tokens.
A key element is the Qwen3-TTS-Tokenizer-12Hz codec. It operates at 12.5 frames per second, about 80 ms per token, and makes use of 16 quantizers with a 2048 entry codebook. On LibriSpeech take a look at clear it reaches PESQ wideband 3.21, STOI 0.96, and UTMOS 4.16, outperforming SpeechTokenizer, XCodec, Mimi, FireredTTS 2 and different current semantic tokenizers, whereas utilizing an analogous or decrease body charge.
The tokenizer is applied as a pure left context streaming decoder, so it will possibly emit waveforms as quickly as sufficient tokens can be found. With 4 tokens per packet, every streaming packet carries 320 ms of audio. The non-DiT decoder and BigVGAN free design reduces decode price and simplifies batching.
On the language mannequin facet, the analysis group studies finish to finish streaming measurements on a single vLLM backend with torch.compile and CUDA Graph optimizations. For Qwen3-TTS-12Hz-0.6B-Base and Qwen3-TTS-12Hz-1.7B-Base at concurrency 1, the primary packet latency is round 97 ms and 101 ms, with actual time elements of 0.288 and 0.313 respectively. Even at concurrency 6, first packet latency stays round 299 ms and 333 ms.


Alignment and management
Publish coaching makes use of a staged alignment pipeline. First, Direct Choice Optimization aligns generated speech with human preferences on multilingual information. Then GSPO with rule primarily based rewards improves stability and prosody. A closing speaker high quality tuning stage on the Base mannequin yields goal speaker variants whereas preserving the core capabilities of the overall mannequin.
Instruction following is applied in a ChatML model format, the place textual content directions about model, emotion or tempo are prepended to the enter. This similar interface powers VoiceDesign, CustomVoice model prompts, and high quality grained edits for cloned audio system.
Benchmarks, zero shot cloning, and multilingual speech
On the Seed-TTS take a look at set, Qwen3-TTS is evaluated as a zero-shot voice cloning system. The Qwen3-TTS-12Hz-1.7B-Base mannequin reaches a Phrase Error Fee of 0.77 on test-zh and 1.24 on test-en. The analysis group highlights the 1.24 WER on test-en as state-of-the-art among the many in contrast methods, whereas the Chinese language WER is near, however not decrease than, the most effective CosyVoice 3 rating.


On a multilingual TTS take a look at set masking 10 languages, Qwen3-TTS achieves the bottom WER in 6 languages, Chinese language, English, Italian, French, Korean, and Russian, and aggressive efficiency on the remaining 4 languages, whereas additionally acquiring the best speaker similarity in all 10 languages in comparison with MiniMax-Speech and ElevenLabs Multilingual v2.
Cross-lingual evaluations present that Qwen3-TTS-12Hz-1.7B-Base reduces combined error charge for a number of language pairs, akin to zh-to-ko, the place the error drops from 14.4 for CosyVoice3 to 4.82, a couple of 66 % relative discount.
On InstructTTSEval, the Qwen3TTS-12Hz-1.7B-VD VoiceDesign mannequin units new state-of-the-art scores amongst open supply fashions on Description-Speech Consistency and Response Precision in each Chinese language and English, and is aggressive with business methods like Hume and Gemini on a number of metrics.
Key Takeaways
- Full open supply multilingual TTS stack: Qwen3-TTS is an Apache 2.0 licensed suite that covers 3 duties in a single stack, top quality TTS, 3 second voice cloning, and instruction primarily based voice design throughout 10 languages utilizing the 12Hz tokenizer household.
- Environment friendly discrete codec and actual time streaming: The Qwen3-TTS-Tokenizer-12Hz makes use of 16 codebooks at 12.5 frames per second, reaches sturdy PESQ, STOI and UTMOS scores, and helps packetized streaming with about 320 ms of audio per packet and sub 120 ms first packet latency for the 0.6B and 1.7B fashions within the reported setup.
- Activity particular mannequin variants: The discharge presents Base fashions for cloning and generic TTS, CustomVoice fashions with 9 predefined audio system and elegance prompts, and a VoiceDesign mannequin that generates new voices straight from pure language descriptions which may then be reused by the Base mannequin.
- Sturdy alignment and multilingual high quality: A multi stage alignment pipeline with DPO, GSPO and speaker high quality tuning offers Qwen3-TTS low phrase error charges and excessive speaker similarity, with lowest WER in 6 of 10 languages and the most effective speaker similarity in all 10 languages among the many evaluated methods, and state-of-the-art zero shot English cloning on Seed TTS.
Take a look at the Mannequin Weights, Repo and Playground. Additionally, be at liberty to comply with us on Twitter and don’t overlook to affix our 100k+ ML SubReddit and Subscribe to our E-newsletter. Wait! are you on telegram? now you may be a part of us on telegram as properly.
