Chroma 1.0 is an actual time speech to speech dialogue mannequin that takes audio as enter and returns audio as output whereas preserving the speaker identification throughout multi flip conversations. It’s offered as the primary open supply finish to finish spoken dialogue system that mixes low latency interplay with excessive constancy customized voice cloning from only some seconds of reference audio.
The mannequin operates straight on discrete speech representations relatively than on textual content transcripts. It targets the identical use instances as industrial actual time brokers, however with a compact 4B parameter dialogue core and a design that treats speaker similarity as a main goal, not as an auxiliary characteristic. Chroma achieves a reported 10.96% relative enchancment in speaker similarity over a human baseline and reaches a Actual Time Issue (RTF) of 0.43, so it will possibly generate speech greater than 2 instances quicker than playback.

From cascaded ASR ➡️ LLM ➡️ TTS ➡️ finish to finish S2S
Most manufacturing assistants nonetheless use a 3 stage pipeline, automated speech recognition to transform audio to textual content, a big language mannequin for reasoning, and textual content to speech synthesis. This construction is versatile but it surely introduces latency and loses paralinguistic info akin to timbre, emotion, talking fee and prosody as soon as the system collapses audio to textual content. In actual time dialogue this lack of acoustic element straight hurts speaker constancy and naturalness.
Chroma follows the newer class of speech to speech methods that map between sequences of codec tokens. A speech tokenizer and neural codec produce quantized acoustic codes. A language mannequin then causes and responds over a sequence that interleaves textual content tokens and audio codes, with out an specific intermediate transcript. This retains the mannequin conditioned on prosody and speaker identification throughout the entire processing chain.
Structure, Reasoner + speech era stack
Chroma 1.0 has two primary subsystems. The Chroma Reasoner handles multimodal understanding and textual content era. The speech stack, Chroma Spine, Chroma Decoder and Chroma Codec Decoder, converts that semantic output into customized response audio.
The Chroma Reasoner is constructed on the Thinker module from the Qwen-omni collection and makes use of the Qwen2 Audio encoding pipeline. It processes textual content and audio inputs with shared entrance ends, fuses them with cross modal consideration, and aligns them over time utilizing Time aligned Multimodal Rotary Place Embedding (TM-RoPE). The output is a sequence of hidden states that carry each linguistic content material and acoustic cues, for instance rhythm and emphasis.


The Chroma Spine is a 1B parameter LLaMA type mannequin primarily based on Llama3. It’s conditioned on the goal voice utilizing CSM-1B, which encodes a brief reference audio clip and its transcript into embedding prompts which are prepended to the sequence. Throughout inference, token embeddings and hidden states from the Reasoner are fed as unified context, so the Spine all the time sees the semantic state of the dialogue whereas it generates acoustic codes.
To assist streaming, the system makes use of a hard and fast 1 to 2 interleaving schedule. For each textual content token from the Reasoner, the Spine produces 2 audio code tokens. This enables the mannequin to begin emitting speech as quickly as textual content era begins and avoids ready for full sentences. This interleaving is the principle mechanism behind the low Time to First Token.
The Chroma Decoder is a light-weight LLaMA variant with about 100M parameters. The Spine predicts solely the primary Residual Vector Quantization codebook per body, which is a rough illustration. The Decoder then takes the Spine hidden state and the primary code and autoregressively predicts the remaining RVQ ranges inside the identical body. This factorization retains lengthy context temporal construction within the Spine and restricts the Decoder to border native refinement, which reduces compute and improves detailed prosody and articulation.
The Chroma Codec Decoder concatenates the coarse and refined codes and maps them to waveform samples. It follows the decoder design of the Mimi vocoder and makes use of a causal convolutional neural community so that every output pattern relies upon solely on previous context, which is required for streaming. The system makes use of 8 codebooks, which cuts the variety of autoregressive refinement steps for the Decoder whereas preserving sufficient element for voice cloning.
Coaching setup and artificial speech to speech (S2S) knowledge
Top quality speech dialogue knowledge with robust reasoning alerts is scarce. Chroma due to this fact makes use of an artificial speech to speech (S2S) pipeline. A Reasoner like LLM first produces textual solutions for person questions. A Check to Speech (TTS) system then synthesizes goal speech that matches the timbre of the reference audio for these solutions. These artificial pairs prepare the Spine and Decoder to carry out acoustic modeling and voice cloning. The Reasoner stays frozen and acts as a supplier of textual content embeddings and multimodal hidden states.
Voice cloning high quality and comparability with current methods
Goal analysis makes use of the SEED-TTS-EVAL protocol on English CommonVoice audio system. Chroma operates at 24 kHz sampling fee and achieves a Speaker Similarity rating of 0.81. The human baseline is 0.73. CosyVoice-3 reaches 0.72 and most different TTS baselines lie beneath the human reference. The analysis crew report this as a ten.96% relative enchancment over the human baseline, which signifies that the mannequin captures nice paralinguistic particulars extra persistently than human recordings on this metric.


Subjective analysis compares Chroma with the ElevenLabs eleven_multilingual_v2 mannequin. In naturalness CMOS, listeners choose ElevenLabs 57.2% of the time versus 24.4% for Chroma, with 18.3% deuce. In speaker similarity CMOS, the scores are very shut, 42.4% for ElevenLabs and 40.6% for Chroma, with 17.0% deuce. A observe up check asking which audio sounds extra pure between ElevenLabs and the unique recordings yields 92.0% choice for ElevenLabs versus 8.0% for floor reality, which reveals that perceived naturalness and speaker constancy are usually not aligned.
Latency and real-time habits
Latency is measured with one concurrent stream. For a 38.80 second response, the entire era time is 16.58 seconds, which provides a Actual Time Issue (RTF) of 0.43. The Reasoner contributes 119.12 ms TTFT, the Spine 8.48 ms and the Decoder 19.27 ms per body on common. The Codec Decoder works on teams of 4 frames so TTFT doesn’t apply to that part. The general Time to First Token is 146.87 ms, which is effectively underneath one second and appropriate for interactive dialogue.


Spoken dialogue and reasoning benchmarks
Chroma is evaluated on the fundamental observe of URO Bench. It makes use of solely 4B parameters but achieves an total activity accomplishment rating of 57.44%. GLM-4 Voice, a 9B parameter mannequin, leads with 69.09%. Chroma ranks second total and outperforms a number of 7B and 0.5B omni baselines on many dimensions. It reaches 71.14% on Storal, 51.69% on TruthfulQA and 22.74% on GSM8K. For oral dialog metrics it attains the very best scores on MLC at 60.26% and on CommonVoice at 62.07%.


Critically, Chroma is the one mannequin on this comparability that helps customized voice cloning. All different methods deal with spoken dialogue and reasoning solely. This implies Chroma gives aggressive cognitive functionality whereas additionally performing excessive constancy voice personalization in actual time.
Key Takeaways
- Finish to finish actual time speech to speech: Chroma 1.0 is a 4B parameter spoken dialogue mannequin that maps speech to speech straight utilizing codec tokens, it avoids specific ASR and TTS phases and preserves prosody and speaker identification by way of the entire pipeline.
- Reasoner plus speech stack structure: The system combines a Qwen-based Chroma Reasoner with a 1B LLaMA type Spine, a 100M Chroma Decoder and a Mimi primarily based Codec Decoder, it makes use of RVQ codebooks and an interleaved 1 to 2 textual content to audio token schedule to assist streaming and low Time to First Token.
- Robust customized voice cloning: On SEED-TTS-EVAL with CommonVoice audio system, Chroma reaches a Speaker Similarity rating of 0.81 at 24 kHz, that is reported as a ten.96 p.c relative enchancment over the human baseline of 0.73 and outperforms CosyVoice 3 and different TTS baselines.
- Sub second latency and quicker than actual time era: Single stream inference on an H200 GPU yields an total Time to First Token of about 147 ms, for a 38.80 second response the mannequin generates audio in 16.58 seconds, leading to a Actual Time Issue of 0.43 which is greater than 2 instances quicker than playback.
- Aggressive dialogue and reasoning with cloning as a singular characteristic: On URO Bench primary observe, Chroma attains 57.44 p.c total activity accomplishment and aggressive scores on Storal, TruthfulQA, GSM8K, MLC and CommonVoice.
Take a look at the Paper, Mannequin Weights, Challenge and Playground. Additionally, be at liberty to observe us on Twitter and don’t neglect to affix our 100k+ ML SubReddit and Subscribe to our Publication. Wait! are you on telegram? now you’ll be able to be a part of us on telegram as effectively.
