Saturday, February 7, 2026

VSSFlow: Unifying Video-conditioned Sound and Speech Technology by way of Joint Studying


Video-conditioned sound and speech technology, encompassing video-to-sound (V2S) and visible text-to-speech (VisualTTS) duties, are conventionally addressed as separate duties, with restricted exploration to unify them inside a signle framework. Current makes an attempt to unify V2S and VisualTTS face challenges in dealing with distinct situation varieties (e.g., heterogeneous video and transcript circumstances) and require advanced coaching levels. Unifying these two duties stays an open downside. To bridge this hole, we current VSSFlow, which seamlessly integrates each V2S and VisualTTS duties right into a unified flow-matching framework. VSSFlow makes use of a novel situation aggregation mechanism to deal with distinct enter alerts. We discover that cross-attention and self-attention layer exhibit totally different inductive biases within the technique of introducing situation. Due to this fact, VSSFlow leverages these inductive biases to successfully deal with totally different representations: cross-attention for ambiguous video circumstances and self-attention for extra deterministic speech transcripts. Moreover, opposite to the prevailing perception that joint coaching on the 2 duties requires advanced coaching methods and will degrade efficiency, we discover that VSSFlow advantages from the end-to-end joint studying course of for sound and speech technology with out further designs on coaching levels. Detailed evaluation attributes it to the discovered common audio prior shared between duties, which accelerates convergence, enhances conditional technology, and stabilizes the classifier-free steering course of. In depth experiments show that VSSFlow surpasses the state-of-the-art domain-specific baselines on each V2S and VisualTTS benchmarks, underscoring the vital potential of unified generative fashions.

Related Articles

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Latest Articles