Thursday, January 15, 2026

MANZANO: A Easy and Scalable Unified Multimodal Mannequin with a Hybrid Imaginative and prescient Tokenizer


Unified multimodal Giant Language Fashions (LLMs) that may each perceive and generate visible content material maintain immense potential. Nevertheless, current open-source fashions usually undergo from a efficiency trade-off between these capabilities. We current Manzano, a easy and scalable unified framework that considerably reduces this stress by coupling a hybrid picture tokenizer with a well-curated coaching recipe. A single shared imaginative and prescient encoder feeds two light-weight adapters that produce steady embeddings for image-to-text understanding and discrete tokens for text-to-image technology inside a standard semantic house. A unified autoregressive LLM predicts high-level semantics within the type of textual content and picture tokens, with an auxiliary diffusion decoder subsequently translating the picture tokens into pixels. The structure, along with a unified coaching recipe over understanding and technology knowledge, permits scalable joint studying of each capabilities. Manzano achieves state-of-the-art outcomes amongst unified fashions, and is aggressive with specialist fashions, significantly on text-rich analysis. Our research present minimal activity conflicts and constant positive factors from scaling mannequin dimension, validating our design selection of a hybrid tokenizer.

Related Articles

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Latest Articles