Wednesday, February 4, 2026

SO-Bench: A Structural Output Analysis of Multimodal LLMs


Multimodal giant language fashions (MLLMs) are more and more deployed in real-world, agentic settings the place outputs should not solely be appropriate, but additionally conform to predefined information schemas. Regardless of current progress in structured technology in textual area, there may be nonetheless no benchmark that systematically evaluates schema-grounded data extraction and reasoning over visible inputs. On this work, we conduct a complete examine of visible structural output capabilities for MLLMs with our fastidiously designed SO-Bench benchmark. Overlaying 4 visible domains, together with UI screens, pure photos, paperwork, and charts, SO-Bench is constructed from over 6.5K numerous JSON schemas and 1.8K curated image-schema pairs with human-verified high quality. Benchmarking experiments on open-sourced and frontier proprietary fashions reveal persistent gaps in predicting correct, schema compliant outputs, highlighting the necessity for higher multimodal structured reasoning. Past benchmarking, we additional conduct coaching experiments to largely enhance the mannequin’s structured output functionality. We plan to make the benchmark out there to the neighborhood.

Diagram of the SO-Bench data generation pipeline showing schema generation, user intent generation, response generation, and CLIP-based embedding search with human expert checks at each stage.

Related Articles

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Latest Articles