SO-Bench: A Structural Output Analysis of Multimodal LLMs

December 6, 2025

87

SO-Bench: A Structural Output Analysis of Multimodal LLMs — Home 1200x630 48225d82e9

Multimodal giant language fashions (MLLMs) are more and more deployed in real-world, agentic settings the place outputs should not solely be appropriate, but additionally conform to predefined information schemas. Regardless of current progress in structured technology in textual area, there may be nonetheless no benchmark that systematically evaluates schema-grounded data extraction and reasoning over visible inputs. On this work, we conduct a complete examine of visible structural output capabilities for MLLMs with our fastidiously designed SO-Bench benchmark. Overlaying 4 visible domains, together with UI screens, pure photos, paperwork, and charts, SO-Bench is constructed from over 6.5K numerous JSON schemas and 1.8K curated image-schema pairs with human-verified high quality. Benchmarking experiments on open-sourced and frontier proprietary fashions reveal persistent gaps in predicting correct, schema compliant outputs, highlighting the necessity for higher multimodal structured reasoning. Past benchmarking, we additional conduct coaching experiments to largely enhance the mannequin’s structured output functionality. We plan to make the benchmark out there to the neighborhood.

Determine 1: Left: Overview of the multi-stage information technology pipeline for SO-Bench, together with schema technology, person intent technology, and response technology phases. At every stage, proprietary frontier fashions corresponding to GPT-5 and Gemini-2.5-Professional act as turbines with fastidiously designed prompts. Human area consultants evaluation information from every stage earlier than it progresses to the subsequent. Previous to schema technology, enter photos and JSON schemas are embedded utilizing a CLIP mannequin for embedding search. Proper: Benchmarking outcomes amongst a number of open-source fashions and proprietary frontier fashions.

SO-Bench: A Structural Output Analysis of Multimodal LLMs

Related Articles

Democratizing enterprise intelligence: BGL’s journey with Claude Agent SDK and Amazon Bedrock AgentCore

Why Our Open Supply, Companies-Led Mannequin Nonetheless Works

GPTHuman vs HIX Bypass: AI Humanizer Showdown

LEAVE A REPLY Cancel reply

Latest Articles

Democratizing enterprise intelligence: BGL’s journey with Claude Agent SDK and Amazon Bedrock AgentCore

Why Our Open Supply, Companies-Led Mannequin Nonetheless Works

GPTHuman vs HIX Bypass: AI Humanizer Showdown

loish weblog

Lab-grown corticospinal neurons provide new fashions for ALS and spinal accidents – NanoApps Medical – Official web site