Wednesday, February 4, 2026

StepFun AI Introduce Step-DeepResearch: A Value-Efficient Deep Analysis Agent Mannequin Constructed Round Atomic Capabilities


StepFun has launched Step-DeepResearch, a 32B parameter finish to finish deep analysis agent that goals to show net search into precise analysis workflows with lengthy horizon reasoning, instrument use and structured reporting. The mannequin is constructed on Qwen2.5 32B-Base and is educated to behave as a single agent that plans, explores sources, verifies proof and writes stories with citations, whereas protecting inference price low.

From Search to Deep Analysis

Most present net brokers are tuned for multi-hop question-answering benchmarks. They attempt to match floor reality solutions for brief questions. That is nearer to focused retrieval than to actual analysis. Deep analysis duties are completely different. They contain latent intent recognition, lengthy horizon choice making, multi-turn instrument use, structured-reasoning and cross-source verification beneath uncertainty.

Step-DeepResearch reframes this as sequential choice making over a compact set of atomic capabilities. The analysis group defines 4 atomic capabilities, planning and activity decomposition, deep-information searching for, reflection and verification, {and professional} report technology. As an alternative of orchestrating many exterior brokers, the system internalizes this loop right into a single mannequin that decides the subsequent motion at every step.

Information Synthesis round Atomic Capabilities

To show these atomic capabilities, the analysis group builds separate knowledge pipelines for every talent. For planning, they begin from prime quality technical stories, survey papers and monetary evaluation paperwork. They reverse-engineer reasonable analysis plans and activity bushes from titles, abstracts and construction, then generate trajectories that observe these plans. This exposes the mannequin to lengthy horizon challenge buildings, not solely quick query templates.

For deep info searching for, they assemble graph primarily based queries over information graphs resembling Wikidata5m and CN-DBpedia. They pattern subgraphs, increase them utilizing search, and synthesize questions that require multi hop reasoning throughout entities and paperwork. A separate pipeline makes use of a Wiki model hyperlink index to drive cross doc retrieval and mixture of proof. Straightforward questions {that a} sturdy mannequin can already remedy with a easy ReAct model technique are filtered out, so coaching focuses on onerous search issues.

Reflection and verification knowledge is generated by self-correction loops and multi-agent instructor traces. Instructor brokers extract claims, plan checks, confirm info, replan if inconsistencies seem and solely then write stories. The ensuing trajectories are cleaned and used as supervision for a single pupil agent. Report technology is educated in 2 phases, mid coaching for area model and depth utilizing question report pairs, then supervised fine-tuning with strict formatting and plan consistency constraints.

Progressive Coaching on Qwen2.5-32B-Base

The coaching pipeline has 3 levels, agentic mid-training, supervised fine-tuning and reinforcement studying. In mid coaching stage-1, the group injects atomic capabilities with out instruments, utilizing context size as much as 32k tokens. The info covers energetic studying, artificial reasoning traces, summarization and reflection. The analysis group present regular features on SimpleQA, TriviaQA and FRAMES as coaching scales as much as about 150B tokens, with the biggest features on FRAMES, which stresses structured reasoning.

In stage-2, the context extends to 128k tokens and express instrument calls are launched. The mannequin learns duties resembling URL primarily based question-answering, deep net search, lengthy doc summarization and lengthy dialogue reasoning. This stage aligns the mannequin with actual analysis eventualities the place search, looking and evaluation should be combined in a single trajectory.

Throughout supervised fine-tuning, the 4 atomic capabilities are composed into full deep search and deep analysis traces. Information cleansing retains trajectories which might be appropriate and quick when it comes to steps and power calls. The pipeline injects managed instrument errors adopted by correction to enhance robustness, and enforces quotation codecs in order that stories keep grounded within the retrieved sources.

Reinforcement studying then optimizes the agent in an actual instrument atmosphere. The analysis group builds duties and checklists by reverse synthesis, and trains a guidelines model Rubrics Decide to attain stories alongside wonderful grained dimensions. The reward design converts ternary rubric labels into uneven binary rewards that seize each constructive targets and violations. The coverage is educated with PPO and a discovered critic, utilizing generalized benefit estimation with close to zero low cost in order that lengthy trajectories will not be truncated.

Single Agent ReAct Structure and Search Stack

At inference time, Step-DeepResearch runs as a single ReAct model agent that alternates pondering, instrument calls and observations till it decides to output a report. The instrument set contains batch net search, a todo supervisor, shell instructions and file operations. Execution runs in a sandbox with terminal persistence by tmux. A notion oriented browser reduces redundant web page captures through the use of perceptual hash distance. Instruments for doc parsing, audio transcription and picture evaluation assist multimodal inputs.

Data acquisition makes use of 2 associated assets. StepFun group states that its Search API is grounded in additional than 20M prime quality papers and 600 premium indices. The analysis group then describes a curated authority indexing technique that isolates greater than 600 trusted domains, together with authorities, educational and institutional websites. Retrieval operates at paragraph degree and makes use of authority conscious rating so that prime belief domains are most popular when relevance is comparable.

The file instruments assist patch primarily based enhancing, so the agent can replace solely modified sections of a report. A abstract conscious storage scheme writes full instrument outputs to native information and injects solely compact summaries into the context. This acts as exterior reminiscence and avoids context overflow for lengthy initiatives.

Analysis, Value and Entry

To measure deep analysis conduct, the group introduce ADR-Bench, a Chinese language benchmark with 110 open ended duties throughout 9 domains. 70 duties cowl normal domains resembling schooling, science and engineering and social life, evaluated by skilled facet by facet comparability. 40 duties in finance and legislation are scored with express rubrics that observe atomicity and verifiability constraints.

On Scale AI Analysis Rubrics, Step-DeepResearch reaches 61.42 % rubric compliance, which is corresponding to OpenAI-DeepResearch and Gemini-DeepResearch, and clearly forward of a number of open and proprietary baselines. On ADR-Bench, expert-based Elo rankings present that the 32B mannequin outperforms bigger open-models resembling MiniMax-M2, GLM-4.6 and DeepSeek-V3.2, and is aggressive with programs like Kimi-Researcher and MiniMax-Agent-Professional.

Key Takeaways

  • Single agent, atomic functionality design: Step-DeepResearch is a 32B parameter single agent constructed on Qwen2.-32B-Base, it internalizes 4 atomic capabilities, planning, deep info searching for, reflection and verification, {and professional} report technology, as a substitute of counting on many exterior brokers.
  • Focused knowledge synthesis for every talent: The analysis group builds separate knowledge pipelines for planning, deep info searching for, reflection and report writing, utilizing reverse-engineered plans from actual stories, graph-based queries over Wikidata5m and CN-DBpedia, multi-agent instructor traces and strict report formatting knowledge.
  • Three stage coaching with lengthy context and RL: Coaching makes use of mid coaching, supervised fine-tuning and reinforcement studying, with mid coaching as much as 150B tokens at 32k after which 128k context, SFT composes full deep analysis trajectories, and PPO primarily based RL with a Rubrics Decide optimizes stories towards wonderful grained checklists.
  • ReAct structure with curated search and exterior reminiscence: At inference time the mannequin runs a ReAct loop that calls instruments for batch net search, todo, shell and file operations, makes use of a Search API grounded in additional than 20M papers and 600 premium indices together with 600+trusted domains, and depends on patch enhancing and abstract conscious storage to behave as exterior reminiscence.
  • Aggressive high quality with decrease price: On Scale AI Analysis Rubrics the mannequin reaches 61.42 % rubric compliance and is aggressive with OpenAI-DeepResearch and Gemini-DeepResearch, on ADR Bench it achieves 67.1 % win or tie price towards sturdy baselines.

Take a look at the Paper and Repo. Additionally, be at liberty to observe us on Twitter and don’t neglect to affix our 100k+ ML SubReddit and Subscribe to our E-newsletter. Wait! are you on telegram? now you possibly can be part of us on telegram as effectively.


Related Articles

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Latest Articles