Zhipu AI Releases GLM-4.7-Flash: A 30B-A3B MoE Mannequin for Environment friendly Native Coding and Brokers

January 21, 2026

3

GLM-4.7-Flash is a brand new member of the GLM 4.7 household and targets builders who need sturdy coding and reasoning efficiency in a mannequin that’s sensible to run domestically. Zhipu AI (Z.ai) describes GLM-4.7-Flash as a 30B-A3B MoE mannequin and presents it because the strongest mannequin within the 30B class, designed for light-weight deployment the place efficiency and effectivity each matter.

Mannequin class and place contained in the GLM 4.7 household

GLM-4.7-Flash is a textual content era mannequin with 31B params, BF16 and F32 tensor sorts, and the structure tag glm4_moe_lite. It helps English and Chinese language, and it’s configured for conversational use. GLM-4.7-Flash sits within the GLM-4.7 assortment subsequent to the bigger GLM-4.7 and GLM-4.7-FP8 fashions.

Z.ai positions GLM-4.7-Flash as a free tier and light-weight deployment possibility relative to the total GLM-4.7 mannequin, whereas nonetheless concentrating on coding, reasoning, and common textual content era duties. This makes it attention-grabbing for builders who can’t deploy a 358B class mannequin however nonetheless desire a trendy MoE design and powerful benchmark outcomes.

Structure and context size

In a Combination of Consultants structure of this sort, the mannequin shops extra parameters than it prompts for every token. That enables specialization throughout consultants whereas protecting the efficient compute per token nearer to a smaller dense mannequin.

GLM 4.7 Flash helps a context size of 128k tokens and achieves sturdy efficiency on coding benchmarks amongst fashions of comparable scale. This context dimension is appropriate for big codebases, multi-file repositories, and lengthy technical paperwork, the place many present fashions would want aggressive chunking.

GLM-4.7-Flash makes use of a regular causal language modeling interface and a chat template, which permits integration into present LLM stacks with minimal modifications.

Benchmark efficiency within the 30B class

The Z.ai workforce compares GLM-4.7-Flash with Qwen3-30B-A3B-Pondering-2507 and GPT-OSS-20B. GLM-4.7-Flash leads or is aggressive throughout a mixture of math, reasoning, lengthy horizon, and coding agent benchmarks.

https://huggingface.co/zai-org/GLM-4.7-Flash

This above desk showcase why GLM-4.7-Flash is without doubt one of the strongest mannequin within the 30B class, no less than among the many fashions included on this comparability. The necessary level is that GLM-4.7-Flash will not be solely a compact deployment of GLM but additionally a excessive performing mannequin on established coding and agent benchmarks.

Analysis parameters and considering mode

For many duties, the default settings are: temperature 1.0, high p 0.95, and max new tokens 131072. This defines a comparatively open sampling regime with a big era funds.

For Terminal Bench and SWE-bench Verified, the configuration makes use of temperature 0.7, high p 1.0, and max new tokens 16384. For τ²-Bench, the configuration makes use of temperature 0 and max new tokens 16,384. These stricter settings scale back randomness for duties that want secure device use and multi step interplay.

Z.ai workforce additionally recommends turning on Preserved Pondering mode for multi flip agentic duties comparable to τ²-Bench and Terminal Bench 2. This mode preserves inside reasoning traces throughout turns. That’s helpful while you construct brokers that want lengthy chains of perform calls and corrections.

How GLM-4.7-Flash matches developer workflows

GLM-4.7-Flash combines a number of properties which can be related for agentic, coding centered purposes:

A 30B-A3B MoE structure with 31B params and a 128k token context size.
Sturdy benchmark outcomes on AIME 25, GPQA, SWE-bench Verified, τ²-Bench, and BrowseComp in comparison with different fashions in the identical desk.
Documented analysis parameters and a Preserved Pondering mode for multi flip agent duties.
Top quality help for vLLM, SGLang, and Transformers based mostly inference, with prepared to make use of instructions.
A rising set of finetunes and quantizations, together with MLX conversions, within the Hugging Face ecosystem.

Take a look at the Mannequin weight. Additionally, be happy to comply with us on Twitter and don’t overlook to affix our 100k+ ML SubReddit and Subscribe to our E-newsletter. Wait! are you on telegram? now you may be a part of us on telegram as properly.

Michal Sutter is an information science skilled with a Grasp of Science in Knowledge Science from the College of Padova. With a stable basis in statistical evaluation, machine studying, and knowledge engineering, Michal excels at reworking advanced datasets into actionable insights.

Zhipu AI Releases GLM-4.7-Flash: A 30B-A3B MoE Mannequin for Environment friendly Native Coding and Brokers

Mannequin class and place contained in the GLM 4.7 household

Structure and context size

Benchmark efficiency within the 30B class

Analysis parameters and considering mode

How GLM-4.7-Flash matches developer workflows

Related Articles

Why Gentle Expertise Matter Extra Than Technical Expertise in Agile Groups

Introducing multimodal retrieval for Amazon Bedrock Information Bases

Accelerating Ethernet-Native AI Clusters with Intel® Gaudi® 3 AI Accelerators and Cisco Nexus 9000

LEAVE A REPLY Cancel reply

Latest Articles

Why Gentle Expertise Matter Extra Than Technical Expertise in Agile Groups

Introducing multimodal retrieval for Amazon Bedrock Information Bases

Accelerating Ethernet-Native AI Clusters with Intel® Gaudi® 3 AI Accelerators and Cisco Nexus 9000

Wordle at the moment: The reply and hints for January 21, 2026

An interview with Nicolai Ommer: the RoboCup Soccer Small Dimension League