GLM-5 by Zhipu AI (Z.ai) is one of the largest open-weight models of 2026, focused on agentic engineering and long-horizon tasks. The release on February 11–12, 2026, marked an important step in the development of autonomous AI systems.
Spoiler: 744B MoE (40B active), 200K context, strong results in coding and agent benchmarks, but with compromises in speed and multimodality.
⚡ In Brief
- ✅ GLM-5: 744B MoE, 40B active, 200K context, pre-training 28.5T tokens, DeepSeek Sparse Attention (DSA).
- ✅ Strengths: agentic/coding (SWE-bench Verified 77.8%, Vending Bench 2 $4,432), reasoning with tool-use.
- ✅ Limitations: lower inference speed, weaker native multimodality, high self-hosting requirements.
- 🎯 You will get: a detailed technical breakdown of the architecture, benchmarks, capabilities, and real-world use cases.
- 👇 Below — tables, examples, and official links
📚 Article Contents
🎯 What is GLM-5 and to which family it belongs
GLM-5 — Zhipu AI's (Z.ai) flagship model: released on February 11–12, 2026. It belongs to the GLM (General Language Model) family, which has been developed since 2019 by the KEG Lab of Tsinghua University and Zhipu AI.
GLM-5 is a decoder-only Transformer model with a Mixture-of-Experts (MoE) architecture, focused on high-complexity text tasks: reasoning, coding, agent systems, and long-horizon planning. The model is distributed under the MIT license as open-weight (weights are available for download and modification).
GLM-5 represents an evolutionary step for the GLM family towards scaling parameters and specializing in autonomous systems, moving from short-term code generation to long-term agentic engineering.
Development of the GLM family:
- 2022 — GLM-130B (one of the first large Chinese open models)
- 2023–2024 — GLM-4 series (transition to MoE architecture)
- 2025 — GLM-4.5/4.7 (355B total, 32B active)
- 2026 — GLM-5 (744B total parameters, ~40B active per token)
GLM-5 training was conducted exclusively on the Huawei Ascend hardware stack using the MindSpore framework — a key aspect of independence from the NVIDIA ecosystem after Zhipu AI was added to the US Entity List (2025). The release took place immediately after the Chinese New Year (February 11–12, 2026), with open weights on Hugging Face (zai-org/GLM-5) and ModelScope (ZhipuAI/GLM-5), including an FP8-quantized version for inference.
Official GLM-5 announcement (z.ai/blog) |
Repository on Hugging Face |
ModelScope
Release Context and Significance in 2026
GLM-5 was released at a time when Chinese companies are actively closing the gap with Western frontier models in the open-weight segment. Scaling to 744B total parameters (with ~40B active per token) and integrating DeepSeek Sparse Attention (DSA) allow for high inference efficiency while maintaining a large context. The main focus is on tasks requiring autonomy: self-correction, multi-step tool chaining, generation of final artifacts instead of simple text.
The MIT license provides complete freedom of use: fine-tuning, self-hosting, commercial deployment without restrictions. This makes GLM-5 attractive for developers and organizations that require control over the model and data, unlike closed APIs like Claude or GPT.
Conclusion: GLM-5 is a logical continuation of the GLM family in terms of scaling and specialization in agent systems, representing one of the largest open projects of the Chinese AI frontier as of 2026.
GLM-5 Model Architecture
GLM-5 — decoder-only Transformer with a Mixture-of-Experts (MoE) architecture: 744B total parameters, ~40B active per token (top-8 out of 256 experts, sparsity ~5.9%), DeepSeek Sparse Attention (DSA) for efficient long-context processing, RoPE positional encoding, SwiGLU activations, post-LN normalization.
Pre-training on 28.5T tokens, post-training using the asynchronous RL framework Slime for fine-tuning agentic and reasoning capabilities.
The combination of MoE with DSA allows scaling parameters to 744B while maintaining acceptable inference efficiency, and Slime provides scalable post-training without significant synchronization losses.
The basic architecture of GLM-5 is based on a decoder-only Transformer with the following key elements:
- MoE layer: 256 experts, top-8 activation per token (effectively ~40B parameters per inference step, sparsity ≈5.9%).
- Attention mechanism: DeepSeek Sparse Attention (DSA) — reduces computational complexity from quadratic to linear with respect to the number of key tokens.
- Positional encoding: Rotary Position Embeddings (RoPE).
- Post-training: asynchronous RL framework Slime, which increases throughput and improves long-horizon agent behaviors.
Official technical description of GLM-5 |
ArXiv preprint |
GitHub Slime RL
Context Window
GLM-5's context window is 200,000 tokens for input data, with a confirmed value of up to 202,752 tokens in HLE w/Tools tests. The maximum generation length is 131,072 tokens.
A large context window allows GLM-5 to work effectively with long sequences without a proportional increase in computational costs.
Reasoning and Benchmarking Position
GLM-5 demonstrates high results: SWE-bench Verified 77.8%, Terminal-Bench 2.0 60.7%, Vending Bench 2 $4,432. This is one of the best indicators among open-weight models of 2026.
| Benchmark |
GLM-5 |
Claude Opus 4.5 |
GPT-5.2 |
Task Type |
| SWE-bench Verified |
77.8% |
80.9% |
80.0% |
GitHub-issues |
| Terminal-Bench 2.0 |
60.7% |
59.3% |
54.0% |
CLI-commands |
| HLE w/Tools |
50.4% |
43.4% |
45.5% |
Exam with tools |
Conclusion: GLM-5 leads among open models in engineering and agentic scenarios, virtually matching closed flagships of 2026.
Multimodality
Brief answer: GLM-5 is primarily a text-based model with enhanced capabilities for generating structured documents (.docx, .pdf, .xlsx) based on text input. Native processing of images, audio, or video is not directly implemented in this model — for such tasks, separate models from the GLM family are used (e.g., GLM-Image, GLM-4.6V, or GLM-Vision), integrated via API or tool-calling, which is not seamless.
GLM-5's main focus is text reasoning, coding, and agent systems, not universal multimodality.
GLM-5 relies on tools and external models for multimodal scenarios, rather than built-in native processing of multiple data types, as in Gemini 2.0 or GPT-5.2.
Technical specifications of GLM-5 multimodality (as of 2026):
- Document generation: built-in native ability to create structured files (.docx, .pdf, .xlsx) directly from text descriptions or data. The model generates not only text content but also layouts, tables, charts, and formatting (e.g., a sponsorship proposal with sections, tables, a color palette, and image placeholders). This is implemented through a specialized post-training stage and tool-use in Agent mode.
- Native vision/audio/video: absent in the base GLM-5. Processing images, audio, or video requires the use of separate models from the family:
- GLM-Image / GLM-Vision — for image understanding and generation
- GLM-4.6V / GLM-Audio — for audio/multimedia
Integration occurs via tool-calling (e.g., calling an external model to analyze an image, then using the result in GLM-5), which adds latency and complicates the pipeline.
- Multimodal input in API: limited to text + files (PDF, DOCX, XLSX, images as input files for description). GLM-5 can analyze uploaded documents or images through built-in tools, but does not perform deep cross-modal reasoning (e.g., "describe a video and write code based on it") without additional steps.
Official GLM-5 announcement (capabilities section) |
Z.ai Multimodal Capabilities Documentation
Comparison with other models and practical implications
Compared to Gemini 2.0 / GPT-5.2 (which have a native unified multimodal backbone), GLM-5 falls short in tasks requiring simultaneous processing of multiple modalities (e.g., video analysis + code generation, or image-to-text reasoning with high accuracy). In benchmarks like MMMU (Multimodal Massive Multitask Understanding), GLM-5 (or GLM-Vision) shows lower results (~70–75% vs 84–88% for leaders).
At the same time, its strong suit is document-heavy scenarios: generating full-fledged reports, presentations, financial models, or PRDs from raw text data without external tools. This makes the model effective in enterprise automation of office processes, where the primary input is text or structured documents.
Limitation: the absence of a single multimodal encoder/decoder leads to pipeline fragmentation (GLM-5 + separate vision model), which increases latency, token consumption, and integration complexity.
Section Conclusion: GLM-5's multimodality is primarily limited to text and document generation, relying on separate family models for vision/audio/video. This makes it suitable for document-centric and text-heavy tasks, but less versatile compared to models with native cross-modal architecture.
Tool-calling Capabilities
GLM-5 supports a full set of OpenAI-compatible tool-calling mechanisms: `tools` and `tool_choice` parameters (auto / required / none / specific function), thinking mode (interleaved and preserved), tool streaming (`tool_stream=true`), structured output (`response_format`), multi-tool calls, and chaining. This allows the model to autonomously plan, invoke tools, analyze results, and iterate until a task is completed.
The model is optimized for complex agentic workflows with self-correction and long-horizon execution, as evidenced by high results in benchmarks such as Vending Bench 2 ($4,432) and BrowseComp (75.9%).
Tool-calling in GLM-5 is implemented not as an additional feature, but as an integral part of post-training, ensuring high accuracy in tool selection and call sequencing.
Technical details of API implementation (api.z.ai /v4/chat/completions, as of 2026):
- tools: an array of functions in JSON Schema format (name, description, parameters). The model returns `tool_calls` in the response with arguments for execution.
- tool_choice:
- "auto" — the model decides whether to invoke a tool
- "required" — mandatory tool invocation
- "none" — prohibition of invocation
- specific function — forced invocation of a particular tool
- thinking mode:
- interleaved — reasoning between each decode step and tool call
- preserved — preservation of reasoning across multiple turns (especially useful in Agent mode and coding endpoint)
- parameter: `thinking: {"type": "enabled"}` or "disabled"
- tool_stream=true: streaming of tool parameters in real-time (useful for UI and quick display of agent progress).
- structured output: `response_format: { "type": "json_schema", "json_schema": {...} }` — forced JSON output according to schema.
- multi-tool chaining: the model can invoke multiple tools sequentially in a single response or iteratively through multi-turn (plan → invocation → result analysis → next invocation).
Official Function Calling Documentation |
Thinking Mode and preserved reasoning |
GLM-5 Announcement with Agentic Workflow Examples
Efficiency and Use Cases
GLM-5 demonstrates high accuracy in tool selection and action sequencing thanks to a specialized RL stage (Slime framework). This is evident in:
- Benchmarks: τ²-Bench 89.7% (tool invocation accuracy), Tool-Decathlon 39.2% (multi-tool).
- Long-horizon tasks: Vending Bench 2 — full business simulation cycle with multi-step planning and self-correction.
- Real-world scenarios: large data analysis → tool invocation for calculation → report generation → result verification → correction.
Limitations: under high load (peak hours), tool-calling may experience delays due to throttling. In complex multi-tool scenarios, an excessive number of calls (over-calling) is sometimes observed, requiring careful prompt and tool description tuning.
Conclusion: Tool-calling in GLM-5 is one of the model's key strengths, providing a reliable foundation for autonomous agents with multi-step planning, self-correction, and effective tool utilization, which distinguishes it among open-weight solutions of 2026.
API Cost
Official API price: on api.z.ai for GLM-5: $1 per 1 million input tokens, $3.2 per 1 million output tokens. Cached input: $0.2 per 1 million tokens (storage temporarily free). For specialized GLM-5-Code: $1.2 input / $5 output.
This is significantly lower than Claude Opus 4.5/4.6 (~$5–$10 input / $25–$37.5 output) and GPT-5.2 (~$1.75–$5 input / $14–$25 output), but GLM-5 consumes tokens faster due to thinking mode and larger scale (2–3× compared to GLM-4.7).
The low base price makes GLM-5 economically viable for production agents and long sessions, especially considering context caching and open-weight self-hosting capabilities.
Pricing Details (as of February 2026, official docs.z.ai):
| Model | Input (per 1 million tokens) | Cached Input | Cached Storage | Output (per 1 million tokens) | Note |
|---|
| GLM-5 | $1.00 | $0.20 | Temporarily free | $3.20 | Main model |
| GLM-5-Code | $1.20 | $0.30 | Temporarily free | $5.00 | Optimized for coding/agent |
Official pricing page (docs.z.ai)
Factors affecting real cost
Actual consumption depends on the usage mode:
- Thinking mode (interleaved/preserved) increases token count by 20–50% due to internal reasoning, which raises costs for complex tasks.
- Context caching reduces costs for repeated prefixes (up to $0.2/million), critical for long agent sessions or RAG.
- GLM Coding Plan ($10–$50+/month depending on tier): provides higher quotas, but GLM-5 consumes 2–3× more quota compared to GLM-4.7, making lower plans less cost-effective for intensive use.
- Self-hosting: The MIT license allows local deployment (vLLM/SGLang), eliminating API costs, but requires significant resources (1.5 TB BF16 weights, 8+ H200 GPUs).
Comparison with competitors (approx. 2026):
- Claude Opus 4.5/4.6: $5–$10 input / $25–$37.5 output (3–10× more expensive than GLM-5).
- GPT-5.2: ~$1.75–$5 input / $14–$25 output (3–8× more expensive).
GLM-5 wins in scenarios with a large volume of tokens (long-horizon agents, RAG), where the price difference becomes significant.
Section Conclusion: The GLM-5 API cost is one of the lowest among frontier-level models, making it attractive for large-scale production applications with agentic and long-context scenarios, especially considering caching and the open-weight self-hosting alternative.
Best Use Cases
GLM-5 is most effective in tasks requiring a high level of autonomy, long-term planning, tool utilization, and large context processing: agentic coding, code generation and refactoring, autonomous agents with self-correction, enterprise RAG on long documents, structured document generation from raw data, and automation of complex workflows.
The model shows advantages where multi-step reasoning, tool-chaining, and the generation of final artifacts are critical, rather than just a text response.
GLM-5 specializes in transitioning from short-term code generation to full task execution with autonomous planning and result verification, making it suitable for production agents and engineering pipelines.
Key scenarios where GLM-5 demonstrates the highest effectiveness (based on benchmarks and architectural features):
- Agentic coding and software engineering: full cycle of code generation (full-stack: frontend + backend + deploy scripts), refactoring legacy code, bug-fixing, architectural planning. Strong results on SWE-bench Verified (77.8%) and SWE-bench Multilingual (73.3%) allow processing real GitHub-issues with large repositories (200K context).
- Autonomous agents with long-horizon planning: multi-turn agents with self-correction, tool-chaining, and iterative execution (Vending Bench 2 — $4,432 balance in a year-long business simulation; BrowseComp 75.9% — web navigation with context management). Suitable for autonomous pipelines like Devin-like agents or enterprise automation.
- Document generation and automation: creation of full-fledged reports, PRDs, financial models, lesson plans, sponsorship proposals from raw data → ready-made files (.docx, .pdf, .xlsx) with layouts, tables, charts. This is one of the model's strongest aspects in Agent mode.
- Enterprise RAG and long-context reasoning: analysis of large documents, codebases, logs, legal texts (200K+ context with DSA for quality stability). Suitable for corporate search engines, compliance document analysis, technical documentation.
- Tool-heavy and multi-step tasks: scenarios involving multiple tool calls, result verification, and correction (Terminal-Bench 2.0 56–61%, HLE w/Tools 50.4%).
Scenarios where GLM-5 is less effective
The model is less optimal in tasks with the following requirements:
- Ultra-low latency interactive chatbots (thinking mode adds latency, speed ~17–19 tokens/s).
- Heavy native multimodality (vision/audio/video reasoning — requires separate models, integration is not seamless).
- High-load real-time systems with thousands of parallel requests (throttling and limited concurrency at peaks).
- Creative or highly ambiguous tasks requiring high situational awareness (Claude Opus 4.5 often excels in nuanced prompts and UI/mockup generation).
Overall, GLM-5 performs best in scenarios where the priority is autonomy, long-term planning, generation of final artifacts, and token cost-efficiency, rather than maximum speed or universal multimodality.
Section Conclusion: GLM-5 is most effective in tasks requiring high autonomy, multi-step planning, tool-use, and large context processing — agentic coding, enterprise RAG, document generation, and long-horizon agents. In scenarios with strict requirements for latency, creativity, or native multimodality, other models have an advantage.
❓ Frequently Asked Questions (FAQ)
When was GLM-5 released?
The official release of GLM-5 took place on February 11–12, 2026 (immediately after the Chinese New Year). The announcement and opening of weights on Hugging Face and ModelScope occurred during the same period.
What is the GLM-5 license?
The model is distributed under the MIT license. This means full open-weight status: downloading weights, fine-tuning, self-hosting, commercial use, modification, and distribution are permitted without restrictions, provided copyright and license statements are retained.
Does GLM-5 support vision and other multimodal capabilities?
GLM-5 is primarily a text-based model without native support for image, audio, or video processing. For vision tasks (recognition, description, image analysis), separate models from the GLM family are used (e.g., GLM-Image or GLM-Vision). Integration is possible via tool-calling or Z.ai API, but it is not seamless and adds extra steps to the pipeline. For full-fledged multimodal scenarios (image-to-text reasoning, video understanding), GLM-5 falls short compared to models with a unified multimodal backbone (Gemini 2.0, GPT-5.2).
What is the maximum context length in GLM-5?
Officially, 200,000 tokens are stated for the input context, with a confirmed value of up to 202,752 tokens in HLE w/Tools tests. The maximum generation length is 131,072 tokens (128K–131K depending on configuration). DeepSeek Sparse Attention (DSA) ensures quality stability across the full window.
Can GLM-5 be run locally (self-hosting)?
Yes, thanks to the MIT license and open weights, the model is available for local deployment. Frameworks such as vLLM, SGLang, KTransformers, and Ascend NPU are supported. Requirements: ~1.5 TB of memory in BF16 (minimum 8× H200/H20 GPUs with high-bandwidth interconnect). FP8 quantization significantly reduces requirements, but it still demands enterprise-level hardware. For smaller teams, using the API or OpenRouter is recommended.
✅ Conclusions
- 🔹 GLM-5 is a large-scale open-weight model (744B MoE, ~40B active per token) with an emphasis on agentic engineering, coding, and long-horizon tasks, distinguishing it among the GLM family and open-weight solutions of 2026.
- 🔹 Key technical advantages: context window up to 200K+ tokens with DeepSeek Sparse Attention, high results in benchmarks SWE-bench Verified (77.8%), Terminal-Bench 2.0 (56.2–60.7%), HLE w/Tools (50.4%), Vending Bench 2 ($4,432), low API cost ($1 / $3.2 per million tokens), and full MIT license for fine-tuning and self-hosting.
- 🔹 Limitations include lower inference speed (~17–19 tokens/s in thinking mode), lack of native multimodality (vision/audio/video processing via separate family models), high resource requirements for local deployment (~1.5 TB BF16 weights, minimum 8× H200/H20 GPUs), and operational service limitations (throttling, concurrency at peaks).
- 🔹 The model demonstrates competitiveness in tasks involving autonomous planning, multi-step tool-use, self-correction, and generation of final artifacts, approaching the level of Claude Opus 4.5 / GPT-5.2 in specialized agentic and coding evaluations.
Main takeaway: In 2026, GLM-5 is one of the most significant open solutions for developers and companies requiring autonomous agents, large context processing, long-term planning, and economical inference with the ability to fully control the model and data.
Full overview of the Z.ai (Zhipu AI) 2026 platform
A detailed analysis of the Z.ai platform, including a comparison of Chat and Agent modes, API architecture, service limitations, GLM-5 positioning, and usage recommendations, is available in the article:
Z.ai (Zhipu AI) 2026: Platform Architecture, Chat vs Agent Modes, and GLM-5 Capabilities