The build.nvidia.com catalog contains over 100 models. This is both its strength and its problem: if you're new to the platform, the choice is paralyzing. DeepSeek or Kimi? Nemotron or Llama? GLM-5 or Qwen3.5?
This article is a practical technical breakdown of which model to run for which specific task.
Read the previous material? In the article "NVIDIA NIM: How Free Inference is Changing AI System Architecture", I analyzed why inference is becoming a commodity layer, how NVIDIA Build differs from OpenRouter and Groq, and what architectural implications this has for AI agents. This material is a logical continuation: specific models, specific code.
Contents
Connecting to NVIDIA NIM: First Steps
Before comparing models, a basic setup. It takes less than 5 minutes.
Step 1. Register for the NVIDIA Developer Program (free, only email required).
Step 2. Generate an API key on build.nvidia.com. The key will have the prefix nvapi-.
Step 3. Install dependencies:
pip install openai
export NVIDIA_API_KEY="nvapi-your-key"
Basic Python Client:
from openai import OpenAI
client = OpenAI(
api_key="nvapi-your-key",
base_url="https://integrate.api.nvidia.com/v1"
)
response = client.chat.completions.create(
model="meta/llama-4-scout-17b-16e-instruct", # only change this line
messages=[
{"role": "system", "content": "You are a technical assistant."},
{"role": "user", "content": "Write a Python function for binary search."}
],
temperature=0.1,
max_tokens=1024
)
print(response.choices[0].message.content)
curl Equivalent:
curl https://integrate.api.nvidia.com/v1/chat/completions \
-H "Authorization: Bearer $NVIDIA_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "meta/llama-4-scout-17b-16e-instruct",
"messages": [
{"role": "user", "content": "Write a Python function for binary search."}
],
"max_tokens": 1024
}'
Key detail: the base URL is the same for all models. To switch from Llama to DeepSeek, you only need to change one line — the value of the model parameter. This is why choosing the right model is not code refactoring, but configuration.
Model Landscape: Who's Who in the NIM Catalog
As of May 2026, over 100 models are available in the build.nvidia.com catalog. Let's break down the main families:
| Family |
Developer |
Key Models in NIM |
License |
Strength |
| DeepSeek V4 |
DeepSeek AI (China) |
V4-Flash, V4-Pro, V4 (671B) |
MIT |
General quality, coding, cost-efficiency |
| Kimi K2 |
Moonshot AI (China) |
K2.5, K2.6, K2-Thinking |
Kimi License |
Agentic coding, long context |
| Nemotron 3 |
NVIDIA |
Nano Omni (30B), Super (120B), Ultra (500B) |
NVIDIA Open Model License |
NVIDIA hardware throughput, agentic tasks |
| Qwen 3.5 |
Alibaba (China) |
Qwen3-8B, Qwen3-32B, Qwen3.5-235B-MoE |
Apache 2.0 |
Coding, multilingual (especially CJK) |
| GLM-5 / GLM-4 |
Zhipu AI (China) |
GLM-4.7, GLM-5, GLM-5.1 |
MIT |
Agentic workflows, function calling |
| Llama 4 |
Meta (USA) |
Scout 17B, Maverick 70B, Llama 3.3 70B |
Llama Community License |
General use, tool use |
| MiniMax M2 |
MiniMax (China) |
M2.5, M2.7 (230B MoE) |
MiniMax License |
Reasoning, multimodality |
| Gemma 4 |
Google |
Gemma 4 31B, Gemma 2B / 7B |
Gemma License |
Light tasks, summarization |
| Specialized |
Various |
NemoClaw, Llama Guard, NV-Embed |
Various |
Safety, embedding, guardrails |
I want to draw attention to an important observation: in 2026, most of the top models come from Chinese labs. According to BenchLM.ai, the top positions in the open-weight model rankings are held by DeepSeek V4 Pro (87 points), Kimi K2.6 (84), GLM-5.1 (83), and Qwen3.5 397B (79). In this context, Meta's Llama no longer appears dominant at the top of the table.
Benchmarks: Real Performance Numbers
The figures below are compiled from Artificial Analysis, BenchLM.ai, and LearnAIForge (May 2026).
Overall Intelligence Index Ranking (Artificial Analysis v4.0)
| Model |
Intelligence Index |
Type |
Available in NIM |
| Kimi K2.6 |
54 |
Open-weight |
✓ |
| MiMo-V2.5-Pro |
54 |
Open-weight |
✓ |
| DeepSeek V4 Pro (Reasoning) |
52 |
Open-weight |
✓ |
| GLM-5.1 (Reasoning) |
~50 |
Open-weight (MIT) |
✓ |
| Nemotron 3 Super 120B |
61 (BenchLM) |
Open-weight |
✓ |
| Nemotron 3 Ultra 500B |
65 (BenchLM) |
Open-weight |
✓ |
SWE-Bench Verified (autonomous fixing of real GitHub issues)
| Model |
SWE-Bench Score |
Note |
| Nemotron 3 Super |
60.47% |
+18.5 pp over GPT-OSS-120B; 7.5x higher throughput than Qwen3.5-122B |
| Qwen3.5-122B |
~66% |
Higher score, but lower throughput |
| DeepSeek V4 Pro |
89/100 (coding harness) |
Requires a special harness for maximum results |
| Kimi K2.6 |
87/100 (coding harness) |
3.6x cheaper than Claude Opus on the same tasks |
| GPT-OSS-120B (reference) |
~42% |
Reference for comparison |
RULER Long-Context (accuracy on 1M tokens)
| Model |
RULER @ 1M ctx |
Maximum Context |
| Nemotron 3 Super |
91.75% |
1M tokens |
| Nemotron 3 Ultra |
Largest among open-weight |
10M tokens |
| DeepSeek V4 Flash |
High |
1M tokens |
| GPT-OSS-120B |
22.30% |
Degrades sharply on large contexts |
Task 1 — Coding and Agentic Coding
Coding is the most competitive category among NIM models in 2026. I will break down three difficulty levels.
Level 1: Simple tasks (function, algorithm, bug fix)
Recommendation: deepseek-ai/deepseek-v4-flash
From a practical standpoint, DeepSeek V4 Flash is a 284B MoE model that activates only a portion of its parameters per token. This provides an unusual speed-to-quality ratio: the model behaves significantly lighter than its overall size might suggest.
According to practical developer tests, V4 Flash handles about 80% of typical coding tasks with a quality that previously required significantly more expensive and heavier models.
# Simple coding tasks — DeepSeek V4 Flash
response = client.chat.completions.create(
model="deepseek-ai/deepseek-v4-flash",
messages=[
{
"role": "system",
"content": "You are an expert Python developer. Return only code, no explanations."
},
{
"role": "user",
"content": "Write a binary search function with type hints and docstring."
}
],
temperature=0.0, # always 0 for code
max_tokens=512
)
Level 2: Complex tasks (multi-file editing, refactoring)
Recommendation: moonshotai/kimi-k2.6 or deepseek-ai/deepseek-v4-pro
According to a comparative coding agent benchmark, Kimi K2.6 scores 87/100 and costs 3.6 times less than Claude Opus on similar tasks. DeepSeek V4 Pro scores 89/100 but requires a specific harness to unlock its full potential.
# Complex agentic coding — Kimi K2.6
response = client.chat.completions.create(
model="moonshotai/kimi-k2.6",
messages=[
{
"role": "system",
"content": (
"You are a senior software engineer. "
"When editing code, show ONLY the changed parts with clear markers. "
"Always verify your changes are consistent across all files."
)
},
{
"role": "user",
"content": "Refactor this FastAPI app to use async SQLAlchemy:\n\n[code here]"
}
],
temperature=0.1,
max_tokens=4096
)
Level 3: Autonomous GitHub issue fixing (SWE-Bench class)
My recommendation: nvidia/nemotron-3-super-120b
For tasks like "autonomously fix this bug in the repository," Nemotron 3 Super shows 60.47% on SWE-Bench Verified and provides 7.5x higher throughput compared to Qwen3.5-122B with comparable quality — which is a critical factor for me in scenarios where agents process multiple tasks in parallel.
# Autonomous coding agent — Nemotron 3 Super
# IMPORTANT: for thinking models, add the correct parameters
response = client.chat.completions.create(
model="nvidia/nemotron-3-super-120b",
messages=[
{
"role": "system",
"content": (
"You are an autonomous software engineer. "
"Given a GitHub issue description and relevant code, "
"produce a complete patch. Think step by step before coding."
)
},
{
"role": "user",
"content": "Issue: #1234 — Memory leak in connection pool\n\n[repository code]"
}
],
temperature=0.15,
max_tokens=8192
)
Coding Summary Table
| Scenario |
Recommended Model |
Why |
| Simple functions, snippets |
deepseek-ai/deepseek-v4-flash |
Speed + quality + credit savings |
| Multi-file refactoring |
moonshotai/kimi-k2.6 |
Long context + sub-agent parallelism |
| Autonomous coding agent |
nvidia/nemotron-3-super-120b |
Highest SWE-Bench score + throughput |
| Maximum accuracy (not real-time) |
deepseek-ai/deepseek-v4-pro |
89/100 on coding harness |
Task 2 — Complex Reasoning and Mathematics
Reasoning involves tasks where the model must "think" before responding: mathematics, logic, GPQA Diamond, Humanity's Last Exam.
I've noticed that for most tasks, the best all-around choice remains deepseek-ai/deepseek-v4-pro with reasoning mode enabled.
For scientific tasks: qwen/qwen3.5-397b — Humanity's Last Exam score of 25.30% compared to 18.26% for Nemotron Super.
# Reasoning task with explicit chain-of-thought
response = client.chat.completions.create(
model="deepseek-ai/deepseek-v4-pro",
messages=[
{
"role": "system",
"content": (
"You are a mathematical reasoning engine. "
"Always show your full chain of thought before the final answer. "
"Format: ...\n\nFinal answer: ..."
)
},
{
"role": "user",
"content": (
"A train leaves city A at 60 km/h. Another leaves city B at 80 km/h "
"toward A. The cities are 420 km apart. When and where do they meet?"
)
}
],
temperature=0.0,
max_tokens=2048
)
Important for thinking models: DeepSeek V4-Pro and Kimi K2-Thinking are "thinking models" — they use an internal chain-of-thought before responding. For "thinker" models, set temperature=0.0 or a very low value, otherwise, the reasoning becomes unstable.
NIM Reasoning Model Comparison
| Model |
GPQA Diamond |
Humanity's Last Exam |
Optimal for |
| DeepSeek V4 Pro (Reasoning) |
High |
~20% |
Coding + logic reasoning |
| Qwen3.5-397B (Reasoning) |
High |
25.30% |
Scientific tasks, mathematics |
| Nemotron 3 Super |
Moderate |
18.26% |
Agentic throughput, not frontier science |
| MiniMax M2.7 |
High |
Competes with DeepSeek-R1 |
Pure chain-of-thought reasoning |
Task 3 — RAG and Long Context
In my practical experiments with RAG (Retrieval-Augmented Generation) — when a model receives a large external context in the form of documents, knowledge bases, or code repositories — specialized configurations for context volume and retrieval task type yield more stable results than general-purpose models.
My recommendation: deepseek-ai/deepseek-v4-flash for most RAG scenarios and nvidia/nemotron-3-ultra-500b for extreme long-context tasks.
Key selection parameters I consider when working with RAG:
- Context Size: DeepSeek V4 Flash — up to 1M tokens; Nemotron Ultra — up to 10M tokens, which is currently one of the largest figures among open-weight models
- Quality on Long Context: Nemotron 3 Super shows 91.75% on RULER @ 1M ctx versus 22.30% for GPT-OSS-120B, which significantly impacts the stability of responses in long documents
- Retrieval Speed: DeepSeek Flash versions usually offer a better latency/quality balance for standard RAG pipelines, especially with a large number of parallel requests
# RAG pipeline — DeepSeek V4 Flash with documents
def query_rag(user_question: str, retrieved_chunks: list[str]) -> str:
context = "\n\n---\n\n".join(retrieved_chunks)
response = client.chat.completions.create(
model="deepseek-ai/deepseek-v4-flash",
messages=[
{
"role": "system",
"content": (
"Answer questions using ONLY the provided context. "
"If the answer is not in the context, say so explicitly. "
"Always cite the relevant passage."
)
},
{
"role": "user",
"content": f"Context:\n{context}\n\nQuestion: {user_question}"
}
],
temperature=0.0,
max_tokens=1024
)
return response.choices[0].message.content
# For extremely long documents — Nemotron Ultra (10M ctx)
def query_giant_document(document: str, question: str) -> str:
response = client.chat.completions.create(
model="nvidia/nemotron-3-ultra-500b",
messages=[
{
"role": "system",
"content": "Analyze the entire document carefully before answering."
},
{
"role": "user",
"content": f"Document:\n{document}\n\nQuestion: {question}"
}
],
temperature=0.0,
max_tokens=2048
)
return response.choices[0].message.content
RAG Selection Table
| Context Size |
Recommended Model |
Reason |
| Up to 128K tokens |
deepseek-ai/deepseek-v4-flash |
Speed + accuracy + savings |
| 128K — 1M tokens |
moonshotai/kimi-k2.5 or deepseek-ai/deepseek-v4 |
Optimized for long context RAG |
| Over 1M tokens |
nvidia/nemotron-3-ultra-500b |
10M ctx, 91.75% RULER accuracy |
Task 4 — Multi-agent Orchestration
For multi-agent systems, the key parameter is not just the quality of a single request, but throughput during parallel sessions and the reliability of tool calling.
Nemotron 3 Super uses a hybrid Mamba-Transformer MoE architecture. Compared to dense models, this provides 5x higher inference throughput on NVIDIA hardware for concurrent agent sessions — due to MoE activating only a portion of parameters per token.
For speculative decoding: Nemotron 3 Super achieves an average acceptance length of 3.45 tokens per verification step compared to 2.70 for DeepSeek-R1 — resulting in up to 3x acceleration without a separate draft model.
import asyncio
from openai import AsyncOpenAI
async_client = AsyncOpenAI(
api_key="nvapi-your-key",
base_url="https://integrate.api.nvidia.com/v1"
)
# Parallel execution of specialized agents
async def run_agent(role: str, model: str, task: str) -> dict:
response = await async_client.chat.completions.create(
model=model,
messages=[{"role": "system", "content": f"You are a specialist in {role}."},
{"role": "user", "content": task}],
temperature=0.1,
max_tokens=1024
)
return {
"role": role,
"model": model,
"result": response.choices[0].message.content
}
async def multi_agent_pipeline(user_task: str) -> dict:
# Define specialized agents — each gets an optimal model
agents = [
("planning", "nvidia/nemotron-3-super-120b", f"Plan this task: {user_task}"),
("coding", "moonshotai/kimi-k2.6", f"Write code for: {user_task}"),
("retrieval", "deepseek-ai/deepseek-v4-flash", f"Find relevant info about: {user_task}"),
("summarizer", "google/gemma-4-31b-it", f"Summarize results for: {user_task}"),
]
# Run in parallel — save time
results = await asyncio.gather(*[
run_agent(role, model, task)
for role, model, task in agents
])
return {r["role"]: r["result"] for r in results}
# Run
results = asyncio.run(multi_agent_pipeline(
"Analyze our Q3 sales data and generate a board presentation"
))
I pay attention to the architectural decision: each agent in the system receives a model that is optimal for its role. For example, for the summarization task, I use a cheaper Gemma instead of Nemotron — as the quality difference is minimal for simple summarization, while the cost and latency difference is significant.
Task 5 — Multilingual Tasks
If your product serves an audience in multiple languages, the choice of model significantly impacts quality.
For CJK (Chinese, Japanese, Korean): qwen/qwen3-32b or zhipuai/glm-5.1 — Qwen and GLM have native Chinese support that significantly surpasses models from Meta or NVIDIA.
For Slavic languages and general multilingual: deepseek-ai/deepseek-v4-flash — shows good results on most European languages.
For multimodal multilingual (text + image + audio): nvidia/nemotron-3-nano-omni — a 30B MoE model released on April 28, 2026, supports text, image, video, and audio through a unified architecture.
# Multilingual pipeline with automatic model selection
LANGUAGE_MODEL_MAP = {
"zh": "zhipuai/glm-4.7", # Chinese — GLM is best
"ja": "qwen/qwen3-32b", # Japanese — Qwen is stronger
"ko": "qwen/qwen3-32b", # Korean
"uk": "deepseek-ai/deepseek-v4", # Ukrainian
"ru": "deepseek-ai/deepseek-v4", # Russian
"en": "deepseek-ai/deepseek-v4-flash", # English — flash is sufficient
"default": "deepseek-ai/deepseek-v4"
}
def multilingual_query(text: str, lang: str) -> str:
model = LANGUAGE_MODEL_MAP.get(lang, LANGUAGE_MODEL_MAP["default"])
response = client.chat.completions.create(
model=model,
messages=[{"role": "user", "content": text}],
temperature=0.3,
max_tokens=1024
)
return response.choices[0].message.content
Task 6 — Structured Output and Function Calling
In my practice, I often find that reliable structured output (JSON according to a schema) and function calling are critical components for production agentic systems. Not all models handle this equally well, especially when dealing with complex schemas or nested tools.
I discussed the mechanics of tool use, JSON Schema, and its connection to RAG in more detail here: tool use vs function calling and RAG.
Models with confirmed function calling support in NIM: Llama 3.1 70B/405B, Nemotron-3-Super, GLM-4.7, GLM-5.1, Kimi K2.5, Mixtral 8x22B, Qwen 2.5 72B — all support the standard OpenAI tool use format.
import json
# Function calling — GLM-5.1 or Nemotron Super
tools = [
{
"type": "function",
"function": {
"name": "get_weather",
"description": "Get current weather for a city",
"parameters": {
"type": "object",
"properties": {
"city": {"type": "string", "description": "City name"},
"unit": {
"type": "string",
"enum": ["celsius", "fahrenheit"],
"description": "Temperature unit"
}
},
"required": ["city"]
}
}
}
]
response = client.chat.completions.create(
model="zhipuai/glm-5.1", # Or: nvidia/nemotron-3-super-120b
messages=[
{"role": "user", "content": "What's the weather in Kyiv and Berlin?"}
],
tools=tools,
tool_choice="auto",
temperature=0.0
)
# Check tool call
message = response.choices[0].message
if message.tool_calls:
for tool_call in message.tool_calls:
args = json.loads(tool_call.function.arguments)
print(f"Tool: {tool_call.function.name}, Args: {args}")
Structured JSON output without function calling:
# Force JSON output via system prompt
response = client.chat.completions.create(
model="zhipuai/glm-5.1",
messages=[
{
"role": "system",
"content": (
"You MUST respond ONLY with valid JSON matching this schema exactly. "
"No explanations, no markdown, no code blocks.\n"
"Schema: {\"sentiment\": \"positive|negative|neutral\", "
"\"score\": 0.0-1.0, \"keywords\": [\"string\"]}"
)
},
{
"role": "user",
"content": "Analyze sentiment: 'The product exceeded all my expectations!'"
}
],
temperature=0.0,
max_tokens=256
)
# ALWAYS validate the output
try:
result = json.loads(response.choices[0].message.content)
except json.JSONDecodeError:
# Fallback logic for broken JSON
print("Model returned invalid JSON, retrying with stricter prompt...")
Practice: How to Switch Models Without Refactoring
The most effective approach is a centralized configuration object that separates model selection from business logic:
from dataclasses import dataclass
from openai import OpenAI
from typing import Optional
@dataclass
class ModelConfig:
model_id: str
temperature: float = 0.1
max_tokens: int = 1024
supports_tools: bool = True
context_window: int = 128_000
# Centralized configuration — change here, not in the code
MODELS = {
"coding_simple": ModelConfig("deepseek-ai/deepseek-v4-flash", temperature=0.0),
"coding_complex": ModelConfig("moonshotai/kimi-k2.6", temperature=0.1, max_tokens=4096),
"coding_agent": ModelConfig("nvidia/nemotron-3-super-120b", temperature=0.15, max_tokens=8192),
"reasoning": ModelConfig("deepseek-ai/deepseek-v4-pro", temperature=0.0, max_tokens=4096),
"rag_standard": ModelConfig("deepseek-ai/deepseek-v4-flash", temperature=0.0),
"rag_longcontext": ModelConfig("nvidia/nemotron-3-ultra-500b", temperature=0.0, context_window=10_000_000),
"multilingual": ModelConfig("deepseek-ai/deepseek-v4", temperature=0.3),
"structured": ModelConfig("zhipuai/glm-5.1", temperature=0.0),
"summarizer": ModelConfig("google/gemma-4-31b-it", temperature=0.3, max_tokens=512),
}
class NIMClient:
def __init__(self, api_key: str):
self.client = OpenAI(
api_key=api_key,
base_url="https://integrate.api.nvidia.com/v1"
)
def query(
self,
task: str,
messages: list[dict],
config_key: str = "coding_simple",
tools: Optional[list] = None
) -> str:
cfg = MODELS[config_key]
kwargs = {
"model": cfg.model_id,
"messages": messages,
"temperature": cfg.temperature,
"max_tokens": cfg.max_tokens,
}
if tools and cfg.supports_tools:
kwargs["tools"] = tools
kwargs["tool_choice"] = "auto"
response = self.client.chat.completions.create(**kwargs)
return response.choices[0].message.content
# Usage — business logic doesn't know about specific models
nim = NIMClient(api_key="nvapi-your-key")
# To change the model, just change config_key
result = nim.query(
task="coding",
messages=[{"role": "user", "content": "Write a quicksort in Python"}],
config_key="coding_simple" # change to "coding_complex" for more complex tasks
)
Decision Tree: Which Model to Choose
What is the task?
│
├── CODING
│ ├── Simple snippet / function
│ │ └── deepseek-ai/deepseek-v4-flash ✓ fast, cheap
│ ├── Multi-file / refactoring
│ │ └── moonshotai/kimi-k2.6 ✓ long context
│ └── Autonomous agent (SWE-Bench class)
│ └── nvidia/nemotron-3-super-120b ✓ highest throughput
│
├── REASONING / MATH
│ ├── Logic, coding reasoning
│ │ └── deepseek-ai/deepseek-v4-pro ✓ CoT reasoning
│ └── Scientific tasks, HLE benchmark
│ └── qwen/qwen3.5-397b ✓ 25.30% HLE
│
├── RAG / LONG CONTEXT
│ ├── Up to 128K tokens
│ │ └── deepseek-ai/deepseek-v4-flash ✓ fast retrieval
│ ├── Up to 1M tokens
│ │ └── moonshotai/kimi-k2.5 ✓ optimized for long-ctx
│ └── Over 1M tokens
│ └── nvidia/nemotron-3-ultra-500b ✓ 10M ctx, 91.75% RULER
│
├── MULTI-AGENT ORCHESTRATOR
│ └── nvidia/nemotron-3-super-120b ✓ 5x throughput, MoE eff.
│
├── MULTILINGUAL
│ ├── CJK (zh/ja/ko)
│ │ └── qwen/qwen3-32b or zhipuai/glm-4.7
│ └── European languages
│ └── deepseek-ai/deepseek-v4
│
├── STRUCTURED OUTPUT / FUNCTION CALLING
│ └── zhipuai/glm-5.1 or nvidia/nemotron-3-super-120b
│
└── SUMMARIZATION (budget-sensitive)
└── google/gemma-4-31b-it ✓ cheapest for simple tasks
Gotchas: Model-Specific Behaviors
What's not in the documentation but critical in production:
1. Thinking models require special handling
Kimi K2-Thinking and DeepSeek V4-Pro in reasoning mode are "thinking models." They generate an internal chain-of-thought before responding. According to practical developer experience, switching from a thinking model to a regular one without adjusting parameters can lead to API errors.
# For thinking models — DO NOT pass reasoning_budget with regular models
# For non-thinking models — set NIM_ENABLE_THINKING=false if there's a conflict
# Check before querying
THINKING_MODELS = {
"moonshotai/kimi-k2-thinking",
"deepseek-ai/deepseek-v4-pro", # in reasoning mode
}
def safe_query(model: str, messages: list, **kwargs) -> str:
if model in THINKING_MODELS:
kwargs.setdefault("temperature", 0.0)
# DO NOT add stream=True for thinking models without extra handling
return client.chat.completions.create(
model=model, messages=messages, **kwargs
).choices[0].message.content
2. GLM/Qwen require specific flags for reasoning tags
# GLM and Qwen 3.5 require --reasoning-format none
# if you don't want tags in the response
# This is handled via the system prompt in the API:
system_no_thinking = (
"Respond directly without showing your reasoning process. "
"Do not use tags."
)
3. Llama 4 uses Pythonic tool format
Llama 4 Scout has a different tool call format (Pythonic syntax) than the standard OpenAI format. If your parser expects a standard JSON tool call, it might break when switching to Llama 4.
4. Rate limit of 40 RPM and agentic workflows
With multi-step agents, a single "logical request" can generate 5–10 API calls. 40 RPM = maximum ~4–8 actual user tasks per minute for a single agent.
import time
import functools
def rate_limited(max_per_minute: int = 35): # leave a buffer from 40
min_interval = 60.0 / max_per_minute
last_called = [0.0]
def decorator(func):
@functools.wraps(func)
def wrapper(*args, **kwargs):
elapsed = time.time() - last_called[0]
wait = min_interval - elapsed
if wait > 0:
time.sleep(wait)
result = func(*args, **kwargs)
last_called[0] = time.time()
return result
return wrapper
return decorator
@rate_limited(max_per_minute=35)
def api_call(messages, model):
return client.chat.completions.create(
model=model, messages=messages, max_tokens=1024
)
5. Summary table of model-specific behaviors
| Model / Family |
Feature |
Action to Take |
| Kimi K2-Thinking, DeepSeek V4-Pro |
Thinking model — internal CoT |
temperature=0.0, do not mix with non-thinking configs |
| GLM-5, Qwen 3.5 |
Defaults to outputting tags |
Add "Do not show thinking" in system prompt |
| Llama 4 Scout/Maverick |
Pythonic tool call format |
Separate parser or use Llama 3.3 70B for tool use |
| Nemotron 3 Super/Ultra |
MoE — low throughput on small batches |
Optimal with parallel requests, not single-shot |
| Gemma 4 |
Requires build b8665+ for local execution |
When deploying locally, check the version |
| All models (free tier) |
40 RPM, 1000 credits on signup |
Rate limiting + exponential backoff for 429 errors |
Conclusion
I formulate it this way for myself: choosing the right model in NVIDIA NIM is not about finding the "best" model, but rather about correctly decomposing tasks and assigning a specialized model to each role in the system.
Three key principles I use in practice:
- I don't use a heavy model where a light one suffices. Gemma 4 for summarization instead of Nemotron Ultra is not a compromise, but an architecturally sound decision.
- I separate model selection from business logic. A centralized ModelConfig allows me to change the model without refactoring the system's core code.
- I account for model-specific behaviors from the start. Thinking models, different tool calling formats, tokenizer differences are things that inevitably manifest in production, and it's better to incorporate them into the architecture in advance.
Sources