BEST_PRACTICES 24 May 2026 17 min read 3,999 view

NVIDIA NIM: Which Model to Choose for Which Task – Technical Review 2026

Updated: 24 June 2026

Language: 🇺🇦 🇬🇧 🇩🇪 🇪🇸

Dmitro Petrov

A Tech Lead who builds AI/ML systems for production — and writes about how they actually work.

✦ Ask AI about this article

NVIDIA NIM: Which Model to Choose for Which Task – Technical Review 2026

The build.nvidia.com catalog contains over 100 models. This is both its strength and its problem: if you're new to the platform, the choice is paralyzing. DeepSeek or Kimi? Nemotron or Llama? GLM-5 or Qwen3.5?

This article is a practical technical breakdown of which model to run for which specific task.

Read the previous material? In the article "NVIDIA NIM: How Free Inference is Changing AI System Architecture", I analyzed why inference is becoming a commodity layer, how NVIDIA Build differs from OpenRouter and Groq, and what architectural implications this has for AI agents. This material is a logical continuation: specific models, specific code.

Connecting to NVIDIA NIM: First Steps
Model Landscape: Who's Who in the NIM Catalog
Benchmarks: Real Performance Numbers
Task 1 — Coding and Agentic Coding
Task 2 — Complex Reasoning and Mathematics
Task 3 — RAG and Long Context
Task 4 — Multi-Agent Orchestration
Task 5 — Multilingual Tasks
Task 6 — Structured Output and Function Calling
Practice: How to Switch Models Without Refactoring
Decision Tree: Which Model to Choose
Gotchas: Model-Specific Behaviors

Connecting to NVIDIA NIM: First Steps

Before comparing models, a basic setup. It takes less than 5 minutes.

Step 1. Register for the NVIDIA Developer Program (free, only email required).

Step 2. Generate an API key on build.nvidia.com. The key will have the prefix nvapi-.

Step 3. Install dependencies:

pip install openai
export NVIDIA_API_KEY="nvapi-your-key"

Basic Python Client:

from openai import OpenAI

client = OpenAI(
    api_key="nvapi-your-key",
    base_url="https://integrate.api.nvidia.com/v1"
)

response = client.chat.completions.create(
    model="meta/llama-4-scout-17b-16e-instruct",  # only change this line
    messages=[
        {"role": "system", "content": "You are a technical assistant."},
        {"role": "user", "content": "Write a Python function for binary search."}
    ],
    temperature=0.1,
    max_tokens=1024
)

print(response.choices[0].message.content)

curl Equivalent:

curl https://integrate.api.nvidia.com/v1/chat/completions \
  -H "Authorization: Bearer $NVIDIA_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta/llama-4-scout-17b-16e-instruct",
    "messages": [
      {"role": "user", "content": "Write a Python function for binary search."}
    ],
    "max_tokens": 1024
  }'

Key detail: the base URL is the same for all models. To switch from Llama to DeepSeek, you only need to change one line — the value of the model parameter. This is why choosing the right model is not code refactoring, but configuration.

Model Landscape: Who's Who in the NIM Catalog

As of May 2026, over 100 models are available in the build.nvidia.com catalog. Let's break down the main families:

Family	Developer	Key Models in NIM	License	Strength
DeepSeek V4	DeepSeek AI (China)	V4-Flash, V4-Pro, V4 (671B)	MIT	General quality, coding, cost-efficiency
Kimi K2	Moonshot AI (China)	K2.5, K2.6, K2-Thinking	Kimi License	Agentic coding, long context
Nemotron 3	NVIDIA	Nano Omni (30B), Super (120B), Ultra (500B)	NVIDIA Open Model License	NVIDIA hardware throughput, agentic tasks
Qwen 3.5	Alibaba (China)	Qwen3-8B, Qwen3-32B, Qwen3.5-235B-MoE	Apache 2.0	Coding, multilingual (especially CJK)
GLM-5 / GLM-4	Zhipu AI (China)	GLM-4.7, GLM-5, GLM-5.1	MIT	Agentic workflows, function calling
Llama 4	Meta (USA)	Scout 17B, Maverick 70B, Llama 3.3 70B	Llama Community License	General use, tool use
MiniMax M2	MiniMax (China)	M2.5, M2.7 (230B MoE)	MiniMax License	Reasoning, multimodality
Gemma 4	Google	Gemma 4 31B, Gemma 2B / 7B	Gemma License	Light tasks, summarization
Specialized	Various	NemoClaw, Llama Guard, NV-Embed	Various	Safety, embedding, guardrails

I want to draw attention to an important observation: in 2026, most of the top models come from Chinese labs. According to BenchLM.ai, the top positions in the open-weight model rankings are held by DeepSeek V4 Pro (87 points), Kimi K2.6 (84), GLM-5.1 (83), and Qwen3.5 397B (79). In this context, Meta's Llama no longer appears dominant at the top of the table.

Benchmarks: Real Performance Numbers

The figures below are compiled from Artificial Analysis, BenchLM.ai, and LearnAIForge (May 2026).

Overall Intelligence Index Ranking (Artificial Analysis v4.0)

Model	Intelligence Index	Type	Available in NIM
Kimi K2.6	54	Open-weight	✓
MiMo-V2.5-Pro	54	Open-weight	✓
DeepSeek V4 Pro (Reasoning)	52	Open-weight	✓
GLM-5.1 (Reasoning)	~50	Open-weight (MIT)	✓
Nemotron 3 Super 120B	61 (BenchLM)	Open-weight	✓
Nemotron 3 Ultra 500B	65 (BenchLM)	Open-weight	✓

SWE-Bench Verified (autonomous fixing of real GitHub issues)

Model	SWE-Bench Score	Note
Nemotron 3 Super	60.47%	+18.5 pp over GPT-OSS-120B; 7.5x higher throughput than Qwen3.5-122B
Qwen3.5-122B	~66%	Higher score, but lower throughput
DeepSeek V4 Pro	89/100 (coding harness)	Requires a special harness for maximum results
Kimi K2.6	87/100 (coding harness)	3.6x cheaper than Claude Opus on the same tasks
GPT-OSS-120B (reference)	~42%	Reference for comparison

RULER Long-Context (accuracy on 1M tokens)

Model	RULER @ 1M ctx	Maximum Context
Nemotron 3 Super	91.75%	1M tokens
Nemotron 3 Ultra	Largest among open-weight	10M tokens
DeepSeek V4 Flash	High	1M tokens
GPT-OSS-120B	22.30%	Degrades sharply on large contexts

Task 1 — Coding and Agentic Coding

Coding is the most competitive category among NIM models in 2026. I will break down three difficulty levels.

Level 1: Simple tasks (function, algorithm, bug fix)

Recommendation: deepseek-ai/deepseek-v4-flash

From a practical standpoint, DeepSeek V4 Flash is a 284B MoE model that activates only a portion of its parameters per token. This provides an unusual speed-to-quality ratio: the model behaves significantly lighter than its overall size might suggest.

According to practical developer tests, V4 Flash handles about 80% of typical coding tasks with a quality that previously required significantly more expensive and heavier models.

# Simple coding tasks — DeepSeek V4 Flash
response = client.chat.completions.create(
    model="deepseek-ai/deepseek-v4-flash",
    messages=[
        {
            "role": "system",
            "content": "You are an expert Python developer. Return only code, no explanations."
        },
        {
            "role": "user",
            "content": "Write a binary search function with type hints and docstring."
        }
    ],
    temperature=0.0,  # always 0 for code
    max_tokens=512
)

Level 2: Complex tasks (multi-file editing, refactoring)

Recommendation: moonshotai/kimi-k2.6 or deepseek-ai/deepseek-v4-pro

According to a comparative coding agent benchmark, Kimi K2.6 scores 87/100 and costs 3.6 times less than Claude Opus on similar tasks. DeepSeek V4 Pro scores 89/100 but requires a specific harness to unlock its full potential.

# Complex agentic coding — Kimi K2.6
response = client.chat.completions.create(
    model="moonshotai/kimi-k2.6",
    messages=[
        {
            "role": "system",
            "content": (
                "You are a senior software engineer. "
                "When editing code, show ONLY the changed parts with clear markers. "
                "Always verify your changes are consistent across all files."
            )
        },
        {
            "role": "user",
            "content": "Refactor this FastAPI app to use async SQLAlchemy:\n\n[code here]"
        }
    ],
    temperature=0.1,
    max_tokens=4096
)

Level 3: Autonomous GitHub issue fixing (SWE-Bench class)

My recommendation: nvidia/nemotron-3-super-120b

For tasks like "autonomously fix this bug in the repository," Nemotron 3 Super shows 60.47% on SWE-Bench Verified and provides 7.5x higher throughput compared to Qwen3.5-122B with comparable quality — which is a critical factor for me in scenarios where agents process multiple tasks in parallel.

# Autonomous coding agent — Nemotron 3 Super
# IMPORTANT: for thinking models, add the correct parameters
response = client.chat.completions.create(
    model="nvidia/nemotron-3-super-120b",
    messages=[
        {
            "role": "system",
            "content": (
                "You are an autonomous software engineer. "
                "Given a GitHub issue description and relevant code, "
                "produce a complete patch. Think step by step before coding."
            )
        },
        {
            "role": "user",
            "content": "Issue: #1234 — Memory leak in connection pool\n\n[repository code]"
        }
    ],
    temperature=0.15,
    max_tokens=8192
)

Coding Summary Table

Scenario	Recommended Model	Why
Simple functions, snippets	`deepseek-ai/deepseek-v4-flash`	Speed + quality + credit savings
Multi-file refactoring	`moonshotai/kimi-k2.6`	Long context + sub-agent parallelism
Autonomous coding agent	`nvidia/nemotron-3-super-120b`	Highest SWE-Bench score + throughput
Maximum accuracy (not real-time)	`deepseek-ai/deepseek-v4-pro`	89/100 on coding harness

Task 2 — Complex Reasoning and Mathematics

Reasoning involves tasks where the model must "think" before responding: mathematics, logic, GPQA Diamond, Humanity's Last Exam.

I've noticed that for most tasks, the best all-around choice remains deepseek-ai/deepseek-v4-pro with reasoning mode enabled.

For scientific tasks: qwen/qwen3.5-397b — Humanity's Last Exam score of 25.30% compared to 18.26% for Nemotron Super.

# Reasoning task with explicit chain-of-thought
response = client.chat.completions.create(
    model="deepseek-ai/deepseek-v4-pro",
    messages=[
        {
            "role": "system",
            "content": (
                "You are a mathematical reasoning engine. "
                "Always show your full chain of thought before the final answer. "
                "Format: ...\n\nFinal answer: ..."
            )
        },
        {
            "role": "user",
            "content": (
                "A train leaves city A at 60 km/h. Another leaves city B at 80 km/h "
                "toward A. The cities are 420 km apart. When and where do they meet?"
            )
        }
    ],
    temperature=0.0,
    max_tokens=2048
)

Important for thinking models: DeepSeek V4-Pro and Kimi K2-Thinking are "thinking models" — they use an internal chain-of-thought before responding. For "thinker" models, set temperature=0.0 or a very low value, otherwise, the reasoning becomes unstable.

NIM Reasoning Model Comparison

Model	GPQA Diamond	Humanity's Last Exam	Optimal for
DeepSeek V4 Pro (Reasoning)	High	~20%	Coding + logic reasoning
Qwen3.5-397B (Reasoning)	High	25.30%	Scientific tasks, mathematics
Nemotron 3 Super	Moderate	18.26%	Agentic throughput, not frontier science
MiniMax M2.7	High	Competes with DeepSeek-R1	Pure chain-of-thought reasoning

Task 3 — RAG and Long Context

In my practical experiments with RAG (Retrieval-Augmented Generation) — when a model receives a large external context in the form of documents, knowledge bases, or code repositories — specialized configurations for context volume and retrieval task type yield more stable results than general-purpose models.

My recommendation: deepseek-ai/deepseek-v4-flash for most RAG scenarios and nvidia/nemotron-3-ultra-500b for extreme long-context tasks.

Key selection parameters I consider when working with RAG:

Context Size: DeepSeek V4 Flash — up to 1M tokens; Nemotron Ultra — up to 10M tokens, which is currently one of the largest figures among open-weight models
Quality on Long Context: Nemotron 3 Super shows 91.75% on RULER @ 1M ctx versus 22.30% for GPT-OSS-120B, which significantly impacts the stability of responses in long documents
Retrieval Speed: DeepSeek Flash versions usually offer a better latency/quality balance for standard RAG pipelines, especially with a large number of parallel requests

# RAG pipeline — DeepSeek V4 Flash with documents
def query_rag(user_question: str, retrieved_chunks: list[str]) -> str:
    context = "\n\n---\n\n".join(retrieved_chunks)

    response = client.chat.completions.create(
        model="deepseek-ai/deepseek-v4-flash",
        messages=[
            {
                "role": "system",
                "content": (
                    "Answer questions using ONLY the provided context. "
                    "If the answer is not in the context, say so explicitly. "
                    "Always cite the relevant passage."
                )
            },
            {
                "role": "user",
                "content": f"Context:\n{context}\n\nQuestion: {user_question}"
            }
        ],
        temperature=0.0,
        max_tokens=1024
    )
    return response.choices[0].message.content

# For extremely long documents — Nemotron Ultra (10M ctx)
def query_giant_document(document: str, question: str) -> str:
    response = client.chat.completions.create(
        model="nvidia/nemotron-3-ultra-500b",
        messages=[
            {
                "role": "system",
                "content": "Analyze the entire document carefully before answering."
            },
            {
                "role": "user",
                "content": f"Document:\n{document}\n\nQuestion: {question}"
            }
        ],
        temperature=0.0,
        max_tokens=2048
    )
    return response.choices[0].message.content

RAG Selection Table

Context Size	Recommended Model	Reason
Up to 128K tokens	`deepseek-ai/deepseek-v4-flash`	Speed + accuracy + savings
128K — 1M tokens	`moonshotai/kimi-k2.5` or `deepseek-ai/deepseek-v4`	Optimized for long context RAG
Over 1M tokens	`nvidia/nemotron-3-ultra-500b`	10M ctx, 91.75% RULER accuracy

Task 4 — Multi-agent Orchestration

For multi-agent systems, the key parameter is not just the quality of a single request, but throughput during parallel sessions and the reliability of tool calling.

Nemotron 3 Super uses a hybrid Mamba-Transformer MoE architecture. Compared to dense models, this provides 5x higher inference throughput on NVIDIA hardware for concurrent agent sessions — due to MoE activating only a portion of parameters per token.

For speculative decoding: Nemotron 3 Super achieves an average acceptance length of 3.45 tokens per verification step compared to 2.70 for DeepSeek-R1 — resulting in up to 3x acceleration without a separate draft model.

import asyncio
from openai import AsyncOpenAI

async_client = AsyncOpenAI(
    api_key="nvapi-your-key",
    base_url="https://integrate.api.nvidia.com/v1"
)

# Parallel execution of specialized agents
async def run_agent(role: str, model: str, task: str) -> dict:
    response = await async_client.chat.completions.create(
        model=model,
        messages=[{"role": "system", "content": f"You are a specialist in {role}."},
                  {"role": "user", "content": task}],
        temperature=0.1,
        max_tokens=1024
    )
    return {
        "role": role,
        "model": model,
        "result": response.choices[0].message.content
    }

async def multi_agent_pipeline(user_task: str) -> dict:
    # Define specialized agents — each gets an optimal model
    agents = [
        ("planning",    "nvidia/nemotron-3-super-120b",     f"Plan this task: {user_task}"),
        ("coding",      "moonshotai/kimi-k2.6",             f"Write code for: {user_task}"),
        ("retrieval",   "deepseek-ai/deepseek-v4-flash",    f"Find relevant info about: {user_task}"),
        ("summarizer",  "google/gemma-4-31b-it",            f"Summarize results for: {user_task}"),
    ]

    # Run in parallel — save time
    results = await asyncio.gather(*[
        run_agent(role, model, task)
        for role, model, task in agents
    ])

    return {r["role"]: r["result"] for r in results}

# Run
results = asyncio.run(multi_agent_pipeline(
    "Analyze our Q3 sales data and generate a board presentation"
))

I pay attention to the architectural decision: each agent in the system receives a model that is optimal for its role. For example, for the summarization task, I use a cheaper Gemma instead of Nemotron — as the quality difference is minimal for simple summarization, while the cost and latency difference is significant.

Task 5 — Multilingual Tasks

If your product serves an audience in multiple languages, the choice of model significantly impacts quality.

For CJK (Chinese, Japanese, Korean): qwen/qwen3-32b or zhipuai/glm-5.1 — Qwen and GLM have native Chinese support that significantly surpasses models from Meta or NVIDIA.

For Slavic languages and general multilingual: deepseek-ai/deepseek-v4-flash — shows good results on most European languages.

For multimodal multilingual (text + image + audio): nvidia/nemotron-3-nano-omni — a 30B MoE model released on April 28, 2026, supports text, image, video, and audio through a unified architecture.

# Multilingual pipeline with automatic model selection
LANGUAGE_MODEL_MAP = {
    "zh": "zhipuai/glm-4.7",          # Chinese — GLM is best
    "ja": "qwen/qwen3-32b",            # Japanese — Qwen is stronger
    "ko": "qwen/qwen3-32b",            # Korean
    "uk": "deepseek-ai/deepseek-v4",   # Ukrainian
    "ru": "deepseek-ai/deepseek-v4",   # Russian
    "en": "deepseek-ai/deepseek-v4-flash",  # English — flash is sufficient
    "default": "deepseek-ai/deepseek-v4"
}

def multilingual_query(text: str, lang: str) -> str:
    model = LANGUAGE_MODEL_MAP.get(lang, LANGUAGE_MODEL_MAP["default"])

    response = client.chat.completions.create(
        model=model,
        messages=[{"role": "user", "content": text}],
        temperature=0.3,
        max_tokens=1024
    )
    return response.choices[0].message.content

Task 6 — Structured Output and Function Calling

In my practice, I often find that reliable structured output (JSON according to a schema) and function calling are critical components for production agentic systems. Not all models handle this equally well, especially when dealing with complex schemas or nested tools.

I discussed the mechanics of tool use, JSON Schema, and its connection to RAG in more detail here: tool use vs function calling and RAG.

Models with confirmed function calling support in NIM: Llama 3.1 70B/405B, Nemotron-3-Super, GLM-4.7, GLM-5.1, Kimi K2.5, Mixtral 8x22B, Qwen 2.5 72B — all support the standard OpenAI tool use format.

import json

# Function calling — GLM-5.1 or Nemotron Super
tools = [
    {
        "type": "function",
        "function": {
            "name": "get_weather",
            "description": "Get current weather for a city",
            "parameters": {
                "type": "object",
                "properties": {
                    "city": {"type": "string", "description": "City name"},
                    "unit": {
                        "type": "string",
                        "enum": ["celsius", "fahrenheit"],
                        "description": "Temperature unit"
                    }
                },
                "required": ["city"]
            }
        }
    }
]

response = client.chat.completions.create(
    model="zhipuai/glm-5.1",   # Or: nvidia/nemotron-3-super-120b
    messages=[
        {"role": "user", "content": "What's the weather in Kyiv and Berlin?"}
    ],
    tools=tools,
    tool_choice="auto",
    temperature=0.0
)

# Check tool call
message = response.choices[0].message
if message.tool_calls:
    for tool_call in message.tool_calls:
        args = json.loads(tool_call.function.arguments)
        print(f"Tool: {tool_call.function.name}, Args: {args}")

Structured JSON output without function calling:

# Force JSON output via system prompt
response = client.chat.completions.create(
    model="zhipuai/glm-5.1",
    messages=[
        {
            "role": "system",
            "content": (
                "You MUST respond ONLY with valid JSON matching this schema exactly. "
                "No explanations, no markdown, no code blocks.\n"
                "Schema: {\"sentiment\": \"positive|negative|neutral\", "
                "\"score\": 0.0-1.0, \"keywords\": [\"string\"]}"
            )
        },
        {
            "role": "user",
            "content": "Analyze sentiment: 'The product exceeded all my expectations!'"
        }
    ],
    temperature=0.0,
    max_tokens=256
)

# ALWAYS validate the output
try:
    result = json.loads(response.choices[0].message.content)
except json.JSONDecodeError:
    # Fallback logic for broken JSON
    print("Model returned invalid JSON, retrying with stricter prompt...")

Practice: How to Switch Models Without Refactoring

The most effective approach is a centralized configuration object that separates model selection from business logic:

from dataclasses import dataclass
from openai import OpenAI
from typing import Optional

@dataclass
class ModelConfig:
    model_id: str
    temperature: float = 0.1
    max_tokens: int = 1024
    supports_tools: bool = True
    context_window: int = 128_000

# Centralized configuration — change here, not in the code
MODELS = {
    "coding_simple":    ModelConfig("deepseek-ai/deepseek-v4-flash",    temperature=0.0),
    "coding_complex":   ModelConfig("moonshotai/kimi-k2.6",             temperature=0.1, max_tokens=4096),
    "coding_agent":     ModelConfig("nvidia/nemotron-3-super-120b",     temperature=0.15, max_tokens=8192),
    "reasoning":        ModelConfig("deepseek-ai/deepseek-v4-pro",      temperature=0.0, max_tokens=4096),
    "rag_standard":     ModelConfig("deepseek-ai/deepseek-v4-flash",    temperature=0.0),
    "rag_longcontext":  ModelConfig("nvidia/nemotron-3-ultra-500b",     temperature=0.0, context_window=10_000_000),
    "multilingual":     ModelConfig("deepseek-ai/deepseek-v4",          temperature=0.3),
    "structured":       ModelConfig("zhipuai/glm-5.1",                  temperature=0.0),
    "summarizer":       ModelConfig("google/gemma-4-31b-it",            temperature=0.3, max_tokens=512),
}

class NIMClient:
    def __init__(self, api_key: str):
        self.client = OpenAI(
            api_key=api_key,
            base_url="https://integrate.api.nvidia.com/v1"
        )

    def query(
        self,
        task: str,
        messages: list[dict],
        config_key: str = "coding_simple",
        tools: Optional[list] = None
    ) -> str:
        cfg = MODELS[config_key]

        kwargs = {
            "model": cfg.model_id,
            "messages": messages,
            "temperature": cfg.temperature,
            "max_tokens": cfg.max_tokens,
        }
        if tools and cfg.supports_tools:
            kwargs["tools"] = tools
            kwargs["tool_choice"] = "auto"

        response = self.client.chat.completions.create(**kwargs)
        return response.choices[0].message.content

# Usage — business logic doesn't know about specific models
nim = NIMClient(api_key="nvapi-your-key")

# To change the model, just change config_key
result = nim.query(
    task="coding",
    messages=[{"role": "user", "content": "Write a quicksort in Python"}],
    config_key="coding_simple"  # change to "coding_complex" for more complex tasks
)

Decision Tree: Which Model to Choose

What is the task?
│
├── CODING
│   ├── Simple snippet / function
│   │   └── deepseek-ai/deepseek-v4-flash  ✓ fast, cheap
│   ├── Multi-file / refactoring
│   │   └── moonshotai/kimi-k2.6           ✓ long context
│   └── Autonomous agent (SWE-Bench class)
│       └── nvidia/nemotron-3-super-120b   ✓ highest throughput
│
├── REASONING / MATH
│   ├── Logic, coding reasoning
│   │   └── deepseek-ai/deepseek-v4-pro    ✓ CoT reasoning
│   └── Scientific tasks, HLE benchmark
│       └── qwen/qwen3.5-397b              ✓ 25.30% HLE
│
├── RAG / LONG CONTEXT
│   ├── Up to 128K tokens
│   │   └── deepseek-ai/deepseek-v4-flash  ✓ fast retrieval
│   ├── Up to 1M tokens
│   │   └── moonshotai/kimi-k2.5           ✓ optimized for long-ctx
│   └── Over 1M tokens
│       └── nvidia/nemotron-3-ultra-500b   ✓ 10M ctx, 91.75% RULER
│
├── MULTI-AGENT ORCHESTRATOR
│   └── nvidia/nemotron-3-super-120b       ✓ 5x throughput, MoE eff.
│
├── MULTILINGUAL
│   ├── CJK (zh/ja/ko)
│   │   └── qwen/qwen3-32b or zhipuai/glm-4.7
│   └── European languages
│       └── deepseek-ai/deepseek-v4
│
├── STRUCTURED OUTPUT / FUNCTION CALLING
│   └── zhipuai/glm-5.1 or nvidia/nemotron-3-super-120b
│
└── SUMMARIZATION (budget-sensitive)
    └── google/gemma-4-31b-it              ✓ cheapest for simple tasks

Gotchas: Model-Specific Behaviors

What's not in the documentation but critical in production:

1. Thinking models require special handling

Kimi K2-Thinking and DeepSeek V4-Pro in reasoning mode are "thinking models." They generate an internal chain-of-thought before responding. According to practical developer experience, switching from a thinking model to a regular one without adjusting parameters can lead to API errors.

# For thinking models — DO NOT pass reasoning_budget with regular models
# For non-thinking models — set NIM_ENABLE_THINKING=false if there's a conflict

# Check before querying
THINKING_MODELS = {
    "moonshotai/kimi-k2-thinking",
    "deepseek-ai/deepseek-v4-pro",  # in reasoning mode
}

def safe_query(model: str, messages: list, **kwargs) -> str:
    if model in THINKING_MODELS:
        kwargs.setdefault("temperature", 0.0)
        # DO NOT add stream=True for thinking models without extra handling
    return client.chat.completions.create(
        model=model, messages=messages, **kwargs
    ).choices[0].message.content

2. GLM/Qwen require specific flags for reasoning tags

# GLM and Qwen 3.5 require --reasoning-format none
# if you don't want  tags in the response
# This is handled via the system prompt in the API:
system_no_thinking = (
    "Respond directly without showing your reasoning process. "
    "Do not use  tags."
)

3. Llama 4 uses Pythonic tool format

Llama 4 Scout has a different tool call format (Pythonic syntax) than the standard OpenAI format. If your parser expects a standard JSON tool call, it might break when switching to Llama 4.

4. Rate limit of 40 RPM and agentic workflows

With multi-step agents, a single "logical request" can generate 5–10 API calls. 40 RPM = maximum ~4–8 actual user tasks per minute for a single agent.

import time
import functools

def rate_limited(max_per_minute: int = 35):  # leave a buffer from 40
    min_interval = 60.0 / max_per_minute
    last_called = [0.0]

    def decorator(func):
        @functools.wraps(func)
        def wrapper(*args, **kwargs):
            elapsed = time.time() - last_called[0]
            wait = min_interval - elapsed
            if wait > 0:
                time.sleep(wait)
            result = func(*args, **kwargs)
            last_called[0] = time.time()
            return result
        return wrapper
    return decorator

@rate_limited(max_per_minute=35)
def api_call(messages, model):
    return client.chat.completions.create(
        model=model, messages=messages, max_tokens=1024
    )

5. Summary table of model-specific behaviors

Model / Family	Feature	Action to Take
Kimi K2-Thinking, DeepSeek V4-Pro	Thinking model — internal CoT	temperature=0.0, do not mix with non-thinking configs
GLM-5, Qwen 3.5	Defaults to outputting tags	Add "Do not show thinking" in system prompt
Llama 4 Scout/Maverick	Pythonic tool call format	Separate parser or use Llama 3.3 70B for tool use
Nemotron 3 Super/Ultra	MoE — low throughput on small batches	Optimal with parallel requests, not single-shot
Gemma 4	Requires build b8665+ for local execution	When deploying locally, check the version
All models (free tier)	40 RPM, 1000 credits on signup	Rate limiting + exponential backoff for 429 errors

Conclusion

I formulate it this way for myself: choosing the right model in NVIDIA NIM is not about finding the "best" model, but rather about correctly decomposing tasks and assigning a specialized model to each role in the system.

Three key principles I use in practice:

I don't use a heavy model where a light one suffices. Gemma 4 for summarization instead of Nemotron Ultra is not a compromise, but an architecturally sound decision.
I separate model selection from business logic. A centralized ModelConfig allows me to change the model without refactoring the system's core code.
I account for model-specific behaviors from the start. Thinking models, different tool calling formats, tokenizer differences are things that inevitably manifest in production, and it's better to incorporate them into the architecture in advance.

Categories