Як LLM вирішує чи викликати tool?

Модель аналізує запит користувача, системний промпт і описи доступних інструментів через внутрішній Chain-of-Thought. Якщо намір запиту семантично збігається з описом інструменту і модель не вважає себе достатньо обізнаною — вона генерує tool call. При tool_choice: auto рішення повністю за моделлю. Якість description напряму впливає на це рішення.

Чому модель не викликає tool навіть коли це потрібно?

Три основні причини: (1) опис інструменту нечіткий або не покриває сценарій запиту; (2) модель вважає себе достатньо обізнаною і відповідає з власних знань — навіть якщо вони застаріли; (3) tool_choice: auto при дуже впевненому контексті. Рішення: покращити description, додати тригерні сценарії, або використати tool_choice: required для критичних запитів

Що таке галюцинація від впевненості в контексті tool use?

Модель відповідає впевнено і зв'язно без виклику tool — тому що внутрішні parametric знання покривають запит, але ці знання застаріли. Наприклад, модель знає умови договору з training data, але поточна версія вже змінилась. Без пошуку відповідь буде граматично правильною але фактично невірною. Особливо небезпечно для договорів, цін, регламентів.

Як паралельні tool calls впливають на поведінку моделі?

Сучасні моделі можуть генерувати кілька tool calls в одному повороті для незалежних запитів. Дослідження WildToolBench (2026) показує що моделі схильні до self-conditioning: якщо вони нещодавно використовували паралельні виклики, вони продовжують їх використовувати навіть коли це недоцільно. Завжди повертайте tool_result для кожного tool_use_id — інакше API поверне помилку.

Як написати description щоб модель завжди викликала tool ?

Ефективний description містить: (1) чітке формулювання що інструмент робить і коли; (2) тригерні сценарії — явний перелік типів запитів де tool потрібний; (3) негативні приклади — коли НЕ викликати; (4) критерій актуальності — 'use this tool when current/up-to-date information is needed'. Опис повинен читатись як контракт, а не як назва функції.

AI_TOOLS 09 April 2026 18 min read 899 view

How LLMs Decide When to Call a Tool: tool_choice, CoT and Hallucination

Updated: 24 June 2026

Language: 🇺🇦 🇬🇧 🇩🇪

Dmitro Petrov

A Tech Lead who builds AI/ML systems for production — and writes about how they actually work.

✦ Ask AI about this article

How LLMs Decide When to Call a Tool: tool_choice, CoT and Hallucination

The developer configured tool use, tested with sample requests — everything works. In production, the model suddenly responds without calling a tool, confidently and coherently, but with data from a year ago. No errors in the logs. Just an incorrect answer. Spoiler: the model didn't "break" — it made a rational decision not to search because it considered itself sufficiently knowledgeable. And this is the most dangerous failure mode.

⚡ In short

✅ Decision to tool call is made via internal CoT: the model weighs the intent of the query against the tool description
✅ Description is not documentation, it's a prompt: a poorly written description = the model won't call the tool
✅ The most dangerous failure mode: the model is confident in its answer from its own knowledge, but it's outdated
✅ tool_choice: auto ≠ guarantee of search: the model can decide to respond without retrieve even when it's needed
✅ self-conditioning: if the model has recently made parallel calls — it tends to repeat them
🎯 You will get: specific templates for writing descriptions and strategies for controlling the model's decision
👇 Below are mechanics, code examples, and practical patterns

📚 Article Contents

📌 Three tool_choice modes: auto, required, none
📌 How the tool description affects the model's decision
📌 Chain-of-Thought inside: how the model analyzes context
📌 Where the decision breaks: confidence → lack of search → hallucination
📌 Parallel calls: when the model calls multiple tools simultaneously
💼 Practice: how to write descriptions so the model calls the tool when needed
❓ Frequently Asked Questions (FAQ)
✅ Conclusions

Three tool_choice modes: auto, required, none — and when to use each

The tool_choice parameter is not just an API setting. It's an architectural decision about who controls the flow: the model or your code. The choice of mode determines where problems will arise and where to look for them.

auto: the model as judge

The default mode when tools are provided. The model decides independently: respond with text from its own knowledge or call a tool. This is the most flexible mode — and the most unpredictable.

# auto — the model decides itself
response = client.messages.create(
    model="claude-opus-4-6",
    max_tokens=1024,
    tools=tools,
    tool_choice={"type": "auto"},   # can be omitted — it's the default
    messages=[{"role": "user", "content": query}]
)

# Check what the model decided
if response.stop_reason == "tool_use":
    # the model called a tool
    tool_block = next(b for b in response.content if b.type == "tool_use")
    print(f"Tool: {tool_block.name}, Args: {tool_block.input}")
elif response.stop_reason == "end_turn":
    # the model responded without searching — this could be normal or a problem
    text = next(b for b in response.content if b.type == "text")
    print(f"Direct answer (no tool): {text.text[:100]}")

Logging stop_reason is mandatory practice in production. It's the only way to understand if the model searched for information or responded from its own knowledge.

required / any: forced call

The model is obligated to call at least one tool regardless of the query. OpenAI calls this required, Anthropic calls it any.

# Anthropic: forced call of any tool
tool_choice={"type": "any"}

# Anthropic: forced call of a specific tool
tool_choice={"type": "tool", "name": "search_documents"}

# OpenAI-compatible syntax
tool_choice="required"

When justified: structured output where the answer must always go through a tool, mandatory logging of every request to an external system, a deterministic pipeline where retrieve is a mandatory step.

When harmful: conversational mode where the user might ask "hello" or "thank you". The model will still form a tool call, your code will spend tokens and time on an unnecessary retrieve.

Critical limitation of Anthropic (as of 2025): any and forced call of a specific tool are incompatible with extended thinking. When thinking is enabled, only auto and none are available. Attempting to use any with thinking returns an HTTP 400. Current status — docs.anthropic.com.

none: disabling calls

The model does not call any tools, it only generates text. Useful for the final step after receiving all results — when you need to synthesize an answer from the already collected context without additional requests.

# Final synthesis response after data collection
final_response = client.messages.create(
    model="claude-opus-4-6",
    max_tokens=2048,
    tools=tools,                      # tools are still passed (API requirement)
    tool_choice={"type": "none"},     # but calling them is forbidden
    messages=[
        *conversation_history,
        {"role": "user", "content": "Now generate the final report based on the collected data."}
    ]
)

Important nuance: if there are tool_result blocks in messages, tools must still be passed in the request — otherwise the API will return a validation error.

⚠️ Pitfall #1: illusion of control through tool_choice

tool_choice: auto does not mean the model will search when needed. It means the model will search when it deems it necessary. If your task requires up-to-date data on every request, auto is insufficient. Either use required, or design the description so that the model always considers search necessary for your type of queries.

How the tool description affects the model's decision: bad description = model won't call tool

Description is not documentation for the developer. It's part of the prompt that the model sees. It's through it that the model "understands" what the tool does and when it's worth calling.

APXML (2025) describes it this way: the model compares the intent of the query with the descriptions of available tools, like a person comparing a task with the tools available in a toolbox. If a hammer is labeled "object for physical impact" — a person might not pick it up for hammering a nail.

OpenAI official documentation recommends: write clear and detailed function names, parameter descriptions, and instructions. Clearly describe the purpose of the function and each parameter, and use a system prompt to explain when (and when not) to use each function.

Comparison: bad vs good description — a practical case

One of the clients had a problem: the model poorly followed instructions and irregularly called search. Some requests about prices and contract terms went through without retrieve — the model responded confidently from its own knowledge, but outdatedly. When we investigated the cause — it turned out the tool description was minimal. After we rewrote the description with trigger scenarios, the model's behavior stabilized.

# ❌ WAS: minimal description — model often didn't call tool
{
    "name": "search_docs",
    "description": "Searches documents",
    "input_schema": {
        "type": "object",
        "properties": {
            "query": {"type": "string"}
        },
        "required": ["query"]
    }
}

# ✅ BECAME: detailed description with triggers — model understands when and why
{
    "name": "search_knowledge_base",
    "description": """Searches for up-to-date information in the corporate knowledge base.

    USE this tool when:
    - The query concerns contract terms, prices, regulations, or internal procedures
    - Up-to-date information is needed that might have changed since your training
    - Specifics of a particular client, product, or project are requested

    DO NOT use for:
    - General questions about technology or generally known facts
    - Mathematical calculations
    - Formatting or editing text""",

    "input_schema": {
        "type": "object",
        "properties": {
            "query": {
                "type": "string",
                "description": "Natural language search query. Formulate as a question or keywords."
            },
            "top_k": {
                "type": "integer",
                "description": "Number of snippets to return. Defaults to 5. Increase to 10 for complex comparative queries."
            }
        },
        "required": ["query"]
    }
}

The key difference: a good description contains trigger scenarios — an explicit list of situations where the tool is needed. The model uses this list as a criterion in its internal decision.

Important if you use Ollama with a local model: weaker models (7B–13B) may unstably follow even detailed descriptions — this is a limitation of the model, not the description. In our production stack, we use AskYourDocs via OpenRouter with deepseek/deepseek-chat as the default model (${SPRING_AI_OPENAI_CHAT_MODEL:deepseek/deepseek-chat}) — and there are no issues with description execution at this level.

For Ollama in dev/staging environments, we recommend additionally duplicating trigger instructions in the system prompt — this increases stability on weaker models. More details on configuring Ollama in a production setup: RAG with Ollama: How to Teach AI to Respond Based on Your Documents .

Impact of description on choosing between multiple tools

When there are multiple tools with similar functionality in the system, the quality of the description becomes even more critical. APXML emphasizes: vague or overlapping descriptions cause the model to either choose the wrong tool, or choose none at all.

# Problem: two similar tools with unclear descriptions
tools = [
    {"name": "search_contracts", "description": "Searches contracts"},
    {"name": "search_documents", "description": "Searches documents"}
]
# The model doesn't know the difference — the choice is unpredictable

# Solution: clear delineation of responsibilities
tools = [
    {
        "name": "search_contracts",
        "description": "Searches ONLY signed legal contracts and their addendums. "
                       "Use for queries about terms, deadlines, party obligations."
    },
    {
        "name": "search_documents",
        "description": "Searches internal regulations, instructions, technical documentation, and reports. "
                       "Does NOT contain legal contracts."
    }
]

⚠️ Pitfall #2: more tools = more selection errors

Laurent Kubaski (2025) notes: most providers publish the maximum number of tools technically supported by the model, but don't mention that in practice, as the number of tools increases, the probability of incorrect selection also increases. After 10-15 tools, the quality of selection noticeably decreases. After 50+ — Tool RAG is needed (details in TU-6).

Chain-of-Thought inside: how the model analyzes context and makes decisions

The decision to tool call is not the result of simple pattern matching. Internally, a process similar to Chain-of-Thought reasoning occurs, where the model sequentially weighs several factors before returning a response.

Raina (2025) describes this through the lens of neural layers: input embedding layers convert the query and tool descriptions into numerical vectors — read about how text becomes a vector and why semantically similar queries fall into the same area of space in Embeddings in Simple Terms: How AI Understands Meaning, Not Just Words . Middle layers perform abstract reasoning and tool selection logic, and how vector search finds the most relevant tool or snippet — details in Vector Search for Beginners: How RAG Finds the Necessary Information . Output layers generate the final decision — text or tool call.

Internal decision process (simplified model)

# What happens inside the model with tool_choice: auto (conceptually):

# 1. Analyze query intent
intent = analyze_query(user_message)
# → "query about early termination contract terms"

# 2. Assess own knowledge
confidence_in_own_knowledge = estimate_confidence(intent)
# → "has general knowledge about contracts, but not about this specific client's contract"

# 3. Compare with available tools
tool_relevance = match_intent_to_tools(intent, tool_descriptions)
# → search_knowledge_base: high relevance (trigger scenario: "contract terms")

# 4. Decision
if tool_relevance > threshold AND confidence_in_own_knowledge < threshold:
    return tool_call(name="search_knowledge_base", args={...})
else:
    return text_response(...)

Key point: the model doesn't just check if a relevant tool exists. It also assesses if its own knowledge is sufficient. If the model considers itself knowledgeable — it might choose a text response even if a relevant tool is available.

How the model learns to decide when to search

The ability for correct decision-making is the result of fine-tuning on synthetic examples. Simplicity is SOTA (2025) describes the approach: providers generate thousands of examples of query → CoT reasoning trace → tool call, where the model learns not just to call a function, but to justify why. An example of a reasoning trace looks like this:

# Internal CoT reasoning trace (how it's formed during training):
"""
Query: "What are the terms for early termination of our contract?"

Analysis: The query concerns a specific contract ("our") —
this means specific information is needed that is not in my general knowledge.
The available tool search_knowledge_base describes searching the corporate base
with the trigger scenario "contract terms". This is an exact match.
Confidence in own knowledge: low (specificity of a particular contract).
Decision: call search_knowledge_base.
"""
→ tool_call("search_knowledge_base", {"query": "terms for early termination of contract"})

This is why reasoning-enabled model variants (Claude with extended thinking, OpenAI's o-series) show better results in complex tool use scenarios — WildToolBench (2026) confirms: reasoning-enabled models consistently outperform non-reasoning variants in tasks requiring correct orchestration of sequential tool calls.

Impact of system prompt on decision

The system prompt is a powerful lever for controlling the model's decision. If it explicitly states when to search, the model will follow this instruction even with tool_choice: auto:

The combination of a high-quality description + an explicit instruction in the system prompt significantly increases the reliability of the model's decisions compared to each approach separately.

Where the solution breaks: the model is confident, but the answer is outdated → doesn't search → hallucinates

This is the most dangerous failure mode in systems with tool use. Not a code error, not an empty result — but a confident, coherent, grammatically correct answer that is irrelevant or incorrect.

The mechanics of "silent" hallucination

# Scenario: the model has parametric knowledge about the product,
# but prices changed 3 months ago

user: "What is the price of the Enterprise plan?"

# Internal process:
# - the model sees the tool search_pricing
# - it assesses its own knowledge: "I know about the Enterprise plan, it was $500/month"
# - confidence: high → it decides to answer without searching

assistant: "The Enterprise plan costs $500 per month and includes..."
# stop_reason: "end_turn"  ← no tool_use

# Real price: $650/month after a price increase 3 months ago
# Result: the client received an incorrect price, nothing in the logs signals a problem

OpenAI (2025) explains why this happens: standard training rewards confident answers, not the acknowledgment of uncertainty. The model is trained to answer, not to refrain. Therefore, even frontier models tend to answer confidently where they should ask for fresh data.

A Survey of Hallucinations in LLMs (2026) highlights a specific type: temporal misalignment — the model generates an answer that was correct at the time of training but is outdated at the time of the query. This is particularly critical for prices, regulations, contract terms, and personnel changes.

Types of situations where this is dangerous

Data Type	Frequency of Changes	Risk	Recommendation
Prices, tariffs	High	🔴 Critical	tool_choice: required or explicit instruction in the prompt
Contract terms	Medium	🔴 Critical	tool_choice: required for all contract-related queries
Internal regulations	Medium	🟠 High	Trigger scenarios in description + system prompt
Personnel data	High	🟠 High	Search is mandatory, never answer from memory
Technical documentation	Low	🟡 Medium	auto + quality description is usually sufficient
General knowledge	Very low	🟢 Low	Answering without search is acceptable

How to detect the problem before it becomes an incident

import anthropic

def safe_query(client, query, tools, require_search_keywords=None):
    """
    Query with control over whether the model used search.
    require_search_keywords: a list of words that require mandatory search
    """
    response = client.messages.create(
        model="claude-opus-4-6",
        max_tokens=1024,
        tools=tools,
        messages=[{"role": "user", "content": query}]
    )

    used_tool = response.stop_reason == "tool_use"

    # Check if the query required a search
    if require_search_keywords:
        needs_search = any(kw in query.lower() for kw in require_search_keywords)
        if needs_search and not used_tool:
            # Log a suspicious direct answer
            log_warning(f"Query requires search but model answered directly: {query[:100]}")
            # Optionally: force a retry with 'required'
            return retry_with_required(client, query, tools)

    return response

# Usage
response = safe_query(
    client, query, tools,
    require_search_keywords=["price", "cost", "contract", "terms", "regulation"]
)

⚠️ Pitfall #3: confident tone = suspicion, not trust

The more confidently the model answers a query where up-to-date information is expected — the more reason there is to check if it searched at all. An uncertain answer with "I'm not sure" is often more reliable than a confident answer without searching. In production: always log the stop_reason. An answer with end_turn without a prior tool_use for a "sensitive" query is grounds for an audit.

Parallel tool calls: when the model calls multiple tools simultaneously

Modern models support parallel tool calls — multiple calls in a single turn for independent requests. This is a powerful capability that requires proper handling.

When the model generates parallel calls

The model decides to make parallel calls when:

The query explicitly compares multiple entities ("compare contracts A and B")
Data from multiple independent sources is needed simultaneously
Sub-queries are independent of each other and can be executed in parallel

# Query: "Compare the terms of contracts with clients Alpha and Beta"
# The model returns two parallel calls in one response:

{
  "content": [
    {
      "type": "tool_use",
      "id": "toolu_01A",
      "name": "search_knowledge_base",
      "input": {"query": "contract terms client Alpha", "top_k": 5}
    },
    {
      "type": "tool_use",
      "id": "toolu_01B",
      "name": "search_knowledge_base",
      "input": {"query": "contract terms client Beta", "top_k": 5}
    }
  ],
  "stop_reason": "tool_use"
}

Correct handling of parallel calls

import anthropic
import json
from concurrent.futures import ThreadPoolExecutor

def handle_parallel_tool_calls(response, tools_map):
    """
    Handles parallel tool calls and returns results for all of them.
    tools_map: dict {tool_name: callable}
    """
    tool_blocks = [b for b in response.content if b.type == "tool_use"]

    # Execute all calls — can be done in parallel if independent
    def execute_tool(block):
        fn = tools_map.get(block.name)
        if not fn:
            return {"tool_use_id": block.id, "content": f"Tool {block.name} not found", "is_error": True}
        try:
            result = fn(**block.input)
            return {
                "type": "tool_result",
                "tool_use_id": block.id,   # ← critical: id from the tool_use block
                "content": json.dumps(result, ensure_ascii=False)
            }
        except Exception as e:
            return {
                "type": "tool_result",
                "tool_use_id": block.id,
                "content": str(e),
                "is_error": True            # ← explicitly mark as an error
            }

    with ThreadPoolExecutor() as executor:
        results = list(executor.map(execute_tool, tool_blocks))

    return results

# Second request with results of ALL parallel calls
tool_results = handle_parallel_tool_calls(response, {
    "search_knowledge_base": search_knowledge_base
})

follow_up = client.messages.create(
    model="claude-opus-4-6",
    max_tokens=2048,
    tools=tools,
    messages=[
        {"role": "user", "content": original_query},
        {"role": "assistant", "content": response.content},
        {"role": "user", "content": tool_results}  # ← all results together
    ]
)

Self-conditioning: a hidden problem of long sessions

WildToolBench (2026) revealed an important effect: models exhibit self-conditioning — if the model has recently used parallel calls in the current session, it tends to repeat them even when it's not appropriate. The reason: the large context of previous messages "overwhelms" the model's attention to the current query.

Practical implication: in long agent sessions, the model's behavior can drift. If you see the model making unnecessary parallel calls in later session messages — it's not a coincidence, but accumulated context bias. Solution: periodically summarize and compress conversation history, or break long agent sessions into independent ones.

⚠️ Pitfall #4: unclosed parallel calls

If the model's response contains two tool call blocks, but you only returned a tool_result for one — the next API request will return a validation error. Each tool_use_id must have a corresponding tool_result. If one of the calls completed with an error — pass is_error: true, but *do not skip* the response for that id.

Practice: how to write descriptions so the model calls tools when needed

Statsig (2025) puts it clearly: tool documentation should read like a contract — a purpose statement, a few specific examples, and argument types that leave no room for guesswork.

Effective Description Template

TEMPLATE:
"""[What the tool does — one sentence, specifically]

USE WHEN:
- [triggering scenario 1]
- [triggering scenario 2]
- [relevance criterion: "when fresh/current information is needed"]

DO NOT use for:
- [anti-use-case 1]
- [anti-use-case 2]

Examples of queries that should trigger this tool:
- "[specific example 1]"
- "[specific example 2]"
"""

Real-world example for AskYourDocs

tools = [
    {
        "name": "search_knowledge_base",
        "description": """Searches for current information in the company's corporate knowledge base.
The base contains: contracts, addendums, specifications, price lists, regulations, internal instructions.

USE WHEN:
- Query about prices, tariffs, costs of any services or products
- Query about contract terms, obligations, deadlines, penalties
- Query about internal procedures, regulations, instructions
- Specifics of a particular client, project, or product are needed
- Query contains words: "our", "your", "current", "actual", "latest"

DO NOT use for:
- General questions about technologies (what is PDF, how does API work)
- Mathematical calculations
- Queries about generally known facts independent of the company

Examples of queries that SHOULD trigger this tool:
- "What is the price for the Enterprise plan?"
- "What are the terms for early termination?"
- "Who is responsible for onboarding new clients?"
- "What is the deadline for the contract with Alfa Corp?"
""",
        "input_schema": {
            "type": "object",
            "properties": {
                "query": {
                    "type": "string",
                    "description": "Natural language search query. "
                                   "Include context: client names, products, document types."
                },
                "top_k": {
                    "type": "integer",
                    "description": "Number of fragments (1-10). Default: 5. "
                                   "Use 8-10 for comparative queries."
                }
            },
            "required": ["query"]
        }
    }
]

Testing description quality

Writing a good description is not enough — you need to verify that the model actually calls the tool for the right queries and doesn't call it for unnecessary ones.

def test_tool_description(client, tools, test_cases):
    """
    Tests whether the model calls the tool for the correct queries.

    test_cases: list of dictionaries:
    [
        {"query": "What is the price?", "should_call_tool": True},
        {"query": "What is RAG?", "should_call_tool": False},
    ]
    """
    results = []
    for case in test_cases:
        response = client.messages.create(
            model="claude-opus-4-6",
            max_tokens=256,
            tools=tools,
            messages=[{"role": "user", "content": case["query"]}]
        )
        called_tool = response.stop_reason == "tool_use"
        passed = called_tool == case["should_call_tool"]
        results.append({
            "query": case["query"],
            "expected": case["should_call_tool"],
            "actual": called_tool,
            "passed": passed
        })
        if not passed:
            print(f"❌ FAIL: '{case['query']}' — "
                  f"expected tool={'yes' if case['should_call_tool'] else 'no'}, "
                  f"got {'yes' if called_tool else 'no'}")

    passed_count = sum(1 for r in results if r["passed"])
    print(f"\n{passed_count}/{len(results)} test cases passed")
    return results

# Run after each description change
test_tool_description(client, tools, [
    {"query": "What is the price for the Enterprise plan?", "should_call_tool": True},
    {"query": "What are the terms for contract termination?", "should_call_tool": True},
    {"query": "Who handles onboarding?", "should_call_tool": True},
    {"query": "What is a vector database?", "should_call_tool": False},
    {"query": "Hello, how are you?", "should_call_tool": False},
    {"query": "What is 15% of 1000?", "should_call_tool": False},
])

Checklist for tool description

☑ The first sentence answers "what does this tool do?" specifically and without generalities
☑ There is an explicit list of triggering scenarios (USE WHEN)
☑ There is an explicit list of anti-use-cases (DO NOT use for)
☑ There is a relevance criterion ("when fresh/current information is needed")
☑ If there are multiple tools, the descriptions do not overlap and are clearly delineated
☑ The description has been tested on a set of queries (both positive and negative cases)
☑ The system prompt reinforces the description with explicit instructions on when to search

❓ Frequently Asked Questions

How does an LLM decide whether to call a tool?

Through internal CoT: the model weighs the intent of the query against the tool's description and assesses whether its own knowledge is sufficient. If the description contains a triggering scenario that matches the query — and the model doesn't consider itself sufficiently knowledgeable — it generates a tool call.

Why doesn't the model call the tool even when it's needed?

Three reasons: (1) unclear or missing triggering scenario in the description; (2) the model considers itself sufficiently knowledgeable from its own parametric knowledge; (3) tool_choice: auto by its nature does not guarantee a call. Solution: improve the description, add explicit instructions in the system prompt, or use required for critical query types.

What is a hallucination from confidence?

The model responds without calling a tool — confidently and coherently — because its parametric knowledge covers the query, but this knowledge is outdated. Especially dangerous for prices, contract terms, regulations. Diagnosis: stop_reason == "end_turn" without a preceding tool_use on a "sensitive" query.

How to correctly handle parallel tool calls?

Execute the function for each tool_use block and pass tool_result for each tool_use_id in the next request. Omitting even one ID will result in an API validation error. If a call fails, pass is_error: true, but do not omit the response.

✅ Conclusions

tool_choice: auto is not a guarantee of search. It's the model's prerogative to decide. Log stop_reason.
Description is part of the prompt, not documentation. Triggering scenarios and anti-use-cases are critical.
The most dangerous failure mode is a confident answer without searching outdated parametric knowledge.
Reasoning-enabled models are better at complex tool use scenarios — WildToolBench (2026) confirms this.
Parallel calls require returning a result for each tool_use_id.
Self-conditioning in long sessions can distort the model's decisions — watch out for context drift.
Test descriptions on a set of positive and negative examples after each change.

Next step: how the model evaluates the quality of results returned by a tool — and why even a correct call can lead to an incorrect answer — in Grounding and trust in sources.

Sources

WildToolBench — Benchmarking LLM Tool-Use in the Wild (2026) · Simplicity is SOTA — How LLMs are trained for function calling (2025) · OpenAI — Why language models hallucinate (2025) · Raina — Inside the Black Box: LLM Neural Layers and Tool Calling (2025) · APXML — Agent Tool Selection Logic · OpenAI Docs — Function Calling Best Practices · Kubaski — Tool Calling Best Practices (2025) · Statsig — Tool calling optimization (2025) · Survey — Large Language Models Hallucination (2026) · Anthropic Docs — How to implement tool use

Categories