The developer configured tool use, tested with sample requests — everything works. In production, the model suddenly responds without calling a tool, confidently and coherently, but with data from a year ago. No errors in the logs. Just an incorrect answer. Spoiler: the model didn't "break" — it made a rational decision not to search because it considered itself sufficiently knowledgeable. And this is the most dangerous failure mode.
⚡ In short
- ✅ Decision to tool call is made via internal CoT: the model weighs the intent of the query against the tool description
- ✅ Description is not documentation, it's a prompt: a poorly written description = the model won't call the tool
- ✅ The most dangerous failure mode: the model is confident in its answer from its own knowledge, but it's outdated
- ✅ tool_choice: auto ≠ guarantee of search: the model can decide to respond without retrieve even when it's needed
- ✅ self-conditioning: if the model has recently made parallel calls — it tends to repeat them
- 🎯 You will get: specific templates for writing descriptions and strategies for controlling the model's decision
- 👇 Below are mechanics, code examples, and practical patterns
📚 Article Contents
- 📌 Three tool_choice modes: auto, required, none
- 📌 How the tool description affects the model's decision
- 📌 Chain-of-Thought inside: how the model analyzes context
- 📌 Where the decision breaks: confidence → lack of search → hallucination
- 📌 Parallel calls: when the model calls multiple tools simultaneously
- 💼 Practice: how to write descriptions so the model calls the tool when needed
- ❓ Frequently Asked Questions (FAQ)
- ✅ Conclusions
Three tool_choice modes: auto, required, none — and when to use each
The tool_choice parameter is not just an API setting.
It's an architectural decision about who controls the flow: the model or your code.
The choice of mode determines where problems will arise and where to look for them.
auto: the model as judge
The default mode when tools are provided. The model decides independently: respond with text from its own knowledge or call a tool. This is the most flexible mode — and the most unpredictable.
# auto — the model decides itself
response = client.messages.create(
model="claude-opus-4-6",
max_tokens=1024,
tools=tools,
tool_choice={"type": "auto"}, # can be omitted — it's the default
messages=[{"role": "user", "content": query}]
)
# Check what the model decided
if response.stop_reason == "tool_use":
# the model called a tool
tool_block = next(b for b in response.content if b.type == "tool_use")
print(f"Tool: {tool_block.name}, Args: {tool_block.input}")
elif response.stop_reason == "end_turn":
# the model responded without searching — this could be normal or a problem
text = next(b for b in response.content if b.type == "text")
print(f"Direct answer (no tool): {text.text[:100]}")
Logging stop_reason is mandatory practice in production.
It's the only way to understand if the model searched for information or responded from its own knowledge.
required / any: forced call
The model is obligated to call at least one tool regardless of the query.
OpenAI calls this required, Anthropic calls it any.
# Anthropic: forced call of any tool
tool_choice={"type": "any"}
# Anthropic: forced call of a specific tool
tool_choice={"type": "tool", "name": "search_documents"}
# OpenAI-compatible syntax
tool_choice="required"
When justified: structured output where the answer must always go through a tool, mandatory logging of every request to an external system, a deterministic pipeline where retrieve is a mandatory step.
When harmful: conversational mode where the user might ask "hello" or "thank you". The model will still form a tool call, your code will spend tokens and time on an unnecessary retrieve.
Critical limitation of Anthropic (as of 2025):
any and forced call of a specific tool are incompatible with extended thinking.
When thinking is enabled, only auto and none are available.
Attempting to use any with thinking returns an HTTP 400.
Current status —
docs.anthropic.com.
none: disabling calls
The model does not call any tools, it only generates text. Useful for the final step after receiving all results — when you need to synthesize an answer from the already collected context without additional requests.
# Final synthesis response after data collection
final_response = client.messages.create(
model="claude-opus-4-6",
max_tokens=2048,
tools=tools, # tools are still passed (API requirement)
tool_choice={"type": "none"}, # but calling them is forbidden
messages=[
*conversation_history,
{"role": "user", "content": "Now generate the final report based on the collected data."}
]
)
Important nuance: if there are tool_result blocks in messages,
tools must still be passed in the request — otherwise the API will return a validation error.
⚠️ Pitfall #1: illusion of control through tool_choice
tool_choice: auto does not mean the model will search when needed.
It means the model will search when it deems it necessary.
If your task requires up-to-date data on every request, auto is insufficient.
Either use required, or design the description so that
the model always considers search necessary for your type of queries.
How the tool description affects the model's decision: bad description = model won't call tool
Description is not documentation for the developer. It's part of the prompt that the model sees. It's through it that the model "understands" what the tool does and when it's worth calling.
APXML (2025) describes it this way: the model compares the intent of the query with the descriptions of available tools, like a person comparing a task with the tools available in a toolbox. If a hammer is labeled "object for physical impact" — a person might not pick it up for hammering a nail.
OpenAI official documentation recommends: write clear and detailed function names, parameter descriptions, and instructions. Clearly describe the purpose of the function and each parameter, and use a system prompt to explain when (and when not) to use each function.
Comparison: bad vs good description — a practical case
One of the clients had a problem: the model poorly followed instructions and irregularly called search.
Some requests about prices and contract terms went through without retrieve —
the model responded confidently from its own knowledge, but outdatedly.
When we investigated the cause — it turned out the tool description was minimal.
After we rewrote the description with trigger scenarios,
the model's behavior stabilized.
# ❌ WAS: minimal description — model often didn't call tool
{
"name": "search_docs",
"description": "Searches documents",
"input_schema": {
"type": "object",
"properties": {
"query": {"type": "string"}
},
"required": ["query"]
}
}
# ✅ BECAME: detailed description with triggers — model understands when and why
{
"name": "search_knowledge_base",
"description": """Searches for up-to-date information in the corporate knowledge base.
USE this tool when:
- The query concerns contract terms, prices, regulations, or internal procedures
- Up-to-date information is needed that might have changed since your training
- Specifics of a particular client, product, or project are requested
DO NOT use for:
- General questions about technology or generally known facts
- Mathematical calculations
- Formatting or editing text""",
"input_schema": {
"type": "object",
"properties": {
"query": {
"type": "string",
"description": "Natural language search query. Formulate as a question or keywords."
},
"top_k": {
"type": "integer",
"description": "Number of snippets to return. Defaults to 5. Increase to 10 for complex comparative queries."
}
},
"required": ["query"]
}
}
The key difference: a good description contains trigger scenarios — an explicit list of situations where the tool is needed. The model uses this list as a criterion in its internal decision.
Important if you use Ollama with a local model:
weaker models (7B–13B) may unstably follow even detailed descriptions —
this is a limitation of the model, not the description.
In our production stack, we use
AskYourDocs via OpenRouter
with deepseek/deepseek-chat as the default model
(${SPRING_AI_OPENAI_CHAT_MODEL:deepseek/deepseek-chat}) —
and there are no issues with description execution at this level.
For Ollama in dev/staging environments, we recommend additionally duplicating trigger instructions in the system prompt — this increases stability on weaker models. More details on configuring Ollama in a production setup: RAG with Ollama: How to Teach AI to Respond Based on Your Documents .
Impact of description on choosing between multiple tools
When there are multiple tools with similar functionality in the system, the quality of the description becomes even more critical. APXML emphasizes: vague or overlapping descriptions cause the model to either choose the wrong tool, or choose none at all.
# Problem: two similar tools with unclear descriptions
tools = [
{"name": "search_contracts", "description": "Searches contracts"},
{"name": "search_documents", "description": "Searches documents"}
]
# The model doesn't know the difference — the choice is unpredictable
# Solution: clear delineation of responsibilities
tools = [
{
"name": "search_contracts",
"description": "Searches ONLY signed legal contracts and their addendums. "
"Use for queries about terms, deadlines, party obligations."
},
{
"name": "search_documents",
"description": "Searches internal regulations, instructions, technical documentation, and reports. "
"Does NOT contain legal contracts."
}
]
⚠️ Pitfall #2: more tools = more selection errors
Laurent Kubaski (2025) notes: most providers publish the maximum number of tools technically supported by the model, but don't mention that in practice, as the number of tools increases, the probability of incorrect selection also increases. After 10-15 tools, the quality of selection noticeably decreases. After 50+ — Tool RAG is needed (details in TU-6).
Chain-of-Thought inside: how the model analyzes context and makes decisions
The decision to tool call is not the result of simple pattern matching. Internally, a process similar to Chain-of-Thought reasoning occurs, where the model sequentially weighs several factors before returning a response.
Raina (2025) describes this through the lens of neural layers: input embedding layers convert the query and tool descriptions into numerical vectors — read about how text becomes a vector and why semantically similar queries fall into the same area of space in Embeddings in Simple Terms: How AI Understands Meaning, Not Just Words . Middle layers perform abstract reasoning and tool selection logic, and how vector search finds the most relevant tool or snippet — details in Vector Search for Beginners: How RAG Finds the Necessary Information . Output layers generate the final decision — text or tool call.
Internal decision process (simplified model)
# What happens inside the model with tool_choice: auto (conceptually):
# 1. Analyze query intent
intent = analyze_query(user_message)
# → "query about early termination contract terms"
# 2. Assess own knowledge
confidence_in_own_knowledge = estimate_confidence(intent)
# → "has general knowledge about contracts, but not about this specific client's contract"
# 3. Compare with available tools
tool_relevance = match_intent_to_tools(intent, tool_descriptions)
# → search_knowledge_base: high relevance (trigger scenario: "contract terms")
# 4. Decision
if tool_relevance > threshold AND confidence_in_own_knowledge < threshold:
return tool_call(name="search_knowledge_base", args={...})
else:
return text_response(...)
Key point: the model doesn't just check if a relevant tool exists. It also assesses if its own knowledge is sufficient. If the model considers itself knowledgeable — it might choose a text response even if a relevant tool is available.
How the model learns to decide when to search
The ability for correct decision-making is the result of fine-tuning on synthetic examples. Simplicity is SOTA (2025) describes the approach: providers generate thousands of examples of query → CoT reasoning trace → tool call, where the model learns not just to call a function, but to justify why. An example of a reasoning trace looks like this:
# Internal CoT reasoning trace (how it's formed during training):
"""
Query: "What are the terms for early termination of our contract?"
Analysis: The query concerns a specific contract ("our") —
this means specific information is needed that is not in my general knowledge.
The available tool search_knowledge_base describes searching the corporate base
with the trigger scenario "contract terms". This is an exact match.
Confidence in own knowledge: low (specificity of a particular contract).
Decision: call search_knowledge_base.
"""
→ tool_call("search_knowledge_base", {"query": "terms for early termination of contract"})
This is why reasoning-enabled model variants (Claude with extended thinking, OpenAI's o-series) show better results in complex tool use scenarios — WildToolBench (2026) confirms: reasoning-enabled models consistently outperform non-reasoning variants in tasks requiring correct orchestration of sequential tool calls.
Impact of system prompt on decision
The system prompt is a powerful lever for controlling the model's decision.
If it explicitly states when to search, the model will follow this instruction
even with tool_choice: auto:
The combination of a high-quality description + an explicit instruction in the system prompt
significantly increases the reliability of the model's decisions compared to each approach separately.