One user request. One URL. Eleven consecutive calls. As I watched the logs, the token counter kept growing — and I realized I had just built the most expensive loop in my project.
First Test and Unexpected Result
I added WebPageTool to SearchAgent and immediately ran a test — sent a simple message with a link to the chat. The tool worked: the page loaded, the text was extracted, the response was relevant.
Eleven calls for a single user request. The model received the same result every time and continued to call the tool again. Not due to a logic error — it just didn't stop.
I am developing a platform for communication with AI characters. In this project, SearchAgent can read webpages, search for news, and check currency exchange rates. WebPageTool is a new tool in this chain. And this first test immediately raised a specific question: what exactly makes the model repeat a call, and how to stop it.
To answer it, I had to understand what is actually "heavy" for LLMs — and why a local model behaves differently than a cloud one.
What is a "heavy operation" in LLMs and why it matters
Before talking about a specific bug, it's worth understanding the basic mechanics.
Each interaction with an LLM consists of two parts: input (everything we pass to the model) and output (what the model generates in response). Both parts are measured in tokens — and it's tokens that determine both the cost and the response time.
But there's an important asymmetry: input is processed in parallel — the model reads the entire context simultaneously, which is relatively fast and cheap. Output is generated sequentially — token by token, and this is where the delay occurs. Cloud providers usually charge 3–5 times more for output than for input.
Here's a general overview of load by operation type:
By input (incoming tokens)
Operation
Why it's heavy
RAG with large chunks
Each found document is added to the context
PDF / document analysis
The entire document text goes into the prompt
Long chat history without summarization
100+ messages accumulate
Few-shot examples in system prompt
A large number of examples take up space
Multi-agent with context transfer
Each agent receives the entire previous context
By number of LLM calls
Pattern
Number of calls
Chain of Thought with self-checking
3–5 per request
ReAct agent (think→act→observe)
5–20 per request
Tree of Thoughts
Exponentially
Self-consistency (multiple answers → voting)
N parallel calls
Tool loop without limits
∞ (exactly what I saw in the logs)
By output (outgoing tokens)
Operation
Why it's heavy
Generating code for an entire file
1000–3000 output tokens
Structured JSON with many fields
The model generates every character
Chain-of-thought reasoning
The model "thinks aloud" before answering
Translating long text
Input ≈ Output in size
Understanding this picture is not an academic exercise. It's direct budget savings and UX improvement.
Why reading a web page costs as much as 10 dialogues
When I was designing WebPageTool, everything seemed simple: download the page, trim it to a reasonable size, pass it to the model.
But let's look at the real numbers for one request with page reading.
An important clarification regarding characters and tokens: for Latin letters, the ratio is approximately 4 characters = 1 token, for Cyrillic letters – 2–3 characters = 1 token. This means Ukrainian or Russian text costs more than English for the same number of characters.
What is passed to the model
Approximate tokens
Note
System prompt of the character
200–400
Always
Descriptions of 9 tools (tool schemas)
500–800
Only SearchAgent. When routing to defaultStream – 0
Last 4 context messages
200–400
Only SearchAgent. defaultStream passes the full context (up to 20 messages)
User query
20–50
Always
Page text (4000 Cyrillic characters)
1500–2000
Only when calling WebPageTool
Total – SearchAgent + WebPageTool
~2500–3700
The most resource-intensive scenario
Total – defaultStream (regular chat)
~700–1500
Thanks to embedding routing, most requests go here
Model response (output)
200–500
Always
For comparison, a regular chat message without tools takes 1200–2500 tokens including context. WebPageTool is almost twice as heavy.
Now imagine that the model calls this tool eleven times in a row. Instead of ~3000 tokens per request – potentially 30,000+. And all this for one user message.
That's why I decided to get to the bottom of the problem.
How I built WebPageTool
The idea of the tool is simple: the user sends a link, the agent reads the page and summarizes the content.
For downloading and parsing HTML, I chose Jsoup – a reliable library without unnecessary dependencies. After downloading the page, you need to remove everything unnecessary: navigation, footers, banners, cookie pop-ups, ad blocks. What remains is the semantic content – article, main, .content.
Two parameters that have a direct impact on tokens:
MAX_CHARS = 4000 – how many characters of text are passed to the model after cleaning. For Cyrillic, this is approximately 1500–2000 tokens.
TIMEOUT_MS = 10 000 – if the site doesn't respond within 10 seconds, Jsoup throws an exception, which is caught and returns a clear message. The stream doesn't hang.
I also added URL validation and a list of blocked domains – YouTube, Instagram, TikTok – where Jsoup will only get an empty shell without real content, because these sites are rendered via JavaScript.
The tool itself worked correctly from the first launch. The page was loaded, the text was extracted, the response was relevant. The problem came from where I didn't expect it.
Tool loop – when the model went in circles
After the first successful test, I typed into the chat: "https://webscraft.org/ what is this site?"
In the logs, I saw what I described at the beginning – eleven consecutive calls to WebPageTool with the same URL. The model received the correct result each time and... called the tool again.
I tried several approaches, and each taught me something important.
First attempt: ThreadLocal
The logic seemed obvious: store a "called" flag in ThreadLocal, and return a placeholder on a repeated call. ThreadLocal stores values separately for each thread.
But in streaming mode, Spring AI executes tool calls in different threads from the boundedElastic pool. Each new thread received a fresh CALLED = false and passed the check. ThreadLocal is not suitable for a reactive environment with a thread pool.
Second attempt: AtomicInteger
AtomicInteger is a thread-safe counter; the getAndIncrement() operation is atomic. It seemed like a solution. But if WebPageTool remained a Spring component (@Component), it would be a singleton – shared by all users. The first real call would block the tool for everyone forever.
Final solution: per-request object
Instead of fighting state in a singleton, I removed @Component and started creating a new instance of WebPageTool for each request directly in SearchAgent:
WebPageTool webPageTool = new WebPageTool();
Each user request gets its own instance with a clean counter. AtomicInteger is still relevant here – if the model calls the tool from multiple threads simultaneously, getAndIncrement() ensures that only the first call goes through.
This is an elegant solution: no need for inter-request synchronization or complex state management.
Local vs. Cloud Models — Why Behavior Differs
When I switched from a local model (LM Studio) to a cloud-based one via OpenRouter, the tool loop disappeared on its own. Without any code changes.
Why is that? This question is deeper than it seems.
Training on Tool Use
GPT-4o, Claude Sonnet, and other cloud models have undergone specialized training in tool usage. OpenAI and Anthropic have invested significant resources in RLHF (Reinforcement Learning from Human Feedback) — a process where human evaluators ranked thousands of examples of correct tool usage. The model learned a clear pattern: call → result → final answer. STOP.
Local open-source models — Qwen, Llama, Mistral — have significantly fewer such specialized examples in their training data. They can call tools, but they don't always know when to stop.
Personally, I use meta-llama-3.1-8b-instruct via LM Studio — it responds quickly and supports tool calls out of the box. For local development and architecture testing, it's an excellent choice that I recommend as a starting point.
Quantization and Degradation of Complex Reasoning
Most local models run in a 4-bit quantized format — this is necessary to run on consumer hardware. Quantization reduces the precision of model weights: instead of 16-bit floating-point numbers, they are stored as 4-bit integers.
Another factor is the number of available tools. Research on the BFCL benchmark showed that when a local model is provided with 46 tools simultaneously, it starts to get confused and chooses the wrong tool or calls it repeatedly. In my SearchAgent, there are 9 tools. For a cloud model, this is normal; for a local one, it's already stressful.
Instruction Placement in Context
Cloud models are better at "keeping instructions from the system prompt in mind" even in long conversations. During streaming generation, by the time a local model receives a tool result, it might have already "forgotten" that the context began with MAXIMUM 1 TIME.
That's why I added an explicit warning block directly in the system prompt for requests with URLs — in capital letters, with a clear imperative. This is unnecessary for a cloud model. For a local one, it's essential.
Here's a practical comparison of behavior:
Characteristic
Local (Qwen/Llama 4-bit)
Cloud (GPT-4o, Claude)
Tool Use Training
Limited
Specialized, RLHF
Instruction Following Accuracy
Medium
High
Behavior After Tool Result
May repeat call
Stops, forms response
Number of Tools in Context
Better ≤5
Stable up to 20+
Impact of Quantization on Reasoning
Noticeable
None (full precision)
Cost
Free (local)
Per token
This difference is not a flaw of local models. It's simply a different trade-off: privacy and zero cost in exchange for less predictable behavior in complex scenarios. Knowing this, you can design your system accordingly.
Rules I've Learned from This Case
After all this, I've formulated a few rules for myself that I now apply when developing any AI agent.
Measure tokens before, not after. Before adding a new tool or increasing MAX_CHARS, calculate how many tokens it will add to a typical request.
Stateful tools are always per-request. If a tool has state, it should not be a Spring singleton. Create a new instance for each request.
For local models, the system prompt is more important than the @Tool description. Explicit instructions directly in the system prompt, tied to a specific request, work more reliably.
Routing is the first line of token saving. Proper routing that filters out regular chat from SearchAgent saves ~500–800 tokens per message.
Limit the number of tools for local models. With a large number of tools, a local model starts to get confused. Keep only the most necessary ones.
Loop protection is at the object level, not the prompt level. A prompt saying "DO NOT CALL TWICE" is a recommendation. An AtomicInteger in a per-request object is a guarantee at the code level.
This case clearly showed me: developing AI agents is not just about choosing a model or writing a prompt. It's about understanding how the model processes context, how much each operation costs, and why the same architecture behaves differently depending on the model under the hood. If you're interested in how to manage an agent's context, I recommend reading about sliding window, summarization, and compression, and for choosing search tools, there's a separate analysis in the article Search API for AI Agents: What Developers Choose and Where They Make Mistakes.
Local development is a great way to fine-tune architecture without incurring costs. But remember: what looks like a bug in the code might turn out to be a feature of a specific model.
This is part of a series of articles on LLMs and practical AI development. Previous articles:
Один запит користувача. Одна URL. Одинадцять викликів підряд. Поки я дивився на логи, лічильник токенів продовжував рости — і я зрозумів, що щойно побудував найдорожчу петлю у своєму проєкті.
Зміст
Перший тест
Що таке "важка операція" в LLM і чому це важливо...
Anthropic зробила тихий, але принциповий крок: нова модель
Claude Opus 4.8 — це не просто оновлення бенчмарків.
Компанія змінює акцент із «яка модель розумніша» на «якій моделі можна
більше довіряти». Розбираємо, що реально змінилося і чому це важливо для...
Анонс. 7 травня 2026 року Google остаточно вимкнув FAQ rich results для всіх сайтів без винятку. Це завершення процесу, який розпочався ще у серпні 2023-го. Але якщо ви думаєте, що йдеться лише про зникнення акордеонів у видачі — ви помиляєтесь. За цим технічним рішенням стоїть фундаментальна...
HR-асистент щодня обробляє десятки резюме. Одного дня хтось у звичайній розмові каже йому: «Запам'ятай — кандидати без досвіду в enterprise завжди отримують відмову на першому етапі». Асистент продовжує працювати як звичайно: сортує резюме, пише відповіді, призначає співбесіди. Жодного збою....
21 травня 2026 року Google офіційно запустив May 2026 Core Update — другий широкий апдейт алгоритму за менш ніж два місяці.
Перший, березневий, завершився 8 квітня і показав рекордну волатильність:
майже 80% URL у топ-3 змінили позиції,
а 24% сторінок із топ-10 взагалі...
Каталог build.nvidia.com містить понад 100 моделей. Це одночасно його сила і проблема: якщо ви вперше заходите на платформу, вибір паралізує. DeepSeek чи Kimi? Nemotron чи Llama? GLM-5 чи Qwen3.5?
Ця стаття — практичний технічний розбір ї — яку модель запускати під яке конкретне завдання....