AI_TOOLS 30 May 2026 9 min read 491 view

Why My AI Agent Called the Same Page 11 Times in a Row

Updated: 24 June 2026

Language: 🇺🇦 🇬🇧 🇩🇪 🇪🇸

Vadim Kharovyuk

CEO & Founder of WebsCraft. 8 years in web development, focused on bringing AI into real products.

✦ Ask AI about this article

Why My AI Agent Called the Same Page 11 Times in a Row

One user request. One URL. Eleven consecutive calls. As I watched the logs, the token counter kept growing — and I realized I had just built the most expensive loop in my project.

First Test and Unexpected Result

I added WebPageTool to SearchAgent and immediately ran a test — sent a simple message with a link to the chat. The tool worked: the page loaded, the text was extracted, the response was relevant.

But I noticed something interesting in the logs.

WebPageTool: url='https://webscraft.org/'  ← call 1
WebPageTool: url='https://webscraft.org/'  ← call 2
WebPageTool: url='https://webscraft.org/'  ← call 3
WebPageTool: url='https://webscraft.org/'  ← call 4
...
WebPageTool: url='https://webscraft.org/'  ← call 11

Eleven calls for a single user request. The model received the same result every time and continued to call the tool again. Not due to a logic error — it just didn't stop.

I am developing a platform for communication with AI characters. In this project, SearchAgent can read webpages, search for news, and check currency exchange rates. WebPageTool is a new tool in this chain. And this first test immediately raised a specific question: what exactly makes the model repeat a call, and how to stop it.

To answer it, I had to understand what is actually "heavy" for LLMs — and why a local model behaves differently than a cloud one.

What is a "heavy operation" in LLMs and why it matters

Before talking about a specific bug, it's worth understanding the basic mechanics.

Each interaction with an LLM consists of two parts: input (everything we pass to the model) and output (what the model generates in response). Both parts are measured in tokens — and it's tokens that determine both the cost and the response time.

But there's an important asymmetry: input is processed in parallel — the model reads the entire context simultaneously, which is relatively fast and cheap. Output is generated sequentially — token by token, and this is where the delay occurs. Cloud providers usually charge 3–5 times more for output than for input.

Here's a general overview of load by operation type:

By input (incoming tokens)

Operation	Why it's heavy
RAG with large chunks	Each found document is added to the context
PDF / document analysis	The entire document text goes into the prompt
Long chat history without summarization	100+ messages accumulate
Few-shot examples in system prompt	A large number of examples take up space
Multi-agent with context transfer	Each agent receives the entire previous context

By number of LLM calls

Pattern	Number of calls
Chain of Thought with self-checking	3–5 per request
ReAct agent (think→act→observe)	5–20 per request
Tree of Thoughts	Exponentially
Self-consistency (multiple answers → voting)	N parallel calls
Tool loop without limits	∞ (exactly what I saw in the logs)

By output (outgoing tokens)

Operation	Why it's heavy
Generating code for an entire file	1000–3000 output tokens
Structured JSON with many fields	The model generates every character
Chain-of-thought reasoning	The model "thinks aloud" before answering
Translating long text	Input ≈ Output in size

Understanding this picture is not an academic exercise. It's direct budget savings and UX improvement.

Why reading a web page costs as much as 10 dialogues

When I was designing WebPageTool, everything seemed simple: download the page, trim it to a reasonable size, pass it to the model.

But let's look at the real numbers for one request with page reading.

An important clarification regarding characters and tokens: for Latin letters, the ratio is approximately 4 characters = 1 token, for Cyrillic letters – 2–3 characters = 1 token. This means Ukrainian or Russian text costs more than English for the same number of characters.

What is passed to the model	Approximate tokens	Note
System prompt of the character	200–400	Always
Descriptions of 9 tools (tool schemas)	500–800	Only SearchAgent. When routing to defaultStream – 0
Last 4 context messages	200–400	Only SearchAgent. defaultStream passes the full context (up to 20 messages)
User query	20–50	Always
Page text (4000 Cyrillic characters)	1500–2000	Only when calling WebPageTool
Total – SearchAgent + WebPageTool	~2500–3700	The most resource-intensive scenario
Total – defaultStream (regular chat)	~700–1500	Thanks to embedding routing, most requests go here
Model response (output)	200–500	Always

For comparison, a regular chat message without tools takes 1200–2500 tokens including context. WebPageTool is almost twice as heavy.

Now imagine that the model calls this tool eleven times in a row. Instead of ~3000 tokens per request – potentially 30,000+. And all this for one user message.

That's why I decided to get to the bottom of the problem.

How I built WebPageTool

The idea of the tool is simple: the user sends a link, the agent reads the page and summarizes the content.

For downloading and parsing HTML, I chose Jsoup – a reliable library without unnecessary dependencies. After downloading the page, you need to remove everything unnecessary: navigation, footers, banners, cookie pop-ups, ad blocks. What remains is the semantic content – article, main, .content.

Two parameters that have a direct impact on tokens:

MAX_CHARS = 4000 – how many characters of text are passed to the model after cleaning. For Cyrillic, this is approximately 1500–2000 tokens.
TIMEOUT_MS = 10 000 – if the site doesn't respond within 10 seconds, Jsoup throws an exception, which is caught and returns a clear message. The stream doesn't hang.

I also added URL validation and a list of blocked domains – YouTube, Instagram, TikTok – where Jsoup will only get an empty shell without real content, because these sites are rendered via JavaScript.

The tool itself worked correctly from the first launch. The page was loaded, the text was extracted, the response was relevant. The problem came from where I didn't expect it.

Tool loop – when the model went in circles

After the first successful test, I typed into the chat: "https://webscraft.org/ what is this site?"

In the logs, I saw what I described at the beginning – eleven consecutive calls to WebPageTool with the same URL. The model received the correct result each time and... called the tool again.

I tried several approaches, and each taught me something important.

First attempt: ThreadLocal

The logic seemed obvious: store a "called" flag in ThreadLocal, and return a placeholder on a repeated call. ThreadLocal stores values separately for each thread.

But in streaming mode, Spring AI executes tool calls in different threads from the boundedElastic pool. Each new thread received a fresh CALLED = false and passed the check. ThreadLocal is not suitable for a reactive environment with a thread pool.

Second attempt: AtomicInteger

AtomicInteger is a thread-safe counter; the getAndIncrement() operation is atomic. It seemed like a solution. But if WebPageTool remained a Spring component (@Component), it would be a singleton – shared by all users. The first real call would block the tool for everyone forever.

Final solution: per-request object

Instead of fighting state in a singleton, I removed @Component and started creating a new instance of WebPageTool for each request directly in SearchAgent:

WebPageTool webPageTool = new WebPageTool();

Each user request gets its own instance with a clean counter. AtomicInteger is still relevant here – if the model calls the tool from multiple threads simultaneously, getAndIncrement() ensures that only the first call goes through.

This is an elegant solution: no need for inter-request synchronization or complex state management.

Local vs. Cloud Models — Why Behavior Differs

When I switched from a local model (LM Studio) to a cloud-based one via OpenRouter, the tool loop disappeared on its own. Without any code changes.

Why is that? This question is deeper than it seems.

Training on Tool Use

GPT-4o, Claude Sonnet, and other cloud models have undergone specialized training in tool usage. OpenAI and Anthropic have invested significant resources in RLHF (Reinforcement Learning from Human Feedback) — a process where human evaluators ranked thousands of examples of correct tool usage. The model learned a clear pattern: call → result → final answer. STOP.

Local open-source models — Qwen, Llama, Mistral — have significantly fewer such specialized examples in their training data. They can call tools, but they don't always know when to stop.

Personally, I use meta-llama-3.1-8b-instruct via LM Studio — it responds quickly and supports tool calls out of the box. For local development and architecture testing, it's an excellent choice that I recommend as a starting point.

Quantization and Degradation of Complex Reasoning

Most local models run in a 4-bit quantized format — this is necessary to run on consumer hardware. Quantization reduces the precision of model weights: instead of 16-bit floating-point numbers, they are stored as 4-bit integers.

Research shows that aggressive 4-bit quantization can lead to a 11–32% degradation in accuracy on complex reasoning tasks. And following multi-step instructions is precisely such a task. The model "forgets" that it has already made a call and repeats it.

Another factor is the number of available tools. Research on the BFCL benchmark showed that when a local model is provided with 46 tools simultaneously, it starts to get confused and chooses the wrong tool or calls it repeatedly. In my SearchAgent, there are 9 tools. For a cloud model, this is normal; for a local one, it's already stressful.

Instruction Placement in Context

Cloud models are better at "keeping instructions from the system prompt in mind" even in long conversations. During streaming generation, by the time a local model receives a tool result, it might have already "forgotten" that the context began with MAXIMUM 1 TIME.

That's why I added an explicit warning block directly in the system prompt for requests with URLs — in capital letters, with a clear imperative. This is unnecessary for a cloud model. For a local one, it's essential.

Here's a practical comparison of behavior:

Characteristic	Local (Qwen/Llama 4-bit)	Cloud (GPT-4o, Claude)
Tool Use Training	Limited	Specialized, RLHF
Instruction Following Accuracy	Medium	High
Behavior After Tool Result	May repeat call	Stops, forms response
Number of Tools in Context	Better ≤5	Stable up to 20+
Impact of Quantization on Reasoning	Noticeable	None (full precision)
Cost	Free (local)	Per token

This difference is not a flaw of local models. It's simply a different trade-off: privacy and zero cost in exchange for less predictable behavior in complex scenarios. Knowing this, you can design your system accordingly.

Rules I've Learned from This Case

After all this, I've formulated a few rules for myself that I now apply when developing any AI agent.

Measure tokens before, not after. Before adding a new tool or increasing MAX_CHARS, calculate how many tokens it will add to a typical request.
Stateful tools are always per-request. If a tool has state, it should not be a Spring singleton. Create a new instance for each request.
For local models, the system prompt is more important than the @Tool description. Explicit instructions directly in the system prompt, tied to a specific request, work more reliably.
Routing is the first line of token saving. Proper routing that filters out regular chat from SearchAgent saves ~500–800 tokens per message.
Limit the number of tools for local models. With a large number of tools, a local model starts to get confused. Keep only the most necessary ones.
Loop protection is at the object level, not the prompt level. A prompt saying "DO NOT CALL TWICE" is a recommendation. An AtomicInteger in a per-request object is a guarantee at the code level.

This case clearly showed me: developing AI agents is not just about choosing a model or writing a prompt. It's about understanding how the model processes context, how much each operation costs, and why the same architecture behaves differently depending on the model under the hood. If you're interested in how to manage an agent's context, I recommend reading about sliding window, summarization, and compression, and for choosing search tools, there's a separate analysis in the article Search API for AI Agents: What Developers Choose and Where They Make Mistakes.

Local development is a great way to fine-tune architecture without incurring costs. But remember: what looks like a bug in the code might turn out to be a feature of a specific model.

This is part of a series of articles on LLMs and practical AI development. Previous articles:

Categories