Grounding in AI Agents: Handle Empty, Irrelevant and Failed Tool Results

Updated:
Ask AI about this article
Grounding in AI Agents: Handle Empty, Irrelevant and Failed Tool Results

Imagine: your AI agent received the request "what is the price of the Enterprise plan?". It called a tool. The tool responded. The agent formulated an answer — confidently, coherently, with a specific figure. The client received the answer and left satisfied.

The problem is that the tool returned an empty result — the document was not found. And instead of saying "I didn't find it" — the agent invented a price from its own training data. Confidently. Without any warning.

This is not a bug. It's a lack of grounding — and that's what this article is about.

This article is part of a series on AI agents with Spring Boot. If you haven't yet read how a model decides when to call a tool — start with How LLM Decides When to Call a Tool.

Contents

What is grounding and why it's missing from tutorials

Most tutorials on AI agents look like this:

1. Get user request
2. Call tool
3. Pass result to model
4. Get answer
5. ???
6. Profit

Step 5 is grounding. And tutorials skip it because it's "uninteresting." There's no flashy demo. No wow moment. Just a boring quality check before answering.

Grounding is the process where the agent checks if the result returned by the tool actually answers the request, if it's of sufficient quality to build an answer on, and if it can refer to a specific source rather than inventing an answer.

Why it's critical — an analogy from real life

Imagine you hired an assistant and asked: "Find me the terms of our contract with client Alpha".

The assistant went to the archive. Returned after a minute. And told you the contract terms — confidently and in detail.

But in reality, they found nothing in the archive. They just remembered a similar contract with another client and decided it was "roughly the same."

You signed a new agreement based on these terms. And only a week later did you discover that the terms were completely different.

This is exactly how an AI agent without grounding behaves. And this is why it's not an academic concept — it's a matter of trust in your product.

Real case from AskYourDocs: in the early stages of development, the agent sometimes answered requests about specific documents even when the document was not uploaded to the knowledge base. The model "remembered" similar content from its training data and presented it as a search result for the client's documents. After implementing grounding checks, this problem disappeared.

Three scenarios of bad tool results — with real examples

Not all bad results are the same. Let's break down three fundamentally different scenarios — each requires a separate processing strategy. And each of them can lead to a different type of error if not handled explicitly.

Scenario 1: Empty result

The tool executed successfully but found nothing. It seems like the simplest case. In reality, it's the most frequent cause of hallucinations.

Why it's dangerous: the model sees an empty response and doesn't know what to do with it. Instead of saying "not found" — it fills the gap with its own training knowledge. Confidently. Without any warning.

// Tool returned an empty list
{
  "results": [],
  "total": 0
}

// What the model does WITHOUT grounding:
// User: "What is the price of the Enterprise plan?"
// Agent: "According to corporate standards,
//         the price of the Enterprise plan is $500/month..."
// (invented from training data — real price is $850)

// What the model does WITH grounding:
// Agent: "I could not find information about prices in the knowledge base.
//         I recommend contacting the sales department."

How to handle in Java:

@Tool(description = "Searches for information in the knowledge base")
public String searchKnowledgeBase(String query) {
    List<SearchResult> results = vectorStore.search(query);

    // Explicitly handle empty result
    if (results == null || results.isEmpty()) {
        log.warn("Empty result for query: '{}'", query);
        // Do not return an empty string — return an explicit instruction
        return String.format("""
            RESULT: nothing found for query "%s"
            INSTRUCTION FOR MODEL: do not answer from your own knowledge.
            Inform the user that the information is missing from the knowledge base.
            """, query);
    }

    return formatResults(results);
}

Real example from Agent Chat: The Wikipedia tool sometimes returns an empty result for highly specialized queries — for example, "vibe coding productivity statistics 2025". Without grounding, the agent immediately started providing statistics it invented itself — and it entered the dialogue as "real fact". After adding an explicit instruction to the empty result — the agent correctly switches to Tavily or admits it didn't find anything.

Scenario 2: Irrelevant result

The tool found something — but not what was needed. This is the most dangerous scenario of the three. Because the model sees "there is a result" and confidently builds an answer on it — without checking if it's really what the user asked for.

Why it's dangerous: semantic search always returns something closest to the query — even if the actual document is not in the database. A score of 0.61 means "the best of what's available" — but it doesn't mean "answers the question".

// Query: "terms of contract termination with client Alpha"
// Tool returned the closest document:
{
  "results": [
    {
      "title": "General contract termination terms",
      "content": "Contract termination is possible with 30 days notice...",
      "score": 0.61  // low relevance — general template, not Alpha's contract
    }
  ]
}

// WITHOUT grounding, the model will answer:
// "The contract with client Alpha can be terminated with 30 days notice"
// (actually, their contract has — 60 days and penalty clauses)

// WITH grounding — we check the score before passing it to the model

How to handle in Java — checking the relevance threshold:

@Tool(description = "Searches for information in the knowledge base")
public String searchKnowledgeBase(String query) {
    List<SearchResult> results = vectorStore.search(query);

    if (results == null || results.isEmpty()) {
        return buildNotFoundMessage(query);
    }

    SearchResult best = results.get(0);
    double score = best.getScore();

    if (score >= 0.75) {
        // High relevance — pass confidently
        return String.format("""
            RESULT (relevance: high):
            %s
            SOURCE: %s
            """, best.getContent(), best.getDocumentTitle());

    } else if (score >= 0.55) {
        // Medium relevance — warn the model
        return String.format("""
            RESULT (relevance: medium, score: %.2f):
            %s
            ATTENTION FOR MODEL: the result may not accurately answer the query.
            Inform the user that the answer may be inaccurate.
            SOURCE: %s
            """, score, best.getContent(), best.getDocumentTitle());

    } else {
        // Low relevance — better to admit not found
        log.warn("Low relevance score {:.2f} for query: '{}'", score, query);
        return buildNotFoundMessage(query);
    }
}

private String buildNotFoundMessage(String query) {
    return String.format("""
        RESULT: information not found for query "%s"
        INSTRUCTION: do not answer from your own knowledge.
        Inform that the information is missing or suggest clarifying the query.
        """, query);
}

Real example from AskYourDocs: a client asked about the terms of a specific contract. The document was not uploaded to the base — but vector search found a general contract template with a score of 0.63. Without grounding, the agent answered with the template's terms as if they were the terms of the specific client. After adding a threshold of 0.75 — the agent correctly responds that the specific contract was not found.

Scenario 3: Execution error

The tool crashed with an exception. External API unavailable. Timeout. Rate limit exhausted. This is a technical error — and it's the easiest to handle if done correctly.

Why it's dangerous without handling: the model receives an error message and does one of two things — either ignores it and answers from its own knowledge, or relays the technical details of the error directly to the user. Both options are bad.

// Tool returned an error
{
  "error": "Connection timeout after 5000ms to https://api.tavily.com",
  "results": null
}

// WITHOUT grounding — option 1: the model ignores the error
// "According to recent studies, vibe coding increases productivity by 40%..."
// (invented because it didn't know what to do with the error)

// WITHOUT grounding — option 2: the model relays the error
// "A connection error occurred to https://api.tavily.com after 5000ms timeout..."
// (technical details that users shouldn't see)

How to handle in Java — try-catch with an explicit grounding instruction:

@Tool(description = """
    Searches for current information on the internet via Tavily.
    Use for fresh news and statistics.
    """)
public String searchWeb(String query) {
    log.info("Tavily search: '{}'", query);

    if (apiKey == null || apiKey.isBlank()) {
        // API key not configured — explicit grounding instruction
        return """
            CONFIGURATION ERROR: search is unavailable.
            INSTRUCTION: inform the user that online search
            is currently not configured. Do not answer from your own knowledge.
            """;
    }

    try {
        TavilyResponse response = callTavilyApi(query);

        if (response == null || response.results().isEmpty()) {
            return buildNotFoundMessage(query);
        }

        return formatResults(response.results());

    } catch (ResourceAccessException e) {
        // Timeout or service unavailability
        log.warn("Tavily timeout for '{}': {}", query, e.getMessage());
        return """
            TEMPORARY ERROR: search service is unavailable.
            INSTRUCTION: inform the user that the search is temporarily
            unavailable and suggest trying again later.
            DO NOT answer from your own knowledge instead of searching.
            """;

    } catch (HttpClientErrorException e) {
        // Rate limit or authorization error
        log.error("Tavily API error for '{}': {} {}",
            query, e.getStatusCode(), e.getMessage());
        return """
            API ERROR: request limit exceeded.
            INSTRUCTION: inform the user that the search
            is temporarily limited. Try to answer
            only if you have accurate knowledge on the topic.
            """;

    } catch (Exception e) {
        log.error("Unexpected Tavily error for '{}': {}", query, e.getMessage());
        return """
            UNKNOWN SEARCH ERROR.
            INSTRUCTION: inform the user that a
            technical problem has occurred. Do not answer from your own knowledge.
            """;
    }
}
Note the pattern: each catch block returns a different grounding instruction depending on the error type. Rate limit — you can try to answer if you have accurate knowledge. Timeout — better not to answer at all. This is a subtle but important difference that affects the quality of the agent's answers.
Scenario What the tool returns Risk without grounding Handling strategy Priority
Empty result results: [] Hallucination from training data Explicit instruction in content 🔴 Critical
Irrelevant result results with score < 0.55 Answering the wrong question Threshold check + warning 🔴 Critical
Execution error exception / timeout Technical details in the answer try-catch with separate instructions 🟠 High
Pitfall: do not use an empty string as a return value for an error. The model interprets an empty string differently depending on the context — sometimes as "not found," sometimes as "can answer independently." Always return an explicit textual instruction about what happened and what the model should do next.

What the model does with a bad result

When the model receives a bad tool result — it doesn't stop and ask "what to do next." It continues to generate an answer. Always. Automatically. And it does so in one of three ways — each with a different level of risk and a different way of diagnosing.

Option A: Ignores and answers from memory

The most common and most dangerous option. The model sees an empty or irrelevant result — and "decides" it's better to answer from its own training knowledge than to admit ignorance. The answer sounds confident and coherent. No warning.

Why this is the most dangerous option: the error is impossible to detect from the answer. There's no "I'm not sure." No "maybe." Just a confident answer that could be completely fabricated.

// Situation: client asks about price, document not found
// Tool returned: results: []

// Internal model process (simplified):
// "Search result is empty...
//  But I was trained on thousands of SaaS pricing pages.
//  I know that Enterprise plans usually cost $300-800/month.
//  I'll answer based on general knowledge. It will sound convincing."

// What the user receives:
// "The Enterprise plan includes an unlimited number of users
//  and costs $500 per month with annual payment."
// stop_reason: "end_turn" — no tool_use in logs after empty result

// Real price: $850/month after a price increase 2 months ago

How to detect in logs: this is the only option where diagnostics require active effort. The answer looks normal — the problem is only visible if you log the entire tool calling cycle.

// What to look for in logs to detect Option A:
// 1. Tool call exists
// 2. Tool result exists but is empty or with low score
// 3. stop_reason == "end_turn" — model answered without a subsequent tool call
// 4. The answer contains specific numbers or facts not present in the tool result

// Add to AgentConversationRunner:
log.info("Tool result length: {} chars, stop_reason: {}",
    toolResult.length(),
    response.getStopReason());

if (toolResult.isBlank() && response.getStopReason().equals("end_turn")) {
    log.warn("GROUNDING RISK: empty tool result but model answered directly. " +
             "Query: '{}'", userQuery);
}

Example from Agent Chat: in one of the test dialogues, the agent provided statistics "according to Stanford research" — but the Wikipedia tool returned an empty result for this query. The model simply "remembered" similar statistics from its training data and presented it as a real fact. In the logs, it looked like a normal successful request.

Option B: Builds an answer on an irrelevant result

The model sees that something was found — and accepts it as a sufficient basis. The semantic similarity between the query and the result "deceives" the model. It doesn't check if the document actually answers the question — it just sees "there is text → can answer".

Why this happens: the model is trained to answer based on context. If there is text about contracts in the context — it answers about contracts. It doesn't distinguish between "general template" and "specific client contract" if both contain similar words.

// Query: "what are the penalty clauses in the contract with client Alpha?"
// Tool found the closest document with a score of 0.63:
{
  "title": "Typical Service Agreement v2.1",
  "content": "In case of violation of contract terms — penalty of 0.1% per day...",
  "score": 0.63
}

// Internal model process:
// "There is a result about contracts and penalties. I'll answer based on it."

// What the user receives:
// "According to the contract, penalty clauses are 0.1% per day..."

// Actual terms of the contract with Alpha:
// penalty of 0.5% per day + right to terminate after 5 days of delay
// (they signed non-standard terms 6 months ago)

How to detect in logs: this option is visible through the score metric — but only if you log it.

// Logging the score for each tool result
@Tool(description = "Searches the knowledge base")
public String search(String query) {
    List<SearchResult> results = vectorStore.search(query);

    if (!results.isEmpty()) {
        double score = results.get(0).getScore();
        log.info("Search score for '{}': {:.3f} — {}",
            query, score, results.get(0).getTitle());

        // Alert if the score is suspiciously low but a result exists
        if (score < 0.65) {
            log.warn("LOW RELEVANCE SCORE {:.3f} for query: '{}'. " +
                     "Document: '{}'", score, query, results.get(0).getTitle());
        }
    }

    // ... result handling
}

Option C: Admits the problem (rare and non-deterministic)

Frontier models — Claude Sonnet, GPT-4o, Gemini Pro — sometimes recognize on their own that the result doesn't answer the query and honestly say so. This is the best behavior. But there's a critical problem: this happens unpredictably.

The same model for the same query can:

  • Once — admit it didn't find an answer
  • Another time — confidently answer with fabricated data

It depends on the temperature, the phrasing of the query, what else is in the context. This is unacceptable in production.

// The same empty result — two different runs:

// Run 1 (model "decided" to admit):
// "Unfortunately, I could not find specific information about prices
//  in the available documents. I recommend contacting a manager."

// Run 2 (the same model, the same query):
// "The Enterprise plan usually includes unlimited access
//  and costs between $400 and $800 per month depending on the number of users."

// Difference between runs: only temperature=0.7 instead of temperature=0.3
// Grounding would make both runs identical and predictable

What to do instead of relying on Option C:

// DO NOT do this — hoping the model will figure it out on its own:
String systemPrompt = "If you don't know the answer, say you don't know";
// This works sometimes. Not always. Not in production.

// DO this — explicit grounding instruction in tool result:
return String.format("""
    SEARCH STATUS: result not found for query "%s"

    MANDATORY INSTRUCTION:
    - DO NOT answer from your own training knowledge
    - DO NOT invent numbers or facts
    - Inform the user that the information is missing from the base
    - Suggest an alternative: clarify the query or contact support
    """, query);
// This works always. Deterministically. In production.
Key takeaway: the model doesn't know its tool result is "bad" if you don't tell it explicitly. It only sees a string of text. Your code knows the context — relevance score, error status, whether the result is empty. Pass this information explicitly through structured tool result content — not through general instructions in the system prompt. The difference between "say you don't know" in the system prompt and "STATUS: not found" directly in the tool result — is the difference between "sometimes works" and "always works".

Diagnostic table: how to determine which option occurred

Option What is visible in logs What is visible in the answer How to detect
A: Answers from memory tool result empty, stop_reason = end_turn Confident answer with specific data 🔴 Difficult — requires logging the cycle
B: Irrelevant result tool result exists, score < 0.65 Answer is similar but not accurate 🟠 Medium — requires logging the score
C: Admits the problem tool result empty, stop_reason = end_turn "Not found," "recommend clarifying" 🟢 Easy — visible from the answer

is_error: true vs empty content — the difference that matters

Spring AI and most LLM APIs support the is_error field in tool results. Most developers ignore it, passing either an empty string or an error message without the flag. This is one of the most common reasons why an agent behaves unpredictably when errors occur.

How the model technically reads a tool result

To understand the difference, you need to know how a tool result gets into the model's context. It's not just a text string. It's a structured block with metadata:

// What the LLM sees inside the context (simplified):
{
  "role": "tool",
  "tool_use_id": "toolu_01ABC",
  "content": "...",      // result text
  "is_error": false      // or true
}

// The model is trained to distinguish these two states:
// is_error: false → "this is a normal result, use it for the answer"
// is_error: true  → "something went wrong, do not build the answer on this"

This is why is_error: true is not just semantics. It's an instruction the model received during fine-tuning: a result with this flag means the search failed and it's not worth answering from its own knowledge.

Empty content — the model decides for itself

If you return an empty string without is_error — the model receives an ambiguous signal. It sees "the tool responded, but returned nothing" and decides how to interpret it on its own. And this decision is unpredictable.

// ❌ Don't do this — the model decides for itself
ToolResponseMessage emptyResult = new ToolResponseMessage(
    toolCallId,
    ""  // empty string without is_error
);

// The model might interpret this as:
// - "nothing found — I'll say I don't know" (correct, but rare)
// - "the result is still loading — I'll wait" (incorrect)
// - "I can answer from my own knowledge" (dangerous — the most common option)
// - "the tool broke — I'll complain technically" (bad UX)

// The same situation can yield different results across different runs

is_error: true — an explicit deterministic instruction

// ✅ This is the correct way — an explicit signal for the model
ToolResponseMessage errorResult = new ToolResponseMessage(
    toolCallId,
    "Document not found in the knowledge base for the query: Alpha contract terms",
    true  // is_error — the model knows what to do
);

// The model receives a clear signal:
// - this is an explicit error, not a missing result
// - do not build an answer based on this
// - inform the user about the problem
// - do not try to "fill in" the answer from its own knowledge

Full GroundedToolResultBuilder — four states

In practice, a tool result has four different states — each requires separate handling:

@Service
@Slf4j
public class GroundedToolResultBuilder {

    /**
     * Successful result with high relevance (score >= 0.75)
     * is_error: false — the model can build an answer
     */
    public ToolResponseMessage success(String toolCallId,
                                       String content,
                                       String documentTitle,
                                       String sourceRef) {
        String groundedContent = String.format("""
            STATUS: RESULT FOUND
            
            CONTENT:
            %s
            
            SOURCE: %s
            DOCUMENT: %s
            INSTRUCTION: use this result to answer.
            Be sure to cite the source in the format "According to [document]..."
            """, content, sourceRef, documentTitle);

        log.info("Tool result: SUCCESS, doc='{}'", documentTitle);
        return new ToolResponseMessage(toolCallId, groundedContent);
        // is_error defaults to false
    }

    /**
     * Result found but relevance is medium (score 0.55-0.75)
     * is_error: false — but with a warning for the model
     */
    public ToolResponseMessage lowRelevance(String toolCallId,
                                             String content,
                                             String documentTitle,
                                             double score) {
        String groundedContent = String.format("""
            STATUS: RESULT WITH LOW RELEVANCE (score: %.2f)
            
            CONTENT:
            %s
            
            DOCUMENT: %s
            WARNING: the result may not accurately answer the user's query.
            Use with caution. Inform the user that the answer
            may be inaccurate and recommend clarifying the query.
            """, score, content, documentTitle);

        log.warn("Tool result: LOW_RELEVANCE, score={:.2f}, doc='{}'",
            score, documentTitle);
        return new ToolResponseMessage(toolCallId, groundedContent);
    }

    /**
     * Nothing found in the knowledge base
     * is_error: true — critical that the model does not answer from memory
     */
    public ToolResponseMessage notFound(String toolCallId, String query) {
        String message = String.format("""
            STATUS: RESULT NOT FOUND
            
            Query: "%s"
            
            MANDATORY INSTRUCTION:
            - DO NOT answer from your own training knowledge
            - DO NOT invent facts or figures
            - Inform the user that the information is not available in the knowledge base
            - Suggest clarifying the query or contacting support
            """, query);

        log.warn("Tool result: NOT_FOUND, query='{}'", query);
        return new ToolResponseMessage(toolCallId, message, true); // is_error: true
    }

    /**
     * Technical error — API unavailable, timeout, etc.
     * is_error: true — the model should report a technical problem
     */
    public ToolResponseMessage technicalError(String toolCallId,
                                               String errorMessage) {
        String message = String.format("""
            STATUS: TECHNICAL SEARCH ERROR
            
            Details (for logs only, do not show to user): %s
            
            INSTRUCTION:
            - Inform the user that the search is temporarily unavailable
            - DO NOT pass technical error details to the user
            - DO NOT answer from your own knowledge instead of searching
            - Suggest trying again later
            """, errorMessage);

        log.error("Tool result: TECHNICAL_ERROR — {}", errorMessage);
        return new ToolResponseMessage(toolCallId, message, true); // is_error: true
    }
}

Comparison table: when to use which state

State is_error When to use What the model does
success() false score >= 0.75, document found Builds an answer, cites the source
lowRelevance() false score 0.55–0.75, something found Answers with a caveat
notFound() true results empty or score < 0.55 Informs that nothing was found
technicalError() true Exception, timeout, API error Informs about a technical problem

How to check that is_error actually works

Add a simple test to ensure the model truly behaves differently depending on is_error:

@SpringBootTest
class IsErrorBehaviorTest {

    @Autowired
    private ChatModel chatModel;

    @Test
    void modelShouldNotHallucinateWhenIsErrorTrue() {
        // Simulate an empty result with is_error: true
        List<Message> messages = List.of(
            new SystemMessage("Answer only based on search results."),
            new UserMessage("What is the price for the Enterprise plan?"),
            new AssistantMessage(""),  // placeholder
            new ToolResponseMessage(
                "test-id",
                "STATUS: RESULT NOT FOUND\n" +
                "INSTRUCTION: do not answer from your own knowledge.",
                true  // is_error: true
            )
        );

        String response = chatModel.call(new Prompt(messages))
            .getResult().getOutput().getText();

        // The model should NOT invent a price
        assertThat(response)
            .doesNotContain("$")
            .doesNotContain("500")
            .doesNotContain("month")
            .containsAnyOf("not found", "unavailable", "could not find");
    }

    @Test
    void modelShouldAnswerWhenIsErrorFalse() {
        // Simulate a successful result
        List<Message> messages = List.of(
            new SystemMessage("Answer only based on search results."),
            new UserMessage("What is the price for the Enterprise plan?"),
            new AssistantMessage(""),
            new ToolResponseMessage(
                "test-id",
                "STATUS: RESULT FOUND\n" +
                "CONTENT: Enterprise plan — $850/month with annual payment.\n" +
                "SOURCE: Price List v2.3",
                false  // is_error: false
            )
        );

        String response = chatModel.call(new Prompt(messages))
            .getResult().getOutput().getText();

        // The model SHOULD use the figure from the result
        assertThat(response)
            .contains("850")
            .containsAnyOf("According to", "price list", "document");
    }
}
Practical advice for Spring AI: in Spring AI 2.0.x, the ToolResponseMessage(id, content, isError) constructor may differ depending on the version — check the signature in your M3/M5. If the three-parameter constructor is unavailable — use the builder or pass is_error through the content with an explicit INSTRUCTION for the model. Explicit text in the content ("DO NOT answer from your own knowledge") works almost as reliably as the flag — and is a fallback option for any Spring AI version.

Confidence scoring — how to ask the model to assess quality

Confidence scoring is a technique where we ask the model to explicitly assess how well the found result answers the query before building the final response. This is an additional step — but it closes a blind spot that vector score doesn't cover.

Why it's needed — and why vector score isn't enough

Your code knows technical metrics — vector search score, number of documents found. But it doesn't know *semantic* relevance.

Here's a specific example where vector score lied:

// Query: "penalties for late payment in the contract with Beta Corp"
// Vector search returned:
{
  "title": "Contract with Beta Corp — main terms",
  "score": 0.82,  // high relevance!
  "content": "The parties agree on the following terms of cooperation: 
               performance deadlines, payment procedure, liability of parties..."
}

// The document is correct — but the section on penalties is elsewhere
// Vector score is 0.82 — because the document is indeed about this contract
// But there is NO ANSWER to the question about penalties here

// Confidence scoring will reveal this:
// { "confidence": "LOW", "reason": "document found but penalties not described", 
//   "can_answer": false }

The model understands semantics better than a numerical score — ask it to check before answering.

When to use confidence scoring

Query type Use? Reason
Prices, tariffs ✅ Yes Error = financial consequences
Contract terms ✅ Yes Error = legal consequences
Specific dates, deadlines ✅ Yes Accuracy is critical
General FAQ questions ⚠️ Optional Vector score is sufficient
Simple topic search ❌ No Adds latency without significant benefit

Implementation in Spring AI

@Service
@RequiredArgsConstructor
@Slf4j
public class ConfidenceScoringService {

    private final ChatModel chatModel;

    private static final String CONFIDENCE_PROMPT = """
        Assess how well the found result answers the user's query.
        
        USER QUERY: %s
        
        FOUND RESULT:
        %s
        
        Respond ONLY in JSON format without explanations, without markdown, without ```json:
        {
          "confidence": "HIGH" | "MEDIUM" | "LOW" | "NOT_RELEVANT",
          "reason": "one sentence why",
          "can_answer": true | false
        }
        
        Assessment criteria:
        - HIGH: the result directly and fully answers the query — answer confidently
        - MEDIUM: the result partially answers — answer with a caveat
        - LOW: the result is related to the topic but doesn't answer specifically — better to admit
        - NOT_RELEVANT: the result is not related to the query — do not answer
        """;

    public ConfidenceResult evaluate(String userQuery, String toolResult) {
        
        // Do not assess if the result is explicitly empty
        if (toolResult == null || toolResult.isBlank()) {
            return ConfidenceResult.notFound();
        }

        String prompt = String.format(CONFIDENCE_PROMPT, userQuery, toolResult);

        try {
            String response = chatModel.call(prompt).trim();
            
            // Remove possible markdown backticks that some models add
            // despite the instruction (deepseek, llama are prone to this)
            String cleanJson = response
                .replaceAll("```json", "")
                .replaceAll("```", "")
                .trim();

            ObjectMapper mapper = new ObjectMapper();
            JsonNode json = mapper.readTree(cleanJson);

            ConfidenceLevel level = ConfidenceLevel.valueOf(
                json.get("confidence").asText());
            String reason = json.get("reason").asText();
            boolean canAnswer = json.get("can_answer").asBoolean();

            log.info("Confidence evaluated: level={}, canAnswer={}, reason='{}'",
                level, canAnswer, reason);

            return ConfidenceResult.builder()
                .level(level)
                .reason(reason)
                .canAnswer(canAnswer)
                .build();

        } catch (JsonProcessingException e) {
            // Model returned non-JSON — log and use a safe default
            log.warn("Failed to parse confidence JSON response. " +
                     "Raw response: '{}'. Defaulting to LOW.", 
                     chatModel.call(prompt));
            return ConfidenceResult.safe(); // LOW + canAnswer: false
            
        } catch (IllegalArgumentException e) {
            // Unknown ConfidenceLevel in response
            log.warn("Unknown confidence level in response. Defaulting to LOW.");
            return ConfidenceResult.safe();
        }
    }
}

@Value
@Builder
public class ConfidenceResult {
    ConfidenceLevel level;
    String reason;
    boolean canAnswer;

    // Factory methods for typical states
    public static ConfidenceResult notFound() {
        return ConfidenceResult.builder()
            .level(ConfidenceLevel.NOT_RELEVANT)
            .reason("Result is empty")
            .canAnswer(false)
            .build();
    }

    public static ConfidenceResult safe() {
        return ConfidenceResult.builder()
            .level(ConfidenceLevel.LOW)
            .reason("Failed to assess relevance")
            .canAnswer(false)
            .build();
    }
}

public enum ConfidenceLevel {
    HIGH, MEDIUM, LOW, NOT_RELEVANT
}

What a real confidence scoring result looks like

Here are three real examples from AskYourDocs — what confidence scoring returns for different queries:

// Example 1: HIGH confidence — the answer is directly in the document
Query: "what is the price for the Basic plan?"
Search result: "Basic plan — $49/month, up to 5 users..."
Confidence: {
  "confidence": "HIGH",
  "reason": "the document contains a direct answer to the question about the price",
  "can_answer": true
}
→ The agent answers confidently with a quote

// Example 2: LOW confidence — the document is there but the answer is not
Query: "penalties for late payment"
Search result: "Service Agreement. Section 3: Payment Procedure..."
Confidence: {
  "confidence": "LOW", 
  "reason": "the document is about payment but penalties are not described in this section",
  "can_answer": false
}
→ The agent informs that the exact information was not found

// Example 3: NOT_RELEVANT — the search found the wrong document
Query: "contract terms with Gamma client"
Search result: "General Service Terms v1.0..."
Confidence: {
  "confidence": "NOT_RELEVANT",
  "reason": "a general template was found, not the specific client's contract",
  "can_answer": false
}
→ The agent informs that Gamma's contract was not found in the database

How to use in an agent pipeline

@Service
@RequiredArgsConstructor
@Slf4j
public class GroundedAgentService {

    private final ConfidenceScoringService confidenceScoring;
    private final GroundedToolResultBuilder resultBuilder;
    private final ChatModel chatModel;

    public String answerWithGrounding(String userQuery,
                                       String toolResult,
                                       String toolCallId,
                                       String documentTitle) {

        // 1. Assess relevance via LLM
        ConfidenceResult confidence = confidenceScoring
            .evaluate(userQuery, toolResult);

        log.info("Grounding decision for '{}': {} (canAnswer={})",
            userQuery, confidence.getLevel(), confidence.isCanAnswer());

        // 2. Choose strategy based on confidence
        ToolResponseMessage groundedResult = switch (confidence.getLevel()) {
            case HIGH ->
                resultBuilder.success(toolCallId, toolResult, documentTitle, "knowledge_base");
            case MEDIUM ->
                resultBuilder.lowRelevance(toolCallId, toolResult, documentTitle, 0.65);
            case LOW, NOT_RELEVANT ->
                resultBuilder.notFound(toolCallId, userQuery);
        };

        // 3. Pass to the model with grounding instructions
        List<Message> messages = List.of(
            new SystemMessage("""
                Answer ONLY based on the provided search results.
                If the result is marked as 'not found' — inform about it.
                Always cite the source of information.
                """),
            new UserMessage(userQuery),
            new AssistantMessage(""),
            groundedResult
        );

        return chatModel.call(new Prompt(messages))
            .getResult()
            .getOutput()
            .getText();
    }
}
Cost and latency of confidence scoring: this is an additional LLM request — ~150-300 tokens input and ~50 tokens output. With deepseek-chat via OpenRouter, it's about $0.0001-0.0003 per request. Latency — 200-500ms depending on the model and provider. For critical queries (prices, contracts) — this cost is justified. For simple informational queries where an error has no serious consequences — vector score >= 0.75 is sufficient, and confidence scoring can be skipped entirely.

Re-query pattern — when to try again

If the first search didn't yield a good result — is it worth trying again with a different query? The answer is yes, but only in specific situations and with a strict attempt limit.

Re-query is not "search for the same thing again." It's rephrasing the query based on what the first search didn't find. The difference is fundamental.

Re-query cycle flow

User query
      ↓
  Search (attempt 1)
      ↓
  Result found?
  ├── NO → Rephrase query → Search (attempt 2)
  │                                    ↓
  │                               Result found?
  │                               ├── NO → notFound()
  │                               └── YES → Confidence scoring
  │                                              ↓
  │                                    HIGH/MEDIUM → found()
  │                                    LOW → notFound()  
  │                                    NOT_RELEVANT → notFound()
  └── YES → Confidence scoring
                  ↓
        HIGH/MEDIUM → found()
        LOW → Rephrase → Search (attempt 2)
        NOT_RELEVANT → notFound() (no re-query — document is missing)

When re-query is justified

  • The first query is too specific — "penalties contract Alpha Corp clause 5.2" → better "liability terms Alpha Corp"
  • The query contains technical jargon — "SLA uptime guarantee" → "service availability guarantee"
  • Confidence is LOW but not NOT_RELEVANT — the document is related to the topic but not the right section. There's a chance to find the needed section with a different query

When re-query won't help

  • Confidence is NOT_RELEVANT — the search found a document on a completely different topic. The document being searched for simply doesn't exist in the database. Rephrasing won't help
  • The tool returned a technical error — the problem is with the infrastructure, not the query. Re-query will only increase the number of erroneous requests
  • MAX_ATTEMPTS variants have already been tried — stop and admit that nothing was found

Real examples of rephrasing

Here's how LLM rephrases queries in practice — from real AskYourDocs logs:

// Example 1: too specific query
Original:  "clause 8.3.2 of the Beta Corp cloud storage service agreement"
Attempt 2:  "Beta Corp data storage terms"
Result: found the required section with score 0.81 ✅

// Example 2: technical jargon
Original:  "RTO and RPO SLA enterprise tier"  
Attempt 2:  "enterprise plan service outage recovery guarantees"
Result: found the SLA document with score 0.79 ✅

// Example 3: document is missing — re-query doesn't help
Original:  "contract with Gamma LLC client"
Confidence: NOT_RELEVANT (found a contract with a different client)
Decision:   immediately notFound() without re-query — 
           no point in rephrasing if the document doesn't exist ✅

// Example 4: re-query also didn't find anything
Original:  "discounts for educational institutions"
Attempt 2:  "preferential terms for universities and schools"
Result: both empty → notFound()
           (pricing policy for edu was not uploaded to the database) ✅

Implementation with attempt limit

@Service
@RequiredArgsConstructor
@Slf4j
public class ReQueryService {

    private static final int MAX_ATTEMPTS = 2;

    private final KnowledgeBaseSearchTool searchTool;
    private final ConfidenceScoringService confidenceScoring;
    private final ChatModel chatModel;

    public SearchResult searchWithRequery(String originalQuery) {

        String currentQuery = originalQuery;

        for (int attempt = 1; attempt <= MAX_ATTEMPTS; attempt++) {
            log.info("Search attempt {}/{}: '{}'",
                attempt, MAX_ATTEMPTS, currentQuery);

            String result = searchTool.search(currentQuery);

            // Empty result — re-query immediately if attempts remain
            if (result == null || result.isBlank()) {
                if (attempt < MAX_ATTEMPTS) {
                    currentQuery = reformulateQuery(originalQuery, attempt);
                    continue;
                }
                log.warn("All {} attempts exhausted, query: '{}'", 
                    MAX_ATTEMPTS, originalQuery);
                return SearchResult.notFound(originalQuery);
            }

            // Assess semantic relevance
            ConfidenceResult confidence = confidenceScoring
                .evaluate(originalQuery, result);

            log.info("Attempt {} confidence: {} — {}", 
                attempt, confidence.getLevel(), confidence.getReason());

            switch (confidence.getLevel()) {
                case HIGH, MEDIUM -> {
                    // Found — return the result
                    return SearchResult.found(result, confidence, currentQuery);
                }
                case NOT_RELEVANT -> {
                    // Document is missing — re-query won't help
                    log.info("NOT_RELEVANT result, skipping re-query for: '{}'", 
                        originalQuery);
                    return SearchResult.notFound(originalQuery);
                }
                case LOW -> {
                    // There's a chance to find better — let's try again
                    if (attempt < MAX_ATTEMPTS) {
                        currentQuery = reformulateQuery(originalQuery, attempt);
                    }
                }
            }
        }

        log.warn("Re-query exhausted for: '{}'", originalQuery);
        return SearchResult.notFound(originalQuery);
    }

    private String reformulateQuery(String originalQuery, int attempt) {
        String prompt = String.format("""
            The search query did not find a relevant result: "%s"
            
            Rephrase the query — make it simpler, broader, without jargon.
            This is attempt %d of rephrasing.
            
            Respond ONLY with the new query without explanations or quotes.
            """, originalQuery, attempt);

        String reformulated = chatModel.call(prompt).trim()
            .replaceAll("\"", ""); // remove quotes if the model added them
            
        log.info("Reformulated: '{}' → '{}'", originalQuery, reformulated);
        return reformulated;
    }
}

SearchResult — result class

@Value
@Builder
public class SearchResult {
    boolean found;
    String content;
    ConfidenceResult confidence;
    String usedQuery;      // which specific query found the result
    int attemptsUsed;      // how many attempts were needed

    public static SearchResult found(String content, 
                                      ConfidenceResult confidence,
                                      String usedQuery,
                                      int attempts) {
        return SearchResult.builder()
            .found(true)
            .content(content)
            .confidence(confidence)
            .usedQuery(usedQuery)
            .attemptsUsed(attempts)
            .build();
    }

    public static SearchResult notFound(String originalQuery) {
        return SearchResult.builder()
            .found(false)
            .content("")
            .usedQuery(originalQuery)
            .attemptsUsed(0)
            .build();
    }
}
Attempt limit and cost — from my experience: I settled on MAX_ATTEMPTS = 2 after testing. Tried 3 — the third attempt yielded a better result in less than 5% of cases. Mostly, if nothing was found in two attempts — the document simply doesn't exist in the database, and no rephrasing will help. Each attempt involves two additional LLM requests — reformulate + confidence scoring. With deepseek-chat via OpenRouter, this is ~$0.0002-0.0005 per full re-query cycle — but with 1000 requests per day, this is already a noticeable sum. Two is enough. More — is wasted expense.

Citation and traceability — the agent must know where the answer comes from

Citation is not just "adding a link". It's an architectural principle: the agent must know where each part of the answer comes from — and be able to show it. Without citation, you have no way to verify whether the agent answered based on documents or made it up.

Why citation is important — two reasons

Reason 1: Business and trust. Imagine a law firm that uses AskYourDocs. A client asks about the terms of a contract. The agent answers. The lawyer then asks: "On which page of which document is this written?"

Without citation — it's impossible to answer. And the client stops trusting the system. With citation — the agent immediately shows: "Article 5.2 of Contract No. 123 dated 15.03.2025".

Reason 2: Debugging and monitoring. When the agent gives a wrong answer — how do you understand why? Without citation, you see only "the agent answered incorrectly". With citation, you see specifically: "the agent answered based on document v1.2 which was replaced by v2.0 three months ago". This is the difference between "something broke" and "here's exactly where".

// Without citation — you only see the result:
User:  "What is the penalty for late payment?"
Agent: "The penalty is 0.1% per day of the outstanding amount."
// Correct? Incorrect? Where did this figure come from? Unknown.

// With citation — you see the full chain:
User:  "What is the penalty for late payment?"
Agent: "According to the Service Agreement v1.2 (section 7.3,
        indexed 12.01.2025), the penalty is 0.1% per day."
// It's immediately clear: v1.2 — but the current one is v2.0 where the penalty is 0.5%
// Problem found in seconds

CitedSearchResult structure

@Value
@Builder
public class CitedSearchResult {
    String content;              // text of the found fragment
    String documentTitle;        // document title
    String documentId;           // ID in the database
    String documentVersion;      // document version if any
    String pageOrSection;        // page or section
    String sourceUrl;            // link if any
    double relevanceScore;       // vector search score
    LocalDateTime indexedAt;     // when the document was indexed in the database
    LocalDateTime documentDate;  // date of the document itself (may differ)
}

How to store citation metadata in PostgreSQL

Citation only works if metadata is stored along with vector embeddings. Here's how it looks in the schema:

-- Table of documents with metadata for citation
CREATE TABLE knowledge_documents (
    id          UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    title       VARCHAR(500) NOT NULL,
    version     VARCHAR(50),
    section     VARCHAR(200),
    source_url  VARCHAR(1000),
    document_date   TIMESTAMP,
    indexed_at      TIMESTAMP DEFAULT NOW(),
    is_active   BOOLEAN DEFAULT TRUE  -- important for outdated documents
);

-- Table of chunks with a link to the document
CREATE TABLE document_chunks (
    id          UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    document_id UUID REFERENCES knowledge_documents(id),
    content     TEXT NOT NULL,
    embedding   vector(1536),          -- pgvector
    chunk_index INTEGER,               -- chunk number in the document
    page_number INTEGER                -- page number if any
);

-- Index for vector search
CREATE INDEX ON document_chunks
USING ivfflat (embedding vector_cosine_ops);
// Repository for searching with metadata
@Repository
public interface DocumentChunkRepository extends JpaRepository<DocumentChunk, UUID> {

    @Query(value = """
        SELECT dc.*, kd.title, kd.version, kd.section,
               kd.source_url, kd.document_date, kd.indexed_at,
               1 - (dc.embedding <=> :embedding) as relevance_score
        FROM document_chunks dc
        JOIN knowledge_documents kd ON dc.document_id = kd.id
        WHERE kd.is_active = TRUE
        ORDER BY dc.embedding <=> :embedding
        LIMIT :limit
        """, nativeQuery = true)
    List<ChunkWithMetadata> findWithCitation(
        @Param("embedding") float[] embedding,
        @Param("limit") int limit
    );
}

Tool with full citation

@Tool(description = """
    Searches for information in the corporate knowledge base.
    Returns the result with a precise link to the source —
    document title, section, and validity date.
    """)
public String searchWithCitation(String query) {

    // Generate embedding for the query
    float[] queryEmbedding = embeddingModel.embed(query);

    List<ChunkWithMetadata> results = chunkRepository
        .findWithCitation(queryEmbedding, 5);

    if (results.isEmpty()) {
        return """
            RESULT: nothing found
            SOURCE: none
            INSTRUCTION: inform that the information was not found in the knowledge base
            """;
    }

    ChunkWithMetadata best = results.get(0);

    // Check document validity
    boolean isRecent = best.getIndexedAt()
        .isAfter(LocalDateTime.now().minusMonths(6));

    String freshnessWarning = isRecent ? "" :
        "\n⚠️ WARNING: the document was indexed more than 6 months ago — " +
        "recommend the user to confirm its relevance.";

    return String.format("""
        RESULT:
        %s

        SOURCE (be sure to indicate in the answer):
        - Document: %s%s
        - Section: %s
        - Page: %s
        - Link: %s
        - Document validity: %s
        - Indexed: %s
        - Relevance: %.0f%%
        %s

        INSTRUCTION: when answering, be sure to indicate the source in the format
        "According to [document name], [section/page]..."
        """,
        best.getContent(),
        best.getTitle(),
        best.getVersion() != null ? " (" + best.getVersion() + ")" : "",
        best.getSection() != null ? best.getSection() : "not specified",
        best.getPageNumber() != null ? best.getPageNumber().toString() : "not specified",
        best.getSourceUrl() != null ? best.getSourceUrl() : "internal document",
        best.getDocumentDate() != null
            ? best.getDocumentDate().toLocalDate().toString()
            : "not specified",
        best.getIndexedAt().toLocalDate(),
        best.getRelevanceScore() * 100,
        freshnessWarning
    );
}

How an answer with citation looks — three levels of detail

// Level 1 — without citation (don't do this):
"The contract can be terminated with 30 days' notice."

// Level 2 — basic citation:
"According to Contract No. 123 (section 5.2),
termination is possible with 30 calendar days' written notice."

// Level 3 — full citation with date (for critical requests):
"According to the Service Agreement No. 123 v2.1 (section 5.2,
page 8), termination is possible with 30 calendar days' written
notice. [Document valid as of 15.03.2025,
indexed on 16.03.2025]"

Citation for Agent Chat — Wikipedia and Tavily

In Agent Chat, agents use external sources — Wikipedia, Tavily, NewsAPI. Citation is especially important here because the reader wants to verify the fact independently.

// Bad — the agent says "according to research":
"According to research, GitHub Copilot increases productivity by 55%."
// Which research? When? Where to check?

// Good — the agent cites a specific source:
"According to a GitHub study (2023, published on github.blog),
developers with Copilot complete tasks 55% faster."
// The reader can go and check
// How to add citation in WikipediaSearchTool
@Tool(description = "Searches for facts on Wikipedia")
public String searchWikipedia(String query) {
    WikiSearchResponse response = callWikipediaApi(query);

    if (response.getResults().isEmpty()) {
        return "Wikipedia: nothing found for query: " + query;
    }

    WikiResult result = response.getResults().get(0);

    // Return the result with explicit citation
    return String.format("""
        WIKIPEDIA RESULT:
        %s

        SOURCE: Wikipedia, article "%s"
        LINK: https://uk.wikipedia.org/wiki/%s
        INSTRUCTION: when answering, indicate the source as
        "According to Wikipedia (article '%s')..."
        """,
        result.getSnippet(),
        result.getTitle(),
        result.getTitle().replace(" ", "_"),
        result.getTitle()
    );
}
Pitfall — outdated documents: citation shows when the document was indexed — but not when it was actually updated. A document indexed a month ago may contain information from two years ago. Add a field to the schema document_date separately from indexed_at and show both. If document_date is more than 6 months ago — warn the model to inform the user about possible obsolescence.

Java + Spring AI implementation — full pipeline

Let's combine all the concepts into a single grounding pipeline that can be used in a real project.

Component structure

src/main/java/com/example/
├── grounding/
│   ├── GroundingPipeline.java          // main component
│   ├── ConfidenceScoringService.java   // relevance scoring
│   ├── ReQueryService.java             // re-querying
│   ├── CitationBuilder.java            // citation formation
│   └── GroundedToolResultBuilder.java  // tool result formation
├── model/
│   ├── ConfidenceResult.java
│   ├── SearchResult.java
│   └── CitedSearchResult.java

GroundingPipeline — main component

@Service
@RequiredArgsConstructor
@Slf4j
public class GroundingPipeline {

    private final ReQueryService reQueryService;
    private final GroundedToolResultBuilder resultBuilder;
    private final ChatModel chatModel;

    private static final double HIGH_CONFIDENCE_THRESHOLD = 0.75;
    private static final double LOW_CONFIDENCE_THRESHOLD = 0.50;

    /**
     * Main method — processes the request with a full grounding cycle
     */
    public String processWithGrounding(String userQuery, String toolCallId) {

        // 1. Search with possible re-query
        SearchResult searchResult = reQueryService
            .searchWithRequery(userQuery);

        log.info("Search result for '{}': found={}, confidence={}",
            userQuery,
            searchResult.isFound(),
            searchResult.getConfidence());

        // 2. Form the grounded tool result
        ToolResponseMessage toolResponse = buildGroundedResponse(
            toolCallId, userQuery, searchResult);

        // 3. Pass to the model with system instructions
        List messages = buildMessagesWithGrounding(
            userQuery, toolResponse);

        // 4. Get the final answer
        return chatModel.call(new Prompt(messages))
            .getResult()
            .getOutput()
            .getText();
    }

    private ToolResponseMessage buildGroundedResponse(
            String toolCallId,
            String query,
            SearchResult result) {

        if (!result.isFound()) {
            return resultBuilder.notFound(toolCallId, query);
        }

        double score = result.getConfidence().getScore();

        if (score >= HIGH_CONFIDENCE_THRESHOLD) {
            return resultBuilder.success(
                toolCallId,
                result.getContent(),
                result.getSourceReference()
            );
        } else if (score >= LOW_CONFIDENCE_THRESHOLD) {
            return resultBuilder.lowRelevance(
                toolCallId,
                result.getContent(),
                score
            );
        } else {
            return resultBuilder.notFound(toolCallId, query);
        }
    }

    private List buildMessagesWithGrounding(
            String userQuery,
            ToolResponseMessage toolResponse) {

        return List.of(
            new SystemMessage("""
                You are a corporate assistant that answers based on company documents.

                GROUNDING RULES:
                1. Answer ONLY based on search results
                2. If the result is marked as "not found" — say you didn't find it
                3. Always indicate the source in the format: "According to [document]..."
                4. If confidence is LOW — warn: "The information found may be inaccurate"
                5. NEVER make up information that is not in the search results
                """),
            new UserMessage(userQuery),
            toolResponse
        );
    }
}

Integration into AgentConversationRunner (Agent Chat)

// In AgentConversationRunner.ask() — add grounding
private String ask(String systemPrompt, List history,
                   String lastMessage, AgentSender currentSender) {

    List messages = new ArrayList<>();
    messages.add(new SystemMessage(systemPrompt));

    // ... history mapping as before ...

    messages.add(new UserMessage(lastMessage));

    ToolCallback[] tools = ToolCallbacks.from(
        wikipediaSearchTool,
        tavilySearchTool,
        alphaVantageTool,
        arxivSearchTool,
        newsApiSearchTool
    );

    try {
        ChatResponse response = agentChatModel.call(
            new Prompt(messages,
                ToolCallingChatOptions.builder()
                    .toolCallbacks(tools)
                    .build()));

        String result = response.getResult().getOutput().getText();

        // Remove  blocks if any (for qwen3)
        return removeThinkingBlock(result);

    } catch (IllegalStateException e) {
        // Grounding fallback: tool call failed — respond without tools
        log.warn("Tool call failed for agent {}: {}", currentSender, e.getMessage());
        return agentChatModel.call(new Prompt(messages))
            .getResult().getOutput().getText();
    }
}

private String removeThinkingBlock(String text) {
    if (text == null) return "";
    // Remove ... blocks that qwen3 adds
    return text.replaceAll("(?s).*?", "").trim();
}

Grounding pipeline test

@SpringBootTest
class GroundingPipelineTest {

    @Autowired
    private GroundingPipeline pipeline;

    @Test
    void shouldReturnNotFoundWhenDocumentMissing() {
        // Request for a document that doesn't exist
        String result = pipeline.processWithGrounding(
            "terms of contract with client XYZ-99999",
            "test-tool-call-id"
        );

        // The answer should acknowledge it wasn't found — not make it up
        assertThat(result)
            .containsAnyOf("didn't find", "not found", "information is missing")
            .doesNotContain("$")          // didn't make up a price
            .doesNotContain("30 days");   // didn't make up terms
    }

    @Test
    void shouldIncludeCitationInResponse() {
        String result = pipeline.processWithGrounding(
            "what is the price of the basic plan",
            "test-tool-call-id"
        );

        // The answer should contain a link to the source
        assertThat(result)
            .containsAnyOf("According to", "document", "section");
    }
}

Conclusions

Grounding is the difference between an agent you can trust and an agent that looks smart but can make anything up. Most tutorials show how to call a tool. No one shows what to do after the tool has responded. This is where 90% of production agent problems hide.

It is especially critical for systems where answers have real consequences — legal documents, prices, contract terms, medical data. There, one wrong answer costs more than all the time spent on implementing grounding.

Five grounding rules — and how to check if they work

1. Empty result ≠ "can answer from memory"

Pass is_error: true and an explicit text instruction in the tool result content. How to check: remove all documents from the database — the agent should respond "not found" to any query, not make it up.

2. Vector score is not enough

A score of 0.85 means "closest document" — not "answer to the question". Confidence scoring provides semantic verification that the score does not. How to check: make a query about a specific client that is not in the database — the agent should not answer with the terms of another client.

3. Re-query with a strict limit

Maximum 2 attempts. NOT_RELEVANT — stop immediately without re-query. How to check: in the logs after each search, there should be attempt X/2 and the reason for stopping.

4. Citation is mandatory

The agent must know where each answer comes from and be able to show it. How to check: ask the agent "where is this information from?" — it should name a specific document and section, not "from the knowledge base".

5. Explicit instructions in tool result — not just in the system prompt

"Do not answer from your own knowledge" in the system prompt — is a recommendation. The same phrase in the tool result content — is an instruction that the model sees right before answering. How to check: test with is_error: true — the answer should not contain any specific numbers or facts.

Checklist before deploying an agent to production

// Grounding checklist — check each item before deploying

□ Tool returns an explicit instruction for an empty result
  (not an empty string, but "RESULT NOT FOUND + INSTRUCTION")

□ Tool uses is_error: true for notFound and technicalError

□ Score check with thresholds is in place:
  score >= 0.75 → success()
  score 0.55-0.75 → lowRelevance()
  score < 0.55 → notFound()

□ Confidence scoring is enabled for critical query types
  (prices, contracts, deadlines)

□ Re-query has MAX_ATTEMPTS = 2 and stops on NOT_RELEVANT

□ Tool result contains citation: document name, section, date

□ stop_reason is logged for each agent response

□ Test in place: empty database → agent responds "not found"
  (does not make up an answer)

□ Test in place: is_error: true → response without specific numbers and facts

Three common mistakes that destroy grounding

// ❌ Mistake 1: empty string instead of an explicit instruction
return "";
// The model interprets it differently — sometimes it answers from memory

// ✅ Correct:
return "RESULT NOT FOUND. DO NOT answer from your own knowledge.";


// ❌ Mistake 2: relying on the system prompt for grounding
new SystemMessage("If you don't know, say you don't know");
// Works sometimes. Non-deterministic. Not for production.

// ✅ Correct:
// Explicit instruction in each tool result where needed


// ❌ Mistake 3: re-query without a limit
while (!found) {
    currentQuery = reformulate(currentQuery);
    result = search(currentQuery);
}
// Infinite loop if the document is not in the database

// ✅ Correct:
for (int attempt = 1; attempt <= MAX_ATTEMPTS; attempt++) { ... }
Next step in the series: grounding solves the problem of result quality. But there is another problem — when an agent has 10-15+ tools, passing them all in each request becomes inefficient and the quality of tool selection drops. How to build a dynamic tool registry and connect only the necessary tools to a specific query — Tool RAG — what to do when an agent has too many tools.

Read also in the AI agents series:

Tool Use vs Function Calling — basic mechanics before building grounding.

How LLM decides when to search — how the model makes decisions before getting results.

Agent Chat — a live example where grounding is critical for Wikipedia and Tavily tools.

Sources: Spring AI Documentation, Anthropic Tool Use Docs, WildToolBench (2026)