On April 17, I took the fresh Claude Opus 4.7 and ran it through my RAG system AskYourDocs on a test set of ~400 public legal documents (sample contracts, regulations, open-source templates). I compared it with Llama 3.3 70B, which most of my clients currently use. The result was unexpected: Opus does two things that are real game-changers for RAG — it says "I don't know" instead of making things up, and it literally obeys the system prompt. But it costs 10-15 times more. Below, I'll tell you exactly what I saw, show you the prompt that works, and give an honest answer about who needs this at all.
🎯 Who actually makes sense to use Opus 4.7 for RAG
In short: only get it if the cost of an error in the answer is high. For an internal FAQ, it's overkill; for a legal assistant, it's okay; for a bot on an e-commerce store website, it's too much.
I tested Opus 4.7 the day after its release. So I won't write "I've gathered a huge amount of statistics" — but I ran enough real queries to understand the main things.
Here's who Opus 4.7 genuinely helps:
Law firms — where you need to clearly say "this clause is not in the contract" instead of inventing an answer.
Finance and accounting — when a client asks about a specific line in a regulation, and you can't afford to make a mistake.
Medical centers — where answers are formed based on protocols, and a hallucination can cost health.
Businesses planning to provide an AI chatbot to public clients and are afraid of prompt injection.
And here's who I'd recommend just using Llama 3.3 70B or even something simpler for:
Internal assistants for a team of 10-20 people.
FAQ bots for 30-50 typical questions.
Projects where the client is price-sensitive and can forgive one or two inaccurate answers.
Anything with strict GDPR requirements — Opus only via API.
Next, I'll go into specifics — what exactly the model showed in tests, how much it costs, and how I integrated it into my service. If you haven't read the review of the model itself yet — it covers technical specifications, benchmarks, and comparisons with GPT-5.4 and Gemini.
⚙️ Why choosing a model isn't the whole RAG
LLM is just the final layer. If you have bad chunking or weak embeddings, even Opus 4.7 won't save you. First, set up the search, then think about the model.
I look at many other people's RAG systems — and almost every one has the same mistake. The developer connects LangChain with default settings, throws in GPT-4 or Claude, and asks: "Why is the bot hallucinating?" The answer: because the models are fed garbage, so they invent things.
RAG has three critical layers, and LLM is the third, not the first:
How you split documents into chunks (chunking). If you split a table in the middle, the model won't see the context. If you cut a legal clause between two chunks, you'll get an answer without exceptions. I wrote in detail about this in the article on 7 chunking strategies — it even includes research numbers.
Which embedding model you use for search. This is where most mistakes happen — the developer takes text-embedding-ada-002 because "it's OpenAI" and doesn't know that for the same price, there are models with twice the quality. I discussed choosing embedding providers here.
The LLM itself, which generates the answer from the context. This is where Opus 4.7 or an alternative comes in.
If the first two layers are broken, Opus 4.7 won't fix it. Instead, it will confidently give you a wrong answer based on wrong chunks. Therefore, before thinking about choosing between Opus, Llama, and Gemini, check if your retrieval actually finds the correct fragments. If you've never encountered vector search, start with this beginner's article.
📄 What 1 million tokens of context gives you in practice
Yes, the context is gigantic. No, it doesn't mean you can throw all documents in there and skip chunking. Large context is not a replacement for retrieval, but support for complex queries.
Opus 4.7 has a context window of 1 million tokens. That's about 750 pages of text. It sounds tempting: you can just stuff your entire knowledge base into the prompt and not bother with vector search, right?
No, it doesn't work that way. Here's why I still use RAG:
Cost. 1M input tokens cost $5. If you have 1000 queries a day and you pass the entire database in each one, that's $5000 a day just for input. Prompt caching reduces it to 90%, but it's still expensive for most clients.
Lost in the middle. Even frontier models are worse at finding information in the middle of long contexts. Stanford showed a degradation of up to 45% recall. If you throw in 700 pages, you get answers based on the beginning and end, and the important stuff is somewhere in the middle.
Updates. Documents change in the knowledge base. With RAG, I just re-index one file. With "stuff everything into context," you have to rebuild the entire prompt.
Where large context genuinely helps is when a query is complex and requires comparing several documents. For example, a lawyer client asks: "Compare the clauses on liability in these three contracts." Previously, I did this through separate retrieval queries and merged the answers. With Opus 4.7, I can load all three contracts entirely and ask one question. The answer quality is noticeably better.
Practical division in AskYourDocs:
Regular queries — RAG with chunking and top-k=5. We pass only relevant chunks to the model, context 8-12K tokens.
Complex analytical queries (comparisons, analysis) — we load several entire documents, context 50-200K. This is more expensive, but the client pays separately for "advanced query."
"Stuff everything" — I don't use it at all. I don't see a scenario where it's justified.
🎯 Hallucinations: what I saw when comparing with Llama
I noticed in tests. Opus 4.7 honestly admits when there's no answer in the documents. Llama 3.3 70B can give a plausible fabrication and not even doubt it.
I tested on a test base of ~400 public legal documents — sample contracts from open legal registries, typical regulations, agreement templates. I use this set for my own benchmarks of different models because it provides a realistic document structure without risking anyone's confidentiality. I wrote ~30 different queries — some with answers in the base, some intentionally without.
Here's a typical example of the difference:
Query: "What is the notice period for contract termination without penalties for clients with a corporate tariff?"
This section is not in the documents — there's only a standard tariff and a VIP one.
Llama 3.3 70B: "For clients with a corporate tariff, the notice period for termination without penalties is 30 days from the date of written notice." It sounds confident, the number looks realistic. But it's a complete fabrication — this clause is not in any of the 400 documents.
Opus 4.7: "In the provided documents, I did not find specific conditions for the corporate tariff. The contracts contain information about standard and VIP tariffs — if you need details for one of them, I can answer."
The difference is critical. For a lawyer, a Llama answer is a ready-made problem: the client will quote it, and then it turns out to be untrue. Opus honestly says it doesn't know and offers to clarify.
I won't write "in 15% of cases Llama fabricated" — that would be dishonest based on such short testing. But the trend is clearly visible on every tenth query without an answer in the base. And this aligns with what Anthropic partners are writing in early access: Hex noted this specific feature — "the model honestly reports the absence of data instead of plausible fabrications."
Second important point: Opus 4.7 follows the system prompt literally. If I write "answer only based on the provided context, do not use general knowledge" — it truly does so. Llama regularly "adds from itself" based on what it knows from training. This is unacceptable in a legal context.
💰 How much it actually costs in production
Official price $5/$25 per million tokens. But the new tokenizer added 23% to my costs for the same queries. Llama 3.3 70B via OpenRouter is 10-15 times cheaper. For mass clients, this is a decisive difference.
I calculated based on a typical AskYourDocs client profile: a law firm, ~20 employees, average load of 1000 queries per day, average context from the RAG window of 8-12K tokens.
Model
Price per 1M input
Price per 1M output
Approximate monthly cost
Claude Opus 4.7
$5
$25
~$400-500
Llama 3.3 70B (OpenRouter)
~$0.23
~$0.40
~$30-40
Gemini 3.1 Pro
$2
$12
~$180-220
The numbers are approximate — they depend on caching, system prompt size, output share. But the order is like this: Opus is 10-15 times more expensive than Llama.
Now, an important detail about Opus 4.7's new tokenizer. Immediately after the announcement, one of my clients asked to connect the new model for testing — they wanted to see if it would improve the quality of answers on their base. I switched them from 4.6 to 4.7 by simply swapping the model ID in the configuration — literally one line. The system logic didn't change, the prompts were the same, the documents were the same. But the cost of processing the same queries increased by approximately 23%. The same text — more tokens. Anthropic officially warns that it will be 1.0-1.35x for the same text; mine was closer to the upper limit.
What to do about it:
Prompt caching is mandatory. If you have a stable system prompt and a large context, caching provides up to a 90% discount on cached tokens. For me, this offset most of the increase from the new tokenizer.
Batch API for non-real-time. If the client doesn't require an instant response (e.g., overnight document analysis) — batch processing gives a -50% discount on input and output.
Task budgets. A new feature — a hard limit on token expenditure for a task. An essential thing if you have an agent that can go into a loop and consume the monthly budget in an hour.
Review the system prompt. I shortened mine by ~30% — removed duplication, combined similar instructions. For 4.7, this is even more critical because every extra token is more expensive.
My practical preset for clients migrating from 4.6 to 4.7: budget an additional 25% to the API budget for the first month, then optimize based on actual data.
🔧 My System Prompt for Document Q&A
Here's my current prompt. It's not perfect, but it works — it's gone through several iterations over the past few days:
You are an AI assistant for [COMPANY_NAME]'s internal documentation.
YOUR TASK:
Answer employee questions based ONLY on the documents provided to you in the <context> block. Do not use general knowledge about jurisprudence, laws, or standards — only what is present in the provided context.
RESPONSE RULES:
1. If the information is not in the context, state it directly:
"I did not find the answer to this question in the provided documents."
Do not invent. Do not "assume." Do not rely on "general rules."
2. If the answer is present, quote the document fragment and indicate the source
(file name, section/point number).
3. If the answer is partial, provide what is available and clearly state what is missing: "The documents contain information about X, but not about Y."
4. If the question is legally complex and requires interpretation —
do not interpret. State: "This question requires legal consultation; the documents only contain the wording of the clause: [quote]."
RESPONSE FORMAT:
- First, a direct answer in 1-2 sentences.
- Then, a quote from the document in quotation marks.
- At the end, the source in the format: [file.pdf, section X.Y].
<context>
{retrieved_chunks}
</context>
User question: {user_question}
Why it's structured this way:
Prohibition of general knowledge use — in the very first lines. Opus 4.7 follows instructions literally, so if you write "primarily based on context," it will start adding its own information. "Only based on" — and that's it.
Precise scenario for "I don't know." I provide a ready-made response phrase. The model uses it. Without this, it might write "unfortunately, there is not enough information, but I will try to assume" — and that's already a bad sign.
Citation format. I require a reference to the source — then the client can manually verify the answer in seconds.
Prohibition of legal interpretation. Important for the legal field. The model is not allowed to "interpret the law" — that's a lawyer's job. It should only show the text.
This same prompt works worse with Llama 3.3 70B. The model regularly misses the "only based on context" limitation and adds "according to the standard interpretation of such a clause...". I haven't seen this with Opus 4.7 — it listens literally.
How it looks in Spring AI code
If you have Spring Boot and are using OpenRouter for flexibility between models — here's approximately how it looks:
I won't include the entire service architecture in the article — it's complex and commercially sensitive. But for clarity, I'll show the key idea of multi-vendor RAG on Spring: model selection is a value in application.yml, not a code change request. Thanks to @ConditionalOnProperty, the required ChatClient is loaded — Opus 4.7, Llama via OpenRouter, or local Ollama.
Model Configuration via application.yml
# application.yml
rag:
llm:
provider: openrouter # anthropic | openrouter | ollama
model: anthropic/claude-opus-4.7
temperature: 0.2
max-tokens: 2048
system-prompt: |
You are a company AI assistant. Respond only based on
the provided context. If information is missing, say "not found".
...
Conditional Provider Connection
@Configuration
public class LlmProviderConfig {
@Bean
@ConditionalOnProperty(
name = "rag.llm.provider",
havingValue = "anthropic"
)
public ChatClient anthropicChatClient(
@Value("${rag.llm.model}") String model,
AnthropicApi anthropicApi
) {
var options = AnthropicChatOptions.builder()
.model(model)
.temperature(0.2)
.build();
return ChatClient.builder(new AnthropicChatModel(anthropicApi, options))
.build();
}
@Bean
@ConditionalOnProperty(
name = "rag.llm.provider",
havingValue = "openrouter"
)
public ChatClient openRouterChatClient(
@Value("${rag.llm.model}") String model,
@Value("${openrouter.api-key}") String apiKey
) {
// OpenRouter is compatible with OpenAI API, so we use OpenAiChatModel
var openAiApi = new OpenAiApi("https://openrouter.ai/api", apiKey);
var options = OpenAiChatOptions.builder()
.model(model)
.temperature(0.2)
.build();
return ChatClient.builder(new OpenAiChatModel(openAiApi, options))
.build();
}
@Bean
@ConditionalOnProperty(
name = "rag.llm.provider",
havingValue = "ollama"
)
public ChatClient ollamaChatClient(
@Value("${rag.llm.model}") String model,
OllamaApi ollamaApi
) {
var options = OllamaOptions.builder()
.model(model)
.temperature(0.2)
.build();
return ChatClient.builder(new OllamaChatModel(ollamaApi, options))
.build();
}
}
The beauty of this approach is that the DocumentQaService itself doesn't know or care which model it's calling. It works with the ChatClient abstraction. If I want to switch the client from Llama to Opus 4.7 — I change one line in application.yml:
Restart the container — and the client is already on Opus. No code changes, no redeploy of the service, no re-release of other components. For a production environment, I keep these configurations in separate Spring profiles (application-client-acme.yml), so each client gets its own model through external configuration, not hardcoding.
In a real service, on top of this basic skeleton, there are guardrails for input (prompt injection detection), reranking after similarity search, circuit breaker with fallback to a backup model, structured logging with correlation ID, and several other layers. But the core is precisely this: a fine abstraction over the provider and configuration in yml. This is what allows me to promise clients "want another model tomorrow — we'll change it without rewriting code."
A few important details in the code worth noting:
One DocumentQaService — three providers. The service works with the ChatClient abstraction, not a specific implementation. Spring itself loads the required @Bean depending on the value of rag.llm.provider. Adding a new provider means another class with @ConditionalOnProperty, without any changes to the rest of the code.
OpenRouter via OpenAI API. The trick is that OpenRouter is compatible with the OpenAI protocol. Therefore, for it, I use OpenAiChatModel, simply substituting the base URL. This allows connecting to Llama, Gemini, Mistral, and dozens of other models through a single OpenRouter bean — just by changing rag.llm.model.
Parameterized filter instead of string concatenation.FilterExpressionBuilder().eq("company_id", companyId) is the correct way to do pre-filtering for vector search. In a multi-tenant service, this is critical: client A will never see documents from client B. Naive concatenation via String.format is an injection vulnerability, especially if companyId ever comes from user input.
topK=5 with cause. I settled on five chunks after several iterations. Less means a risk of missing relevant information, more means unnecessary noise and the lost-in-the-middle problem, where the model performs worse on information in the middle of a long context. Stanford showed recall degradation to 45% on middle fragments — this is not theory, I've seen it in practice.
Switching models without redeploying code. Client configurations live in separate Spring profiles (application-client-acme.yml). When a client wants to change a model — I change one value in yml and restart the container. Compilation, tests, CI/CD — nothing is needed.
⚖️ Opus 4.7 vs Llama 3.3 70B for RAG: When to Use What
Llama 3.3 70B is the default for most clients. Opus 4.7 is for niches with a high cost of error. Not "either-or," but the right tool for the job. And between them, there are many other options for different budgets.
I hear this question from clients every week. Here's my real decision table, without marketing:
Parameter
Llama 3.3 70B
Claude Opus 4.7
Price at typical load
~$30-40/mo
~$400-500/mo
Hallucinations in RAG
Occur, especially on queries without answers
Almost never seen
Following system prompt
Often "supplements" with general knowledge
Listens literally
Prompt injection resistance
Medium
High
Response speed
Faster
Slower, especially at xhigh
Self-hosted
Possible (via Ollama or vLLM)
Impossible
GDPR compliance
Full (self-hosted)
Via US API — problematic for some sectors
Tool use (self-querying)
Works, but suboptimal queries
Accurate and efficient
In production, I actually offer clients a wider choice, not just Llama or Opus. Here are the models I currently recommend for different budgets and tasks:
Ultra-budget, minimal entry barrier (<$20/mo):deepseek/deepseek-chat via OpenRouter. Unexpectedly good quality for pennies, especially for simple document Q&A with 200-500 queries per day.
Small budget with better quality (<$50/mo):openai/gpt-4o-mini via OpenRouter. Consistently outperforms DeepSeek on more complex queries, but costs more.
Mid-range, default for most clients (<$100/mo):meta-llama/llama-3.3-70b-instruct. Strong balance of quality/price, open architecture, can be migrated to self-hosted if needed.
GDPR + sensitive data: Llama 3.3 70B in self-hosted mode on the client's server via vLLM or Ollama. No traffic going out. This is the basis of the closed AskYourDocs loop for medical and legal clients.
Legal/medical documents, public chatbot, high cost of error:anthropic/claude-opus-4.7. The premium is justified by the model's honesty and resistance to prompt injection.
Enterprise with thousands of queries per day: hybrid. Simpler queries go to a cheaper model (GPT-4o mini or Llama), complex and sensitive ones — to Opus 4.7. Routing by query type at the service level.
Technically, this is implemented with minimal configuration changes. Here's how switching providers in application.yml looks — just one line:
# application.yml
# Budget option: DeepSeek via OpenRouter
spring.ai.openai.chat.model=${SPRING_AI_OPENAI_CHAT_MODEL:deepseek/deepseek-chat}
# Higher quality mid-tier: GPT-4o mini
# spring.ai.openai.chat.model=${SPRING_AI_OPENAI_CHAT_MODEL:openai/gpt-4o-mini}
# Default for most: Llama 3.3 70B
# spring.ai.openai.chat.model=${SPRING_AI_OPENAI_CHAT_MODEL:meta-llama/llama-3.3-70b-instruct}
# Premium for legal and medical tasks
# spring.ai.openai.chat.model=${SPRING_AI_OPENAI_CHAT_MODEL:anthropic/claude-opus-4.7}
The value is taken from the environment variable SPRING_AI_OPENAI_CHAT_MODEL with a fallback to the model in the configuration. This is convenient for different environments: in dev, I run DeepSeek (to not burn budget on tests), and in client production — what we agreed on. No code redeploy when changing the model, only the env variable in Railway or Docker config changes.
If you want to try Llama 3.3 70B locally on your own hardware for testing — I have a separate article on which models run on 8GB RAM. Llama 70B won't run there — you'll need at least 64GB RAM for decent speed — but you can test smaller models from the family to understand their behavior.
🏗️ How I connected Opus 4.7 in AskYourDocs
A brief overview of the architecture: Spring Boot + PostgreSQL with pgvector + OpenRouter as a proxy to LLMs. Opus 4.7 is one of several models that the client can switch between depending on budget and requirements.
I built AskYourDocs to be multi-vendor from the very beginning. The service positioning is "your documents, your model, your server" — and this is not marketing, but an architectural decision. The client is not tied to a single LLM provider: today it might be Llama via OpenRouter, tomorrow Opus 4.7, the day after tomorrow a self-hosted model on their infrastructure.
Key components of the stack:
Spring Boot + Spring AI — the main backend. Spring AI provides a ChatClient abstraction over LLM providers, so connecting a new one is a new bean with @ConditionalOnProperty, not a rewrite of the service layer.
PostgreSQL + pgvector — the vector database. For my size, it's more than enough. There's no point in adding a separate Qdrant or Pinecone as long as millions of chunks fit into one table and search latency is <100ms.
OpenRouter — a proxy for external LLMs. Through it, I connect to Opus 4.7, Llama 3.3 70B, Gemini, GPT-4o mini, DeepSeek with a single API key. Billing is unified, model switching is a change of a single value in an environment variable.
Ollama and vLLM — for clients with closed environments. The same Spring AI ChatClient, just a different provider bean.
Railway — deployment. Simple, inexpensive, supports secrets via environment variables, which is critical for multi-tenant configuration.
The workflow looks like this:
Client uploads PDF
↓
[Indexing pipeline]
PDF → text → chunking → embedding → pgvector
↓
(index is ready, independent of LLM choice)
Client asks a question
↓
[Query pipeline]
Question → vector search → top-K chunks → prompt → ChatClient → LLM
↓
LLM is selected via client's Spring profile
(Opus 4.7 / Llama / Gemini / other)
↓
Answer + sources
The key principle is that indexing is independent of the LLM. A client can start with Llama, work for a month, realize they need a more accurate model, switch to Opus 4.7 — and there's no need to rebuild the vector database. This saves both time and money (document embedding costs pennies but takes time for large volumes).
Regarding indexing — it's a standard RAG pipeline: text extraction from PDF, chunking according to my strategy (recursive + metadata, parameters from my article on chunking), embedding via OpenAI, saving to pgvector with metadata for pre-filtering. The choice of embedding model is a separate topic, and I wrote about it in detail here.
The rest are thin layers around this core: guardrails, rate limiting per client, logging with correlation IDs, token consumption metrics for billing, an admin panel for document management. But the key idea is simple: a thin abstraction over LLM + well-tuned retrieval = flexible RAG that lives longer than any specific model.
⚠️ When Opus 4.7 is definitely not needed for RAG
Honestly: for most RAG projects, Opus 4.7 is overkill. I'll break down scenarios where you can safely opt for a cheaper alternative.
Based on my tests of Opus 4.7 and previous experience with Opus 4.5/4.6 working with clients — here are scenarios where I don't see the point in using a premium model:
Internal FAQ with 30-50 questions. If you have a few dozen typical questions and the answers in the documents are straightforward — Llama 3.3 70B will perform at a 95% quality level for $30/month. Paying $400 extra is, in my opinion, unnecessary.
GDPR restrictions or corporate secrets. Opus works only through Anthropic's API (or AWS/Google/Azure with their terms). If you have medical, legal, or financial data and a strict requirement of "not leaving the perimeter" — opt for self-hosted Llama. This is the main positioning of AskYourDocs, by the way — you can deploy a completely closed environment without any external APIs.
Volume of 10K+ requests per day. At this scale, Opus will cost $4000+/month. Rarely justified — at large volumes, it's almost always better to combine models (cheaper for simple requests, Opus for complex ones).
Web-research scenarios. Opus 4.7 underperformed on BrowseComp against GPT-5.4 and Gemini 3.1 Pro. If your agent relies on browsing websites — I would look elsewhere.
Retrieval is not yet optimized. If your recall is 50% — Opus won't fix that. First, chunking, embedding, reranking. Then, think about the model. I see this constantly: developers switch LLMs expecting magic, but the problem is that retrieval returns irrelevant chunks.
When the business wants the bot to always respond. This is a rare but real request — "we need the model not to say 'I don't know', otherwise clients will think the bot is broken." Opus 4.7, with its honest behavior, will be perceived as "worse." Here, you either need to re-educate the client or choose a model with a higher tendency to generate. For Opus, this is simply not the task.
❓ FAQ: Claude Opus 4.7 for RAG
Can Claude Opus 4.7 be used for a RAG system?
Yes, and my initial tests show that it handles it well — especially where model honesty ("I don't know" instead of hallucination) and literal adherence to the system prompt are important. The main question is not "can it be used," but "is it justified." For most use cases, cheaper models like Llama 3.3 70B provide 90% of Opus's quality for 10% of the price.
How much does Opus 4.7 cost for a real RAG system?
According to my calculations for a typical profile (approximately 1000 requests per day, 8-12K token context) — it's about $400-500/month. 10-15 times more expensive than Llama 3.3 70B via OpenRouter. With prompt caching, it can be reduced by 30-40%, but it's still in the premium segment.
Opus 4.7 or GPT-5.4 for RAG?
Looking at benchmarks and early tests — Opus 4.7 is stronger where honesty and absence of fabrication are critical. GPT-5.4 wins in web research and where combining knowledge with context is needed. For pure document Q&A on a closed document base, I would choose Opus.
Does the 1M token context in Opus 4.7 help replace RAG?
No. Large context is a supplement to RAG, not a replacement. Stuffing the entire database into the prompt is expensive, models degrade with long context (lost in the middle effect), and data updates become a problem. RAG + selectively large context for complex analytical queries is optimal.
Is Opus 4.7 suitable for GDPR-sensitive data?
Difficult. Opus is only available via API — Anthropic, AWS Bedrock, Google Vertex AI, Azure Foundry. For some jurisdictions and data types, this is already sufficient (Bedrock in the EU region), for others — it's not. If you have medical or personal data with strict restrictions — self-hosted Llama will be a cleaner choice. Opus as an API simply cannot "not leave the perimeter." I discussed this topic in detail in a separate article on GDPR and AI on documents — it covers fines, real cases, and specific steps for businesses in the EU.
How to integrate Opus 4.7 into a Spring Boot application?
Via Spring AI. Basically, there are two ways: directly through the Anthropic provider (spring-ai-anthropic) or via OpenRouter as a proxy (with spring-ai-openai, as OpenRouter is compatible with the OpenAI protocol). The latter option is more convenient if you plan to switch between models — one API key for Opus, Llama, Gemini, and dozens of others.
Do hallucinations decrease in RAG if Opus 4.7 is used instead of Llama?
In my initial tests — yes, noticeably. Opus 4.7 more often admits the absence of information in the context instead of fabricating it. But this depends on the system prompt — if you don't forbid the model from "supplementing" with general knowledge, even Opus will do it. In my opinion, a correct prompt combined with Opus is more important than just the choice of model.
Can you switch between Opus 4.7 and Llama without re-indexing documents?
Yes. Indexing depends on the embedding model, not the LLM. As long as you don't change the embedding model, the vector database remains the same. You can switch Opus ↔ Llama ↔ Gemini without any effort, and this is precisely what the multi-vendor architecture of AskYourDocs is built upon.
Why did Opus 4.7 become more expensive if the price per token hasn't changed?
Due to the new tokenizer. The same text maps to 1.0-1.35× more tokens. In my test run, after changing the model ID, the cost increase was approximately 23% — without any changes in logic. Prompt caching and shortening the system prompt partially mitigate this.
What to use for local development of a RAG system before deploying to Opus 4.7?
I use Ollama with smaller models (qwen2.5 7B or llama3.2 3B) for a fast dev cycle. The behavior is not identical to Opus, but it allows testing the pipeline without API costs. On staging, I switch to the actual model — Llama via OpenRouter or Opus 4.7, depending on the task.
If you're unsure which model to choose — start with the AskYourDocs demo. I'll help you choose based on your document profile and budget, with no obligation.
Коротко про що ця стаття:
17 квітня я взяв свіжий Claude Opus 4.7 і прогнав його через свою RAG-систему AskYourDocs на тестовому наборі з ~400 публічних юридичних документів (зразки договорів, нормативні акти, шаблони з відкритих джерел). Порівняв з Llama 3.3 70B, на якій у мене зараз...
TL;DR за 30 секунд:
Claude Opus 4.7 — новий флагман Anthropic, який вийшов 16 квітня 2026 року. Головне: +10.9 пунктів на SWE-bench Pro (64.3% проти 53.4% у Opus 4.6), вища роздільна здатність vision (3.75 MP), нова memory на рівні файлової системи та новий рівень міркування xhigh. Ціна...
Коротко: Gemma 4 26B MoE рекламують як "якість 26B за ціною 4B". Це правда щодо швидкості інференсу — але не щодо пам'яті. Завантажити потрібно всі 18 GB. На Mac з 24 GB — свопінг і 2 токени/сек. Комфортно працює на 32+ GB. Читай перш ніж завантажувати.
Що таке MoE і чому 26B...
Коротко: Reasoning mode — це вбудована здатність Gemma 4 "думати" перед відповіддю. Увімкнений за замовчуванням. На M1 16 GB з'їдає від 20 до 73 секунд залежно від задачі. Повністю вимкнути через Ollama не можна — але можна скоротити через /no_think. Читай коли це варто робити, а коли...
Коротко: Gemma 4 — нове покоління відкритих моделей від Google DeepMind, випущене 2 квітня 2026 року. Чотири розміри: E2B, E4B, 26B MoE і 31B Dense. Ліцензія Apache 2.0 — можна використовувати комерційно без обмежень. Підтримує зображення, аудіо, reasoning mode і 256K контекст. Запускається...
Коротко: Встановив Gemma 4 на MacBook Pro M1 16 GB і протестував на двох реальних задачах — генерація Spring Boot коду і текст про RAG. Порівняв з Qwen3:8b і Mistral Nemo. Результат: Gemma 4 видає найкращу якість, але найповільніша. Qwen3:8b — майже та сама якість коду за 1/4 часу. Читай якщо...