Що таке токен у ChatGPT, Claude і Gemini?

Токен — це не слово і не символ, а фрагмент тексту довільної довжини, який AI використовує як одиницю обробки. Це може бути частина слова, ціле слово або навіть кілька слів разом. Наприклад, слово «Hello» — це 1 токен, а «authorization» — 2 токени. Усі великі мовні моделі (GPT-5, Claude, Gemini) бачать ваш текст як послідовність числових ідентифікаторів зі свого словника, який містить від 128 000 до 200 000 токенів.

Як працює BPE (Byte Pair Encoding) ?

BPE — це алгоритм, який будує словник токенів, починаючи з окремих байтів і ітеративно об'єднуючи найчастіші пари. Наприклад, якщо пара «l» + «o» зустрічається найчастіше — вона стає одним токеном «lo». Процес повторюється тисячі разів на корпусі тексту в сотні гігабайт, доки словник не досягне потрібного розміру (100 000–200 000 токенів). BPE був винайдений у 1994 році для стиснення даних, а у 2016 адаптований для NLP дослідниками Sennrich, Haddow і Birch.

Що таке glitch-токени і чим вони небезпечні?

Glitch-токени — це токени зі словника моделі, для яких нейромережа не навчилась нормальної поведінки. Вони потрапили в словник через тренувальні дані токенізатора, але були відсутні в основному корпусі навчання. Наприклад, токен «SolidGoldMagikarp» змушував ChatGPT відповідати словом «distribute». За даними дослідження GlitchMiner (AAAI 2026), приблизно 4.3% токенів у словниках GPT-4, Llama 2 і DeepSeek є потенційними glitch-токенами. Вони можуть обходити фільтри безпеки та спричиняти непередбачувану поведінку.

Скільки коштує токен у API 2026?

Ціни на березень 2026: GPT-5.4 — $2.50/$10.00 за 1M input/output токенів, Claude Sonnet 4.6 — $3.00/$15.00, Gemini 2.5 Pro — $1.25/$10.00, DeepSeek V3.2 — $0.14/$0.28. Output-токени коштують у 3–10 разів дорожче за input. Ціни впали приблизно на 80% порівняно з 2025. Найдешевший варіант — Gemini 2.0 Flash-Lite за $0.075/$0.30 за 1M токенів.

Скільки токенів в одному слові?

Залежить від мови і частоти слова. Коротке часте англійське слово («the», «is», «cat») — 1 токен. Довге або рідкісне англійське слово — 2–3 токени. Слово кирилицею середньої довжини — 3–6 токенів. Середній токен для англійської відповідає приблизно 4 символам або ¾ слова. Перевірити токенізацію конкретного слова можна на tokenizer.openai.com.

Чи однаково коштують input і output токени?

Ні. Output-токени коштують у 3–10 разів дорожче, бо генерація послідовна — кожен токен залежить від попереднього і вимагає більше GPU-часу. Наприклад, у Claude Sonnet 4.6: input — $3/1M, output — $15/1M (різниця 5×). У reasoning-моделях (O3, DeepSeek R1) вартість ще вища через «thinking tokens» — внутрішні роздуми моделі, за які також треба платити.

Чому не-англійські мови дорожчі в AI API?

BPE-токенізатор навчається на корпусі, де понад 90% тексту — англійська. Тому англійські слова отримують цілі токени, а слова іншими мовами (кирилиця, китайська, арабська) розбиваються на дрібні частини. Одне не-англійське слово коштує в середньому 2–5 токенів проти 1–2 для англійського. Це означає, що той самий текст не-англійською обробляється у 2–4 рази дорожче і займає більше місця в контекстному вікні моделі.

Як зменшити кількість токенів і заощадити на API?

Найефективніші методи: використовуйте prompt caching для незмінних частин промпту (економія 80–90% на input), batch API для неінтерактивних запитів (знижка 50%), model routing — дешева модель для простих задач і дорожча для складних. Також стисніть промпти: приберіть зайве форматування, мінімізуйте JSON, уникайте повторів. Для некритичних задач розгляньте локальний AI через Ollama — $0/місяць за API.

Що буде після BPE-токенізації?

У 2025–2026 з'явилось кілька альтернатив: SuperBPE (COLM 2025) створює мультислівні токени і покращує результати на 4% на 30 бенчмарках; LiteToken (лютий 2026) прибирає зайві токени зі словника без перенавчання; Byte Latent Transformer (BLT) від Meta повністю відмовляється від токенізатора і працює з сирими байтами, досягаючи якості Llama 3 при вдвічі менших обчислювальних витратах. BPE залишається стандартом, але перехід може початися у 2026–2027 роках.

AI_TOOLS 23 March 2026 22 min read 2,923 view

What Are Tokens in ChatGPT, Claude, and Gemini: How AI Sees Your Text and What It Really Costs in 2026

Updated: 24 June 2026

Language: 🇺🇦 🇬🇧 🇩🇪

Dmitro Petrov

A Tech Lead who builds AI/ML systems for production — and writes about how they actually work.

✦ Ask AI about this article

What Are Tokens in ChatGPT, Claude, and Gemini: How AI Sees Your Text and What It Really Costs in 2026

You write "Hello" in ChatGPT — and think you've sent one word. In reality, the AI received 3–4 numbers. This is how tokens work — invisible units that all large language models think in. Spoiler: one word in Cyrillic is already 3–4 tokens versus 1–2 for English, code formatting eats up to a quarter of tokens, and some words literally break GPT.

⚡ In Short

✅ Token ≠ Word: one English word is approximately 1 token, one word in Latin script is 3–4 tokens.
✅ BPE: the tokenizer builds a vocabulary by merging frequent character pairs — effective for English, expensive for Cyrillic.
✅ Glitch Tokens: ~4.3% of the GPT-4 and Llama 2 vocabulary contain "broken" tokens that cause unpredictable behavior.
✅ Prices in 2026 are Falling: DeepSeek V3.2 costs $0.14/1M input tokens, GPT-4o — $2.50/1M.
🎯 You Will Get: an understanding of tokens from basics to API practice, plus a price table for choosing a provider.
👇 Below are detailed explanations, examples, and tables

📚 Table of Contents

📌 1. Why AI Doesn't Read Words — It Reads Tokens
📌 2. Why a Token ≠ Word: Length Anomalies and Special Tokens
📌 3. How BPE Works: Merging Characters on Your Fingers
📌 4. Why Non-English Languages Are "More Expensive" in AI
📌 5. Glitch Tokens: Why "SolidGoldMagikarp" Breaks GPT
📌 6. Formatting Eats Tokens
💼 7. How Much a Token Costs in the API in 2026
💼 8. The Future of Tokenization: LiteToken, BoundlessBPE, BLT
📌 9. Practice: Test Your Text
❓ Frequently Asked Questions (FAQ)
✅ Conclusions

🎯 1. Why AI Doesn't Read Words — It Reads Tokens

What is a Token in LLMs

A token is not a word or a character, but a text fragment of arbitrary length: part of a word, a whole word, or even several words together. GPT-5, Claude, and Gemini see your request as a sequence of numerical identifiers from their own vocabulary. The word "Hello" is 1 token (id: 9906), and "authorization" is already 2 tokens, even though it's also one word.

A computer cannot read letters. It can count numbers. A token is a bridge between human language and the numerical world of a neural network.

When you send a message to ChatGPT, it doesn't go directly to the neural network. First, the text goes through a tokenizer — a special program that breaks your string into fragments and replaces each with a number from the model's vocabulary.

For example, the sentence "The cat sat on the mat" is seen by the model approximately as: [791, 8415, 9137, 389, 279, 2450] — 6 numbers instead of 6 words. The model reads precisely these numbers, processes them through hundreds of transformer layers, and at the output, through the tokenizer again, converts the numbers back into text.

Why Not Just Letters or Words?

A character-based approach (one character = one unit) results in overly long sequences — the neural network processes them poorly due to the quadratic complexity of attention. More on this in the article about the context window.

A vocabulary-based approach (one word = one token) also doesn't work: there are hundreds of thousands of words in English, plus numbers, names, code, and emojis. The vocabulary would be infinite.

Therefore, a compromise prevailed — subword tokenization: parts of words ("run", "ning", "##s"), whole frequent words ("the", "is"), rare words are broken into parts. This is what the BPE algorithm does (section 3).

✔️ GPT-5.4 / GPT-4o Vocabulary: ~200,000 tokens (o200k_base)
✔️ Llama 3 / 4 Vocabulary: ~128,000 tokens
✔️ DeepSeek V3 Vocabulary: ~128,000 tokens

Section Conclusion: AI reads neither letters nor words — it reads tokens, and the efficiency of your text tokenization affects both the quality of the response and your API bill.

📌 2. Why a Token ≠ Word: Length Anomalies and Special Tokens

Why Token Length is Unpredictable

The length of a token depends not on grammar, but on frequency: the more often a certain sequence of characters appeared in the training data — the higher the chance it became a single token. Hence the paradoxes: "GPT-4" = 2 tokens, "GPT4" = 1. The hyphen changes everything.

A token is not a unit of length, but a unit of frequency in the training data.

The easiest way to feel this is to check familiar words in the OpenAI tokenizer. The results often surprise even experienced developers.

Examples That Break Intuition

String	Tokens	Breakdown	Why so
`ChatGPT`	3	Chat + G + PT	New name, BPE didn't see it whole
`OpenAI`	2	Open + AI	Both parts are frequent separately
`tokenization`	3	token + iz + ation	Word is rarer than its parts
`GPT-4`	2	GPT + -4	Hyphen breaks the merge
`GPT4`	1	GPT4	Without hyphen — one token
`cat`	1	cat	Frequent word
`cat` (with space)	1	·cat	A different token than without a space

The last line is particularly important. In BPE, a space at the beginning of a word is part of the token, not a separate character. Therefore, cat and cat have different numerical IDs. The model literally sees them as different words — and processes them differently depending on their position in the sentence.

Special Tokens: Model's Service Symbols

In addition to text tokens, each model has a set of special tokens — service markers that denote dialogue structure and control model behavior:

Token	Meaning	Where it's found
`<\|endoftext\|>`	End of document	GPT-4, GPT-4o
`[BOS]`	Beginning of sequence	Llama, Mistral
`[EOS]`	End of sequence	Llama, Mistral
`[PAD]`	Batch padding	Model training
`[INST]` / `[/INST]`	Instruction start/end	Llama 2 Instruct
`<\|im_start\|>`	Message start	GPT-4, ChatML format

Special tokens directly affect model behavior: if they end up in user input accidentally or intentionally — the model can exit "assistant" mode and behave unpredictably. This is one of the vectors of attack on LLM systems.

Section Conclusion: A token is a statistical unit, not a linguistic one. Spaces, hyphens, and case — all of these change tokenization. Understanding this gives an advantage when writing prompts and designing LLM-based systems.

📌 3. How BPE Works: Merging Characters on Your Fingers

Byte Pair Encoding Algorithm

BPE builds a vocabulary starting with individual bytes and iteratively merges the most frequent pairs. It's a greedy compression algorithm: if "ing" appears more often than any other pair — it becomes a single token. The process is repeated thousands of times until the vocabulary reaches the desired size.

BPE is like building words from Lego bricks: at first, there are only individual letter bricks, then you start joining the most popular combinations.

A Brief History: From Compression to GPT

BPE didn't originate in AI. In 1994, Philip Gage published the algorithm as a data compression method in The C Users Journal. The idea was simple: find the most frequent pair of bytes in a file, replace it with a single new byte, repeat.

In 2016, researchers Rico Sennrich, Barry Haddow, and Alexandra Birch adapted BPE for machine translation in the paper "Neural Machine Translation of Rare Words with Subword Units" (Sennrich et al., 2016). Instead of compressing bytes for storage — compressing characters into subwords for neural networks. This work became the foundation of tokenization in all modern LLMs.

In 2019, OpenAI took the next step in GPT-2: they moved from character-level BPE to byte-level BPE. The difference is critical: instead of individual Unicode characters, the algorithm works with 256 base bytes. This means any text in any language is guaranteed to be encoded — there are no "unknown characters." All modern models — GPT-4o, Claude, Llama 3, DeepSeek — use byte-level BPE or its variations.

Step-by-Step Merging Example

Imagine we are building a tokenizer on one sentence: "low low low lower lower".

Step 0. Vocabulary: l, o, w, e, r (each letter separately)
Text:   l-o-w  l-o-w  l-o-w  l-o-w-e-r  l-o-w-e-r

Step 1. Most frequent pair: "l"+"o" appears 5 times → merge into "lo"
Text:   lo-w  lo-w  lo-w  lo-w-e-r  lo-w-e-r

Step 2. Most frequent pair: "lo"+"w" appears 5 times → merge into "low"
Text:   low  low  low  low-e-r  low-e-r

Step 3. Most frequent pair: "e"+"r" appears 2 times → merge into "er"
Text:   low  low  low  low-er  low-er

Resulting vocabulary: l, o, w, e, r, lo, low, er, lower

This is a classic example from Sennrich's original paper. On real data, the process is repeated tens of thousands of times on a corpus of hundreds of gigabytes. The final vocabulary contains from 30,000 to 200,000 tokens depending on the model.

From BPE to Real Tokenizers: What GPT and Llama Add

No modern model uses "raw" BPE. Each provider adds its own modifications:

✔️ Regex pre-tokenization: before running BPE, the text is split by a regular expression into categories — letters, numbers, punctuation, spaces. This prevents "merging" across category boundaries: a number and a word won't become a single token. GPT-4 and GPT-4o use different regex patterns, which affects the outcome.
✔️ Predefined vocabulary: tiktoken (OpenAI's library) adds frequent words directly to the vocabulary. If a word is already in the vocabulary — it is returned whole, even if BPE merging rules wouldn't have created it.
✔️ Vocabulary growth: GPT-2 had ~50,000 tokens, GPT-4 — ~100,000 (cl100k_base), and GPT-4o — ~200,000 (o200k_base). A larger vocabulary means more efficient tokenization, especially for non-Latin languages and code.
✔️ SentencePiece: Llama and Mistral use SentencePiece from Google — an alternative implementation that supports BPE and Unigram algorithms and works directly with Unicode, without prior word splitting.

Why BPE is Successful

The algorithm automatically identifies morphological units (word roots, suffixes, prefixes) without any linguistic knowledge. Frequent words become single tokens, rare ones are broken into known parts — the model never encounters an "unknown word." This is the power of BPE: Open vocabulary with a fixed dictionary.

Want to see how it works in code? Andrej Karpathy (ex-OpenAI) created a learning implementation minBPE, and Sebastian Raschka wrote a detailed guide on building a tokenizer from scratch — both resources are ideal for deeper understanding.

Read more about how tokens are processed within the transformer in our article on transformers and attention (in preparation).

Section Conclusion: BPE is an elegant 1994 compression algorithm that has become the foundation of tokenization for all modern LLMs. Its main drawback is uneven language coverage because the vocabulary is built on Anglocentric data. But even with this drawback, BPE remains the industry standard — alternatives have not yet proven scalable.

📌 4. Why Non-English Languages Are "More Expensive" in AI

Uneven BPE for Different Languages

BPE is trained on a corpus where 90%+ of the text is English and code. Therefore, English words get whole tokens, while words in other languages — Cyrillic, Chinese characters, Arabic script — are broken into small parts. One non-English word on average costs 2–5 tokens versus 1–2 for its English equivalent. In practice, this means: the same text in a non-English language is processed more expensively and takes up more space in the context window.

If the model's context window is 128K tokens, then in English, you can fit 2–3 times more text than in Cyrillic, and 3–4 times more than in Arabic.

Why This Happens: Three Levels of the Problem

Language inequality in tokenization is not an accident, but a result of three factors that reinforce each other:

1. Training Data Imbalance. Most LLMs are trained on Anglocentric corpora. For example, Llama 3 reports that 95% of training data is English and code, and only 5% is all other languages combined. BPE, during training, simply hasn't "seen" enough text in other languages to identify larger blocks as separate tokens.

2. UTF-8 Advantage for Latin Script. Byte-level BPE works with bytes, and UTF-8 encodes Latin letters with one byte, Cyrillic with two, and Chinese characters with three. Even if BPE were perfectly balanced by language, Latin script would have a structural advantage at the encoding level.

3. Morphological Complexity. Languages with rich morphology (Slavic, Turkic, Finno-Ugric) generate significantly more unique word forms from a single root. English "run" has ~5 forms, while the corresponding verb in Turkish or Finnish has dozens. For BPE, each rare form is a potential split into parts.

Tokenization Comparison: One Word in Different Languages

Word	Language	Tokens (GPT-4o)
authorization	English	2
авторизація	Cyrillic	4–6
授权	Chinese	2–3
autorización	Spanish	3
Genehmigung	German	3–4
intelligence	English	2
інтелект	Cyrillic	4–5
智能	Chinese	2
inteligencia	Spanish	3
token	English	1
токен	Cyrillic	3–4

You can check your own words in tokenizer.openai.com — it shows the breakdown and exact token count for GPT-4o.

How Much It Costs: "Token Tariff" with Real Figures

The study "Do All Languages Cost the Same?" (Ahia et al., EMNLP 2023) analyzed 22 languages and showed: users of many languages are effectively overpaying for APIs while getting worse results. Some languages require up to 5 times more tokens for the same content.

The study "The Token Tax" (2025) went further: if a language requires twice as many tokens, it means a 4× increase in training cost (due to quadratic attention complexity O(n²)) and a corresponding increase in inference latency.

There is also good news: OpenAI has significantly improved the situation with each new tokenizer. According to an analysis of the 50,000 most frequent words in 12 languages, the number of tokens per word for Hindi dropped from 6.55 (GPT-2, 2021) to 1.89 (GPT-4o, 2024) — a 71% improvement. For Russian — from 5.16 to 1.96 (−62%). But even after that, Hindi is still 63% more expensive than English.

What a Developer Working with Non-English Languages Should Do

✔️ Factor in a x2–4 multiplier when calculating API budgets for non-Latin languages
✔️ Prompt engineering is more critical for non-English languages — each extra word costs more and consumes more context
✔️ Choose a model with a better vocabulary: GPT-4o (o200k_base, 200K tokens) is significantly more efficient for multilingual tasks than older models with smaller vocabularies
✔️ Consider specialized models: for a specific language, local or multilingual models (e.g., Qwen for Chinese or multilingual-e5-large for embedding tasks) often have better tokenization
✔️ Prompt caching: if the system prompt in a non-English language is large — prompt caching will reduce the cost of repeated requests by 80–90%. More details in our article about LLMs for business

Conclusion: Language inequality in tokenization is a documented and measurable problem that affects all non-Latin languages. It directly impacts API budgets, efficient use of the context window, and even the quality of model responses. The trend is positive — vocabularies are growing from 50K to 200K+ tokens, and the gap is narrowing — but complete equality has not yet been achieved.

📌 5. Glitch Tokens: Why "SolidGoldMagikarp" Breaks GPT

What are Glitch Tokens

Glitch tokens are tokens from the model's vocabulary for which the neural network has not learned normal behavior. They ended up in the vocabulary (because they were in the tokenizer's training data) but were absent or extremely rare in the main model training corpus. Consequence: When encountering such a token, the model generates unpredictable, chaotic, or offensive output.

Imagine a library where there's a card for a book, but the book itself is not on the shelf. The librarian (the model) gets disoriented and says something nonsensical.

2023 Discovery: SolidGoldMagikarp

In January 2023, researchers Jessica Rumbelow and Matthew Watkins, as part of the SERI-MATS program, published on LessWrong an unexpected discovery: when asked to repeat the word "SolidGoldMagikarp," ChatGPT responded with "distribute." Or it refused entirely, shouted, or insulted – the behavior was completely unpredictable.

The reason turned out to be: "SolidGoldMagikarp" is the nickname of a Reddit user who made hundreds of thousands of posts in a number-counting thread. GPT's tokenizer "learned" from this text and assigned the nickname to a separate token. However, during the model's training, this Reddit content was filtered out, and the token "remained" in the vocabulary without any meaning.

The token petertodd (with a leading space) behaved even more strangely. When GPT-3 was asked to repeat it, the model produced chaotic responses – from mystical poems to aggressive exclamations. As it turned out, Peter Todd is a Canadian cryptographer whose name was the subject of massive attacks on Reddit due to his work with Bitcoin. These comments made it into the tokenizer's data but not into the model's training corpus. A detailed study of this phenomenon is described on LessWrong: The 'petertodd' phenomenon.

Why It's Dangerous: From Curiosity to Vulnerability

At first glance, glitch tokens are a funny artifact. But for production systems, they create real risks:

✔️ Bypassing Safety Filters: A glitch token can "knock" the model out of assistant mode, causing it to ignore the system prompt and guardrails.
✔️ Unpredictable Hallucinations: Instead of refusing, the model generates chaotic content – from nonsense to offensive text.
✔️ Violation of Determinism: Even with temperature=0, glitch tokens break reproducibility – the same model gives different answers to the same query.
✔️ Attack Vector: An attacker can intentionally insert glitch tokens into input data to disrupt the LLM system's operation.

GlitchMiner: The Scale of the Problem in 2026

In 2024–2025, researchers developed an automated framework for finding glitch tokens – GlitchMiner (arXiv), accepted to the AAAI 2026 conference. The tool uses gradient optimization to find tokens with abnormally high prediction entropy.

Results: Approximately 4.3% of tokens in the vocabularies of GPT-4, Llama 2, and DeepSeek are potential glitch tokens. For a vocabulary of 100,000 tokens, this is ~4,300 "broken" units.

What Providers Have Done

OpenAI reacted quickly: on February 14, 2023, ChatGPT received a patch that prevents direct encounters with known glitch tokens. When transitioning from GPT-3 (r50k_base, ~50K tokens) to GPT-4 (cl100k_base, ~100K) and further to GPT-4o (o200k_base, ~200K), the vocabulary was completely rebuilt – old glitch tokens disappeared.

But the problem didn't disappear with them. Research into new glitch tokens in GPT-4 showed that each new tokenizer creates its own set of anomalous tokens. Tokens like ForCanBeConverted, YYSTACK, JSBracketAccess were found in cl100k_base and exhibit similar unpredictable behavior. This suggests that glitch tokens are a systemic property of the BPE approach, not a one-off bug.

What a Developer Should Do

✔️ Test Models for Glitch Tokens Before Production: NVIDIA Garak is an open-source LLM vulnerability scanner that includes a special probes.glitch module for automated testing.
✔️ Filter Input Data: If your system accepts arbitrary text from users, add a check for known glitch tokens in the input pipeline.
✔️ Use GlitchMiner for Deeper Analysis: GlitchMiner on GitHub allows you to find anomalous tokens in any model with accessible weights.
✔️ Monitor Output: Log instances where the model responds atypically – this could be a sign of encountering a glitch token.

Conclusion: Glitch tokens are not a theoretical vulnerability but a documented systemic problem present in all large models and reproducible with each new tokenizer. Providers patch known cases, but the BPE approach itself generates new anomalies. For production systems, testing for glitch tokens should be part of the security pipeline.

📌 6. Formatting Eats Tokens

How Formatting Affects Token Count

Spaces, indentation, line breaks, parentheses – all of them are tokenized. In code with indentation, tabulation can account for a significant portion of tokens in the entire file. Markdown markup (asterisks, hashes, hyphens) also adds tokens. This directly impacts the cost of API requests.

Every space in your code is potentially a token you pay for.

Studies on Python code tokenization show that indentation, spaces, and special characters account for 15% to 25% of the total tokens in a typical file. For large codebases, this is not insignificant money when using the API.

Practical Implications for Developers

✔️ Minimize indentation in system prompts (4 spaces → 2 spaces or a tab).
✔️ JSON without line breaks takes up fewer tokens than pretty-printed JSON.
✔️ Markdown headers (###) and lists (- item) add tokens – avoid them in system prompts where not needed.
✔️ Repeating patterns (e.g., the same prefix in each array element) are efficiently compressed by BPE.

Prompt Caching – How to Save Money

All major providers (OpenAI, Anthropic, Google) support prompt caching: if the prefix of your prompt doesn't change between requests, reprocessing costs 80–90% less. For products with a large system prompt, this is the easiest way to reduce costs. More details can be found in our article on LLMs for Business.

Section Conclusion: Formatting is not free. Optimizing prompts with token count in mind can reduce costs by 15–30% without loss of quality.

💼 7. How Much a Token Costs in the API 2026

Token Prices in 2026

LLM provider APIs charge separately for input tokens (your request) and output tokens (model's response). Output is 3–10 times more expensive because generation is sequential and more costly than parallel reading. In 2026, prices have dropped by approximately 80% compared to 2025 due to competition from DeepSeek and open-source models.

DeepSeek has shaken up the market: frontier quality at a price that seemed impossible just a year ago.

Current prices as of March 2026 (sources: TLDL LLM API Pricing, March 2026, CostGoat LLM Pricing, PricePerToken.com):

Main (Chat) Models

Model	Input ($/1M)	Output ($/1M)	Context	Comment
GPT-5.4 (OpenAI)	$2.50	$10.00	128K	OpenAI's flagship, replaced GPT-4o
GPT-5 mini (OpenAI)	$0.25	$2.00	128K	Budget option for simple tasks
GPT-5 nano (OpenAI)	$0.05	$0.40	128K	Cheapest from OpenAI
Claude Sonnet 4.6 (Anthropic)	$3.00	$15.00	200K	Top for complex instructions and code
Claude Haiku 4.5 (Anthropic)	$0.25	$1.25	200K	Budget Claude, updated price
Gemini 2.5 Pro (Google)	$1.25	$10.00	1M	Largest context among main models
Gemini 2.5 Flash (Google)	$0.30	$2.50	1M	Excellent price/quality ratio
Gemini 2.0 Flash-Lite (Google)	$0.075	$0.30	1M	Cheapest among major providers
DeepSeek V3.2	$0.14	$0.28	128K	Chat + reasoning in one model
Grok 3 mini (xAI)	$0.30	$0.50	128K	Best output/input ratio

Reasoning Models: Thinking is More Expensive

Reasoning models (o3, DeepSeek R1) generate "thinking tokens" – internal thoughts, for which you also pay. This means that the output for a reasoning task can be 5–20 times longer than the final answer. More details about reasoning – in our article (in preparation).

Model	Input ($/1M)	Output ($/1M)	Note
O3 Pro (OpenAI)	$150.00	$600.00	Most expensive model on the market
O3 (OpenAI)	$10.00	$40.00	Reasoning flagship
DeepSeek R1	$0.55	$2.19	Reasoning at the price of a chat model

How a Real Request is Calculated

Imagine: you send a system prompt (500 tokens) + a question (50 tokens) and receive a response (200 tokens).

On Claude Sonnet 4.6: (550 × $3.00 + 200 × $15.00) / 1,000,000 = ~$0.0047 per request. At 10,000 requests per day – $47/day or ~$1,400/month.

The same task on DeepSeek V3.2: (550 × $0.14 + 200 × $0.28) / 1,000,000 = ~$0.00013 per request. At 10,000 requests/day – $1.3/day or ~$40/month. The difference is 35 times.

Important for non-Latin languages: if your requests are in Cyrillic, Chinese, or Arabic – multiply tokens by a coefficient of 2–4× (see Section 4 on language inequality). This directly increases your bill – and makes choosing a cheaper model even more critical.

How to Save on API: 5 Proven Methods

✔️ Prompt Caching – if the system prompt doesn't change between requests, caching reduces input cost by 80–90%. OpenAI, Anthropic, and Google support this feature. DeepSeek offers cache hits for $0.028/1M – a 90% discount from the base price.
✔️ Batch API – send non-interactive requests in batches. Anthropic offers a 50% discount on batch requests, OpenAI similarly.
✔️ Model Routing – use a cheap model (Gemini Flash, DeepSeek V3.2) for simple requests and a more expensive one (Claude Sonnet, GPT-5.4) only for complex ones. A router based on request classification can save 60–80%.
✔️ Compress Prompts – remove unnecessary formatting, minimize JSON, avoid repetition in the system prompt (more details in Section 6 on formatting).
✔️ Local AI – for non-critical tasks, Ollama + an open-source model (Llama 4 Maverick, DeepSeek V3) costs $0/month for API. More details in our article on Ollama.

Interactive calculators for cost calculation: LangCopilot Token Calculator (41 models, updated March 2026) and LLM Pricing Calculator.

Detailed calculation of AI costs for various business scenarios – in our article on AI costs (in preparation). Also see LLM vs RAG: the right architecture can reduce the number of tokens by an order of magnitude.

Conclusion: Prices have dropped by ~80% in a year, but the difference between models can reach 35–100× for the same task. Reasoning models cost an order of magnitude more due to thinking tokens. For production systems, a combination of prompt caching + model routing + batch API reduces costs by 5–10 times.

💼 8. The Future of Tokenization: SuperBPE, BoundlessBPE, BLT

Where Tokenization is Heading

In 2025–2026, several competing directions emerged: multi-word tokens (SuperBPE, BoundlessBPE), cleaning existing vocabularies (LiteToken), and a complete abandonment of tokens in favor of bytes (BLT from Meta). None have yet replaced BPE in production, but the pressure is growing.

BPE is like the QWERTY keyboard: not optimal, but everyone is used to it. Alternatives exist, the transition is slow.

SuperBPE and BoundlessBPE — Tokens Longer Than One Word

Classic BPE never merges tokens across word boundaries: "New" and "York" always remain separate. Two new approaches, accepted at the COLM 2025 conference, remove this limitation:

✔️ SuperBPE (Liu et al., 2025) — a two-pass BPE: first, standard subword training, then a second pass without word boundary restrictions. Result: 33% fewer tokens, +4.0% average improvement on 30 benchmarks, and +8.2% on MMLU — solely due to better tokenization.
✔️ BoundlessBPE (Schmidt et al., 2025) — a single-pass variant where frequent phrases ("of the", "machine learning") become a single token. Compression improvement: up to 20% more bytes per token.

LiteToken — Cleaning Up the Vocabulary

LiteToken (arXiv, February 2026) — a lightweight algorithm for removing "intermediate merge remnants" from BPE vocabularies. These are tokens that entered the vocabulary during training but are never used independently in real text. For DeepSeek-V3, LiteToken reduced the 3-gram vocabulary by ~22% without retraining the model — plug-and-play. A smaller vocabulary means fewer parameters and a lower risk of glitch tokens (Section 5).

BLT from Meta — No Tokenizer at All

The most radical approach is the Byte Latent Transformer (BLT, Meta AI). Instead of a fixed vocabulary, BLT processes raw bytes and dynamically groups them into "patches" based on the next byte's entropy: where the text is predictable — a large patch, where it's complex — a smaller one.

On tests up to 8B parameters, BLT matches Llama 3 in quality with half the inference FLOPs. The main advantage for multilingual tasks: bytes are the same for any language, so the problem of language inequality (Section 4) disappears at the architecture level. However, a nuance remains: UTF-8 encodes Latin characters with 1 byte, and Cyrillic with 2, so complete equality is still not guaranteed. Code and weights: GitHub facebookresearch/blt.

Conclusion: BPE still reigns, but alternatives are already showing concrete results: +8% on MMLU (SuperBPE), -22% vocabulary (LiteToken), 2x fewer FLOPs (BLT). 2026–2027 could be years of transition — especially if BLT proves its scalability on larger models.

📌 9. Practice: Check Your Text

The best way to understand tokens is to see them in your own text. Here are the tools:

✔️ platform.openai.com/tokenizer — OpenAI's official tokenizer for GPT-4o. Shows colored breakdown and exact token count.
✔️ tokencalculator.ai — Compares token counts and costs for several models at once.
✔️ digiqt.com/tools/llm-cost-calculator — API cost calculator for 30+ models (updated March 2026).

Try pasting the same sentence in Ukrainian and English and compare the token count — the difference will surprise you.

❓ Frequently Asked Questions (FAQ)

How many tokens are in one word?

It depends on the language and word frequency. A short, frequent English word ("the", "is", "cat") is 1 token. A long or rare one is 2–3. A Ukrainian or other Cyrillic word of medium length is 3–6 tokens in most models. An average token for English corresponds to approximately 4 characters or ¾ of a word.

Why does ChatGPT count tokens, not words?

Words are a human concept that depends on language. Tokens are the operational unit of a neural network, independent of any grammar. With tokens, it's easier to calculate computational cost, manage the context window, and compare models.

What is a glitch token and is it dangerous for my application?

A glitch token is a token from the vocabulary for which the model has no normal behavior (undertrained embedding weight). For a regular chat application, the risk is minimal — users rarely input such strings. However, for security systems, classifiers, and anything where input is not controlled — it's worth testing for known glitch tokens. Tool: GlitchMiner on GitHub.

How to reduce the number of tokens in a prompt?

The most effective methods: remove unnecessary formatting (indentations, Markdown where not needed), compress JSON (without pretty-print), remove redundant repetitions in the system prompt, use prompt caching for unchanging parts. For code, minimize comments and blank lines where possible.

Do input and output tokens cost the same?

No. Output tokens are 3–10 times more expensive depending on the provider because generation is sequential (each token depends on the previous one) and requires more GPU time. For example, in Claude Sonnet 4.6: input — $3/1M, output — $15/1M (5x difference). In GPT-4o: input — $2.50/1M, output — $10/1M (4x difference).

What will come after BPE?

The most promising direction is byte-based models like Meta's BLT, which don't require a tokenizer at all. However, BPE remains the standard in all top models for now. The transition, if it happens, will take several years after scalability is proven.

✅ Conclusions

🔹 A token is not a word or a character, but a statistical unit of frequency. All LLMs — ChatGPT, Claude, Gemini — see your text as a sequence of numbers.
🔹 BPE builds a vocabulary by iteratively merging frequent pairs — efficient for English, expensive for Cyrillic (x3–4).
🔹 ~4.3% of tokens in GPT-4, Llama 2, and DeepSeek vocabularies are potential glitch tokens, causing unpredictable behavior (GlitchMiner, AAAI 2026).
🔹 API prices have dropped ~80% in a year: from $10/1M (2025) to $0.14–2.50/1M (2026). Model choice can vary by 20x in cost.
🔹 The future is Meta's BLT (no tokenizer at all) and LiteToken (cleaner BPE), but BPE will dominate for a few more years.

Main takeaway: Tokens are the foundation upon which everything rests: the context window, API cost, quality of work with non-Latin languages, and even system security. Understanding them means understanding what's actually happening inside AI.

Read also in the "How LLMs Work" series:

📌 How ChatGPT, Claude, and Gemini Actually Work: The Complete 2026 Guide (anchor article of the series)
📌 Context Window: Why AI Forgets and How Much It Costs
📌 LLM vs RAG in 2026: When to Use What
📌 AI Hallucinations: What They Are, Why They Are Dangerous, and How to Avoid Them
📌 Embedding Models for RAG in 2026: How to Choose and Provider Comparison
📌 Ollama in 2026: What It Is and Why Developers Are Massively Switching to Local AI

Categories