You write "Hello" in ChatGPT — and think you've sent one word.
In reality, the AI received 3–4 numbers. This is how tokens work — invisible
units that all large language models think in.
Spoiler: one word in Cyrillic is already 3–4 tokens
versus 1–2 for English, code formatting eats up to a quarter
of tokens, and some words literally break GPT.
⚡ In Short
- ✅ Token ≠ Word: one English word is approximately 1 token, one word in Latin script is 3–4 tokens.
- ✅ BPE: the tokenizer builds a vocabulary by merging frequent character pairs — effective for English, expensive for Cyrillic.
- ✅ Glitch Tokens: ~4.3% of the GPT-4 and Llama 2 vocabulary contain "broken" tokens that cause unpredictable behavior.
- ✅ Prices in 2026 are Falling: DeepSeek V3.2 costs $0.14/1M input tokens, GPT-4o — $2.50/1M.
- 🎯 You Will Get: an understanding of tokens from basics to API practice, plus a price table for choosing a provider.
- 👇 Below are detailed explanations, examples, and tables
📚 Table of Contents
🎯 1. Why AI Doesn't Read Words — It Reads Tokens
What is a Token in LLMs
A token is not a word or a character, but a text fragment of arbitrary length: part of a word, a whole word, or even several words together.
GPT-5, Claude, and Gemini see your request as a sequence of numerical identifiers from their own vocabulary.
The word "Hello" is 1 token (id: 9906), and "authorization" is already 2 tokens, even though it's also one word.
A computer cannot read letters. It can count numbers. A token is a bridge between human language and the numerical world of a neural network.
When you send a message to ChatGPT, it doesn't go directly to the neural network.
First, the text goes through a tokenizer — a special program that breaks your string into fragments and replaces each with a number from the model's vocabulary.
For example, the sentence "The cat sat on the mat" is seen by the model approximately as: [791, 8415, 9137, 389, 279, 2450] — 6 numbers instead of 6 words.
The model reads precisely these numbers, processes them through hundreds of transformer layers, and at the output, through the tokenizer again, converts the numbers back into text.
Why Not Just Letters or Words?
A character-based approach (one character = one unit) results in overly long sequences — the neural network processes them poorly due to the quadratic complexity of attention. More on this in the article about the context window.
A vocabulary-based approach (one word = one token) also doesn't work: there are hundreds of thousands of words in English, plus numbers, names, code, and emojis. The vocabulary would be infinite.
Therefore, a compromise prevailed — subword tokenization: parts of words ("run", "ning", "##s"), whole frequent words ("the", "is"), rare words are broken into parts. This is what the BPE algorithm does (section 3).
- ✔️ GPT-5.4 / GPT-4o Vocabulary: ~200,000 tokens (
o200k_base)
- ✔️ Llama 3 / 4 Vocabulary: ~128,000 tokens
- ✔️ DeepSeek V3 Vocabulary: ~128,000 tokens
Section Conclusion: AI reads neither letters nor words — it reads tokens, and the efficiency of your text tokenization affects both the quality of the response and your API bill.
📌 2. Why a Token ≠ Word: Length Anomalies and Special Tokens
Why Token Length is Unpredictable
The length of a token depends not on grammar, but on frequency: the more often a certain sequence of characters appeared in the training data — the higher the chance it became a single token. Hence the paradoxes: "GPT-4" =
2 tokens, "GPT4" = 1. The hyphen changes everything.
A token is not a unit of length, but a unit of frequency in the training data.
The easiest way to feel this is to check familiar words in the
OpenAI tokenizer. The results often surprise even experienced developers.
Examples That Break Intuition
| String |
Tokens |
Breakdown |
Why so |
ChatGPT |
3 |
Chat + G + PT |
New name, BPE didn't see it whole |
OpenAI |
2 |
Open + AI |
Both parts are frequent separately |
tokenization |
3 |
token + iz + ation |
Word is rarer than its parts |
GPT-4 |
2 |
GPT + -4 |
Hyphen breaks the merge |
GPT4 |
1 |
GPT4 |
Without hyphen — one token |
cat |
1 |
cat |
Frequent word |
cat (with space) |
1 |
·cat |
A different token than without a space |
The last line is particularly important. In BPE, a space at the beginning of a word is
part of the token, not a separate character. Therefore, cat
and cat have different numerical IDs. The model literally sees them as
different words — and processes them differently depending on their position in the sentence.
Special Tokens: Model's Service Symbols
In addition to text tokens, each model has a set of special tokens —
service markers that denote dialogue structure and control model behavior:
| Token |
Meaning |
Where it's found |
<|endoftext|> |
End of document |
GPT-4, GPT-4o |
[BOS] |
Beginning of sequence |
Llama, Mistral |
[EOS] |
End of sequence |
Llama, Mistral |
[PAD] |
Batch padding |
Model training |
[INST] / [/INST] |
Instruction start/end |
Llama 2 Instruct |
<|im_start|> |
Message start |
GPT-4, ChatML format |
Special tokens directly affect model behavior: if they end up in user input accidentally or intentionally — the model can exit "assistant" mode and behave unpredictably. This is one of the vectors of attack on LLM systems.
Section Conclusion: A token is a statistical unit, not a linguistic one. Spaces, hyphens, and case — all of these change tokenization.
Understanding this gives an advantage when writing prompts and designing LLM-based systems.
📌 3. How BPE Works: Merging Characters on Your Fingers
Byte Pair Encoding Algorithm
BPE builds a vocabulary starting with individual bytes and iteratively merges the most frequent pairs.
It's a greedy compression algorithm: if "ing" appears more often than any other pair — it becomes a single token.
The process is repeated thousands of times until the vocabulary reaches the desired size.
BPE is like building words from Lego bricks: at first, there are only individual letter bricks, then you start joining the most popular combinations.
A Brief History: From Compression to GPT
BPE didn't originate in AI. In 1994, Philip Gage published the algorithm as a data compression method in The C Users Journal.
The idea was simple: find the most frequent pair of bytes in a file, replace it with a single new byte, repeat.
In 2016, researchers Rico Sennrich, Barry Haddow, and Alexandra Birch adapted BPE for machine translation in the paper
"Neural Machine Translation of Rare Words with Subword Units" (Sennrich et al., 2016).
Instead of compressing bytes for storage — compressing characters into subwords for neural networks.
This work became the foundation of tokenization in all modern LLMs.
In 2019, OpenAI took the next step in GPT-2: they moved from character-level BPE to byte-level BPE.
The difference is critical: instead of individual Unicode characters, the algorithm works with 256 base bytes.
This means any text in any language is guaranteed to be encoded — there are no "unknown characters."
All modern models — GPT-4o, Claude, Llama 3, DeepSeek — use byte-level BPE or its variations.
Step-by-Step Merging Example
Imagine we are building a tokenizer on one sentence: "low low low lower lower".
Step 0. Vocabulary: l, o, w, e, r (each letter separately)
Text: l-o-w l-o-w l-o-w l-o-w-e-r l-o-w-e-r
Step 1. Most frequent pair: "l"+"o" appears 5 times → merge into "lo"
Text: lo-w lo-w lo-w lo-w-e-r lo-w-e-r
Step 2. Most frequent pair: "lo"+"w" appears 5 times → merge into "low"
Text: low low low low-e-r low-e-r
Step 3. Most frequent pair: "e"+"r" appears 2 times → merge into "er"
Text: low low low low-er low-er
Resulting vocabulary: l, o, w, e, r, lo, low, er, lower
This is a classic example from Sennrich's original paper. On real data, the process is repeated tens of thousands of times on a corpus of hundreds of gigabytes.
The final vocabulary contains from 30,000 to 200,000 tokens depending on the model.
From BPE to Real Tokenizers: What GPT and Llama Add
No modern model uses "raw" BPE. Each provider adds its own modifications:
- ✔️ Regex pre-tokenization: before running BPE, the text is split by a regular expression into categories — letters, numbers, punctuation, spaces.
This prevents "merging" across category boundaries: a number and a word won't become a single token. GPT-4 and GPT-4o use different regex patterns, which affects the outcome.
- ✔️ Predefined vocabulary:
tiktoken (OpenAI's library) adds frequent words directly to the vocabulary.
If a word is already in the vocabulary — it is returned whole, even if BPE merging rules wouldn't have created it.
- ✔️ Vocabulary growth: GPT-2 had ~50,000 tokens, GPT-4 — ~100,000 (
cl100k_base), and GPT-4o — ~200,000 (o200k_base).
A larger vocabulary means more efficient tokenization, especially for non-Latin languages and code.
- ✔️ SentencePiece: Llama and Mistral use
SentencePiece from Google — an alternative implementation that supports BPE and Unigram algorithms and works directly with Unicode, without prior word splitting.
Why BPE is Successful
The algorithm automatically identifies morphological units (word roots, suffixes, prefixes) without any linguistic knowledge.
Frequent words become single tokens, rare ones are broken into known parts — the model never encounters an "unknown word."
This is the power of BPE: Open vocabulary with a fixed dictionary.
Want to see how it works in code? Andrej Karpathy (ex-OpenAI) created a learning implementation
minBPE, and
Sebastian Raschka
wrote a detailed guide on building a tokenizer from scratch — both resources are ideal for deeper understanding.
Read more about how tokens are processed within the transformer in our article on transformers and attention (in preparation).
Section Conclusion: BPE is an elegant 1994 compression algorithm that has become the foundation of tokenization for all modern LLMs. Its main drawback is uneven language coverage because the vocabulary is built on Anglocentric data. But even with this drawback, BPE remains the industry standard — alternatives have not yet proven scalable.
📌 4. Why Non-English Languages Are "More Expensive" in AI
Uneven BPE for Different Languages
BPE is trained on a corpus where 90%+ of the text is English and code. Therefore, English words get whole tokens, while words in other languages — Cyrillic, Chinese characters, Arabic script — are broken into small parts.
One non-English word on average costs 2–5 tokens versus 1–2 for its English equivalent.
In practice, this means: the same text in a non-English language is processed more expensively and takes up more space in the context window.
If the model's context window is 128K tokens, then in English, you can fit 2–3 times more text than in Cyrillic, and 3–4 times more than in Arabic.
Why This Happens: Three Levels of the Problem
Language inequality in tokenization is not an accident, but a result of three factors that reinforce each other:
1. Training Data Imbalance.
Most LLMs are trained on Anglocentric corpora. For example, Llama 3 reports that
95% of training data is English and code, and only 5% is all other languages combined.
BPE, during training, simply hasn't "seen" enough text in other languages to identify larger blocks as separate tokens.
2. UTF-8 Advantage for Latin Script.
Byte-level BPE works with bytes, and UTF-8 encodes Latin letters with one byte, Cyrillic with two, and Chinese characters with three.
Even if BPE were perfectly balanced by language, Latin script would have a structural advantage at the encoding level.
3. Morphological Complexity.
Languages with rich morphology (Slavic, Turkic, Finno-Ugric) generate significantly more unique word forms from a single root.
English "run" has ~5 forms, while the corresponding verb in Turkish or Finnish has dozens.
For BPE, each rare form is a potential split into parts.
Tokenization Comparison: One Word in Different Languages
| Word |
Language |
Tokens (GPT-4o) |
| authorization | English | 2 |
| авторизація | Cyrillic | 4–6 |
| 授权 | Chinese | 2–3 |
| autorización | Spanish | 3 |
| Genehmigung | German | 3–4 |
| intelligence | English | 2 |
| інтелект | Cyrillic | 4–5 |
| 智能 | Chinese | 2 |
| inteligencia | Spanish | 3 |
| token | English | 1 |
| токен | Cyrillic | 3–4 |
You can check your own words in tokenizer.openai.com — it shows the breakdown and exact token count for GPT-4o.
How Much It Costs: "Token Tariff" with Real Figures
The study
"Do All Languages Cost the Same?" (Ahia et al., EMNLP 2023)
analyzed 22 languages and showed: users of many languages are effectively overpaying for APIs while getting worse results.
Some languages require up to 5 times more tokens for the same content.
The study
"The Token Tax" (2025)
went further: if a language requires twice as many tokens, it means a 4× increase in training cost (due to quadratic attention complexity O(n²)) and a corresponding increase in inference latency.
There is also good news: OpenAI has significantly improved the situation with each new tokenizer.
According to an analysis of the 50,000 most frequent words in 12 languages, the number of tokens per word for Hindi dropped from 6.55 (GPT-2, 2021) to 1.89 (GPT-4o, 2024) — a 71% improvement.
For Russian — from 5.16 to 1.96 (−62%).
But even after that, Hindi is still 63% more expensive than English.
What a Developer Working with Non-English Languages Should Do
- ✔️ Factor in a x2–4 multiplier when calculating API budgets for non-Latin languages
- ✔️ Prompt engineering is more critical for non-English languages — each extra word costs more and consumes more context
- ✔️ Choose a model with a better vocabulary: GPT-4o (
o200k_base, 200K tokens) is significantly more efficient for multilingual tasks than older models with smaller vocabularies
- ✔️ Consider specialized models: for a specific language, local or multilingual models (e.g.,
Qwen for Chinese or
multilingual-e5-large for embedding tasks) often have better tokenization
- ✔️ Prompt caching: if the system prompt in a non-English language is large — prompt caching will reduce the cost of repeated requests by 80–90%.
More details in our article about LLMs for business
Conclusion: Language inequality in tokenization is a documented and measurable problem that affects all non-Latin languages. It directly impacts API budgets, efficient use of the context window, and even the quality of model responses. The trend is positive — vocabularies are growing from 50K to 200K+ tokens, and the gap is narrowing — but complete equality has not yet been achieved.
📌 5. Glitch Tokens: Why "SolidGoldMagikarp" Breaks GPT
What are Glitch Tokens
Glitch tokens are tokens from the model's vocabulary for which the neural network has not learned normal behavior.
They ended up in the vocabulary (because they were in the tokenizer's training data) but were absent or extremely rare in the main model training corpus.
Consequence: When encountering such a token, the model generates unpredictable, chaotic, or offensive output.
Imagine a library where there's a card for a book, but the book itself is not on the shelf. The librarian (the model) gets disoriented and says something nonsensical.
2023 Discovery: SolidGoldMagikarp
In January 2023, researchers Jessica Rumbelow and Matthew Watkins, as part of the SERI-MATS program,
published on LessWrong
an unexpected discovery: when asked to repeat the word "SolidGoldMagikarp," ChatGPT responded with "distribute."
Or it refused entirely, shouted, or insulted – the behavior was completely unpredictable.
The reason turned out to be: "SolidGoldMagikarp" is the nickname of a Reddit user who made hundreds of thousands of posts in a number-counting thread.
GPT's tokenizer "learned" from this text and assigned the nickname to a separate token.
However, during the model's training, this Reddit content was filtered out, and the token "remained" in the vocabulary without any meaning.
The token petertodd (with a leading space) behaved even more strangely.
When GPT-3 was asked to repeat it, the model produced chaotic responses –
from mystical poems to aggressive exclamations.
As it turned out, Peter Todd is a Canadian cryptographer whose name was the subject of massive attacks on Reddit
due to his work with Bitcoin. These comments made it into the tokenizer's data but not into the model's training corpus.
A detailed study of this phenomenon is described on
LessWrong: The 'petertodd' phenomenon.
Why It's Dangerous: From Curiosity to Vulnerability
At first glance, glitch tokens are a funny artifact. But for production systems, they create real risks:
- ✔️ Bypassing Safety Filters: A glitch token can "knock" the model out of assistant mode, causing it to ignore the system prompt and guardrails.
- ✔️ Unpredictable Hallucinations: Instead of refusing, the model generates chaotic content – from nonsense to offensive text.
- ✔️ Violation of Determinism: Even with temperature=0, glitch tokens break reproducibility – the same model gives different answers to the same query.
- ✔️ Attack Vector: An attacker can intentionally insert glitch tokens into input data to disrupt the LLM system's operation.
GlitchMiner: The Scale of the Problem in 2026
In 2024–2025, researchers developed an automated framework for finding glitch tokens –
GlitchMiner (arXiv),
accepted to the AAAI 2026 conference.
The tool uses gradient optimization to find tokens with abnormally high prediction entropy.
Results: Approximately 4.3% of tokens in the vocabularies of GPT-4, Llama 2, and DeepSeek are potential glitch tokens.
For a vocabulary of 100,000 tokens, this is ~4,300 "broken" units.
What Providers Have Done
OpenAI reacted quickly: on February 14, 2023, ChatGPT received a patch that prevents direct encounters with known glitch tokens.
When transitioning from GPT-3 (r50k_base, ~50K tokens) to GPT-4 (cl100k_base, ~100K) and further to GPT-4o (o200k_base, ~200K),
the vocabulary was completely rebuilt – old glitch tokens disappeared.
But the problem didn't disappear with them. Research into
new glitch tokens in GPT-4
showed that each new tokenizer creates its own set of anomalous tokens.
Tokens like ForCanBeConverted, YYSTACK, JSBracketAccess
were found in cl100k_base and exhibit similar unpredictable behavior.
This suggests that glitch tokens are a systemic property of the BPE approach, not a one-off bug.
What a Developer Should Do
- ✔️ Test Models for Glitch Tokens Before Production:
NVIDIA Garak is an
open-source LLM vulnerability scanner that includes a special
probes.glitch module for automated testing.
- ✔️ Filter Input Data: If your system accepts arbitrary text from users, add a check for known glitch tokens in the input pipeline.
- ✔️ Use GlitchMiner for Deeper Analysis:
GlitchMiner on GitHub allows you to find anomalous tokens in any model with accessible weights.
- ✔️ Monitor Output: Log instances where the model responds atypically – this could be a sign of encountering a glitch token.
Conclusion: Glitch tokens are not a theoretical vulnerability but a documented systemic problem present in all large models and reproducible with each new tokenizer. Providers patch known cases, but the BPE approach itself generates new anomalies. For production systems, testing for glitch tokens should be part of the security pipeline.
📌 6. Formatting Eats Tokens
How Formatting Affects Token Count
Spaces, indentation, line breaks, parentheses – all of them are tokenized. In code with indentation, tabulation can account for a significant portion of tokens in the entire file.
Markdown markup (asterisks, hashes, hyphens) also adds tokens. This directly impacts the cost of API requests.
Every space in your code is potentially a token you pay for.
Studies on Python code tokenization show that indentation, spaces, and special characters account for 15% to 25% of the total tokens in a typical file.
For large codebases, this is not insignificant money when using the API.
Practical Implications for Developers
- ✔️ Minimize indentation in system prompts (4 spaces → 2 spaces or a tab).
- ✔️ JSON without line breaks takes up fewer tokens than pretty-printed JSON.
- ✔️ Markdown headers (###) and lists (- item) add tokens – avoid them in system prompts where not needed.
- ✔️ Repeating patterns (e.g., the same prefix in each array element) are efficiently compressed by BPE.
Prompt Caching – How to Save Money
All major providers (OpenAI, Anthropic, Google) support prompt caching:
if the prefix of your prompt doesn't change between requests, reprocessing costs 80–90% less.
For products with a large system prompt, this is the easiest way to reduce costs.
More details can be found in our article on LLMs for Business.
Section Conclusion: Formatting is not free. Optimizing prompts with token count in mind can reduce costs by 15–30% without loss of quality.
💼 7. How Much a Token Costs in the API 2026
Token Prices in 2026
LLM provider APIs charge separately for input tokens (your request) and output tokens (model's response).
Output is 3–10 times more expensive because generation is sequential and more costly than parallel reading.
In 2026, prices have dropped by approximately 80% compared to 2025 due to competition from DeepSeek and open-source models.
DeepSeek has shaken up the market: frontier quality at a price that seemed impossible just a year ago.
Current prices as of March 2026 (sources:
TLDL LLM API Pricing, March 2026,
CostGoat LLM Pricing,
PricePerToken.com):
Main (Chat) Models
| Model |
Input ($/1M) |
Output ($/1M) |
Context |
Comment |
| GPT-5.4 (OpenAI) |
$2.50 |
$10.00 |
128K |
OpenAI's flagship, replaced GPT-4o |
| GPT-5 mini (OpenAI) |
$0.25 |
$2.00 |
128K |
Budget option for simple tasks |
| GPT-5 nano (OpenAI) |
$0.05 |
$0.40 |
128K |
Cheapest from OpenAI |
| Claude Sonnet 4.6 (Anthropic) |
$3.00 |
$15.00 |
200K |
Top for complex instructions and code |
| Claude Haiku 4.5 (Anthropic) |
$0.25 |
$1.25 |
200K |
Budget Claude, updated price |
| Gemini 2.5 Pro (Google) |
$1.25 |
$10.00 |
1M |
Largest context among main models |
| Gemini 2.5 Flash (Google) |
$0.30 |
$2.50 |
1M |
Excellent price/quality ratio |
| Gemini 2.0 Flash-Lite (Google) |
$0.075 |
$0.30 |
1M |
Cheapest among major providers |
| DeepSeek V3.2 |
$0.14 |
$0.28 |
128K |
Chat + reasoning in one model |
| Grok 3 mini (xAI) |
$0.30 |
$0.50 |
128K |
Best output/input ratio |
Reasoning Models: Thinking is More Expensive
Reasoning models (o3, DeepSeek R1) generate "thinking tokens" – internal thoughts, for which you also pay.
This means that the output for a reasoning task can be 5–20 times longer than the final answer.
More details about reasoning – in our article (in preparation).
| Model |
Input ($/1M) |
Output ($/1M) |
Note |
| O3 Pro (OpenAI) |
$150.00 |
$600.00 |
Most expensive model on the market |
| O3 (OpenAI) |
$10.00 |
$40.00 |
Reasoning flagship |
| DeepSeek R1 |
$0.55 |
$2.19 |
Reasoning at the price of a chat model |
How a Real Request is Calculated
Imagine: you send a system prompt (500 tokens) + a question (50 tokens) and receive a response (200 tokens).
On Claude Sonnet 4.6: (550 × $3.00 + 200 × $15.00) / 1,000,000 = ~$0.0047 per request.
At 10,000 requests per day – $47/day or ~$1,400/month.
The same task on DeepSeek V3.2: (550 × $0.14 + 200 × $0.28) / 1,000,000 = ~$0.00013 per request.
At 10,000 requests/day – $1.3/day or ~$40/month. The difference is 35 times.
Important for non-Latin languages: if your requests are in Cyrillic, Chinese, or Arabic – multiply tokens by a coefficient of 2–4×
(see Section 4 on language inequality).
This directly increases your bill – and makes choosing a cheaper model even more critical.
How to Save on API: 5 Proven Methods
- ✔️ Prompt Caching – if the system prompt doesn't change between requests, caching reduces input cost by 80–90%. OpenAI, Anthropic, and Google support this feature.
DeepSeek offers cache hits for $0.028/1M – a 90% discount from the base price.
- ✔️ Batch API – send non-interactive requests in batches. Anthropic offers a 50% discount on batch requests, OpenAI similarly.
- ✔️ Model Routing – use a cheap model (Gemini Flash, DeepSeek V3.2) for simple requests and a more expensive one (Claude Sonnet, GPT-5.4) only for complex ones. A router based on request classification can save 60–80%.
- ✔️ Compress Prompts – remove unnecessary formatting, minimize JSON, avoid repetition in the system prompt (more details in Section 6 on formatting).
- ✔️ Local AI – for non-critical tasks, Ollama + an open-source model (Llama 4 Maverick, DeepSeek V3) costs $0/month for API. More details in our article on Ollama.
Interactive calculators for cost calculation:
LangCopilot Token Calculator (41 models, updated March 2026) and
LLM Pricing Calculator.
Detailed calculation of AI costs for various business scenarios – in our article on AI costs (in preparation).
Also see LLM vs RAG: the right architecture can reduce the number of tokens by an order of magnitude.
Conclusion: Prices have dropped by ~80% in a year, but the difference between models can reach 35–100× for the same task. Reasoning models cost an order of magnitude more due to thinking tokens. For production systems, a combination of prompt caching + model routing + batch API reduces costs by 5–10 times.
💼 8. The Future of Tokenization: SuperBPE, BoundlessBPE, BLT
Where Tokenization is Heading
In 2025–2026, several competing directions emerged: multi-word tokens (SuperBPE, BoundlessBPE), cleaning existing vocabularies (LiteToken), and a complete abandonment of tokens in favor of bytes (BLT from Meta).
None have yet replaced BPE in production, but the pressure is growing.
BPE is like the QWERTY keyboard: not optimal, but everyone is used to it. Alternatives exist, the transition is slow.
SuperBPE and BoundlessBPE — Tokens Longer Than One Word
Classic BPE never merges tokens across word boundaries: "New" and "York" always remain separate.
Two new approaches, accepted at the
COLM 2025 conference, remove this limitation:
- ✔️ SuperBPE (Liu et al., 2025) — a two-pass BPE: first, standard subword training, then a second pass without word boundary restrictions. Result: 33% fewer tokens, +4.0% average improvement on 30 benchmarks, and +8.2% on MMLU — solely due to better tokenization.
- ✔️ BoundlessBPE (Schmidt et al., 2025) — a single-pass variant where frequent phrases ("of the", "machine learning") become a single token. Compression improvement: up to 20% more bytes per token.
LiteToken — Cleaning Up the Vocabulary
LiteToken (arXiv, February 2026) —
a lightweight algorithm for removing "intermediate merge remnants" from BPE vocabularies.
These are tokens that entered the vocabulary during training but are never used independently in real text.
For DeepSeek-V3, LiteToken reduced the 3-gram vocabulary by ~22% without retraining the model — plug-and-play.
A smaller vocabulary means fewer parameters and a lower risk of glitch tokens (Section 5).
BLT from Meta — No Tokenizer at All
The most radical approach is the Byte Latent Transformer (BLT, Meta AI).
Instead of a fixed vocabulary, BLT processes raw bytes and dynamically groups them into "patches" based on the next byte's entropy:
where the text is predictable — a large patch, where it's complex — a smaller one.
On tests up to 8B parameters, BLT matches Llama 3 in quality with half the inference FLOPs.
The main advantage for multilingual tasks: bytes are the same for any language, so the problem of language inequality (Section 4) disappears at the architecture level.
However, a nuance remains: UTF-8 encodes Latin characters with 1 byte, and Cyrillic with 2, so complete equality is still not guaranteed.
Code and weights: GitHub facebookresearch/blt.
Conclusion: BPE still reigns, but alternatives are already showing concrete results: +8% on MMLU (SuperBPE), -22% vocabulary (LiteToken), 2x fewer FLOPs (BLT). 2026–2027 could be years of transition — especially if BLT proves its scalability on larger models.
📌 9. Practice: Check Your Text
The best way to understand tokens is to see them in your own text. Here are the tools:
Try pasting the same sentence in Ukrainian and English and compare the token count — the difference will surprise you.
❓ Frequently Asked Questions (FAQ)
How many tokens are in one word?
It depends on the language and word frequency. A short, frequent English word ("the", "is", "cat") is 1 token.
A long or rare one is 2–3. A Ukrainian or other Cyrillic word of medium length is 3–6 tokens in most models.
An average token for English corresponds to approximately 4 characters or ¾ of a word.
Why does ChatGPT count tokens, not words?
Words are a human concept that depends on language. Tokens are the operational unit of a neural network, independent of any grammar.
With tokens, it's easier to calculate computational cost, manage the context window, and compare models.
What is a glitch token and is it dangerous for my application?
A glitch token is a token from the vocabulary for which the model has no normal behavior (undertrained embedding weight).
For a regular chat application, the risk is minimal — users rarely input such strings.
However, for security systems, classifiers, and anything where input is not controlled — it's worth testing for known glitch tokens.
Tool: GlitchMiner on GitHub.
How to reduce the number of tokens in a prompt?
The most effective methods: remove unnecessary formatting (indentations, Markdown where not needed), compress JSON (without pretty-print), remove redundant repetitions in the system prompt, use prompt caching for unchanging parts.
For code, minimize comments and blank lines where possible.
Do input and output tokens cost the same?
No. Output tokens are 3–10 times more expensive depending on the provider because generation is sequential (each token depends on the previous one) and requires more GPU time.
For example, in Claude Sonnet 4.6: input — $3/1M, output — $15/1M (5x difference).
In GPT-4o: input — $2.50/1M, output — $10/1M (4x difference).
What will come after BPE?
The most promising direction is byte-based models like Meta's BLT, which don't require a tokenizer at all.
However, BPE remains the standard in all top models for now. The transition, if it happens, will take several years after scalability is proven.
✅ Conclusions
- 🔹 A token is not a word or a character, but a statistical unit of frequency. All LLMs — ChatGPT, Claude, Gemini — see your text as a sequence of numbers.
- 🔹 BPE builds a vocabulary by iteratively merging frequent pairs — efficient for English, expensive for Cyrillic (x3–4).
- 🔹 ~4.3% of tokens in GPT-4, Llama 2, and DeepSeek vocabularies are potential glitch tokens, causing unpredictable behavior (GlitchMiner, AAAI 2026).
- 🔹 API prices have dropped ~80% in a year: from $10/1M (2025) to $0.14–2.50/1M (2026). Model choice can vary by 20x in cost.
- 🔹 The future is Meta's BLT (no tokenizer at all) and LiteToken (cleaner BPE), but BPE will dominate for a few more years.
Main takeaway:
Tokens are the foundation upon which everything rests: the context window, API cost, quality of work with non-Latin languages, and even system security. Understanding them means understanding what's actually happening inside AI.
Read also in the "How LLMs Work" series: