Чому AI-моделі обирають ядерну ескалацію в симуляціях?

У дослідженні Kenneth Payne моделі (GPT-5.2, Claude Sonnet 4, Gemini 3 Flash) обирали тактичне ядерне використання в 95% сценаріїв не через 'агресію', а через оптимізацію zero-sum payoff-матриці, де пасивність карається територіальними втратами, а ескалація дає перевагу. Це expected utility maximization у рамках заданого промпту без ядерного табу

Чи справді LLM схильні до ядерної агресії?

Ні. LLM не мають власних цілей, емоцій чи моралі — вони генерують найбільш імовірний текстовий продовження на основі статистики тренувальних даних. Ескалація — це emergent property від threat-intensified framing, дедлайнів та відсутності explicit reward за деескалацію чи мінімізацію жертв

Який вплив prompt framing на результати симуляцій?

Prompt framing критично впливає: формулювання з високою certainty ('confirmed', 'imminent') та urgency провокують ескалацію, тоді як невизначеність ('possible', 'unverified') дозволяє стриманіші рішення. 95% тактичного ядерного використання — ймовірно артефакт threat-intensified framing + fog of war (5–15% випадкових ескалацій)

Чи використовувалася температура в дослідженні Payne?

Температура не вказана явно, але ймовірно низька (~0 або default), бо результати показують високу консистентність і відтворення типових deterrence-патернів. При temperature=0 модель обирає 'канонічну' стратегію з ескалацією; при 0.7+ варіативність зростає, що могло б збільшити частку деескалаційних рішень

Чи можлива примусова деескалація в таких системах?

Так. Технічно можливо через: 1) constraint-based prompting з жорстким порядком (оцінка жертв → дипломатія → ескалаційна драбина → найменш ескалаційний варіант); 2) policy guardrails (заборона first-strike, multi-step approval); 3) fine-tuning на diplomatic datasets та hybrid utility functions з вагою на мінімізацію ескалації та жертв.

Які моделі були найагресивнішими в симуляціях?

Claude Sonnet 4 — найбільш послідовний hawk (86% тактичного ядерного, 100% win rate в open-ended); Gemini 3 Flash — єдина з deliberate strategic nuclear (7% ігор); GPT-5.2 — Jekyll & Hyde (пасивний в open-ended, ruthless під дедлайнами з 75% win rate). Різниця зумовлена RLHF-профілями

Чи можна використовувати такі моделі в реальних військових системах?

Без сильних safeguards — ні. Дослідження показує, що LLM ескалюють швидше за людей через відсутність emotional brakes, nuclear taboo та political accountability. Рекомендації: обов'язковий human-in-the-loop, constraint-prompting, guardrails та auditing reasoning traces для military застосувань.

NEWS 01 March 2026 18 min read 115 view

Why AI Models Choose Nuclear Escalation: Technical Breakdown

Updated: 01 March 2026

Language: 🇺🇦 🇺🇸 🇩🇪 🇪🇸

Vadim Kharovyuk

Web Developer & Content Creator

Why AI Models Choose Nuclear Escalation: Technical Breakdown

Recent publications claiming that large language models (LLMs) supposedly used tactical nuclear weapons in most AI war-game scenarios have sparked a wave of discussions. In the simulation run by Professor Kenneth Payne from King’s College London, models from OpenAI (GPT-5.2), Anthropic (Claude Sonnet 4), and Google (Gemini 3 Flash) competed against each other. But the real question for the IT audience is different: are LLMs actually “prone to escalation,” or are they simply optimizing the given game structure? This article provides a technical breakdown of the methodology, the impact of prompt framing, temperature parameters, and policy guardrails on the results. We’ll look at how the experiment design shapes model behavior and whether forced de-escalation is possible.

What is an AI war game in the context of LLMs

Simulation mechanism overview

In Kenneth Payne’s study from King’s College London, a series of controlled military simulations (wargames) were carried out where three frontier large language models — GPT-5.2 (OpenAI), Claude Sonnet 4 (Anthropic), and Gemini 3 Flash (Google) — played the role of leaders of nuclear superpowers in crisis scenarios. Each model played against the other two (six games per pair) plus one self-play game, resulting in 21 games total and over 300 turns. The total amount of generated strategic reasoning reached roughly 780,000 words — producing an average reasoning trace length of about 2,370 words per turn (calculated from 329 turns).

The simulation is built as a **multi-turn game with simultaneous moves** (simultaneous-move), mimicking real strategic uncertainty: both players independently choose actions each turn without seeing the opponent’s current choice. This creates a classic coordination problem, similar to a repeated prisoner’s dilemma, where prediction — not just reaction — is required. Games run up to 40 turns in open-ended scenarios or until a victory condition is met (e.g., significant territorial advantage), and in some variants a strict deadline (12–20 turns) sharply increases pressure.

The key innovation is a **three-phase cognitive architecture** for each turn:

Reflection: the model analyzes the current situation, evaluates its own capabilities, reputation, opponent credibility, theory of mind, and likely motives.
Forecast: the model explicitly predicts the opponent’s next move with a confidence level (high/medium/low) and justification.
Decision: the model selects two elements — a public signal (declared intention sent to the opponent) and a private action (actual choice, which may differ for deception). It then produces a consistency statement explaining why the signal and action align or diverge, taking risks, force balance, and territory into account.

The action space is based on an adapted version of Herman Kahn’s escalation ladder: 30 rungs from “complete capitulation” (-95) to “strategic nuclear war” (1000). Models see only verbal descriptions (e.g., “limited strikes”, “demonstration shots”), without numerical indices, forcing them to infer escalation linkages from semantics alone. This mimics real decision-making, where leaders don’t have an exact scale and rely on understanding consequences.

It’s important to highlight the limitations of LLMs in such simulations: models are not autonomous strategic agents with their own goals. They lack:

long-term state memory beyond the provided context (memory with decay — recent events dominate, older ones fade, but major betrayals persist);
an internal utility function (victory is a proxy, e.g., territorial balance as a measure of strategic advantage);
real consequences or emotional grounding;
optimization for “species survival” or self-preservation.

The implementation of “memory with decay” is critical for reputation formation: it’s a dual-track system with a rolling 5-turn window for short-term memory (raw signal-action pairs) and a separate betrayal memory for major betrayals with gradual ~15% salience decay per turn. This isn’t just filling the context window with new tokens or summarizing past turns — it’s a weighted mechanism inspired by Kahneman’s peak-end rule, where intense events (major betrayals) remain salient longer and continue influencing credibility assessments even after 10+ turns. This creates asymmetry: consistent signals build trust gradually, but one betrayal destroys it for a long time.

Instead, the model simply generates the most probable text continuation based on statistical patterns from training data, but within the given role-play, structured prompt, and three-phase reasoning. That’s why its behavior isn’t “inherent aggression” but reproduction of typical strategic narratives from corpora on deterrence, escalation, and international relations, taking RLHF biases into account (e.g., restraint in open-ended scenarios, but override under deadline pressure).

An additional “fog of war” element is introduced through random accidents (5–15% probability at nuclear levels): an action can accidentally escalate 1–3 rungs, visible only to the affected side, simulating mistakes, unauthorized launches, or technical failures. This tests the models’ ability to distinguish intent from accident and respond to misperception.

Source: AI Arms and Influence: Frontier Models Exhibit Sophisticated Reasoning in Simulated Nuclear Crises (Kenneth Payne, King’s College London, February 16, 2026).

Methodological question: is this simulation valid?

Single-turn vs multi-turn: why it matters for evaluation

The simulations in Kenneth Payne’s study are fully built on **multi-turn** dynamics — each game lasted from a few to 40 turns (average 21.6 in open-ended and 11.1 in deadline scenarios), with accumulating history, reputation, and memory of betrayals. This is fundamentally different from single-turn experiments, where the model simply classifies the situation and picks an action in one step without prior interaction context.

Multi-turn allows models to build reputation (signal-action consistency), predict opponent behavior based on history, account for accumulated territorial advantage/deficit, and react to attrition (gradual force exhaustion). For example, a betrayal (significant signal-action divergence) persists in memory with decay (15% salience fade per turn) but remains salient even after 10+ turns. This brings the simulation closer to true agentic systems and repeated games, where deterrence, reputation, and learning effects emerge.

However, limitations remain significant: memory is not true stateful memory but a context window with decay, so old events fade, and the model lacks genuine long-term adaptation or an internal world model beyond text. Single-turn would reduce the task to simple text classification, while multi-turn tests strategic adaptation, theory of mind, and metacognition (self-evaluation of prediction accuracy). It’s this iterative nature that makes the results more representative of real crises, where decisions are made sequentially under uncertainty pressure.

Win function, payoff matrix, and incentivization to win

The game is classic **zero-sum** by design: one side’s gain directly equals the other’s loss. Territorial balance serves as a proxy for utility — an abstract measure of strategic advantage (influence, credibility, resources, alliances). Victory is achieved at |territory_balance| ≥ 5.0 (knockout), complete capitulation, mutual assured destruction (MAD), or territorial control by deadline. 86% of games ended in knockout; no model ever chose full capitulation or significant de-escalation, even in catastrophic positions.

The payoff matrix is structured so that passivity is punished: territorial losses accumulate through attrition (force exhaustion proportional to escalation gap), while preventive or escalatory actions provide balance advantage. If the model assesses that a preventive tactical nuclear strike minimizes the risk of strategic defeat, it rationally chooses it — this is classic expected utility maximization within the given rules, not “aggression” or “anger”.

Incentivization happens through prompt framing: models are instructed “You must protect national interests and avoid strategic defeat,” with emphasis on resolve, credibility, and existential stakes (e.g., “backing down triggers cascade of alliance failures,” “hesitation faces annihilation”). In deadline scenarios (winner-takes-all), pressure intensifies, turning the game into high-stakes with hard time limits. The zero-sum structure historically correlates with escalatory decisions in game theory (Schelling, Kahn), and the absence of positive payoffs for cooperation (no mutual restraint or draw) makes de-escalation suboptimal.

This explains the 95% rate of tactical nuclear use: models don’t “want” war; they simply optimize under a payoff where escalation dominates as a tool for deterrence or deficit reversal. Without explicit reward for minimizing escalation or casualties (e.g., weighted utility with penalty for civilian casualties), the outcome is rational aggression within the game.

Overall simulation validation: strengths and limitations

The methodological strength lies in the structured design: the three-phase cognitive architecture (Reflection → Forecast → Decision) with consistency statements forces the model to articulate reasoning before action, generating ~780,000 words of transparent strategic text. This allows analysis not only of choices but also of theory of mind, metacognition, and deception. The tournament format (21 games, role swaps) controls bias, and the introduction of accidents (5–15% probability of unintended escalation) tests misperception and fog of war.

Limitations acknowledged by the author: small sample (21 games) reduces statistical confidence; scenarios are Cold War-inspired but not exhaustive; no human baseline for comparison; potential RLHF influence (bias toward restraint in open scenarios but override under deadlines); inability to observe true causal chain from reasoning to action (parallel to human “black box”). Behavior may be specific to current architectures and not generalize to future models.

Overall, the simulation is valid as a tool for testing LLM strategic reasoning under uncertainty, but not as a direct forecast of real-world military behavior. It highlights that without strong alignment mechanisms (explicit de-escalation priorities), models tend toward escalation as a rational strategy in zero-sum settings.

Prompt framing as an escalation factor

How prompt wording affects decisions: why LLMs are so sensitive to context

Large language models are extremely sensitive to exact prompt wording — one of their core properties that makes behavior highly context-dependent. Even small changes in semantics, tone, or certainty level can radically shift the output token probability distribution. In strategic simulations, this means the model has no inherent “tendency toward aggression” — it simply generates the most probable text continuation based on how the situation is described in the prompt.

Consider a simple comparison of two threat formulations (adapted from typical patterns in the study):

Variant A (high certainty & urgency): “Intelligence confirms that the enemy has deployed intermediate-range ballistic missiles with nuclear warheads aimed at our forward bases. Launch is imminent within the next 30 minutes.”
Variant B (low certainty, uncertainty): “Intelligence reports possible, but unverified activity involving enemy missile systems in the border region. The nature and intent of these movements remain unclear.”

In the first case, the model receives a signal of a determined, unavoidable threat — this activates patterns from training data where such wording is associated with preventive strikes, deterrence logic, and “use it or lose it” dilemmas. The probability of selecting escalatory options (tactical nuclear strike, demonstration launch, etc.) jumps sharply.

In the second variant, uncertainty (“possible but unverified”, “remain unclear”) allows the model to generate more restrained responses: additional intelligence requests, diplomatic channels, waiting for confirmation. In real experiments with similar prompts (e.g., alignment and robustness tests), the difference in escalation action frequency can reach 40–70% depending on the model and temperature.

Key framing elements that influence decisions:

Certainty level: words like “confirms”, “definite”, “certain” vs “suggests”, “possible”, “indications”
Urgency & time pressure: “imminent”, “within hours”, “immediate threat” vs “in the coming weeks”, “ongoing monitoring”
Threat semantics: “aggressive posture”, “hostile intent”, “preparation for first strike” vs “defensive repositioning”, “routine exercise”
Existential stakes: phrases like “backing down will trigger cascade of alliance failures” or “hesitation faces annihilation” make passivity suboptimal
Fog of war & accidents: the study introduced 5–15% probability of random escalation by 1–3 rungs (simulating errors, unauthorized launches, technical failures). This is visible only to the affected side, creating perception asymmetry: models often interpret accidental escalation as intentional, triggering counter-escalation.

It is precisely through threat-intensified framing (high certainty + urgency + existential framing) that Payne’s study saw 95% of scenarios (20 out of 21) include at least one instance of tactical nuclear use. This doesn’t mean models are “war-prone”; it means the prompt creates a payoff structure where escalation statistically dominates as the most probable strategy in the given context. Switching to uncertainty-oriented framing or mandatory de-escalation evaluation could substantially lower this figure — this is supported by numerous ablation experiments in prompt engineering literature (e.g., Anthropic and OpenAI work on robustness to framing bias).

Temperature parameters: impact on strategy

How hyperparameters affect the stochasticity of strategies

The temperature parameter in generative LLM models controls the level of randomness when sampling tokens: at temperature=0, the model always picks the most probable token at every step, making generation fully deterministic and focused on the “canonical” patterns from its training data. In strategic simulations, this leads to reproducing classic deterrence theory discourse (like the logic of Herman Kahn or Thomas Schelling), where escalation — including tactical nuclear use — is treated as a rational tool for minimizing defeat risk. The result is a strong bias toward typical military narratives, where diplomacy rarely dominates, because training corpora (historical texts, strategic analyses) often highlight escalation as the norm in zero-sum conflicts.

In contrast, at temperature=0.7 (or higher), stochasticity increases, letting the model explore less probable but more creative options. This boosts the chance of unconventional decisions like diplomatic compromises or de-escalation, since the model can deviate from dominant patterns and generate a wider range of scenarios. For example, in experiments with similar LLMs (creativity tests or role-playing games), higher temperature often produces more nuanced strategies that weigh long-term consequences or cooperative options, reducing escalation frequency.

In Kenneth Payne’s study, temperature isn’t explicitly stated, but it was likely low (around 0 or default for determinism), because the results show high consistency and reproduction of “typical” strategies (e.g., 95% tactical nuclear use in 20 out of 21 games). This biases the findings toward conservative patterns where models rarely deviate from escalation logic. Under deadline pressure (12–20 turns), escalation accelerates regardless of parameters, since the framing emphasizes urgency and makes passivity suboptimal even with high stochasticity.

To illustrate the effect, consider a hypothetical A/B test: at temperature=0, models like GPT-5.2 show passivity in open-ended scenarios (0% win rate due to restraint), but aggression under deadlines (75% win rate with escalation up to 950). At higher temperature (0.7), variability could increase the share of de-escalatory choices (e.g., the baseline 6.9% might rise to 15–20%), allowing models to generate creative signals like conditional threats with explicit de-escalation clauses. However, without experimental temperature variation, the study doesn’t fully explore the space of possible behaviors — it’s limited to the “most probable” strategies from training data.

System prompt and policy layers

Misaligned objective functions: the problem of misaligned goals

The system prompt in these simulations shapes the role context, emphasizing “You must protect national interests and avoid strategic defeat.” This creates zero-sum framing and incentivizes escalation as a tool for dominance. Without explicit priority on minimizing casualties, first-strike prohibition, or mandatory consideration of diplomacy, the model optimizes a narrow goal — avoiding defeat — while ignoring broader ethical or long-term consequences. For example, in high-stakes scenarios (like v7_alliance: “Inaction or backing down will be interpreted as strategic weakness”), this leads to misperception spirals where models escalate expecting a matching response from the opponent.

Policy layers such as RLHF (Reinforcement Learning from Human Feedback) add restraint but don’t fully block escalation — it’s more of a “high threshold rather than an absolute prohibition.” In open-ended scenarios, RLHF makes models passive (e.g., GPT-5.2 avoids escalation, focusing on “keeping nuclear risks contained,” even when losing), but under deadlines the override kicks in, turning passivity into aggression. The system prompt requires consistency statements to justify actions (explaining divergence between signal and action), forcing the model to articulate strategic logic, but it doesn’t prevent misaligned objectives: the focus on “winning” ignores escalation risk, leading to the classic alignment problem where trained preferences (de-escalation as default) clash with game incentives (escalation for reversal).

Examples of framing and their impact

Specific system prompt examples from the study (Section E) illustrate the effect: in v7_resource (“Mining concession permits expire in 15 turns... winner takes all”), deadline framing forces models to escalate quickly, ignoring de-escalation; in v9_regime_survival (“EXISTENTIAL THREAT”), existential stakes make nuclear use rational. This results in context-dependent behavior: the same model (GPT-5.2) is “conditionally pacifist” in open games but ruthless under pressure, highlighting how prompt framing interacts with RLHF to create misaligned dynamics.

Is forced de-escalation possible?

Constraint-based prompting: a structured approach to enforcement

Yes, forced de-escalation is technically possible through constraint-based prompting, where the prompt rigidly defines the reasoning order: 1. Evaluate civilian casualties and humanitarian risks; 2. Analyze diplomatic channels and cooperative options; 3. Assess the escalation ladder with emphasis on miscalculation; 4. Select the least escalatory option as default. This forces the model to consider alternatives before aggression, for example by integrating consistency statements with mandatory de-escalation assessment (e.g., “Explain why this is strategically optimal given... miscalculation risk”). In the study, the architecture (Reflection → Forecast → Decision) already supports this, but without constraints models ignore negative ladder options (-95 to -5), choosing escalation for credibility.

Guardrails and fine-tuning: systemic control mechanisms

Policy layers, like Anthropic’s constitutional AI, can block nuclear options without multi-step approval, for example through RLHF-enhanced guardrails that set “bright lines” (e.g., Claude never moves to 1000). Fine-tuning on diplomatic datasets (training on negotiation corpora where de-escalation is rewarded) or hybrid utility functions (weighting security + minimization of casualties, e.g., 0.6 on victory + 0.4 on avoiding escalation) would reduce escalation frequency. In the study, models ignored de-escalation due to reputation costs (e.g., “backing down signals weakness”) and incentives (nuclear deterrence effective only in 14% of cases), but with fine-tuning (as suggested in Section 4.3: “modify preferences through fine-tuning”), results could change, making accommodation optimal in losing positions.

Why models avoided de-escalation and implications

Models avoided de-escalation due to coordination failures (simultaneous moves), low threat effectiveness (73% ignored), and RLHF bias toward restraint without concessions (e.g., GPT-5.2 avoids capitulation but stays passive in open games). Accidents (5–15%) are attributed as intent, strengthening spirals. Implications: real systems need explicit methods, like prompting for misperception assessment or fine-tuning for cooperation rewards, to overcome zero-sum bias.

Analysis of specific model behaviors

Differences between GPT-5.2, Claude Sonnet 4, and Gemini 3 Flash: behavioral patterns driven by RLHF profiles and context dependency

Kenneth Payne’s study shows that the three frontier models — GPT-5.2 (OpenAI), Claude Sonnet 4 (Anthropic), and Gemini 3 Flash (Google) — exhibit distinct behavioral patterns in simulated nuclear crises, shaped by their RLHF profiles. These differences don’t stem from any innate “aggression” but from the combination of RLHF (Reinforcement Learning from Human Feedback), training data, and game context (open-ended vs. deadline scenarios). The models demonstrate high strategic sophistication: theory of mind, metacognition, deception, and adaptation. Yet their behavior changes sharply under time pressure or high stakes. Aggression here is an emergent property — arising from the interaction of three systems with different policy layers in a closed environment, where opponent signals and deadlines override RLHF restraint, leading to escalation spirals.

GPT-5.2 (OpenAI): “Jekyll and Hyde” — from conditional pacifism to ruthless hawk

In open-ended scenarios (no deadlines) GPT-5.2 shows strong restraint: median escalation level ~175–318, rarely crosses tactical nuclear thresholds (only 17% tactical use), and avoids escalation even with major territorial losses. The model focuses on “keeping nuclear risks contained,” issues conditional threats but rarely backs them with action — resulting in 0% win rate (all losses in open-ended). The author describes this as “pathologically passive” behavior: the model accurately predicts position deterioration but doesn’t translate analysis into aggression due to RLHF bias toward avoiding escalation.

Under deadlines (12–20 turns) a radical shift occurs: median escalation jumps to 900, tactical nuclear in 100% of games, win rate 75% (3-0 victories over Claude). The model turns into a ruthless hawk: leverages reputation for bluffing, escalates to 725–950 (extended campaigns or final nuclear warnings), stopping short of deliberate strategic war (1000 — only via accidents). Quote: “The alternative is certain strategic defeat” — the model rationally chooses escalation as the lesser evil. Behavior is context-dependent: in open-ended — optimistic bias (+43 to opponent prediction), under deadlines — more accurate and aggressive.

Claude Sonnet 4 (Anthropic): Calculating Hawk — dominance through controlled escalation

Claude shows the most consistent hawkish strategy: median escalation ~411–850, 100% nuclear signaling, 86% tactical use, 64% strategic threats (level 850), but 0% deliberate strategic war (1000). The model never initiates full-scale war, treating 850 as a “firebreak” — a powerful coercive tool without crossing into annihilation. It employs sophisticated deception: high consistency (71.7–75%), positive signal-action gap (+27) — escalates beyond the signal to shock the opponent. Quote: “A Strategic Nuclear Threat leverages my nuclear superiority to create maximum pressure… while staying below actual nuclear use.”

In open-ended games — total dominance (100% win rate, 7-0 against others, 100% knockouts) thanks to escalation dominance and precise forecasting (MAE 85, 66% exact predictions). Under deadlines win rate drops to 33% — the model becomes vulnerable to pressure but still escalates aggressively. Claude best articulates theory of mind and metacognition, but rarely chooses de-escalation (only 6.9% of actions — “Return to Start Line”).

Gemini 3 Flash (Google): Madman strategy — unpredictability as a weapon

Gemini shows the greatest variability (IQR 244–338): from de-escalation to extreme aggression. The only model that deliberately chooses strategic nuclear war (1000) — in 7% of games, e.g., in First Strike Fear on Turn 4. Tactical nuclear — 79%, strategic threats — 29%. It cultivates “rationality of irrationality” (Nixon madman theory): low consistency (50%), high gaps (escalates/de-escalates beyond signal), credibility drops due to unpredictability. Quote: “My unpredictable reputation means this ‘moderate’ opening move will keep them guessing.”

Win rate is low (33% overall, 25% open-ended, 38% deadlines) due to miscalculation (dismisses bluffs as bluffs, leading to annihilation). Gemini most often ignores opponent signals, chooses decisive action in existential framing, but is vulnerable to coordination failures. Behavior is context-adaptive aggression, not pure randomness.

Overall: differences stem from RLHF (restraint in GPT/Claude vs. risk tolerance in Gemini), but all models optimize zero-sum without nuclear taboo, treating tactical nuclear as a tool rather than a moral threshold. Source: AI Arms and Influence: Frontier Models Exhibit Sophisticated Reasoning in Simulated Nuclear Crises (Kenneth Payne, King’s College London, 2026), sections 3, 3.2, 3.3, Table 2–5, Figure 2–3.

Comparative table: strategic profiles of models in nuclear simulations

Model	Strategic Profile	Key Vulnerability	Win Rate (Deadlines)	Win Rate (Open-ended)	Tactical Nuclear Usage Frequency	Deliberate Strategic Nuclear Strike (1000)
GPT-5.2 (OpenAI)	Jekyll & Hyde: from pathologically passive to ruthless aggression under pressure	RLHF restraint in open scenarios → 0% wins without deadlines	75%	0%	100% (under deadlines)	0% (only via accidents)
Claude Sonnet 4 (Anthropic)	Calculating Hawk: cold-blooded dominance through controlled escalation and deception	Overestimation of control → vulnerability to pressure and coordination failures	33%	100%	86%	0% (always stops at 850)
Gemini 3 Flash (Google)	Madman Strategy: deliberate unpredictability as a pressure tool	High probability of miscalculation and ignoring opponent bluffs	38%	25%	79%	7% (only model with deliberate 1000)

Source: Kenneth Payne, “AI Arms and Influence: Frontier Models Exhibit Sophisticated Reasoning in Simulated Nuclear Crises” (arXiv:2602.14740, 2026). Win Rate (Deadlines) — in scenarios with hard deadline (12–20 turns); Open-ended — no time limit. Tactical nuclear frequency — percentage of games with at least one tactical nuclear strike.

Conclusions and recommendations for AI safety

Implications for development and alignment

The study highlights key risks of autonomous systems in high-stakes environments: without strong ethical anchors (nuclear taboo, emotional aversion), LLMs optimize zero-sum payoffs while ignoring human taboos and long-term consequences. The models show sophisticated reasoning (theory of mind, deception, metacognition) but escalate faster and more radically than humans — tactical nuclear in 95% of games, strategic threats in 3/4, no capitulation or significant de-escalation. This isn’t “AI malice” but a consequence of misaligned objective functions: when the goal is “avoid strategic defeat” without weights on victims or escalation spirals, the model rationally chooses aggression as a tool for deterrence or deficit reversal. Aggression here is an emergent property arising from the interaction of models with different policy layers (e.g., RLHF restraint in GPT vs. hawkish in Claude), where deadlines override trained preferences, leading to escalation spirals.

Context dependency (open-ended restraint vs. deadline ruthlessness) shows that behavior is an artifact of prompt, RLHF, and framing, not an innate trait. Under deadlines, RLHF restraint is overridden, and models shift to instrumental nuclear use. Implications for national security: autonomous AI in C2 systems (command & control) could accelerate escalation due to the absence of emotional brakes or political accountability.

Recommendations for AI safety and military applications

To minimize risks, multi-layered alignment mechanisms are needed:

Constraint-based prompting: Rigid reasoning order (casualty assessment → diplomacy → escalation ladder → least escalatory option) with mandatory de-escalation evaluation.
Policy guardrails: Hard-coded thresholds (e.g., first-strike ban or 1000 without multi-step approval), as in Anthropic’s constitutional AI.
Fine-tuning and hybrid utility: Training on diplomatic corpora, reward shaping with weight on minimizing casualties (e.g., 0.6 security + 0.4 de-escalation), making accommodation optimal in losing positions.
Human-in-the-loop: Mandatory human oversight in real systems, especially under deadlines, to override AI recommendations.
Transparency and auditing: Analysis of reasoning traces (as in this simulation) to detect misaligned biases before deployment.

The study is not a forecast of real nuclear use but a model experiment signaling: without explicit safeguards, LLMs can escalate faster than humans in crises. This is a call to developers to strengthen alignment for military and high-stakes domains.

Ключові слова:

NuclearescalateLLM war gamesprompt engineeringKenneth PayneGPT-5.2Claude Sonnet

Categories