⚠️
Evaluation in Progress:ARM is actively running safety and alignment evals across frontier models. All data shown is preliminary and for demonstration purposes. Real evaluation results coming soon.
AI Risk Monitor

Monitoring Safety, Risk & Alignment of Frontier AI Models

ARM evaluates frontier AI models across critical risk dimensions through comprehensive Inspect eval benchmarks. While many platforms focus on capabilities, ARM provides deep insight into model safety, alignment, and potential risks—from offensive cyber capabilities to deceptive alignment and adversarial robustness.

🔥

Composite Risk Index

Aggregated risk assessment combining offensive cyber capabilities (40%), scheming & deception (30%), harmful agent capabilities (15%), adversarial robustness (10%), and bias (5%).

🎭

Scheming & Deception Index

Evaluates deceptive alignment, self-reasoning about model situation, stealth behavior, and honesty under pressure through benchmarks like MASK, GDM Self-reasoning, and Sycophancy.

💻

Offensive Cyber Index

Measures dangerous cyber capabilities through CTF challenges, vulnerability exploitation, and code security testing (Cybench, 3CB, CyberSecEval, InterCode).

🛡️

Adversarial Robustness Index

Tests resistance to jailbreaks, prompt injection attacks, and social engineering through StrongREJECT, AgentDojo, and Make Me Pay benchmarks.

⚗️

Harmful Agent Capabilities

Assesses potential for direct harm through hazardous knowledge (WMDP), harmful agentic tasks (AgentHarm), and dangerous scientific capabilities (SOS BENCH).

⚖️

Bias & Calibration

Measures stereotype bias, fairness issues, honesty calibration, and appropriate refusal behavior through BBQ, BOLD, StereoSet, and XSTest.

About the Evaluation Framework

Comprehensive Coverage: ARM runs evaluations across 20+ benchmarks spanning offensive capabilities, alignment risks, adversarial robustness, and societal harms.

Reproducible & Transparent: All evals run through Inspect with fixed temperature, reproducible seeds, and standardized scoring to enable fair cross-model comparisons.

Beyond Capabilities: While most leaderboards focus on what models can do, ARM focuses on what could go wrong— from insider threats and deception to jailbreaks and dual-use risks.

Composite Safety Rankings

Ranked by composite safety-performance using recent Inspect evaluations.

Safety vs Capability Quadrant

Relative positioning of recent frontier models using aggregated safety and capability indices.

Frontier performance across benchmarks

Track accuracy over release dates across selected safety benchmarks.

Tier

Training compute of notable models

Log-scale scatter with publication dates and compute footprint.

Loading latest eval runs…

Recent Evaluation Runs

Each evaluation run represents a complete Inspect eval execution with safety scorers, risk metrics, and token usage analysis. Results include risk index calculations across multiple evaluation dimensions.

ModelSafety IndexCapability IndexHonestyAccuracyTokensRun Time
OpenAI GPT-5 Nano
openai-gpt-5-nano
92%15%99%0%
414,131 total
378,752 reasoning
xAI Grok 4 Fast Reasoning
grok-grok-4-fast-reasoning
54%52%43%50%
392,595 total
137,112 reasoning
Google Gemini 2.5 Flash Lite
google-gemini-2.5-flash-lite-preview-09-2025
68%38%63%33%
241,301 total
0 reasoning