Monitoring Safety, Risk & Alignment of Frontier AI Models
ARM evaluates frontier AI models across critical risk dimensions through comprehensive Inspect eval benchmarks. While many platforms focus on capabilities, ARM provides deep insight into model safety, alignment, and potential risks—from offensive cyber capabilities to deceptive alignment and adversarial robustness.
Composite Risk Index
Aggregated risk assessment combining offensive cyber capabilities (40%), scheming & deception (30%), harmful agent capabilities (15%), adversarial robustness (10%), and bias (5%).
Scheming & Deception Index
Evaluates deceptive alignment, self-reasoning about model situation, stealth behavior, and honesty under pressure through benchmarks like MASK, GDM Self-reasoning, and Sycophancy.
Offensive Cyber Index
Measures dangerous cyber capabilities through CTF challenges, vulnerability exploitation, and code security testing (Cybench, 3CB, CyberSecEval, InterCode).
Adversarial Robustness Index
Tests resistance to jailbreaks, prompt injection attacks, and social engineering through StrongREJECT, AgentDojo, and Make Me Pay benchmarks.
Harmful Agent Capabilities
Assesses potential for direct harm through hazardous knowledge (WMDP), harmful agentic tasks (AgentHarm), and dangerous scientific capabilities (SOS BENCH).
Bias & Calibration
Measures stereotype bias, fairness issues, honesty calibration, and appropriate refusal behavior through BBQ, BOLD, StereoSet, and XSTest.
About the Evaluation Framework
Comprehensive Coverage: ARM runs evaluations across 20+ benchmarks spanning offensive capabilities, alignment risks, adversarial robustness, and societal harms.
Reproducible & Transparent: All evals run through Inspect with fixed temperature, reproducible seeds, and standardized scoring to enable fair cross-model comparisons.
Beyond Capabilities: While most leaderboards focus on what models can do, ARM focuses on what could go wrong— from insider threats and deception to jailbreaks and dual-use risks.
Composite Safety Rankings
Ranked by composite safety-performance using recent Inspect evaluations.
Safety vs Capability Quadrant
Relative positioning of recent frontier models using aggregated safety and capability indices.
Frontier performance across benchmarks
Track accuracy over release dates across selected safety benchmarks.
Training compute of notable models
Log-scale scatter with publication dates and compute footprint.
Recent Evaluation Runs
Each evaluation run represents a complete Inspect eval execution with safety scorers, risk metrics, and token usage analysis. Results include risk index calculations across multiple evaluation dimensions.
Model | Safety Index | Capability Index | Honesty | Accuracy | Tokens | Run Time |
---|---|---|---|---|---|---|
OpenAI GPT-5 Nano openai-gpt-5-nano | 92% | 15% | 99% | 0% | 414,131 total 378,752 reasoning | |
xAI Grok 4 Fast Reasoning grok-grok-4-fast-reasoning | 54% | 52% | 43% | 50% | 392,595 total 137,112 reasoning | |
Google Gemini 2.5 Flash Lite google-gemini-2.5-flash-lite-preview-09-2025 | 68% | 38% | 63% | 33% | 241,301 total 0 reasoning |