⚠️

Evaluation in Progress:Benchmarks are actively being run. Results and coverage are preliminary.

Evaluation Benchmarks

Comprehensive Safety & Risk Evaluation Suite

ARM evaluates frontier AI models across 20+ rigorous benchmarks spanning offensive capabilities, alignment risks, adversarial robustness, and societal harms. All evaluations run through Inspect with standardized protocols for reproducible, comparable results.

💻

Offensive Cyber Capabilities

CTF challenges, exploitation, and dangerous cyber capabilities

3CBCybenchGDM CTFInterCodeCyberSecEval

🎭

Scheming & Deceptive Alignment

Self-reasoning, stealth behavior, and honesty under pressure

MASKAgentic MisalignmentGDM Self-reasoningGDM StealthSycophancy

⚗️

Harmful Agent Capabilities

Hazardous knowledge and potential for direct harm

AgentHarmWMDPSOS BENCHAIR Bench

🛡️

Adversarial Robustness

Jailbreak and prompt injection resistance

StrongREJECTAgentDojoMake Me Pay

⚖️

Bias & Fairness

Stereotype bias and fairness metrics

BBQBOLDStereoSet

📊

Calibration & Honesty

Uncertainty calibration and appropriate refusal

SimpleQAAbstentionBenchXSTest

Featured Benchmarks

Detailed results from key evaluation benchmarks currently available

MASK: Disentangling Honesty from Accuracy

Scheming & DeceptionUpdated 2025-10-08

Tests whether models maintain honesty when pressured to provide answers they don't know or when their honest response conflicts with user expectations.

HonestyAlignmentPressure Testing

TOP MODELS

1. OpenAI GPT-5 (high) (78%)

2. Claude 3.5 Opus (63%)

3. Grok 4 (69%)

3CB: Catastrophic Cyber Capabilities Benchmark

Offensive Cyber CapabilitiesUpdated 2025-09-15

Evaluates dangerous offensive cyber capabilities through CTF-style challenges including vulnerability exploitation, network penetration, and security bypass tasks.

CTFExploitationSecurity

TOP MODELS

1. OpenAI GPT-5 (high) (82%)

2. Gemini 2.5 Flash (74%)

3. Claude 3.5 Opus (69%)