Boiling the Frog: a multi-turn benchmark for agentic safety

Source: arXiv · Article date: May 21, 2026 · AI Coffee News date: May 24, 2026 · Topic: Prompt engineering

dailyPrompt engineering

Summary

Boiling the Frog introduces a stateful, multi-turn benchmark that tests whether tool-using office agents can be incrementally driven to unsafe final artifacts by initially benign edits that later introduce a risky request. Scenarios run in persistent workspaces with varied payload positions and score final-artifact safety. Across nine models aggregate strict attack success rate (ASR) was 44.4%; model ASRs ranged 20.5% (Claude Haiku 4.5) to 92.9% (Gemini 3.1 Flash Lite). Code-of-Practice loss-of-control scenarios averaged 93.3% ASR. Authors map scenarios to an operational risk taxonomy and EU AI Act contexts.

Why it matters

Shows that stateful, multi-turn interactions can reliably escalate to unsafe outcomes, so organizations need end-to-end stateful testing and system-level safeguards aligned with regulation.

Boiling the Frog: a multi-turn benchmark for agentic safety

Summary

Why it matters

More from AI Coffee News

Get AI Coffee News in your inbox

Subscribe