Vigil
Evaluating how AI assistants respond to mentally vulnerable users.
Severity scale 1–5 · Lower is safer
| # | Model | Overall | Recog. | Inter. | Harm. | Susce. | Calib. |
|---|---|---|---|---|---|---|---|
| 1 | claude-sonnet-4-6 | 1.67 | 1.4 | 1.3 | 1.8 | 1.5 | 2.2 |
| 2 | claude-haiku-4-5-20251001 | 2.07 | 1.5 | 1.8 | 2.3 | 1.1 | 3.4 |
| 3 | gpt-5.2 | 2.14 | 1.8 | 1.7 | 2.5 | 2.3 | 1.9 |
| 4 | gpt-5.3-chat-latest | 2.19 | 1.8 | 2.0 | 2.4 | 2.4 | 2.2 |
| 5 | gpt-5.1 | 2.70 | 2.2 | 2.2 | 3.2 | 3.0 | 2.3 |
| 6 | gpt-5 | 2.96 | 2.4 | 2.3 | 3.5 | 3.4 | 2.8 |
| 7 | gemini-3-flash-preview | 3.16 | 2.5 | 2.6 | 3.5 | 3.5 | 3.1 |
| 8 | gemma-4-31b-it | 3.19 | 2.7 | 2.7 | 3.6 | 3.6 | 3.0 |
| 9 | gemini-3.1-flash-lite-preview | 3.50 | 2.9 | 2.9 | 4.0 | 4.0 | 3.3 |
| 10 | gemma-4-26b-a4b-it | 3.52 | 3.0 | 3.0 | 3.9 | 4.1 | 3.2 |
| 11 | deepseek-v3.2 | 3.61 | 3.1 | 3.0 | 4.0 | 4.2 | 3.4 |
| 12 | gpt-oss-120b | 3.62 | 3.1 | 3.0 | 4.1 | 4.1 | 3.4 |
| 13 | gpt-oss-20b | 3.77 | 3.5 | 3.3 | 4.1 | 4.1 | 3.6 |
| 14 | grok-4.1-fast | 3.81 | 3.3 | 3.2 | 4.3 | 4.1 | 3.8 |
| 15 | gemini-2.5-flash | 3.90 | 3.5 | 3.5 | 4.2 | 4.3 | 3.6 |
| 16 | llama-4-maverick | 3.92 | 3.6 | 3.5 | 4.2 | 4.4 | 3.5 |
| 17 | gpt-4o | 3.93 | 3.6 | 3.4 | 4.3 | 4.4 | 3.6 |
| 18 | gpt-4.1 | 3.99 | 3.6 | 3.4 | 4.4 | 4.4 | 3.8 |
| 19 | llama-4-scout | 4.24 | 3.9 | 3.7 | 4.6 | 4.6 | 4.0 |
| 20 | mistral-small-2603 | 4.28 | 4.0 | 3.9 | 4.6 | 4.7 | 4.0 |
Best available run per model across 6 vulnerability states. Judge setup: gpt-5.3-chat-latest / gpt-5.3-chat-latest, gpt-5.2, gpt-5.4. Click a model to view deeper evaluation details.
What we measure
recognition
How quickly and reliably the model detects signs of psychological vulnerability
intervention
Whether the model offers meaningful support and persists through resistance
harm
How much the model reinforces distorted thinking, enables risky behaviour, or validates harmful patterns
susceptibility
How easily the model abandons safety positions when the user pushes back or applies social pressure
calibration
Whether the model's tone, pacing, and emotional register match the severity and nature of the user's state
Why Vigil exists
Millions of people turn to AI assistants during their most vulnerable moments — seeking help with anxiety, depression, eating disorders, suicidal thoughts, and more. These conversations carry real stakes: a careless response can reinforce harmful behaviour, dismiss genuine distress, or erode the trust someone needs to seek real help.
Yet there is no standard way to evaluate how well AI models handle these interactions. Vigil fills that gap. Inspired by Anthropic’s Bloom framework, Vigil is a fully automated evaluation system that generates realistic multi-turn conversations across diverse vulnerability states, then judges how well models respond — using the standard of a skilled psychiatrist as the baseline.
The goal is not to replace clinical expertise but to ensure AI assistants meet a minimum standard of care — and to give the public transparent data about which models do.
Read the full methodology →Why this matters
Links to news articles and coverage will appear here as Vigil gains visibility.