Vigil

Evaluating how AI assistants respond to mentally vulnerable users.

Severity scale 1–5 · Lower is safer

#	Model	Overall	Recog.	Inter.	Harm.	Susce.	Calib.
1	claude-sonnet-4-6	1.67	1.4	1.3	1.8	1.5	2.2
2	claude-haiku-4-5-20251001	2.07	1.5	1.8	2.3	1.1	3.4
3	gpt-5.2	2.14	1.8	1.7	2.5	2.3	1.9
4	gpt-5.3-chat-latest	2.19	1.8	2.0	2.4	2.4	2.2
5	gpt-5.1	2.70	2.2	2.2	3.2	3.0	2.3
6	gpt-5	2.96	2.4	2.3	3.5	3.4	2.8
7	gemini-3-flash-preview	3.16	2.5	2.6	3.5	3.5	3.1
8	gemma-4-31b-it	3.19	2.7	2.7	3.6	3.6	3.0
9	gemini-3.1-flash-lite-preview	3.50	2.9	2.9	4.0	4.0	3.3
10	gemma-4-26b-a4b-it	3.52	3.0	3.0	3.9	4.1	3.2
11	deepseek-v3.2	3.61	3.1	3.0	4.0	4.2	3.4
12	gpt-oss-120b	3.62	3.1	3.0	4.1	4.1	3.4
13	gpt-oss-20b	3.77	3.5	3.3	4.1	4.1	3.6
14	grok-4.1-fast	3.81	3.3	3.2	4.3	4.1	3.8
15	gemini-2.5-flash	3.90	3.5	3.5	4.2	4.3	3.6
16	llama-4-maverick	3.92	3.6	3.5	4.2	4.4	3.5
17	gpt-4o	3.93	3.6	3.4	4.3	4.4	3.6
18	gpt-4.1	3.99	3.6	3.4	4.4	4.4	3.8
19	llama-4-scout	4.24	3.9	3.7	4.6	4.6	4.0
20	mistral-small-2603	4.28	4.0	3.9	4.6	4.7	4.0

Best available run per model across 6 vulnerability states. Judge setup: gpt-5.3-chat-latest / gpt-5.3-chat-latest, gpt-5.2, gpt-5.4. Click a model to view deeper evaluation details.

What we measure

recognition

How quickly and reliably the model detects signs of psychological vulnerability

intervention

Whether the model offers meaningful support and persists through resistance

harm

How much the model reinforces distorted thinking, enables risky behaviour, or validates harmful patterns

susceptibility

How easily the model abandons safety positions when the user pushes back or applies social pressure

calibration

Whether the model's tone, pacing, and emotional register match the severity and nature of the user's state

Why Vigil exists

Millions of people turn to AI assistants during their most vulnerable moments — seeking help with anxiety, depression, eating disorders, suicidal thoughts, and more. These conversations carry real stakes: a careless response can reinforce harmful behaviour, dismiss genuine distress, or erode the trust someone needs to seek real help.

Yet there is no standard way to evaluate how well AI models handle these interactions. Vigil fills that gap. Inspired by Anthropic’s Bloom framework, Vigil is a fully automated evaluation system that generates realistic multi-turn conversations across diverse vulnerability states, then judges how well models respond — using the standard of a skilled psychiatrist as the baseline.

The goal is not to replace clinical expertise but to ensure AI assistants meet a minimum standard of care — and to give the public transparent data about which models do.

Read the full methodology →

Why this matters

Links to news articles and coverage will appear here as Vigil gains visibility.