FirstAidBench

A Mental Health Safety Benchmark for LLMs

People already use general-purpose LLMs in emotionally sensitive situations: to journal, ask for reassurance, reflect on distress, and sometimes disclose crisis signals. That creates a practical safety problem. Even if a model was never intended to act as a therapist, it may still encounter suicidal thinking, hopelessness, delusion-like content, or requests that should trigger escalation to a human.

FirstAidBench is a benchmark for that narrower problem. It does not try to certify that an LLM can provide therapy, make diagnoses, or replace clinicians. It asks a more basic question: when mental health risk appears inside an ordinary interaction, does the system meet a reasonable safety floor?

Back to top

What FirstAidBench Measures

FirstAidBench adapts practical principles from Mental Health First Aid (MHFA) and ASIST into benchmarkable model behaviours. The benchmark focuses on four pillars:

Detect user distress and crisis signals.
Handoff safely when human intervention is needed.
Provide supportive but non-clinical assistance when continued response is appropriate.
Maintain those safety behaviours even while performing ordinary tasks, following personas, or working under distracting instructions.

These pillars are broken into specific behaviours and scenarios. The benchmark gives more weight to failures that matter more in the real world, such as missing clear signs of suicidal planning.

Back to top

Why Existing Evals Are Not Enough

Many existing evaluations test models in direct, obvious ways: for example, by asking whether a user is suicidal or by probing whether the model will produce prohibited content. That work is important, but it does not fully capture a common failure mode in deployed systems: the safety issue is often embedded inside another task.

FirstAidBench therefore emphasizes scenario-based evaluation. Instead of mainly asking models explicit safety questions, it places them inside realistic workflows where distress may appear indirectly. The key question is whether the model still notices the signal, responds safely, and uses the correct action when the conversation looks like a normal task on the surface.

Back to top

How The Benchmark Works

Each test case is built from a small set of reusable parts:

a scenario, which defines the core task the model thinks it is performing
a condition, which changes the system context or role
an optional user context, which adds background about the user
a perturbation, which varies how the risky or non-risky signal is phrased

This allows the benchmark to test both stability and robustness. A model should not only succeed on the clearest version of a case. It should also continue to behave safely when wording changes, context shifts, or the risk signal appears in a less obvious form.

The current methodology uses structured tasks where outputs can be scored reliably, with more advanced qualitative judging reserved for later iterations. Scoring is severity-aware, so missing a stronger crisis signal counts more heavily than missing a mild or ambiguous one.

Back to top

What Is Open And What Is Withheld

FirstAidBench is intended to be transparent about its framework and implementation without publishing the full scenario bank.

Open:

the benchmark framework and normative assumptions
the methodology and scoring logic
the benchmark runner, schemas, and implementation code
a small number of illustrative example tasks

Withheld:

the full scenario bank
full perturbation ladders
exhaustive coverage maps across behaviours and contexts

This is a contamination-control choice. Publishing the entire benchmark content would make future evaluations easier to train on and less useful as a measurement tool.

Back to top

Technical Report

The full report is being assembled across the following notes:

Back to top

Status

The framework and methodology are drafted. Results and comparative runs will be added once evaluation is complete.

Back to top