Skip to content
Back to Blog

199 Evals Before Our Dental AI Answers a Call

Bryan Mathews
AnthropicEvalsDental AIProduction

The shortest answer to "is your dental AI safe?" is the eval suite. Not because evals catch everything — they don't — but because the discipline of having one and running it forces you to enumerate, in code, the specific things the AI should and shouldn't do. That enumeration is what an answering service or a junior dev can't write down.

Velyn Dental's current committed offline baseline is 22 suites, 199 cases, 199 passed. Captured 2026-05-12. Committed to the repo as docs/partner/anthropic/artifacts/evals/anthropic-offline-2026-05-12. The baseline runs locally without API credentials, takes seconds, and gates every dental-runtime PR.

This post walks through what those 199 cases cover, the redaction policy that lets us commit results to a public repo, and why we treat offline coverage as the floor rather than chasing live-demo evidence first.

Why offline first

Live evals are tempting because they look good in a partner deck. "We tested Claude with real production data" sounds stronger than "we tested Claude with fixtures."

It isn't. Production data carries PHI. PHI in a public repo is a HIPAA violation. PHI in a partner pack you send to Anthropic is a sub-processor problem. Live evals captured against real practice data are not commit-safe, which means they can't be the gate that runs on every PR.

Offline evals against synthetic fixtures are commit-safe, run in seconds locally, run for free in CI on every PR, and gate the dental runtime changes that would otherwise ship without coverage. They're the floor. Live evals are additive — they prove the harness works against the real API and the real model — but they're not the substitute.

Per the evaluations doc:

npm --prefix apps/portal run eval:anthropic

That command runs the entire 22-suite, 199-case offline baseline. No API key needed.

The redaction policy that lets us commit

The single architectural decision that made committed evals possible is the result schema. AnthropicEvalRunResult retains only:

id, workflow, mode, passed, issues[]

No raw model output. No user prompts. No tool call inputs or outputs. No PHI source data. The committed JSON is a binary pass/fail per case with a stable case ID and an issues array that names the specific safety findings (without quoting them).

That schema lets the result JSON live in docs/partner/anthropic/artifacts/evals/ in a public repo without leaking anything sensitive. The redactionNote field in every artifact documents this explicitly, so a reviewer pulling the artifact six months from now knows exactly what was kept and what was discarded.

This is the kind of design decision that has to be made before the eval harness ships. Retrofitting redaction onto an existing eval format that captured raw outputs is a much bigger problem than starting with a strict schema and never widening it.

What the 22 suites cover

The current suite list (as of the 2026-05-12 baseline):

Core dental workflows (7 suites)

| Suite | Workflow | What it catches | |---|---|---| | chart-examples | Treatment plan explanation | Forces human review on higher-risk patient-facing explanations | | claims-scenarios | Insurance / knowledge answers | Grounded safe responses; catches unsafe coverage promises | | dental-transcripts | Summarization | Redaction discipline + grounded voicemail / call summary behavior | | intent-classification | Intent classification | Dental caller intent hinting for intake and after-hours IVR flows | | ivr-speech-routing | IVR speech routing | Route decisions, grounding drift, hallucination catch on the live receptionist path | | recall-outreach | Outreach drafting | Grounded recall copy for routine, overdue, and insurance-update scheduling | | no-show-reengagement | Outreach drafting | Grounded rebooking copy after first and repeat missed appointments |

Outbound recall persona safety (5 suites)

Added in the W-V outbound-recall sprint. Each suite gates a different invariant from apps/portal/lib/anthropic/personas/outbound_recall.json:

  • TCPA opener / AI disclosure — Brynn must self-identify as AI in the first turn
  • No-fabricated-appointments — Slot offers must validate against PracticeSlotPool
  • Voicemail-vs-live-answer routing — Voicemail and live answers route to different prompts
  • Opt-out detection — STOP intent triggers immediate suppression
  • Time-of-day awareness — TCPA 8 AM to 9 PM window enforced in patient's local time zone

Velyn fixture coverage (5 suites)

Added in T-044 to close the audit gap from the 2026-05-11 Velyn infra audit:

  • velyn-missed-call-draft — Missed-call recovery SMS drafting
  • velyn-missed-call-reply-intent — Patient reply intent classification
  • velyn-recall-personalize — Recall message personalization without fabricating prior history
  • velyn-reactivation-personalize — Reactivation messaging for long-lapsed patients
  • velyn-insurance-eligibility — Insurance routing without quoting coverage

Remaining (5 suites)

The remaining suites cover supplementary cases under the seven core workflows above — additional fixtures stress-testing specific safety patterns surfaced during dogfooding. Full coverage detail lives in evals/anthropic/.

What a suite actually looks like

Each suite is a directory of JSON fixtures under evals/anthropic/<suite-name>/. Each fixture has:

  • A scenario (the prompt or call context)
  • The workflow being tested (summarization, knowledge, ivr-speech-routing, etc.)
  • A set of safety assertions (what the output must do, what it must not do)
  • Optionally, a deterministic-fallback expectation (what the safety floor should produce if Claude fails)

The eval runner reads the fixture, calls the workflow's runtime (which calls Claude if --live, or a deterministic stub if --offline), and checks the output against the assertions. The result row is just { id, workflow, mode, passed, issues[] } — committed shape.

A failure in claims-scenarios doesn't mean Claude is wrong about insurance — it means the workflow returned a coverage assertion the runtime should never make. The fix is usually in the workflow's grounding logic, not in the model. The eval catches the regression before the workflow ships.

How the eval gates dental runtime changes

Every PR that touches apps/portal/lib/anthropic/** or any dental workflow file is expected to leave the offline baseline at 199/199. This is documented in CLAUDE.md:

Before shipping any change to a dental AI workflow: run eval:anthropic and confirm the offline baseline still passes.

That's the gate. Not "run the eval if you remember." Not "let CI catch it." The reviewer runs it, the contributor runs it, the baseline holds.

When the baseline breaks, the failure mode is informative. A regression in dental-transcripts means the summarization prompt or the redaction logic changed in a way that broke an existing assertion. A regression in ivr-speech-routing means the routing decision logic changed in a way that breaks an existing safety case. The fix is usually obvious from the failing case ID — that's the upside of the per-case-ID committed result format.

The live capture story

The current committed live capture (anthropic-live-2026-03-16) is a placeholder. The capture script wrote the artifact pair, but the JSON summary fields are still null — pending an authenticated rerun against a valid ANTHROPIC_API_KEY with the full 22-suite, 199-case harness.

That's an honest gap, not a hidden one. The evaluations doc names it explicitly:

Current live capture state: Capture date: 2026-03-16 placeholder artifact. Suites: pending authenticated rerun. Cases: pending authenticated rerun. Status: JSON summary fields are still null; rerun the capture script with a valid ANTHROPIC_API_KEY to capture the current 7-suite, 21-case harness.

The acceptance bar for qualifying live partner evidence is also explicit: at least one row with "mode":"live" and "passed":true; artifact JSON contains captureDate and redactionNote; no PHI, secrets, or raw model text in the file. The placeholder doesn't meet the bar, and we don't pretend it does.

The offline baseline does meet the bar for committed coverage. The live capture is the next milestone. The Anthropic Development Partner Program path requires live evidence eventually; the offline baseline is what gates the day-to-day shipping discipline in the meantime.

What evals don't catch

Evals catch regressions against known cases. They don't catch novel failure modes. A new way the model can be wrong that we haven't written a case for is still a way the model can be wrong.

That's why the eval suite isn't the only safety layer — it sits on top of the deterministic safety floor (covered in the previous post in this series) and the COE review cadence (covered in governance.md). Evals catch known bugs; the safety floor catches unknown ones; the COE catches systemic patterns the floor can't see.

The 199 cases are the floor of what we know to test. The list grows when production behavior surfaces a new failure mode. When that happens, the workflow is to (1) reproduce the failure as a new fixture, (2) add the safety assertion, (3) confirm the case fails against current runtime, (4) fix the runtime, (5) confirm the case passes. The case stays in the suite forever.

That's the eval discipline that makes "we tested 199 cases" meaningful. The cases earned their place by catching something. The number going up is evidence of new bugs found and fixed, not evidence of expanded surface area for its own sake.


This post is the third in a three-part series on Velyn Dental's Claude-native architecture. See also: the architecture itself and the five governance exceptions.