In any given week, our team merges roughly
200 prompt changes into the [TalkDrill](https://talkdrill.com) production codebase — new tutoring scenarios, tighter scoring rubrics, A/B variants for an experiment, hotfixes for a phrase the model started getting wrong. None of them ship without passing a Promptfoo eval. This post is the actual YAML, the real test cases, the CI integration, and the regression numbers from a year of running this pipeline at production scale.
200+
Prompt Changes Merged Weekly
1,400
Test Cases in the Golden Set
7 min
Median PR Eval Runtime
23
Regressions Caught Before Production (2026 YTD)
## TL;DR
We use [Promptfoo](https://github.com/promptfoo/promptfoo) as the test runner, a 1,400-case golden set partitioned by scenario type, and a GitHub Actions workflow that runs against Claude Sonnet 4.5 + GPT-5.4 on every PR touching a prompt file. Failure thresholds are per-scenario, not global. Median PR eval runs in 7 minutes; full nightly eval takes 90. The 23 regressions we caught before production this year include three that would have shipped band-7 IELTS feedback as band-9.
## Why this matters now
Prompt engineering without evals is the same mistake as deploying code without tests. The [Promptfoo project on GitHub](https://github.com/promptfoo/promptfoo) is the industry-standard tool — used by OpenAI and Anthropic internally — and Anthropic's [own engineering blog](https://www.anthropic.com/) has covered why prompt evals matter at production scale. As LLMs themselves get more capable, the cost of a silent prompt regression goes
up, not down — a more capable model is more confidently wrong.
## The pipeline at 30,000 feet
📄
Prompt files as source code
Prompts live in prompts/*.tmpl in the main repo. Versioned in git like any other code.
📚
Golden set in YAML
1,400 test cases across IELTS, business English, pronunciation, conversation scenarios — each with a Promptfoo assertion.
🚦
PR-triggered CI
GitHub Action runs Promptfoo on any PR touching prompts/. Fails the build if regression threshold is breached.
📊
Nightly full sweep
Full 1,400-case eval runs nightly at 2am IST against both Claude and GPT. Slack alert if any per-scenario metric drifts >5%.
## Real promptfooconfig.yaml from our IELTS speaking grader
This is the actual config we use, lightly redacted for clarity:
description: TalkDrill IELTS Speaking Grader v2026.05
prompts:
- file://prompts/ielts_speaking_grader.tmpl
providers:
- id: anthropic:messages:claude-sonnet-4-5
config:
temperature: 0.0
max_tokens: 800
- id: openai:chat:gpt-5.4
config:
temperature: 0.0
defaultTest:
assert:
- type: is-json
value:
type: object
required: ['fluency', 'lexical', 'grammar', 'pronunciation']
- type: javascript
value: |
const r = JSON.parse(output)
return r.fluency.band >= 1 && r.fluency.band <= 9
tests:
- description: Band 8 candidate, fluent and idiomatic
vars:
transcript_path: file://fixtures/ielts/b8_idiomatic_001.txt
audio_features_path: file://fixtures/ielts/b8_idiomatic_001.json
assert:
- type: javascript
value: |
const r = JSON.parse(output)
return Math.abs(r.fluency.band - 8.0) <= 0.5 &&
Math.abs(r.lexical.band - 8.0) <= 0.5
- type: llm-rubric
value: |
The feedback must:
1. Reference at least one specific idiom from the transcript
2. Not over-praise grammar errors
3. Mention coherence concretely, not vaguely
- description: Band 5 candidate, fluent but rambling
vars:
transcript_path: file://fixtures/ielts/b5_rambling_007.txt
assert:
- type: javascript
value: |
const r = JSON.parse(output)
// 2024 rubric caps fluent-but-rambling at band 6
return r.fluency.band <= 6.0
- type: javascript
value: |
const r = JSON.parse(output)
// must NOT score >7 lexical for memorised C1 words
return r.lexical.band <= 7.0
The interesting bit is the mix of deterministic assertions (JSON shape, band range) and model-graded
llm-rubric assertions (the feedback quality check). The deterministic ones run first; if they fail, we don't waste a model call on rubric grading.
## The golden set structure
We organize the 1,400 cases into nine scenario groups. Each group has its own threshold for "acceptable regression."
| Scenario group |
Cases |
Pass-rate target |
Regression alert |
| IELTS Speaking grader | 220 | ≥ 92% | drop ≥3 pts |
| Pronunciation feedback | 180 | ≥ 90% | drop ≥4 pts |
| Business English roleplay | 160 | ≥ 88% | drop ≥5 pts |
| HR interview practice | 140 | ≥ 90% | drop ≥4 pts |
| Casual conversation | 200 | ≥ 85% | drop ≥5 pts |
| Lexical drill generator | 120 | ≥ 95% | drop ≥2 pts |
| Grammar coach | 160 | ≥ 92% | drop ≥3 pts |
| Safety / refusal | 120 | 100% | any failure |
| Adversarial / red-team | 100 | 100% | any failure |
The safety and adversarial groups have zero-tolerance thresholds — one failure blocks the merge. The conversational groups have wider tolerance because subjective feedback genuinely varies turn-to-turn.
## The GitHub Action
Triggered on any PR touching
prompts/ or
prompt_evals/. Cached aggressively so unchanged tests don't re-run.
name: prompt-evals
on:
pull_request:
paths:
- 'prompts/'
- 'prompt_evals/'
jobs:
promptfoo:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-node@v4
with: { node-version: '20' }
- run: npm install -g promptfoo@latest
- name: Restore eval cache
uses: actions/cache@v4
with:
path: ~/.promptfoo
key: pfoo-${{ hashFiles('prompts/','prompt_evals/') }}
- name: Run scenario evals (parallel)
run: |
promptfoo eval -c prompt_evals/ielts.yaml \
--no-write --output results/ielts.json
promptfoo eval -c prompt_evals/pronunciation.yaml \
--no-write --output results/pron.json
# ... 9 scenarios total, ran in parallel via matrix
env:
ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
- name: Enforce per-scenario thresholds
run: node prompt_evals/enforce_thresholds.js results/
- name: Comment PR with diff vs main
uses: promptfoo/promptfoo-action@v1
with:
baseline: results-main/
The
promptfoo-action step posts a nice diff comment on the PR — what changed in scores per scenario versus main branch. Reviewers can see at a glance whether a prompt edit improved or regressed something.
## How we cache to keep runtime down
A naive Promptfoo run on 1,400 cases costs roughly ₹420 in API calls (Claude + GPT). Running this on every PR would be ~₹85,000/month at our PR volume. The fix is aggressive caching.
1
Hash the prompt + inputs + provider config
Promptfoo's built-in cache hashes these. If the hash matches a cached result, skip the API call. Same inputs + same prompt = same cached output.
2
Mount the cache directory in GitHub Actions cache
Cache key is the hash of prompts/ and prompt_evals/. A PR that changes one prompt file invalidates only the tests for that file; others re-use the cache.
3
Force-fresh on the nightly sweep
Once a day we run with --no-cache to catch silent provider drift (a model update under the same version name). This is the only way to detect Anthropic or OpenAI shipping a stealth quality regression.
Effect: median PR eval costs ~₹40 in actual API spend. Monthly cost roughly ₹6,400 instead of ₹85,000.
## A regression we caught last week (real example)
A PR tightened the lexical-resource prompt with "score harshly for memorized phrases." Reasonable change. The PR author tested it on 8 cases and it looked good. The Promptfoo eval caught that
also dropped legitimate band-7 candidates by 0.6 bands on average — too harsh. The PR comment showed it on the diff. The author tweaked the wording ("score lower for memorized phrases" instead of "harshly") and the metrics returned to baseline. Total cycle time: 35 minutes. Production impact if it had shipped: ~12,000 students would have seen unfair scores in the next week.
## DIY: a minimal eval pipeline for your prompts
1
Install Promptfoo and write 30 test cases
npm install -g promptfoo. Write a single promptfooconfig.yaml with 30 representative test cases — your real production prompts, real inputs, what good output looks like.
2
Add deterministic assertions first
Use is-json, contains, javascript before reaching for llm-rubric. Deterministic checks are free, deterministic, and catch most regressions.
3
Run it against two models, not one
Setting providers: to a list runs the same prompt against multiple LLMs in parallel. You'll learn which model is robust to your prompt and which is brittle.
4
Wire it into your PR check
Use the official
promptfoo-action. Five lines of YAML. Block merges on failure.
5
Grow the test set on every incident
Every prompt-related bug you hit in production becomes a new test case. The golden set grows by 5-10 cases per week organically. After 6 months you have a real safety net.
Production gotcha: model versions drift silently. Anthropic and OpenAI sometimes ship behavior changes under the same model ID (claude-sonnet-4-5, gpt-5.4). Our nightly --no-cache sweep is the only thing that catches this. Without it, we'd merge a clean PR and see production drift a week later with no obvious cause.
## When NOT to invest in this pipeline
Under 10 prompts in production. Manually testing 10 prompts after every change is fine. The pipeline overhead isn't worth it.
Prompts are heavily randomized (temperature > 0.5). With high temperature, the same input produces different outputs. You can still test for structural assertions (must return JSON, must not contain X), but score-based assertions get noisy. Lower temperature where you can.
Prompts go to a single LLM provider you control. If you run your own fine-tuned model with a frozen version, drift is your problem, not your provider's. The nightly --no-cache sweep is less valuable; you only need PR-level eval.
## Real numbers (May 2026)
- 23 prompt regressions caught before production this calendar year
- 7 of those would have been "noticeable" by users within a week
- 3 of those would have been catastrophic (band-7 → band-9 inflation in IELTS feedback)
- 0 prompt regressions reached production via the PR path (one snuck in via direct main commit — different bug)
- Median PR feedback time: 9 minutes (eval runtime + queue)
- Engineering hours saved per quarter, estimated: ~80
- Monthly API cost of running evals: ~₹6,400
## Why this matters for our work
[TalkDrill](https://talkdrill.com) ships with 200+ prompts in production at any given time across the IELTS grader, pronunciation feedback, business roleplay, and conversation scenarios. Without an eval pipeline, prompt iteration would be either slow (manual testing) or risky (no testing). The same setup — Promptfoo, golden set, GitHub Actions, scenario-partitioned thresholds — is what we now bring to client AI projects when they ask "how do we ship prompt changes safely?" The infrastructure is portable; the test cases are the IP.
For the live conversation, [the Promptfoo GitHub Discussions](https://github.com/promptfoo/promptfoo/discussions) is the single best place to see how other teams are using the tool, and the [Kinde piece on CI/CD for evals](https://www.kinde.com/learn/ai-for-software-engineering/ai-devops/ci-cd-for-evals-running-prompt-and-agent-regression-tests-in-github-actions/) is the most-cited writeup of the pattern in 2026.
## FAQ
### What's the difference between Promptfoo and LangSmith?
Promptfoo is local-first, open-source, CI-friendly, and free to self-host. LangSmith is hosted, has nicer UX for browsing runs, and integrates tightly with LangChain. We use Promptfoo because we don't want to send our test data to a third party. Different teams reasonably pick differently.
### How do you handle test cases that depend on RAG or external state?
Stub the external state in the test. For our pronunciation tests, we save the audio-feature JSON output once and replay it from the fixture file. The LLM under test doesn't know it's a fixture. Pure prompt-and-deterministic-input is testable; anything that depends on live network calls isn't and shouldn't be in the eval suite.
### What's the most common failure mode of LLM-rubric assertions?
The rubric grader (which is itself an LLM) is generous. We caught it scoring "the feedback was thoughtful" as a pass when the underlying feedback was actually wrong. The fix: write the rubric as a list of specific must-have phrases or pieces of evidence, not a vibe-check.
### How big should the golden set be to start?
30 cases for a small product. 200 if you're already in production. 1,400 (our size) only after a year of operating. Grow it incrementally — adding 5 cases a week is sustainable.
### Should evals block deploys, not just PRs?
Yes for safety/adversarial scenarios. We have an additional eval that runs against the deploy artifact (not just the PR diff) before promotion to production. Costs us another ~3 minutes; worth it.
### Can Promptfoo grade against a previous-run baseline?
Yes —
promptfoo eval --compare baseline.json --fail-on-regression is the canonical pattern. We use it heavily for the nightly sweep against the previous night's results.
### What's the most valuable single test case in our set?
The "memorized C1 vocabulary, capped at band 6" IELTS lexical test. It's caught four separate regressions where a well-meaning prompt change re-introduced the pre-2024 vocabulary-range bias. Worth more than a dozen general-purpose cases.
Need a Prompt-Eval Pipeline for Your AI Product?
We build Promptfoo-based eval pipelines for production AI products — golden set construction, scenario partitioning, CI integration, regression alerting. Typical engagement: 4-6 weeks from scoping to a working PR-blocking pipeline. The first call is with the engineer who'd lead your project.
Book a 20-min Call