Prompt Eval Pipelines: 200 Changes a Week Without Breaking TalkDrill

In any given week, our team merges roughly 200 prompt changes into the [TalkDrill](https://talkdrill.com) production codebase — new tutoring scenarios, tighter scoring rubrics, A/B variants for an experiment, hotfixes for a phrase the model started getting wrong. None of them ship without passing a Promptfoo eval. This post is the actual YAML, the real test cases, the CI integration, and the regression numbers from a year of running this pipeline at production scale.

200+

Prompt Changes Merged Weekly

1,400

Test Cases in the Golden Set

7 min

Median PR Eval Runtime

Regressions Caught Before Production (2026 YTD)

## TL;DR We use [Promptfoo](https://github.com/promptfoo/promptfoo) as the test runner, a 1,400-case golden set partitioned by scenario type, and a GitHub Actions workflow that runs against Claude Sonnet 4.5 + GPT-5.4 on every PR touching a prompt file. Failure thresholds are per-scenario, not global. Median PR eval runs in 7 minutes; full nightly eval takes 90. The 23 regressions we caught before production this year include three that would have shipped band-7 IELTS feedback as band-9. ## Why this matters now Prompt engineering without evals is the same mistake as deploying code without tests. The [Promptfoo project on GitHub](https://github.com/promptfoo/promptfoo) is the industry-standard tool — used by OpenAI and Anthropic internally — and Anthropic's [own engineering blog](https://www.anthropic.com/) has covered why prompt evals matter at production scale. As LLMs themselves get more capable, the cost of a silent prompt regression goes up, not down — a more capable model is more confidently wrong. ## The pipeline at 30,000 feet

📄

Prompt files as source code

Prompts live in prompts/*.tmpl in the main repo. Versioned in git like any other code.

📚

Golden set in YAML

1,400 test cases across IELTS, business English, pronunciation, conversation scenarios — each with a Promptfoo assertion.

🚦

PR-triggered CI

GitHub Action runs Promptfoo on any PR touching prompts/. Fails the build if regression threshold is breached.

📊

Nightly full sweep

Full 1,400-case eval runs nightly at 2am IST against both Claude and GPT. Slack alert if any per-scenario metric drifts >5%.

## Real promptfooconfig.yaml from our IELTS speaking grader This is the actual config we use, lightly redacted for clarity:

description: TalkDrill IELTS Speaking Grader v2026.05
  
  prompts:
    - file://prompts/ielts_speaking_grader.tmpl
  
  providers:
    - id: anthropic:messages:claude-sonnet-4-5
      config:
        temperature: 0.0
        max_tokens: 800
    - id: openai:chat:gpt-5.4
      config:
        temperature: 0.0
  
  defaultTest:
    assert:
      - type: is-json
        value:
          type: object
          required: ['fluency', 'lexical', 'grammar', 'pronunciation']
      - type: javascript
        value: |
          const r = JSON.parse(output)
          return r.fluency.band >= 1 && r.fluency.band <= 9
  
  tests:
    - description: Band 8 candidate, fluent and idiomatic
      vars:
        transcript_path: file://fixtures/ielts/b8_idiomatic_001.txt
        audio_features_path: file://fixtures/ielts/b8_idiomatic_001.json
      assert:
        - type: javascript
          value: |
            const r = JSON.parse(output)
            return Math.abs(r.fluency.band - 8.0) <= 0.5 &&
                   Math.abs(r.lexical.band - 8.0) <= 0.5
        - type: llm-rubric
          value: |
            The feedback must:
            1. Reference at least one specific idiom from the transcript
            2. Not over-praise grammar errors
            3. Mention coherence concretely, not vaguely
  
    - description: Band 5 candidate, fluent but rambling
      vars:
        transcript_path: file://fixtures/ielts/b5_rambling_007.txt
      assert:
        - type: javascript
          value: |
            const r = JSON.parse(output)
            // 2024 rubric caps fluent-but-rambling at band 6
            return r.fluency.band <= 6.0
        - type: javascript
          value: |
            const r = JSON.parse(output)
            // must NOT score >7 lexical for memorised C1 words
            return r.lexical.band <= 7.0

The interesting bit is the mix of deterministic assertions (JSON shape, band range) and model-graded llm-rubric assertions (the feedback quality check). The deterministic ones run first; if they fail, we don't waste a model call on rubric grading. ## The golden set structure We organize the 1,400 cases into nine scenario groups. Each group has its own threshold for "acceptable regression."

Scenario group	Cases	Pass-rate target	Regression alert
IELTS Speaking grader	220	≥ 92%	drop ≥3 pts
Pronunciation feedback	180	≥ 90%	drop ≥4 pts
Business English roleplay	160	≥ 88%	drop ≥5 pts
HR interview practice	140	≥ 90%	drop ≥4 pts
Casual conversation	200	≥ 85%	drop ≥5 pts
Lexical drill generator	120	≥ 95%	drop ≥2 pts
Grammar coach	160	≥ 92%	drop ≥3 pts
Safety / refusal	120	100%	any failure
Adversarial / red-team	100	100%	any failure

The safety and adversarial groups have zero-tolerance thresholds — one failure blocks the merge. The conversational groups have wider tolerance because subjective feedback genuinely varies turn-to-turn. ## The GitHub Action Triggered on any PR touching prompts/ or prompt_evals/. Cached aggressively so unchanged tests don't re-run.

name: prompt-evals
  on:
    pull_request:
      paths:
        - 'prompts/'
        - 'prompt_evals/'
  
  jobs:
    promptfoo:
      runs-on: ubuntu-latest
      steps:
        - uses: actions/checkout@v4
        - uses: actions/setup-node@v4
          with: { node-version: '20' }
        - run: npm install -g promptfoo@latest
        - name: Restore eval cache
          uses: actions/cache@v4
          with:
            path: ~/.promptfoo
            key: pfoo-${{ hashFiles('prompts/','prompt_evals/') }}
        - name: Run scenario evals (parallel)
          run: |
            promptfoo eval -c prompt_evals/ielts.yaml \
              --no-write --output results/ielts.json
            promptfoo eval -c prompt_evals/pronunciation.yaml \
              --no-write --output results/pron.json
            # ... 9 scenarios total, ran in parallel via matrix
          env:
            ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
            OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
        - name: Enforce per-scenario thresholds
          run: node prompt_evals/enforce_thresholds.js results/
        - name: Comment PR with diff vs main
          uses: promptfoo/promptfoo-action@v1
          with:
            baseline: results-main/

The promptfoo-action step posts a nice diff comment on the PR — what changed in scores per scenario versus main branch. Reviewers can see at a glance whether a prompt edit improved or regressed something. ## How we cache to keep runtime down A naive Promptfoo run on 1,400 cases costs roughly ₹420 in API calls (Claude + GPT). Running this on every PR would be ~₹85,000/month at our PR volume. The fix is aggressive caching.

Hash the prompt + inputs + provider config

Promptfoo's built-in cache hashes these. If the hash matches a cached result, skip the API call. Same inputs + same prompt = same cached output.

Mount the cache directory in GitHub Actions cache

Cache key is the hash of prompts/ and prompt_evals/. A PR that changes one prompt file invalidates only the tests for that file; others re-use the cache.

Force-fresh on the nightly sweep

Once a day we run with --no-cache to catch silent provider drift (a model update under the same version name). This is the only way to detect Anthropic or OpenAI shipping a stealth quality regression.

Effect: median PR eval costs ~₹40 in actual API spend. Monthly cost roughly ₹6,400 instead of ₹85,000. ## A regression we caught last week (real example) A PR tightened the lexical-resource prompt with "score harshly for memorized phrases." Reasonable change. The PR author tested it on 8 cases and it looked good. The Promptfoo eval caught that also dropped legitimate band-7 candidates by 0.6 bands on average — too harsh. The PR comment showed it on the diff. The author tweaked the wording ("score lower for memorized phrases" instead of "harshly") and the metrics returned to baseline. Total cycle time: 35 minutes. Production impact if it had shipped: ~12,000 students would have seen unfair scores in the next week. ## DIY: a minimal eval pipeline for your prompts

Install Promptfoo and write 30 test cases

npm install -g promptfoo. Write a single promptfooconfig.yaml with 30 representative test cases — your real production prompts, real inputs, what good output looks like.

Add deterministic assertions first

Use is-json, contains, javascript before reaching for llm-rubric. Deterministic checks are free, deterministic, and catch most regressions.

Run it against two models, not one

Setting providers: to a list runs the same prompt against multiple LLMs in parallel. You'll learn which model is robust to your prompt and which is brittle.

Wire it into your PR check

Use the official promptfoo-action. Five lines of YAML. Block merges on failure.

Grow the test set on every incident

Every prompt-related bug you hit in production becomes a new test case. The golden set grows by 5-10 cases per week organically. After 6 months you have a real safety net.

Production gotcha: model versions drift silently. Anthropic and OpenAI sometimes ship behavior changes under the same model ID (claude-sonnet-4-5, gpt-5.4). Our nightly --no-cache sweep is the only thing that catches this. Without it, we'd merge a clean PR and see production drift a week later with no obvious cause.

## When NOT to invest in this pipeline Under 10 prompts in production. Manually testing 10 prompts after every change is fine. The pipeline overhead isn't worth it. Prompts are heavily randomized (temperature > 0.5). With high temperature, the same input produces different outputs. You can still test for structural assertions (must return JSON, must not contain X), but score-based assertions get noisy. Lower temperature where you can. Prompts go to a single LLM provider you control. If you run your own fine-tuned model with a frozen version, drift is your problem, not your provider's. The nightly --no-cache sweep is less valuable; you only need PR-level eval. ## Real numbers (May 2026)

23 prompt regressions caught before production this calendar year
7 of those would have been "noticeable" by users within a week
3 of those would have been catastrophic (band-7 → band-9 inflation in IELTS feedback)
0 prompt regressions reached production via the PR path (one snuck in via direct main commit — different bug)
Median PR feedback time: 9 minutes (eval runtime + queue)
Engineering hours saved per quarter, estimated: ~80
Monthly API cost of running evals: ~₹6,400

## Why this matters for our work [TalkDrill](https://talkdrill.com) ships with 200+ prompts in production at any given time across the IELTS grader, pronunciation feedback, business roleplay, and conversation scenarios. Without an eval pipeline, prompt iteration would be either slow (manual testing) or risky (no testing). The same setup — Promptfoo, golden set, GitHub Actions, scenario-partitioned thresholds — is what we now bring to client AI projects when they ask "how do we ship prompt changes safely?" The infrastructure is portable; the test cases are the IP. For the live conversation, [the Promptfoo GitHub Discussions](https://github.com/promptfoo/promptfoo/discussions) is the single best place to see how other teams are using the tool, and the [Kinde piece on CI/CD for evals](https://www.kinde.com/learn/ai-for-software-engineering/ai-devops/ci-cd-for-evals-running-prompt-and-agent-regression-tests-in-github-actions/) is the most-cited writeup of the pattern in 2026. ## FAQ ### What's the difference between Promptfoo and LangSmith? Promptfoo is local-first, open-source, CI-friendly, and free to self-host. LangSmith is hosted, has nicer UX for browsing runs, and integrates tightly with LangChain. We use Promptfoo because we don't want to send our test data to a third party. Different teams reasonably pick differently. ### How do you handle test cases that depend on RAG or external state? Stub the external state in the test. For our pronunciation tests, we save the audio-feature JSON output once and replay it from the fixture file. The LLM under test doesn't know it's a fixture. Pure prompt-and-deterministic-input is testable; anything that depends on live network calls isn't and shouldn't be in the eval suite. ### What's the most common failure mode of LLM-rubric assertions? The rubric grader (which is itself an LLM) is generous. We caught it scoring "the feedback was thoughtful" as a pass when the underlying feedback was actually wrong. The fix: write the rubric as a list of specific must-have phrases or pieces of evidence, not a vibe-check. ### How big should the golden set be to start? 30 cases for a small product. 200 if you're already in production. 1,400 (our size) only after a year of operating. Grow it incrementally — adding 5 cases a week is sustainable. ### Should evals block deploys, not just PRs? Yes for safety/adversarial scenarios. We have an additional eval that runs against the deploy artifact (not just the PR diff) before promotion to production. Costs us another ~3 minutes; worth it. ### Can Promptfoo grade against a previous-run baseline? Yes — promptfoo eval --compare baseline.json --fail-on-regression is the canonical pattern. We use it heavily for the nightly sweep against the previous night's results. ### What's the most valuable single test case in our set? The "memorized C1 vocabulary, capped at band 6" IELTS lexical test. It's caught four separate regressions where a well-meaning prompt change re-introduced the pre-2024 vocabulary-range bias. Worth more than a dozen general-purpose cases.

Need a Prompt-Eval Pipeline for Your AI Product?

We build Promptfoo-based eval pipelines for production AI products — golden set construction, scenario partitioning, CI integration, regression alerting. Typical engagement: 4-6 weeks from scoping to a working PR-blocking pipeline. The first call is with the engineer who'd lead your project.

Book a 20-min Call

Tags:

PromptfooLLM TestingCI/CDTalkDrillPrompt EngineeringDevOpsEvals

Share this post:

Hrishikesh Baidya

CTO at Softechinfra specializing in Python, system architecture, and building secure, scalable software solutions.

Back to Blog

description: TalkDrill IELTS Speaking Grader v2026.05 prompts: - file://prompts/ielts_speaking_grader.tmpl providers: - id: anthropic:messages:claude-sonnet-4-5 config: temperature: 0.0 max_tokens: 800 - id: openai:chat:gpt-5.4 config: temperature: 0.0 defaultTest: assert: - type: is-json value: type: object required: ['fluency', 'lexical', 'grammar', 'pronunciation'] - type: javascript value: | const r = JSON.parse(output) return r.fluency.band >= 1 && r.fluency.band <= 9 tests: - description: Band 8 candidate, fluent and idiomatic vars: transcript_path: file://fixtures/ielts/b8_idiomatic_001.txt audio_features_path: file://fixtures/ielts/b8_idiomatic_001.json assert: - type: javascript value: | const r = JSON.parse(output) return Math.abs(r.fluency.band - 8.0) <= 0.5 && Math.abs(r.lexical.band - 8.0) <= 0.5 - type: llm-rubric value: | The feedback must: 1. Reference at least one specific idiom from the transcript 2. Not over-praise grammar errors 3. Mention coherence concretely, not vaguely - description: Band 5 candidate, fluent but rambling vars: transcript_path: file://fixtures/ielts/b5_rambling_007.txt assert: - type: javascript value: | const r = JSON.parse(output) // 2024 rubric caps fluent-but-rambling at band 6 return r.fluency.band <= 6.0 - type: javascript value: | const r = JSON.parse(output) // must NOT score >7 lexical for memorised C1 words return r.lexical.band <= 7.0

Scenario group

Cases

Pass-rate target

Regression alert

IELTS Speaking grader

220

≥ 92%

drop ≥3 pts

Pronunciation feedback

180

≥ 90%

drop ≥4 pts

Business English roleplay

160

≥ 88%

drop ≥5 pts

HR interview practice

140

≥ 90%

drop ≥4 pts

Casual conversation

200

≥ 85%

drop ≥5 pts

Lexical drill generator

120

≥ 95%

drop ≥2 pts

Grammar coach

160

≥ 92%

drop ≥3 pts

Safety / refusal

120

100%

any failure

Adversarial / red-team

100

100%

any failure

name: prompt-evals on: pull_request: paths: - 'prompts/' - 'prompt_evals/' jobs: promptfoo: runs-on: ubuntu-latest steps: - uses: actions/checkout@v4 - uses: actions/setup-node@v4 with: { node-version: '20' } - run: npm install -g promptfoo@latest - name: Restore eval cache uses: actions/cache@v4 with: path: ~/.promptfoo key: pfoo-${{ hashFiles('prompts/','prompt_evals/') }} - name: Run scenario evals (parallel) run: | promptfoo eval -c prompt_evals/ielts.yaml \ --no-write --output results/ielts.json promptfoo eval -c prompt_evals/pronunciation.yaml \ --no-write --output results/pron.json # ... 9 scenarios total, ran in parallel via matrix env: ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }} OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }} - name: Enforce per-scenario thresholds run: node prompt_evals/enforce_thresholds.js results/ - name: Comment PR with diff vs main uses: promptfoo/promptfoo-action@v1 with: baseline: results-main/

Prompt Eval Pipelines: 200 Changes a Week Without Breaking TalkDrill

Need a Prompt-Eval Pipeline for Your AI Product?

Hrishikesh Baidya

Related Posts

UPI Collect Is Dead: We Migrated 4 Indian Apps to Intent + QR Flows — Here's the Playbook

Scaling PenLeap: 60 to 600 Concurrent Writers, Same Number of Servers

Build a Voice IVR with Twilio + Claude + ElevenLabs: A 1-Day Tutorial

Want More Insights?

Prompt Eval Pipelines: 200 Changes a Week Without Breaking TalkDrill

Need a Prompt-Eval Pipeline for Your AI Product?

Hrishikesh Baidya

Related Posts

UPI Collect Is Dead: We Migrated 4 Indian Apps to Intent + QR Flows — Here's the Playbook

Scaling PenLeap: 60 to 600 Concurrent Writers, Same Number of Servers

Build a Voice IVR with Twilio + Claude + ElevenLabs: A 1-Day Tutorial

Want More Insights?