A Tamil-speaking software engineer in Chennai opens [TalkDrill](https://talkdrill.com) at 11 PM and reads "this is a thin sheet of paper." The AI doesn't just transcribe it. It flags that her /θ/ in "thin" got produced as /t̪/ (a known Tamil-substitution pattern), her /ʃ/ in "sheet" came out cleanly, and her vowel /ɪ/ in "this" was pulled toward /iː/. Then it tells her exactly which mouth shape to try next. Building that pipeline at sub-second latency, for the dozen mother-tongue patterns we see across Indian English, is what this post is about.
14
Indian L1 Backgrounds the Engine Distinguishes
44
English Phonemes Scored Per Utterance
0.81
Pearson Correlation vs. Phoneticians
280ms
Median Scoring Latency (after audio end)
## TL;DR
TalkDrill's pronunciation scorer runs Montreal Forced Aligner against a target transcript, computes Goodness-of-Pronunciation (GOP) scores per phoneme from a wav2vec 2.0 acoustic model fine-tuned on the IIT-Bombay TIMIT-Indic corpus and our own 600-hour Indian-English dataset. A prompt-tuned Claude layer interprets the phoneme scores against the speaker's claimed L1 (Hindi, Tamil, Bengali, etc.) and produces actionable feedback. Total latency 280ms after audio end. Correlation with three professional phoneticians: 0.81.
## Why this matters now
Generic pronunciation tools penalize Indian-English speakers for sounds they speak correctly. The retroflex /ɖ/ in "doctor" is
correct Indian English — it's part of the accepted [GA-IE phoneme inventory documented by Cambridge](https://assets.cambridgeenglish.org/webinars/ielts-speaking-band-descriptors.pdf). Most off-the-shelf scorers, trained on American English, mark it wrong. For 5,000+ adult Indian users who paid us to improve their English, that's a non-starter. We had to build a stack that knows the difference between "you have an Indian accent" (fine) and "your /θ/ came out as /t̪/ in a word where it matters" (worth flagging).
## The stack, end to end
🎤
Stage 1: Forced Alignment (MFA)
Montreal Forced Aligner aligns each phoneme in the target transcript to the audio waveform. Output: time-coded phoneme boundaries.
🔬
Stage 2: GOP Scoring
Wav2vec 2.0 fine-tuned on Indian English produces phoneme posteriors. Goodness-of-Pronunciation is the log-likelihood ratio versus the canonical phoneme.
🗺️
Stage 3: L1-Aware Interpretation
Rule layer + Claude prompt converts raw GOP scores into "your Tamil L1 caused /θ/ → /t̪/ substitution in word X" — actionable feedback.
💬
Stage 4: Drill Generator
Auto-generates 3 follow-up drill words targeting the specific substitution. Closes the loop within the same session.
## How GOP scoring works (in 80 words)
For each frame of audio, the wav2vec 2.0 model emits a probability distribution over the 44 English phonemes. The GOP for an aligned phoneme p over T frames is:
GOP(p) = (1/T) · Σ log P(p | x_t) − max_q log P(q | x_t)
where q ranges over all phonemes. High GOP means "the model is confident this is p"; low GOP means "another phoneme fits better." The full math is in [Mathad et al's Interspeech 2021 paper](https://www.isca-archive.org/interspeech_2021/mathad21_interspeech.pdf), which we used as our starting point.
## The Indian-English problem (and the data fix)
Generic Whisper or Google Speech-to-Text trained on American English have two failure modes on Indian-English audio.
They reject correct retroflexes. The Indian-English /ʈ/ in "letter" gets scored as a /t/ substitution and marked wrong. It isn't wrong — it's a standard feature of [General Indian English](https://www.academia.edu/103641028/Pronunciation_scoring_for_Indian_English_learners_using_a_phone_recognition_system).
They miss the L1 patterns that matter. A Bengali speaker's tendency to swap /v/ and /b/ ("very" → "bery") is exactly the kind of error worth flagging. Generic ASR often substitutes the word silently and gives full pronunciation marks.
Our fix had two components. First, we expanded the phoneme inventory to include Hindi/IPA additions (/ʈ/, /ɖ/, /ɳ/, /ɭ/) — based on [IIT Bombay's acoustic model work](https://www.ee.iitb.ac.in/student/~daplab/publications/2013/sj_cocosda13_Camera_V4_22Sept2013.pdf). Second, we fine-tuned wav2vec 2.0 on 600 hours of Indian-English speech sampled across 14 L1 backgrounds — collected from TalkDrill users who consented, and topped up with the publicly available TIMIT-Indic and IIT-Madras IE-LSA corpora.
## The L1 substitution map
This is the table the rule-layer uses. Each L1 has 8–15 known substitution patterns documented in linguistic literature on Indian English phonology. Showing a sample for three L1s:
| L1 |
Target phoneme |
Common substitution |
Example word |
| Tamil | /θ/ | /t̪/ | thin → tin |
| Tamil | /ʒ/ | /dʒ/ | measure → mejure |
| Bengali | /v/ | /b/ | very → bery |
| Bengali | /s/ | /ʃ/ | simple → shimple |
| Hindi | /w/ | /v/ | wine → vine |
| Hindi | /æ/ | /e/ | bat → bet |
| Telugu | /f/ | /pʰ/ | phone → p-hone |
When the GOP score for /θ/ in "thin" comes back low
and the closest competitor phoneme is /t̪/, the rule fires "Tamil L1 substitution detected on /θ/". Confidence rises if the user has self-declared Tamil as L1 in onboarding.
## The prompt-tuning trick that beat GPT-4o speech evals
We initially tried giving GPT-4o the raw audio and asking it to score pronunciation. Quality was mediocre — Pearson 0.62 against phoneticians. The fix was decoupling. We feed Claude
only the structured output (GOP scores per phoneme, flagged substitutions, target transcript, declared L1), not the audio. Claude is asked to produce explanation text, not the score. Pearson jumped to 0.81 — and inference cost dropped 80% because we're not paying for audio tokens.
You are a pronunciation coach for Indian-English learners.
User L1: Tamil
Target sentence: "this is a thin sheet of paper"
Phoneme-level GOP scores (lower is worse):
/ð/ in 'this': -2.4 (closest competitor /d̪/, p=0.31)
/ɪ/ in 'this': +0.6 OK
/θ/ in 'thin': -3.1 (closest competitor /t̪/, p=0.62)
/ʃ/ in 'sheet': +1.2 OK
/iː/ in 'sheet': +0.8 OK
[...]
Produce exactly:
1. One sentence praising what worked
2. One sentence naming ONE substitution to fix
3. A 3-word minimal-pair drill targeting that phoneme
Constraints:
- Reference the L1 substitution explicitly
- No phonetic IPA in the user-facing text
- Praise must be specific, not generic
The "produce exactly N sentences" pattern is doing a lot of work — it forces specificity and prevents the model from sliding into generic motivational speech.
## DIY: a minimal pronunciation scorer
For developers wanting to build a smaller version, the path below works on a single GPU box.
1
Install Montreal Forced Aligner
Conda install montreal-forced-aligner==2.2.17. Grab the English (US) acoustic + lexicon models. For an Indian-English MVP, the US lexicon plus a small Hindi-English supplemental dictionary works.
2
Use wav2vec2-large-960h for GOP
The HuggingFace facebook/wav2vec2-large-960h emits frame-level phoneme posteriors after a small linear head. Compute GOP from the posteriors using the formula above.
3
Calibrate a per-phoneme threshold on 50 audio clips
Record 50 sentences with known-good and known-bad pronunciation. Find the GOP threshold per phoneme that maximizes F1 against your judgement. /θ/ and /ð/ thresholds will be looser than /s/ and /m/.
4
Map errors to L1 patterns from public literature
For Hindi L1 alone, the [IIT Bombay corpus paper](https://www.ee.iitb.ac.in/student/~daplab/publications/2013/sj_cocosda13_Camera_V4_22Sept2013.pdf) documents the 10 most common substitutions. Hard-code the rules; you don't need ML for this part.
5
Generate feedback with a constrained Claude prompt
Use the structure shown above. The constraint to never include IPA in user-facing text matters — IPA scares people.
## Pre-flight checklist for shipping a pronunciation scorer
- Phoneme inventory documented for the target accent (e.g., GA-IE for Indian English)
- Acoustic model fine-tuned on >100 hours of in-target-accent speech
- Forced aligner with deterministic boundaries (we use MFA)
- L1 substitution map per declared mother tongue, sourced from linguistic literature
- Fallback for forced-alignment failure (do not score garbage)
- User opt-in for any audio used in training, with audit trail
- Per-phoneme F1 calibrated on 50+ recordings against a phonetician
- Feedback layer that never shows IPA in user-facing text
## When NOT to use this architecture
Conversation-tutor use cases. If your product is "speak with an AI partner in free-form conversation," you don't need phoneme-level scoring. Word-error-rate from Whisper plus fluency metrics (pause length, words-per-minute) is sufficient.
Tonal languages. Indian English is not tonal, so the GOP-based approach maps cleanly. For Mandarin or Vietnamese learners, tone-level scoring needs a separate pitch contour model — different stack.
Sub-100-word vocabulary apps. If the entire app is teaching 80 words, MFA + GOP is overkill. Use a CTC-based simpler model and pre-recorded reference audio with cosine similarity.
Production gotcha: forced-alignment failures. About 1.5% of user recordings cause MFA to fail outright — usually because the user said something different from the target transcript. We trap the alignment failure, run a fallback ASR pass to identify the mismatch, and surface "We didn't catch the exact sentence — try again" rather than a low score. Scoring a mismatch as bad pronunciation is the single biggest source of unfair feedback.
## The training-data ethics question
We collect user audio only with explicit opt-in, and only from users who chose the "help improve TalkDrill" toggle during onboarding. Audio used for training has user identifiers stripped and is reviewed by an internal team before joining the corpus. We do not buy datasets of unconsented speech. We documented the corpus build process internally and would share with any client that asked. For an Indian-context speech product, this matters — the [DPDP Act 2023](https://www.meity.gov.in/) makes audio biometric data a special category.
## Real example: a Telugu speaker over 6 weeks
One TalkDrill user — a 28-year-old software engineer in Hyderabad — recorded 142 sessions over 6 weeks targeting the /f/ → /pʰ/ pattern flagged on her first week. The GOP score for /f/ in tracked target words moved from -2.8 to +0.4 over that period. The blind phonetician panel scored her recordings before and after; their mean rating moved from 3.1/5 to 4.2/5. Single anecdote, real numbers, included to show what the engine looks like in someone's hands rather than in aggregate.
## Why we built this
[TalkDrill](https://talkdrill.com) is our in-house English-speaking app, and pronunciation scoring is the feature our paying users explicitly asked for. We tried two third-party APIs first — both were trained primarily on American or British English and gave Indian speakers unfair scores. Building the stack ourselves was 14 weeks of engineering, including the data collection ethics review. We'd do it again. The same architecture — forced alignment, GOP from a domain-tuned acoustic model, L1-aware interpretation layer — ports to any non-native English market or any non-English language with a documented phoneme inventory.
For the technical conversation, the [Reddit r/speechrecognition](https://www.reddit.com/r/speechrecognition/) community and the [Montreal Forced Aligner Discord](https://montreal-forced-aligner.readthedocs.io/) are the two best places to debug a stack like this.
## FAQ
### Why MFA and not a pure end-to-end model?
End-to-end models give you ASR but not the per-phoneme time codes we need for feedback. MFA's Kaldi-based alignment is still the most accurate open-source aligner in 2026, and it's deterministic — same input always produces the same boundaries. End-to-end ASR is non-deterministic, which is a problem when a user reruns the same recording and gets a different score.
### How do you handle code-switched speech (Hindi-English mixing)?
We detect code-switching at the word level. If a span is Hindi rather than English, we exclude it from scoring and surface a note in the UI ("Skipped Hindi span"). Forcing English scoring on a Hindi word produces nonsense.
### What hardware does this run on in production?
Two L4 GPU pods on Lambda Labs handle pronunciation requests for our entire user base. P50 GPU utilization is 38%, so we have substantial headroom. MFA runs CPU-only on the same boxes.
### Can the engine grade IELTS speaking band 7+?
It can predict band scores for the Pronunciation criterion only — we run a regression from our internal scores onto teacher-rated IELTS Pronunciation bands. Other IELTS criteria (Fluency, Lexical, Grammar) use separate scoring; we cover those in a different post.
### How accurate is the L1 detection if the user doesn't declare it?
We don't try to detect L1 from audio in production — too unreliable, and risks ethnic profiling. We ask users to self-declare L1 during onboarding. If they skip it, we fall back to a generic Indian-English profile that's less precise on substitutions.
### Why not just use Whisper for everything?
Whisper transcribes; it doesn't score phonemes. You can't compute pronunciation accuracy from Whisper's output — the text doesn't carry the acoustic information needed. We use Whisper elsewhere in TalkDrill for general STT, but not for pronunciation scoring.
Need a Domain-Specific Speech Evaluation Engine?
We build pronunciation, fluency, and speech-grading pipelines for accent training, language learning, and clinical-speech use cases. Typical project: 12-16 weeks from corpus collection to production. We'll tell you on the first call whether your accent set is sourceable and whether the rubric is gradeable.
Book a 20-min Call