How TalkDrill Hits 800ms Voice AI Round-Trip Latency on 4G
A field-tested breakdown of TalkDrill's Whisper + Claude + streaming-TTS pipeline — where every millisecond of the 800ms budget gets spent, and what we'd redesign today.
Hrishikesh Baidya
March 29, 202614 min read
0%
An Indian college student opens [TalkDrill](https://talkdrill.com) on a 4G connection in Lucknow, taps the AI tutor, and starts speaking. 820 milliseconds later, she hears the bot reply — and that gap is the difference between a real conversation and a walkie-talkie. This post is the engineering tour of how we got there: every stage of our Whisper + Claude + streaming-TTS pipeline, where the budget gets spent, and the three things we'd redesign if we started the codebase today.
820ms
Median Voice-to-Voice Round-Trip (4G, p50)
1.4s
p95 Latency (4G, Tier-2 Indian Cities)
5,000+
Active Users Running This Stack Daily
240KB
Opus-Encoded Audio Per 20s Reply
## TL;DR: the answer in 60 words
TalkDrill's voice pipeline runs Opus audio over WebRTC into a US-East edge node, streams it into faster-whisper (medium model) with VAD chunking, pipes partial transcripts to Claude Sonnet 4.5 with prefix caching, and streams the reply through ElevenLabs Turbo v2 back over WebRTC. p50 round-trip on a 4G connection in India is 820ms. The biggest budget sinks are network up (160ms) and STT first chunk (140ms).
## Why this matters now
Voice-AI conversation feels broken above ~1 second. Recent research from [Hamming AI](https://hamming.ai/resources/voice-ai-latency-whats-fast-whats-slow-how-to-fix-it) confirms 800ms is the threshold where users stop treating the bot as a peer. For an Indian English-fluency app — where the user is already nervous about speaking — that threshold is even tighter. Below 800ms, students keep talking. Above 1.2s, retention drops.
## The 800ms latency budget, stage by stage
The chart below is what we measured on a Reliance Jio 4G connection in Pune, May 2026 — 1,200 round-trips, median values shown. Total p50: 820ms.
The structure follows the [LiveKit sequential-pipeline pattern](https://livekit.com/blog/sequential-pipeline-architecture-voice-agents): each stage starts the next stage's input as soon as its own first byte is ready. We never wait for a stage to finish before piping the partial output forward.
## The actual stack (what's wired together)
📱
Client: React Native + LiveKit SDK
Captures 48kHz mic audio, downsamples to 16kHz mono, Opus-encodes at 24 kbps, ships over a WebRTC data channel.
🎙️
STT: faster-whisper medium
Runs on an L4 GPU pod with Silero VAD. 380ms VAD utterance, then chunked decode with 5-second context window.
🧠
LLM: Claude Sonnet 4.5 streaming
5-minute prefix cache on the lesson-context prompt block. First token typically 180ms after final transcript.
🔊
TTS: ElevenLabs Turbo v2
Streams MP3 frames as text arrives. First audio chunk emitted after ~90ms of accumulated text.
The orchestrator is a Python service on FastAPI. Each conversation gets a dedicated asyncio task per stage, connected by asyncio.Queue instances. No threads — we want backpressure to bubble.
## DIY walkthrough: build the 800ms loop yourself
The version below is the trimmed-down reference we open-source friends with. It's not the full TalkDrill stack — it's the minimum runnable version that hits sub-second p50 on a developer laptop with a wired connection.
1
Stand up faster-whisper with VAD
Install faster-whisper==1.0.3 and silero-vad. Pin the model to medium for English (not large-v3 — the quality gain isn't worth the 60ms cost on this task). Run on a GPU pod with at least 8GB VRAM; an L4 at ₹26/hour on Lambda is sufficient.
2
Wire WebRTC with LiveKit Cloud
Use the LiveKit React Native SDK on the client. The server agent subscribes to the audio track and forwards raw PCM frames to your STT queue. WebRTC handles jitter and packet loss — do not write your own.
3
Wire prefix caching on Claude
Put the lesson context (rubric, scenario prompt, last 6 turns) inside cache_control: {"type": "ephemeral"}. Hit-rate of 92% in our production logs; saves ~110ms TTFT and ~70% of input token cost on follow-up turns.
4
Pipe text-to-TTS chunks as they arrive
Do not wait for Claude to finish. Emit each sentence-boundary chunk to ElevenLabs Turbo v2's streaming API. ElevenLabs starts streaming MP3 within ~90ms of the first text byte.
5
Measure end-to-end with a synthetic probe
Record a fixed WAV file on the client side, mark the timestamp at the end of audio, and watch for the first audio packet from the server. The diff is your p50. We run this every 60 seconds from probes in Mumbai, Bangalore, and Delhi.
## Where the network kills you (and what to do about it)
The single biggest contributor to p95 blow-up is the upstream network leg. On a stressed cell tower at 7 PM in Delhi, the upstream RTT for the first packet can balloon to 380ms. Three things help.
Use UDP, not TCP. WebRTC's data channel is UDP-based — packet loss is recoverable inside 20ms with NACK. A TCP-based audio pipe (which several "fast" voice frameworks still use under the hood) stalls for a full retransmission window every time a packet drops, which it will on Indian 4G.
Pick the geographically closest edge node. Our orchestrator runs in three regions: ap-south-1 (Mumbai), us-east-1, and eu-west-1. We route based on the client's STUN-discovered IP. The Mumbai pod saves 60–90ms of network up for South Asian users versus US-East.
Push VAD onto the device. TalkDrill's client runs Silero VAD locally on the handset and only opens the upstream when speech is detected. This cuts the median number of bytes shipped per session by ~40% and prevents the cell tower from throttling our flow.
Production gotcha: jitter buffer creep. If your client jitter buffer grows past 80ms during a session, audio playback drifts behind real-time and the user perceives "the bot got slow." We force-flush the buffer at 80ms and accept the audible discontinuity — users complain less about a brief glitch than about cumulative drift.
## When NOT to use this architecture
Three cases where the 800ms pipeline is overkill or wrong.
Asynchronous voice messages. If the product is "leave a voice note, get a written reply within a minute," you do not need WebRTC. Use HTTP file upload, batch-process the audio, and stream the reply over server-sent events. The infrastructure cost is 30% of the real-time stack.
On-device-only voice. If your privacy or offline story requires no cloud, this pipeline doesn't fit. Local Whisper + a 4B-parameter on-device LLM (Phi-3 mini or Gemma 2B) on a recent iPhone will deliver ~1.2s p50, but with full offline capability. We considered this for an internal use case and rejected it — too many older Android handsets in our Indian user base couldn't sustain it.
Single-turn voice command. If the user is just saying "set timer for 5 minutes," you don't need a streaming LLM at all. Run Whisper, regex-match the intent, respond. The 800ms infrastructure is built for free-form conversation, not commands.
## Real production numbers from TalkDrill (May 2026)
A real cut from one week of production traffic, with the carrier-by-carrier split. Jio leads because its packet-core peering with our Mumbai pod is the shortest. BSNL is the long tail — and we lose engagement on BSNL users at a measurable rate.
Carrier (India)
p50 ms
p95 ms
Sessions / wk
D7 retention
Reliance Jio 4G
780
1,290
14,400
38%
Airtel 4G
860
1,520
11,800
34%
Vi 4G
910
1,640
5,200
31%
BSNL 4G
1,120
2,180
1,400
22%
5G (mixed)
640
990
8,700
42%
The retention correlation isn't subtle. Every 200ms of additional p50 latency costs us roughly 4–5 percentage points of D7 retention. That's the business case for the engineering work, in one sentence.
## Quick checklist before you ship a voice pipeline
Use UDP-based WebRTC, not TCP-based audio transport
Run device-side VAD to gate the upstream
Pin a single STT model version; do not auto-upgrade
Enable Anthropic prefix caching with structured prompt blocks
Stream TTS chunks the moment first text arrives — never wait
Pick the geographically closest LLM region to your users
Force-flush jitter buffer above 80ms
Probe end-to-end latency every 60s from real-user regions
## What we'd do differently today
If we were re-architecting from zero in May 2026, three changes.
Switch STT to Voxtral Transcribe 2. Mistral's [Voxtral release](https://weesperneonflow.ai/en/blog/2026-03-31-voxtral-whisper-open-source-speech-models-comparison-2026/) uses a causal audio encoder that processes left-to-right and emits partial transcripts with ~80ms of trailing latency. Whisper's encoder-decoder design requires a buffered window. We'd save 60ms of STT time and ~30% of GPU cost.
Adopt Claude's extended_thinking only on high-error turns. Currently we run vanilla Sonnet 4.5 on every turn. For ~8% of inputs (low-confidence transcripts), we'd swap to thinking mode at the cost of ~400ms. The error reduction is worth the latency hit only on those turns.
Move TTS to streaming-aware diffusion. OpenAI's [low-latency voice work](https://openai.com/index/delivering-low-latency-voice-ai-at-scale/) and the new wave of streaming diffusion TTS (Style-Bert-VITS3, F5-TTS) push first-chunk latency below 50ms. ElevenLabs is fine — we'd just save 30–40ms on TTS and reclaim some headroom for the network leg.
The honest tradeoff: All three changes shave another ~150ms of p50 latency at a 2x increase in self-managed-infra complexity. ElevenLabs and Anthropic eat the ops cost for us today; running Voxtral and a diffusion-TTS pipeline in-house means three more services to keep alive at 99.9%. We have not made the switch yet, and that's a deliberate call.
## Why this matters for the work we do
This stack lives inside [TalkDrill](https://talkdrill.com), the in-house English-fluency app we built and operate at Softechinfra. It serves 5,000+ active users daily, and the cost-per-conversation is low enough that we can offer a free tier without bleeding. The same architecture — WebRTC ingress, GPU-served STT, streaming LLM, streaming TTS, edge-routed orchestrator — is what we use when clients ask us to build voice products for healthcare triage, customer service, or interview prep. The lessons port; the latency budget is the lessons.
For the live conversation around this stack, [Hacker News thread #43899028](https://news.ycombinator.com/item?id=43899028) is the best single read on production voice-AI latency we've found this year — multiple practitioners sharing real numbers, not vendor pitches.
## FAQ
### How do you measure voice-to-voice latency in production?
We use a synthetic probe: a fixed WAV is recorded on the client at a marked timestamp, sent through the live pipeline, and the time-to-first-audio-byte on the return path is measured. Probes run from Mumbai, Bangalore, and Delhi every 60 seconds. Real-user measurements come from RUM telemetry on the LiveKit SDK.
### Why not use OpenAI Realtime API instead of your own pipeline?
We evaluated it. For free-form English tutoring, Claude Sonnet 4.5 with our custom rubric prompts outperforms Realtime API on pedagogical correctness in our internal evals. The latency difference is roughly a wash. We'd revisit if Anthropic shipped a native realtime endpoint.
### Does prefix caching really cut 100ms off TTFT?
In our production logs, the median time-to-first-token on cached prompts is 180ms versus 290ms on cold prompts. Across 5,000 daily active users, cache hit-rate is 92%. The effect is real but only matters because we structured the prompt to put stable context first.
### What does this cost per user per month?
At a typical session length of 14 minutes and 18 turns, our infrastructure cost per active user per month is roughly ₹42 — split across GPU STT (35%), Claude API (40%), ElevenLabs (18%), and WebRTC/network (7%). We charge ₹299/mo on Pro, so unit economics are healthy.
### Is faster-whisper better than OpenAI's hosted Whisper API for this?
For our use case, yes. Self-hosted faster-whisper on an L4 GPU gives us deterministic first-chunk latency under 150ms and avoids a third-party round-trip. The OpenAI hosted API works fine for batch transcription but adds 200–300ms of API overhead we can't afford in a real-time loop.
### What's the cheapest viable version of this stack?
For a hobby or MVP build, you can run faster-whisper on a single L4 (₹26/hour spot), use Groq's Llama 3.3 70B for the LLM layer (~₹0.50 per 1k tokens), and use OpenAI TTS for output (~₹100/hour of audio). End-to-end you can be live with one engineer in 10 days. Production hardening — observability, multi-region, retries, fallback — is what eats the next 60 days.
Want a Voice-AI Mobile App Built on This Stack?
We ship production voice-AI products on the same architecture that runs TalkDrill — WebRTC ingress, streaming STT, streaming LLM, streaming TTS, edge-routed orchestrator. Typical project: 10–14 weeks from kickoff to public beta. First call is technical, with the engineer who'll lead your build.