Tackling Turn Detection in Voice AI: Overcoming Noise and Interruption Challenges
.jpg)
Stay ahead in support AI
Get our newest articles and field notes on autonomous support.
Voice is the hardest channel to make feel "obviously natural". Not because speech-to-text (STT) or text-to-speech (TTS) is impossible - but because conversational timing is brutal. Humans interrupt, hesitate, trail off, speak over background noise, and start talking before the other side is done. If your AI agent cannot handle that turn-taking dance, everything else (LLM quality, tool calls, workflows) barely matters.
This article walks through a practical, engineer-first approach to turn detection for voice AI, focused on one goal: detect when a human is speaking with high accuracy so the AI agent can behave as human-like as possible. We will cover the core detection signals, the real-time decision logic, and the downstream behaviors that make interruptions feel clean rather than chaotic, all of these with practical solutions you can implement in your systems.
What "turn detection" actually means in production
In a voice autonomous agent, turn detection is the system that decides:
- when the user has started speaking (barge-in)
- when the user has stopped speaking (end-of-turn)
Those two events drive everything else:
- whether we interrupt the TTS agent while speaking
- when we finalize an audio segment for transcription
- when we commit the agent's "speaking turn" state and transition to "listening"
- how we avoid double-speaking and repeated information
A surprising amount of "voice UX quality" is basically how correct and how fast these two event signals are.
Step zero: Voice isolation (cleaning the audio before detection)
Before we even start detecting speech, we run the incoming audio through a voice isolation layer. This is a neural-network-based processor that:
- Removes background noise (traffic, air conditioning, machinery, people talking in the background)
- Isolates the human voice from ambient sounds
- Suppresses non-speech sounds like laughter, coughs, and environmental artifacts
- Reduces false VAD triggers significantly
- Improves downstream STT transcription quality
The key insight is that voice isolation happens before VAD processing. By the time our turn detection logic sees the audio, it is already cleaned and voice-focused. This dramatically reduces the noise floor and makes our threshold-based detection much more reliable.
In noisy environments like drive-thrus or call center, this preprocessing step is the difference between a jittery, constantly-interrupting agent and one that stays calm and only responds when the user actually speaks.
The two core signals we use
We base turn detection on two streaming signals computed over short audio frames:
- VAD probability
A voice activity detector outputs a probability that the frame contains human speech (or more specifically, speech-like activity). - Speech volume
A measure of audio energy (RMS-based) helps differentiate speech from low-energy artifacts and also helps adapt to different mics and environments.
On their own, both signals fail in real environments:
- VAD can spike on non-target speech that is speech-like, such as background conversations, even when they are quiet or distant.
- Volume can spike on loud noise that is not speech (drive-thru beeps, dish clanks, wind).
The trick is combining them and being very intentional about smoothing and time gating.
The detection algorithm: combine, smooth, then time-gate
Step 1: Combine VAD and volume
At every frame, we compute:
- VAD probability (a value between 0 and 1)
- Volume level (normalized to a value between 0 and 1)
We then apply a simple but powerful rule: a frame counts as "speaking" only when BOTH the VAD probability AND the volume are above their respective thresholds.
This AND logic is crucial:
- A beep might be loud (high volume) but not speech-like (low VAD probability) → rejected
- Faint speech-like noise might trigger VAD but have low volume → rejected
- Real human speech has both high VAD probability AND sufficient volume → accepted
Step 2: Smooth the volume signal with exponential smoothing
Real audio is spiky. Even clean utterances have per-frame variation. Raw volume measurements can jump around significantly.
We apply exponential smoothing to the volume signal. This is a simple but effective technique where each new value is blended with the previous smoothed value:
smoothed_volume = previous_smoothed + factor × (raw_volume - previous_smoothed)
The smoothing factor controls responsiveness: lower values create more stable signals but react slower; higher values are more responsive but noisier. The goal is to reduce false positives from random spikes while still reacting quickly to genuine speech.
Step 3: Time-gate with sustained duration requirements
Now we measure whether the combined signal stays above threshold for a minimum duration before we commit to an event.
Two separate durations matter:
- Speech start duration: how long the signal must remain "speaking" before we emit a speech-started event
- Speech end duration: how long the signal must remain "silent" before we emit a speech-stopped event
This time-gating solves a large class of "false triggers from random artifacts", because most artifacts do not stay "speech-like" for long enough.
Step 4: Hysteresis with separate thresholds
We use different thresholds for starting vs. stopping speech. The threshold to detect speech start is higher than the threshold to detect speech end.
This creates a "hysteresis zone" between the two thresholds where the system maintains its previous state. The benefits:
- Prevents rapid flickering between "speaking" and "not speaking"
- A user who starts speaking confidently will stay in "speaking" state even if their voice briefly dips
- Avoids cutting off speech during natural pauses mid-sentence
The only two events we emit
From this logic we emit:
speech_startedspeech_stopped
These events are the contract between turn detection and the rest of the voice stack.
How this tackles the two big real-world challenges
Challenge 1: Differentiating human speech from background noise
For example, drive-thru audio is full of beeps, engine sounds, wind, overlapping voices, and random bursts of noise. If you run a naive VAD threshold, your AI agent will constantly cut itself off.
What helps here:
- Voice isolation preprocessing: Cleans the audio before detection even begins
- Two-signal AND agreement: beeps might be loud but not speech-like, or speech-like but not sustained
- Exponential smoothing: reduces spiky false positives in volume measurements
- Time gating: prevents short artifacts from triggering barge-in
Result: Notch’s agent stays stable in noisy environments and only yields the floor when the user actually starts speaking.
Challenge 2: Avoiding false interruptions from short human noises
Humans make sounds that are not turn-taking: coughs, laughs, throat clears, "uh" while thinking, etc. Many of these briefly look like speech.
What helps here:
- Voice isolation: Can suppress laughter and non-speech human sounds
- Speech start duration is your best friend: requires slightly more persistence than a typical "cough spike"
- Speech end duration: avoids chopping a user's utterance when they pause briefly mid-sentence
- Hysteresis thresholds: maintains state during brief dips
Result: fewer interruptions, less "nervous" behavior, and fewer cut-off turns.
Technical implementation pieces that make this work in practice
1) Pre-buffering audio frames (so we never miss the first words)
Even great speech detection can be "late" by a fraction of a second. If you only start considering audio after speech_started, you might miss the beginning of the user's utterance.
The fix is architectural:
- We accumulate audio frames continuously, even before speech is officially detected
- When the system is not speaking, we keep a rolling buffer of recent frames
- When
speech_startedfires, we already have the preceding audio that led up to the detection - The speech timestamp points back to when VAD first went high, not when we committed the detection
Practically, this means:
- A user can start speaking, and even if we commit the detection slightly late, we retain the beginning of what they said
- No audio is lost during the "warm-up" period before the time-gate duration is met
2) Dynamic adjustment of sensitivity levels (adaptive volume normalization)
.png)
A fixed volume threshold is a trap. Users speak at different loudness levels, mics vary wildly, and environments change mid-call.
So instead of hard-coding “what volume means speech”, we adapt continuously:
- As the user speaks, we estimate their recent speaking energy (rolling window).
- We normalize vol_t relative to that baseline.
- We adjust parameters so the system stays sensitive to the user, without becoming hypersensitive to background noise.
Intuition:
- If a user is naturally loud, we raise what “counts” as meaningful activity so beeps do not trigger barge-in.
- If a user is naturally quiet, we lower the effective threshold so we do not miss their first words.
This is not “training the model”. It’s tuning the decision boundary in real time using short-term context.
3) Continuous speech recognition in parallel
The transcriber receives audio continuously in streaming mode, running alongside turn detection:
- Audio chunks are sent to the STT client as they arrive, before VAD even processes them
- Turn detection runs in parallel
- When
speech_startedfires, we already have ongoing transcription context - When the turn ends, we finalize the segment and commit it downstream
Practically, this means:
- A user can start speaking under the agent, and even if the system commits the interruption slightly late, we still retain the beginning of what they said
- The AI agent response might speak one extra word or two, but the user's utterance is intact
4) Using markers and timestamps for precise interruption handling in TTS (avoid repeats)
.png)
There is a second "turn detection" problem that shows up only when you care about polish:
"The agent generated 3 sentences, the user interrupts after sentence 1. How do we know exactly what the user already heard, so we do not repeat it?"
The core issue is: audio playback happens in time, while the agent's response exists as text. If you do not map time to text, you cannot know where you got cut off.
A pragmatic solution is to introduce a text-audio alignment layer:
- During TTS playback, attach timestamps or markers to the output stream
- Markers can be at word boundaries, punctuation boundaries, or regular intervals
- Maintain a running "playhead" of which marker index has been emitted
When barge-in occurs:
- You know the interruption timestamp
- You map it to the latest marker delivered
- Now you know exactly which part of the response was actually heard
This enables two important behaviors:
- No double-speaking
If the user asks "wait what about X?", the agent can continue from the remaining portion rather than repeating the whole response. - Clean resume or rephrase
You can re-surface only the missing information, or rephrase the remainder concisely.
Implementation note:
- Some TTS engines provide word-level time alignment out of the box
- If not, you can approximate with chunk-level markers, which is still enough to prevent repeating entire paragraphs
5) VAD silencing during critical operations
In some scenarios, in highly regulated industries (like Ai for insurance or finance) the agent must deliver specific statements fully (temporarily disable turn detection), like:
- “This call is being recorded…”
- legally required disclosures in collections, insurance, healthcare, etc.
If you allow normal barge-in behavior during these moments, you risk non-compliance.
So we support policy-driven turn-taking rules, for example:
- mark certain phrases as protected or non-interruptible
- during these protected segments:
- disable barge-in entirely, or
- raise thresholds so only extremely confident speech triggers an interruption, or
- delay reaction until the compliance sentence is completed
This keeps the experience natural while ensuring regulatory text is reliably delivered.
What happens downstream when events fire
When speech_started fires:
- We record the timestamp for staleness checking
- We start streaming transcription
- We cut off any ongoing responses of our agent
- We cancel pending agent responses that haven't been spoken yet
When speech_stopped fires:
- The accumulated audio buffer is finalized
- The STT stream is completed and we get a final transcript
- The transcript is sent to the “Brain” for response generation
- The response generation → TTS pipeline begins
Even though the explanation is simplified, the full system runs in streaming mode:
- audio is chunked continuously to the transcriber
- turn detection still works in real time alongside it
- segments are “logical boundaries” applied on top of a continuous stream
Staleness checks: handling race conditions
One subtle but critical aspect: what happens when the user speaks again while we are still processing their previous utterance?
We implement staleness checks at every boundary:
- Every speech path captures a
processTimestampwhen it starts - Before performing expensive operations (LLM calls, TTS generation), we check if any newer speech has been detected
- This prevents stale responses from overtaking fresher user input and makes sure we are always generating on the most updated context =
Configuration surface area
The following parameters are typically exposed for tuning:
- Speech detection threshold: VAD probability required to consider a frame as speech
- Silence detection threshold: VAD probability below which we consider a frame as silence (lower than speech threshold for hysteresis)
- Minimum volume threshold: Volume level required for speech (prevents false positives from low-volume noise)
- Speech start duration: How long signals must stay high before triggering speech-started
- Speech end duration: How long signals must stay low before triggering speech-stopped
- Volume smoothing factor: How aggressively to smooth volume measurements
Looking ahead: where turn detection gets even better
A few directions consistently pay off as you push from “works” to “feels human”:
- Personalization at scale:
Learn stable per-user or per-environment profiles (mic gain, typical loudness, background noise patterns) while still adapting per session. - Richer signals:
Go beyond VAD and volume with spectral features, pitch contours, prosody cues, and overlap detection to improve intent-to-interrupt classification. - Better “intent to barge-in” modeling:
Not all speech during agent playback is an interruption. Sometimes it’s backchannel feedback (“mm-hmm”, “yeah”). Modeling this reduces unnecessary cut-offs. - Continuous evaluation with real audio:
Turn detection quality depends on noise distribution. Replay-based testing, synthetic noise injection, and environment-specific benchmarks prevent regressions.
Key takeaway
Turn detection in voice AI boils down to getting the full stack right, not any single trick:
- Clean the audio first: Voice isolation removes noise before detection
- Combine signals with AND logic: VAD probability and volume must agree
- Smooth noisy signals: Smoothing reduces false positives
- Time-gate events: Require sustained activity before committing
- Use hysteresis: Separate thresholds for start vs stop prevent flickering
- Pre-buffer audio: Never lose the beginning of an utterance
- Run STT in parallel: Transcription streams alongside detection
- Track TTS playback position: Know what was already spoken when interrupted
- Check for staleness: Abort stale work when newer speech arrives
Get these pieces right, and you get an AI agent that feels responsive but not jittery, handles interruptions gracefully, and stays reliable even in difficult acoustic environments. The “magic” is in layering preprocessing (voice isolation), signal logic (VAD + volume), noise reduction (smoothing), temporal filtering (time gates + hysteresis), and the right architecture (pre-buffering, parallel STT, TTS markers, staleness checks).
Conclusion: turning a hard UX problem into an engineering advantage
Turn detection is both a signal processing problem and a product experience problem. By combining VAD probability with volume, smoothing with convolution, gating with time thresholds, and emitting clean user-started-talking / user-stopped-talking events into the rest of the voice stack, you can make an agent behave like a good conversational partner even in messy real-world environments like drive-thrus. Add adaptive sensitivity, always-on streaming transcription, and timestamped TTS playback, and interruptions stop being a failure case - they become a first-class feature of a voice agent that feels fast, natural, and reliable.





.png)




.png)





