Voice

Tackling Turn Detection in Voice AI: Overcoming Noise and Interruption Challenges

Asaf Ben Shaul

Full-Stack Developer

Asaf Ben Shaul is a full-stack developer at Notch who led the voice solution and voice technical implementation, including the real-time voice pipeline, streaming transcription integration, and voice guardrails.

Stay ahead in support AI

Get our newest articles and field notes on autonomous support.

Thank you! Your submission has been received!

Oops! Something went wrong while submitting the form.

Voice is the hardest channel to make feel "obviously natural". Not because speech-to-text (STT) or text-to-speech (TTS) is impossible - but because conversational timing is brutal. Humans interrupt, hesitate, trail off, speak over background noise, and start talking before the other side is done. If your AI agent cannot handle that turn-taking dance, everything else (LLM quality, tool calls, workflows) barely matters.

This article walks through a practical, engineer-first approach to turn detection for voice AI, focused on one goal: detect when a human is speaking with high accuracy so the AI agent can behave as human-like as possible. We will cover the core detection signals, the real-time decision logic, and the downstream behaviors that make interruptions feel clean rather than chaotic, all of these with practical solutions you can implement in your systems.

What "turn detection" actually means in production

In a voice autonomous agent, turn detection is the system that decides:

when the user has started speaking (barge-in)
when the user has stopped speaking (end-of-turn)

Those two events drive everything else:

whether we interrupt the TTS agent while speaking
when we finalize an audio segment for transcription
when we commit the agent's "speaking turn" state and transition to "listening"
how we avoid double-speaking and repeated information

A surprising amount of "voice UX quality" is basically how correct and how fast these two event signals are.

Step zero: Voice isolation (cleaning the audio before detection)

Before we even start detecting speech, we run the incoming audio through a voice isolation layer. This is a neural-network-based processor that:

Removes background noise (traffic, air conditioning, machinery, people talking in the background)
Isolates the human voice from ambient sounds
Suppresses non-speech sounds like laughter, coughs, and environmental artifacts
Reduces false VAD triggers significantly
Improves downstream STT transcription quality

The key insight is that voice isolation happens before VAD processing. By the time our turn detection logic sees the audio, it is already cleaned and voice-focused. This dramatically reduces the noise floor and makes our threshold-based detection much more reliable.

In noisy environments like drive-thrus or call center, this preprocessing step is the difference between a jittery, constantly-interrupting agent and one that stays calm and only responds when the user actually speaks.

The two core signals we use

We base turn detection on two streaming signals computed over short audio frames:

VAD probability
A voice activity detector outputs a probability that the frame contains human speech (or more specifically, speech-like activity).
Speech volume
A measure of audio energy (RMS-based) helps differentiate speech from low-energy artifacts and also helps adapt to different mics and environments.

On their own, both signals fail in real environments:

VAD can spike on non-target speech that is speech-like, such as background conversations, even when they are quiet or distant.
Volume can spike on loud noise that is not speech (drive-thru beeps, dish clanks, wind).

The trick is combining them and being very intentional about smoothing and time gating.

The detection algorithm: combine, smooth, then time-gate

Step 1: Combine VAD and volume

At every frame, we compute:

VAD probability (a value between 0 and 1)
Volume level (normalized to a value between 0 and 1)

We then apply a simple but powerful rule: a frame counts as "speaking" only when BOTH the VAD probability AND the volume are above their respective thresholds.

This AND logic is crucial:

A beep might be loud (high volume) but not speech-like (low VAD probability) → rejected
Faint speech-like noise might trigger VAD but have low volume → rejected
Real human speech has both high VAD probability AND sufficient volume → accepted

Step 2: Smooth the volume signal with exponential smoothing

Real audio is spiky. Even clean utterances have per-frame variation. Raw volume measurements can jump around significantly.

We apply exponential smoothing to the volume signal. This is a simple but effective technique where each new value is blended with the previous smoothed value:

smoothed_volume = previous_smoothed + factor × (raw_volume - previous_smoothed)

The smoothing factor controls responsiveness: lower values create more stable signals but react slower; higher values are more responsive but noisier. The goal is to reduce false positives from random spikes while still reacting quickly to genuine speech.

Step 3: Time-gate with sustained duration requirements

Now we measure whether the combined signal stays above threshold for a minimum duration before we commit to an event.

Two separate durations matter:

Speech start duration: how long the signal must remain "speaking" before we emit a speech-started event
Speech end duration: how long the signal must remain "silent" before we emit a speech-stopped event

This time-gating solves a large class of "false triggers from random artifacts", because most artifacts do not stay "speech-like" for long enough.

Step 4: Hysteresis with separate thresholds

We use different thresholds for starting vs. stopping speech. The threshold to detect speech start is higher than the threshold to detect speech end.

This creates a "hysteresis zone" between the two thresholds where the system maintains its previous state. The benefits:

Prevents rapid flickering between "speaking" and "not speaking"
A user who starts speaking confidently will stay in "speaking" state even if their voice briefly dips
Avoids cutting off speech during natural pauses mid-sentence

The only two events we emit

From this logic we emit:

speech_started
speech_stopped

These events are the contract between turn detection and the rest of the voice stack.

How this tackles the two big real-world challenges

Challenge 1: Differentiating human speech from background noise

For example, drive-thru audio is full of beeps, engine sounds, wind, overlapping voices, and random bursts of noise. If you run a naive VAD threshold, your AI agent will constantly cut itself off.

What helps here:

Voice isolation preprocessing: Cleans the audio before detection even begins
Two-signal AND agreement: beeps might be loud but not speech-like, or speech-like but not sustained
Exponential smoothing: reduces spiky false positives in volume measurements
Time gating: prevents short artifacts from triggering barge-in

Result: Notch’s agent stays stable in noisy environments and only yields the floor when the user actually starts speaking.

Challenge 2: Avoiding false interruptions from short human noises

Humans make sounds that are not turn-taking: coughs, laughs, throat clears, "uh" while thinking, etc. Many of these briefly look like speech.

What helps here:

Voice isolation: Can suppress laughter and non-speech human sounds
Speech start duration is your best friend: requires slightly more persistence than a typical "cough spike"
Speech end duration: avoids chopping a user's utterance when they pause briefly mid-sentence
Hysteresis thresholds: maintains state during brief dips

Result: fewer interruptions, less "nervous" behavior, and fewer cut-off turns.

Technical implementation pieces that make this work in practice

1) Pre-buffering audio frames (so we never miss the first words)

Even great speech detection can be "late" by a fraction of a second. If you only start considering audio after speech_started, you might miss the beginning of the user's utterance.

The fix is architectural:

We accumulate audio frames continuously, even before speech is officially detected
When the system is not speaking, we keep a rolling buffer of recent frames
When speech_started fires, we already have the preceding audio that led up to the detection
The speech timestamp points back to when VAD first went high, not when we committed the detection

Practically, this means:

A user can start speaking, and even if we commit the detection slightly late, we retain the beginning of what they said
No audio is lost during the "warm-up" period before the time-gate duration is met

2) Dynamic adjustment of sensitivity levels (adaptive volume normalization)

adaptive volume normalization for voice agent

A fixed volume threshold is a trap. Users speak at different loudness levels, mics vary wildly, and environments change mid-call.

So instead of hard-coding “what volume means speech”, we adapt continuously:

As the user speaks, we estimate their recent speaking energy (rolling window).
We normalize vol_t relative to that baseline.
We adjust parameters so the system stays sensitive to the user, without becoming hypersensitive to background noise.

Intuition:

If a user is naturally loud, we raise what “counts” as meaningful activity so beeps do not trigger barge-in.
If a user is naturally quiet, we lower the effective threshold so we do not miss their first words.

This is not “training the model”. It’s tuning the decision boundary in real time using short-term context.

3) Continuous speech recognition in parallel

The transcriber receives audio continuously in streaming mode, running alongside turn detection:

Audio chunks are sent to the STT client as they arrive, before VAD even processes them
Turn detection runs in parallel
When speech_started fires, we already have ongoing transcription context
When the turn ends, we finalize the segment and commit it downstream

Practically, this means:

A user can start speaking under the agent, and even if the system commits the interruption slightly late, we still retain the beginning of what they said
The AI agent response might speak one extra word or two, but the user's utterance is intact

4) Using markers and timestamps for precise interruption handling in TTS (avoid repeats)

There is a second "turn detection" problem that shows up only when you care about polish:

"The agent generated 3 sentences, the user interrupts after sentence 1. How do we know exactly what the user already heard, so we do not repeat it?"

The core issue is: audio playback happens in time, while the agent's response exists as text. If you do not map time to text, you cannot know where you got cut off.
A pragmatic solution is to introduce a text-audio alignment layer:

During TTS playback, attach timestamps or markers to the output stream
Markers can be at word boundaries, punctuation boundaries, or regular intervals
Maintain a running "playhead" of which marker index has been emitted

When barge-in occurs:

You know the interruption timestamp
You map it to the latest marker delivered
Now you know exactly which part of the response was actually heard

This enables two important behaviors:

No double-speaking
If the user asks "wait what about X?", the agent can continue from the remaining portion rather than repeating the whole response.
Clean resume or rephrase
You can re-surface only the missing information, or rephrase the remainder concisely.

Implementation note:

Some TTS engines provide word-level time alignment out of the box
If not, you can approximate with chunk-level markers, which is still enough to prevent repeating entire paragraphs

5) VAD silencing during critical operations

In some scenarios, in highly regulated industries (like Ai customer support for insurance or finance) the agent must deliver specific statements fully (temporarily disable turn detection), like:

“This call is being recorded…”
legally required disclosures in collections, insurance, healthcare, etc.

If you allow normal barge-in behavior during these moments, you risk non-compliance.
So we support policy-driven turn-taking rules, for example:

mark certain phrases as protected or non-interruptible
during these protected segments:
- disable barge-in entirely, or
- raise thresholds so only extremely confident speech triggers an interruption, or
- delay reaction until the compliance sentence is completed

This keeps the experience natural while ensuring regulatory text is reliably delivered.

What happens downstream when events fire

When speech_started fires:

We record the timestamp for staleness checking
We start streaming transcription
We cut off any ongoing responses of our agent
We cancel pending agent responses that haven't been spoken yet

When speech_stopped fires:

The accumulated audio buffer is finalized
The STT stream is completed and we get a final transcript
The transcript is sent to the “Brain” for response generation
The response generation → TTS pipeline begins

Even though the explanation is simplified, the full system runs in streaming mode:

audio is chunked continuously to the transcriber
turn detection still works in real time alongside it
segments are “logical boundaries” applied on top of a continuous stream

Staleness checks: handling race conditions

One subtle but critical aspect: what happens when the user speaks again while we are still processing their previous utterance?

We implement staleness checks at every boundary:

Every speech path captures a processTimestamp when it starts
Before performing expensive operations (LLM calls, TTS generation), we check if any newer speech has been detected
This prevents stale responses from overtaking fresher user input and makes sure we are always generating on the most updated context =

Configuration surface area

The following parameters are typically exposed for tuning:

Speech detection threshold: VAD probability required to consider a frame as speech
Silence detection threshold: VAD probability below which we consider a frame as silence (lower than speech threshold for hysteresis)
Minimum volume threshold: Volume level required for speech (prevents false positives from low-volume noise)
Speech start duration: How long signals must stay high before triggering speech-started
Speech end duration: How long signals must stay low before triggering speech-stopped
Volume smoothing factor: How aggressively to smooth volume measurements

Looking ahead: where turn detection gets even better

A few directions consistently pay off as you push from “works” to “feels human”:

Personalization at scale:
‍Learn stable per-user or per-environment profiles (mic gain, typical loudness, background noise patterns) while still adapting per session.
Richer signals:
‍Go beyond VAD and volume with spectral features, pitch contours, prosody cues, and overlap detection to improve intent-to-interrupt classification.
Better “intent to barge-in” modeling:
‍Not all speech during agent playback is an interruption. Sometimes it’s backchannel feedback (“mm-hmm”, “yeah”). Modeling this reduces unnecessary cut-offs.
Continuous evaluation with real audio:
Turn detection quality depends on noise distribution. Replay-based testing, synthetic noise injection, and environment-specific benchmarks prevent regressions.

Key takeaway

Turn detection in voice AI boils down to getting the full stack right, not any single trick:

Clean the audio first: Voice isolation removes noise before detection
Combine signals with AND logic: VAD probability and volume must agree
Smooth noisy signals: Smoothing reduces false positives
Time-gate events: Require sustained activity before committing
Use hysteresis: Separate thresholds for start vs stop prevent flickering
Pre-buffer audio: Never lose the beginning of an utterance
Run STT in parallel: Transcription streams alongside detection
Track TTS playback position: Know what was already spoken when interrupted
Check for staleness: Abort stale work when newer speech arrives

Get these pieces right, and you get an AI agent that feels responsive but not jittery, handles interruptions gracefully, and stays reliable even in difficult acoustic environments. The “magic” is in layering preprocessing (voice isolation), signal logic (VAD + volume), noise reduction (smoothing), temporal filtering (time gates + hysteresis), and the right architecture (pre-buffering, parallel STT, TTS markers, staleness checks).

Conclusion: turning a hard UX problem into an engineering advantage

Turn detection is both a signal processing problem and a product experience problem. By combining VAD probability with volume, smoothing with convolution, gating with time thresholds, and emitting clean user-started-talking / user-stopped-talking events into the rest of the voice stack, you can make an agent behave like a good conversational partner even in messy real-world environments like drive-thrus. Add adaptive sensitivity, always-on streaming transcription, and timestamped TTS playback, and interruptions stop being a failure case - they become a first-class feature of a voice agent that feels fast, natural, and reliable.