How to build production-grade voice agents using LiveKit Agents, Whisper, and LLMs — from WebRTC basics to deployment.
Celoris
Official Creator
Voice is eating software. In the last two years we have seen AI voice agents go from lab curiosities to the front line of customer service, healthcare intake, language learning, and developer tooling. The technology is no longer experimental — it is in production, at scale, handling millions of real conversations.
But here is the uncomfortable truth: most tutorials on voice AI are shallow. They show you how to call a transcription API, pipe the text into ChatGPT, and play back a TTS response. That is fine for a weekend demo. It is not how you build something that works reliably for real users.
This guide is different. We are going to build a real-time voice AI system from the ground up using LiveKit — the open-source real-time communications infrastructure that powers some of the most serious voice AI products in production today. By the end, you will understand every layer of the stack: WebRTC and audio, the LiveKit Agents framework, STT/TTS integrations, LLM orchestration, and deployment.
A fully functional voice agent that joins a LiveKit room, listens with VAD-gated transcription, reasons with an LLM, calls tools, and responds with low-latency TTS — all deployable on a live URL.
Before diving into code, it is worth asking why LiveKit specifically. There are plenty of voice AI APIs — Vapi, Retell, Play.ai, even Twilio AI assistants. Why build on the infrastructure layer?
The answer is control. When you build on top of a voice AI platform, you inherit their latency, their supported models, their pricing, and their architectural constraints. When you build on LiveKit, you control every layer:
LiveKit is an open-source Selective Forwarding Unit (SFU) built on WebRTC. It handles the hard real-time infrastructure: signaling, STUN/TURN, media routing, and connection management. The LiveKit Agents framework sits on top and gives you a Python SDK for building AI participants that join rooms alongside human users.
LiveKit Cloud vs Self-Hosted
You can run LiveKit entirely on your own infrastructure (open-source, MIT licensed) or use LiveKit Cloud for managed hosting. For most teams, Cloud is the right starting point — you pay for bandwidth but skip the ops burden. Self-hosting makes sense when you need data residency or are running very high volume.
Building voice AI without understanding audio is like building a web app without understanding HTTP. You can copy examples, but you will not understand why things break. Let us cover the essentials.
When your microphone captures sound, it produces PCM (Pulse-Code Modulation) audio — a stream of numeric samples. Each sample represents the amplitude of the sound wave at a moment in time. The sample rate determines how many samples are captured per second.
Most voice AI pipelines want 16kHz mono PCM. If your audio source gives you stereo 48kHz (which browsers do by default), you need to downsample and mix channels before sending it to your STT model. LiveKit handles this automatically when you set up your agent correctly — but understanding it helps when things go wrong.
WebRTC is the protocol that makes real-time audio work in browsers without plugins. It handles peer authentication, encryption (DTLS-SRTP), and NAT traversal (via STUN/TURN servers). Direct peer-to-peer works for two users, but it does not scale to rooms with multiple participants — every participant would need a connection to every other participant.
This is where the SFU comes in. LiveKit acts as a Selective Forwarding Unit: it receives media from each participant and routes it to the others, without mixing or decoding it. Your agent joins as a participant, subscribes to audio tracks from humans in the room, and publishes its own audio back.
SFU vs MCU
An MCU (Multipoint Control Unit) mixes all audio/video into a single stream server-side. Simpler for clients, but compute-intensive and inflexible. An SFU routes streams individually, which is more scalable and gives your AI agent access to per-speaker audio — essential for diarization and interruption handling.
LiveKit uses the Opus codec for audio. Opus is a variable-bitrate codec designed for real-time communication — it trades off quality for bandwidth dynamically based on network conditions. At 32Kbps it produces excellent voice quality. At 8Kbps it degrades gracefully rather than dropping frames. For your cost modelling, budget roughly 32Kbps per participant in a LiveKit room.
LiveKit Agents is a Python framework for building AI participants — server-side processes that join LiveKit rooms and interact with human participants in real time. It abstracts the audio pipeline so you focus on application logic.
Here is the minimal working agent — genuinely the entire thing:
# agent.py
from livekit.agents import AutoSubscribe, JobContext, WorkerOptions, cli
from livekit.agents.voice_assistant import VoiceAssistant
from livekit.plugins import silero, openai
async def entrypoint(ctx: JobContext):
# Connect to the room, subscribe to audio only
await ctx.connect(auto_subscribe=AutoSubscribe.AUDIO_ONLY)
assistant = VoiceAssistant(
vad=silero.VAD.load(), # Voice activity detection
stt=openai.STT(), # Whisper via OpenAI API
llm=openai.LLM(model='gpt-4o'), # Language model
tts=openai.TTS(voice='nova'), # Text-to-speech
chat_ctx=initial_ctx, # System prompt + history
)
assistant.start(ctx.room)
if __name__ == '__main__':
cli.run_app(WorkerOptions(entrypoint_fnc=entrypoint))When a human speaks in the room, the agent pipeline runs in this sequence:
The whole thing runs asynchronously. TTS starts playing before the LLM finishes generating, which is how you get sub-second perceived response times.
STT quality and latency are the biggest variables in voice AI UX. A slow or inaccurate transcription breaks the entire experience. Here is how the main options compare.
OpenAI Whisper is a transformer-based encoder-decoder model trained on 680,000 hours of multilingual audio. It is extraordinarily accurate — especially for accented English, technical jargon, and multilingual content. The tradeoff is that it is not a streaming model. It processes audio in 30-second windows and returns complete transcripts.
For voice AI this matters. You cannot wait for 30 seconds of audio before responding. The solution is VAD-gated chunking: use a VAD model to detect natural speech boundaries, chunk the audio on those boundaries, and send each chunk to Whisper separately. This gives you utterance-level transcription with latency in the 200–500ms range.
from faster_whisper import WhisperModel
import numpy as np
# faster-whisper: 4x faster than original, same accuracy
# Use 'large-v3' for best accuracy, 'base' for speed
model = WhisperModel('large-v3', device='cuda', compute_type='float16')
def transcribe(audio_bytes: bytes, language: str = 'en') -> str:
# Convert bytes to float32 numpy array
audio = np.frombuffer(audio_bytes, dtype=np.int16)
audio = audio.astype(np.float32) / 32768.0
segments, info = model.transcribe(
audio,
language=language,
beam_size=5,
word_timestamps=True, # for subtitle-style output
vad_filter=True, # skip internal silence
)
return ' '.join(seg.text.strip() for seg in segments)Deepgram is purpose-built for streaming real-time transcription. Nova-2 supports websocket-based streaming with partial transcripts arriving in under 200ms. If you need the absolute lowest latency and are building in English, Deepgram is hard to beat. It is the default STT in many production LiveKit deployments.
AssemblyAI sits between Whisper and Deepgram — good streaming support, strong accuracy, and it offers features Whisper does not: speaker diarization, sentiment analysis, and content moderation built into the transcription pipeline.
Which should you use?
Start with the OpenAI STT plugin (Whisper) — zero setup, good accuracy, easy to swap later. For production at scale, benchmark Deepgram Nova-2 against your specific audio conditions. Deepgram wins on latency; Whisper wins on accuracy for non-native English and technical vocabulary.
VAD is the component that decides when a person has finished speaking. Get it wrong and your agent either cuts people off mid-sentence or waits forever after they stop. It is one of the most impactful components in the entire pipeline.
Silero VAD is a lightweight neural network model that classifies 30ms audio frames as speech or non-speech with very high accuracy. It runs on CPU in real time with minimal overhead. LiveKit Agents ships a plugin for it.
from livekit.plugins import silero
# Load once at startup
vad = silero.VAD.load()
# The VoiceAssistant uses it automatically:
assistant = VoiceAssistant(
vad=vad,
# ... other plugins
# min_endpointing_delay=0.5, # seconds of silence before end-of-turn
# interrupt_min_words=3, # minimum words before interruption allowed
)Two parameters matter most:
min_endpointing_delay — how long to wait after silence before treating it as end-of-turn. 0.5s is good default. Increase for users who think aloud with natural pauses.interrupt_min_words — prevents accidental interruption on short sounds like 'uh-huh'. Set to 3–5 for most use cases.Connecting an LLM is the easy part. Designing prompts that work well in voice is the hard part.
LiveKit Agents supports OpenAI, Anthropic, Google Gemini, and any LiteLLM-compatible model. Swapping providers is a one-line change:
# OpenAI GPT-4o
from livekit.plugins import openai
llm = openai.LLM(model='gpt-4o-mini')
# Anthropic Claude
from livekit.plugins import anthropic
llm = anthropic.LLM(model='claude-3-5-haiku-latest')
# Any OpenAI-compatible endpoint (local Ollama, etc.)
llm = openai.LLM.with_ollama(model='llama3.2', base_url='http://localhost:11434/v1')Tool calling transforms a voice chatbot into a voice agent. Your agent can look up data, trigger workflows, book appointments, and take actions in the world — all while maintaining a natural conversation.
from livekit.agents import llm
class AssistantTools(llm.FunctionContext):
@llm.ai_callable(description='Search the course catalogue by topic')
async def search_courses(
self,
topic: str = llm.TypeInfo(description='Topic keyword, e.g. Excel, Blender, dance')
) -> str:
results = await db.search_courses(topic)
if not results:
return 'No courses found for that topic.'
names = ', '.join(r.title for r in results[:3])
return f'I found {len(results)} courses including: {names}.'
@llm.ai_callable(description='Get the price of a course by its ID')
async def get_price(
self,
course_id: str = llm.TypeInfo(description='The course ID')
) -> str:
course = await db.get_course(course_id)
return f'{course.title} is priced at {course.price} rupees.'
# Pass to VoiceAssistant
assistant = VoiceAssistant(
vad=..., stt=..., llm=..., tts=...,
fnc_ctx=AssistantTools(),
)A critical design decision: tool results must be voice-friendly. Return natural language strings, not JSON or structured data. The LLM will read the tool result verbatim to the user.
TTS is often the last mile problem in voice AI. Even a perfect pipeline sounds broken if the voice is robotic or the response takes two seconds to start.
Time-to-First-Byte (TTFB) for TTS is the delay between the LLM generating the first word and your agent starting to speak it. For perceived responsiveness, this number should be under 300ms. Streaming TTS — where audio starts playing before the full sentence is generated — is how you achieve this.
This matters more than developers expect. A voice that sounds confident and warm makes users trust the agent more, stay on longer, and report higher satisfaction. For product-facing agents, run a quick A/B test with your users — voice preference is surprisingly personal.
For Indian audiences specifically, test your chosen voice on Hindi-accented English input. Some TTS voices handle accented speech output well; others produce a jarring mismatch between the transcribed text and the spoken response.
The sum of all pipeline stages determines how responsive your agent feels. Here is a realistic breakdown for a typical deployment:
| Pipeline Stage | Typical | Optimized | Primary Lever |
|---|---|---|---|
| VAD end-of-utterance | 200–500ms | 100–200ms | Endpointing delay |
| STT transcription | 100–400ms | 80–150ms | Provider / model size |
| LLM first token | 300–800ms | 150–350ms | Model + region |
| TTS first audio | 200–400ms | 50–150ms | Provider selection |
| Network (LiveKit) | 50–100ms | 20–50ms | Server region |
| Total perceived | 850ms–2.2s | 400–900ms |
The biggest single win is usually switching from a high-latency TTS provider to one built for real-time use (Cartesia in particular). The second-biggest win is reducing LLM TTFT by choosing a smaller, faster model or co-locating your agent with the model's data centre.
Real conversations are not sequential request-response cycles. People interrupt. They trail off. They say "yeah, I know" halfway through an explanation. Your agent needs to handle all of this.
Barge-in is when the user speaks while the agent is still talking. LiveKit Agents handles this by default — when the VAD detects speech from a human participant while the agent is speaking, it stops the TTS playback and processes the new utterance. You can tune aggressiveness with the interrupt_min_words parameter.
Backchannels are short acknowledgement sounds — "mm-hmm", "right", "sure" — that humans use to signal they are listening. Without them, silence from the agent while processing feels like a disconnection. A simple approach: inject a message after the STT returns but before the LLM responds.
# Play a filler while the LLM thinks
FILLERS = ['One moment...', 'Let me check that...', 'Sure, give me a second...']
async def on_user_speech_committed(text: str):
# Start filler immediately
await assistant.say(random.choice(FILLERS), allow_interruptions=True)
# LLM response will follow naturallyLocal development is one thing. Production means your agent needs to run 24/7, handle multiple concurrent rooms, and restart gracefully when it crashes.
# Dockerfile
FROM python:3.11-slim
WORKDIR /app
RUN apt-get update && apt-get install -y ffmpeg libsndfile1 && \
rm -rf /var/lib/apt/lists/*
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
# Pre-download Silero VAD weights at build time
RUN python -c 'from livekit.plugins import silero; silero.VAD.load()'
COPY . .
CMD ["python", "agent.py", "start"]# docker-compose.yml
services:
agent:
build: .
environment:
- LIVEKIT_URL=wss://your-project.livekit.cloud
- LIVEKIT_API_KEY=your-api-key
- LIVEKIT_API_SECRET=your-api-secret
- OPENAI_API_KEY=your-openai-key
restart: unless-stopped
deploy:
replicas: 3 # Three workers handle 3 concurrent roomsEach agent worker process handles one room at a time. To handle N concurrent rooms, run N worker processes. LiveKit's job dispatching handles assignment automatically — workers register with the server and receive jobs as rooms are created.
For auto-scaling on Kubernetes, expose a metric for active_rooms and use an HPA to scale worker replicas based on utilisation. Aim for 70–80% utilisation to leave headroom for burst demand.
At 100 concurrent voice sessions: LiveKit Cloud bandwidth ~$50/day, Deepgram STT ~$35/day, GPT-4o-mini ~$20/day, ElevenLabs TTS ~$40/day. Total: ~$145/day or ~$1.45 per active session-hour. Price your product accordingly — most B2C voice products charge $20–50/month for 5–10 hours of voice time.
If your agent joins a room with multiple human participants, you need to know who said what. pyannote.audio provides state-of-the-art speaker diarization.
Whisper's language detection is excellent. Route detected language to the appropriate TTS voice and translation service.
LiveKit's SIP server lets your agent receive/make phone calls through any SIP trunk provider (Twilio, Telnyx). Connect to 8 billion phones.
Use pgvector or Qdrant with sentence-transformers embeddings, and inject retrieved chunks into the LLM context before each response.
Real-time voice AI has crossed the threshold from impressive demo to practical technology. The stack we have covered — LiveKit, Whisper, an LLM, a streaming TTS — is what production voice agents are built on today. The pieces are all open-source or available as affordable APIs. The barrier is now knowledge and execution, not technology access.
The highest-leverage next step is to build something. A voice agent that does one thing well — answers questions about a product, books appointments, tutors on a subject — is far more valuable than a feature-complete prototype that does nothing well.
Our full course — 8 modules, 30+ lessons, 6 deployable projects — takes you from these fundamentals to shipping a real product.
View Course Curriculum