Back to Insights
Deep Dive • Voice AI • LiveKit 15 MIN READ

Building Real-Time Voice AI with LiveKit: The Complete Guide

How to build production-grade voice agents using LiveKit Agents, Whisper, and LLMs — from WebRTC basics to deployment.

C

Celoris

Official Creator

March 8, 2026

Voice is eating software. In the last two years we have seen AI voice agents go from lab curiosities to the front line of customer service, healthcare intake, language learning, and developer tooling. The technology is no longer experimental — it is in production, at scale, handling millions of real conversations.

But here is the uncomfortable truth: most tutorials on voice AI are shallow. They show you how to call a transcription API, pipe the text into ChatGPT, and play back a TTS response. That is fine for a weekend demo. It is not how you build something that works reliably for real users.

This guide is different. We are going to build a real-time voice AI system from the ground up using LiveKit — the open-source real-time communications infrastructure that powers some of the most serious voice AI products in production today. By the end, you will understand every layer of the stack: WebRTC and audio, the LiveKit Agents framework, STT/TTS integrations, LLM orchestration, and deployment.

What you will build

A fully functional voice agent that joins a LiveKit room, listens with VAD-gated transcription, reasons with an LLM, calls tools, and responds with low-latency TTS — all deployable on a live URL.

What We Are Covering

  • Why LiveKit — and why not just use a voice API
  • How real-time audio actually works (WebRTC, codecs, SFUs)
  • The LiveKit Agents framework: architecture and key abstractions
  • Speech-to-text: Whisper, Deepgram, and streaming strategies
  • LLM integration with tool calling for voice agents
  • Text-to-speech: latency, quality, and provider tradeoffs
  • End-to-end latency: where time is spent and how to cut it
  • Production deployment with Docker and LiveKit Cloud
  • What to build next

1. Why LiveKit?

Before diving into code, it is worth asking why LiveKit specifically. There are plenty of voice AI APIs — Vapi, Retell, Play.ai, even Twilio AI assistants. Why build on the infrastructure layer?

The answer is control. When you build on top of a voice AI platform, you inherit their latency, their supported models, their pricing, and their architectural constraints. When you build on LiveKit, you control every layer:

  • Which STT model you use and how you chunk audio
  • Which LLM you connect (and you can swap it per-session)
  • Which TTS voice and provider
  • How interruptions are handled
  • How you store, replay, and analyze conversations
  • Your cost structure

LiveKit is an open-source Selective Forwarding Unit (SFU) built on WebRTC. It handles the hard real-time infrastructure: signaling, STUN/TURN, media routing, and connection management. The LiveKit Agents framework sits on top and gives you a Python SDK for building AI participants that join rooms alongside human users.

LiveKit Cloud vs Self-Hosted

You can run LiveKit entirely on your own infrastructure (open-source, MIT licensed) or use LiveKit Cloud for managed hosting. For most teams, Cloud is the right starting point — you pay for bandwidth but skip the ops burden. Self-hosting makes sense when you need data residency or are running very high volume.

2. Real-Time Audio Fundamentals

Building voice AI without understanding audio is like building a web app without understanding HTTP. You can copy examples, but you will not understand why things break. Let us cover the essentials.

Audio as data: PCM and sample rates

When your microphone captures sound, it produces PCM (Pulse-Code Modulation) audio — a stream of numeric samples. Each sample represents the amplitude of the sound wave at a moment in time. The sample rate determines how many samples are captured per second.

  • 8,000 Hz — telephone quality, barely acceptable for voice
  • 16,000 Hz — the sweet spot for voice AI (Whisper default, most STT models)
  • 44,100 / 48,000 Hz — music and broadcast, unnecessary for voice

Most voice AI pipelines want 16kHz mono PCM. If your audio source gives you stereo 48kHz (which browsers do by default), you need to downsample and mix channels before sending it to your STT model. LiveKit handles this automatically when you set up your agent correctly — but understanding it helps when things go wrong.

WebRTC and the SFU model

WebRTC is the protocol that makes real-time audio work in browsers without plugins. It handles peer authentication, encryption (DTLS-SRTP), and NAT traversal (via STUN/TURN servers). Direct peer-to-peer works for two users, but it does not scale to rooms with multiple participants — every participant would need a connection to every other participant.

This is where the SFU comes in. LiveKit acts as a Selective Forwarding Unit: it receives media from each participant and routes it to the others, without mixing or decoding it. Your agent joins as a participant, subscribes to audio tracks from humans in the room, and publishes its own audio back.

SFU vs MCU

An MCU (Multipoint Control Unit) mixes all audio/video into a single stream server-side. Simpler for clients, but compute-intensive and inflexible. An SFU routes streams individually, which is more scalable and gives your AI agent access to per-speaker audio — essential for diarization and interruption handling.

Opus: the codec that matters

LiveKit uses the Opus codec for audio. Opus is a variable-bitrate codec designed for real-time communication — it trades off quality for bandwidth dynamically based on network conditions. At 32Kbps it produces excellent voice quality. At 8Kbps it degrades gracefully rather than dropping frames. For your cost modelling, budget roughly 32Kbps per participant in a LiveKit room.

3. The LiveKit Agents Framework

LiveKit Agents is a Python framework for building AI participants — server-side processes that join LiveKit rooms and interact with human participants in real time. It abstracts the audio pipeline so you focus on application logic.

Core abstractions

  • Worker — a long-running process that listens for job dispatch from the LiveKit server
  • Job / JobContext — a single room session assigned to a worker
  • Agent / VoiceAssistant — the AI participant with STT → LLM → TTS pipeline
  • Plugin — a swappable component: VAD, STT, LLM, or TTS

Here is the minimal working agent — genuinely the entire thing:

# agent.py
from livekit.agents import AutoSubscribe, JobContext, WorkerOptions, cli
from livekit.agents.voice_assistant import VoiceAssistant
from livekit.plugins import silero, openai

async def entrypoint(ctx: JobContext):
    # Connect to the room, subscribe to audio only
    await ctx.connect(auto_subscribe=AutoSubscribe.AUDIO_ONLY)

    assistant = VoiceAssistant(
        vad=silero.VAD.load(),           # Voice activity detection
        stt=openai.STT(),                # Whisper via OpenAI API
        llm=openai.LLM(model='gpt-4o'),  # Language model
        tts=openai.TTS(voice='nova'),     # Text-to-speech
        chat_ctx=initial_ctx,            # System prompt + history
    )
    assistant.start(ctx.room)

if __name__ == '__main__':
    cli.run_app(WorkerOptions(entrypoint_fnc=entrypoint))

The pipeline under the hood

When a human speaks in the room, the agent pipeline runs in this sequence:

  1. Opus audio arrives from the LiveKit server
  2. VAD detects end of utterance (silence after speech)
  3. PCM audio chunk is sent to the STT model
  4. Transcript text is appended to conversation history
  5. LLM generates a response (streaming)
  6. TTS converts response text to audio (streaming)
  7. Agent publishes audio back to the LiveKit room

The whole thing runs asynchronously. TTS starts playing before the LLM finishes generating, which is how you get sub-second perceived response times.

4. Speech-to-Text: Choosing the Right Model

STT quality and latency are the biggest variables in voice AI UX. A slow or inaccurate transcription breaks the entire experience. Here is how the main options compare.

Whisper

OpenAI Whisper is a transformer-based encoder-decoder model trained on 680,000 hours of multilingual audio. It is extraordinarily accurate — especially for accented English, technical jargon, and multilingual content. The tradeoff is that it is not a streaming model. It processes audio in 30-second windows and returns complete transcripts.

For voice AI this matters. You cannot wait for 30 seconds of audio before responding. The solution is VAD-gated chunking: use a VAD model to detect natural speech boundaries, chunk the audio on those boundaries, and send each chunk to Whisper separately. This gives you utterance-level transcription with latency in the 200–500ms range.

from faster_whisper import WhisperModel
import numpy as np

# faster-whisper: 4x faster than original, same accuracy
# Use 'large-v3' for best accuracy, 'base' for speed
model = WhisperModel('large-v3', device='cuda', compute_type='float16')

def transcribe(audio_bytes: bytes, language: str = 'en') -> str:
    # Convert bytes to float32 numpy array
    audio = np.frombuffer(audio_bytes, dtype=np.int16)
    audio = audio.astype(np.float32) / 32768.0

    segments, info = model.transcribe(
        audio,
        language=language,
        beam_size=5,
        word_timestamps=True,   # for subtitle-style output
        vad_filter=True,        # skip internal silence
    )

    return ' '.join(seg.text.strip() for seg in segments)

Deepgram Nova-2

Deepgram is purpose-built for streaming real-time transcription. Nova-2 supports websocket-based streaming with partial transcripts arriving in under 200ms. If you need the absolute lowest latency and are building in English, Deepgram is hard to beat. It is the default STT in many production LiveKit deployments.

AssemblyAI

AssemblyAI sits between Whisper and Deepgram — good streaming support, strong accuracy, and it offers features Whisper does not: speaker diarization, sentiment analysis, and content moderation built into the transcription pipeline.

Which should you use?

Start with the OpenAI STT plugin (Whisper) — zero setup, good accuracy, easy to swap later. For production at scale, benchmark Deepgram Nova-2 against your specific audio conditions. Deepgram wins on latency; Whisper wins on accuracy for non-native English and technical vocabulary.

5. Voice Activity Detection: The Unsung Hero

VAD is the component that decides when a person has finished speaking. Get it wrong and your agent either cuts people off mid-sentence or waits forever after they stop. It is one of the most impactful components in the entire pipeline.

Silero VAD

Silero VAD is a lightweight neural network model that classifies 30ms audio frames as speech or non-speech with very high accuracy. It runs on CPU in real time with minimal overhead. LiveKit Agents ships a plugin for it.

from livekit.plugins import silero

# Load once at startup
vad = silero.VAD.load()

# The VoiceAssistant uses it automatically:
assistant = VoiceAssistant(
    vad=vad,
    # ... other plugins
    # min_endpointing_delay=0.5,  # seconds of silence before end-of-turn
    # interrupt_min_words=3,      # minimum words before interruption allowed
)

Tuning VAD for your use case

Two parameters matter most:

  • min_endpointing_delay — how long to wait after silence before treating it as end-of-turn. 0.5s is good default. Increase for users who think aloud with natural pauses.
  • interrupt_min_words — prevents accidental interruption on short sounds like 'uh-huh'. Set to 3–5 for most use cases.

6. LLM Integration & Voice Prompts

Connecting an LLM is the easy part. Designing prompts that work well in voice is the hard part.

Connecting your LLM

LiveKit Agents supports OpenAI, Anthropic, Google Gemini, and any LiteLLM-compatible model. Swapping providers is a one-line change:

# OpenAI GPT-4o
from livekit.plugins import openai
llm = openai.LLM(model='gpt-4o-mini')

# Anthropic Claude
from livekit.plugins import anthropic
llm = anthropic.LLM(model='claude-3-5-haiku-latest')

# Any OpenAI-compatible endpoint (local Ollama, etc.)
llm = openai.LLM.with_ollama(model='llama3.2', base_url='http://localhost:11434/v1')

Voice Prompt Principles

  1. No markdown — no bold, bullets, or headers. The TTS will read them aloud as noise.
  2. No URLs — spell out domain names if needed, never paste full links.
  3. Short sentences — complex nested clauses are hard to follow by ear.
  4. Acknowledge before answering — 'Great question, here is what I know...' buys thinking time and feels human.
  5. Strip chain-of-thought — use a scratchpad tool or system prompt instruction to keep reasoning internal.

Tool calling for voice agents

Tool calling transforms a voice chatbot into a voice agent. Your agent can look up data, trigger workflows, book appointments, and take actions in the world — all while maintaining a natural conversation.

from livekit.agents import llm

class AssistantTools(llm.FunctionContext):

    @llm.ai_callable(description='Search the course catalogue by topic')
    async def search_courses(
        self,
        topic: str = llm.TypeInfo(description='Topic keyword, e.g. Excel, Blender, dance')
    ) -> str:
        results = await db.search_courses(topic)
        if not results:
            return 'No courses found for that topic.'
        names = ', '.join(r.title for r in results[:3])
        return f'I found {len(results)} courses including: {names}.'

    @llm.ai_callable(description='Get the price of a course by its ID')
    async def get_price(
        self,
        course_id: str = llm.TypeInfo(description='The course ID')
    ) -> str:
        course = await db.get_course(course_id)
        return f'{course.title} is priced at {course.price} rupees.'

# Pass to VoiceAssistant
assistant = VoiceAssistant(
    vad=..., stt=..., llm=..., tts=...,
    fnc_ctx=AssistantTools(),
)

A critical design decision: tool results must be voice-friendly. Return natural language strings, not JSON or structured data. The LLM will read the tool result verbatim to the user.

7. Text-to-Speech: Latency & Quality

TTS is often the last mile problem in voice AI. Even a perfect pipeline sounds broken if the voice is robotic or the response takes two seconds to start.

The metric that matters: TTFB

Time-to-First-Byte (TTFB) for TTS is the delay between the LLM generating the first word and your agent starting to speak it. For perceived responsiveness, this number should be under 300ms. Streaming TTS — where audio starts playing before the full sentence is generated — is how you achieve this.

  • OpenAI TTS — good quality, ~300ms TTFB, streaming supported, easy integration
  • ElevenLabs — highest quality, voice cloning, ~200ms TTFB on Turbo v2.5
  • Cartesia Sonic — purpose-built for real-time, sub-100ms TTFB, strong quality
  • Google Cloud TTS — reliable, multilingual, ~250ms TTFB

Choosing a voice

This matters more than developers expect. A voice that sounds confident and warm makes users trust the agent more, stay on longer, and report higher satisfaction. For product-facing agents, run a quick A/B test with your users — voice preference is surprisingly personal.

For Indian audiences specifically, test your chosen voice on Hindi-accented English input. Some TTS voices handle accented speech output well; others produce a jarring mismatch between the transcribed text and the spoken response.

8. End-to-End Latency Breakdown

The sum of all pipeline stages determines how responsive your agent feels. Here is a realistic breakdown for a typical deployment:

Pipeline StageTypicalOptimizedPrimary Lever
VAD end-of-utterance200–500ms100–200msEndpointing delay
STT transcription100–400ms80–150msProvider / model size
LLM first token300–800ms150–350msModel + region
TTS first audio200–400ms50–150msProvider selection
Network (LiveKit)50–100ms20–50msServer region
Total perceived850ms–2.2s400–900ms

The biggest single win is usually switching from a high-latency TTS provider to one built for real-time use (Cartesia in particular). The second-biggest win is reducing LLM TTFT by choosing a smaller, faster model or co-locating your agent with the model's data centre.

9. Handling Interruptions

Real conversations are not sequential request-response cycles. People interrupt. They trail off. They say "yeah, I know" halfway through an explanation. Your agent needs to handle all of this.

Barge-in

Barge-in is when the user speaks while the agent is still talking. LiveKit Agents handles this by default — when the VAD detects speech from a human participant while the agent is speaking, it stops the TTS playback and processes the new utterance. You can tune aggressiveness with the interrupt_min_words parameter.

Backchannels

Backchannels are short acknowledgement sounds — "mm-hmm", "right", "sure" — that humans use to signal they are listening. Without them, silence from the agent while processing feels like a disconnection. A simple approach: inject a message after the STT returns but before the LLM responds.

# Play a filler while the LLM thinks
FILLERS = ['One moment...', 'Let me check that...', 'Sure, give me a second...']

async def on_user_speech_committed(text: str):
    # Start filler immediately
    await assistant.say(random.choice(FILLERS), allow_interruptions=True)
    # LLM response will follow naturally

10. Deploying Your Voice Agent

Local development is one thing. Production means your agent needs to run 24/7, handle multiple concurrent rooms, and restart gracefully when it crashes.

# Dockerfile
FROM python:3.11-slim

WORKDIR /app

RUN apt-get update && apt-get install -y ffmpeg libsndfile1 && \
    rm -rf /var/lib/apt/lists/*

COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

# Pre-download Silero VAD weights at build time
RUN python -c 'from livekit.plugins import silero; silero.VAD.load()'

COPY . .

CMD ["python", "agent.py", "start"]
# docker-compose.yml
services:
  agent:
    build: .
    environment:
      - LIVEKIT_URL=wss://your-project.livekit.cloud
      - LIVEKIT_API_KEY=your-api-key
      - LIVEKIT_API_SECRET=your-api-secret
      - OPENAI_API_KEY=your-openai-key
    restart: unless-stopped
    deploy:
      replicas: 3   # Three workers handle 3 concurrent rooms

Scaling

Each agent worker process handles one room at a time. To handle N concurrent rooms, run N worker processes. LiveKit's job dispatching handles assignment automatically — workers register with the server and receive jobs as rooms are created.

For auto-scaling on Kubernetes, expose a metric for active_rooms and use an HPA to scale worker replicas based on utilisation. Aim for 70–80% utilisation to leave headroom for burst demand.

Cost modelling before launch

At 100 concurrent voice sessions: LiveKit Cloud bandwidth ~$50/day, Deepgram STT ~$35/day, GPT-4o-mini ~$20/day, ElevenLabs TTS ~$40/day. Total: ~$145/day or ~$1.45 per active session-hour. Price your product accordingly — most B2C voice products charge $20–50/month for 5–10 hours of voice time.

11. Advanced Patterns Worth Knowing

Speaker diarization

If your agent joins a room with multiple human participants, you need to know who said what. pyannote.audio provides state-of-the-art speaker diarization.

Multi-language support

Whisper's language detection is excellent. Route detected language to the appropriate TTS voice and translation service.

SIP and telephony

LiveKit's SIP server lets your agent receive/make phone calls through any SIP trunk provider (Twilio, Telnyx). Connect to 8 billion phones.

RAG for knowledge

Use pgvector or Qdrant with sentence-transformers embeddings, and inject retrieved chunks into the LLM context before each response.


Where to Go From Here

Real-time voice AI has crossed the threshold from impressive demo to practical technology. The stack we have covered — LiveKit, Whisper, an LLM, a streaming TTS — is what production voice agents are built on today. The pieces are all open-source or available as affordable APIs. The barrier is now knowledge and execution, not technology access.

The highest-leverage next step is to build something. A voice agent that does one thing well — answers questions about a product, books appointments, tutors on a subject — is far more valuable than a feature-complete prototype that does nothing well.

Ready to build production voice AI?

Our full course — 8 modules, 30+ lessons, 6 deployable projects — takes you from these fundamentals to shipping a real product.

View Course Curriculum
Voice AILiveKitWhisperLLMWebRTCReal-Time AIPythonAI Agents

Published by Celoris | celoris.in | Your Creative Learning Platform

© 2026 Celoris.in • All Rights Reserved

Sponsored Content