Voice AI / Real-Time Agent

IntelliCatalog

A real-time voice and chat assistant for Electronic Parts Catalogs (EPC) built with LiveKit Agents. Find parts, check availability, navigate catalogs, and place orders using natural language.

Stack · Python · LiveKit · Deepgram Nova-3 · GPT-4o · Cartesia Sonic-3 · WebRTC Year · 2026 Status · Functional — full voice pipeline operational

01Problem

Automotive and industrial parts catalogs (EPCs) are notoriously complex. Technicians, mechanics, and parts dealers spent minutes navigating clunky search interfaces, filtering through obscure part numbers, and cross-referencing availability across multiple systems. The friction of traditional catalog browsing — typing exact part numbers, tabbing between categories, and manually checking stock — slows down every repair or order.

The core question: could a voice-first AI agent make parts catalog interaction as fast and natural as asking a colleague? Not just searching, but understanding intent, refining across turns, and taking action — all in real time.

02Architecture

The system routes every user interaction through a LiveKit WebRTC room that chains speech-to-text, LLM reasoning with function calling, and text-to-speech into a seamless real-time pipeline.

User (Voice / Chat)
    │
    ▼
┌────────────────────────────────────────────────┐
│              LiveKit Room (WebRTC)              │
│  ┌──────────┐  ┌──────────┐  ┌──────────────┐ │
│  │  STT     │  │  LLM     │  │  TTS         │ │
│  │ Deepgram │─▶│ GPT-4o   │─▶│ Cartesia     │ │
│  │ Nova-3   │  │          │  │ Sonic-3      │ │
│  └──────────┘  └────┬─────┘  └──────────────┘ │
│                     │                           │
│              ┌──────▼──────┐                    │
│              │  Function    │                    │
│              │  Calling     │                    │
│              │  ┌────────┐  │                    │
│              │  │search  │  │                    │
│              │  │parts() │  │                    │
│              │  │details()│ │                    │
│              │  │cart()  │  │                    │
│              │  │avail() │  │                    │
│              │  └────────┘  │                    │
│              └──────────────┘                    │
└────────────────────────────────────────────────┘
    │
    ▼
┌────────────────────────────────────────────────┐
│              Parts Catalog (EPC)                │
│  • Part search & lookup                         │
│  • Availability & supersession                  │
│  • Cart & ordering                              │
│  • Category navigation                          │
│  • Multi-turn context                           │
└────────────────────────────────────────────────┘

Design Decision

LiveKit was chosen over raw WebRTC or a simple HTTP pipeline because it provides managed STT/TTS agent integration out of the box, handles room lifecycle and reconnection, and lets us swap model providers without touching transport code.

03Tech Stack

Component	Chosen Technology	Role
Orchestration	Python	Application logic, agent pipeline, state management
Real-Time Transport	LiveKit Agents	WebRTC communication, room management, agent lifecycle
Speech-to-Text	Deepgram Nova-3	Low-latency speech recognition with high accuracy
Language Model	GPT-4o	Natural language understanding, intent classification, function calling
Text-to-Speech	Cartesia Sonic-3	Expressive, natural-sounding voice output
Function Calling	OpenAI Function API	Search parts, get details, manage cart, check availability

Why This Stack

Each component was chosen for latency. Deepgram Nova-3 delivers sub-300ms STT. GPT-4o classifies intent and triggers function calls on the same turn. Cartesia Sonic-3 streams TTS at speeds that don't bottleneck the audio pipeline. Together they keep the voice round trip under a second.

04Key Challenges

1. Real-time latency

Voice conversations require sub-300ms STT → LLM → TTS round trip. Optimizing each stage without sacrificing quality was a tight constraint. We tuned Deepgram's endpointing sensitivity, reduced LLM response tokens via prompt compression, and streamed TTS in parallel with LLM response generation.

2. Function calling reliability

The LLM must correctly parse part queries and call the right catalog functions. A false positive — calling the wrong function or hallucinating a part number — would show incorrect results. We solved this with strict function schemas, few-shot examples in the system prompt, and validation layers that reject unsupported function calls before reaching the catalog API.

3. Multi-turn context

Users frequently refine searches across turns: "Find oil filters" → "for Traxion 5200" → "how much?". Each turn must carry context. We maintain a conversation history scoped to the LiveKit room session and inject it into every LLM request, with an automatic summary mechanism to prevent context window overflow on long sessions.

4. WebRTC complexity

LiveKit handles the WebRTC transport, but configuring STT/TTS agent pipelines and managing room lifecycle added complexity. Audio format mismatches, network jitter, and stream synchronization all had to be tested across different browsers and network conditions.

Lesson Learned

Building a test harness that simulated partial audio, network degradation, and rapid-turn interactions was invaluable. Real-time voice agents behave differently than request-response APIs — you can't just unit-test the LLM and call it done.

⚡ Code Highlight

The LiveKit agent pipeline wires STT → LLM with function calling → TTS into a real-time voice loop.

async def entrypoint(ctx: JobContext):
    # Join the LiveKit room
    room = await ctx.connect()

    # Set up the voice pipeline agent
    agent = VoicePipelineAgent(
        vad = ctx.create_vad(),
        stt = DeepgramSTT(language="en-US"),
        llm = OpenAILLM(model="gpt-4o"),
        tts = CartesiaTTS(voice="79a125e8"),
    )

    # Register catalog functions
    @agent.on("search_parts")
    async def search_parts(query: str) -> list[Part]:
        """Search the EPC catalog by description or part number."""
        return await catalog.search(query)

    await agent.start(room)
    await agent.say("Welcome to IntelliCatalog. Say a part name to begin.")

05Results

3 Pipeline: STT → LLM → TTS

Multi-Turn Context across interactions

Voice+Chat Dual-mode interface

Real-Time WebRTC sub-second audio

Complete voice-first EPC assistant with a chat fallback mode — users can speak or type interchangeably within the same session.
Natural language part search — "find the oil filter for a 2022 Traxion 5200" returns the correct part without needing the exact part number.
Availability checking and cart management — users can ask "is it in stock?" or "add two to my cart" and the agent handles the entire lifecycle.
Real-time voice with Deepgram Nova-3 (STT) + Cartesia Sonic-3 (TTS) delivering natural, low-latency conversations.
Deployable as a LiveKit agent — runs anywhere LiveKit runs, from a local machine to a production Kubernetes cluster.

06What I Learned

This project was a deep dive into the practical realities of building production-grade voice AI. Some takeaways:

Latency is a product feature. A voice agent that pauses for 800ms feels broken. Every millisecond in the pipeline matters — from model selection to prompt design to streaming strategy.
Voice-first uncovers usability patterns that chat doesn't. Users speak differently than they type — more fragmentary, more VUI-style commands. The system prompt and function calling had to be tuned for this different input modality.
WebRTC and audio pipelines demand tooling. Debugging a voice call is harder than debugging an HTTP request. We invested in room logging, audio file capture for replay, and structured event tracing.
Function calling is the bridge. The LLM is powerful only if it can reliably drive real-world actions. Getting function calling right — schemas, validation, error recovery — was the most important engineering work on the project.

What's Next

Multilingual support, agentic navigation (the LLM drives multi-step workflows like "check stock for all Traxion 5000 filters and add the cheapest to cart"), and a WebSocket-based chat fallback for environments without microphone access.