Voice AI / Real-Time Agent

IntelliCatalog

A real-time voice and chat assistant for Electronic Parts Catalogs (EPC) built with LiveKit Agents. Find parts, check availability, navigate catalogs, and place orders using natural language.

Stack · Python · LiveKit · Deepgram Nova-3 · GPT-4o · Cartesia Sonic-3 · WebRTC Year · 2026 Status · Functional — full voice pipeline operational

01Problem

Automotive and industrial parts catalogs (EPCs) are notoriously complex. Technicians, mechanics, and parts dealers spent minutes navigating clunky search interfaces, filtering through obscure part numbers, and cross-referencing availability across multiple systems. The friction of traditional catalog browsing — typing exact part numbers, tabbing between categories, and manually checking stock — slows down every repair or order.

The core question: could a voice-first AI agent make parts catalog interaction as fast and natural as asking a colleague? Not just searching, but understanding intent, refining across turns, and taking action — all in real time.

02Architecture

The system routes every user interaction through a LiveKit WebRTC room that chains speech-to-text, LLM reasoning with function calling, and text-to-speech into a seamless real-time pipeline.

User (Voice / Chat)
    │
    ▼
┌────────────────────────────────────────────────┐
│              LiveKit Room (WebRTC)              │
│  ┌──────────┐  ┌──────────┐  ┌──────────────┐ │
│  │  STT     │  │  LLM     │  │  TTS         │ │
│  │ Deepgram │─▶│ GPT-4o   │─▶│ Cartesia     │ │
│  │ Nova-3   │  │          │  │ Sonic-3      │ │
│  └──────────┘  └────┬─────┘  └──────────────┘ │
│                     │                           │
│              ┌──────▼──────┐                    │
│              │  Function    │                    │
│              │  Calling     │                    │
│              │  ┌────────┐  │                    │
│              │  │search  │  │                    │
│              │  │parts() │  │                    │
│              │  │details()│ │                    │
│              │  │cart()  │  │                    │
│              │  │avail() │  │                    │
│              │  └────────┘  │                    │
│              └──────────────┘                    │
└────────────────────────────────────────────────┘
    │
    ▼
┌────────────────────────────────────────────────┐
│              Parts Catalog (EPC)                │
│  • Part search & lookup                         │
│  • Availability & supersession                  │
│  • Cart & ordering                              │
│  • Category navigation                          │
│  • Multi-turn context                           │
└────────────────────────────────────────────────┘
Design Decision

LiveKit was chosen over raw WebRTC or a simple HTTP pipeline because it provides managed STT/TTS agent integration out of the box, handles room lifecycle and reconnection, and lets us swap model providers without touching transport code.

03Tech Stack

Component Chosen Technology Role
Orchestration Python Application logic, agent pipeline, state management
Real-Time Transport LiveKit Agents WebRTC communication, room management, agent lifecycle
Speech-to-Text Deepgram Nova-3 Low-latency speech recognition with high accuracy
Language Model GPT-4o Natural language understanding, intent classification, function calling
Text-to-Speech Cartesia Sonic-3 Expressive, natural-sounding voice output
Function Calling OpenAI Function API Search parts, get details, manage cart, check availability
Why This Stack

Each component was chosen for latency. Deepgram Nova-3 delivers sub-300ms STT. GPT-4o classifies intent and triggers function calls on the same turn. Cartesia Sonic-3 streams TTS at speeds that don't bottleneck the audio pipeline. Together they keep the voice round trip under a second.

04Key Challenges

1. Real-time latency

Voice conversations require sub-300ms STT → LLM → TTS round trip. Optimizing each stage without sacrificing quality was a tight constraint. We tuned Deepgram's endpointing sensitivity, reduced LLM response tokens via prompt compression, and streamed TTS in parallel with LLM response generation.

2. Function calling reliability

The LLM must correctly parse part queries and call the right catalog functions. A false positive — calling the wrong function or hallucinating a part number — would show incorrect results. We solved this with strict function schemas, few-shot examples in the system prompt, and validation layers that reject unsupported function calls before reaching the catalog API.

3. Multi-turn context

Users frequently refine searches across turns: "Find oil filters""for Traxion 5200""how much?". Each turn must carry context. We maintain a conversation history scoped to the LiveKit room session and inject it into every LLM request, with an automatic summary mechanism to prevent context window overflow on long sessions.

4. WebRTC complexity

LiveKit handles the WebRTC transport, but configuring STT/TTS agent pipelines and managing room lifecycle added complexity. Audio format mismatches, network jitter, and stream synchronization all had to be tested across different browsers and network conditions.

Lesson Learned

Building a test harness that simulated partial audio, network degradation, and rapid-turn interactions was invaluable. Real-time voice agents behave differently than request-response APIs — you can't just unit-test the LLM and call it done.

Code Highlight

The LiveKit agent pipeline wires STT → LLM with function calling → TTS into a real-time voice loop.

async def entrypoint(ctx: JobContext):
    # Join the LiveKit room
    room = await ctx.connect()

    # Set up the voice pipeline agent
    agent = VoicePipelineAgent(
        vad = ctx.create_vad(),
        stt = DeepgramSTT(language="en-US"),
        llm = OpenAILLM(model="gpt-4o"),
        tts = CartesiaTTS(voice="79a125e8"),
    )

    # Register catalog functions
    @agent.on("search_parts")
    async def search_parts(query: str) -> list[Part]:
        """Search the EPC catalog by description or part number."""
        return await catalog.search(query)

    await agent.start(room)
    await agent.say("Welcome to IntelliCatalog. Say a part name to begin.")

05Results

3 Pipeline: STT → LLM → TTS
Multi-Turn Context across interactions
Voice+Chat Dual-mode interface
Real-Time WebRTC sub-second audio

06What I Learned

This project was a deep dive into the practical realities of building production-grade voice AI. Some takeaways:

What's Next

Multilingual support, agentic navigation (the LLM drives multi-step workflows like "check stock for all Traxion 5000 filters and add the cheapest to cart"), and a WebSocket-based chat fallback for environments without microphone access.