Audio AI / TTS

VoxCraft

Enterprise-grade text-to-speech studio generating 44.1 kHz studio-quality voiceovers. 10 voice styles, 31 languages, 100% local.

📅 2026 ⚡ Operational 🐍 Python + FastAPI 🔊 Supertonic 3
Frontend (WaveSurfer.js + Vanilla JS)
    │ HTTP
Backend (FastAPI + Python 3.14)
    ├── Supertonic 3 Engine (99M ONNX)
    │   ├── Voice Models (M1-M5, F1-F5)
    │   └── Expression Parser (<laugh>, <whisper>, etc.)
    └── Generation History (SQLite)
100% Local — No API Calls
  

01 Problem

Modern voiceover production relies on a patchwork of cloud TTS services — each with per-character billing, network latency, opaque privacy policies, and limited control over voice parameters. Content creators, indie studios, and accessibility tool builders needed a solution that was fast, private, and fully controllable without recurring API costs.

VoxCraft was built to prove that a 100% local TTS engine — powered by a compact 99M parameter neural model — could deliver broadcast-ready 44.1 kHz audio across dozens of languages and voice profiles, all from a single laptop.

02 Architecture

The system is split into two layers connected over HTTP, keeping everything on-device:

The Expression Parser sits between the user's input text and the model's tokenizer, converting human-friendly tags like <laugh> into Supertonic control tokens while preserving natural prosody across the surrounding sentence.

03 Tech Stack

Layer Technology Notes
Runtime Python 3.14 Latest CPython with TOML-native config, free-threaded compatible
API Server FastAPI Async endpoints for generation, history CRUD, voice listing
Inference Engine Supertonic 3 (99M params ONNX) 31 languages, no GPU required, ~300 ms inference
Frontend WaveSurfer.js + Vanilla JS Waveform rendering, playback, region selection
Storage SQLite Generation history with full-text search
Audio Format 44.1 kHz 16-bit WAV Studio-standard PCM, cross-platform consistent

04 Challenges

Real-time generation vs quality

The model's denoising steps are the primary control knob for audio quality. Fewer steps deliver sub-second previews but introduce audible artifacts; more steps produce studio-grade output at the cost of latency. We solved this with a Draft → Studio quality toggle: draft mode uses minimal steps for rapid iteration, studio mode runs the full pipeline and caches the result.

Key Insight

Separating preview from final quality let users iterate 5× faster without sacrificing the final render. The same model handles both — just a parameter change.

Expression tag parsing

Tags like <laugh>, <whisper>, and <sigh> must be converted to Supertonic control tokens without breaking the rhythm or prosody of the surrounding sentence. Early attempts produced robotic transitions or swallowed adjacent punctuation.

The solution: a two-pass parser that first expands tags to token-level control sequences, then runs a prosody-preservation pass that adjusts duration and pitch contours at tag boundaries.

Technical Detail

Tags are treated as phoneme-level interrupts rather than sentence breaks. The parser inserts a short fade window (5 ms) around each tag boundary to prevent audible clicks.

WaveSurfer.js sync

Rendering a 44.1 kHz WAV waveform in the browser without dropping frames requires careful buffer management. WaveSurfer.js expects pre-decoded PCM data; sending raw 16-bit samples over HTTP and decoding on the client introduced jank on larger generations.

Cross-platform WAV consistency

WAV files generated on macOS vs Linux had subtle header differences (chunk sizes, byte ordering) that broke playback on certain players. We standardised on a canonical WAV writer that writes RIFF headers by hand, guaranteeing bit-identical output across platforms.

Lesson

Never trust platform stdlib WAV writers for production audio. A 50-line manual RIFF writer eliminates a surprising class of heisenbugs.

Code Highlight

The TTS engine wraps Supertonic 3 with clamped parameters, voice caching, and expression tag support — keeping the generation API simple while exposing full control.

def synthesize(
    self, text: str, voice: str = "M1",
    lang: str = "en", speed: float = 1.05,
    quality: int = 8,
) -> tuple[np.ndarray, float]:
    """Synthesize text → 44.1kHz WAV audio.

    Args:
        text:    Input with optional <laugh>, <whisper> tags
        voice:   M1-M5 (male) or F1-F5 (female)
        lang:    31 supported languages
        speed:   0.7–2.0x without pitch distortion
        quality: 5 (draft) – 12 (studio)
    """
    if not text.strip():
        raise ValueError("Text cannot be empty")

    # Clamp and resolve
    voice_style = self._get_voice(voice)
    quality = max(5, min(12, quality))
    speed = max(0.7, min(2.0, speed))

    wav, duration = self._tts.synthesize(
        text=text, lang=lang,
        voice_style=voice_style,
        total_steps=quality, speed=speed,
    )
    return wav, float(duration[0])

05 Results

VoxCraft ships as a fully self-contained desktop studio with zero external dependencies at inference time. The application has been validated across macOS and Linux with consistent output quality.

10 Voice Profiles
M1–M5, F1–F5
31 Languages
Global coverage
44.1 kHz Studio Quality
16-bit WAV
100% Local
No API calls

06 Lessons Learned

Lesson #1 — Local-first doesn't mean simple

Shipping a local ML application means owning the entire stack — model loading, device management, audio encoding, cross-platform file I/O. The complexity you'd normally outsource to a cloud API becomes your responsibility, but the payoff in privacy and latency is enormous.

Lesson #2 — Quality is a UX parameter, not a model knob

Users don't want to tweak denoising steps or sample rates. They want "preview" and "final". Mapping technical knobs to human-meaningful settings (Draft/Studio) made the tool approachable for non-technical users without hiding power from engineers.

Lesson #3 — Small models are underrated

99M parameters is tiny by modern LLM standards, yet Supertonic 3 produces TTS quality that rivals models 10× its size — on CPU. The efficiency of the architecture (attention-free feed-forward networks) meant we could run on a MacBook Air without a GPU. Not every problem needs a trillion-parameter model.