Audio AI / TTS

VoxCraft

Enterprise-grade text-to-speech studio generating 44.1 kHz studio-quality voiceovers. 10 voice styles, 31 languages, 100% local.

📅 2026 ⚡ Operational 🐍 Python + FastAPI 🔊 Supertonic 3

Frontend (WaveSurfer.js + Vanilla JS)
    │ HTTP
Backend (FastAPI + Python 3.14)
    ├── Supertonic 3 Engine (99M ONNX)
    │   ├── Voice Models (M1-M5, F1-F5)
    │   └── Expression Parser (<laugh>, <whisper>, etc.)
    └── Generation History (SQLite)
100% Local — No API Calls

01 Problem

Modern voiceover production relies on a patchwork of cloud TTS services — each with per-character billing, network latency, opaque privacy policies, and limited control over voice parameters. Content creators, indie studios, and accessibility tool builders needed a solution that was fast, private, and fully controllable without recurring API costs.

VoxCraft was built to prove that a 100% local TTS engine — powered by a compact 99M parameter neural model — could deliver broadcast-ready 44.1 kHz audio across dozens of languages and voice profiles, all from a single laptop.

02 Architecture

The system is split into two layers connected over HTTP, keeping everything on-device:

Backend (FastAPI + Python 3.14) — Loads the Supertonic 3 ONNX model, manages voice profiles, parses expression tags, and serves the generation API. A SQLite store keeps a searchable history of every generation.
Frontend (WaveSurfer.js + Vanilla JS) — A zero-dependency studio UI with waveform display, voice/language selection, speed control, draft/studio quality toggle, and batch mode. All rendering is client-side.

The Expression Parser sits between the user's input text and the model's tokenizer, converting human-friendly tags like <laugh> into Supertonic control tokens while preserving natural prosody across the surrounding sentence.

03 Tech Stack

Layer	Technology	Notes
Runtime	Python 3.14	Latest CPython with TOML-native config, free-threaded compatible
API Server	FastAPI	Async endpoints for generation, history CRUD, voice listing
Inference Engine	Supertonic 3 (99M params ONNX)	31 languages, no GPU required, ~300 ms inference
Frontend	WaveSurfer.js + Vanilla JS	Waveform rendering, playback, region selection
Storage	SQLite	Generation history with full-text search
Audio Format	44.1 kHz 16-bit WAV	Studio-standard PCM, cross-platform consistent

04 Challenges

Real-time generation vs quality

The model's denoising steps are the primary control knob for audio quality. Fewer steps deliver sub-second previews but introduce audible artifacts; more steps produce studio-grade output at the cost of latency. We solved this with a Draft → Studio quality toggle: draft mode uses minimal steps for rapid iteration, studio mode runs the full pipeline and caches the result.

Key Insight

Separating preview from final quality let users iterate 5× faster without sacrificing the final render. The same model handles both — just a parameter change.

Expression tag parsing

Tags like <laugh>, <whisper>, and <sigh> must be converted to Supertonic control tokens without breaking the rhythm or prosody of the surrounding sentence. Early attempts produced robotic transitions or swallowed adjacent punctuation.

The solution: a two-pass parser that first expands tags to token-level control sequences, then runs a prosody-preservation pass that adjusts duration and pitch contours at tag boundaries.

Technical Detail

Tags are treated as phoneme-level interrupts rather than sentence breaks. The parser inserts a short fade window (5 ms) around each tag boundary to prevent audible clicks.

WaveSurfer.js sync

Rendering a 44.1 kHz WAV waveform in the browser without dropping frames requires careful buffer management. WaveSurfer.js expects pre-decoded PCM data; sending raw 16-bit samples over HTTP and decoding on the client introduced jank on larger generations.

Cross-platform WAV consistency

WAV files generated on macOS vs Linux had subtle header differences (chunk sizes, byte ordering) that broke playback on certain players. We standardised on a canonical WAV writer that writes RIFF headers by hand, guaranteeing bit-identical output across platforms.

Lesson

Never trust platform stdlib WAV writers for production audio. A 50-line manual RIFF writer eliminates a surprising class of heisenbugs.

⚡ Code Highlight

The TTS engine wraps Supertonic 3 with clamped parameters, voice caching, and expression tag support — keeping the generation API simple while exposing full control.

def synthesize(
    self, text: str, voice: str = "M1",
    lang: str = "en", speed: float = 1.05,
    quality: int = 8,
) -> tuple[np.ndarray, float]:
    """Synthesize text → 44.1kHz WAV audio.

    Args:
        text:    Input with optional <laugh>, <whisper> tags
        voice:   M1-M5 (male) or F1-F5 (female)
        lang:    31 supported languages
        speed:   0.7–2.0x without pitch distortion
        quality: 5 (draft) – 12 (studio)
    """
    if not text.strip():
        raise ValueError("Text cannot be empty")

    # Clamp and resolve
    voice_style = self._get_voice(voice)
    quality = max(5, min(12, quality))
    speed = max(0.7, min(2.0, speed))

    wav, duration = self._tts.synthesize(
        text=text, lang=lang,
        voice_style=voice_style,
        total_steps=quality, speed=speed,
    )
    return wav, float(duration[0])

05 Results

VoxCraft ships as a fully self-contained desktop studio with zero external dependencies at inference time. The application has been validated across macOS and Linux with consistent output quality.

10 Voice Profiles
M1–M5, F1–F5

31 Languages
Global coverage

44.1 kHz Studio Quality
16-bit WAV

100% Local
No API calls

Full studio UI — Waveform viewer, voice/language selector, speed slider (0.7–2.0×), and quality toggle.
10 voice profiles — 5 male (M1–M5) and 5 female (F1–F5), each with distinct timbre and cadence.
31 languages — 30+ languages supported by the underlying Supertonic 3 model, covering major European, Asian, and Middle Eastern language families.
Batch processing — Queue multiple texts for sequential or parallel generation.
SQLite history — Every generation is stored with full-text search; users can revisit, re-download, or tweak past renders.
Zero API dependency — All inference runs locally; no data ever leaves the machine.

06 Lessons Learned

Lesson #1 — Local-first doesn't mean simple

Shipping a local ML application means owning the entire stack — model loading, device management, audio encoding, cross-platform file I/O. The complexity you'd normally outsource to a cloud API becomes your responsibility, but the payoff in privacy and latency is enormous.

Lesson #2 — Quality is a UX parameter, not a model knob

Users don't want to tweak denoising steps or sample rates. They want "preview" and "final". Mapping technical knobs to human-meaningful settings (Draft/Studio) made the tool approachable for non-technical users without hiding power from engineers.

Lesson #3 — Small models are underrated

99M parameters is tiny by modern LLM standards, yet Supertonic 3 produces TTS quality that rivals models 10× its size — on CPU. The efficiency of the architecture (attention-free feed-forward networks) meant we could run on a MacBook Air without a GPU. Not every problem needs a trillion-parameter model.