VoxCraft
Enterprise-grade text-to-speech studio generating 44.1 kHz studio-quality voiceovers. 10 voice styles, 31 languages, 100% local.
Frontend (WaveSurfer.js + Vanilla JS)
│ HTTP
Backend (FastAPI + Python 3.14)
├── Supertonic 3 Engine (99M ONNX)
│ ├── Voice Models (M1-M5, F1-F5)
│ └── Expression Parser (<laugh>, <whisper>, etc.)
└── Generation History (SQLite)
100% Local — No API Calls
01 Problem
Modern voiceover production relies on a patchwork of cloud TTS services — each with per-character billing, network latency, opaque privacy policies, and limited control over voice parameters. Content creators, indie studios, and accessibility tool builders needed a solution that was fast, private, and fully controllable without recurring API costs.
VoxCraft was built to prove that a 100% local TTS engine — powered by a compact 99M parameter neural model — could deliver broadcast-ready 44.1 kHz audio across dozens of languages and voice profiles, all from a single laptop.
02 Architecture
The system is split into two layers connected over HTTP, keeping everything on-device:
- Backend (FastAPI + Python 3.14) — Loads the Supertonic 3 ONNX model, manages voice profiles, parses expression tags, and serves the generation API. A SQLite store keeps a searchable history of every generation.
- Frontend (WaveSurfer.js + Vanilla JS) — A zero-dependency studio UI with waveform display, voice/language selection, speed control, draft/studio quality toggle, and batch mode. All rendering is client-side.
The Expression Parser sits between the user's input text and the model's tokenizer, converting human-friendly tags like <laugh> into Supertonic control tokens while preserving natural prosody across the surrounding sentence.
03 Tech Stack
| Layer | Technology | Notes |
|---|---|---|
| Runtime | Python 3.14 | Latest CPython with TOML-native config, free-threaded compatible |
| API Server | FastAPI | Async endpoints for generation, history CRUD, voice listing |
| Inference Engine | Supertonic 3 (99M params ONNX) | 31 languages, no GPU required, ~300 ms inference |
| Frontend | WaveSurfer.js + Vanilla JS | Waveform rendering, playback, region selection |
| Storage | SQLite | Generation history with full-text search |
| Audio Format | 44.1 kHz 16-bit WAV | Studio-standard PCM, cross-platform consistent |
04 Challenges
Real-time generation vs quality
The model's denoising steps are the primary control knob for audio quality. Fewer steps deliver sub-second previews but introduce audible artifacts; more steps produce studio-grade output at the cost of latency. We solved this with a Draft → Studio quality toggle: draft mode uses minimal steps for rapid iteration, studio mode runs the full pipeline and caches the result.
Separating preview from final quality let users iterate 5× faster without sacrificing the final render. The same model handles both — just a parameter change.
Expression tag parsing
Tags like <laugh>, <whisper>, and <sigh> must be converted to Supertonic control tokens without breaking the rhythm or prosody of the surrounding sentence. Early attempts produced robotic transitions or swallowed adjacent punctuation.
The solution: a two-pass parser that first expands tags to token-level control sequences, then runs a prosody-preservation pass that adjusts duration and pitch contours at tag boundaries.
Tags are treated as phoneme-level interrupts rather than sentence breaks. The parser inserts a short fade window (5 ms) around each tag boundary to prevent audible clicks.
WaveSurfer.js sync
Rendering a 44.1 kHz WAV waveform in the browser without dropping frames requires careful buffer management. WaveSurfer.js expects pre-decoded PCM data; sending raw 16-bit samples over HTTP and decoding on the client introduced jank on larger generations.
Cross-platform WAV consistency
WAV files generated on macOS vs Linux had subtle header differences (chunk sizes, byte ordering) that broke playback on certain players. We standardised on a canonical WAV writer that writes RIFF headers by hand, guaranteeing bit-identical output across platforms.
Never trust platform stdlib WAV writers for production audio. A 50-line manual RIFF writer eliminates a surprising class of heisenbugs.
⚡ Code Highlight
The TTS engine wraps Supertonic 3 with clamped parameters, voice caching, and expression tag support — keeping the generation API simple while exposing full control.
def synthesize( self, text: str, voice: str = "M1", lang: str = "en", speed: float = 1.05, quality: int = 8, ) -> tuple[np.ndarray, float]: """Synthesize text → 44.1kHz WAV audio. Args: text: Input with optional <laugh>, <whisper> tags voice: M1-M5 (male) or F1-F5 (female) lang: 31 supported languages speed: 0.7–2.0x without pitch distortion quality: 5 (draft) – 12 (studio) """ if not text.strip(): raise ValueError("Text cannot be empty") # Clamp and resolve voice_style = self._get_voice(voice) quality = max(5, min(12, quality)) speed = max(0.7, min(2.0, speed)) wav, duration = self._tts.synthesize( text=text, lang=lang, voice_style=voice_style, total_steps=quality, speed=speed, ) return wav, float(duration[0])
05 Results
VoxCraft ships as a fully self-contained desktop studio with zero external dependencies at inference time. The application has been validated across macOS and Linux with consistent output quality.
M1–M5, F1–F5
Global coverage
16-bit WAV
No API calls
- Full studio UI — Waveform viewer, voice/language selector, speed slider (0.7–2.0×), and quality toggle.
- 10 voice profiles — 5 male (M1–M5) and 5 female (F1–F5), each with distinct timbre and cadence.
- 31 languages — 30+ languages supported by the underlying Supertonic 3 model, covering major European, Asian, and Middle Eastern language families.
- Batch processing — Queue multiple texts for sequential or parallel generation.
- SQLite history — Every generation is stored with full-text search; users can revisit, re-download, or tweak past renders.
- Zero API dependency — All inference runs locally; no data ever leaves the machine.
06 Lessons Learned
Shipping a local ML application means owning the entire stack — model loading, device management, audio encoding, cross-platform file I/O. The complexity you'd normally outsource to a cloud API becomes your responsibility, but the payoff in privacy and latency is enormous.
Users don't want to tweak denoising steps or sample rates. They want "preview" and "final". Mapping technical knobs to human-meaningful settings (Draft/Studio) made the tool approachable for non-technical users without hiding power from engineers.
99M parameters is tiny by modern LLM standards, yet Supertonic 3 produces TTS quality that rivals models 10× its size — on CPU. The efficiency of the architecture (attention-free feed-forward networks) meant we could run on a MacBook Air without a GPU. Not every problem needs a trillion-parameter model.