KittenTTS: The 25MB Model That Makes On-Device TTS Finally Practical
KittenTTS ships a 15M-parameter TTS model in 25MB that runs on CPU at 1.5x realtime — no GPU, no API key, no per-character billing.
On-device text-to-speech in 2026 still feels like it should be solved — and yet every practical option has a catch. Cloud APIs (ElevenLabs, OpenAI TTS) charge per character and send your audio to someone else's servers. Local heavyweights like Bark and XTTS demand 4–8GB of VRAM and take several seconds to synthesize a single sentence. Kokoro is excellent but clocks in at 82MB even in its smallest form, and still has dependency weight that makes shipping it non-trivial.
KittenTTS dropped on Hacker News this week with a deceptively simple claim: a 15-million-parameter TTS model in 25MB (int8), running on CPU without any GPU dependency, generating 24 kHz audio in real time. It collected 432 upvotes and 160 comments in under 18 hours — a reliable signal that developers immediately recognized what this unlocks.
This isn't a toy demo. It's a working Python library, Apache 2.0 licensed, with an ONNX runtime backend, 8 built-in voices, and a clean API that gets you to playable audio in five lines of code. Here's everything you need to know, and exactly how to build with it today.
Why This Matters
The practical ceiling for embedding TTS into a product has always been set by the smallest legitimate use case: a reading app, a voice UI, a screen reader, a CLI tool that narrates output. None of these justify spinning up a GPU server or signing a contract with an API vendor.
KittenTTS resets that ceiling. At 25MB, the model is shippable alongside your application — smaller than most web fonts bundles. One HN commenter immediately noted they want to test it inside a Vercel Edge Function, where bundle size is the hard constraint. At that scale, TTS becomes a first-class feature rather than an expensive add-on.
The implications extend across several verticals:
- Privacy-first products: No audio leaves the machine. Medical, legal, and enterprise apps that can't send data to third-party servers finally have a viable local option.
- Accessibility: The author themselves called it "an amazing accessibility tool" — offline TTS means users with poor connectivity aren't second-class citizens.
- Embedded and edge devices: Raspberry Pi 4, Jetson Nano, even a Pi Zero for the most constrained cases. No CUDA, no driver hell.
- Cost elimination: Zero per-character pricing. An audiobook's worth of synthesis costs the same as silence.
KittenTTS v0.8 is a developer preview — APIs may change between releases. The team is actively fixing installation issues (see Python version gotcha below) and shipping a simplified installer within weeks.
Under the Hood: StyleTTS 2 Made Small
KittenTTS is built on the StyleTTS 2 architecture (GitHub), one of the most significant TTS papers of the past three years. StyleTTS 2 achieved a landmark: human-level speech synthesis on the LJSpeech benchmark, surpassing human recordings in listener preference tests. It accomplished this by combining:
- Style diffusion: Speaking style is modeled as a latent random variable via a diffusion model, letting the system synthesize appropriate prosody for any text without reference audio.
- Adversarial training with SLMs: A pre-trained WavLM model acts as a discriminator, pushing the synthesized audio toward the acoustic properties of real human speech.
- Differentiable duration modeling: End-to-end training with learnable phoneme durations, dramatically improving naturalness versus fixed-duration predecessors.
The KittenTTS team's contribution is distilling this architecture into ONNX models at three aggressively pruned sizes, then quantizing the smallest to int8 to hit the 25MB target. The critical benchmark the author shared in the HN thread: the new 15M nano model outperforms their previous 80M model (v0.1) on quality metrics. That's a 5× size reduction with a quality improvement — a sign of a team that knows how to compress models, not just release them.
The ONNX backend is key to the deployment story. There are no CUDA dependencies at inference time, no PyTorch required at runtime (though the Python package does pull it as a build dependency — more on that below), and it runs on any platform ONNX Runtime supports: x86, ARM, macOS, Windows, Linux.
Three Models, Three Use Cases
KittenTTS ships three model tiers as of v0.8:
| Model | Parameters | Disk Size | Best For |
|---|---|---|---|
kitten-tts-nano-0.8-int8 | 15M | 25 MB | Browser embedding, MCUs, Pi Zero, latency-critical |
kitten-tts-micro-0.8 | 40M | 41 MB | Raspberry Pi 4, constrained servers |
kitten-tts-mini-0.8 | 80M | 80 MB | Best quality, production voice UIs, laptops |
The author's own recommendation from the HN comments: "The 80M is the highest quality while also being quite efficient. The 40M is quite similar to 80M for most use cases. 15M is for resource-constrained CPUs, loading onto a browser, etc."
How does it compare to alternatives?
- vs. ElevenLabs: Zero cost, fully offline, no account needed. Trade-off: fewer voices, no voice cloning yet (coming May 2026).
- vs. Kokoro TTS: KittenTTS author claims "competitive on Artificial Analysis benchmarks" at a fraction of the size. The team targets "Kokoro quality at 1/5 the size" for their next release.
- vs. Bark / XTTS: Night-and-day on resource requirements. Bark needs 4–12GB VRAM; KittenTTS runs on a 2015 laptop CPU.
- vs. Web Speech API: Offline, consistent cross-platform voice, no browser variation in quality, programmatic control over speed and voice.
Step-by-Step: Installation and First Audio
Before installing, two critical environment notes that the HN thread surfaced the hard way:
Python version: KittenTTS requires Python 3.8–3.12. Python 3.14 hits a known spaCy bug that breaks a transitive dependency. Check your version first: python --version.
Linux NVIDIA bloat: The default installation on Linux will pull CUDA packages (NVIDIA cuBLAS, cuDNN, etc.) — over 1GB — through an indirect torch dependency, even though the ONNX runtime doesn't use them. Force CPU-only torch first:
# Linux users: install CPU-only torch first to avoid ~1GB of NVIDIA packages
pip install torch --index-url https://download.pytorch.org/whl/cpu
# Then install KittenTTS
pip install https://github.com/KittenML/KittenTTS/releases/download/0.8.1/kittentts-0.8.1-py3-none-any.whl
# Linux users may also need PortAudio for playback
# sudo apt install libportaudio2With a clean install, here's the full workflow from text to WAV file:
from kittentts import KittenTTS
import soundfile as sf
import numpy as np
# Load the 25MB nano model (downloads from HuggingFace on first run)
# Swap in "KittenML/kitten-tts-mini-0.8" for best quality
model = KittenTTS("KittenML/kitten-tts-nano-0.8-int8")
# Single-sentence synthesis
audio = model.generate(
"KittenTTS delivers high-quality speech synthesis without a GPU.",
voice="Jasper", # Options: Bella, Jasper, Luna, Bruno, Rosie, Hugo, Kiki, Leo
speed=1.0, # 0.5 = half speed, 1.5 = 50% faster
clean_text=True # Expands numbers, currencies, and units to words
)
# Save to WAV at 24kHz
sf.write("output.wav", audio, 24000)
print(f"Generated {len(audio) / 24000:.1f}s of audio")
# Batch synthesis — concatenate segments for longer content
paragraphs = [
"Chapter one. The introduction.",
"It was March 2026, and TTS models had finally gotten small.",
"The 25 megabyte model ran without a GPU in sight.",
]
segments = [
model.generate(text, voice="Bruno", clean_text=True)
for text in paragraphs
]
full_audio = np.concatenate(segments)
sf.write("chapter_one.wav", full_audio, 24000)
print(f"Total duration: {len(full_audio) / 24000:.1f}s")The clean_text=True flag matters. Without it, feeding "The model is 25MB and costs $0.00" will produce garbled output — the model doesn't handle numeric literals well at this size. With preprocessing enabled, numbers and currencies get expanded to their spoken form before phonemization.
CLI shortcut: Community member @newptcai released purr, a CLI wrapper for KittenTTS that strips the unnecessary dependency chain and adds a --play flag for direct playback. Install it via pip install purr for faster iteration if you don't need the Python API.
Saving Directly to File and Speed Control
The library also has a generate_to_file method that skips the NumPy array entirely — useful for scripts where you don't need to manipulate the audio:
from kittentts import KittenTTS
model = KittenTTS("KittenML/kitten-tts-mini-0.8") # 80MB, best quality
# Direct-to-file synthesis with speed control
model.generate_to_file(
"Generating voice output for a product demo.",
output_path="demo.wav",
voice="Luna",
speed=0.9, # Slightly slower = more gravitas
sample_rate=24000,
clean_text=True
)
# List all available voices programmatically
print(model.available_voices)
# ['Bella', 'Jasper', 'Luna', 'Bruno', 'Rosie', 'Hugo', 'Kiki', 'Leo']For Raspberry Pi and other ARM deployments, the ONNX backend is the same — no code changes. The KittenTTS() constructor accepts a cache_dir parameter to keep downloaded model files local:
model = KittenTTS(
"KittenML/kitten-tts-nano-0.8-int8",
cache_dir="/opt/models/kittentts" # Persist across deployments
)Benchmarks and Real Performance
Based on community testing reported in the HN thread:
| Hardware | Model | Real-Time Factor |
|---|---|---|
| Intel i7-9700 CPU | 80M (mini) | ~1.5× realtime |
| Intel i7-9700 CPU | 15M (nano) | Faster than mini |
| NVIDIA RTX 3080 | 80M (mini) | ~1.5× realtime (no improvement over CPU) |
The GPU parity is notable: the RTX 3080 showed no speedup over the i7 CPU, because the ONNX inference graph is already CPU-optimized and the overhead of GPU memory transfers cancels out gains at this model size. This is a feature, not a limitation — it means there's zero reason to provision GPU capacity for TTS workloads using KittenTTS.
On the quality axis, the team published a key benchmark claim: their current 14M model is better than their previous 80M model (v0.1). That trajectory — higher quality at smaller size — is consistent with the StyleTTS 2 foundation providing headroom that smaller architectures typically don't have.
The HuggingFace Spaces demo is live for zero-install testing. Community feedback on voice quality is positive for standard prose, with known rough edges on technical jargon, acronyms, and domain-specific words — consistent with what you'd expect from a 15M-parameter model.
Limitations and What to Watch
Number pronunciation: This is the clearest quality gap. Numeric literals in strings ("the model has 135ms latency") produce garbled output. Always use clean_text=True, and for technical content with edge cases (acronyms like "SECDED", units like "ms"), pre-process your text with an LLM before feeding it to the TTS pipeline. The team says a model-level fix is coming in the next release.
Python version constraints: Pin to Python 3.8–3.12 in your project. Python 3.14 breaks a spaCy transitive dependency. The team is aware and working on a fix, but until it's resolved this is a real footgun for anyone on bleeding-edge Python.
Dependency footprint contradiction: The model is 25MB. Your Python environment to run it will be ~700MB (macOS) or up to 3GB+ (Linux, without the CPU-only torch workaround). This is a developer experience problem, not a deployment problem — once packaged, you can strip to just the ONNX model and runtime. A CLI executable and a mobile SDK are both on the roadmap.
English-only: The current release is English only. Multilingual support (French, Spanish, German first) is targeting April 2026, with lower-resource languages following. Japanese is confirmed "~3 weeks" away per the HN thread.
Prosody complexity: Sentence-final intonation and rhythm on long, complex sentences show the limits of 15M parameters. The 80M model is significantly better here. For production use cases where naturalness matters, use kitten-tts-mini.
Research preview status: The team calls this a developer preview explicitly. Expect breaking changes. Pin your version.
Final Thoughts
The on-device AI story has been "coming soon" for years. KittenTTS is one of the clearest demonstrations yet that "soon" is now. A StyleTTS 2 distillation that fits in 25MB, runs on CPU at 1.5× realtime, and ships under Apache 2.0 — that's not a research artifact, it's a building block.
The roadmap makes it more compelling: voice cloning at 15M parameters by May 2026, mobile SDKs in weeks, and the team explicitly targeting Kokoro-level quality at 1/5 the size. If they hit even half of that, the remaining arguments for cloud TTS APIs in privacy-sensitive or latency-critical applications evaporate.
Try the HuggingFace demo first to calibrate expectations on voice quality. Then install and iterate locally. The five-line API is genuinely that simple, and the edge cases are well-documented enough that you can work around them today.
Resources:
- KittenTTS GitHub — source, README, and issue tracker
- HuggingFace Demo — try before you install
- nano int8 model card — 25MB model on HF Hub
- purr CLI wrapper — community CLI with cleaner deps
- StyleTTS 2 paper — the architecture underneath
- StyleTTS2 GitHub — reference implementation
- kittenml.com — commercial support and updates
- HN discussion — 160 comments of community benchmarks and feedback
Related Posts
BuzzFeed's AI Bet Backfired: A $57 Million Lesson for Every Publisher in 2026
BuzzFeed just reported a $57M net loss and 'substantial doubt' it can survive. Three years after its all-in AI pivot, what went wrong — and what every media company should learn from it.
Read moreBuild an Event-Sourced AI Agent from Scratch: Full Working Code
Step-by-step tutorial with complete Python code to build a production-ready event-sourced AI agent — orchestrator, planner, policy guard, tool executor, and replay engine.
Read moreMamba 3 Is Here: The Open-Source Architecture That Could Finally Dethrone the Transformer
Mamba 3 delivers 57.6% benchmark accuracy at 1.5B scale, halves state memory vs. Mamba 2, and ships under Apache 2.0 — and developers can use it today.
Read more