KittenTTS: The 25MB Model That Makes On-Device TTS Finally Practical

On-device text-to-speech in 2026 still feels like it should be solved — and yet every practical option has a catch. Cloud APIs (ElevenLabs, OpenAI TTS) charge per character and send your audio to someone else's servers. Local heavyweights like Bark and XTTS demand 4–8GB of VRAM and take several seconds to synthesize a single sentence. Kokoro is excellent but clocks in at 82MB even in its smallest form, and still has dependency weight that makes shipping it non-trivial.

KittenTTS dropped on Hacker News this week with a deceptively simple claim: a 15-million-parameter TTS model in 25MB (int8), running on CPU without any GPU dependency, generating 24 kHz audio in real time. It collected 432 upvotes and 160 comments in under 18 hours — a reliable signal that developers immediately recognized what this unlocks.

This isn't a toy demo. It's a working Python library, Apache 2.0 licensed, with an ONNX runtime backend, 8 built-in voices, and a clean API that gets you to playable audio in five lines of code. Here's everything you need to know, and exactly how to build with it today.

Why This Matters

The practical ceiling for embedding TTS into a product has always been set by the smallest legitimate use case: a reading app, a voice UI, a screen reader, a CLI tool that narrates output. None of these justify spinning up a GPU server or signing a contract with an API vendor.

KittenTTS resets that ceiling. At 25MB, the model is shippable alongside your application — smaller than most web fonts bundles. One HN commenter immediately noted they want to test it inside a Vercel Edge Function, where bundle size is the hard constraint. At that scale, TTS becomes a first-class feature rather than an expensive add-on.

The implications extend across several verticals:

Privacy-first products: No audio leaves the machine. Medical, legal, and enterprise apps that can't send data to third-party servers finally have a viable local option.
Accessibility: The author themselves called it "an amazing accessibility tool" — offline TTS means users with poor connectivity aren't second-class citizens.
Embedded and edge devices: Raspberry Pi 4, Jetson Nano, even a Pi Zero for the most constrained cases. No CUDA, no driver hell.
Cost elimination: Zero per-character pricing. An audiobook's worth of synthesis costs the same as silence.

KittenTTS v0.8 is a developer preview — APIs may change between releases. The team is actively fixing installation issues (see Python version gotcha below) and shipping a simplified installer within weeks.

📁 Full source code for this article is available on GitHub: github.com/aistackinsights/stackinsights/kittentts-on-device-tts-25mb

Under the Hood: StyleTTS 2 Made Small

KittenTTS is built on the StyleTTS 2 architecture (GitHub), one of the most significant TTS papers of the past three years. StyleTTS 2 achieved a landmark: human-level speech synthesis on the LJSpeech benchmark, surpassing human recordings in listener preference tests. It accomplished this by combining:

Style diffusion: Speaking style is modeled as a latent random variable via a diffusion model, letting the system synthesize appropriate prosody for any text without reference audio.
Adversarial training with SLMs: A pre-trained WavLM model acts as a discriminator, pushing the synthesized audio toward the acoustic properties of real human speech.
Differentiable duration modeling: End-to-end training with learnable phoneme durations, dramatically improving naturalness versus fixed-duration predecessors.

The KittenTTS team's contribution is distilling this architecture into ONNX models at three aggressively pruned sizes, then quantizing the smallest to int8 to hit the 25MB target. The critical benchmark the author shared in the HN thread: the new 15M nano model outperforms their previous 80M model (v0.1) on quality metrics. That's a 5× size reduction with a quality improvement — a sign of a team that knows how to compress models, not just release them.

The ONNX backend is key to the deployment story. There are no CUDA dependencies at inference time, no PyTorch required at runtime (though the Python package does pull it as a build dependency — more on that below), and it runs on any platform ONNX Runtime supports: x86, ARM, macOS, Windows, Linux.

Three Models, Three Use Cases

KittenTTS ships three model tiers as of v0.8:

Model	Parameters	Disk Size	Best For
`kitten-tts-nano-0.8-int8`	15M	25 MB	Browser embedding, MCUs, Pi Zero, latency-critical
`kitten-tts-micro-0.8`	40M	41 MB	Raspberry Pi 4, constrained servers
`kitten-tts-mini-0.8`	80M	80 MB	Best quality, production voice UIs, laptops

The author's own recommendation from the HN comments: "The 80M is the highest quality while also being quite efficient. The 40M is quite similar to 80M for most use cases. 15M is for resource-constrained CPUs, loading onto a browser, etc."

How does it compare to alternatives?

vs. ElevenLabs: Zero cost, fully offline, no account needed. Trade-off: fewer voices, no voice cloning yet (coming May 2026).
vs. Kokoro TTS: KittenTTS author claims "competitive on Artificial Analysis benchmarks" at a fraction of the size. The team targets "Kokoro quality at 1/5 the size" for their next release.
vs. Bark / XTTS: Night-and-day on resource requirements. Bark needs 4–12GB VRAM; KittenTTS runs on a 2015 laptop CPU.
vs. Web Speech API: Offline, consistent cross-platform voice, no browser variation in quality, programmatic control over speed and voice.

Step-by-Step: Installation and First Audio

Before installing, two critical environment notes that the HN thread surfaced the hard way:

Python version: KittenTTS requires Python 3.8–3.12. Python 3.14 hits a known spaCy bug that breaks a transitive dependency. Check your version first: python --version.

Linux NVIDIA bloat: The default installation on Linux will pull CUDA packages (NVIDIA cuBLAS, cuDNN, etc.) — over 1GB — through an indirect torch dependency, even though the ONNX runtime doesn't use them. Force CPU-only torch first:

# Linux users: install CPU-only torch first to avoid ~1GB of NVIDIA packages
pip install torch --index-url https://download.pytorch.org/whl/cpu
 
# Then install KittenTTS
pip install https://github.com/KittenML/KittenTTS/releases/download/0.8.1/kittentts-0.8.1-py3-none-any.whl
 
# Linux users may also need PortAudio for playback
# sudo apt install libportaudio2

With a clean install, here's the full workflow from text to WAV file:

from kittentts import KittenTTS
import soundfile as sf
import numpy as np
 
# Load the 25MB nano model (downloads from HuggingFace on first run)
# Swap in "KittenML/kitten-tts-mini-0.8" for best quality
model = KittenTTS("KittenML/kitten-tts-nano-0.8-int8")
 
# Single-sentence synthesis
audio = model.generate(
    "KittenTTS delivers high-quality speech synthesis without a GPU.",
    voice="Jasper",       # Options: Bella, Jasper, Luna, Bruno, Rosie, Hugo, Kiki, Leo
    speed=1.0,            # 0.5 = half speed, 1.5 = 50% faster
    clean_text=True       # Expands numbers, currencies, and units to words
)
 
# Save to WAV at 24kHz
sf.write("output.wav", audio, 24000)
print(f"Generated {len(audio) / 24000:.1f}s of audio")
 
# Batch synthesis — concatenate segments for longer content
paragraphs = [
    "Chapter one. The introduction.",
    "It was March 2026, and TTS models had finally gotten small.",
    "The 25 megabyte model ran without a GPU in sight.",
]
 
segments = [
    model.generate(text, voice="Bruno", clean_text=True)
    for text in paragraphs
]
 
full_audio = np.concatenate(segments)
sf.write("chapter_one.wav", full_audio, 24000)
print(f"Total duration: {len(full_audio) / 24000:.1f}s")

The clean_text=True flag matters. Without it, feeding "The model is 25MB and costs $0.00" will produce garbled output — the model doesn't handle numeric literals well at this size. With preprocessing enabled, numbers and currencies get expanded to their spoken form before phonemization.

CLI shortcut: Community member @newptcai released purr, a CLI wrapper for KittenTTS that strips the unnecessary dependency chain and adds a --play flag for direct playback. Install it via pip install purr for faster iteration if you don't need the Python API.

Saving Directly to File and Speed Control

The library also has a generate_to_file method that skips the NumPy array entirely — useful for scripts where you don't need to manipulate the audio:

from kittentts import KittenTTS
 
model = KittenTTS("KittenML/kitten-tts-mini-0.8")  # 80MB, best quality
 
# Direct-to-file synthesis with speed control
model.generate_to_file(
    "Generating voice output for a product demo.",
    output_path="demo.wav",
    voice="Luna",
    speed=0.9,          # Slightly slower = more gravitas
    sample_rate=24000,
    clean_text=True
)
 
# List all available voices programmatically
print(model.available_voices)
# ['Bella', 'Jasper', 'Luna', 'Bruno', 'Rosie', 'Hugo', 'Kiki', 'Leo']

For Raspberry Pi and other ARM deployments, the ONNX backend is the same — no code changes. The KittenTTS() constructor accepts a cache_dir parameter to keep downloaded model files local:

model = KittenTTS(
    "KittenML/kitten-tts-nano-0.8-int8",
    cache_dir="/opt/models/kittentts"   # Persist across deployments
)

Benchmarks and Real Performance

Based on community testing reported in the HN thread:

Hardware	Model	Real-Time Factor
Intel i7-9700 CPU	80M (mini)	~1.5× realtime
Intel i7-9700 CPU	15M (nano)	Faster than mini
NVIDIA RTX 3080	80M (mini)	~1.5× realtime (no improvement over CPU)

The GPU parity is notable: the RTX 3080 showed no speedup over the i7 CPU, because the ONNX inference graph is already CPU-optimized and the overhead of GPU memory transfers cancels out gains at this model size. This is a feature, not a limitation — it means there's zero reason to provision GPU capacity for TTS workloads using KittenTTS.

On the quality axis, the team published a key benchmark claim: their current 14M model is better than their previous 80M model (v0.1). That trajectory — higher quality at smaller size — is consistent with the StyleTTS 2 foundation providing headroom that smaller architectures typically don't have.

The HuggingFace Spaces demo is live for zero-install testing. Community feedback on voice quality is positive for standard prose, with known rough edges on technical jargon, acronyms, and domain-specific words — consistent with what you'd expect from a 15M-parameter model.

Limitations and What to Watch

Number pronunciation: This is the clearest quality gap. Numeric literals in strings ("the model has 135ms latency") produce garbled output. Always use clean_text=True, and for technical content with edge cases (acronyms like "SECDED", units like "ms"), pre-process your text with an LLM before feeding it to the TTS pipeline. The team says a model-level fix is coming in the next release.

Python version constraints: Pin to Python 3.8–3.12 in your project. Python 3.14 breaks a spaCy transitive dependency. The team is aware and working on a fix, but until it's resolved this is a real footgun for anyone on bleeding-edge Python.

Dependency footprint contradiction: The model is 25MB. Your Python environment to run it will be ~700MB (macOS) or up to 3GB+ (Linux, without the CPU-only torch workaround). This is a developer experience problem, not a deployment problem — once packaged, you can strip to just the ONNX model and runtime. A CLI executable and a mobile SDK are both on the roadmap.

English-only: The current release is English only. Multilingual support (French, Spanish, German first) is targeting April 2026, with lower-resource languages following. Japanese is confirmed "~3 weeks" away per the HN thread.

Prosody complexity: Sentence-final intonation and rhythm on long, complex sentences show the limits of 15M parameters. The 80M model is significantly better here. For production use cases where naturalness matters, use kitten-tts-mini.

Research preview status: The team calls this a developer preview explicitly. Expect breaking changes. Pin your version.

Final Thoughts

The on-device AI story has been "coming soon" for years. KittenTTS is one of the clearest demonstrations yet that "soon" is now. A StyleTTS 2 distillation that fits in 25MB, runs on CPU at 1.5× realtime, and ships under Apache 2.0 — that's not a research artifact, it's a building block.

The roadmap makes it more compelling: voice cloning at 15M parameters by May 2026, mobile SDKs in weeks, and the team explicitly targeting Kokoro-level quality at 1/5 the size. If they hit even half of that, the remaining arguments for cloud TTS APIs in privacy-sensitive or latency-critical applications evaporate.

Try the HuggingFace demo first to calibrate expectations on voice quality. Then install and iterate locally. The five-line API is genuinely that simple, and the edge cases are well-documented enough that you can work around them today.

Resources:

KittenTTS GitHub — source, README, and issue tracker
HuggingFace Demo — try before you install
nano int8 model card — 25MB model on HF Hub
purr CLI wrapper — community CLI with cleaner deps
StyleTTS 2 paper — the architecture underneath
StyleTTS2 GitHub — reference implementation
kittenml.com — commercial support and updates
HN discussion — 160 comments of community benchmarks and feedback

Why This Matters

The implications extend across several verticals:

Privacy-first products: No audio leaves the machine. Medical, legal, and enterprise apps that can't send data to third-party servers finally have a viable local option.
Accessibility: The author themselves called it "an amazing accessibility tool" — offline TTS means users with poor connectivity aren't second-class citizens.
Embedded and edge devices: Raspberry Pi 4, Jetson Nano, even a Pi Zero for the most constrained cases. No CUDA, no driver hell.
Cost elimination: Zero per-character pricing. An audiobook's worth of synthesis costs the same as silence.

📁 Full source code for this article is available on GitHub: github.com/aistackinsights/stackinsights/kittentts-on-device-tts-25mb

Under the Hood: StyleTTS 2 Made Small

Style diffusion: Speaking style is modeled as a latent random variable via a diffusion model, letting the system synthesize appropriate prosody for any text without reference audio.
Adversarial training with SLMs: A pre-trained WavLM model acts as a discriminator, pushing the synthesized audio toward the acoustic properties of real human speech.
Differentiable duration modeling: End-to-end training with learnable phoneme durations, dramatically improving naturalness versus fixed-duration predecessors.

Three Models, Three Use Cases

KittenTTS ships three model tiers as of v0.8:

Model	Parameters	Disk Size	Best For
`kitten-tts-nano-0.8-int8`	15M	25 MB	Browser embedding, MCUs, Pi Zero, latency-critical
`kitten-tts-micro-0.8`	40M	41 MB	Raspberry Pi 4, constrained servers
`kitten-tts-mini-0.8`	80M	80 MB	Best quality, production voice UIs, laptops

How does it compare to alternatives?

vs. ElevenLabs: Zero cost, fully offline, no account needed. Trade-off: fewer voices, no voice cloning yet (coming May 2026).
vs. Kokoro TTS: KittenTTS author claims "competitive on Artificial Analysis benchmarks" at a fraction of the size. The team targets "Kokoro quality at 1/5 the size" for their next release.
vs. Bark / XTTS: Night-and-day on resource requirements. Bark needs 4–12GB VRAM; KittenTTS runs on a 2015 laptop CPU.
vs. Web Speech API: Offline, consistent cross-platform voice, no browser variation in quality, programmatic control over speed and voice.

Step-by-Step: Installation and First Audio

Before installing, two critical environment notes that the HN thread surfaced the hard way:

Python version: KittenTTS requires Python 3.8–3.12. Python 3.14 hits a known spaCy bug that breaks a transitive dependency. Check your version first: python --version.

# Linux users: install CPU-only torch first to avoid ~1GB of NVIDIA packages
pip install torch --index-url https://download.pytorch.org/whl/cpu
 
# Then install KittenTTS
pip install https://github.com/KittenML/KittenTTS/releases/download/0.8.1/kittentts-0.8.1-py3-none-any.whl
 
# Linux users may also need PortAudio for playback
# sudo apt install libportaudio2

With a clean install, here's the full workflow from text to WAV file:

from kittentts import KittenTTS
import soundfile as sf
import numpy as np
 
# Load the 25MB nano model (downloads from HuggingFace on first run)
# Swap in "KittenML/kitten-tts-mini-0.8" for best quality
model = KittenTTS("KittenML/kitten-tts-nano-0.8-int8")
 
# Single-sentence synthesis
audio = model.generate(
    "KittenTTS delivers high-quality speech synthesis without a GPU.",
    voice="Jasper",       # Options: Bella, Jasper, Luna, Bruno, Rosie, Hugo, Kiki, Leo
    speed=1.0,            # 0.5 = half speed, 1.5 = 50% faster
    clean_text=True       # Expands numbers, currencies, and units to words
)
 
# Save to WAV at 24kHz
sf.write("output.wav", audio, 24000)
print(f"Generated {len(audio) / 24000:.1f}s of audio")
 
# Batch synthesis — concatenate segments for longer content
paragraphs = [
    "Chapter one. The introduction.",
    "It was March 2026, and TTS models had finally gotten small.",
    "The 25 megabyte model ran without a GPU in sight.",
]
 
segments = [
    model.generate(text, voice="Bruno", clean_text=True)
    for text in paragraphs
]
 
full_audio = np.concatenate(segments)
sf.write("chapter_one.wav", full_audio, 24000)
print(f"Total duration: {len(full_audio) / 24000:.1f}s")

Saving Directly to File and Speed Control

The library also has a generate_to_file method that skips the NumPy array entirely — useful for scripts where you don't need to manipulate the audio:

from kittentts import KittenTTS
 
model = KittenTTS("KittenML/kitten-tts-mini-0.8")  # 80MB, best quality
 
# Direct-to-file synthesis with speed control
model.generate_to_file(
    "Generating voice output for a product demo.",
    output_path="demo.wav",
    voice="Luna",
    speed=0.9,          # Slightly slower = more gravitas
    sample_rate=24000,
    clean_text=True
)
 
# List all available voices programmatically
print(model.available_voices)
# ['Bella', 'Jasper', 'Luna', 'Bruno', 'Rosie', 'Hugo', 'Kiki', 'Leo']

For Raspberry Pi and other ARM deployments, the ONNX backend is the same — no code changes. The KittenTTS() constructor accepts a cache_dir parameter to keep downloaded model files local:

model = KittenTTS(
    "KittenML/kitten-tts-nano-0.8-int8",
    cache_dir="/opt/models/kittentts"   # Persist across deployments
)

Benchmarks and Real Performance

Based on community testing reported in the HN thread:

Hardware	Model	Real-Time Factor
Intel i7-9700 CPU	80M (mini)	~1.5× realtime
Intel i7-9700 CPU	15M (nano)	Faster than mini
NVIDIA RTX 3080	80M (mini)	~1.5× realtime (no improvement over CPU)

Limitations and What to Watch

Research preview status: The team calls this a developer preview explicitly. Expect breaking changes. Pin your version.

Final Thoughts

Resources:

KittenTTS GitHub — source, README, and issue tracker
HuggingFace Demo — try before you install
nano int8 model card — 25MB model on HF Hub
purr CLI wrapper — community CLI with cleaner deps
StyleTTS 2 paper — the architecture underneath
StyleTTS2 GitHub — reference implementation
kittenml.com — commercial support and updates
HN discussion — 160 comments of community benchmarks and feedback

KittenTTS: The 25MB Model That Makes On-Device TTS Finally Practical

Why This Matters

Under the Hood: StyleTTS 2 Made Small

Three Models, Three Use Cases

Step-by-Step: Installation and First Audio

Saving Directly to File and Speed Control

Benchmarks and Real Performance

Limitations and What to Watch

Final Thoughts

Related Posts

The $80 Brain: A Billion Tiny AI Agents Are About to Run on Everything You Own

1-Bit LLMs Hit Production: What Prism's Bonsai and BitNet Mean for On-Device AI

OpenCode: The Open-Source AI Coding Agent That Just Topped Hacker News

Comments

Leave a comment

KittenTTS: The 25MB Model That Makes On-Device TTS Finally Practical

Why This Matters

Under the Hood: StyleTTS 2 Made Small

Three Models, Three Use Cases

Step-by-Step: Installation and First Audio

Saving Directly to File and Speed Control

Benchmarks and Real Performance

Limitations and What to Watch

Final Thoughts

Related Posts

The $80 Brain: A Billion Tiny AI Agents Are About to Run on Everything You Own

1-Bit LLMs Hit Production: What Prism's Bonsai and BitNet Mean for On-Device AI

OpenCode: The Open-Source AI Coding Agent That Just Topped Hacker News

Comments

Leave a comment