README
Overview
Research on open-source TTS models for local deployment, with focus on quality, speed, and AMD GPU compatibility.
Key distinction from Whisper: TTS converts text → audio (synthesis), while Whisper converts audio → text (transcription). They are complementary technologies.
Quick Recommendation
| Use Case | Recommended Model | Why |
|---|---|---|
| Fast & lightweight | Kokoro-82M | 82M params, Apache license, CPU-friendly |
| Best quality | Chatterbox, Fish Speech | State-of-the-art, emotion support |
| Edge/Raspberry Pi | Piper | Optimized for low-resource devices |
| Voice cloning | XTTS-v2, Chatterbox | 6-sec sample cloning |
| AMD GPU | Fish Speech (ROCm fork) | Native ROCm support |
| Dialogue/characters | Dia | Multi-speaker, nonverbal sounds |
Top Models (2024-2025)
Tier 1: State-of-the-Art
Kokoro-82M
- Parameters: 82M (lightweight!)
- License: Apache 2.0 (commercial use OK)
- Quality: Comparable to larger models
- Speed: Very fast, runs on CPU
- Languages: 8 languages, 54 voices
pip install kokoro>=0.9.2 soundfileapt-get install espeak-ng # Linuxfrom kokoro import KPipelineimport soundfile as sf
pipeline = KPipeline(lang_code='a') # 'a' = American Englishgenerator = pipeline("Hello world!", voice='af_heart')for i, (gs, ps, audio) in enumerate(generator): sf.write(f'{i}.wav', audio, 24000)Links: GitHub | Hugging Face
Chatterbox (Resemble AI)
- Parameters: 500M (Llama backbone)
- License: MIT (fully open)
- Training: 500K+ hours of audio
- Special: First open-source with emotion exaggeration control
- Quality: Benchmarked favorably against ElevenLabs
Best for: Games, storytelling, dynamic characters
Links: Resemble AI
Fish Speech V1.5
- Architecture: DualAR (dual autoregressive transformer)
- Training: 300K+ hours (English, Chinese), 100K+ hours (Japanese)
- ELO Score: 1339 (TTS Arena)
- Word Error Rate: 3.5%
- AMD Support: ROCm fork available!
# ROCm version for AMD GPUs (Linux)git clone https://github.com/moyutegong/fish-speech-rocmLinks: GitHub ROCm | ZLUDA Windows
Dia (Nari Labs)
- Parameters: 1.6B
- Specialty: Dialogue generation
- Features: Multi-speaker, nonverbal sounds (laughter, coughing, sighing)
- Use case: Audiobooks, podcasts, game dialogue
Tier 2: Practical & Fast
Piper TTS
- Optimization: Raspberry Pi 4, edge devices
- Format: ONNX models (VITS-trained)
- Speed: 10x faster than real-time on CPU
- Privacy: Fully offline, no cloud
# Installationpython3 -m venv .venvsource .venv/bin/activatepip install piper-tts
# Usageecho "Hello world" | piper -m en_US-amy-medium.onnx --output_file hello.wavGPU acceleration (CUDA):
pip install onnxruntime-gpuecho "Hello!" | piper -m en_US-amy-medium.onnx --cuda | aplaySpeed adjustment: Edit .onnx.json, change length_scale (higher = slower)
XTTS-v2 (Coqui)
- Voice cloning: 6-second sample only!
- Languages: Cross-lingual cloning
- License: ⚠️ Non-commercial only (Coqui Public Model License)
from TTS.api import TTS
tts = TTS("tts_models/multilingual/multi-dataset/xtts_v2")tts.tts_to_file( text="Hello world", speaker_wav="voice_sample.wav", language="en", file_path="output.wav")Links: Coqui TTS
Tier 3: Specialized
Bark (Suno AI)
- Specialty: Expressive, creative synthesis
- Features: Emotions, laughter, music, non-speech sounds
- Speed: Slower (quality-focused)
- Use case: Creative audio content
Higgs Audio V2 (BosonAI)
- Base: Llama 3.2 3B
- Training: 10M+ hours
- Specialty: Emotion, question intonation
- Status: Top trending on Hugging Face
eSpeak NG
- Languages: 100+ languages
- Size: Extremely lightweight
- Use case: Accessibility, embedded systems
- Note: Robotic voice, but works anywhere
apt-get install espeak-ngespeak-ng "Hello world"AMD GPU Compatibility
Working Solutions
| Model | AMD Method | Notes |
|---|---|---|
| Fish Speech | ROCm fork | Native Linux support |
| Fish Speech | ZLUDA (Windows) | Requires ROCm 5.7 setup |
| Kokoro | CPU | Fast enough without GPU |
| Piper | CPU (ONNX) | Optimized for CPU |
| Coqui TTS | PyTorch + ROCm | Works but community support |
Fish Speech ROCm Setup (Linux)
# Clone ROCm forkgit clone https://github.com/moyutegong/fish-speech-rocmcd fish-speech-rocm
# Install dependencies (ensure ROCm is installed)pip install -r requirements.txt
# Run WebUIpython webui.pyFish Speech ZLUDA Setup (Windows)
- Install ROCm 5.7 for Windows
- Set environment variables:
HIP_PATH:C:\Program Files\AMD\ROCm\5.7\- Add to PATH:
C:\Program Files\AMD\ROCm\5.7\bin
- Clone and run:
Terminal window git clone https://github.com/patientx/fish-speech-zluda
Note: First generation is slow (ZLUDA compiling), subsequent runs are faster.
Comparison Matrix
| Model | Params | License | Quality | Speed | Voice Clone | AMD GPU |
|---|---|---|---|---|---|---|
| Kokoro-82M | 82M | Apache 2.0 | ⭐⭐⭐⭐ | ⭐⭐⭐⭐⭐ | ❌ | CPU fast |
| Chatterbox | 500M | MIT | ⭐⭐⭐⭐⭐ | ⭐⭐⭐ | ✅ | ? |
| Fish Speech | Large | Apache 2.0 | ⭐⭐⭐⭐⭐ | ⭐⭐⭐ | ✅ | ✅ ROCm |
| Piper | Small | MIT | ⭐⭐⭐ | ⭐⭐⭐⭐⭐ | ❌ | CPU |
| XTTS-v2 | Medium | Non-commercial | ⭐⭐⭐⭐ | ⭐⭐⭐ | ✅ | ? |
| Bark | Large | MIT | ⭐⭐⭐⭐ | ⭐⭐ | ❌ | ? |
| Dia | 1.6B | ? | ⭐⭐⭐⭐⭐ | ⭐⭐ | ❌ | ? |
Recommendations by Use Case
Real-Time Voice Assistant
Use: Piper or Kokoro
- Low latency critical
- CPU performance sufficient
- Offline operation
Content Creation / Audiobooks
Use: Fish Speech or Chatterbox
- Quality prioritized
- GPU beneficial
- Emotion/expression support
Voice Cloning
Use: XTTS-v2 (non-commercial) or Chatterbox (commercial)
- Only 6-sec sample needed
- Cross-lingual support
Games / Interactive
Use: Dia or Chatterbox
- Multi-character dialogue
- Emotion exaggeration
- Nonverbal sounds
Edge / Embedded
Use: Piper
- Raspberry Pi optimized
- ONNX runtime
- Minimal dependencies
Related Research
- Ubuntu Audio Streaming + Whisper - Speech-to-text (complementary)