README

Overview

Research on open-source TTS models for local deployment, with focus on quality, speed, and AMD GPU compatibility.

Key distinction from Whisper: TTS converts text → audio (synthesis), while Whisper converts audio → text (transcription). They are complementary technologies.

Quick Recommendation

Use Case	Recommended Model	Why
Fast & lightweight	Kokoro-82M	82M params, Apache license, CPU-friendly
Best quality	Chatterbox, Fish Speech	State-of-the-art, emotion support
Edge/Raspberry Pi	Piper	Optimized for low-resource devices
Voice cloning	XTTS-v2, Chatterbox	6-sec sample cloning
AMD GPU	Fish Speech (ROCm fork)	Native ROCm support
Dialogue/characters	Dia	Multi-speaker, nonverbal sounds

Top Models (2024-2025)

Tier 1: State-of-the-Art

Kokoro-82M

Parameters: 82M (lightweight!)
License: Apache 2.0 (commercial use OK)
Quality: Comparable to larger models
Speed: Very fast, runs on CPU
Languages: 8 languages, 54 voices

pip install kokoro>=0.9.2 soundfile
apt-get install espeak-ng  # Linux

from kokoro import KPipeline
import soundfile as sf

pipeline = KPipeline(lang_code='a')  # 'a' = American English
generator = pipeline("Hello world!", voice='af_heart')
for i, (gs, ps, audio) in enumerate(generator):
    sf.write(f'{i}.wav', audio, 24000)

Links: GitHub | Hugging Face

Chatterbox (Resemble AI)

Parameters: 500M (Llama backbone)
License: MIT (fully open)
Training: 500K+ hours of audio
Special: First open-source with emotion exaggeration control
Quality: Benchmarked favorably against ElevenLabs

Best for: Games, storytelling, dynamic characters

Links: Resemble AI

Fish Speech V1.5

Architecture: DualAR (dual autoregressive transformer)
Training: 300K+ hours (English, Chinese), 100K+ hours (Japanese)
ELO Score: 1339 (TTS Arena)
Word Error Rate: 3.5%
AMD Support: ROCm fork available!

# ROCm version for AMD GPUs (Linux)
git clone https://github.com/moyutegong/fish-speech-rocm

Links: GitHub ROCm | ZLUDA Windows

Dia (Nari Labs)

Parameters: 1.6B
Specialty: Dialogue generation
Features: Multi-speaker, nonverbal sounds (laughter, coughing, sighing)
Use case: Audiobooks, podcasts, game dialogue

Tier 2: Practical & Fast

Piper TTS

Optimization: Raspberry Pi 4, edge devices
Format: ONNX models (VITS-trained)
Speed: 10x faster than real-time on CPU
Privacy: Fully offline, no cloud

# Installation
python3 -m venv .venv
source .venv/bin/activate
pip install piper-tts

# Usage
echo "Hello world" | piper -m en_US-amy-medium.onnx --output_file hello.wav

GPU acceleration (CUDA):

pip install onnxruntime-gpu
echo "Hello!" | piper -m en_US-amy-medium.onnx --cuda | aplay

Speed adjustment: Edit .onnx.json, change length_scale (higher = slower)

Links: GitHub | Voices

XTTS-v2 (Coqui)

Voice cloning: 6-second sample only!
Languages: Cross-lingual cloning
License: ⚠️ Non-commercial only (Coqui Public Model License)

from TTS.api import TTS

tts = TTS("tts_models/multilingual/multi-dataset/xtts_v2")
tts.tts_to_file(
    text="Hello world",
    speaker_wav="voice_sample.wav",
    language="en",
    file_path="output.wav"
)

Links: Coqui TTS

Tier 3: Specialized

Bark (Suno AI)

Specialty: Expressive, creative synthesis
Features: Emotions, laughter, music, non-speech sounds
Speed: Slower (quality-focused)
Use case: Creative audio content

Higgs Audio V2 (BosonAI)

Base: Llama 3.2 3B
Training: 10M+ hours
Specialty: Emotion, question intonation
Status: Top trending on Hugging Face

eSpeak NG

Languages: 100+ languages
Size: Extremely lightweight
Use case: Accessibility, embedded systems
Note: Robotic voice, but works anywhere

apt-get install espeak-ng
espeak-ng "Hello world"

AMD GPU Compatibility

Working Solutions

Model	AMD Method	Notes
Fish Speech	ROCm fork	Native Linux support
Fish Speech	ZLUDA (Windows)	Requires ROCm 5.7 setup
Kokoro	CPU	Fast enough without GPU
Piper	CPU (ONNX)	Optimized for CPU
Coqui TTS	PyTorch + ROCm	Works but community support

Fish Speech ROCm Setup (Linux)

# Clone ROCm fork
git clone https://github.com/moyutegong/fish-speech-rocm
cd fish-speech-rocm

# Install dependencies (ensure ROCm is installed)
pip install -r requirements.txt

# Run WebUI
python webui.py

Fish Speech ZLUDA Setup (Windows)

Install ROCm 5.7 for Windows
Set environment variables:
- HIP_PATH: C:\Program Files\AMD\ROCm\5.7\
- Add to PATH: C:\Program Files\AMD\ROCm\5.7\bin

Clone and run:

git clone https://github.com/patientx/fish-speech-zluda

Note: First generation is slow (ZLUDA compiling), subsequent runs are faster.

Comparison Matrix

Model	Params	License	Quality	Speed	Voice Clone	AMD GPU
Kokoro-82M	82M	Apache 2.0	⭐⭐⭐⭐	⭐⭐⭐⭐⭐	❌	CPU fast
Chatterbox	500M	MIT	⭐⭐⭐⭐⭐	⭐⭐⭐	✅	?
Fish Speech	Large	Apache 2.0	⭐⭐⭐⭐⭐	⭐⭐⭐	✅	✅ ROCm
Piper	Small	MIT	⭐⭐⭐	⭐⭐⭐⭐⭐	❌	CPU
XTTS-v2	Medium	Non-commercial	⭐⭐⭐⭐	⭐⭐⭐	✅	?
Bark	Large	MIT	⭐⭐⭐⭐	⭐⭐	❌	?
Dia	1.6B	?	⭐⭐⭐⭐⭐	⭐⭐	❌	?

Recommendations by Use Case

Real-Time Voice Assistant

Use: Piper or Kokoro

Low latency critical
CPU performance sufficient
Offline operation

Content Creation / Audiobooks

Use: Fish Speech or Chatterbox

Quality prioritized
GPU beneficial
Emotion/expression support

Voice Cloning

Use: XTTS-v2 (non-commercial) or Chatterbox (commercial)

Only 6-sec sample needed
Cross-lingual support

Games / Interactive

Use: Dia or Chatterbox

Multi-character dialogue
Emotion exaggeration
Nonverbal sounds

Edge / Embedded

Use: Piper

Raspberry Pi optimized
ONNX runtime
Minimal dependencies

Ubuntu Audio Streaming + Whisper - Speech-to-text (complementary)