Overview

Research on open-source TTS models for local deployment, with focus on quality, speed, and AMD GPU compatibility.

Key distinction from Whisper: TTS converts text → audio (synthesis), while Whisper converts audio → text (transcription). They are complementary technologies.

Quick Recommendation

Use CaseRecommended ModelWhy
Fast & lightweightKokoro-82M82M params, Apache license, CPU-friendly
Best qualityChatterbox, Fish SpeechState-of-the-art, emotion support
Edge/Raspberry PiPiperOptimized for low-resource devices
Voice cloningXTTS-v2, Chatterbox6-sec sample cloning
AMD GPUFish Speech (ROCm fork)Native ROCm support
Dialogue/charactersDiaMulti-speaker, nonverbal sounds

Top Models (2024-2025)

Tier 1: State-of-the-Art

Kokoro-82M

  • Parameters: 82M (lightweight!)
  • License: Apache 2.0 (commercial use OK)
  • Quality: Comparable to larger models
  • Speed: Very fast, runs on CPU
  • Languages: 8 languages, 54 voices
Terminal window
pip install kokoro>=0.9.2 soundfile
apt-get install espeak-ng # Linux
from kokoro import KPipeline
import soundfile as sf
pipeline = KPipeline(lang_code='a') # 'a' = American English
generator = pipeline("Hello world!", voice='af_heart')
for i, (gs, ps, audio) in enumerate(generator):
sf.write(f'{i}.wav', audio, 24000)

Links: GitHub | Hugging Face


Chatterbox (Resemble AI)

  • Parameters: 500M (Llama backbone)
  • License: MIT (fully open)
  • Training: 500K+ hours of audio
  • Special: First open-source with emotion exaggeration control
  • Quality: Benchmarked favorably against ElevenLabs

Best for: Games, storytelling, dynamic characters

Links: Resemble AI


Fish Speech V1.5

  • Architecture: DualAR (dual autoregressive transformer)
  • Training: 300K+ hours (English, Chinese), 100K+ hours (Japanese)
  • ELO Score: 1339 (TTS Arena)
  • Word Error Rate: 3.5%
  • AMD Support: ROCm fork available!
Terminal window
# ROCm version for AMD GPUs (Linux)
git clone https://github.com/moyutegong/fish-speech-rocm

Links: GitHub ROCm | ZLUDA Windows


Dia (Nari Labs)

  • Parameters: 1.6B
  • Specialty: Dialogue generation
  • Features: Multi-speaker, nonverbal sounds (laughter, coughing, sighing)
  • Use case: Audiobooks, podcasts, game dialogue

Tier 2: Practical & Fast

Piper TTS

  • Optimization: Raspberry Pi 4, edge devices
  • Format: ONNX models (VITS-trained)
  • Speed: 10x faster than real-time on CPU
  • Privacy: Fully offline, no cloud
Terminal window
# Installation
python3 -m venv .venv
source .venv/bin/activate
pip install piper-tts
# Usage
echo "Hello world" | piper -m en_US-amy-medium.onnx --output_file hello.wav

GPU acceleration (CUDA):

Terminal window
pip install onnxruntime-gpu
echo "Hello!" | piper -m en_US-amy-medium.onnx --cuda | aplay

Speed adjustment: Edit .onnx.json, change length_scale (higher = slower)

Links: GitHub | Voices


XTTS-v2 (Coqui)

  • Voice cloning: 6-second sample only!
  • Languages: Cross-lingual cloning
  • License: ⚠️ Non-commercial only (Coqui Public Model License)
from TTS.api import TTS
tts = TTS("tts_models/multilingual/multi-dataset/xtts_v2")
tts.tts_to_file(
text="Hello world",
speaker_wav="voice_sample.wav",
language="en",
file_path="output.wav"
)

Links: Coqui TTS


Tier 3: Specialized

Bark (Suno AI)

  • Specialty: Expressive, creative synthesis
  • Features: Emotions, laughter, music, non-speech sounds
  • Speed: Slower (quality-focused)
  • Use case: Creative audio content

Higgs Audio V2 (BosonAI)

  • Base: Llama 3.2 3B
  • Training: 10M+ hours
  • Specialty: Emotion, question intonation
  • Status: Top trending on Hugging Face

eSpeak NG

  • Languages: 100+ languages
  • Size: Extremely lightweight
  • Use case: Accessibility, embedded systems
  • Note: Robotic voice, but works anywhere
Terminal window
apt-get install espeak-ng
espeak-ng "Hello world"

AMD GPU Compatibility

Working Solutions

ModelAMD MethodNotes
Fish SpeechROCm forkNative Linux support
Fish SpeechZLUDA (Windows)Requires ROCm 5.7 setup
KokoroCPUFast enough without GPU
PiperCPU (ONNX)Optimized for CPU
Coqui TTSPyTorch + ROCmWorks but community support

Fish Speech ROCm Setup (Linux)

Terminal window
# Clone ROCm fork
git clone https://github.com/moyutegong/fish-speech-rocm
cd fish-speech-rocm
# Install dependencies (ensure ROCm is installed)
pip install -r requirements.txt
# Run WebUI
python webui.py

Fish Speech ZLUDA Setup (Windows)

  1. Install ROCm 5.7 for Windows
  2. Set environment variables:
    • HIP_PATH: C:\Program Files\AMD\ROCm\5.7\
    • Add to PATH: C:\Program Files\AMD\ROCm\5.7\bin
  3. Clone and run:
    Terminal window
    git clone https://github.com/patientx/fish-speech-zluda

Note: First generation is slow (ZLUDA compiling), subsequent runs are faster.


Comparison Matrix

ModelParamsLicenseQualitySpeedVoice CloneAMD GPU
Kokoro-82M82MApache 2.0⭐⭐⭐⭐⭐⭐⭐⭐⭐CPU fast
Chatterbox500MMIT⭐⭐⭐⭐⭐⭐⭐⭐?
Fish SpeechLargeApache 2.0⭐⭐⭐⭐⭐⭐⭐⭐✅ ROCm
PiperSmallMIT⭐⭐⭐⭐⭐⭐⭐⭐CPU
XTTS-v2MediumNon-commercial⭐⭐⭐⭐⭐⭐⭐?
BarkLargeMIT⭐⭐⭐⭐⭐⭐?
Dia1.6B?⭐⭐⭐⭐⭐⭐⭐?

Recommendations by Use Case

Real-Time Voice Assistant

Use: Piper or Kokoro

  • Low latency critical
  • CPU performance sufficient
  • Offline operation

Content Creation / Audiobooks

Use: Fish Speech or Chatterbox

  • Quality prioritized
  • GPU beneficial
  • Emotion/expression support

Voice Cloning

Use: XTTS-v2 (non-commercial) or Chatterbox (commercial)

  • Only 6-sec sample needed
  • Cross-lingual support

Games / Interactive

Use: Dia or Chatterbox

  • Multi-character dialogue
  • Emotion exaggeration
  • Nonverbal sounds

Edge / Embedded

Use: Piper

  • Raspberry Pi optimized
  • ONNX runtime
  • Minimal dependencies

Sources

  1. Modal Blog - Top Open-Source TTS Models
  2. BentoML - Open-Source TTS Models
  3. Northflank - Best Open Source TTS
  4. DataCamp - 9 Best Open Source TTS Engines
  5. Resemble AI - Best Open Source TTS 2025
  6. GitHub - Kokoro-82M
  7. GitHub - Piper TTS
  8. GitHub - Fish Speech ROCm