whisper-rocm-compatibility

The Root Cause

You had to use openai-whisper instead of faster-whisper because of CTranslate2’s lack of native ROCm support.

Technical Breakdown

Architecture Differences

Component	openai-whisper	faster-whisper
Backend	Pure PyTorch	CTranslate2 inference engine
GPU Support	CUDA + ROCm ✅	CUDA only ❌
Speed	Baseline (1-3x real-time)	4-8x faster
Memory	Higher VRAM usage	50% less VRAM
ROCm 7 Compatible	✅ Yes (via PyTorch)	❌ No (CTranslate2 limitation)

Why CTranslate2 Doesn’t Support ROCm

CTranslate2 is an optimized inference engine that provides:

INT8 quantization (vs FP16)
Kernel fusion
Better memory management
CPU/GPU parallelization

However, it was built specifically for NVIDIA CUDA and does not have native ROCm support. The official PyPI packages (pip install ctranslate2) are CUDA-only.

PyTorch’s Role

openai-whisper uses PyTorch directly, which has excellent ROCm support through PyTorch ROCm builds
faster-whisper bypasses PyTorch and uses CTranslate2 for inference, losing ROCm compatibility
Your RX 7800 XT works perfectly with PyTorch ROCm 7.x, but CTranslate2 doesn’t use PyTorch’s ROCm backend

Your Current Performance

With openai-whisper + ROCm 7 + RX 7800 XT:

Base model: ~38x real-time
Large-v3 model: ~2-5x real-time
VRAM usage: 1-8GB depending on model

This is actually excellent performance - near what faster-whisper achieves on NVIDIA GPUs!

Community Workarounds

Option 1: Community CTranslate2-ROCm Forks

There are unofficial ROCm builds of CTranslate2:

arlo-phoenix/CTranslate2-rocm
- Reports ~60% faster than whisper.cpp
- Requires building from source
- May not work perfectly with ROCm 7
ROCm/CTranslate2 (amd_dev branch)
- AMD’s official fork
- Significantly behind mainline CTranslate2
- Not recommended (as of Oct 2025)
Donkey545/wyoming-faster-whisper-rocm
- Pre-built libraries for ROCm
- Used in Home Assistant Wyoming protocol
- Requires specific ROCm versions

Challenges with Community Builds

Architecture-specific: Must build for your exact GPU (gfx1101 for RX 7800 XT)
ROCm version sensitivity: May not work with ROCm 7.x
Maintenance lag: Community forks fall behind mainline CTranslate2
Complex setup: Requires manual compilation with specific flags

Impact on Real-Time Transcription Libraries

WhisperLive

Repository: https://github.com/collabora/WhisperLive

Backend: faster-whisper (CTranslate2)

ROCm Compatibility: ❌ Will NOT work with AMD GPUs

Why: WhisperLive explicitly uses faster-whisper as its backend for “nearly-live” transcription. Since faster-whisper requires CTranslate2, and CTranslate2 doesn’t support ROCm, WhisperLive inherits the same limitation.

Alternative: You would need to modify WhisperLive to use openai-whisper instead of faster-whisper, but this would significantly reduce performance.

RealtimeSTT

Repository: https://github.com/KoljaB/RealtimeSTT

Backend: faster_whisper for transcription

ROCm Compatibility: ❌ Will NOT work with AMD GPUs

Why: RealtimeSTT uses faster_whisper for its instant GPU-accelerated transcription feature. The library’s architecture includes:

WebRTCVAD + SileroVAD for voice activity detection
faster_whisper for transcription (requires CTranslate2)
Porcupine for wake word detection

Default Installation: RealtimeSTT installs CPU-only PyTorch by default. Even if you upgrade to PyTorch with CUDA support, the faster_whisper dependency will fail on AMD GPUs.

Alternative: The library architecture would need to be modified to support openai-whisper as a backend option.

Working Real-Time Solutions for AMD GPUs

Since both WhisperLive and RealtimeSTT won’t work out-of-the-box, here are your options:

Option 1: Custom Implementation with openai-whisper

Build your own real-time transcription using:

PyAudio or PulseAudio for audio capture
openai-whisper with PyTorch ROCm for transcription
Chunked processing (process audio in 1-2 second segments)

Performance: Should achieve near-real-time with your RX 7800 XT

Option 2: insanely-fast-whisper-rocm

Repository: https://github.com/beecave-homelab/insanely-fast-whisper-rocm

This is a Docker-based solution specifically designed for AMD GPUs with ROCm 6.1. It includes:

Pre-configured PyTorch + ROCm environment
Optimized Whisper implementation
Easier setup than building from source

Caveat: Designed for ROCm 6.1, may need adaptation for ROCm 7.x

Option 3: whisper.cpp with ROCm

Repository: https://github.com/ggerganov/whisper.cpp

Written in C++, uses GGML format (like llama.cpp)
Has ROCm/HIP support
Can be 2-3x faster than openai-whisper
Requires compilation with ROCm flags

Pros:

Native ROCm support
Very fast inference
Low memory usage

Cons:

More complex to integrate into Python projects
Requires building from source with correct flags

Recommendations

For Your Use Case (Real-Time Audio Capture + Transcription)

Best approach: Custom implementation with openai-whisper

import whisper
import pyaudio
import numpy as np

# Your existing setup already works!
model = whisper.load_model("base")  # ROCm-accelerated

# Audio capture from mic or speakers
audio = pyaudio.PyAudio()
stream = audio.open(
    format=pyaudio.paInt16,
    channels=1,
    rate=16000,
    input=True,
    frames_per_buffer=16000  # 1 second chunks
)

# Real-time transcription loop
while True:
    audio_chunk = stream.read(16000)
    audio_np = np.frombuffer(audio_chunk, dtype=np.int16).astype(np.float32) / 32768.0
    result = model.transcribe(audio_np, fp16=False)  # fp16=False for ROCm
    print(result["text"])

Why this works:

✅ Uses your existing openai-whisper + ROCm setup
✅ Near-real-time performance (38x for base model)
✅ No complex dependencies
✅ Full control over audio sources (mic, speakers, or both)

If You Need Maximum Speed

Consider investing time in whisper.cpp with ROCm, but understand:

Significant compilation complexity
Less Python-friendly API
Marginal speed improvement over your current 38x real-time

Don’t Bother With

❌ WhisperLive (requires CTranslate2/CUDA)
❌ RealtimeSTT (requires CTranslate2/CUDA)
❌ faster-whisper community forks (too much hassle for ROCm 7)

Summary: Why You’re Stuck with openai-whisper

faster-whisper requires CTranslate2
CTranslate2 only supports CUDA (no native ROCm)
WhisperLive and RealtimeSTT both use faster-whisper
Therefore, all three fail on AMD GPUs

The good news: Your openai-whisper + PyTorch ROCm setup already provides excellent performance (38x real-time), which is competitive with faster-whisper on NVIDIA GPUs for your use case.

ROCm 7 Specific Issues

PyTorch ROCm Compatibility

✅ PyTorch with ROCm 7 works great for openai-whisper:

ROCm 7.0+: Improved performance, expanded datatype support
ROCm 7.1: Faster, more reliable, easier for developers
Compatible PyTorch versions: 2.2.1+ recommended

CTranslate2 ROCm Status (as of Nov 2025)

❌ Still no official ROCm support:

Official builds: CUDA only
AMD fork (amd_dev branch): Too far behind mainline
Community forks: Most target ROCm 5.x or 6.x, not 7.x

MIOpen Issues (Bonus Context)

Your past experience with pyannote (speaker diarization) failing on GPU was due to MIOpen compilation issues:

LSTM layers fail to compile for gfx1101 (RX 7800 XT)
Missing <utility> header in kernel compilation
Not related to Whisper, but a broader ROCm ecosystem issue

This doesn’t affect openai-whisper but shows ROCm 7 still has rough edges with some PyTorch operations.

Future Outlook

What Would Fix This?

For faster-whisper to work on AMD GPUs, one of these needs to happen:

CTranslate2 adds native ROCm support (official maintainers)
AMD fork catches up to mainline and gets proper maintenance
Community forks target ROCm 7+ and provide easy installation

Likelihood?

Low-to-medium in the near term:

CTranslate2 maintainers show no signs of adding ROCm support
AMD’s focus is on larger enterprise GPUs (MI series)
Community efforts are fragmented and version-specific

Alternative trajectory:

More projects may follow whisper.cpp’s approach (native ROCm/HIP support)
Or new inference engines emerge with first-class ROCm support

Conclusion

You’re not missing anything - this is a fundamental architectural limitation. The “faster” implementations (faster-whisper, WhisperLive, RealtimeSTT) all rely on CTranslate2, which is CUDA-exclusive.

Your solution: Build custom real-time transcription with openai-whisper + PyTorch ROCm, which already provides excellent performance on your RX 7800 XT.