whisper-rocm-compatibility
The Root Cause
You had to use openai-whisper instead of faster-whisper because of CTranslate2’s lack of native ROCm support.
Technical Breakdown
Architecture Differences
| Component | openai-whisper | faster-whisper |
|---|---|---|
| Backend | Pure PyTorch | CTranslate2 inference engine |
| GPU Support | CUDA + ROCm ✅ | CUDA only ❌ |
| Speed | Baseline (1-3x real-time) | 4-8x faster |
| Memory | Higher VRAM usage | 50% less VRAM |
| ROCm 7 Compatible | ✅ Yes (via PyTorch) | ❌ No (CTranslate2 limitation) |
Why CTranslate2 Doesn’t Support ROCm
CTranslate2 is an optimized inference engine that provides:
- INT8 quantization (vs FP16)
- Kernel fusion
- Better memory management
- CPU/GPU parallelization
However, it was built specifically for NVIDIA CUDA and does not have native ROCm support. The official PyPI packages (pip install ctranslate2) are CUDA-only.
PyTorch’s Role
- openai-whisper uses PyTorch directly, which has excellent ROCm support through PyTorch ROCm builds
- faster-whisper bypasses PyTorch and uses CTranslate2 for inference, losing ROCm compatibility
- Your RX 7800 XT works perfectly with PyTorch ROCm 7.x, but CTranslate2 doesn’t use PyTorch’s ROCm backend
Your Current Performance
With openai-whisper + ROCm 7 + RX 7800 XT:
- Base model: ~38x real-time
- Large-v3 model: ~2-5x real-time
- VRAM usage: 1-8GB depending on model
This is actually excellent performance - near what faster-whisper achieves on NVIDIA GPUs!
Community Workarounds
Option 1: Community CTranslate2-ROCm Forks
There are unofficial ROCm builds of CTranslate2:
-
arlo-phoenix/CTranslate2-rocm
- Reports ~60% faster than whisper.cpp
- Requires building from source
- May not work perfectly with ROCm 7
-
ROCm/CTranslate2 (amd_dev branch)
- AMD’s official fork
- Significantly behind mainline CTranslate2
- Not recommended (as of Oct 2025)
-
Donkey545/wyoming-faster-whisper-rocm
- Pre-built libraries for ROCm
- Used in Home Assistant Wyoming protocol
- Requires specific ROCm versions
Challenges with Community Builds
- Architecture-specific: Must build for your exact GPU (gfx1101 for RX 7800 XT)
- ROCm version sensitivity: May not work with ROCm 7.x
- Maintenance lag: Community forks fall behind mainline CTranslate2
- Complex setup: Requires manual compilation with specific flags
Impact on Real-Time Transcription Libraries
WhisperLive
Repository: https://github.com/collabora/WhisperLive
Backend: faster-whisper (CTranslate2)
ROCm Compatibility: ❌ Will NOT work with AMD GPUs
Why: WhisperLive explicitly uses faster-whisper as its backend for “nearly-live” transcription. Since faster-whisper requires CTranslate2, and CTranslate2 doesn’t support ROCm, WhisperLive inherits the same limitation.
Alternative: You would need to modify WhisperLive to use openai-whisper instead of faster-whisper, but this would significantly reduce performance.
RealtimeSTT
Repository: https://github.com/KoljaB/RealtimeSTT
Backend: faster_whisper for transcription
ROCm Compatibility: ❌ Will NOT work with AMD GPUs
Why: RealtimeSTT uses faster_whisper for its instant GPU-accelerated transcription feature. The library’s architecture includes:
- WebRTCVAD + SileroVAD for voice activity detection
- faster_whisper for transcription (requires CTranslate2)
- Porcupine for wake word detection
Default Installation: RealtimeSTT installs CPU-only PyTorch by default. Even if you upgrade to PyTorch with CUDA support, the faster_whisper dependency will fail on AMD GPUs.
Alternative: The library architecture would need to be modified to support openai-whisper as a backend option.
Working Real-Time Solutions for AMD GPUs
Since both WhisperLive and RealtimeSTT won’t work out-of-the-box, here are your options:
Option 1: Custom Implementation with openai-whisper
Build your own real-time transcription using:
- PyAudio or PulseAudio for audio capture
- openai-whisper with PyTorch ROCm for transcription
- Chunked processing (process audio in 1-2 second segments)
Performance: Should achieve near-real-time with your RX 7800 XT
Option 2: insanely-fast-whisper-rocm
Repository: https://github.com/beecave-homelab/insanely-fast-whisper-rocm
This is a Docker-based solution specifically designed for AMD GPUs with ROCm 6.1. It includes:
- Pre-configured PyTorch + ROCm environment
- Optimized Whisper implementation
- Easier setup than building from source
Caveat: Designed for ROCm 6.1, may need adaptation for ROCm 7.x
Option 3: whisper.cpp with ROCm
Repository: https://github.com/ggerganov/whisper.cpp
- Written in C++, uses GGML format (like llama.cpp)
- Has ROCm/HIP support
- Can be 2-3x faster than openai-whisper
- Requires compilation with ROCm flags
Pros:
- Native ROCm support
- Very fast inference
- Low memory usage
Cons:
- More complex to integrate into Python projects
- Requires building from source with correct flags
Recommendations
For Your Use Case (Real-Time Audio Capture + Transcription)
Best approach: Custom implementation with openai-whisper
import whisperimport pyaudioimport numpy as np
# Your existing setup already works!model = whisper.load_model("base") # ROCm-accelerated
# Audio capture from mic or speakersaudio = pyaudio.PyAudio()stream = audio.open( format=pyaudio.paInt16, channels=1, rate=16000, input=True, frames_per_buffer=16000 # 1 second chunks)
# Real-time transcription loopwhile True: audio_chunk = stream.read(16000) audio_np = np.frombuffer(audio_chunk, dtype=np.int16).astype(np.float32) / 32768.0 result = model.transcribe(audio_np, fp16=False) # fp16=False for ROCm print(result["text"])Why this works:
- ✅ Uses your existing openai-whisper + ROCm setup
- ✅ Near-real-time performance (38x for base model)
- ✅ No complex dependencies
- ✅ Full control over audio sources (mic, speakers, or both)
If You Need Maximum Speed
Consider investing time in whisper.cpp with ROCm, but understand:
- Significant compilation complexity
- Less Python-friendly API
- Marginal speed improvement over your current 38x real-time
Don’t Bother With
- ❌ WhisperLive (requires CTranslate2/CUDA)
- ❌ RealtimeSTT (requires CTranslate2/CUDA)
- ❌ faster-whisper community forks (too much hassle for ROCm 7)
Summary: Why You’re Stuck with openai-whisper
- faster-whisper requires CTranslate2
- CTranslate2 only supports CUDA (no native ROCm)
- WhisperLive and RealtimeSTT both use faster-whisper
- Therefore, all three fail on AMD GPUs
The good news: Your openai-whisper + PyTorch ROCm setup already provides excellent performance (38x real-time), which is competitive with faster-whisper on NVIDIA GPUs for your use case.
ROCm 7 Specific Issues
PyTorch ROCm Compatibility
✅ PyTorch with ROCm 7 works great for openai-whisper:
- ROCm 7.0+: Improved performance, expanded datatype support
- ROCm 7.1: Faster, more reliable, easier for developers
- Compatible PyTorch versions: 2.2.1+ recommended
CTranslate2 ROCm Status (as of Nov 2025)
❌ Still no official ROCm support:
- Official builds: CUDA only
- AMD fork (amd_dev branch): Too far behind mainline
- Community forks: Most target ROCm 5.x or 6.x, not 7.x
MIOpen Issues (Bonus Context)
Your past experience with pyannote (speaker diarization) failing on GPU was due to MIOpen compilation issues:
- LSTM layers fail to compile for gfx1101 (RX 7800 XT)
- Missing
<utility>header in kernel compilation - Not related to Whisper, but a broader ROCm ecosystem issue
This doesn’t affect openai-whisper but shows ROCm 7 still has rough edges with some PyTorch operations.
Future Outlook
What Would Fix This?
For faster-whisper to work on AMD GPUs, one of these needs to happen:
- CTranslate2 adds native ROCm support (official maintainers)
- AMD fork catches up to mainline and gets proper maintenance
- Community forks target ROCm 7+ and provide easy installation
Likelihood?
Low-to-medium in the near term:
- CTranslate2 maintainers show no signs of adding ROCm support
- AMD’s focus is on larger enterprise GPUs (MI series)
- Community efforts are fragmented and version-specific
Alternative trajectory:
- More projects may follow whisper.cpp’s approach (native ROCm/HIP support)
- Or new inference engines emerge with first-class ROCm support
Conclusion
You’re not missing anything - this is a fundamental architectural limitation. The “faster” implementations (faster-whisper, WhisperLive, RealtimeSTT) all rely on CTranslate2, which is CUDA-exclusive.
Your solution: Build custom real-time transcription with openai-whisper + PyTorch ROCm, which already provides excellent performance on your RX 7800 XT.