The Root Cause

You had to use openai-whisper instead of faster-whisper because of CTranslate2’s lack of native ROCm support.

Technical Breakdown

Architecture Differences

Componentopenai-whisperfaster-whisper
BackendPure PyTorchCTranslate2 inference engine
GPU SupportCUDA + ROCmCUDA only ❌
SpeedBaseline (1-3x real-time)4-8x faster
MemoryHigher VRAM usage50% less VRAM
ROCm 7 Compatible✅ Yes (via PyTorch)❌ No (CTranslate2 limitation)

Why CTranslate2 Doesn’t Support ROCm

CTranslate2 is an optimized inference engine that provides:

  • INT8 quantization (vs FP16)
  • Kernel fusion
  • Better memory management
  • CPU/GPU parallelization

However, it was built specifically for NVIDIA CUDA and does not have native ROCm support. The official PyPI packages (pip install ctranslate2) are CUDA-only.

PyTorch’s Role

  • openai-whisper uses PyTorch directly, which has excellent ROCm support through PyTorch ROCm builds
  • faster-whisper bypasses PyTorch and uses CTranslate2 for inference, losing ROCm compatibility
  • Your RX 7800 XT works perfectly with PyTorch ROCm 7.x, but CTranslate2 doesn’t use PyTorch’s ROCm backend

Your Current Performance

With openai-whisper + ROCm 7 + RX 7800 XT:

  • Base model: ~38x real-time
  • Large-v3 model: ~2-5x real-time
  • VRAM usage: 1-8GB depending on model

This is actually excellent performance - near what faster-whisper achieves on NVIDIA GPUs!

Community Workarounds

Option 1: Community CTranslate2-ROCm Forks

There are unofficial ROCm builds of CTranslate2:

  1. arlo-phoenix/CTranslate2-rocm

    • Reports ~60% faster than whisper.cpp
    • Requires building from source
    • May not work perfectly with ROCm 7
  2. ROCm/CTranslate2 (amd_dev branch)

    • AMD’s official fork
    • Significantly behind mainline CTranslate2
    • Not recommended (as of Oct 2025)
  3. Donkey545/wyoming-faster-whisper-rocm

    • Pre-built libraries for ROCm
    • Used in Home Assistant Wyoming protocol
    • Requires specific ROCm versions

Challenges with Community Builds

  • Architecture-specific: Must build for your exact GPU (gfx1101 for RX 7800 XT)
  • ROCm version sensitivity: May not work with ROCm 7.x
  • Maintenance lag: Community forks fall behind mainline CTranslate2
  • Complex setup: Requires manual compilation with specific flags

Impact on Real-Time Transcription Libraries

WhisperLive

Repository: https://github.com/collabora/WhisperLive

Backend: faster-whisper (CTranslate2)

ROCm Compatibility: ❌ Will NOT work with AMD GPUs

Why: WhisperLive explicitly uses faster-whisper as its backend for “nearly-live” transcription. Since faster-whisper requires CTranslate2, and CTranslate2 doesn’t support ROCm, WhisperLive inherits the same limitation.

Alternative: You would need to modify WhisperLive to use openai-whisper instead of faster-whisper, but this would significantly reduce performance.

RealtimeSTT

Repository: https://github.com/KoljaB/RealtimeSTT

Backend: faster_whisper for transcription

ROCm Compatibility: ❌ Will NOT work with AMD GPUs

Why: RealtimeSTT uses faster_whisper for its instant GPU-accelerated transcription feature. The library’s architecture includes:

  • WebRTCVAD + SileroVAD for voice activity detection
  • faster_whisper for transcription (requires CTranslate2)
  • Porcupine for wake word detection

Default Installation: RealtimeSTT installs CPU-only PyTorch by default. Even if you upgrade to PyTorch with CUDA support, the faster_whisper dependency will fail on AMD GPUs.

Alternative: The library architecture would need to be modified to support openai-whisper as a backend option.

Working Real-Time Solutions for AMD GPUs

Since both WhisperLive and RealtimeSTT won’t work out-of-the-box, here are your options:

Option 1: Custom Implementation with openai-whisper

Build your own real-time transcription using:

  • PyAudio or PulseAudio for audio capture
  • openai-whisper with PyTorch ROCm for transcription
  • Chunked processing (process audio in 1-2 second segments)

Performance: Should achieve near-real-time with your RX 7800 XT

Option 2: insanely-fast-whisper-rocm

Repository: https://github.com/beecave-homelab/insanely-fast-whisper-rocm

This is a Docker-based solution specifically designed for AMD GPUs with ROCm 6.1. It includes:

  • Pre-configured PyTorch + ROCm environment
  • Optimized Whisper implementation
  • Easier setup than building from source

Caveat: Designed for ROCm 6.1, may need adaptation for ROCm 7.x

Option 3: whisper.cpp with ROCm

Repository: https://github.com/ggerganov/whisper.cpp

  • Written in C++, uses GGML format (like llama.cpp)
  • Has ROCm/HIP support
  • Can be 2-3x faster than openai-whisper
  • Requires compilation with ROCm flags

Pros:

  • Native ROCm support
  • Very fast inference
  • Low memory usage

Cons:

  • More complex to integrate into Python projects
  • Requires building from source with correct flags

Recommendations

For Your Use Case (Real-Time Audio Capture + Transcription)

Best approach: Custom implementation with openai-whisper

import whisper
import pyaudio
import numpy as np
# Your existing setup already works!
model = whisper.load_model("base") # ROCm-accelerated
# Audio capture from mic or speakers
audio = pyaudio.PyAudio()
stream = audio.open(
format=pyaudio.paInt16,
channels=1,
rate=16000,
input=True,
frames_per_buffer=16000 # 1 second chunks
)
# Real-time transcription loop
while True:
audio_chunk = stream.read(16000)
audio_np = np.frombuffer(audio_chunk, dtype=np.int16).astype(np.float32) / 32768.0
result = model.transcribe(audio_np, fp16=False) # fp16=False for ROCm
print(result["text"])

Why this works:

  • ✅ Uses your existing openai-whisper + ROCm setup
  • ✅ Near-real-time performance (38x for base model)
  • ✅ No complex dependencies
  • ✅ Full control over audio sources (mic, speakers, or both)

If You Need Maximum Speed

Consider investing time in whisper.cpp with ROCm, but understand:

  • Significant compilation complexity
  • Less Python-friendly API
  • Marginal speed improvement over your current 38x real-time

Don’t Bother With

  • ❌ WhisperLive (requires CTranslate2/CUDA)
  • ❌ RealtimeSTT (requires CTranslate2/CUDA)
  • ❌ faster-whisper community forks (too much hassle for ROCm 7)

Summary: Why You’re Stuck with openai-whisper

  1. faster-whisper requires CTranslate2
  2. CTranslate2 only supports CUDA (no native ROCm)
  3. WhisperLive and RealtimeSTT both use faster-whisper
  4. Therefore, all three fail on AMD GPUs

The good news: Your openai-whisper + PyTorch ROCm setup already provides excellent performance (38x real-time), which is competitive with faster-whisper on NVIDIA GPUs for your use case.

ROCm 7 Specific Issues

PyTorch ROCm Compatibility

PyTorch with ROCm 7 works great for openai-whisper:

  • ROCm 7.0+: Improved performance, expanded datatype support
  • ROCm 7.1: Faster, more reliable, easier for developers
  • Compatible PyTorch versions: 2.2.1+ recommended

CTranslate2 ROCm Status (as of Nov 2025)

Still no official ROCm support:

  • Official builds: CUDA only
  • AMD fork (amd_dev branch): Too far behind mainline
  • Community forks: Most target ROCm 5.x or 6.x, not 7.x

MIOpen Issues (Bonus Context)

Your past experience with pyannote (speaker diarization) failing on GPU was due to MIOpen compilation issues:

  • LSTM layers fail to compile for gfx1101 (RX 7800 XT)
  • Missing <utility> header in kernel compilation
  • Not related to Whisper, but a broader ROCm ecosystem issue

This doesn’t affect openai-whisper but shows ROCm 7 still has rough edges with some PyTorch operations.

Future Outlook

What Would Fix This?

For faster-whisper to work on AMD GPUs, one of these needs to happen:

  1. CTranslate2 adds native ROCm support (official maintainers)
  2. AMD fork catches up to mainline and gets proper maintenance
  3. Community forks target ROCm 7+ and provide easy installation

Likelihood?

Low-to-medium in the near term:

  • CTranslate2 maintainers show no signs of adding ROCm support
  • AMD’s focus is on larger enterprise GPUs (MI series)
  • Community efforts are fragmented and version-specific

Alternative trajectory:

  • More projects may follow whisper.cpp’s approach (native ROCm/HIP support)
  • Or new inference engines emerge with first-class ROCm support

Conclusion

You’re not missing anything - this is a fundamental architectural limitation. The “faster” implementations (faster-whisper, WhisperLive, RealtimeSTT) all rely on CTranslate2, which is CUDA-exclusive.

Your solution: Build custom real-time transcription with openai-whisper + PyTorch ROCm, which already provides excellent performance on your RX 7800 XT.