Overview

Research on real-time audio capture and transcription on Ubuntu with AMD GPU (RX 7800 XT).

Key Findings

Why faster-whisper Doesn’t Work on AMD GPUs

Root cause: CTranslate2 (the inference engine behind faster-whisper) only supports CUDA, not ROCm.

This affects:

  • faster-whisper: Requires CTranslate2 → CUDA only
  • WhisperLive: Uses faster-whisper → Won’t work on AMD
  • RealtimeSTT: Uses faster-whisper → Won’t work on AMD

See: whisper-rocm-compatibility.md for full technical details

Solutions for Real-Time Transcription

ApproachLatencyComplexityPerformanceRecommended For
openai-whisper (chunked)2sLow38x real-timeGood enough, already working
whisper.cpp + Vulkan500msMedium25x real-timeReal-time streaming
whisper.cpp + ROCm300msHigh38x real-timeMaximum performance needed

Documentation Files

1. whisper-rocm-compatibility.md

Comprehensive technical analysis:

  • Why CTranslate2 doesn’t support ROCm
  • Architecture differences between openai-whisper and faster-whisper
  • Why WhisperLive and RealtimeSTT won’t work
  • Stateless vs stateful streaming
  • KV-cache and beam search internals
  • Vulkan’s role in GPU acceleration

2. audio-capture-guide.md

Complete Ubuntu audio setup:

  • Capturing microphone input (PyAudio, PulseAudio, ALSA)
  • Capturing speaker output (monitor sources)
  • Combining mic + speakers (virtual sinks)
  • Real-time transcription implementation
  • Voice Activity Detection (VAD)
  • Performance optimization for RX 7800 XT

3. whisper-cpp-vulkan-quickstart.md

Practical implementation guide:

  • 15-minute setup for whisper.cpp + Vulkan
  • Real-time streaming transcription
  • Python integration examples
  • Combined audio source (mic + speakers)
  • Troubleshooting common issues
  • Performance tuning recommendations

Quick Start

If You Want: Simple Chunked Transcription (Already Working)

import whisper
import pyaudio
import numpy as np
model = whisper.load_model("base", device="cuda")
audio = pyaudio.PyAudio()
stream = audio.open(format=pyaudio.paInt16, channels=1, rate=16000, input=True)
while True:
chunk = stream.read(32000) # 2 seconds
audio_np = np.frombuffer(chunk, dtype=np.int16).astype(np.float32) / 32768.0
result = model.transcribe(audio_np, fp16=False)
print(result["text"])

Latency: 2 seconds Performance: 38x real-time (excellent) Complexity: Low (5 lines of code)

If You Want: True Real-Time Streaming (<1s latency)

Terminal window
# Install whisper.cpp with Vulkan
sudo apt install libvulkan-dev
git clone https://github.com/ggerganov/whisper.cpp
cd whisper.cpp
cmake -B build -DGGML_VULKAN=ON -DWHISPER_BUILD_EXAMPLES=ON
cmake --build build -j$(nproc)
bash models/download-ggml-model.sh base
# Stream transcription
./build/bin/stream -m models/ggml-base.bin --step 3000 --length 8000

Latency: ~500ms Performance: ~25x real-time (good enough) Complexity: Medium (15 min setup)

See: whisper-cpp-vulkan-quickstart.md for full guide

Recommendations by Use Case

Meeting/Conversation Transcription

  • Use: openai-whisper (chunked)
  • Why: 2s latency is fine, already working, simple Python code

Live Captions / Real-Time Monitoring

  • Use: whisper.cpp + Vulkan
  • Why: 500ms latency is better, streaming handles mid-sentence cuts

Maximum Performance Needed

  • Use: whisper.cpp + ROCm
  • Why: Best performance, but complex setup (2-4 hours)

Distributing to Others (Unknown GPUs)

  • Use: whisper.cpp + Vulkan
  • Why: Works on AMD/NVIDIA/Intel, no special drivers needed

Hardware Context

System: Ubuntu + AMD RX 7800 XT (gfx1101) + ROCm 7.x

Performance benchmarks:

  • openai-whisper (base): 38x real-time
  • whisper.cpp Vulkan (base): ~25x real-time (estimated)
  • whisper.cpp ROCm (base): ~38x real-time (estimated)

Key Technical Insights

Chunking vs Streaming

All models must chunk (audio is continuous, computers process discrete chunks).

The difference:

  • Stateless chunking (openai-whisper): Each chunk processed independently, no memory
  • Stateful streaming (whisper.cpp): Maintains KV-cache, can revise previous output

Mid-Sentence Handling

Chunked approaches (openai-whisper):

  • Cut audio at fixed intervals
  • Can split mid-sentence
  • Mitigation: Overlapping chunks, VAD-based chunking, deduplication
  • ~85% accuracy at sentence boundaries

Streaming approaches (whisper.cpp):

  • Sliding windows with overlap
  • Maintains context across chunks
  • Can revise previous transcription
  • ~95%+ accuracy at sentence boundaries

Vulkan’s Role

Vulkan: Cross-platform GPU compute API (works on AMD/NVIDIA/Intel)

Performance: ~60-80% of native (ROCm/CUDA) performance

Use when: Portability matters more than maximum performance

Don’t use when: Single platform + native API already working (your case: ROCm works)

For your goal (real-time streaming, performance not critical): Vulkan is perfect!

Decision Tree

Do you need real-time streaming (<1s latency)?
├─ No → Use openai-whisper (chunked)
│ ✓ Already working
│ ✓ Simple Python code
│ ✓ 2s latency acceptable
└─ Yes → Is performance critical?
├─ No → Use whisper.cpp + Vulkan
│ ✓ 500ms latency
│ ✓ Simple build
│ ✓ 25x real-time (good enough)
└─ Yes → Use whisper.cpp + ROCm
✓ 300ms latency
✓ 38x real-time
✓ Complex build (2-4 hours)

Your path: whisper.cpp + Vulkan ✓

Next Steps

  1. Read: whisper-cpp-vulkan-quickstart.md
  2. Build: whisper.cpp with Vulkan (15 minutes)
  3. Test: Basic streaming with ./build/bin/stream
  4. Integrate: Set up combined audio source (mic + speakers)
  5. Deploy: Python wrapper for your application

Estimated time to working solution: 30-60 minutes

Additional Resources