whisper-cpp-vulkan-quickstart
Goal
Get real-time streaming audio transcription working with minimal complexity using whisper.cpp + Vulkan on Ubuntu with AMD GPU.
Why This Approach?
- ✅ True streaming: ~500ms latency (vs 2s chunked)
- ✅ Simple setup: Just need Vulkan drivers (already installed)
- ✅ Good enough performance: ~25x real-time
- ✅ No ROCm complexity: Easier build, fewer flags
Prerequisites Check
# 1. Verify Vulkan is availablevulkaninfo | grep -i "deviceName"# Should show: AMD Radeon RX 7800 XT
# 2. Check Vulkan versionvulkaninfo | grep -i "apiVersion"# Should show: 1.3.x or higher
# 3. If vulkaninfo not found:sudo apt install vulkan-tools
# 4. Verify GPU is visible to Vulkanvulkaninfo | grep -i "discrete"# Should show your RX 7800 XT as discrete GPUInstallation (15 minutes)
Step 1: Install Dependencies
# Build toolssudo apt install -y git build-essential cmake
# Vulkan development librariessudo apt install -y libvulkan-dev vulkan-tools
# Audio tools (for testing)sudo apt install -y ffmpeg portaudio19-devStep 2: Clone and Build whisper.cpp
# Clone repositorycd ~git clone https://github.com/ggerganov/whisper.cppcd whisper.cpp
# Build with Vulkan supportcmake -B build \ -DGGML_VULKAN=ON \ -DWHISPER_BUILD_EXAMPLES=ON \ -DCMAKE_BUILD_TYPE=Release
cmake --build build -j$(nproc)
# Verify build succeededls build/bin/# Should see: main, stream, bench, etc.Step 3: Download Models
# Download base model (good balance of speed/accuracy)cd ~/whisper.cppbash ./models/download-ggml-model.sh base
# Or download other models:# bash ./models/download-ggml-model.sh tiny # fastest# bash ./models/download-ggml-model.sh small # more accurate# bash ./models/download-ggml-model.sh medium # even more accurate
# Verify model downloadedls models/# Should see: ggml-base.binBasic Usage
Test with Audio File
# Download sample audiocd ~/whisper.cppwget https://github.com/ggerganov/whisper.cpp/raw/master/samples/jfk.wav -O samples/test.wav
# Transcribe./build/bin/main -m models/ggml-base.bin -f samples/test.wav
# Should output: transcription textTest GPU Acceleration
# Run with verbose output to verify Vulkan is being used./build/bin/main -m models/ggml-base.bin -f samples/test.wav -p 1
# Look for output like:# "using Vulkan"# "device: AMD Radeon RX 7800 XT"Real-Time Streaming Transcription
Basic Streaming (Microphone Input)
# Start streaming transcriptioncd ~/whisper.cpp./build/bin/stream -m models/ggml-base.bin \ --step 3000 \ --length 8000 \ --keep 200 \ --max-tokens 32 \ --audio-ctx 0
# Parameters explained:# --step 3000 : Process every 3 seconds of audio# --length 8000 : Use 8 seconds of context# --keep 200 : Keep last 200ms for continuity# --max-tokens 32 : Max tokens per segment# --audio-ctx 0 : No audio context (faster)What you’ll see:
[00:00.000 --> 00:03.000] Hello, this is a test[00:03.000 --> 00:06.000] of real-time transcription[00:06.000 --> 00:09.000] using Whisper and VulkanLatency: ~500ms from speech to text output
Advanced Streaming Options
# Higher accuracy (slower)./build/bin/stream -m models/ggml-base.bin \ --step 2000 \ --length 10000 \ --keep 500 \ --max-tokens 64
# Lower latency (less accurate)./build/bin/stream -m models/ggml-base.bin \ --step 1500 \ --length 5000 \ --keep 100 \ --max-tokens 16
# With specific audio device./build/bin/stream -m models/ggml-base.bin \ --capture 1 # Use audio device 1 (check with: arecord -l)Stream-to-File
# Save transcription to file./build/bin/stream -m models/ggml-base.bin \ --step 3000 \ --length 8000 \ --output-txt > transcription.txt
# Or structured output./build/bin/stream -m models/ggml-base.bin \ --step 3000 \ --length 8000 \ --output-srt > subtitles.srt # SRT subtitle formatCapture Speaker Output + Mic (Combined)
Since you want both mic and speaker audio:
Step 1: Set up PulseAudio Combined Sink
# Create combined virtual sink (run once)pactl load-module module-null-sink sink_name=combined sink_properties=device.description="Combined_Audio"
# Route microphone to combined sinkpactl load-module module-loopback \ source=alsa_input.pci-0000_00_1f.3.analog-stereo \ sink=combined \ latency_msec=1
# Route speaker monitor to combined sinkpactl load-module module-loopback \ source=alsa_output.pci-0000_00_1f.3.analog-stereo.monitor \ sink=combined \ latency_msec=1Step 2: Stream from Combined Source
# Record from combined source and pipe to whisper.cppparec --device=combined.monitor \ --format=s16le \ --rate=16000 \ --channels=1 | \./build/bin/stream -m models/ggml-base.bin \ --step 3000 \ --length 8000 \ --no-timestampsNote: This requires whisper.cpp built with stdin support (already included).
Python Integration
Simple Subprocess Wrapper
#!/usr/bin/env python3"""Real-time streaming transcription with whisper.cpp + VulkanCaptures both mic and speaker audio"""
import subprocessimport sysfrom pathlib import Path
# PathsWHISPER_CPP = Path.home() / "whisper.cpp"MODEL = WHISPER_CPP / "models" / "ggml-base.bin"STREAM_BIN = WHISPER_CPP / "build" / "bin" / "stream"
def stream_transcribe(audio_source="default", step=3000, length=8000): """ Start real-time streaming transcription
Args: audio_source: "default" (mic), "combined.monitor" (mic+speakers) step: Processing interval in ms length: Context window in ms """ if audio_source == "combined.monitor": # Capture from combined source (mic + speakers) parec_cmd = [ "parec", "--device=combined.monitor", "--format=s16le", "--rate=16000", "--channels=1" ]
whisper_cmd = [ str(STREAM_BIN), "-m", str(MODEL), "--step", str(step), "--length", str(length), "--keep", "200", "--max-tokens", "32" ]
# Pipe parec output to whisper.cpp parec_proc = subprocess.Popen(parec_cmd, stdout=subprocess.PIPE) whisper_proc = subprocess.Popen( whisper_cmd, stdin=parec_proc.stdout, stdout=subprocess.PIPE, stderr=subprocess.PIPE, text=True, bufsize=1 )
print("Streaming transcription from mic + speakers...") print("Press Ctrl+C to stop\n")
try: for line in whisper_proc.stdout: print(line.rstrip()) sys.stdout.flush() except KeyboardInterrupt: print("\nStopping...") parec_proc.terminate() whisper_proc.terminate()
else: # Direct mic capture cmd = [ str(STREAM_BIN), "-m", str(MODEL), "--step", str(step), "--length", str(length), "--keep", "200", "--max-tokens", "32" ]
print("Streaming transcription from microphone...") print("Press Ctrl+C to stop\n")
try: subprocess.run(cmd) except KeyboardInterrupt: print("\nStopping...")
if __name__ == "__main__": import argparse
parser = argparse.ArgumentParser(description="Real-time streaming transcription") parser.add_argument( "--source", choices=["mic", "combined"], default="mic", help="Audio source: mic (microphone only) or combined (mic + speakers)" ) parser.add_argument("--step", type=int, default=3000, help="Processing step (ms)") parser.add_argument("--length", type=int, default=8000, help="Context length (ms)")
args = parser.parse_args()
audio_source = "combined.monitor" if args.source == "combined" else "default" stream_transcribe(audio_source, args.step, args.length)Usage:
# Mic onlypython stream_transcribe.py --source mic
# Mic + speakerspython stream_transcribe.py --source combined
# Adjust latency/accuracypython stream_transcribe.py --source combined --step 2000 --length 10000Performance Tuning
Model Selection
| Model | Speed (RX 7800 XT) | Accuracy | Latency | Best For |
|---|---|---|---|---|
| tiny | ~50x real-time | Basic | ~300ms | Ultra-fast, casual speech |
| base | ~25x real-time | Good | ~500ms | Recommended balance |
| small | ~10x real-time | Better | ~800ms | Higher accuracy needed |
| medium | ~4x real-time | Great | ~1.5s | Offline processing |
Latency vs Accuracy Trade-off
# Lowest latency (~300ms, less accurate)./build/bin/stream -m models/ggml-tiny.bin \ --step 1000 \ --length 3000 \ --keep 100
# Balanced (~500ms, good accuracy)./build/bin/stream -m models/ggml-base.bin \ --step 3000 \ --length 8000 \ --keep 200
# Higher accuracy (~1s, better context)./build/bin/stream -m models/ggml-small.bin \ --step 4000 \ --length 12000 \ --keep 500Troubleshooting
Issue: “No Vulkan device found”
# Check Vulkan is workingvulkaninfo | grep -i device
# If no output, reinstall Vulkansudo apt install --reinstall libvulkan1 mesa-vulkan-drivers vulkan-tools
# Verify GPU driverlspci -k | grep -A 3 VGA# Should show amdgpu kernel driverIssue: Stream binary not found
# Rebuild with examples enabledcd ~/whisper.cpprm -rf buildcmake -B build -DGGML_VULKAN=ON -DWHISPER_BUILD_EXAMPLES=ONcmake --build build -j$(nproc)
# Verifyls build/bin/streamIssue: “Cannot open audio device”
# List audio devicesarecord -l
# Test audio capturearecord -d 3 -f S16_LE -r 16000 test.wavaplay test.wav
# If no audio, check PulseAudiopactl list sources shortIssue: Poor transcription quality
# Try larger modelbash models/download-ggml-model.sh small./build/bin/stream -m models/ggml-small.bin
# Increase context window./build/bin/stream -m models/ggml-base.bin \ --step 3000 \ --length 12000 # Increased from 8000
# Check audio input qualityparecord --device=combined.monitor test.wav# Listen to test.wav - should be clearIssue: High CPU usage / slow performance
# Verify Vulkan is actually being used./build/bin/main -m models/ggml-base.bin -f test.wav -p 1 2>&1 | grep -i vulkan
# Should see: "using Vulkan"# If not, rebuild:cmake -B build -DGGML_VULKAN=ON -DCMAKE_BUILD_TYPE=Releasecmake --build build -j$(nproc)
# Check GPU usage while runningwatch -n 1 'rocm-smi | grep -A 5 "GPU"'# Should show GPU activityIssue: Streaming lags behind real-time
# Reduce processing step (faster, but may cut sentences)./build/bin/stream -m models/ggml-base.bin \ --step 2000 \ --length 6000
# Or use tiny model./build/bin/stream -m models/ggml-tiny.bin \ --step 2000 \ --length 6000Making PulseAudio Combined Sink Permanent
Add to ~/.config/pulse/default.pa:
# Create file if doesn't existmkdir -p ~/.config/pulsecat >> ~/.config/pulse/default.pa << 'EOF'.include /etc/pulse/default.pa
# Combined audio sink (mic + speakers)load-module module-null-sink sink_name=combined sink_properties=device.description="Combined_Audio"
# Route mic to combined (replace with your mic source)load-module module-loopback source=alsa_input.pci-0000_00_1f.3.analog-stereo sink=combined latency_msec=1
# Route speakers to combined (replace with your speaker monitor)load-module module-loopback source=alsa_output.pci-0000_00_1f.3.analog-stereo.monitor sink=combined latency_msec=1EOF
# Restart PulseAudiopulseaudio -kpulseaudio --start
# Verifypactl list sources | grep -i combinedNext Steps
Integration Ideas
1. Live Subtitles for Video Calls
# Transcribe meeting audio in real-time./build/bin/stream -m models/ggml-base.bin \ --output-srt | \ while read line; do echo "$line" # Send to subtitle overlay app done2. Voice Command Detection
# Watch for specific keywordsimport subprocess
proc = subprocess.Popen( ["./build/bin/stream", "-m", "models/ggml-base.bin"], stdout=subprocess.PIPE, text=True)
for line in proc.stdout: text = line.lower() if "hey computer" in text: print("Wake word detected!") if "open browser" in text: subprocess.run(["firefox"])3. Meeting Transcription Logger
# Save all transcriptions with timestamps./build/bin/stream -m models/ggml-base.bin \ --output-txt | \ while read line; do echo "$(date '+%Y-%m-%d %H:%M:%S') | $line" >> meeting_log.txt donePerformance Expectations
Your Setup (RX 7800 XT + Vulkan)
Expected performance:
- Model: base
- Real-time factor: ~25x (processes 1s audio in ~0.04s)
- Streaming latency: ~500ms
- Accuracy: Good for conversational speech
Comparison:
- openai-whisper (chunked): 2s latency, 38x real-time
- whisper.cpp (Vulkan): 500ms latency, 25x real-time ← Your choice
- whisper.cpp (ROCm): 300ms latency, 38x real-time (harder to build)
You chose: Better latency over maximum performance ✓
Summary
What you got:
- ✅ Real-time streaming transcription
- ✅ ~500ms latency (good enough!)
- ✅ Simple setup (no ROCm complexity)
- ✅ Works with mic + speakers
- ✅ Python-friendly integration
Build time: ~15 minutes Complexity: Medium (much simpler than ROCm build) Performance: Good enough for real-time use
You’re ready to start streaming!
cd ~/whisper.cpp./build/bin/stream -m models/ggml-base.bin