whisper-cpp-vulkan-quickstart

Goal

Get real-time streaming audio transcription working with minimal complexity using whisper.cpp + Vulkan on Ubuntu with AMD GPU.

Why This Approach?

✅ True streaming: ~500ms latency (vs 2s chunked)
✅ Simple setup: Just need Vulkan drivers (already installed)
✅ Good enough performance: ~25x real-time
✅ No ROCm complexity: Easier build, fewer flags

Prerequisites Check

# 1. Verify Vulkan is available
vulkaninfo | grep -i "deviceName"
# Should show: AMD Radeon RX 7800 XT

# 2. Check Vulkan version
vulkaninfo | grep -i "apiVersion"
# Should show: 1.3.x or higher

# 3. If vulkaninfo not found:
sudo apt install vulkan-tools

# 4. Verify GPU is visible to Vulkan
vulkaninfo | grep -i "discrete"
# Should show your RX 7800 XT as discrete GPU

Installation (15 minutes)

Step 1: Install Dependencies

# Build tools
sudo apt install -y git build-essential cmake

# Vulkan development libraries
sudo apt install -y libvulkan-dev vulkan-tools

# Audio tools (for testing)
sudo apt install -y ffmpeg portaudio19-dev

Step 2: Clone and Build whisper.cpp

# Clone repository
cd ~
git clone https://github.com/ggerganov/whisper.cpp
cd whisper.cpp

# Build with Vulkan support
cmake -B build \
  -DGGML_VULKAN=ON \
  -DWHISPER_BUILD_EXAMPLES=ON \
  -DCMAKE_BUILD_TYPE=Release

cmake --build build -j$(nproc)

# Verify build succeeded
ls build/bin/
# Should see: main, stream, bench, etc.

Step 3: Download Models

# Download base model (good balance of speed/accuracy)
cd ~/whisper.cpp
bash ./models/download-ggml-model.sh base

# Or download other models:
# bash ./models/download-ggml-model.sh tiny    # fastest
# bash ./models/download-ggml-model.sh small   # more accurate
# bash ./models/download-ggml-model.sh medium  # even more accurate

# Verify model downloaded
ls models/
# Should see: ggml-base.bin

Basic Usage

Test with Audio File

# Download sample audio
cd ~/whisper.cpp
wget https://github.com/ggerganov/whisper.cpp/raw/master/samples/jfk.wav -O samples/test.wav

# Transcribe
./build/bin/main -m models/ggml-base.bin -f samples/test.wav

# Should output: transcription text

Test GPU Acceleration

# Run with verbose output to verify Vulkan is being used
./build/bin/main -m models/ggml-base.bin -f samples/test.wav -p 1

# Look for output like:
# "using Vulkan"
# "device: AMD Radeon RX 7800 XT"

Real-Time Streaming Transcription

Basic Streaming (Microphone Input)

# Start streaming transcription
cd ~/whisper.cpp
./build/bin/stream -m models/ggml-base.bin \
  --step 3000 \
  --length 8000 \
  --keep 200 \
  --max-tokens 32 \
  --audio-ctx 0

# Parameters explained:
# --step 3000      : Process every 3 seconds of audio
# --length 8000    : Use 8 seconds of context
# --keep 200       : Keep last 200ms for continuity
# --max-tokens 32  : Max tokens per segment
# --audio-ctx 0    : No audio context (faster)

What you’ll see:

[00:00.000 --> 00:03.000] Hello, this is a test
[00:03.000 --> 00:06.000] of real-time transcription
[00:06.000 --> 00:09.000] using Whisper and Vulkan

Latency: ~500ms from speech to text output

Advanced Streaming Options

# Higher accuracy (slower)
./build/bin/stream -m models/ggml-base.bin \
  --step 2000 \
  --length 10000 \
  --keep 500 \
  --max-tokens 64

# Lower latency (less accurate)
./build/bin/stream -m models/ggml-base.bin \
  --step 1500 \
  --length 5000 \
  --keep 100 \
  --max-tokens 16

# With specific audio device
./build/bin/stream -m models/ggml-base.bin \
  --capture 1  # Use audio device 1 (check with: arecord -l)

Stream-to-File

# Save transcription to file
./build/bin/stream -m models/ggml-base.bin \
  --step 3000 \
  --length 8000 \
  --output-txt > transcription.txt

# Or structured output
./build/bin/stream -m models/ggml-base.bin \
  --step 3000 \
  --length 8000 \
  --output-srt > subtitles.srt  # SRT subtitle format

Capture Speaker Output + Mic (Combined)

Since you want both mic and speaker audio:

Step 1: Set up PulseAudio Combined Sink

# Create combined virtual sink (run once)
pactl load-module module-null-sink sink_name=combined sink_properties=device.description="Combined_Audio"

# Route microphone to combined sink
pactl load-module module-loopback \
  source=alsa_input.pci-0000_00_1f.3.analog-stereo \
  sink=combined \
  latency_msec=1

# Route speaker monitor to combined sink
pactl load-module module-loopback \
  source=alsa_output.pci-0000_00_1f.3.analog-stereo.monitor \
  sink=combined \
  latency_msec=1

Step 2: Stream from Combined Source

# Record from combined source and pipe to whisper.cpp
parec --device=combined.monitor \
  --format=s16le \
  --rate=16000 \
  --channels=1 | \
./build/bin/stream -m models/ggml-base.bin \
  --step 3000 \
  --length 8000 \
  --no-timestamps

Note: This requires whisper.cpp built with stdin support (already included).

Python Integration

Simple Subprocess Wrapper

#!/usr/bin/env python3
"""
Real-time streaming transcription with whisper.cpp + Vulkan
Captures both mic and speaker audio
"""

import subprocess
import sys
from pathlib import Path

# Paths
WHISPER_CPP = Path.home() / "whisper.cpp"
MODEL = WHISPER_CPP / "models" / "ggml-base.bin"
STREAM_BIN = WHISPER_CPP / "build" / "bin" / "stream"

def stream_transcribe(audio_source="default", step=3000, length=8000):
    """
    Start real-time streaming transcription

    Args:
        audio_source: "default" (mic), "combined.monitor" (mic+speakers)
        step: Processing interval in ms
        length: Context window in ms
    """
    if audio_source == "combined.monitor":
        # Capture from combined source (mic + speakers)
        parec_cmd = [
            "parec",
            "--device=combined.monitor",
            "--format=s16le",
            "--rate=16000",
            "--channels=1"
        ]

        whisper_cmd = [
            str(STREAM_BIN),
            "-m", str(MODEL),
            "--step", str(step),
            "--length", str(length),
            "--keep", "200",
            "--max-tokens", "32"
        ]

        # Pipe parec output to whisper.cpp
        parec_proc = subprocess.Popen(parec_cmd, stdout=subprocess.PIPE)
        whisper_proc = subprocess.Popen(
            whisper_cmd,
            stdin=parec_proc.stdout,
            stdout=subprocess.PIPE,
            stderr=subprocess.PIPE,
            text=True,
            bufsize=1
        )

        print("Streaming transcription from mic + speakers...")
        print("Press Ctrl+C to stop\n")

        try:
            for line in whisper_proc.stdout:
                print(line.rstrip())
                sys.stdout.flush()
        except KeyboardInterrupt:
            print("\nStopping...")
            parec_proc.terminate()
            whisper_proc.terminate()

    else:
        # Direct mic capture
        cmd = [
            str(STREAM_BIN),
            "-m", str(MODEL),
            "--step", str(step),
            "--length", str(length),
            "--keep", "200",
            "--max-tokens", "32"
        ]

        print("Streaming transcription from microphone...")
        print("Press Ctrl+C to stop\n")

        try:
            subprocess.run(cmd)
        except KeyboardInterrupt:
            print("\nStopping...")

if __name__ == "__main__":
    import argparse

    parser = argparse.ArgumentParser(description="Real-time streaming transcription")
    parser.add_argument(
        "--source",
        choices=["mic", "combined"],
        default="mic",
        help="Audio source: mic (microphone only) or combined (mic + speakers)"
    )
    parser.add_argument("--step", type=int, default=3000, help="Processing step (ms)")
    parser.add_argument("--length", type=int, default=8000, help="Context length (ms)")

    args = parser.parse_args()

    audio_source = "combined.monitor" if args.source == "combined" else "default"
    stream_transcribe(audio_source, args.step, args.length)

Usage:

# Mic only
python stream_transcribe.py --source mic

# Mic + speakers
python stream_transcribe.py --source combined

# Adjust latency/accuracy
python stream_transcribe.py --source combined --step 2000 --length 10000

Performance Tuning

Model Selection

Model	Speed (RX 7800 XT)	Accuracy	Latency	Best For
tiny	~50x real-time	Basic	~300ms	Ultra-fast, casual speech
base	~25x real-time	Good	~500ms	Recommended balance
small	~10x real-time	Better	~800ms	Higher accuracy needed
medium	~4x real-time	Great	~1.5s	Offline processing

Latency vs Accuracy Trade-off

# Lowest latency (~300ms, less accurate)
./build/bin/stream -m models/ggml-tiny.bin \
  --step 1000 \
  --length 3000 \
  --keep 100

# Balanced (~500ms, good accuracy)
./build/bin/stream -m models/ggml-base.bin \
  --step 3000 \
  --length 8000 \
  --keep 200

# Higher accuracy (~1s, better context)
./build/bin/stream -m models/ggml-small.bin \
  --step 4000 \
  --length 12000 \
  --keep 500

Troubleshooting

Issue: “No Vulkan device found”

# Check Vulkan is working
vulkaninfo | grep -i device

# If no output, reinstall Vulkan
sudo apt install --reinstall libvulkan1 mesa-vulkan-drivers vulkan-tools

# Verify GPU driver
lspci -k | grep -A 3 VGA
# Should show amdgpu kernel driver

Issue: Stream binary not found

# Rebuild with examples enabled
cd ~/whisper.cpp
rm -rf build
cmake -B build -DGGML_VULKAN=ON -DWHISPER_BUILD_EXAMPLES=ON
cmake --build build -j$(nproc)

# Verify
ls build/bin/stream

Issue: “Cannot open audio device”

# List audio devices
arecord -l

# Test audio capture
arecord -d 3 -f S16_LE -r 16000 test.wav
aplay test.wav

# If no audio, check PulseAudio
pactl list sources short

Issue: Poor transcription quality

# Try larger model
bash models/download-ggml-model.sh small
./build/bin/stream -m models/ggml-small.bin

# Increase context window
./build/bin/stream -m models/ggml-base.bin \
  --step 3000 \
  --length 12000  # Increased from 8000

# Check audio input quality
parecord --device=combined.monitor test.wav
# Listen to test.wav - should be clear

Issue: High CPU usage / slow performance

# Verify Vulkan is actually being used
./build/bin/main -m models/ggml-base.bin -f test.wav -p 1 2>&1 | grep -i vulkan

# Should see: "using Vulkan"
# If not, rebuild:
cmake -B build -DGGML_VULKAN=ON -DCMAKE_BUILD_TYPE=Release
cmake --build build -j$(nproc)

# Check GPU usage while running
watch -n 1 'rocm-smi | grep -A 5 "GPU"'
# Should show GPU activity

Issue: Streaming lags behind real-time

# Reduce processing step (faster, but may cut sentences)
./build/bin/stream -m models/ggml-base.bin \
  --step 2000 \
  --length 6000

# Or use tiny model
./build/bin/stream -m models/ggml-tiny.bin \
  --step 2000 \
  --length 6000

Making PulseAudio Combined Sink Permanent

Add to ~/.config/pulse/default.pa:

# Create file if doesn't exist
mkdir -p ~/.config/pulse
cat >> ~/.config/pulse/default.pa << 'EOF'
.include /etc/pulse/default.pa

# Combined audio sink (mic + speakers)
load-module module-null-sink sink_name=combined sink_properties=device.description="Combined_Audio"

# Route mic to combined (replace with your mic source)
load-module module-loopback source=alsa_input.pci-0000_00_1f.3.analog-stereo sink=combined latency_msec=1

# Route speakers to combined (replace with your speaker monitor)
load-module module-loopback source=alsa_output.pci-0000_00_1f.3.analog-stereo.monitor sink=combined latency_msec=1
EOF

# Restart PulseAudio
pulseaudio -k
pulseaudio --start

# Verify
pactl list sources | grep -i combined

Next Steps

Integration Ideas

1. Live Subtitles for Video Calls

# Transcribe meeting audio in real-time
./build/bin/stream -m models/ggml-base.bin \
  --output-srt | \
  while read line; do
    echo "$line"
    # Send to subtitle overlay app
  done

2. Voice Command Detection

# Watch for specific keywords
import subprocess

proc = subprocess.Popen(
    ["./build/bin/stream", "-m", "models/ggml-base.bin"],
    stdout=subprocess.PIPE,
    text=True
)

for line in proc.stdout:
    text = line.lower()
    if "hey computer" in text:
        print("Wake word detected!")
    if "open browser" in text:
        subprocess.run(["firefox"])

3. Meeting Transcription Logger

# Save all transcriptions with timestamps
./build/bin/stream -m models/ggml-base.bin \
  --output-txt | \
  while read line; do
    echo "$(date '+%Y-%m-%d %H:%M:%S') | $line" >> meeting_log.txt
  done

Performance Expectations

Your Setup (RX 7800 XT + Vulkan)

Expected performance:

Model: base
Real-time factor: ~25x (processes 1s audio in ~0.04s)
Streaming latency: ~500ms
Accuracy: Good for conversational speech

Comparison:

openai-whisper (chunked): 2s latency, 38x real-time
whisper.cpp (Vulkan): 500ms latency, 25x real-time ← Your choice
whisper.cpp (ROCm): 300ms latency, 38x real-time (harder to build)

You chose: Better latency over maximum performance ✓

Summary

What you got:

✅ Real-time streaming transcription
✅ ~500ms latency (good enough!)
✅ Simple setup (no ROCm complexity)
✅ Works with mic + speakers
✅ Python-friendly integration

Build time: ~15 minutes Complexity: Medium (much simpler than ROCm build) Performance: Good enough for real-time use

You’re ready to start streaming!

cd ~/whisper.cpp
./build/bin/stream -m models/ggml-base.bin