audio-capture-guide

Overview

This guide covers how to capture and stream audio from both microphone input and speaker output on Ubuntu, suitable for real-time Whisper transcription with AMD GPUs.

Ubuntu Audio Stack

Ubuntu uses one of two audio systems:

PulseAudio (Traditional)

Default on Ubuntu 22.04 and earlier
Mature, well-documented
Better for simple setups

PipeWire (Modern)

Default on Ubuntu 22.10+
Replaces PulseAudio + JACK
Better performance, lower latency
Backward compatible with PulseAudio commands

Check which you’re using:

# PipeWire
pactl info | grep "Server Name"
# Shows: PulseAudio (on PipeWire X.X.X) if using PipeWire
# Shows: pulseaudio X.X.X if using traditional PulseAudio

Method 1: Capturing Microphone Input

Using PulseAudio/PipeWire Commands

List available input sources:

pactl list sources short

Example output:

0   alsa_input.pci-0000_00_1f.3.analog-stereo    module-alsa-card.c    s16le 2ch 44100Hz
1   alsa_input.usb-Blue_Microphones_Yeti_Stereo   module-alsa-card.c    s16le 2ch 48000Hz

Record from microphone:

# Replace source name with your actual mic
parec --device=alsa_input.pci-0000_00_1f.3.analog-stereo | your-processing-script

# Or specify format (16kHz mono for Whisper)
parec --device=alsa_input.pci-0000_00_1f.3.analog-stereo \
  --format=s16le \
  --rate=16000 \
  --channels=1 | your-processing-script

Using ALSA Directly (Lower Level)

# List capture devices
arecord -l

# Record from mic (16kHz mono, 16-bit PCM)
arecord -f S16_LE -r 16000 -c 1 -t raw | your-processing-script

# Or specify device
arecord -D hw:0,0 -f S16_LE -r 16000 -c 1 -t raw | your-processing-script

Using PyAudio (Python)

import pyaudio
import numpy as np

audio = pyaudio.PyAudio()

# List devices
for i in range(audio.get_device_count()):
    info = audio.get_device_info_by_index(i)
    if info['maxInputChannels'] > 0:
        print(f"{i}: {info['name']} (inputs: {info['maxInputChannels']})")

# Open microphone stream
stream = audio.open(
    format=pyaudio.paInt16,
    channels=1,
    rate=16000,
    input=True,
    input_device_index=None,  # None = default, or specify device index
    frames_per_buffer=1024
)

# Read audio chunks
while True:
    audio_chunk = stream.read(1024)
    audio_np = np.frombuffer(audio_chunk, dtype=np.int16)
    # Process audio_np...

Method 2: Capturing Speaker Output (System Audio)

Understanding Monitor Sources

PulseAudio/PipeWire creates “monitor” sources for each audio output. These capture what’s being played through your speakers.

List monitor sources:

pactl list sources short | grep monitor

Example output:

2   alsa_output.pci-0000_00_1f.3.analog-stereo.monitor   module-alsa-card.c   s16le 2ch 44100Hz

Record speaker output:

parec --device=alsa_output.pci-0000_00_1f.3.analog-stereo.monitor | your-processing-script

# With format specification
parec --device=alsa_output.pci-0000_00_1f.3.analog-stereo.monitor \
  --format=s16le \
  --rate=16000 \
  --channels=1 | your-processing-script

Finding the Correct Monitor

If you have multiple audio outputs:

# List all sinks (outputs)
pactl list sinks short

# Each sink has a corresponding .monitor source
# Example: alsa_output.pci-0000_00_1f.3.hdmi-stereo
#   Monitor: alsa_output.pci-0000_00_1f.3.hdmi-stereo.monitor

Using ALSA Loopback (Alternative)

# Load loopback module (may require root)
sudo modprobe snd-aloop

# This creates a virtual loopback device
arecord -D hw:Loopback,1 -f S16_LE -r 16000 -c 1 | your-processing-script

Method 3: Capturing BOTH Mic + Speakers Simultaneously

This is the most useful for comprehensive audio transcription.

Option A: Using PulseAudio/PipeWire Virtual Sink

Step 1: Create a virtual sink (combined audio)

# Create a null sink (virtual audio device)
pactl load-module module-null-sink sink_name=combined sink_properties=device.description="Combined_Audio"

Step 2: Route microphone to virtual sink

# Replace with your actual mic source
pactl load-module module-loopback \
  source=alsa_input.pci-0000_00_1f.3.analog-stereo \
  sink=combined \
  latency_msec=1

Step 3: Route speaker output to virtual sink

# Replace with your actual speaker monitor
pactl load-module module-loopback \
  source=alsa_output.pci-0000_00_1f.3.analog-stereo.monitor \
  sink=combined \
  latency_msec=1

Step 4: Record from the combined monitor

parec --device=combined.monitor | your-processing-script

# Or with format
parec --device=combined.monitor \
  --format=s16le \
  --rate=16000 \
  --channels=1 | your-processing-script

To make permanent:

Edit /etc/pulse/default.pa or ~/.config/pulse/default.pa:

# Add these lines
load-module module-null-sink sink_name=combined sink_properties=device.description="Combined_Audio"
load-module module-loopback source=alsa_input.pci-0000_00_1f.3.analog-stereo sink=combined latency_msec=1
load-module module-loopback source=alsa_output.pci-0000_00_1f.3.analog-stereo.monitor sink=combined latency_msec=1

Then restart PulseAudio:

pulseaudio -k
pulseaudio --start

Option B: Using Python with Multiple Streams

import pyaudio
import numpy as np
import threading

audio = pyaudio.PyAudio()

# Buffer to store combined audio
combined_buffer = []
buffer_lock = threading.Lock()

def capture_mic(device_index):
    stream = audio.open(
        format=pyaudio.paInt16,
        channels=1,
        rate=16000,
        input=True,
        input_device_index=device_index,
        frames_per_buffer=1024
    )
    while True:
        data = stream.read(1024)
        audio_np = np.frombuffer(data, dtype=np.int16)
        with buffer_lock:
            combined_buffer.append(('mic', audio_np))

def capture_speakers(device_index):
    stream = audio.open(
        format=pyaudio.paInt16,
        channels=1,
        rate=16000,
        input=True,
        input_device_index=device_index,
        frames_per_buffer=1024
    )
    while True:
        data = stream.read(1024)
        audio_np = np.frombuffer(data, dtype=np.int16)
        with buffer_lock:
            combined_buffer.append(('speaker', audio_np))

# Start capture threads
mic_thread = threading.Thread(target=capture_mic, args=(mic_device_index,))
speaker_thread = threading.Thread(target=capture_speakers, args=(monitor_device_index,))

mic_thread.daemon = True
speaker_thread.daemon = True

mic_thread.start()
speaker_thread.start()

# Process combined buffer
while True:
    with buffer_lock:
        if combined_buffer:
            source, audio_chunk = combined_buffer.pop(0)
            # Process audio_chunk...

Option C: Using GStreamer

# Install GStreamer
sudo apt install gstreamer1.0-tools gstreamer1.0-plugins-good

# Capture mic + speakers and mix
gst-launch-1.0 \
  audiomixer name=mix ! \
  audioconvert ! \
  audio/x-raw,rate=16000,channels=1 ! \
  fdsink fd=1 \
  pulsesrc device=alsa_input.pci-0000_00_1f.3.analog-stereo ! mix. \
  pulsesrc device=alsa_output.pci-0000_00_1f.3.analog-stereo.monitor ! mix.

Integration with Whisper for Real-Time Transcription

Complete Working Example

#!/usr/bin/env python3
"""
Real-time audio transcription with AMD GPU support
Captures both mic and speakers, transcribes with openai-whisper
"""

import whisper
import pyaudio
import numpy as np
import subprocess
import sys
from threading import Thread, Lock
from queue import Queue

# Load Whisper model (ROCm-accelerated on AMD GPUs)
print("Loading Whisper model...")
model = whisper.load_model("base", device="cuda")  # "cuda" works for ROCm too
print("Model loaded!")

# Audio configuration
RATE = 16000
CHUNK = RATE * 2  # 2 seconds of audio
FORMAT = pyaudio.paInt16
CHANNELS = 1

# Transcription queue
audio_queue = Queue()

def get_monitor_device(name_contains="monitor"):
    """Find monitor device for speaker capture"""
    p = pyaudio.PyAudio()
    for i in range(p.get_device_count()):
        info = p.get_device_info_by_index(i)
        if name_contains.lower() in info['name'].lower() and info['maxInputChannels'] > 0:
            print(f"Found monitor device: {info['name']}")
            return i
    return None

def get_mic_device():
    """Find default microphone"""
    p = pyaudio.PyAudio()
    return p.get_default_input_device_info()['index']

def capture_audio_pulseaudio():
    """
    Capture combined audio using PulseAudio virtual sink
    Assumes you've set up the 'combined' sink as shown above
    """
    cmd = [
        'parec',
        '--device=combined.monitor',
        '--format=s16le',
        '--rate=16000',
        '--channels=1'
    ]

    process = subprocess.Popen(cmd, stdout=subprocess.PIPE, bufsize=CHUNK*2)

    print("Capturing audio from combined source (mic + speakers)...")

    while True:
        audio_data = process.stdout.read(CHUNK * 2)  # 2 bytes per sample
        if not audio_data:
            break

        audio_np = np.frombuffer(audio_data, dtype=np.int16).astype(np.float32) / 32768.0
        audio_queue.put(audio_np)

def capture_audio_pyaudio(source='mic'):
    """
    Capture audio using PyAudio
    source: 'mic', 'speakers', or 'both' (not implemented yet)
    """
    audio = pyaudio.PyAudio()

    if source == 'mic':
        device_index = get_mic_device()
        print(f"Capturing from microphone (device {device_index})...")
    elif source == 'speakers':
        device_index = get_monitor_device()
        print(f"Capturing from speakers (device {device_index})...")
    else:
        print("Invalid source. Use 'mic' or 'speakers'")
        return

    stream = audio.open(
        format=FORMAT,
        channels=CHANNELS,
        rate=RATE,
        input=True,
        input_device_index=device_index,
        frames_per_buffer=CHUNK
    )

    while True:
        audio_data = stream.read(CHUNK, exception_on_overflow=False)
        audio_np = np.frombuffer(audio_data, dtype=np.int16).astype(np.float32) / 32768.0
        audio_queue.put(audio_np)

def transcribe_worker():
    """Process audio chunks and transcribe"""
    print("Transcription worker started...")

    while True:
        audio_chunk = audio_queue.get()

        if audio_chunk is None:
            break

        # Check if audio has speech (simple energy threshold)
        energy = np.abs(audio_chunk).mean()
        if energy < 0.01:  # Silence threshold
            continue

        # Transcribe
        try:
            result = model.transcribe(
                audio_chunk,
                fp16=False,  # Use fp32 for ROCm
                language='en',  # Specify if known
                task='transcribe'
            )

            text = result["text"].strip()
            if text:
                print(f"[TRANSCRIPT]: {text}")

        except Exception as e:
            print(f"Transcription error: {e}")

if __name__ == "__main__":
    import argparse

    parser = argparse.ArgumentParser(description='Real-time audio transcription')
    parser.add_argument('--source', choices=['mic', 'speakers', 'combined'],
                       default='combined', help='Audio source to capture')
    args = parser.parse_args()

    # Start transcription worker
    transcribe_thread = Thread(target=transcribe_worker, daemon=True)
    transcribe_thread.start()

    # Start audio capture
    try:
        if args.source == 'combined':
            capture_audio_pulseaudio()
        else:
            capture_audio_pyaudio(args.source)
    except KeyboardInterrupt:
        print("\nStopping...")
        audio_queue.put(None)
        sys.exit(0)

Usage

# Install dependencies
pip install openai-whisper pyaudio numpy

# Set up combined audio source (one-time)
pactl load-module module-null-sink sink_name=combined
pactl load-module module-loopback source=alsa_input.pci-0000_00_1f.3.analog-stereo sink=combined latency_msec=1
pactl load-module module-loopback source=alsa_output.pci-0000_00_1f.3.analog-stereo.monitor sink=combined latency_msec=1

# Run transcription
python real_time_transcribe.py --source combined

Performance Optimization

For AMD GPUs (ROCm)

Key settings for optimal performance:

import os

# May need to override GPU architecture
os.environ["HSA_OVERRIDE_GFX_VERSION"] = "11.0.1"  # For RX 7800 XT (gfx1101)

# Load model
model = whisper.load_model("base", device="cuda")  # "cuda" works for ROCm

# Transcribe with fp32 (fp16 can be problematic on ROCm)
result = model.transcribe(audio, fp16=False)

Model Selection for Real-Time

Model	Speed (RX 7800 XT)	Accuracy	Best For
tiny	~80x real-time	Basic	Ultra-fast, low accuracy OK
base	~38x real-time	Good	Best balance for real-time
small	~12x real-time	Better	High accuracy needed
medium	~4x real-time	Great	Offline processing
large-v3	~2-5x real-time	Best	Offline processing

Recommendation: Use base model for real-time transcription.

Chunk Size Tuning

# Smaller chunks = lower latency, less context
CHUNK = RATE * 1  # 1 second (faster, less accurate)

# Larger chunks = higher latency, better context
CHUNK = RATE * 3  # 3 seconds (slower, more accurate)

# Recommended for real-time
CHUNK = RATE * 2  # 2 seconds (good balance)

Troubleshooting

Issue: PyAudio installation fails

# Install PortAudio development headers
sudo apt install portaudio19-dev python3-pyaudio

# Then install PyAudio
pip install pyaudio

Issue: Can’t find monitor device

# List all sources with details
pactl list sources

# Look for "Monitor of Sink" in the output
# The source name will be like: alsa_output.XXX.monitor

Issue: No audio captured from speakers

Check if anything is playing:

pactl list sink-inputs

Test monitor directly:

parec --device=alsa_output.pci-0000_00_1f.3.analog-stereo.monitor | aplay
# You should hear what's playing (with delay)

Issue: High latency with virtual sink

Reduce latency_msec:

pactl load-module module-loopback \
  source=alsa_input.pci-0000_00_1f.3.analog-stereo \
  sink=combined \
  latency_msec=1  # Lower = less latency (but may cause dropouts)

Issue: ROCm not detecting GPU

# Check ROCm installation
rocm-smi

# Check PyTorch ROCm
python -c "import torch; print(torch.cuda.is_available())"
# Should print: True

# If False, reinstall PyTorch with ROCm
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm6.0

Issue: Transcription too slow for real-time

Use smaller model: Switch from base → tiny
Increase chunk size: Process 3-5 second chunks instead of 1-2 seconds
Check GPU usage: rocm-smi should show GPU utilization
Verify fp16=False: fp16 can cause slowdowns on ROCm

Advanced: Voice Activity Detection (VAD)

To avoid transcribing silence, add VAD:

import webrtcvad

vad = webrtcvad.Vad(3)  # Aggressiveness 0-3

def has_speech(audio_chunk, sample_rate=16000):
    """Check if audio chunk contains speech"""
    # Convert to bytes
    audio_bytes = (audio_chunk * 32768).astype(np.int16).tobytes()

    # VAD expects 10, 20, or 30ms frames
    frame_duration = 30  # ms
    frame_size = int(sample_rate * frame_duration / 1000) * 2  # 2 bytes per sample

    # Check first frame
    return vad.is_speech(audio_bytes[:frame_size], sample_rate)

# In transcription loop
if has_speech(audio_chunk):
    result = model.transcribe(audio_chunk)

Resources

PulseAudio Documentation: https://www.freedesktop.org/wiki/Software/PulseAudio/
PipeWire Documentation: https://pipewire.org/
PyAudio Documentation: http://people.csail.mit.edu/hubert/pyaudio/
Whisper GitHub: https://github.com/openai/whisper
ROCm Documentation: https://rocm.docs.amd.com/

Next Steps

See whisper-rocm-compatibility.md for:

Why faster-whisper doesn’t work with AMD GPUs
Why WhisperLive and RealtimeSTT won’t work
Alternative solutions for maximum performance