Overview

This guide covers how to capture and stream audio from both microphone input and speaker output on Ubuntu, suitable for real-time Whisper transcription with AMD GPUs.

Ubuntu Audio Stack

Ubuntu uses one of two audio systems:

PulseAudio (Traditional)

  • Default on Ubuntu 22.04 and earlier
  • Mature, well-documented
  • Better for simple setups

PipeWire (Modern)

  • Default on Ubuntu 22.10+
  • Replaces PulseAudio + JACK
  • Better performance, lower latency
  • Backward compatible with PulseAudio commands

Check which you’re using:

Terminal window
# PipeWire
pactl info | grep "Server Name"
# Shows: PulseAudio (on PipeWire X.X.X) if using PipeWire
# Shows: pulseaudio X.X.X if using traditional PulseAudio

Method 1: Capturing Microphone Input

Using PulseAudio/PipeWire Commands

List available input sources:

Terminal window
pactl list sources short

Example output:

0 alsa_input.pci-0000_00_1f.3.analog-stereo module-alsa-card.c s16le 2ch 44100Hz
1 alsa_input.usb-Blue_Microphones_Yeti_Stereo module-alsa-card.c s16le 2ch 48000Hz

Record from microphone:

Terminal window
# Replace source name with your actual mic
parec --device=alsa_input.pci-0000_00_1f.3.analog-stereo | your-processing-script
# Or specify format (16kHz mono for Whisper)
parec --device=alsa_input.pci-0000_00_1f.3.analog-stereo \
--format=s16le \
--rate=16000 \
--channels=1 | your-processing-script

Using ALSA Directly (Lower Level)

Terminal window
# List capture devices
arecord -l
# Record from mic (16kHz mono, 16-bit PCM)
arecord -f S16_LE -r 16000 -c 1 -t raw | your-processing-script
# Or specify device
arecord -D hw:0,0 -f S16_LE -r 16000 -c 1 -t raw | your-processing-script

Using PyAudio (Python)

import pyaudio
import numpy as np
audio = pyaudio.PyAudio()
# List devices
for i in range(audio.get_device_count()):
info = audio.get_device_info_by_index(i)
if info['maxInputChannels'] > 0:
print(f"{i}: {info['name']} (inputs: {info['maxInputChannels']})")
# Open microphone stream
stream = audio.open(
format=pyaudio.paInt16,
channels=1,
rate=16000,
input=True,
input_device_index=None, # None = default, or specify device index
frames_per_buffer=1024
)
# Read audio chunks
while True:
audio_chunk = stream.read(1024)
audio_np = np.frombuffer(audio_chunk, dtype=np.int16)
# Process audio_np...

Method 2: Capturing Speaker Output (System Audio)

Understanding Monitor Sources

PulseAudio/PipeWire creates “monitor” sources for each audio output. These capture what’s being played through your speakers.

List monitor sources:

Terminal window
pactl list sources short | grep monitor

Example output:

2 alsa_output.pci-0000_00_1f.3.analog-stereo.monitor module-alsa-card.c s16le 2ch 44100Hz

Record speaker output:

Terminal window
parec --device=alsa_output.pci-0000_00_1f.3.analog-stereo.monitor | your-processing-script
# With format specification
parec --device=alsa_output.pci-0000_00_1f.3.analog-stereo.monitor \
--format=s16le \
--rate=16000 \
--channels=1 | your-processing-script

Finding the Correct Monitor

If you have multiple audio outputs:

Terminal window
# List all sinks (outputs)
pactl list sinks short
# Each sink has a corresponding .monitor source
# Example: alsa_output.pci-0000_00_1f.3.hdmi-stereo
# Monitor: alsa_output.pci-0000_00_1f.3.hdmi-stereo.monitor

Using ALSA Loopback (Alternative)

Terminal window
# Load loopback module (may require root)
sudo modprobe snd-aloop
# This creates a virtual loopback device
arecord -D hw:Loopback,1 -f S16_LE -r 16000 -c 1 | your-processing-script

Method 3: Capturing BOTH Mic + Speakers Simultaneously

This is the most useful for comprehensive audio transcription.

Option A: Using PulseAudio/PipeWire Virtual Sink

Step 1: Create a virtual sink (combined audio)

Terminal window
# Create a null sink (virtual audio device)
pactl load-module module-null-sink sink_name=combined sink_properties=device.description="Combined_Audio"

Step 2: Route microphone to virtual sink

Terminal window
# Replace with your actual mic source
pactl load-module module-loopback \
source=alsa_input.pci-0000_00_1f.3.analog-stereo \
sink=combined \
latency_msec=1

Step 3: Route speaker output to virtual sink

Terminal window
# Replace with your actual speaker monitor
pactl load-module module-loopback \
source=alsa_output.pci-0000_00_1f.3.analog-stereo.monitor \
sink=combined \
latency_msec=1

Step 4: Record from the combined monitor

Terminal window
parec --device=combined.monitor | your-processing-script
# Or with format
parec --device=combined.monitor \
--format=s16le \
--rate=16000 \
--channels=1 | your-processing-script

To make permanent:

Edit /etc/pulse/default.pa or ~/.config/pulse/default.pa:

Terminal window
# Add these lines
load-module module-null-sink sink_name=combined sink_properties=device.description="Combined_Audio"
load-module module-loopback source=alsa_input.pci-0000_00_1f.3.analog-stereo sink=combined latency_msec=1
load-module module-loopback source=alsa_output.pci-0000_00_1f.3.analog-stereo.monitor sink=combined latency_msec=1

Then restart PulseAudio:

Terminal window
pulseaudio -k
pulseaudio --start

Option B: Using Python with Multiple Streams

import pyaudio
import numpy as np
import threading
audio = pyaudio.PyAudio()
# Buffer to store combined audio
combined_buffer = []
buffer_lock = threading.Lock()
def capture_mic(device_index):
stream = audio.open(
format=pyaudio.paInt16,
channels=1,
rate=16000,
input=True,
input_device_index=device_index,
frames_per_buffer=1024
)
while True:
data = stream.read(1024)
audio_np = np.frombuffer(data, dtype=np.int16)
with buffer_lock:
combined_buffer.append(('mic', audio_np))
def capture_speakers(device_index):
stream = audio.open(
format=pyaudio.paInt16,
channels=1,
rate=16000,
input=True,
input_device_index=device_index,
frames_per_buffer=1024
)
while True:
data = stream.read(1024)
audio_np = np.frombuffer(data, dtype=np.int16)
with buffer_lock:
combined_buffer.append(('speaker', audio_np))
# Start capture threads
mic_thread = threading.Thread(target=capture_mic, args=(mic_device_index,))
speaker_thread = threading.Thread(target=capture_speakers, args=(monitor_device_index,))
mic_thread.daemon = True
speaker_thread.daemon = True
mic_thread.start()
speaker_thread.start()
# Process combined buffer
while True:
with buffer_lock:
if combined_buffer:
source, audio_chunk = combined_buffer.pop(0)
# Process audio_chunk...

Option C: Using GStreamer

Terminal window
# Install GStreamer
sudo apt install gstreamer1.0-tools gstreamer1.0-plugins-good
# Capture mic + speakers and mix
gst-launch-1.0 \
audiomixer name=mix ! \
audioconvert ! \
audio/x-raw,rate=16000,channels=1 ! \
fdsink fd=1 \
pulsesrc device=alsa_input.pci-0000_00_1f.3.analog-stereo ! mix. \
pulsesrc device=alsa_output.pci-0000_00_1f.3.analog-stereo.monitor ! mix.

Integration with Whisper for Real-Time Transcription

Complete Working Example

#!/usr/bin/env python3
"""
Real-time audio transcription with AMD GPU support
Captures both mic and speakers, transcribes with openai-whisper
"""
import whisper
import pyaudio
import numpy as np
import subprocess
import sys
from threading import Thread, Lock
from queue import Queue
# Load Whisper model (ROCm-accelerated on AMD GPUs)
print("Loading Whisper model...")
model = whisper.load_model("base", device="cuda") # "cuda" works for ROCm too
print("Model loaded!")
# Audio configuration
RATE = 16000
CHUNK = RATE * 2 # 2 seconds of audio
FORMAT = pyaudio.paInt16
CHANNELS = 1
# Transcription queue
audio_queue = Queue()
def get_monitor_device(name_contains="monitor"):
"""Find monitor device for speaker capture"""
p = pyaudio.PyAudio()
for i in range(p.get_device_count()):
info = p.get_device_info_by_index(i)
if name_contains.lower() in info['name'].lower() and info['maxInputChannels'] > 0:
print(f"Found monitor device: {info['name']}")
return i
return None
def get_mic_device():
"""Find default microphone"""
p = pyaudio.PyAudio()
return p.get_default_input_device_info()['index']
def capture_audio_pulseaudio():
"""
Capture combined audio using PulseAudio virtual sink
Assumes you've set up the 'combined' sink as shown above
"""
cmd = [
'parec',
'--device=combined.monitor',
'--format=s16le',
'--rate=16000',
'--channels=1'
]
process = subprocess.Popen(cmd, stdout=subprocess.PIPE, bufsize=CHUNK*2)
print("Capturing audio from combined source (mic + speakers)...")
while True:
audio_data = process.stdout.read(CHUNK * 2) # 2 bytes per sample
if not audio_data:
break
audio_np = np.frombuffer(audio_data, dtype=np.int16).astype(np.float32) / 32768.0
audio_queue.put(audio_np)
def capture_audio_pyaudio(source='mic'):
"""
Capture audio using PyAudio
source: 'mic', 'speakers', or 'both' (not implemented yet)
"""
audio = pyaudio.PyAudio()
if source == 'mic':
device_index = get_mic_device()
print(f"Capturing from microphone (device {device_index})...")
elif source == 'speakers':
device_index = get_monitor_device()
print(f"Capturing from speakers (device {device_index})...")
else:
print("Invalid source. Use 'mic' or 'speakers'")
return
stream = audio.open(
format=FORMAT,
channels=CHANNELS,
rate=RATE,
input=True,
input_device_index=device_index,
frames_per_buffer=CHUNK
)
while True:
audio_data = stream.read(CHUNK, exception_on_overflow=False)
audio_np = np.frombuffer(audio_data, dtype=np.int16).astype(np.float32) / 32768.0
audio_queue.put(audio_np)
def transcribe_worker():
"""Process audio chunks and transcribe"""
print("Transcription worker started...")
while True:
audio_chunk = audio_queue.get()
if audio_chunk is None:
break
# Check if audio has speech (simple energy threshold)
energy = np.abs(audio_chunk).mean()
if energy < 0.01: # Silence threshold
continue
# Transcribe
try:
result = model.transcribe(
audio_chunk,
fp16=False, # Use fp32 for ROCm
language='en', # Specify if known
task='transcribe'
)
text = result["text"].strip()
if text:
print(f"[TRANSCRIPT]: {text}")
except Exception as e:
print(f"Transcription error: {e}")
if __name__ == "__main__":
import argparse
parser = argparse.ArgumentParser(description='Real-time audio transcription')
parser.add_argument('--source', choices=['mic', 'speakers', 'combined'],
default='combined', help='Audio source to capture')
args = parser.parse_args()
# Start transcription worker
transcribe_thread = Thread(target=transcribe_worker, daemon=True)
transcribe_thread.start()
# Start audio capture
try:
if args.source == 'combined':
capture_audio_pulseaudio()
else:
capture_audio_pyaudio(args.source)
except KeyboardInterrupt:
print("\nStopping...")
audio_queue.put(None)
sys.exit(0)

Usage

Terminal window
# Install dependencies
pip install openai-whisper pyaudio numpy
# Set up combined audio source (one-time)
pactl load-module module-null-sink sink_name=combined
pactl load-module module-loopback source=alsa_input.pci-0000_00_1f.3.analog-stereo sink=combined latency_msec=1
pactl load-module module-loopback source=alsa_output.pci-0000_00_1f.3.analog-stereo.monitor sink=combined latency_msec=1
# Run transcription
python real_time_transcribe.py --source combined

Performance Optimization

For AMD GPUs (ROCm)

Key settings for optimal performance:

import os
# May need to override GPU architecture
os.environ["HSA_OVERRIDE_GFX_VERSION"] = "11.0.1" # For RX 7800 XT (gfx1101)
# Load model
model = whisper.load_model("base", device="cuda") # "cuda" works for ROCm
# Transcribe with fp32 (fp16 can be problematic on ROCm)
result = model.transcribe(audio, fp16=False)

Model Selection for Real-Time

ModelSpeed (RX 7800 XT)AccuracyBest For
tiny~80x real-timeBasicUltra-fast, low accuracy OK
base~38x real-timeGoodBest balance for real-time
small~12x real-timeBetterHigh accuracy needed
medium~4x real-timeGreatOffline processing
large-v3~2-5x real-timeBestOffline processing

Recommendation: Use base model for real-time transcription.

Chunk Size Tuning

# Smaller chunks = lower latency, less context
CHUNK = RATE * 1 # 1 second (faster, less accurate)
# Larger chunks = higher latency, better context
CHUNK = RATE * 3 # 3 seconds (slower, more accurate)
# Recommended for real-time
CHUNK = RATE * 2 # 2 seconds (good balance)

Troubleshooting

Issue: PyAudio installation fails

Terminal window
# Install PortAudio development headers
sudo apt install portaudio19-dev python3-pyaudio
# Then install PyAudio
pip install pyaudio

Issue: Can’t find monitor device

Terminal window
# List all sources with details
pactl list sources
# Look for "Monitor of Sink" in the output
# The source name will be like: alsa_output.XXX.monitor

Issue: No audio captured from speakers

Check if anything is playing:

Terminal window
pactl list sink-inputs

Test monitor directly:

Terminal window
parec --device=alsa_output.pci-0000_00_1f.3.analog-stereo.monitor | aplay
# You should hear what's playing (with delay)

Issue: High latency with virtual sink

Reduce latency_msec:

Terminal window
pactl load-module module-loopback \
source=alsa_input.pci-0000_00_1f.3.analog-stereo \
sink=combined \
latency_msec=1 # Lower = less latency (but may cause dropouts)

Issue: ROCm not detecting GPU

Terminal window
# Check ROCm installation
rocm-smi
# Check PyTorch ROCm
python -c "import torch; print(torch.cuda.is_available())"
# Should print: True
# If False, reinstall PyTorch with ROCm
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm6.0

Issue: Transcription too slow for real-time

  1. Use smaller model: Switch from base → tiny
  2. Increase chunk size: Process 3-5 second chunks instead of 1-2 seconds
  3. Check GPU usage: rocm-smi should show GPU utilization
  4. Verify fp16=False: fp16 can cause slowdowns on ROCm

Advanced: Voice Activity Detection (VAD)

To avoid transcribing silence, add VAD:

import webrtcvad
vad = webrtcvad.Vad(3) # Aggressiveness 0-3
def has_speech(audio_chunk, sample_rate=16000):
"""Check if audio chunk contains speech"""
# Convert to bytes
audio_bytes = (audio_chunk * 32768).astype(np.int16).tobytes()
# VAD expects 10, 20, or 30ms frames
frame_duration = 30 # ms
frame_size = int(sample_rate * frame_duration / 1000) * 2 # 2 bytes per sample
# Check first frame
return vad.is_speech(audio_bytes[:frame_size], sample_rate)
# In transcription loop
if has_speech(audio_chunk):
result = model.transcribe(audio_chunk)

Resources

Next Steps

See whisper-rocm-compatibility.md for:

  • Why faster-whisper doesn’t work with AMD GPUs
  • Why WhisperLive and RealtimeSTT won’t work
  • Alternative solutions for maximum performance