audio-capture-guide
Overview
This guide covers how to capture and stream audio from both microphone input and speaker output on Ubuntu, suitable for real-time Whisper transcription with AMD GPUs.
Ubuntu Audio Stack
Ubuntu uses one of two audio systems:
PulseAudio (Traditional)
- Default on Ubuntu 22.04 and earlier
- Mature, well-documented
- Better for simple setups
PipeWire (Modern)
- Default on Ubuntu 22.10+
- Replaces PulseAudio + JACK
- Better performance, lower latency
- Backward compatible with PulseAudio commands
Check which you’re using:
# PipeWirepactl info | grep "Server Name"# Shows: PulseAudio (on PipeWire X.X.X) if using PipeWire# Shows: pulseaudio X.X.X if using traditional PulseAudioMethod 1: Capturing Microphone Input
Using PulseAudio/PipeWire Commands
List available input sources:
pactl list sources shortExample output:
0 alsa_input.pci-0000_00_1f.3.analog-stereo module-alsa-card.c s16le 2ch 44100Hz1 alsa_input.usb-Blue_Microphones_Yeti_Stereo module-alsa-card.c s16le 2ch 48000HzRecord from microphone:
# Replace source name with your actual micparec --device=alsa_input.pci-0000_00_1f.3.analog-stereo | your-processing-script
# Or specify format (16kHz mono for Whisper)parec --device=alsa_input.pci-0000_00_1f.3.analog-stereo \ --format=s16le \ --rate=16000 \ --channels=1 | your-processing-scriptUsing ALSA Directly (Lower Level)
# List capture devicesarecord -l
# Record from mic (16kHz mono, 16-bit PCM)arecord -f S16_LE -r 16000 -c 1 -t raw | your-processing-script
# Or specify devicearecord -D hw:0,0 -f S16_LE -r 16000 -c 1 -t raw | your-processing-scriptUsing PyAudio (Python)
import pyaudioimport numpy as np
audio = pyaudio.PyAudio()
# List devicesfor i in range(audio.get_device_count()): info = audio.get_device_info_by_index(i) if info['maxInputChannels'] > 0: print(f"{i}: {info['name']} (inputs: {info['maxInputChannels']})")
# Open microphone streamstream = audio.open( format=pyaudio.paInt16, channels=1, rate=16000, input=True, input_device_index=None, # None = default, or specify device index frames_per_buffer=1024)
# Read audio chunkswhile True: audio_chunk = stream.read(1024) audio_np = np.frombuffer(audio_chunk, dtype=np.int16) # Process audio_np...Method 2: Capturing Speaker Output (System Audio)
Understanding Monitor Sources
PulseAudio/PipeWire creates “monitor” sources for each audio output. These capture what’s being played through your speakers.
List monitor sources:
pactl list sources short | grep monitorExample output:
2 alsa_output.pci-0000_00_1f.3.analog-stereo.monitor module-alsa-card.c s16le 2ch 44100HzRecord speaker output:
parec --device=alsa_output.pci-0000_00_1f.3.analog-stereo.monitor | your-processing-script
# With format specificationparec --device=alsa_output.pci-0000_00_1f.3.analog-stereo.monitor \ --format=s16le \ --rate=16000 \ --channels=1 | your-processing-scriptFinding the Correct Monitor
If you have multiple audio outputs:
# List all sinks (outputs)pactl list sinks short
# Each sink has a corresponding .monitor source# Example: alsa_output.pci-0000_00_1f.3.hdmi-stereo# Monitor: alsa_output.pci-0000_00_1f.3.hdmi-stereo.monitorUsing ALSA Loopback (Alternative)
# Load loopback module (may require root)sudo modprobe snd-aloop
# This creates a virtual loopback devicearecord -D hw:Loopback,1 -f S16_LE -r 16000 -c 1 | your-processing-scriptMethod 3: Capturing BOTH Mic + Speakers Simultaneously
This is the most useful for comprehensive audio transcription.
Option A: Using PulseAudio/PipeWire Virtual Sink
Step 1: Create a virtual sink (combined audio)
# Create a null sink (virtual audio device)pactl load-module module-null-sink sink_name=combined sink_properties=device.description="Combined_Audio"Step 2: Route microphone to virtual sink
# Replace with your actual mic sourcepactl load-module module-loopback \ source=alsa_input.pci-0000_00_1f.3.analog-stereo \ sink=combined \ latency_msec=1Step 3: Route speaker output to virtual sink
# Replace with your actual speaker monitorpactl load-module module-loopback \ source=alsa_output.pci-0000_00_1f.3.analog-stereo.monitor \ sink=combined \ latency_msec=1Step 4: Record from the combined monitor
parec --device=combined.monitor | your-processing-script
# Or with formatparec --device=combined.monitor \ --format=s16le \ --rate=16000 \ --channels=1 | your-processing-scriptTo make permanent:
Edit /etc/pulse/default.pa or ~/.config/pulse/default.pa:
# Add these linesload-module module-null-sink sink_name=combined sink_properties=device.description="Combined_Audio"load-module module-loopback source=alsa_input.pci-0000_00_1f.3.analog-stereo sink=combined latency_msec=1load-module module-loopback source=alsa_output.pci-0000_00_1f.3.analog-stereo.monitor sink=combined latency_msec=1Then restart PulseAudio:
pulseaudio -kpulseaudio --startOption B: Using Python with Multiple Streams
import pyaudioimport numpy as npimport threading
audio = pyaudio.PyAudio()
# Buffer to store combined audiocombined_buffer = []buffer_lock = threading.Lock()
def capture_mic(device_index): stream = audio.open( format=pyaudio.paInt16, channels=1, rate=16000, input=True, input_device_index=device_index, frames_per_buffer=1024 ) while True: data = stream.read(1024) audio_np = np.frombuffer(data, dtype=np.int16) with buffer_lock: combined_buffer.append(('mic', audio_np))
def capture_speakers(device_index): stream = audio.open( format=pyaudio.paInt16, channels=1, rate=16000, input=True, input_device_index=device_index, frames_per_buffer=1024 ) while True: data = stream.read(1024) audio_np = np.frombuffer(data, dtype=np.int16) with buffer_lock: combined_buffer.append(('speaker', audio_np))
# Start capture threadsmic_thread = threading.Thread(target=capture_mic, args=(mic_device_index,))speaker_thread = threading.Thread(target=capture_speakers, args=(monitor_device_index,))
mic_thread.daemon = Truespeaker_thread.daemon = True
mic_thread.start()speaker_thread.start()
# Process combined bufferwhile True: with buffer_lock: if combined_buffer: source, audio_chunk = combined_buffer.pop(0) # Process audio_chunk...Option C: Using GStreamer
# Install GStreamersudo apt install gstreamer1.0-tools gstreamer1.0-plugins-good
# Capture mic + speakers and mixgst-launch-1.0 \ audiomixer name=mix ! \ audioconvert ! \ audio/x-raw,rate=16000,channels=1 ! \ fdsink fd=1 \ pulsesrc device=alsa_input.pci-0000_00_1f.3.analog-stereo ! mix. \ pulsesrc device=alsa_output.pci-0000_00_1f.3.analog-stereo.monitor ! mix.Integration with Whisper for Real-Time Transcription
Complete Working Example
#!/usr/bin/env python3"""Real-time audio transcription with AMD GPU supportCaptures both mic and speakers, transcribes with openai-whisper"""
import whisperimport pyaudioimport numpy as npimport subprocessimport sysfrom threading import Thread, Lockfrom queue import Queue
# Load Whisper model (ROCm-accelerated on AMD GPUs)print("Loading Whisper model...")model = whisper.load_model("base", device="cuda") # "cuda" works for ROCm tooprint("Model loaded!")
# Audio configurationRATE = 16000CHUNK = RATE * 2 # 2 seconds of audioFORMAT = pyaudio.paInt16CHANNELS = 1
# Transcription queueaudio_queue = Queue()
def get_monitor_device(name_contains="monitor"): """Find monitor device for speaker capture""" p = pyaudio.PyAudio() for i in range(p.get_device_count()): info = p.get_device_info_by_index(i) if name_contains.lower() in info['name'].lower() and info['maxInputChannels'] > 0: print(f"Found monitor device: {info['name']}") return i return None
def get_mic_device(): """Find default microphone""" p = pyaudio.PyAudio() return p.get_default_input_device_info()['index']
def capture_audio_pulseaudio(): """ Capture combined audio using PulseAudio virtual sink Assumes you've set up the 'combined' sink as shown above """ cmd = [ 'parec', '--device=combined.monitor', '--format=s16le', '--rate=16000', '--channels=1' ]
process = subprocess.Popen(cmd, stdout=subprocess.PIPE, bufsize=CHUNK*2)
print("Capturing audio from combined source (mic + speakers)...")
while True: audio_data = process.stdout.read(CHUNK * 2) # 2 bytes per sample if not audio_data: break
audio_np = np.frombuffer(audio_data, dtype=np.int16).astype(np.float32) / 32768.0 audio_queue.put(audio_np)
def capture_audio_pyaudio(source='mic'): """ Capture audio using PyAudio source: 'mic', 'speakers', or 'both' (not implemented yet) """ audio = pyaudio.PyAudio()
if source == 'mic': device_index = get_mic_device() print(f"Capturing from microphone (device {device_index})...") elif source == 'speakers': device_index = get_monitor_device() print(f"Capturing from speakers (device {device_index})...") else: print("Invalid source. Use 'mic' or 'speakers'") return
stream = audio.open( format=FORMAT, channels=CHANNELS, rate=RATE, input=True, input_device_index=device_index, frames_per_buffer=CHUNK )
while True: audio_data = stream.read(CHUNK, exception_on_overflow=False) audio_np = np.frombuffer(audio_data, dtype=np.int16).astype(np.float32) / 32768.0 audio_queue.put(audio_np)
def transcribe_worker(): """Process audio chunks and transcribe""" print("Transcription worker started...")
while True: audio_chunk = audio_queue.get()
if audio_chunk is None: break
# Check if audio has speech (simple energy threshold) energy = np.abs(audio_chunk).mean() if energy < 0.01: # Silence threshold continue
# Transcribe try: result = model.transcribe( audio_chunk, fp16=False, # Use fp32 for ROCm language='en', # Specify if known task='transcribe' )
text = result["text"].strip() if text: print(f"[TRANSCRIPT]: {text}")
except Exception as e: print(f"Transcription error: {e}")
if __name__ == "__main__": import argparse
parser = argparse.ArgumentParser(description='Real-time audio transcription') parser.add_argument('--source', choices=['mic', 'speakers', 'combined'], default='combined', help='Audio source to capture') args = parser.parse_args()
# Start transcription worker transcribe_thread = Thread(target=transcribe_worker, daemon=True) transcribe_thread.start()
# Start audio capture try: if args.source == 'combined': capture_audio_pulseaudio() else: capture_audio_pyaudio(args.source) except KeyboardInterrupt: print("\nStopping...") audio_queue.put(None) sys.exit(0)Usage
# Install dependenciespip install openai-whisper pyaudio numpy
# Set up combined audio source (one-time)pactl load-module module-null-sink sink_name=combinedpactl load-module module-loopback source=alsa_input.pci-0000_00_1f.3.analog-stereo sink=combined latency_msec=1pactl load-module module-loopback source=alsa_output.pci-0000_00_1f.3.analog-stereo.monitor sink=combined latency_msec=1
# Run transcriptionpython real_time_transcribe.py --source combinedPerformance Optimization
For AMD GPUs (ROCm)
Key settings for optimal performance:
import os
# May need to override GPU architectureos.environ["HSA_OVERRIDE_GFX_VERSION"] = "11.0.1" # For RX 7800 XT (gfx1101)
# Load modelmodel = whisper.load_model("base", device="cuda") # "cuda" works for ROCm
# Transcribe with fp32 (fp16 can be problematic on ROCm)result = model.transcribe(audio, fp16=False)Model Selection for Real-Time
| Model | Speed (RX 7800 XT) | Accuracy | Best For |
|---|---|---|---|
| tiny | ~80x real-time | Basic | Ultra-fast, low accuracy OK |
| base | ~38x real-time | Good | Best balance for real-time |
| small | ~12x real-time | Better | High accuracy needed |
| medium | ~4x real-time | Great | Offline processing |
| large-v3 | ~2-5x real-time | Best | Offline processing |
Recommendation: Use base model for real-time transcription.
Chunk Size Tuning
# Smaller chunks = lower latency, less contextCHUNK = RATE * 1 # 1 second (faster, less accurate)
# Larger chunks = higher latency, better contextCHUNK = RATE * 3 # 3 seconds (slower, more accurate)
# Recommended for real-timeCHUNK = RATE * 2 # 2 seconds (good balance)Troubleshooting
Issue: PyAudio installation fails
# Install PortAudio development headerssudo apt install portaudio19-dev python3-pyaudio
# Then install PyAudiopip install pyaudioIssue: Can’t find monitor device
# List all sources with detailspactl list sources
# Look for "Monitor of Sink" in the output# The source name will be like: alsa_output.XXX.monitorIssue: No audio captured from speakers
Check if anything is playing:
pactl list sink-inputsTest monitor directly:
parec --device=alsa_output.pci-0000_00_1f.3.analog-stereo.monitor | aplay# You should hear what's playing (with delay)Issue: High latency with virtual sink
Reduce latency_msec:
pactl load-module module-loopback \ source=alsa_input.pci-0000_00_1f.3.analog-stereo \ sink=combined \ latency_msec=1 # Lower = less latency (but may cause dropouts)Issue: ROCm not detecting GPU
# Check ROCm installationrocm-smi
# Check PyTorch ROCmpython -c "import torch; print(torch.cuda.is_available())"# Should print: True
# If False, reinstall PyTorch with ROCmpip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm6.0Issue: Transcription too slow for real-time
- Use smaller model: Switch from base → tiny
- Increase chunk size: Process 3-5 second chunks instead of 1-2 seconds
- Check GPU usage:
rocm-smishould show GPU utilization - Verify fp16=False: fp16 can cause slowdowns on ROCm
Advanced: Voice Activity Detection (VAD)
To avoid transcribing silence, add VAD:
import webrtcvad
vad = webrtcvad.Vad(3) # Aggressiveness 0-3
def has_speech(audio_chunk, sample_rate=16000): """Check if audio chunk contains speech""" # Convert to bytes audio_bytes = (audio_chunk * 32768).astype(np.int16).tobytes()
# VAD expects 10, 20, or 30ms frames frame_duration = 30 # ms frame_size = int(sample_rate * frame_duration / 1000) * 2 # 2 bytes per sample
# Check first frame return vad.is_speech(audio_bytes[:frame_size], sample_rate)
# In transcription loopif has_speech(audio_chunk): result = model.transcribe(audio_chunk)Resources
- PulseAudio Documentation: https://www.freedesktop.org/wiki/Software/PulseAudio/
- PipeWire Documentation: https://pipewire.org/
- PyAudio Documentation: http://people.csail.mit.edu/hubert/pyaudio/
- Whisper GitHub: https://github.com/openai/whisper
- ROCm Documentation: https://rocm.docs.amd.com/
Next Steps
See whisper-rocm-compatibility.md for:
- Why faster-whisper doesn’t work with AMD GPUs
- Why WhisperLive and RealtimeSTT won’t work
- Alternative solutions for maximum performance