Project: Running Local LLM in Browser with WebGPU

Service: WebLLM Chat

  • URL: https://chat.webllm.ai
  • Technology: WebLLM library for running LLMs in browser using WebGPU
  • Model Tested: Llama

Issue Encountered (2025-10-31)

Error Message:

Error: Unable to find a compatible GPU. This issue might be because your
computer doesn't have a GPU, or your system settings are not configured
properly. Please check if your device has a GPU properly set up and if
your browser supports WebGPU.

System Information

  • OS: Ubuntu 24.04
  • Browser: Google Chrome
  • GPUs Detected:
    • GPU0: AMD Radeon RX 7800 XT (Discrete) - Vulkan 1.4.305
    • GPU1: AMD Radeon Graphics (Integrated) - Vulkan 1.4.305
    • GPU2: llvmpipe (CPU fallback)

WebGPU Browser Support (2025)

  • ✅ Chrome/Edge: Stable since April 2023 (v113+)
  • ✅ Safari: Since June 2025 (v26+)
  • ⚠️ Firefox: Windows only since July 2025 (v141)

Troubleshooting Steps

  1. Check WebGPU Browser Support

  2. Check GPU Drivers

    • AMD GPUs require Mesa drivers with WebGPU support
    • Vulkan support is required for WebGPU on Linux
  3. Browser Flags (Chrome/Chromium)

    • chrome://flags/#enable-unsafe-webgpu - Enable WebGPU
    • chrome://gpu - Check GPU feature status
  4. Firefox Specific

    • about:configdom.webgpu.enabled → true
    • Note: Linux support still experimental

WebLLM Project Details

GitHub: https://github.com/mlc-ai/web-llm

Key Features:

  • Runs LLMs entirely in browser using WebGPU
  • No server required - complete privacy
  • Supports various models: Llama, Mistral, Phi, etc.
  • Model downloaded and cached locally

System Requirements:

  • WebGPU-compatible browser
  • GPU with sufficient VRAM (varies by model)
  • Modern GPU drivers with Vulkan support

Troubleshooting Steps Completed

  1. Installed Vulkan Support

    Terminal window
    sudo apt install -y mesa-vulkan-drivers vulkan-tools
  2. Verified Vulkan Installation

    • vulkaninfo --summary confirms both GPUs detected
    • AMD Radeon RX 7800 XT: Vulkan 1.4.305
    • AMD Radeon Graphics (integrated): Vulkan 1.4.305
  3. ⚠️ Chrome GPU Status Issues Found

    • WebGPU: Disabled (blocked via blocklist)
    • Vulkan: Disabled in Chrome
    • Dawn (WebGPU backend) detects both GPUs but WebGPU is blocked

Solution: Enable WebGPU in Chrome

Problem: WebGPU is disabled via Chrome’s blocklist despite Vulkan working.

Fix: Enable WebGPU manually in Chrome flags.

Next Steps

  • Verify GPU hardware details
  • Install Vulkan drivers
  • Check Chrome GPU status
  • Enable WebGPU in chrome://flags
  • Restart Chrome
  • Verify WebGPU is “Hardware accelerated”
  • Test WebLLM with small model
  • Document successful configuration

Status After Chrome Restart (2025-10-31)

WebGPU: Hardware accelerated (WORKING!)

  • Enabled via chrome://flags/#enable-unsafe-webgpu
  • Command line shows: --enable-unsafe-webgpu

⚠️ Vulkan: Still shows “Disabled” in main status

  • WebGPU can use OpenGL/ANGLE backend as fallback
  • However, Dawn detected Vulkan backend adapters for both GPUs:
    • AMD Radeon RX 7800 XT (Vulkan backend) - Available
    • AMD Radeon Graphics (Vulkan backend) - Available
  • WebGPU may already be using Vulkan backend internally

Performance Considerations:

  • Vulkan backend: Lower overhead, better performance for compute
  • ANGLE/OpenGL backend: Compatibility fallback, slightly more overhead
  • For LLM inference, Vulkan backend is preferred

Ready to test WebLLM!

Initial WebLLM Test Results (2025-10-31)

Model Loaded Successfully: Llama-3.2-1B-Instruct-q4f32_1-MLC

⚠️ Performance Issues:

  • Prefill: 1.6 tok/s (very slow)
  • Decode: 0.4 tok/s (extremely slow)
  • Expected on RX 7800 XT: 50-100+ tok/s

Likely Causes:

  1. Using integrated GPU instead of discrete RX 7800 XT
  2. Using ANGLE/OpenGL backend instead of Vulkan
  3. Need to force Vulkan backend and discrete GPU selection

Solutions to Try:

1. Enable Vulkan Backend in Chrome ✅ COMPLETED

Results After Enabling Vulkan:

MASSIVE PERFORMANCE IMPROVEMENT!

Before (ANGLE/OpenGL):

  • Prefill: 1.6 tok/s
  • Decode: 0.4 tok/s

After (Vulkan enabled):

  • Prefill: 320.8 tok/s (200x faster!)
  • Decode: 47.1 tok/s (117x faster!)

Conclusion:

  • Vulkan backend is now active and working
  • Likely using discrete AMD Radeon RX 7800 XT
  • Performance is now as expected for high-end GPU

Summary: How We Fixed It

  1. ✅ Installed Vulkan drivers (mesa-vulkan-drivers, vulkan-tools)
  2. ✅ Enabled WebGPU in Chrome (chrome://flags/#enable-unsafe-webgpu)
  3. ✅ Enabled Vulkan backend (chrome://flags/)
  4. ✅ Enabled ANGLE Vulkan (chrome://flags/)
  5. ✅ Result: 117-200x performance improvement!

2. Check Which GPU is Being Used

Open Chrome DevTools (F12) in WebLLM tab and run:

// Check which GPU WebGPU is using
navigator.gpu.requestAdapter().then(adapter => {
console.log('=== GPU INFO ===');
console.log('Adapter:', adapter);
console.log('Features:', [...adapter.features]);
console.log('Name/Vendor:', adapter.name || 'Not available');
// Request device to see more info
adapter.requestDevice().then(device => {
console.log('Device:', device);
});
});

Alternative - Check in WebLLM logs:

  • Just reload the WebLLM page and watch console output
  • WebLLM logs which adapter it’s using during initialization

Expected: Should show “AMD Radeon RX 7800 XT” (discrete GPU)

Top Open Source LLMs - 2025 Rankings

LMSYS Chatbot Arena Leaderboard (Oct 2025)

Top Open Source Models by Arena Score:

  1. Qwen2.5-Max - 975.53 (Alibaba)
  2. DeepSeek-V3 - 959.80 (DeepSeek AI)
  3. Qwen2.5-Coder-32B-Instruct - 901.98 (Coding specialist)
  4. Meta Llama 4 Scout - Strong multimodal, 10M token context
  5. Google Gemma 3 - Top 10 Arena ranking, 27B variant

New Release: Qwen3 (April 2025)

Major Improvements Over Qwen2.5:

  • ArenaHard: 91.0 (vs 85.5 for DeepSeek-V3, 85.3 for GPT-4o)
  • AIME’24/25: 80.4 (ahead of QwQ-32B)
  • Best open-source LLM as of April 2025

Benchmark Comparison (2025)

MMLU-Pro CS Benchmark:

  • DeepSeek-V3: 78%
  • Qwen2.5-72B: 78% (tied)
  • Llama 3.3-70B: Solid, lower cost

Math/Reasoning (MATH-500):

  • DeepSeek-R1-Distill: 94.5% (best)
  • Qwen2.5: Strong
  • Llama 4: Good

Coding:

  • Qwen2.5-Coder: Leader (May 2025)
  • DeepSeek-V3: Excellent
  • Llama 4: Good general coding

Model Specializations

For Coding: Qwen2.5-Coder, DeepSeek-V3 For Math/Reasoning: DeepSeek-R1, Qwen3 For Long Context: Llama 4 (10M tokens) For Efficiency: Gemma 3, Qwen2.5-3B For Multilingual: Qwen2.5, GPT-4o

✨ UPDATED: Qwen3 Models Now Available in WebLLM! (2025-10-31)

WebLLM now supports Qwen3 models (0.6B, 1.7B, 4B, 8B), which are from the same family as Qwen3-235B (ranked on Chatbot Arena). These are significantly better than Qwen2.5 models.

🏆 Top Recommendations (Qwen3 Family):

Small Models (0.6B-4B):

  • 🥇 Qwen3-4B - BEST CHOICE! (from Arena family, 36T tokens training)
  • 🥈 Qwen3-1.7B - Excellent balance of speed and quality
  • 🥉 Qwen3-0.6B - Fastest option, still very capable

Medium Models (7-8B):

  • 🥇 Qwen3-8B - Top quality from Arena family, excellent for 16GB VRAM
  • 🥈 Qwen2.5-Coder-7B - Best for coding tasks specifically
  • 🥉 Llama 3.1-8B - Solid general purpose alternative

Legacy Options (Still Good, But Qwen3 is Better):

  • Qwen2.5-3B-Instruct - Good, but Qwen3-4B is superior
  • Qwen2.5-7B-Instruct - Good, but Qwen3-8B is superior
  • Gemma 3-4B - Efficient alternative
  • Llama 3.2-3B - Solid general purpose

Why Qwen3 Over Qwen2.5?

  • ✅ Newer generation (April 2025 vs 2024)
  • ✅ Better training (36T tokens vs 18T)
  • ✅ From Arena ranked family (Qwen3-235B)
  • ✅ Improved reasoning, coding, and instruction following
  • ✅ Same speed as Qwen2.5, but higher quality

Expected Performance on AMD RX 7800 XT:

  • Qwen3-0.6B: ~80-100 tok/s
  • Qwen3-1.7B: ~70-80 tok/s
  • Qwen3-4B: ~60-70 tok/s
  • Qwen3-8B: ~40-50 tok/s

Note: Larger models (DeepSeek-V3 671B, Qwen3-235B, Llama 4) are too large for browser deployment but excellent for local/server deployment.

Alternative Approaches

  1. Transformers.js - Hugging Face’s browser ML library
  2. ONNX Runtime Web - Microsoft’s web inference runtime
  3. TensorFlow.js - Google’s browser ML framework

Performance Considerations

  • WebGPU compute shaders enable GPU acceleration
  • Model size vs. VRAM availability
  • Browser overhead vs. native performance
  • Network costs for model download (first run only)

References