webgpu-llm-research

Project: Running Local LLM in Browser with WebGPU

Service: WebLLM Chat

URL: https://chat.webllm.ai
Technology: WebLLM library for running LLMs in browser using WebGPU
Model Tested: Llama

Issue Encountered (2025-10-31)

Error Message:

Error: Unable to find a compatible GPU. This issue might be because your
computer doesn't have a GPU, or your system settings are not configured
properly. Please check if your device has a GPU properly set up and if
your browser supports WebGPU.

System Information

OS: Ubuntu 24.04
Browser: Google Chrome
GPUs Detected:
- GPU0: AMD Radeon RX 7800 XT (Discrete) - Vulkan 1.4.305
- GPU1: AMD Radeon Graphics (Integrated) - Vulkan 1.4.305
- GPU2: llvmpipe (CPU fallback)

WebGPU Browser Support (2025)

✅ Chrome/Edge: Stable since April 2023 (v113+)
✅ Safari: Since June 2025 (v26+)
⚠️ Firefox: Windows only since July 2025 (v141)

Troubleshooting Steps

Check WebGPU Browser Support
- Visit: https://webgpureport.org/
- Verify WebGPU is enabled in browser
Check GPU Drivers
- AMD GPUs require Mesa drivers with WebGPU support
- Vulkan support is required for WebGPU on Linux
Browser Flags (Chrome/Chromium)
- chrome://flags/#enable-unsafe-webgpu - Enable WebGPU
- chrome://gpu - Check GPU feature status
Firefox Specific
- about:config → dom.webgpu.enabled → true
- Note: Linux support still experimental

WebLLM Project Details

GitHub: https://github.com/mlc-ai/web-llm

Key Features:

Runs LLMs entirely in browser using WebGPU
No server required - complete privacy
Supports various models: Llama, Mistral, Phi, etc.
Model downloaded and cached locally

System Requirements:

WebGPU-compatible browser
GPU with sufficient VRAM (varies by model)
Modern GPU drivers with Vulkan support

Troubleshooting Steps Completed

✅ Installed Vulkan Support

sudo apt install -y mesa-vulkan-drivers vulkan-tools

✅ Verified Vulkan Installation
- vulkaninfo --summary confirms both GPUs detected
- AMD Radeon RX 7800 XT: Vulkan 1.4.305
- AMD Radeon Graphics (integrated): Vulkan 1.4.305
⚠️ Chrome GPU Status Issues Found
- WebGPU: Disabled (blocked via blocklist)
- Vulkan: Disabled in Chrome
- Dawn (WebGPU backend) detects both GPUs but WebGPU is blocked

Solution: Enable WebGPU in Chrome

Problem: WebGPU is disabled via Chrome’s blocklist despite Vulkan working.

Fix: Enable WebGPU manually in Chrome flags.

Next Steps

Status After Chrome Restart (2025-10-31)

✅ WebGPU: Hardware accelerated (WORKING!)

Enabled via chrome://flags/#enable-unsafe-webgpu
Command line shows: --enable-unsafe-webgpu

⚠️ Vulkan: Still shows “Disabled” in main status

WebGPU can use OpenGL/ANGLE backend as fallback
However, Dawn detected Vulkan backend adapters for both GPUs:
- AMD Radeon RX 7800 XT (Vulkan backend) - Available
- AMD Radeon Graphics (Vulkan backend) - Available
WebGPU may already be using Vulkan backend internally

Performance Considerations:

Vulkan backend: Lower overhead, better performance for compute
ANGLE/OpenGL backend: Compatibility fallback, slightly more overhead
For LLM inference, Vulkan backend is preferred

Ready to test WebLLM!

Initial WebLLM Test Results (2025-10-31)

✅ Model Loaded Successfully: Llama-3.2-1B-Instruct-q4f32_1-MLC

⚠️ Performance Issues:

Prefill: 1.6 tok/s (very slow)
Decode: 0.4 tok/s (extremely slow)
Expected on RX 7800 XT: 50-100+ tok/s

Likely Causes:

Using integrated GPU instead of discrete RX 7800 XT
Using ANGLE/OpenGL backend instead of Vulkan
Need to force Vulkan backend and discrete GPU selection

Solutions to Try:

1. Enable Vulkan Backend in Chrome ✅ COMPLETED

Enabled #enable-vulkan flag
Enabled #default-angle-vulkan flag
Result: Vulkan now shows “Enabled” in chrome://gpu

Results After Enabling Vulkan:

✅ MASSIVE PERFORMANCE IMPROVEMENT!

Before (ANGLE/OpenGL):

Prefill: 1.6 tok/s
Decode: 0.4 tok/s

After (Vulkan enabled):

Prefill: 320.8 tok/s (200x faster!)
Decode: 47.1 tok/s (117x faster!)

Conclusion:

Vulkan backend is now active and working
Likely using discrete AMD Radeon RX 7800 XT
Performance is now as expected for high-end GPU

Summary: How We Fixed It

✅ Installed Vulkan drivers (mesa-vulkan-drivers, vulkan-tools)
✅ Enabled WebGPU in Chrome (chrome://flags/#enable-unsafe-webgpu)
✅ Enabled Vulkan backend (chrome://flags/#enable-vulkan)
✅ Enabled ANGLE Vulkan (chrome://flags/#default-angle-vulkan)
✅ Result: 117-200x performance improvement!

2. Check Which GPU is Being Used

Open Chrome DevTools (F12) in WebLLM tab and run:

// Check which GPU WebGPU is using
navigator.gpu.requestAdapter().then(adapter => {
  console.log('=== GPU INFO ===');
  console.log('Adapter:', adapter);
  console.log('Features:', [...adapter.features]);
  console.log('Name/Vendor:', adapter.name || 'Not available');

  // Request device to see more info
  adapter.requestDevice().then(device => {
    console.log('Device:', device);
  });
});

Alternative - Check in WebLLM logs:

Just reload the WebLLM page and watch console output
WebLLM logs which adapter it’s using during initialization

Expected: Should show “AMD Radeon RX 7800 XT” (discrete GPU)

Top Open Source LLMs - 2025 Rankings

LMSYS Chatbot Arena Leaderboard (Oct 2025)

Top Open Source Models by Arena Score:

Qwen2.5-Max - 975.53 (Alibaba)
DeepSeek-V3 - 959.80 (DeepSeek AI)
Qwen2.5-Coder-32B-Instruct - 901.98 (Coding specialist)
Meta Llama 4 Scout - Strong multimodal, 10M token context
Google Gemma 3 - Top 10 Arena ranking, 27B variant

New Release: Qwen3 (April 2025)

Major Improvements Over Qwen2.5:

ArenaHard: 91.0 (vs 85.5 for DeepSeek-V3, 85.3 for GPT-4o)
AIME’24/25: 80.4 (ahead of QwQ-32B)
Best open-source LLM as of April 2025

Benchmark Comparison (2025)

MMLU-Pro CS Benchmark:

DeepSeek-V3: 78%
Qwen2.5-72B: 78% (tied)
Llama 3.3-70B: Solid, lower cost

Math/Reasoning (MATH-500):

DeepSeek-R1-Distill: 94.5% (best)
Qwen2.5: Strong
Llama 4: Good

Coding:

Qwen2.5-Coder: Leader (May 2025)
DeepSeek-V3: Excellent
Llama 4: Good general coding

Model Specializations

For Coding: Qwen2.5-Coder, DeepSeek-V3 For Math/Reasoning: DeepSeek-R1, Qwen3 For Long Context: Llama 4 (10M tokens) For Efficiency: Gemma 3, Qwen2.5-3B For Multilingual: Qwen2.5, GPT-4o

Recommended for WebLLM (Browser Usage)

✨ UPDATED: Qwen3 Models Now Available in WebLLM! (2025-10-31)

WebLLM now supports Qwen3 models (0.6B, 1.7B, 4B, 8B), which are from the same family as Qwen3-235B (ranked #3 on Chatbot Arena). These are significantly better than Qwen2.5 models.

🏆 Top Recommendations (Qwen3 Family):

Small Models (0.6B-4B):

🥇 Qwen3-4B - BEST CHOICE! (from #3 Arena family, 36T tokens training)
🥈 Qwen3-1.7B - Excellent balance of speed and quality
🥉 Qwen3-0.6B - Fastest option, still very capable

Medium Models (7-8B):

🥇 Qwen3-8B - Top quality from #3 Arena family, excellent for 16GB VRAM
🥈 Qwen2.5-Coder-7B - Best for coding tasks specifically
🥉 Llama 3.1-8B - Solid general purpose alternative

Legacy Options (Still Good, But Qwen3 is Better):

Qwen2.5-3B-Instruct - Good, but Qwen3-4B is superior
Qwen2.5-7B-Instruct - Good, but Qwen3-8B is superior
Gemma 3-4B - Efficient alternative
Llama 3.2-3B - Solid general purpose

Why Qwen3 Over Qwen2.5?

✅ Newer generation (April 2025 vs 2024)
✅ Better training (36T tokens vs 18T)
✅ From Arena #3 ranked family (Qwen3-235B)
✅ Improved reasoning, coding, and instruction following
✅ Same speed as Qwen2.5, but higher quality

Expected Performance on AMD RX 7800 XT:

Qwen3-0.6B: ~80-100 tok/s
Qwen3-1.7B: ~70-80 tok/s
Qwen3-4B: ~60-70 tok/s
Qwen3-8B: ~40-50 tok/s

Note: Larger models (DeepSeek-V3 671B, Qwen3-235B, Llama 4) are too large for browser deployment but excellent for local/server deployment.

Alternative Approaches

Transformers.js - Hugging Face’s browser ML library
ONNX Runtime Web - Microsoft’s web inference runtime
TensorFlow.js - Google’s browser ML framework

Performance Considerations

WebGPU compute shaders enable GPU acceleration
Model size vs. VRAM availability
Browser overhead vs. native performance
Network costs for model download (first run only)

References

WebGPU Specification: https://www.w3.org/TR/webgpu/
WebLLM Documentation: https://webllm.mlc.ai/
Browser Compatibility: https://caniuse.com/webgpu