webgpu-llm-research
Project: Running Local LLM in Browser with WebGPU
Service: WebLLM Chat
- URL: https://chat.webllm.ai
- Technology: WebLLM library for running LLMs in browser using WebGPU
- Model Tested: Llama
Issue Encountered (2025-10-31)
Error Message:
Error: Unable to find a compatible GPU. This issue might be because yourcomputer doesn't have a GPU, or your system settings are not configuredproperly. Please check if your device has a GPU properly set up and ifyour browser supports WebGPU.System Information
- OS: Ubuntu 24.04
- Browser: Google Chrome
- GPUs Detected:
- GPU0: AMD Radeon RX 7800 XT (Discrete) - Vulkan 1.4.305
- GPU1: AMD Radeon Graphics (Integrated) - Vulkan 1.4.305
- GPU2: llvmpipe (CPU fallback)
WebGPU Browser Support (2025)
- ✅ Chrome/Edge: Stable since April 2023 (v113+)
- ✅ Safari: Since June 2025 (v26+)
- ⚠️ Firefox: Windows only since July 2025 (v141)
Troubleshooting Steps
-
Check WebGPU Browser Support
- Visit: https://webgpureport.org/
- Verify WebGPU is enabled in browser
-
Check GPU Drivers
- AMD GPUs require Mesa drivers with WebGPU support
- Vulkan support is required for WebGPU on Linux
-
Browser Flags (Chrome/Chromium)
chrome://flags/#enable-unsafe-webgpu- Enable WebGPUchrome://gpu- Check GPU feature status
-
Firefox Specific
about:config→dom.webgpu.enabled→ true- Note: Linux support still experimental
WebLLM Project Details
GitHub: https://github.com/mlc-ai/web-llm
Key Features:
- Runs LLMs entirely in browser using WebGPU
- No server required - complete privacy
- Supports various models: Llama, Mistral, Phi, etc.
- Model downloaded and cached locally
System Requirements:
- WebGPU-compatible browser
- GPU with sufficient VRAM (varies by model)
- Modern GPU drivers with Vulkan support
Troubleshooting Steps Completed
-
✅ Installed Vulkan Support
Terminal window sudo apt install -y mesa-vulkan-drivers vulkan-tools -
✅ Verified Vulkan Installation
vulkaninfo --summaryconfirms both GPUs detected- AMD Radeon RX 7800 XT: Vulkan 1.4.305
- AMD Radeon Graphics (integrated): Vulkan 1.4.305
-
⚠️ Chrome GPU Status Issues Found
- WebGPU: Disabled (blocked via blocklist)
- Vulkan: Disabled in Chrome
- Dawn (WebGPU backend) detects both GPUs but WebGPU is blocked
Solution: Enable WebGPU in Chrome
Problem: WebGPU is disabled via Chrome’s blocklist despite Vulkan working.
Fix: Enable WebGPU manually in Chrome flags.
Next Steps
- Verify GPU hardware details
- Install Vulkan drivers
- Check Chrome GPU status
- Enable WebGPU in chrome://flags
- Restart Chrome
- Verify WebGPU is “Hardware accelerated”
- Test WebLLM with small model
- Document successful configuration
Status After Chrome Restart (2025-10-31)
✅ WebGPU: Hardware accelerated (WORKING!)
- Enabled via
chrome://flags/#enable-unsafe-webgpu - Command line shows:
--enable-unsafe-webgpu
⚠️ Vulkan: Still shows “Disabled” in main status
- WebGPU can use OpenGL/ANGLE backend as fallback
- However, Dawn detected Vulkan backend adapters for both GPUs:
- AMD Radeon RX 7800 XT (Vulkan backend) - Available
- AMD Radeon Graphics (Vulkan backend) - Available
- WebGPU may already be using Vulkan backend internally
Performance Considerations:
- Vulkan backend: Lower overhead, better performance for compute
- ANGLE/OpenGL backend: Compatibility fallback, slightly more overhead
- For LLM inference, Vulkan backend is preferred
Ready to test WebLLM!
Initial WebLLM Test Results (2025-10-31)
✅ Model Loaded Successfully: Llama-3.2-1B-Instruct-q4f32_1-MLC
⚠️ Performance Issues:
- Prefill: 1.6 tok/s (very slow)
- Decode: 0.4 tok/s (extremely slow)
- Expected on RX 7800 XT: 50-100+ tok/s
Likely Causes:
- Using integrated GPU instead of discrete RX 7800 XT
- Using ANGLE/OpenGL backend instead of Vulkan
- Need to force Vulkan backend and discrete GPU selection
Solutions to Try:
1. Enable Vulkan Backend in Chrome ✅ COMPLETED
- Enabled
#enable-vulkanflag - Enabled
#default-angle-vulkanflag - Result: Vulkan now shows “Enabled” in chrome://gpu
Results After Enabling Vulkan:
✅ MASSIVE PERFORMANCE IMPROVEMENT!
Before (ANGLE/OpenGL):
- Prefill: 1.6 tok/s
- Decode: 0.4 tok/s
After (Vulkan enabled):
- Prefill: 320.8 tok/s (200x faster!)
- Decode: 47.1 tok/s (117x faster!)
Conclusion:
- Vulkan backend is now active and working
- Likely using discrete AMD Radeon RX 7800 XT
- Performance is now as expected for high-end GPU
Summary: How We Fixed It
- ✅ Installed Vulkan drivers (
mesa-vulkan-drivers,vulkan-tools) - ✅ Enabled WebGPU in Chrome (
chrome://flags/#enable-unsafe-webgpu) - ✅ Enabled Vulkan backend (
chrome://flags/#enable-vulkan) - ✅ Enabled ANGLE Vulkan (
chrome://flags/#default-angle-vulkan) - ✅ Result: 117-200x performance improvement!
2. Check Which GPU is Being Used
Open Chrome DevTools (F12) in WebLLM tab and run:
// Check which GPU WebGPU is usingnavigator.gpu.requestAdapter().then(adapter => { console.log('=== GPU INFO ==='); console.log('Adapter:', adapter); console.log('Features:', [...adapter.features]); console.log('Name/Vendor:', adapter.name || 'Not available');
// Request device to see more info adapter.requestDevice().then(device => { console.log('Device:', device); });});Alternative - Check in WebLLM logs:
- Just reload the WebLLM page and watch console output
- WebLLM logs which adapter it’s using during initialization
Expected: Should show “AMD Radeon RX 7800 XT” (discrete GPU)
Top Open Source LLMs - 2025 Rankings
LMSYS Chatbot Arena Leaderboard (Oct 2025)
Top Open Source Models by Arena Score:
- Qwen2.5-Max - 975.53 (Alibaba)
- DeepSeek-V3 - 959.80 (DeepSeek AI)
- Qwen2.5-Coder-32B-Instruct - 901.98 (Coding specialist)
- Meta Llama 4 Scout - Strong multimodal, 10M token context
- Google Gemma 3 - Top 10 Arena ranking, 27B variant
New Release: Qwen3 (April 2025)
Major Improvements Over Qwen2.5:
- ArenaHard: 91.0 (vs 85.5 for DeepSeek-V3, 85.3 for GPT-4o)
- AIME’24/25: 80.4 (ahead of QwQ-32B)
- Best open-source LLM as of April 2025
Benchmark Comparison (2025)
MMLU-Pro CS Benchmark:
- DeepSeek-V3: 78%
- Qwen2.5-72B: 78% (tied)
- Llama 3.3-70B: Solid, lower cost
Math/Reasoning (MATH-500):
- DeepSeek-R1-Distill: 94.5% (best)
- Qwen2.5: Strong
- Llama 4: Good
Coding:
- Qwen2.5-Coder: Leader (May 2025)
- DeepSeek-V3: Excellent
- Llama 4: Good general coding
Model Specializations
For Coding: Qwen2.5-Coder, DeepSeek-V3 For Math/Reasoning: DeepSeek-R1, Qwen3 For Long Context: Llama 4 (10M tokens) For Efficiency: Gemma 3, Qwen2.5-3B For Multilingual: Qwen2.5, GPT-4o
Recommended for WebLLM (Browser Usage)
✨ UPDATED: Qwen3 Models Now Available in WebLLM! (2025-10-31)
WebLLM now supports Qwen3 models (0.6B, 1.7B, 4B, 8B), which are from the same family as Qwen3-235B (ranked #3 on Chatbot Arena). These are significantly better than Qwen2.5 models.
🏆 Top Recommendations (Qwen3 Family):
Small Models (0.6B-4B):
- 🥇 Qwen3-4B - BEST CHOICE! (from #3 Arena family, 36T tokens training)
- 🥈 Qwen3-1.7B - Excellent balance of speed and quality
- 🥉 Qwen3-0.6B - Fastest option, still very capable
Medium Models (7-8B):
- 🥇 Qwen3-8B - Top quality from #3 Arena family, excellent for 16GB VRAM
- 🥈 Qwen2.5-Coder-7B - Best for coding tasks specifically
- 🥉 Llama 3.1-8B - Solid general purpose alternative
Legacy Options (Still Good, But Qwen3 is Better):
- Qwen2.5-3B-Instruct - Good, but Qwen3-4B is superior
- Qwen2.5-7B-Instruct - Good, but Qwen3-8B is superior
- Gemma 3-4B - Efficient alternative
- Llama 3.2-3B - Solid general purpose
Why Qwen3 Over Qwen2.5?
- ✅ Newer generation (April 2025 vs 2024)
- ✅ Better training (36T tokens vs 18T)
- ✅ From Arena #3 ranked family (Qwen3-235B)
- ✅ Improved reasoning, coding, and instruction following
- ✅ Same speed as Qwen2.5, but higher quality
Expected Performance on AMD RX 7800 XT:
- Qwen3-0.6B: ~80-100 tok/s
- Qwen3-1.7B: ~70-80 tok/s
- Qwen3-4B: ~60-70 tok/s
- Qwen3-8B: ~40-50 tok/s
Note: Larger models (DeepSeek-V3 671B, Qwen3-235B, Llama 4) are too large for browser deployment but excellent for local/server deployment.
Related Research
Alternative Approaches
- Transformers.js - Hugging Face’s browser ML library
- ONNX Runtime Web - Microsoft’s web inference runtime
- TensorFlow.js - Google’s browser ML framework
Performance Considerations
- WebGPU compute shaders enable GPU acceleration
- Model size vs. VRAM availability
- Browser overhead vs. native performance
- Network costs for model download (first run only)
References
- WebGPU Specification: https://www.w3.org/TR/webgpu/
- WebLLM Documentation: https://webllm.mlc.ai/
- Browser Compatibility: https://caniuse.com/webgpu