m3-ultra-performance-benchmarks
Purpose
Comprehensive performance benchmarks for running local LLMs on Mac Studio M3 Ultra with 256GB unified memory, including MLX-specific optimizations, tokens-per-second data, and practical deployment guidance.
Hardware Specifications
Mac Studio M3 Ultra Configuration (varies by model):
| Spec | Your Config (Mac15,14) | Max Config |
|---|---|---|
| CPU | 28-core (20P + 8E) | 32-core (24P + 8E) |
| GPU | 60-core (likely) | 80-core |
| Unified Memory | 256GB LPDDR5x | Up to 512GB |
| Memory Bandwidth | 800 GB/s | 800 GB/s |
| Storage | 2TB Apple SSD | Up to 8TB |
Note: M3 Ultra comes in 28/60 and 32/80 core configurations. Performance benchmarks in this doc are generally based on max config but should be similar on 28/60.
Key Advantage: Unified memory architecture allows GPU to access all 256GB directly, eliminating VRAM bottleneck common with discrete GPUs.
MLX Performance Benchmarks
What is MLX?
MLX is Apple’s optimized machine learning framework specifically designed for Apple Silicon. It delivers 26-30% better performance than Ollama on M3 Ultra for local LLM inference.
Installation:
pip install mlx-lmModel Performance by Size
Small Models (1B-8B)
| Model | Quantization | Prompt Speed | Generation Speed | Use Case |
|---|---|---|---|---|
| Gemma 3 1B | Q4 | - | 237 tok/s | Ultra-fast prototyping |
| Gemma 3 4B | Q4 | - | 134 tok/s | Fast general purpose |
| Llama 2 7B | Q4_0 | 115.54 tok/s | - | Development/testing |
Performance Notes:
- 1B models: Ideal for rapid iteration, near-instant responses
- 4B models: Best balance of speed and capability for most tasks
- 7B models: Solid performance with good reasoning
Medium Models (8B-32B)
| Model | Quantization | Prompt Speed | Generation Speed | Context Size |
|---|---|---|---|---|
| Gemma 3 27B | Q4 | - | 33-41 tok/s | 4K |
| Qwen3 32B | 4-bit Dense | 234.22 tok/s | 16.81 tok/s | 32K |
| Qwen2.5-Coder-32B | 4-bit | - | ~18-20 tok/s (est.) | - |
Performance Notes:
- Qwen3 32B: Exceptional prompt processing (234 tok/s) at 32K context
- Gemma 3 27B: Consistent 33-41 tok/s across workloads
- 27-32B models: Optimal for M3 Ultra - high quality without sacrificing speed
Large Models (70B)
| Model | Quantization | Prompt Speed | Generation Speed | Notes |
|---|---|---|---|---|
| Llama 3 70B | Q4_K_M | - | 12-18 tok/s | GPU accelerated |
| Llama 2 70B | 4-bit | - | 3-5 tok/s (CPU) | CPU-only fallback |
| Llama 3 70B | 4-bit | - | ~15 tok/s (Metal) | M3 Ultra GPU optimized |
Performance Notes:
- GPU acceleration (Metal) provides 3-5x speedup over CPU
- 12-18 tok/s is “fairly reasonable response times” for interactive use
- M3 Ultra significantly outperforms M2 Ultra on 70B models
- Single M3 Ultra > 4x Mac Mini M4 cluster for 70B inference
Very Large Models (235B+)
| Model | Quantization | Prompt Speed | Generation Speed | First Token Latency |
|---|---|---|---|---|
| Qwen3 235B | FP8 | - | 25-35 tok/s | - |
| Qwen3 235B | Q3 | - | 19.43 tok/s | 87.57s |
| DeepSeek-V3 | 4-bit | - | >20 tok/s | - |
Requirements: 512GB Mac Studio recommended for 235B+ models
Performance Notes:
- 256GB sufficient for FP8/Q3 quantized 235B models
- 20-35 tok/s usable for most applications including code generation
- First token latency high (~88s) but generation speed acceptable
405B Models - Feasibility Analysis
Memory Requirements:
- FP16: 810GB base + overhead = ~1TB total (NOT POSSIBLE on any Mac Studio)
- FP8: ~405GB + overhead = Requires 512GB Mac Studio with aggressive optimization
- 4-bit (Q4): ~203GB + overhead = Feasible on 256GB Mac Studio but performance severely limited
Practical Reality:
- 256GB Mac Studio: Cannot run 405B practically
- 512GB Mac Studio: Possible with heavy quantization, very slow (estimated 5-10 tok/s)
- Recommendation: Use cloud GPUs (8× H100) for 405B models, Mac Studio for development/testing
Framework Comparisons
MLX vs Ollama vs llama.cpp
| Framework | Optimization | M3 Ultra Performance | Best For |
|---|---|---|---|
| MLX | Apple Silicon-specific | +26-30% faster than Ollama | Production inference |
| Ollama | General Mac support | Baseline | Easy setup, broad model support |
| llama.cpp | Cross-platform | Good, slightly slower than MLX | Maximum flexibility |
Recommendation: Use MLX for best performance on M3 Ultra
Steady-State Performance (MLX)
Under optimal conditions, MLX achieves:
- Throughput: ~230 tokens/sec (sustained)
- Latency: 5-7ms per token
- Consistency: Highly stable performance
Practical Performance Guide
Model Selection by Use Case
Interactive Chat / Real-Time Applications:
- Best: Gemma 3 4B (134 tok/s) or Qwen3 8B (~70-80 tok/s estimated)
- Good: Gemma 3 27B (41 tok/s)
- Target: >30 tok/s for responsive UX
Code Generation / Development:
- Best: Qwen2.5-Coder-32B (~18-20 tok/s) or DeepSeek Coder V2
- Good: Llama 3 70B (15 tok/s)
- Target: >15 tok/s acceptable for coding workflows
Complex Reasoning / Long Context:
- Best: Qwen3 235B (25-35 tok/s, FP8) on 512GB
- 256GB Option: Qwen3 32B (16.81 tok/s at 32K context)
- Target: >15 tok/s with long context support
Production Batch Processing:
- Best: Llama 3 70B (15-18 tok/s) or Qwen3 235B
- Latency: Not critical, focus on quality
- Target: >10 tok/s acceptable
Expected Performance by Model Size
| Model Size | Quantization | Expected tok/s | Interactive Use? |
|---|---|---|---|
| 1B-4B | Q4 | 80-240 tok/s | ✅ Excellent |
| 7B-8B | Q4 | 40-115 tok/s | ✅ Excellent |
| 27B-32B | Q4 | 16-41 tok/s | ✅ Good |
| 70B | Q4 | 12-18 tok/s | ⚠️ Acceptable |
| 235B | FP8/Q3 | 19-35 tok/s | ⚠️ Acceptable (512GB) |
| 405B | Q4 | 5-10 tok/s (est.) | ❌ Too slow |
Hardware Comparison
M3 Ultra vs Other Apple Silicon
| Chip | Llama 7B (Q4) | Llama 70B (Q4) | Notes |
|---|---|---|---|
| M3 Max | 30-40 tok/s | - | Good for 7B-13B models |
| M3 Ultra | 115 tok/s | 12-18 tok/s | Best for 70B+ models |
| M2 Ultra | - | 8-12 tok/s | Previous gen reference |
Performance Scaling: M3 Ultra ~1.3x faster than M2 Ultra on GPU tasks
M3 Ultra vs High-End PC
Based on real-world benchmarks, M3 Ultra Mac Studio demonstrates superior LLM inference speeds compared to high-end PCs with Intel i9 13900K + RTX 5090 for models that fit in unified memory.
M3 Ultra Advantages:
- No VRAM wall (256GB fully accessible)
- Lower power consumption
- Silent operation under load
- Unified memory eliminates PCIe bottlenecks
M3 Ultra Limitations:
- Cannot match 8× H100 cluster for 405B+ models
- Training large models still requires cloud GPUs
- FP16 full-precision slower than dedicated tensor cores
Deployment Recommendations
For Your 256GB M3 Ultra
Optimal Models (will run well):
- Qwen3 32B (4-bit): 16.81 tok/s, 32K context - RECOMMENDED
- Gemma 3 27B (Q4): 33-41 tok/s - Fast and capable
- Llama 3.3 70B (Q4): 12-18 tok/s - Best open 70B model
- Qwen2.5-Coder-32B (4-bit): ~18-20 tok/s - Best for coding
- DeepSeek 67B (4-bit): Similar to Llama 70B - Coding specialist
Aggressive but Possible:
- Qwen3 235B (Q3): 19.43 tok/s - Will work but tight memory
- DeepSeek-V3 (4-bit): >20 tok/s - Requires optimization
Not Recommended:
- Llama 3.1 405B: Requires 512GB minimum with heavy quantization
- Any model requiring >230GB after quantization
Recommended Setup
1. Install MLX:
pip install mlx-lm2. Download & Run Models:
# Fast general purpose (27B)mlx_lm.generate --model mlx-community/gemma-3-27b-4bit
# Best coding (32B)mlx_lm.generate --model mlx-community/Qwen2.5-Coder-32B-Instruct-4bit
# High quality reasoning (70B)mlx_lm.generate --model mlx-community/Llama-3.3-70B-Instruct-4bit
# Ultimate quality (235B) - requires memory optimizationmlx_lm.generate --model mlx-community/Qwen3-235B-FP83. Alternative: Ollama (easier setup, slightly slower):
brew install ollamaollama run llama3.3:70bPerformance Optimization Tips
1. Use MLX for Maximum Performance
- 26-30% faster than Ollama
- Apple Silicon-specific optimizations
- Best tokens/sec across all model sizes
2. GPU Acceleration (Metal)
- Ensure workload uses 80-core GPU, not CPU
- 3-5x speedup for 70B models (3-5 tok/s CPU → 15-18 tok/s GPU)
- Check Activity Monitor → GPU tab during inference
3. Quantization Strategy
- Q4 (4-bit): Best balance (optimal for most models)
- FP8 (8-bit): Higher quality, 2x memory usage vs Q4
- Q3 (3-bit): Aggressive compression for 235B+ on 256GB
4. Context Length Trade-offs
- Longer context = slower generation
- Qwen3 32B: 234 tok/s prompt, 16.81 tok/s generation at 32K context
- Reduce context window if speed critical
5. Model Selection by Priority
- Speed Priority: Gemma 3 27B (41 tok/s) or smaller
- Quality Priority: Llama 3.3 70B (15 tok/s) or Qwen3 235B
- Balance: Qwen3 32B (16.81 tok/s, excellent quality)
Validation Tests (Your Mac Studio)
Based on your stress test results, your Mac Studio is in excellent condition:
- ✅ Memory passed (8 stressors, 30GB, 10 min)
- ✅ SSD SMART verified (9 hours power-on, 0% wear)
- ✅ No thermal issues
- ✅ 256GB unified memory accessible
Recommended First Test:
# Install MLXpip install mlx-lm
# Test with fast 7B modelmlx_lm.generate --model mlx-community/Llama-2-7b-chat-mlx --prompt "Explain quantum computing"
# Benchmark: Should see ~115 tok/sCost-Performance Analysis
Mac Studio M3 Ultra vs Cloud GPUs
Mac Studio (256GB) One-time Cost: $5,499
Cloud GPU Costs (for equivalent 70B performance):
- AWS p4d.24xlarge (8× A100): ~$32/hour
- Break-even: ~172 hours of usage
- Annual break-even: ~15 hours/month
Recommendation:
- Mac Studio: Justified if >15 hours/month local inference needed
- Cloud GPUs: Better for 405B models or <15 hours/month usage
Limitations & Considerations
What M3 Ultra Does Well
✅ 70B models at usable speeds (12-18 tok/s) ✅ 32B models at excellent speeds (16-41 tok/s) ✅ Development and testing workflows ✅ Silent, low-power operation ✅ No VRAM constraints for models <230GB
What M3 Ultra Struggles With
❌ 405B models (need 512GB + heavy quantization) ❌ Training large models from scratch (use cloud) ❌ FP16 full-precision large models (limited by memory bandwidth) ❌ Matching 8× H100 cluster performance
Sweet Spot
The M3 Ultra 256GB excels at:
- Running 70B models for development
- Experimenting with 32B models at production speeds
- Interactive coding assistance with specialized models
- Prototyping agentic workflows before cloud deployment
Future-Proofing
Upgrade Path
- Current: 256GB sufficient for most use cases
- Future-Proof: Consider 512GB if planning to run 235B+ models regularly
- Alternative: Use current 256GB for development, cloud for production 235B+
Model Evolution
- Open-source models improving rapidly (7.3x better value in 2025)
- Quantization techniques advancing (better quality at same bit depth)
- MLX framework continues optimization for Apple Silicon
Sources
- Apple Silicon vs NVIDIA CUDA: AI Comparison 2025 - Scalastic
- GitHub - TristanBilot/mlx-benchmark
- Performance of llama.cpp on Apple Silicon M-series - GitHub
- Qwen3-30B-A3B Model with MLX Weights - DeepNewz
- Gemma 3 Performance: Tokens Per Second in LM Studio vs. Ollama - Medium
- Apple Mac Studio with M3 Ultra Review - Creative Strategies
- Mac Studio M3 Ultra Tested - Hostbor
- Local LLM Hardware Guide 2025 - Introl
- Run Llama 3.3 70B on Mac - Private LLM
- Mac Studio M3 Ultra (512 GB RAM) for Local LLM Inference - LLM Perf
- Qwen2.5-Coder-32B runs on my Mac - Simon Willison