m3-ultra-performance-benchmarks

Purpose

Comprehensive performance benchmarks for running local LLMs on Mac Studio M3 Ultra with 256GB unified memory, including MLX-specific optimizations, tokens-per-second data, and practical deployment guidance.

Hardware Specifications

Mac Studio M3 Ultra Configuration (varies by model):

Spec	Your Config (Mac15,14)	Max Config
CPU	28-core (20P + 8E)	32-core (24P + 8E)
GPU	60-core (likely)	80-core
Unified Memory	256GB LPDDR5x	Up to 512GB
Memory Bandwidth	800 GB/s	800 GB/s
Storage	2TB Apple SSD	Up to 8TB

Note: M3 Ultra comes in 28/60 and 32/80 core configurations. Performance benchmarks in this doc are generally based on max config but should be similar on 28/60.

Key Advantage: Unified memory architecture allows GPU to access all 256GB directly, eliminating VRAM bottleneck common with discrete GPUs.

MLX Performance Benchmarks

What is MLX?

MLX is Apple’s optimized machine learning framework specifically designed for Apple Silicon. It delivers 26-30% better performance than Ollama on M3 Ultra for local LLM inference.

Installation:

pip install mlx-lm

Model Performance by Size

Small Models (1B-8B)

Model	Quantization	Prompt Speed	Generation Speed	Use Case
Gemma 3 1B	Q4	-	237 tok/s	Ultra-fast prototyping
Gemma 3 4B	Q4	-	134 tok/s	Fast general purpose
Llama 2 7B	Q4_0	115.54 tok/s	-	Development/testing

Performance Notes:

1B models: Ideal for rapid iteration, near-instant responses
4B models: Best balance of speed and capability for most tasks
7B models: Solid performance with good reasoning

Medium Models (8B-32B)

Model	Quantization	Prompt Speed	Generation Speed	Context Size
Gemma 3 27B	Q4	-	33-41 tok/s	4K
Qwen3 32B	4-bit Dense	234.22 tok/s	16.81 tok/s	32K
Qwen2.5-Coder-32B	4-bit	-	~18-20 tok/s (est.)	-

Performance Notes:

Qwen3 32B: Exceptional prompt processing (234 tok/s) at 32K context
Gemma 3 27B: Consistent 33-41 tok/s across workloads
27-32B models: Optimal for M3 Ultra - high quality without sacrificing speed

Large Models (70B)

Model	Quantization	Prompt Speed	Generation Speed	Notes
Llama 3 70B	Q4_K_M	-	12-18 tok/s	GPU accelerated
Llama 2 70B	4-bit	-	3-5 tok/s (CPU)	CPU-only fallback
Llama 3 70B	4-bit	-	~15 tok/s (Metal)	M3 Ultra GPU optimized

Performance Notes:

GPU acceleration (Metal) provides 3-5x speedup over CPU
12-18 tok/s is “fairly reasonable response times” for interactive use
M3 Ultra significantly outperforms M2 Ultra on 70B models
Single M3 Ultra > 4x Mac Mini M4 cluster for 70B inference

Very Large Models (235B+)

Model	Quantization	Prompt Speed	Generation Speed	First Token Latency
Qwen3 235B	FP8	-	25-35 tok/s	-
Qwen3 235B	Q3	-	19.43 tok/s	87.57s
DeepSeek-V3	4-bit	-	>20 tok/s	-

Requirements: 512GB Mac Studio recommended for 235B+ models

Performance Notes:

256GB sufficient for FP8/Q3 quantized 235B models
20-35 tok/s usable for most applications including code generation
First token latency high (~88s) but generation speed acceptable

405B Models - Feasibility Analysis

Memory Requirements:

FP16: 810GB base + overhead = ~1TB total (NOT POSSIBLE on any Mac Studio)
FP8: ~405GB + overhead = Requires 512GB Mac Studio with aggressive optimization
4-bit (Q4): ~203GB + overhead = Feasible on 256GB Mac Studio but performance severely limited

Practical Reality:

256GB Mac Studio: Cannot run 405B practically
512GB Mac Studio: Possible with heavy quantization, very slow (estimated 5-10 tok/s)
Recommendation: Use cloud GPUs (8× H100) for 405B models, Mac Studio for development/testing

Framework Comparisons

MLX vs Ollama vs llama.cpp

Framework	Optimization	M3 Ultra Performance	Best For
MLX	Apple Silicon-specific	+26-30% faster than Ollama	Production inference
Ollama	General Mac support	Baseline	Easy setup, broad model support
llama.cpp	Cross-platform	Good, slightly slower than MLX	Maximum flexibility

Recommendation: Use MLX for best performance on M3 Ultra

Steady-State Performance (MLX)

Under optimal conditions, MLX achieves:

Throughput: ~230 tokens/sec (sustained)
Latency: 5-7ms per token
Consistency: Highly stable performance

Practical Performance Guide

Model Selection by Use Case

Interactive Chat / Real-Time Applications:

Best: Gemma 3 4B (134 tok/s) or Qwen3 8B (~70-80 tok/s estimated)
Good: Gemma 3 27B (41 tok/s)
Target: >30 tok/s for responsive UX

Code Generation / Development:

Best: Qwen2.5-Coder-32B (~18-20 tok/s) or DeepSeek Coder V2
Good: Llama 3 70B (15 tok/s)
Target: >15 tok/s acceptable for coding workflows

Complex Reasoning / Long Context:

Best: Qwen3 235B (25-35 tok/s, FP8) on 512GB
256GB Option: Qwen3 32B (16.81 tok/s at 32K context)
Target: >15 tok/s with long context support

Production Batch Processing:

Best: Llama 3 70B (15-18 tok/s) or Qwen3 235B
Latency: Not critical, focus on quality
Target: >10 tok/s acceptable

Expected Performance by Model Size

Model Size	Quantization	Expected tok/s	Interactive Use?
1B-4B	Q4	80-240 tok/s	✅ Excellent
7B-8B	Q4	40-115 tok/s	✅ Excellent
27B-32B	Q4	16-41 tok/s	✅ Good
70B	Q4	12-18 tok/s	⚠️ Acceptable
235B	FP8/Q3	19-35 tok/s	⚠️ Acceptable (512GB)
405B	Q4	5-10 tok/s (est.)	❌ Too slow

Hardware Comparison

M3 Ultra vs Other Apple Silicon

Chip	Llama 7B (Q4)	Llama 70B (Q4)	Notes
M3 Max	30-40 tok/s	-	Good for 7B-13B models
M3 Ultra	115 tok/s	12-18 tok/s	Best for 70B+ models
M2 Ultra	-	8-12 tok/s	Previous gen reference

Performance Scaling: M3 Ultra ~1.3x faster than M2 Ultra on GPU tasks

M3 Ultra vs High-End PC

Based on real-world benchmarks, M3 Ultra Mac Studio demonstrates superior LLM inference speeds compared to high-end PCs with Intel i9 13900K + RTX 5090 for models that fit in unified memory.

M3 Ultra Advantages:

No VRAM wall (256GB fully accessible)
Lower power consumption
Silent operation under load
Unified memory eliminates PCIe bottlenecks

M3 Ultra Limitations:

Cannot match 8× H100 cluster for 405B+ models
Training large models still requires cloud GPUs
FP16 full-precision slower than dedicated tensor cores

Deployment Recommendations

For Your 256GB M3 Ultra

Optimal Models (will run well):

Qwen3 32B (4-bit): 16.81 tok/s, 32K context - RECOMMENDED
Gemma 3 27B (Q4): 33-41 tok/s - Fast and capable
Llama 3.3 70B (Q4): 12-18 tok/s - Best open 70B model
Qwen2.5-Coder-32B (4-bit): ~18-20 tok/s - Best for coding
DeepSeek 67B (4-bit): Similar to Llama 70B - Coding specialist

Aggressive but Possible:

Qwen3 235B (Q3): 19.43 tok/s - Will work but tight memory
DeepSeek-V3 (4-bit): >20 tok/s - Requires optimization

Not Recommended:

Llama 3.1 405B: Requires 512GB minimum with heavy quantization
Any model requiring >230GB after quantization

Recommended Setup

1. Install MLX:

pip install mlx-lm

2. Download & Run Models:

# Fast general purpose (27B)
mlx_lm.generate --model mlx-community/gemma-3-27b-4bit

# Best coding (32B)
mlx_lm.generate --model mlx-community/Qwen2.5-Coder-32B-Instruct-4bit

# High quality reasoning (70B)
mlx_lm.generate --model mlx-community/Llama-3.3-70B-Instruct-4bit

# Ultimate quality (235B) - requires memory optimization
mlx_lm.generate --model mlx-community/Qwen3-235B-FP8

3. Alternative: Ollama (easier setup, slightly slower):

brew install ollama
ollama run llama3.3:70b

Performance Optimization Tips

1. Use MLX for Maximum Performance

26-30% faster than Ollama
Apple Silicon-specific optimizations
Best tokens/sec across all model sizes

2. GPU Acceleration (Metal)

Ensure workload uses 80-core GPU, not CPU
3-5x speedup for 70B models (3-5 tok/s CPU → 15-18 tok/s GPU)
Check Activity Monitor → GPU tab during inference

3. Quantization Strategy

Q4 (4-bit): Best balance (optimal for most models)
FP8 (8-bit): Higher quality, 2x memory usage vs Q4
Q3 (3-bit): Aggressive compression for 235B+ on 256GB

4. Context Length Trade-offs

Longer context = slower generation
Qwen3 32B: 234 tok/s prompt, 16.81 tok/s generation at 32K context
Reduce context window if speed critical

5. Model Selection by Priority

Speed Priority: Gemma 3 27B (41 tok/s) or smaller
Quality Priority: Llama 3.3 70B (15 tok/s) or Qwen3 235B
Balance: Qwen3 32B (16.81 tok/s, excellent quality)

Validation Tests (Your Mac Studio)

Based on your stress test results, your Mac Studio is in excellent condition:

✅ Memory passed (8 stressors, 30GB, 10 min)
✅ SSD SMART verified (9 hours power-on, 0% wear)
✅ No thermal issues
✅ 256GB unified memory accessible

Recommended First Test:

# Install MLX
pip install mlx-lm

# Test with fast 7B model
mlx_lm.generate --model mlx-community/Llama-2-7b-chat-mlx --prompt "Explain quantum computing"

# Benchmark: Should see ~115 tok/s

Cost-Performance Analysis

Mac Studio M3 Ultra vs Cloud GPUs

Mac Studio (256GB) One-time Cost: $5,499

Cloud GPU Costs (for equivalent 70B performance):

AWS p4d.24xlarge (8× A100): ~$32/hour
Break-even: ~172 hours of usage
Annual break-even: ~15 hours/month

Recommendation:

Mac Studio: Justified if >15 hours/month local inference needed
Cloud GPUs: Better for 405B models or <15 hours/month usage

Limitations & Considerations

What M3 Ultra Does Well

✅ 70B models at usable speeds (12-18 tok/s) ✅ 32B models at excellent speeds (16-41 tok/s) ✅ Development and testing workflows ✅ Silent, low-power operation ✅ No VRAM constraints for models <230GB

What M3 Ultra Struggles With

❌ 405B models (need 512GB + heavy quantization) ❌ Training large models from scratch (use cloud) ❌ FP16 full-precision large models (limited by memory bandwidth) ❌ Matching 8× H100 cluster performance

Sweet Spot

The M3 Ultra 256GB excels at:

Running 70B models for development
Experimenting with 32B models at production speeds
Interactive coding assistance with specialized models
Prototyping agentic workflows before cloud deployment

Future-Proofing

Upgrade Path

Current: 256GB sufficient for most use cases
Future-Proof: Consider 512GB if planning to run 235B+ models regularly
Alternative: Use current 256GB for development, cloud for production 235B+

Model Evolution

Open-source models improving rapidly (7.3x better value in 2025)
Quantization techniques advancing (better quality at same bit depth)
MLX framework continues optimization for Apple Silicon

m3-ultra-performance-benchmarks

Purpose

Hardware Specifications

MLX Performance Benchmarks

What is MLX?

Model Performance by Size

Small Models (1B-8B)

Medium Models (8B-32B)

Large Models (70B)

Very Large Models (235B+)

405B Models - Feasibility Analysis

Framework Comparisons

MLX vs Ollama vs llama.cpp

Steady-State Performance (MLX)

Practical Performance Guide

Model Selection by Use Case

Expected Performance by Model Size

Hardware Comparison

M3 Ultra vs Other Apple Silicon

M3 Ultra vs High-End PC

Deployment Recommendations

For Your 256GB M3 Ultra

Recommended Setup

Performance Optimization Tips

1. Use MLX for Maximum Performance

2. GPU Acceleration (Metal)

3. Quantization Strategy

4. Context Length Trade-offs

5. Model Selection by Priority

Validation Tests (Your Mac Studio)

Cost-Performance Analysis

Mac Studio M3 Ultra vs Cloud GPUs

Limitations & Considerations

What M3 Ultra Does Well

What M3 Ultra Struggles With

Sweet Spot

Future-Proofing

Upgrade Path

Model Evolution

Sources