Purpose

Comprehensive performance benchmarks for running local LLMs on Mac Studio M3 Ultra with 256GB unified memory, including MLX-specific optimizations, tokens-per-second data, and practical deployment guidance.

Hardware Specifications

Mac Studio M3 Ultra Configuration (varies by model):

SpecYour Config (Mac15,14)Max Config
CPU28-core (20P + 8E)32-core (24P + 8E)
GPU60-core (likely)80-core
Unified Memory256GB LPDDR5xUp to 512GB
Memory Bandwidth800 GB/s800 GB/s
Storage2TB Apple SSDUp to 8TB

Note: M3 Ultra comes in 28/60 and 32/80 core configurations. Performance benchmarks in this doc are generally based on max config but should be similar on 28/60.

Key Advantage: Unified memory architecture allows GPU to access all 256GB directly, eliminating VRAM bottleneck common with discrete GPUs.

MLX Performance Benchmarks

What is MLX?

MLX is Apple’s optimized machine learning framework specifically designed for Apple Silicon. It delivers 26-30% better performance than Ollama on M3 Ultra for local LLM inference.

Installation:

Terminal window
pip install mlx-lm

Model Performance by Size

Small Models (1B-8B)

ModelQuantizationPrompt SpeedGeneration SpeedUse Case
Gemma 3 1BQ4-237 tok/sUltra-fast prototyping
Gemma 3 4BQ4-134 tok/sFast general purpose
Llama 2 7BQ4_0115.54 tok/s-Development/testing

Performance Notes:

  • 1B models: Ideal for rapid iteration, near-instant responses
  • 4B models: Best balance of speed and capability for most tasks
  • 7B models: Solid performance with good reasoning

Medium Models (8B-32B)

ModelQuantizationPrompt SpeedGeneration SpeedContext Size
Gemma 3 27BQ4-33-41 tok/s4K
Qwen3 32B4-bit Dense234.22 tok/s16.81 tok/s32K
Qwen2.5-Coder-32B4-bit-~18-20 tok/s (est.)-

Performance Notes:

  • Qwen3 32B: Exceptional prompt processing (234 tok/s) at 32K context
  • Gemma 3 27B: Consistent 33-41 tok/s across workloads
  • 27-32B models: Optimal for M3 Ultra - high quality without sacrificing speed

Large Models (70B)

ModelQuantizationPrompt SpeedGeneration SpeedNotes
Llama 3 70BQ4_K_M-12-18 tok/sGPU accelerated
Llama 2 70B4-bit-3-5 tok/s (CPU)CPU-only fallback
Llama 3 70B4-bit-~15 tok/s (Metal)M3 Ultra GPU optimized

Performance Notes:

  • GPU acceleration (Metal) provides 3-5x speedup over CPU
  • 12-18 tok/s is “fairly reasonable response times” for interactive use
  • M3 Ultra significantly outperforms M2 Ultra on 70B models
  • Single M3 Ultra > 4x Mac Mini M4 cluster for 70B inference

Very Large Models (235B+)

ModelQuantizationPrompt SpeedGeneration SpeedFirst Token Latency
Qwen3 235BFP8-25-35 tok/s-
Qwen3 235BQ3-19.43 tok/s87.57s
DeepSeek-V34-bit->20 tok/s-

Requirements: 512GB Mac Studio recommended for 235B+ models

Performance Notes:

  • 256GB sufficient for FP8/Q3 quantized 235B models
  • 20-35 tok/s usable for most applications including code generation
  • First token latency high (~88s) but generation speed acceptable

405B Models - Feasibility Analysis

Memory Requirements:

  • FP16: 810GB base + overhead = ~1TB total (NOT POSSIBLE on any Mac Studio)
  • FP8: ~405GB + overhead = Requires 512GB Mac Studio with aggressive optimization
  • 4-bit (Q4): ~203GB + overhead = Feasible on 256GB Mac Studio but performance severely limited

Practical Reality:

  • 256GB Mac Studio: Cannot run 405B practically
  • 512GB Mac Studio: Possible with heavy quantization, very slow (estimated 5-10 tok/s)
  • Recommendation: Use cloud GPUs (8× H100) for 405B models, Mac Studio for development/testing

Framework Comparisons

MLX vs Ollama vs llama.cpp

FrameworkOptimizationM3 Ultra PerformanceBest For
MLXApple Silicon-specific+26-30% faster than OllamaProduction inference
OllamaGeneral Mac supportBaselineEasy setup, broad model support
llama.cppCross-platformGood, slightly slower than MLXMaximum flexibility

Recommendation: Use MLX for best performance on M3 Ultra

Steady-State Performance (MLX)

Under optimal conditions, MLX achieves:

  • Throughput: ~230 tokens/sec (sustained)
  • Latency: 5-7ms per token
  • Consistency: Highly stable performance

Practical Performance Guide

Model Selection by Use Case

Interactive Chat / Real-Time Applications:

  • Best: Gemma 3 4B (134 tok/s) or Qwen3 8B (~70-80 tok/s estimated)
  • Good: Gemma 3 27B (41 tok/s)
  • Target: >30 tok/s for responsive UX

Code Generation / Development:

  • Best: Qwen2.5-Coder-32B (~18-20 tok/s) or DeepSeek Coder V2
  • Good: Llama 3 70B (15 tok/s)
  • Target: >15 tok/s acceptable for coding workflows

Complex Reasoning / Long Context:

  • Best: Qwen3 235B (25-35 tok/s, FP8) on 512GB
  • 256GB Option: Qwen3 32B (16.81 tok/s at 32K context)
  • Target: >15 tok/s with long context support

Production Batch Processing:

  • Best: Llama 3 70B (15-18 tok/s) or Qwen3 235B
  • Latency: Not critical, focus on quality
  • Target: >10 tok/s acceptable

Expected Performance by Model Size

Model SizeQuantizationExpected tok/sInteractive Use?
1B-4BQ480-240 tok/s✅ Excellent
7B-8BQ440-115 tok/s✅ Excellent
27B-32BQ416-41 tok/s✅ Good
70BQ412-18 tok/s⚠️ Acceptable
235BFP8/Q319-35 tok/s⚠️ Acceptable (512GB)
405BQ45-10 tok/s (est.)❌ Too slow

Hardware Comparison

M3 Ultra vs Other Apple Silicon

ChipLlama 7B (Q4)Llama 70B (Q4)Notes
M3 Max30-40 tok/s-Good for 7B-13B models
M3 Ultra115 tok/s12-18 tok/sBest for 70B+ models
M2 Ultra-8-12 tok/sPrevious gen reference

Performance Scaling: M3 Ultra ~1.3x faster than M2 Ultra on GPU tasks

M3 Ultra vs High-End PC

Based on real-world benchmarks, M3 Ultra Mac Studio demonstrates superior LLM inference speeds compared to high-end PCs with Intel i9 13900K + RTX 5090 for models that fit in unified memory.

M3 Ultra Advantages:

  • No VRAM wall (256GB fully accessible)
  • Lower power consumption
  • Silent operation under load
  • Unified memory eliminates PCIe bottlenecks

M3 Ultra Limitations:

  • Cannot match 8× H100 cluster for 405B+ models
  • Training large models still requires cloud GPUs
  • FP16 full-precision slower than dedicated tensor cores

Deployment Recommendations

For Your 256GB M3 Ultra

Optimal Models (will run well):

  1. Qwen3 32B (4-bit): 16.81 tok/s, 32K context - RECOMMENDED
  2. Gemma 3 27B (Q4): 33-41 tok/s - Fast and capable
  3. Llama 3.3 70B (Q4): 12-18 tok/s - Best open 70B model
  4. Qwen2.5-Coder-32B (4-bit): ~18-20 tok/s - Best for coding
  5. DeepSeek 67B (4-bit): Similar to Llama 70B - Coding specialist

Aggressive but Possible:

  • Qwen3 235B (Q3): 19.43 tok/s - Will work but tight memory
  • DeepSeek-V3 (4-bit): >20 tok/s - Requires optimization

Not Recommended:

  • Llama 3.1 405B: Requires 512GB minimum with heavy quantization
  • Any model requiring >230GB after quantization

1. Install MLX:

Terminal window
pip install mlx-lm

2. Download & Run Models:

Terminal window
# Fast general purpose (27B)
mlx_lm.generate --model mlx-community/gemma-3-27b-4bit
# Best coding (32B)
mlx_lm.generate --model mlx-community/Qwen2.5-Coder-32B-Instruct-4bit
# High quality reasoning (70B)
mlx_lm.generate --model mlx-community/Llama-3.3-70B-Instruct-4bit
# Ultimate quality (235B) - requires memory optimization
mlx_lm.generate --model mlx-community/Qwen3-235B-FP8

3. Alternative: Ollama (easier setup, slightly slower):

Terminal window
brew install ollama
ollama run llama3.3:70b

Performance Optimization Tips

1. Use MLX for Maximum Performance

  • 26-30% faster than Ollama
  • Apple Silicon-specific optimizations
  • Best tokens/sec across all model sizes

2. GPU Acceleration (Metal)

  • Ensure workload uses 80-core GPU, not CPU
  • 3-5x speedup for 70B models (3-5 tok/s CPU → 15-18 tok/s GPU)
  • Check Activity Monitor → GPU tab during inference

3. Quantization Strategy

  • Q4 (4-bit): Best balance (optimal for most models)
  • FP8 (8-bit): Higher quality, 2x memory usage vs Q4
  • Q3 (3-bit): Aggressive compression for 235B+ on 256GB

4. Context Length Trade-offs

  • Longer context = slower generation
  • Qwen3 32B: 234 tok/s prompt, 16.81 tok/s generation at 32K context
  • Reduce context window if speed critical

5. Model Selection by Priority

  • Speed Priority: Gemma 3 27B (41 tok/s) or smaller
  • Quality Priority: Llama 3.3 70B (15 tok/s) or Qwen3 235B
  • Balance: Qwen3 32B (16.81 tok/s, excellent quality)

Validation Tests (Your Mac Studio)

Based on your stress test results, your Mac Studio is in excellent condition:

  • ✅ Memory passed (8 stressors, 30GB, 10 min)
  • ✅ SSD SMART verified (9 hours power-on, 0% wear)
  • ✅ No thermal issues
  • ✅ 256GB unified memory accessible

Recommended First Test:

Terminal window
# Install MLX
pip install mlx-lm
# Test with fast 7B model
mlx_lm.generate --model mlx-community/Llama-2-7b-chat-mlx --prompt "Explain quantum computing"
# Benchmark: Should see ~115 tok/s

Cost-Performance Analysis

Mac Studio M3 Ultra vs Cloud GPUs

Mac Studio (256GB) One-time Cost: $5,499

Cloud GPU Costs (for equivalent 70B performance):

  • AWS p4d.24xlarge (8× A100): ~$32/hour
  • Break-even: ~172 hours of usage
  • Annual break-even: ~15 hours/month

Recommendation:

  • Mac Studio: Justified if >15 hours/month local inference needed
  • Cloud GPUs: Better for 405B models or <15 hours/month usage

Limitations & Considerations

What M3 Ultra Does Well

✅ 70B models at usable speeds (12-18 tok/s) ✅ 32B models at excellent speeds (16-41 tok/s) ✅ Development and testing workflows ✅ Silent, low-power operation ✅ No VRAM constraints for models <230GB

What M3 Ultra Struggles With

❌ 405B models (need 512GB + heavy quantization) ❌ Training large models from scratch (use cloud) ❌ FP16 full-precision large models (limited by memory bandwidth) ❌ Matching 8× H100 cluster performance

Sweet Spot

The M3 Ultra 256GB excels at:

  • Running 70B models for development
  • Experimenting with 32B models at production speeds
  • Interactive coding assistance with specialized models
  • Prototyping agentic workflows before cloud deployment

Future-Proofing

Upgrade Path

  • Current: 256GB sufficient for most use cases
  • Future-Proof: Consider 512GB if planning to run 235B+ models regularly
  • Alternative: Use current 256GB for development, cloud for production 235B+

Model Evolution

  • Open-source models improving rapidly (7.3x better value in 2025)
  • Quantization techniques advancing (better quality at same bit depth)
  • MLX framework continues optimization for Apple Silicon

Sources

  1. Apple Silicon vs NVIDIA CUDA: AI Comparison 2025 - Scalastic
  2. GitHub - TristanBilot/mlx-benchmark
  3. Performance of llama.cpp on Apple Silicon M-series - GitHub
  4. Qwen3-30B-A3B Model with MLX Weights - DeepNewz
  5. Gemma 3 Performance: Tokens Per Second in LM Studio vs. Ollama - Medium
  6. Apple Mac Studio with M3 Ultra Review - Creative Strategies
  7. Mac Studio M3 Ultra Tested - Hostbor
  8. Local LLM Hardware Guide 2025 - Introl
  9. Run Llama 3.3 70B on Mac - Private LLM
  10. Mac Studio M3 Ultra (512 GB RAM) for Local LLM Inference - LLM Perf
  11. Qwen2.5-Coder-32B runs on my Mac - Simon Willison