quantization-bit-width-performance
Purpose
Explain why sub-4bit quantization (2-bit, 3-bit) is not optimal for MLX and Apple Silicon despite offering better memory compression, due to emulation overhead and lack of native hardware support.
Key Findings
The 4-Bit Sweet Spot
4-bit quantization is the practical lower bound for optimal MLX performance. While 2-bit and 3-bit quantization offer better memory compression, they suffer from significant performance penalties on Apple Silicon.
Performance Degradation Below 4-Bit
Critical Finding: Apple Silicon is not very friendly to IQ series quantization (2-bit, 3-bit). These quantization schemes use a “codebook” to encode the quantized values of groups of 4 or 8 model weights, requiring many loads from a lookup table when doing matrix multiplications. Apple Silicon handles this poorly compared to NVIDIA GPUs.
Benchmark Evidence:
- IQ2_XS on RTX 4080: 175 tokens/sec
- IQ2_XS on M2 Max: 50 tokens/sec
- Performance ratio: 3.5x slower on Apple Silicon (vs 2x for standard 4-bit)
The difference is less pronounced on precisions natively supported by Apple Silicon (4-bit and above).
Why Sub-4Bit Underperforms on Apple Silicon
1. Lookup Table Overhead
2-bit and 3-bit (IQ quants):
- Use “codebook” encoding for groups of 4-8 weights
- Require frequent lookup table reads during matrix multiplication
- Apple’s GPU architecture handles these indirect memory accesses inefficiently
4-bit quantization:
- Direct representation without lookup tables
- Native Metal Performance Shaders support
- Efficient GPU kernel implementation
2. Native Hardware Support
Metal Performance Shaders (MPS):
- Native 4-bit integer format: Announced 2024, provides hardware-accelerated 4-bit quantization
- 16 quantization points: Efficient linear distribution for 4-bit
- Affine quantization: Direct 4-bit or 8-bit support without emulation
Below 4-bit:
- No native MPS support
- Must emulate through lookup tables
- Lacks optimized Metal GPU kernels
3. Memory Access Patterns
Apple Silicon Architecture:
- Optimized for unified memory direct access
- Excels at sequential memory patterns
- Struggles with indirect/random lookup patterns
Codebook lookups (2-bit/3-bit):
- Create random memory access patterns
- Defeat unified memory advantages
- Bottleneck GPU performance
Quantization Performance by Bit-Width
Native Support (Optimal Performance)
| Bit-Width | Metal Support | Performance | Use Case |
|---|---|---|---|
| 8-bit | ✅ Native | Excellent | High quality, 2x memory savings |
| 4-bit | ✅ Native | Excellent | Optimal balance, 4x memory savings |
Emulated Support (Degraded Performance)
| Bit-Width | Metal Support | Performance | Penalty vs 4-bit | Use Case |
|---|---|---|---|---|
| 3-bit | ❌ Emulated (codebook) | Poor | ~1.75x slower | Not recommended |
| 2-bit | ❌ Emulated (codebook) | Poor | ~1.75x slower | Not recommended |
Performance Penalty Calculation:
- Standard 4-bit vs NVIDIA: ~2x slower on Apple Silicon
- IQ2/IQ3 vs NVIDIA: ~3.5x slower on Apple Silicon
- Extra penalty for sub-4bit: ~1.75x (3.5x ÷ 2x)
Research Findings
Debunking Common Myths
Myth: “Compressing models to lower bit precision is a defacto promise for faster inference across all hardware platforms”
Reality: Research shows this is false. Lower bit precision can actually slow down inference on Apple Silicon when it requires emulation.
MLX Supported Quantization
MLX officially supports the following bit-widths:
- 2, 3, 4, 5, 6, and 8 bits per quantized weight
However, practical performance varies dramatically:
- 4-bit and 8-bit: Excellent native performance
- 2-bit and 3-bit: Significant performance degradation despite technical support
Group-Wise Quantization
MLX implements efficient quantization where:
- Consecutive weights share scale and optionally bias parameters
- Reduces memory footprint while preserving model quality
- Most efficient at 4-bit due to native hardware support
Practical Recommendations
For Apple Silicon (M1/M2/M3/M4/M5)
Primary Recommendation: Use 4-bit quantization
- Best balance of compression and performance
- Native Metal GPU kernel support
- No emulation overhead
Alternative: Use 8-bit quantization
- Near full-precision quality
- 2x memory savings vs FP16
- Excellent performance
Avoid: 2-bit and 3-bit quantization
- 1.75x performance penalty vs 4-bit
- No practical benefit despite smaller memory footprint
- Speed loss negates memory savings
Memory vs Performance Trade-off
| Quantization | Memory Savings | Performance | Verdict |
|---|---|---|---|
| FP16 | Baseline (1x) | Baseline | Maximum quality |
| 8-bit | 2x | Excellent (0.95x) | High quality option |
| 4-bit | 4x | Excellent (1.0x) | ⭐ Optimal choice |
| 3-bit | 5.3x | Poor (0.57x) | ❌ Not worth it |
| 2-bit | 8x | Poor (0.57x) | ❌ Not worth it |
Performance normalized to 4-bit as 1.0x baseline
MLX Framework Optimizations
M5 Neural Accelerators (2026)
The M5 chip introduces GPU Neural Accelerators that provide:
- 4x speedup vs M4 for time-to-first-token
- Dedicated matrix multiplication operations
- Native support for TensorOps through Metal 4
Critical: These accelerators are optimized for 4-bit and 8-bit operations, not sub-4bit.
Metal 4 Features
Metal 4 introduces:
- Tensor Operations (TensorOps): Native 4-bit quantization support
- Metal Performance Primitives: Efficient 4-bit kernels
- Quadgroup operations: Optimized for 4-bit matrix multiplications
MLX vs Other Frameworks
On Apple Silicon (4-bit quantization):
- MLX: 26-30% faster than Ollama
- vllm-mlx: 21-87% higher throughput than llama.cpp
- Optimized models: 400+ tokens/sec achievable
Performance advantage disappears with sub-4bit quantization due to emulation overhead.
Real-World Implications
When Sub-4Bit Makes Sense
Only in extreme memory constraints:
- Cannot fit 4-bit model in available RAM
- No alternative except smaller model
- Quality degradation acceptable
Example: Running 235B model on 256GB Mac Studio
- 4-bit: ~203GB (tight fit)
- 3-bit: ~135GB (comfortable fit)
- Trade-off: Accept 1.75x performance penalty
When 4-Bit is Superior
Almost always preferred:
- Fits in available memory
- Interactive applications (>15 tok/s target)
- Production deployments
- Real-time inference
Example: Running 70B model on M3 Ultra
- 4-bit: 12-18 tok/s ✅ Usable
- 3-bit: 7-10 tok/s ❌ Too slow
Technical Details
Affine Quantization (4-bit)
How it works:
- 16 quantization points (2^4)
- Linearly distributed along number line
- Direct integer representation
- No lookup required
Performance:
- Single memory read per weight
- Sequential access pattern
- Metal GPU kernel optimized
Codebook Quantization (2-bit/3-bit)
How it works:
- Assign each weight 2-bit or 3-bit index
- Index points to codebook entry
- Two memory reads per weight
Performance bottleneck:
- First read: Get weight index
- Second read: Lookup codebook value
- Random access pattern
- Not Metal-optimized
Comparison with NVIDIA GPUs
NVIDIA Architecture Advantages
For sub-4bit quantization:
- Optimized for indirect memory access
- Dedicated lookup table hardware
- Better random access performance
Performance ratio:
- 4-bit quantization: NVIDIA ~2x faster
- 2-bit/3-bit (IQ): NVIDIA ~3.5x faster
Conclusion: NVIDIA’s advantage increases with sub-4bit quantization, while Apple’s advantage is with 4-bit native support.
Apple Silicon Advantages
For 4-bit+ quantization:
- Unified memory architecture
- Native Metal GPU kernels
- Efficient sequential access
- Lower power consumption
Sweet spot: 4-bit to 8-bit quantization on Apple Silicon offers competitive performance with much better power efficiency.
Future Considerations
Potential Improvements
Hardware: Future Apple Silicon may add native 2-bit/3-bit support Software: MLX may develop optimized codebook kernels for Metal Algorithms: New quantization schemes without lookup tables
Current State (2026)
As of early 2026:
- M5 chip continues focus on 4-bit/8-bit optimization
- No announced sub-4bit native support
- MLX framework prioritizes 4-bit performance
Recommendation: Plan deployments around 4-bit as minimum, avoid sub-4bit unless absolutely necessary.
Summary
The Answer to “What bit-width is MLX not optimal for?”
2-bit and 3-bit quantization (sub-4bit) are not optimal for MLX and Apple Silicon because:
- Lookup table emulation: Requires codebook reads that Apple’s GPU handles poorly
- No native support: Metal Performance Shaders only natively support 4-bit and 8-bit
- Performance penalty: ~1.75x slower than 4-bit despite better memory compression
- Architecture mismatch: Apple Silicon optimized for direct access, not indirect lookups
4-bit is the practical lower bound for optimal MLX performance on Apple Silicon.
Sources
- Exploring LLMs with MLX and the Neural Accelerators in the M5 GPU | Apple Machine Learning Research
- Very slow IQ quant performance on Apple Silicon | llama.cpp Discussion
- Profiling Large Language Model Inference on Apple Silicon: A Quantization Perspective | ResearchGate
- Native LLM and MLLM Inference at Scale on Apple Silicon | arXiv
- Accelerate machine learning with Metal | WWDC24
- Metal Performance Shaders | Apple Developer Documentation
- Quantization | MLX DeepWiki
- Local AI with MLX on the Mac | Markus Schall
- Benchmarking On-Device Machine Learning on Apple Silicon with MLX | arXiv
- vllm-mlx GitHub Repository