quantization-bit-width-performance

Purpose

Explain why sub-4bit quantization (2-bit, 3-bit) is not optimal for MLX and Apple Silicon despite offering better memory compression, due to emulation overhead and lack of native hardware support.

Key Findings

The 4-Bit Sweet Spot

4-bit quantization is the practical lower bound for optimal MLX performance. While 2-bit and 3-bit quantization offer better memory compression, they suffer from significant performance penalties on Apple Silicon.

Performance Degradation Below 4-Bit

Critical Finding: Apple Silicon is not very friendly to IQ series quantization (2-bit, 3-bit). These quantization schemes use a “codebook” to encode the quantized values of groups of 4 or 8 model weights, requiring many loads from a lookup table when doing matrix multiplications. Apple Silicon handles this poorly compared to NVIDIA GPUs.

Benchmark Evidence:

IQ2_XS on RTX 4080: 175 tokens/sec
IQ2_XS on M2 Max: 50 tokens/sec
Performance ratio: 3.5x slower on Apple Silicon (vs 2x for standard 4-bit)

The difference is less pronounced on precisions natively supported by Apple Silicon (4-bit and above).

Why Sub-4Bit Underperforms on Apple Silicon

1. Lookup Table Overhead

2-bit and 3-bit (IQ quants):

Use “codebook” encoding for groups of 4-8 weights
Require frequent lookup table reads during matrix multiplication
Apple’s GPU architecture handles these indirect memory accesses inefficiently

4-bit quantization:

Direct representation without lookup tables
Native Metal Performance Shaders support
Efficient GPU kernel implementation

2. Native Hardware Support

Metal Performance Shaders (MPS):

Native 4-bit integer format: Announced 2024, provides hardware-accelerated 4-bit quantization
16 quantization points: Efficient linear distribution for 4-bit
Affine quantization: Direct 4-bit or 8-bit support without emulation

Below 4-bit:

No native MPS support
Must emulate through lookup tables
Lacks optimized Metal GPU kernels

3. Memory Access Patterns

Apple Silicon Architecture:

Optimized for unified memory direct access
Excels at sequential memory patterns
Struggles with indirect/random lookup patterns

Codebook lookups (2-bit/3-bit):

Create random memory access patterns
Defeat unified memory advantages
Bottleneck GPU performance

Quantization Performance by Bit-Width

Native Support (Optimal Performance)

Bit-Width	Metal Support	Performance	Use Case
8-bit	✅ Native	Excellent	High quality, 2x memory savings
4-bit	✅ Native	Excellent	Optimal balance, 4x memory savings

Emulated Support (Degraded Performance)

Bit-Width	Metal Support	Performance	Penalty vs 4-bit	Use Case
3-bit	❌ Emulated (codebook)	Poor	~1.75x slower	Not recommended
2-bit	❌ Emulated (codebook)	Poor	~1.75x slower	Not recommended

Performance Penalty Calculation:

Standard 4-bit vs NVIDIA: ~2x slower on Apple Silicon
IQ2/IQ3 vs NVIDIA: ~3.5x slower on Apple Silicon
Extra penalty for sub-4bit: ~1.75x (3.5x ÷ 2x)

Research Findings

Debunking Common Myths

Myth: “Compressing models to lower bit precision is a defacto promise for faster inference across all hardware platforms”

Reality: Research shows this is false. Lower bit precision can actually slow down inference on Apple Silicon when it requires emulation.

MLX Supported Quantization

MLX officially supports the following bit-widths:

2, 3, 4, 5, 6, and 8 bits per quantized weight

However, practical performance varies dramatically:

4-bit and 8-bit: Excellent native performance
2-bit and 3-bit: Significant performance degradation despite technical support

Group-Wise Quantization

MLX implements efficient quantization where:

Consecutive weights share scale and optionally bias parameters
Reduces memory footprint while preserving model quality
Most efficient at 4-bit due to native hardware support

Practical Recommendations

For Apple Silicon (M1/M2/M3/M4/M5)

Primary Recommendation: Use 4-bit quantization

Best balance of compression and performance
Native Metal GPU kernel support
No emulation overhead

Alternative: Use 8-bit quantization

Near full-precision quality
2x memory savings vs FP16
Excellent performance

Avoid: 2-bit and 3-bit quantization

1.75x performance penalty vs 4-bit
No practical benefit despite smaller memory footprint
Speed loss negates memory savings

Memory vs Performance Trade-off

Quantization	Memory Savings	Performance	Verdict
FP16	Baseline (1x)	Baseline	Maximum quality
8-bit	2x	Excellent (0.95x)	High quality option
4-bit	4x	Excellent (1.0x)	⭐ Optimal choice
3-bit	5.3x	Poor (0.57x)	❌ Not worth it
2-bit	8x	Poor (0.57x)	❌ Not worth it

Performance normalized to 4-bit as 1.0x baseline

MLX Framework Optimizations

M5 Neural Accelerators (2026)

The M5 chip introduces GPU Neural Accelerators that provide:

4x speedup vs M4 for time-to-first-token
Dedicated matrix multiplication operations
Native support for TensorOps through Metal 4

Critical: These accelerators are optimized for 4-bit and 8-bit operations, not sub-4bit.

Metal 4 Features

Metal 4 introduces:

Tensor Operations (TensorOps): Native 4-bit quantization support
Metal Performance Primitives: Efficient 4-bit kernels
Quadgroup operations: Optimized for 4-bit matrix multiplications

MLX vs Other Frameworks

On Apple Silicon (4-bit quantization):

MLX: 26-30% faster than Ollama
vllm-mlx: 21-87% higher throughput than llama.cpp
Optimized models: 400+ tokens/sec achievable

Performance advantage disappears with sub-4bit quantization due to emulation overhead.

Real-World Implications

When Sub-4Bit Makes Sense

Only in extreme memory constraints:

Cannot fit 4-bit model in available RAM
No alternative except smaller model
Quality degradation acceptable

Example: Running 235B model on 256GB Mac Studio

4-bit: ~203GB (tight fit)
3-bit: ~135GB (comfortable fit)
Trade-off: Accept 1.75x performance penalty

When 4-Bit is Superior

Almost always preferred:

Fits in available memory
Interactive applications (>15 tok/s target)
Production deployments
Real-time inference

Example: Running 70B model on M3 Ultra

4-bit: 12-18 tok/s ✅ Usable
3-bit: 7-10 tok/s ❌ Too slow

Technical Details

Affine Quantization (4-bit)

How it works:

16 quantization points (2^4)
Linearly distributed along number line
Direct integer representation
No lookup required

Performance:

Single memory read per weight
Sequential access pattern
Metal GPU kernel optimized

Codebook Quantization (2-bit/3-bit)

How it works:

Assign each weight 2-bit or 3-bit index
Index points to codebook entry
Two memory reads per weight

Performance bottleneck:

First read: Get weight index
Second read: Lookup codebook value
Random access pattern
Not Metal-optimized

Comparison with NVIDIA GPUs

NVIDIA Architecture Advantages

For sub-4bit quantization:

Optimized for indirect memory access
Dedicated lookup table hardware
Better random access performance

Performance ratio:

4-bit quantization: NVIDIA ~2x faster
2-bit/3-bit (IQ): NVIDIA ~3.5x faster

Conclusion: NVIDIA’s advantage increases with sub-4bit quantization, while Apple’s advantage is with 4-bit native support.

Apple Silicon Advantages

For 4-bit+ quantization:

Unified memory architecture
Native Metal GPU kernels
Efficient sequential access
Lower power consumption

Sweet spot: 4-bit to 8-bit quantization on Apple Silicon offers competitive performance with much better power efficiency.

Future Considerations

Potential Improvements

Hardware: Future Apple Silicon may add native 2-bit/3-bit support Software: MLX may develop optimized codebook kernels for Metal Algorithms: New quantization schemes without lookup tables

Current State (2026)

As of early 2026:

M5 chip continues focus on 4-bit/8-bit optimization
No announced sub-4bit native support
MLX framework prioritizes 4-bit performance

Recommendation: Plan deployments around 4-bit as minimum, avoid sub-4bit unless absolutely necessary.

Summary

The Answer to “What bit-width is MLX not optimal for?”

2-bit and 3-bit quantization (sub-4bit) are not optimal for MLX and Apple Silicon because:

Lookup table emulation: Requires codebook reads that Apple’s GPU handles poorly
No native support: Metal Performance Shaders only natively support 4-bit and 8-bit
Performance penalty: ~1.75x slower than 4-bit despite better memory compression
Architecture mismatch: Apple Silicon optimized for direct access, not indirect lookups

4-bit is the practical lower bound for optimal MLX performance on Apple Silicon.