Purpose

Explain why sub-4bit quantization (2-bit, 3-bit) is not optimal for MLX and Apple Silicon despite offering better memory compression, due to emulation overhead and lack of native hardware support.

Key Findings

The 4-Bit Sweet Spot

4-bit quantization is the practical lower bound for optimal MLX performance. While 2-bit and 3-bit quantization offer better memory compression, they suffer from significant performance penalties on Apple Silicon.

Performance Degradation Below 4-Bit

Critical Finding: Apple Silicon is not very friendly to IQ series quantization (2-bit, 3-bit). These quantization schemes use a “codebook” to encode the quantized values of groups of 4 or 8 model weights, requiring many loads from a lookup table when doing matrix multiplications. Apple Silicon handles this poorly compared to NVIDIA GPUs.

Benchmark Evidence:

  • IQ2_XS on RTX 4080: 175 tokens/sec
  • IQ2_XS on M2 Max: 50 tokens/sec
  • Performance ratio: 3.5x slower on Apple Silicon (vs 2x for standard 4-bit)

The difference is less pronounced on precisions natively supported by Apple Silicon (4-bit and above).

Why Sub-4Bit Underperforms on Apple Silicon

1. Lookup Table Overhead

2-bit and 3-bit (IQ quants):

  • Use “codebook” encoding for groups of 4-8 weights
  • Require frequent lookup table reads during matrix multiplication
  • Apple’s GPU architecture handles these indirect memory accesses inefficiently

4-bit quantization:

  • Direct representation without lookup tables
  • Native Metal Performance Shaders support
  • Efficient GPU kernel implementation

2. Native Hardware Support

Metal Performance Shaders (MPS):

  • Native 4-bit integer format: Announced 2024, provides hardware-accelerated 4-bit quantization
  • 16 quantization points: Efficient linear distribution for 4-bit
  • Affine quantization: Direct 4-bit or 8-bit support without emulation

Below 4-bit:

  • No native MPS support
  • Must emulate through lookup tables
  • Lacks optimized Metal GPU kernels

3. Memory Access Patterns

Apple Silicon Architecture:

  • Optimized for unified memory direct access
  • Excels at sequential memory patterns
  • Struggles with indirect/random lookup patterns

Codebook lookups (2-bit/3-bit):

  • Create random memory access patterns
  • Defeat unified memory advantages
  • Bottleneck GPU performance

Quantization Performance by Bit-Width

Native Support (Optimal Performance)

Bit-WidthMetal SupportPerformanceUse Case
8-bit✅ NativeExcellentHigh quality, 2x memory savings
4-bit✅ NativeExcellentOptimal balance, 4x memory savings

Emulated Support (Degraded Performance)

Bit-WidthMetal SupportPerformancePenalty vs 4-bitUse Case
3-bit❌ Emulated (codebook)Poor~1.75x slowerNot recommended
2-bit❌ Emulated (codebook)Poor~1.75x slowerNot recommended

Performance Penalty Calculation:

  • Standard 4-bit vs NVIDIA: ~2x slower on Apple Silicon
  • IQ2/IQ3 vs NVIDIA: ~3.5x slower on Apple Silicon
  • Extra penalty for sub-4bit: ~1.75x (3.5x ÷ 2x)

Research Findings

Debunking Common Myths

Myth: “Compressing models to lower bit precision is a defacto promise for faster inference across all hardware platforms”

Reality: Research shows this is false. Lower bit precision can actually slow down inference on Apple Silicon when it requires emulation.

MLX Supported Quantization

MLX officially supports the following bit-widths:

  • 2, 3, 4, 5, 6, and 8 bits per quantized weight

However, practical performance varies dramatically:

  • 4-bit and 8-bit: Excellent native performance
  • 2-bit and 3-bit: Significant performance degradation despite technical support

Group-Wise Quantization

MLX implements efficient quantization where:

  • Consecutive weights share scale and optionally bias parameters
  • Reduces memory footprint while preserving model quality
  • Most efficient at 4-bit due to native hardware support

Practical Recommendations

For Apple Silicon (M1/M2/M3/M4/M5)

Primary Recommendation: Use 4-bit quantization

  • Best balance of compression and performance
  • Native Metal GPU kernel support
  • No emulation overhead

Alternative: Use 8-bit quantization

  • Near full-precision quality
  • 2x memory savings vs FP16
  • Excellent performance

Avoid: 2-bit and 3-bit quantization

  • 1.75x performance penalty vs 4-bit
  • No practical benefit despite smaller memory footprint
  • Speed loss negates memory savings

Memory vs Performance Trade-off

QuantizationMemory SavingsPerformanceVerdict
FP16Baseline (1x)BaselineMaximum quality
8-bit2xExcellent (0.95x)High quality option
4-bit4xExcellent (1.0x)⭐ Optimal choice
3-bit5.3xPoor (0.57x)❌ Not worth it
2-bit8xPoor (0.57x)❌ Not worth it

Performance normalized to 4-bit as 1.0x baseline

MLX Framework Optimizations

M5 Neural Accelerators (2026)

The M5 chip introduces GPU Neural Accelerators that provide:

  • 4x speedup vs M4 for time-to-first-token
  • Dedicated matrix multiplication operations
  • Native support for TensorOps through Metal 4

Critical: These accelerators are optimized for 4-bit and 8-bit operations, not sub-4bit.

Metal 4 Features

Metal 4 introduces:

  • Tensor Operations (TensorOps): Native 4-bit quantization support
  • Metal Performance Primitives: Efficient 4-bit kernels
  • Quadgroup operations: Optimized for 4-bit matrix multiplications

MLX vs Other Frameworks

On Apple Silicon (4-bit quantization):

  • MLX: 26-30% faster than Ollama
  • vllm-mlx: 21-87% higher throughput than llama.cpp
  • Optimized models: 400+ tokens/sec achievable

Performance advantage disappears with sub-4bit quantization due to emulation overhead.

Real-World Implications

When Sub-4Bit Makes Sense

Only in extreme memory constraints:

  • Cannot fit 4-bit model in available RAM
  • No alternative except smaller model
  • Quality degradation acceptable

Example: Running 235B model on 256GB Mac Studio

  • 4-bit: ~203GB (tight fit)
  • 3-bit: ~135GB (comfortable fit)
  • Trade-off: Accept 1.75x performance penalty

When 4-Bit is Superior

Almost always preferred:

  • Fits in available memory
  • Interactive applications (>15 tok/s target)
  • Production deployments
  • Real-time inference

Example: Running 70B model on M3 Ultra

  • 4-bit: 12-18 tok/s ✅ Usable
  • 3-bit: 7-10 tok/s ❌ Too slow

Technical Details

Affine Quantization (4-bit)

How it works:

  • 16 quantization points (2^4)
  • Linearly distributed along number line
  • Direct integer representation
  • No lookup required

Performance:

  • Single memory read per weight
  • Sequential access pattern
  • Metal GPU kernel optimized

Codebook Quantization (2-bit/3-bit)

How it works:

  • Assign each weight 2-bit or 3-bit index
  • Index points to codebook entry
  • Two memory reads per weight

Performance bottleneck:

  • First read: Get weight index
  • Second read: Lookup codebook value
  • Random access pattern
  • Not Metal-optimized

Comparison with NVIDIA GPUs

NVIDIA Architecture Advantages

For sub-4bit quantization:

  • Optimized for indirect memory access
  • Dedicated lookup table hardware
  • Better random access performance

Performance ratio:

  • 4-bit quantization: NVIDIA ~2x faster
  • 2-bit/3-bit (IQ): NVIDIA ~3.5x faster

Conclusion: NVIDIA’s advantage increases with sub-4bit quantization, while Apple’s advantage is with 4-bit native support.

Apple Silicon Advantages

For 4-bit+ quantization:

  • Unified memory architecture
  • Native Metal GPU kernels
  • Efficient sequential access
  • Lower power consumption

Sweet spot: 4-bit to 8-bit quantization on Apple Silicon offers competitive performance with much better power efficiency.

Future Considerations

Potential Improvements

Hardware: Future Apple Silicon may add native 2-bit/3-bit support Software: MLX may develop optimized codebook kernels for Metal Algorithms: New quantization schemes without lookup tables

Current State (2026)

As of early 2026:

  • M5 chip continues focus on 4-bit/8-bit optimization
  • No announced sub-4bit native support
  • MLX framework prioritizes 4-bit performance

Recommendation: Plan deployments around 4-bit as minimum, avoid sub-4bit unless absolutely necessary.

Summary

The Answer to “What bit-width is MLX not optimal for?”

2-bit and 3-bit quantization (sub-4bit) are not optimal for MLX and Apple Silicon because:

  1. Lookup table emulation: Requires codebook reads that Apple’s GPU handles poorly
  2. No native support: Metal Performance Shaders only natively support 4-bit and 8-bit
  3. Performance penalty: ~1.75x slower than 4-bit despite better memory compression
  4. Architecture mismatch: Apple Silicon optimized for direct access, not indirect lookups

4-bit is the practical lower bound for optimal MLX performance on Apple Silicon.

Sources

  1. Exploring LLMs with MLX and the Neural Accelerators in the M5 GPU | Apple Machine Learning Research
  2. Very slow IQ quant performance on Apple Silicon | llama.cpp Discussion
  3. Profiling Large Language Model Inference on Apple Silicon: A Quantization Perspective | ResearchGate
  4. Native LLM and MLLM Inference at Scale on Apple Silicon | arXiv
  5. Accelerate machine learning with Metal | WWDC24
  6. Metal Performance Shaders | Apple Developer Documentation
  7. Quantization | MLX DeepWiki
  8. Local AI with MLX on the Mac | Markus Schall
  9. Benchmarking On-Device Machine Learning on Apple Silicon with MLX | arXiv
  10. vllm-mlx GitHub Repository