overview
Purpose
Z80-μLM (z80ai) is a research project by HarryR exploring the extreme limits of language model compression by running conversational AI on 1970s-era 8-bit Z80 processors. The project answers the question: “How small can we go while maintaining personality?”
Key Findings
What It Is
Z80-μLM enables running neural language models on vintage Z80 processors (4MHz, 64KB RAM) by using aggressive 2-bit quantization and integer-only arithmetic. The entire system - including inference code, weights, and chat UI - fits in a ~40KB CP/M .COM binary that can run on hardware from 1976.
Core Innovation
The project demonstrates that with extreme optimization techniques, you can run conversational AI on hardware with less computing power than a modern calculator:
- 2-bit weight quantization: Restricts all weights to {-2, -1, 0, +1}, packed 4 per byte
- 16-bit integer inference: Uses Z80-native register pairs for all math operations
- No floating-point: Everything uses integer arithmetic with fixed-point scaling
- Trigram hash encoding: Compresses input into 128 semantic buckets (typo-tolerant but loses word order)
Technical Architecture
Input Processing
Text is hashed into 256 buckets total:
- 128 query buckets (current input)
- 128 context buckets (conversation history)
This creates abstract “tag cloud” representations rather than parsing grammar. The approach is typo-tolerant and word-order invariant, but longer sentences cause semantic collisions.
Model Structure
Configurable neural network with typical architecture:
- Input layer (256 buckets)
- Hidden layers (e.g., 256→192→128 neurons)
- Output layer (one neuron per character)
- ReLU activation between layers
Training Process
Uses Quantization-Aware Training (QAT):
- Runs float and integer-quantized forward passes in parallel
- Scores how well knowledge survives compression to the 2-bit grid
- Uses straight-through estimators for gradient flow
- Optimizes specifically for the quantization constraints
Community insights note that “the first layer is the most sensitive” to quantization, while “middle layers take best to quantization.”
Z80 Inference Implementation
The inner loop on the Z80:
- Unpacks 2-bit weights from packed bytes
- Performs multiply-accumulate using 16-bit accumulators
- Applies arithmetic right-shifts by 2 to prevent overflow
- Generates characters one at a time autoregressively
Capabilities and Limitations
What It Can Do
- Interactive chat mode with distinct personalities
- Play 20-questions game
- Maintain simple conversations
- Run entirely self-contained on vintage hardware
Limitations
- “Won’t pass the Turing test, but it might make you smile at the green screen”
- Best with short inputs
- Longer sentences cause semantic hash collisions
- Word order information is lost in encoding
Implementation Details
Technology Stack
- Training: Python with custom QAT library
- Export: Z80 assembly code generation
- Target Platform: CP/M operating system
- Binary Format: Self-contained
.COMexecutable (~40KB)
Pre-built Examples
tinychat: General chatbotguess: 20-questions game
No external dependencies required beyond training tools.
Community Reception
Hacker News Discussion
The project generated significant interest on Hacker News (46417815):
Historical Perspective: Commenters noted this “could have worked on 60s-era hardware” and would have “completely changed the world…back then.”
Modern Context: Users highlighted the contrast with modern software bloat, joking that this demonstrates why “Slack needs 1.5GB of ram” for text chat.
Practical Applications Discussed:
- Embedding in retro games (ZX Spectrum, Game Boy)
- IoT device integration for lightweight AI
- Proof-of-concept for extreme compression limits
Technical Synchronicity: Two developers working independently on CP/M emulators discovered each other’s projects through this release.
Significance
Z80-μLM represents an extreme point on the model compression spectrum, demonstrating that conversational AI capabilities can be achieved with:
- 1970s hardware specifications
- No floating-point support
- Severe memory constraints (64KB total)
- Ultra-aggressive quantization (2-bit)
While not practically useful for modern applications, it serves as a valuable proof-of-concept for understanding the fundamental limits of model compression and the minimal hardware requirements for neural language models.
Sources
- GitHub - HarryR/z80ai - Official repository
- Show HN: Z80-μLM, a ‘Conversational AI’ That Fits in 40KB | Hacker News - Community discussion