overview

Purpose

Z80-μLM (z80ai) is a research project by HarryR exploring the extreme limits of language model compression by running conversational AI on 1970s-era 8-bit Z80 processors. The project answers the question: “How small can we go while maintaining personality?”

Key Findings

What It Is

Z80-μLM enables running neural language models on vintage Z80 processors (4MHz, 64KB RAM) by using aggressive 2-bit quantization and integer-only arithmetic. The entire system - including inference code, weights, and chat UI - fits in a ~40KB CP/M .COM binary that can run on hardware from 1976.

Core Innovation

The project demonstrates that with extreme optimization techniques, you can run conversational AI on hardware with less computing power than a modern calculator:

2-bit weight quantization: Restricts all weights to {-2, -1, 0, +1}, packed 4 per byte
16-bit integer inference: Uses Z80-native register pairs for all math operations
No floating-point: Everything uses integer arithmetic with fixed-point scaling
Trigram hash encoding: Compresses input into 128 semantic buckets (typo-tolerant but loses word order)

Technical Architecture

Input Processing

Text is hashed into 256 buckets total:

128 query buckets (current input)
128 context buckets (conversation history)

This creates abstract “tag cloud” representations rather than parsing grammar. The approach is typo-tolerant and word-order invariant, but longer sentences cause semantic collisions.

Model Structure

Configurable neural network with typical architecture:

Input layer (256 buckets)
Hidden layers (e.g., 256→192→128 neurons)
Output layer (one neuron per character)
ReLU activation between layers

Training Process

Uses Quantization-Aware Training (QAT):

Runs float and integer-quantized forward passes in parallel
Scores how well knowledge survives compression to the 2-bit grid
Uses straight-through estimators for gradient flow
Optimizes specifically for the quantization constraints

Community insights note that “the first layer is the most sensitive” to quantization, while “middle layers take best to quantization.”

Z80 Inference Implementation

The inner loop on the Z80:

Unpacks 2-bit weights from packed bytes
Performs multiply-accumulate using 16-bit accumulators
Applies arithmetic right-shifts by 2 to prevent overflow
Generates characters one at a time autoregressively

Capabilities and Limitations

What It Can Do

Interactive chat mode with distinct personalities
Play 20-questions game
Maintain simple conversations
Run entirely self-contained on vintage hardware

Limitations

“Won’t pass the Turing test, but it might make you smile at the green screen”
Best with short inputs
Longer sentences cause semantic hash collisions
Word order information is lost in encoding

Implementation Details

Technology Stack

Training: Python with custom QAT library
Export: Z80 assembly code generation
Target Platform: CP/M operating system
Binary Format: Self-contained .COM executable (~40KB)

Pre-built Examples

tinychat: General chatbot
guess: 20-questions game

No external dependencies required beyond training tools.

Community Reception

Hacker News Discussion

The project generated significant interest on Hacker News (46417815):

Historical Perspective: Commenters noted this “could have worked on 60s-era hardware” and would have “completely changed the world…back then.”

Modern Context: Users highlighted the contrast with modern software bloat, joking that this demonstrates why “Slack needs 1.5GB of ram” for text chat.

Practical Applications Discussed:

Embedding in retro games (ZX Spectrum, Game Boy)
IoT device integration for lightweight AI
Proof-of-concept for extreme compression limits

Technical Synchronicity: Two developers working independently on CP/M emulators discovered each other’s projects through this release.

Significance

Z80-μLM represents an extreme point on the model compression spectrum, demonstrating that conversational AI capabilities can be achieved with:

1970s hardware specifications
No floating-point support
Severe memory constraints (64KB total)
Ultra-aggressive quantization (2-bit)

While not practically useful for modern applications, it serves as a valuable proof-of-concept for understanding the fundamental limits of model compression and the minimal hardware requirements for neural language models.

Sources

GitHub - HarryR/z80ai - Official repository
Show HN: Z80-μLM, a ‘Conversational AI’ That Fits in 40KB | Hacker News - Community discussion