CFR Poker Development Journey
A chronological timeline documenting the development of a CFR (Counterfactual Regret Minimization) poker AI solver, from initial research through GPU optimization.
Table of Contents
- Phase 0: Research & Foundation
- Phase 1: Preflop-Only MCCFR
- Phase 2: Full 4-Street MCCFR
- Phase 3: GPU Acceleration
- Phase 4: Monte Carlo Sample Reduction
- Phase 5: Save/Load & Interactive Play
- Phase 6: Matrix CFR Exploration
- Key Learnings
- Future Work
Phase 0: Research & Foundation
Date: December 2, 2025
Initial Research
Started with a research request covering:
- Incomplete information game theory
- Kuhn Poker
- Monte-Carlo Counterfactual Regret Minimization (MCCFR)
- Nash equilibrium convergence
Documentation Created
Created comprehensive documentation at docs/game-theory-cfr/:
| Document | Description |
|---|---|
README.md | Topic index and navigation |
mccfr-nash-equilibrium.md | CFR algorithm deep-dive, MCCFR variants, Libratus/Pluribus architecture |
kuhn-poker.md | Benchmark game with 12 information sets, Nash equilibrium strategies |
implementation-guide.md | Step-by-step tutorial with working code |
Package Creation
Created packages/cfr-poker/ using uv with:
packages/cfr-poker/├── pyproject.toml # uv config with ROCm/CUDA optional deps├── README.md # Usage documentation├── src/cfr_poker/│ ├── __init__.py # Package exports│ ├── information_set.py # Regret tracking│ ├── kuhn.py # Full Kuhn Poker CFR│ ├── cli.py # CLI commands│ └── jax_cfr.py # JAX GPU acceleration (experimental)└── tests/ └── test_kuhn.py # 19 testsGPU Exploration: JAX vs PyTorch
JAX ROCm Attempt (Failed)
Attempted JAX ROCm for GPU acceleration:
# Errors encountered- libhipfftw.so missing- libdw.so.1 missing- Memory access faults on gfx1100Conclusion: JAX ROCm only supports MI Instinct datacenter GPUs, NOT consumer RDNA3 (gfx1100/gfx1101).
PyTorch ROCm (Partial Success)
Switched to PyTorch ROCm:
pip install torch --index-url https://download.pytorch.org/whl/rocm6.2# torch==2.5.1+rocm6.2- Detected GPUs: RX 7800 XT + integrated graphics
- Created
torch_cfr.pywithTorchKuhnCFRclass - GPU execution failed with “HIP error: invalid device function”
Initial Benchmarks
| Backend | Kuhn (12 info sets) | Leduc (288 info sets) |
|---|---|---|
| NumPy CPU | ~6,500 iter/sec | ~22 iter/sec |
| PyTorch CPU | ~1,500 iter/sec | - |
| PyTorch GPU | FAILED | - |
Key Finding: NumPy is fastest for small games due to framework overhead.
Leduc Poker Implementation
Implemented Leduc Poker (~288 info sets):
- 6-card deck (J, Q, K in two suits)
- 2 betting rounds with community card
- Max 2 raises per round
- Pair beats high card at showdown
Phase 1: Preflop-Only MCCFR
Date: December 2-3, 2025
Implementation
Created Hold’em modules in src/cfr_poker/holdem/:
| File | Purpose |
|---|---|
cards.py | 52-card representation |
hand_evaluator.py | Multi-backend hand evaluation |
abstraction.py | Preflop 10 buckets by equity |
preflop_mccfr.py | External sampling MCCFR |
Results
- Performance: ~5,000 iterations/sec on CPU
- Information Sets: 100 (10 buckets × 10 betting histories)
- Convergence: ~20 seconds (100k iterations)
- Tests: 70 passing
Strategy Insights
- Premium hands (AA, KK) raise ~85% from SB
- Trash hands (72o) mostly call/fold
- SB loses ~0.006 chips/hand (BB position advantage)
Phase 2: Full 4-Street MCCFR
Date: December 3, 2025
Implementation
Extended to all betting streets: Preflop → Flop → Turn → River
New Classes:
LimitHoldemMCCFR- Full 4-street solverPostflopAbstraction- EHS (Expected Hand Strength) bucketing
Key Design Decisions:
# Info set key format"PF{bucket}:F{bucket}:T{bucket}:R{bucket}|{history}"
# History encoding with street separators"rc/cr/c/rc" # raise-call / check-raise / call / raise-call
# Bet sizesPreflop/Flop: 2 chipsTurn/River: 4 chips
# PositionSB acts first preflopBB acts first postflopResults (5/10/10/10 Buckets)
| Metric | Value |
|---|---|
| Information Sets | 32,887 |
| Speed | 26 iter/sec |
| SB Expected Value | +0.20 chips/hand |
| Memory | ~50 MB |
| Tests | 131 passing |
Usage
# Preflop only (fast)uv run python scripts/train_holdem.py -i 100000 -v
# Full multi-streetuv run python scripts/train_holdem.py --multistreet -i 1000 -vPhase 3: GPU Acceleration
Date: December 3, 2025
Initial Attempt: Direct GPU
Created GPU modules in src/cfr_poker/holdem/gpu/:
| File | Purpose |
|---|---|
__init__.py | GPU availability detection |
utils.py | Combinatorial indexing (C(52,5) = 2,598,960) |
lookup_table.py | Pre-computed hand rank table (~5MB) |
batched_ehs.py | Batched Monte Carlo EHS |
The Problem: GPU Was Slower!
CPU: 26 iter/secGPU: 10 iter/sec (2.6x SLOWER)Root Cause: Sequential MCCFR algorithm creates 300+ tiny GPU kernel launches per hand. Kernel launch overhead (~50μs) exceeded actual computation time.
Solution: Pre-computation
Shifted strategy to pre-computing EHS tables offline:
Pre-computation speedup: 76x (7140s → 94s)Training improvement: ~4% (26 → 27 iter/sec)The training improvement was minimal because the bottleneck was flop/turn computation, not river.
Profiling Discovery
# In-memory cache statisticsHit rate: 97.4%Pre-computed coverage: 0.036% of 2.8B combinationsPre-computed tables were ineffective because:
- In-memory cache already handled most repeated queries
- Random sampling covered only 0.036% of possible combinations
- Per-call GPU overhead remained the bottleneck
Cleanup
Removed ineffective pre-computation code, keeping the codebase clean.
Phase 4: Monte Carlo Sample Reduction
Date: December 3, 2025
Hypothesis
EHS computation uses Monte Carlo sampling (default: 100 samples). Reducing samples trades accuracy for speed.
Benchmarks
| EHS Samples | Iterations/sec | Speedup | Accuracy Impact |
|---|---|---|---|
| 100 | 255 | baseline | Best |
| 50 | 372 | 1.46x | Minimal |
| 25 | 448 | 1.76x | Noticeable |
Mathematical Basis
EHS estimation error scales with 1/√n:
- 100 samples: ±3.2% error
- 50 samples: ±4.5% error
- 25 samples: ±6.3% error
CFR is robust to noise, so 50 samples provides good tradeoff.
Combined Speedups
| Phase | Optimization | Speedup |
|---|---|---|
| Phase 3 | GPU vectorization | 10.4x |
| Phase 4 | Sample reduction | 1.46x |
| Total | Combined | 15.2x |
Current Bottleneck
Profiling revealed CPU showdown evaluation (_evaluate_5card) consumes 39.4% of training time.
Recommendations
# Development (1.46x faster)uv run python scripts/train_holdem.py --multistreet --ehs-samples 50 -i 10000
# Final accuracy testinguv run python scripts/train_holdem.py --multistreet --ehs-samples 100 -i 100000Phase 5: Save/Load & Interactive Play
Date: December 4-5, 2025
Motivation
After completing GPU optimization, the next question was: “How do I actually use this trained model?” This led to implementing:
- Strategy persistence (save/load to JSON)
- Interactive play mode against the trained bot
Implementation (TDD Approach)
New Files Created:
| File | Purpose | Tests |
|---|---|---|
src/cfr_poker/holdem/strategy_io.py | Save/load strategies to JSON | 13 |
src/cfr_poker/holdem/play.py | Interactive play with Unicode cards | 22 |
scripts/play_holdem.py | CLI entry point | - |
Files Modified:
| File | Change |
|---|---|
scripts/train_holdem.py | Added --save and --load flags |
src/cfr_poker/holdem/mccfr.py | Fixed fold payoff calculation bug |
Strategy I/O
class StrategyCodec: VERSION = "1.0"
@staticmethod def encode_strategy(mccfr, game_value=None) -> Dict: # Convert MCCFR to JSON-serializable dict
@staticmethod def decode_strategy(data: Dict) -> Dict[str, InformationSet]: # Reconstruct InformationSet objects from JSON
@staticmethod def validate_compatibility(data, preflop_buckets, flop_buckets, ...): # Check bucket configuration matchesInteractive Play Features
- Unicode card display:
A♠ K♥format using suit symbols (♠♥♦♣) - Full game loop: Deal → Bet → Showdown with pot tracking
- Position selection: Play as SB or BB
- Quit support: Exit mid-game with ‘q’
Bug Fixes
Fold Payoff Bug: User reported “Bot folds” but “You lose 12 chips”
# Before (buggy) - action_count incremented inside inner loopaction_count = 0for part in history.split("/"): for i, action in enumerate(part): if action == "f": folder = (action_count + 1) % 2 # Wrong! break action_count += 1 # BUG: Inside inner loop!
# After (fixed) - proper fold position detectionstreets = history.split("/")for street_idx, street_history in enumerate(streets): if "f" in street_history: fold_pos = street_history.index("f") if street_idx == 0: folder = fold_pos % 2 # Preflop: SB acts first else: folder = (fold_pos + 1) % 2 # Postflop: BB acts first breakDisplay Bug: Hands weren’t shown when someone folded. Fixed to show both hands on any terminal action.
Usage
# Train and save strategypython scripts/train_holdem.py --multistreet -i 100000 --save strategy.json
# Play against the botpython scripts/play_holdem.py --load strategy.json --position sbResults
- 227 tests passing (35 new tests added)
- All phases complete (1-5)
- Final performance: ~372 iter/sec (15.2x from baseline)
Phase 6: Matrix CFR Exploration
Date: December 5, 2025
Motivation
Traditional recursive CFR doesn’t parallelize well on GPUs because:
- Irregular tree traversal with unpredictable branching
- Sequential dependencies between parent/child nodes
- Control flow divergence kills GPU efficiency
Matrix CFR reformulates the algorithm as:
- Forward pass: Compute reach probabilities via sparse matrix multiply
- Backward pass: Compute counterfactual values via transposed matrix multiply
- Update: Vectorized regret and strategy accumulation
Reference: GPU-Accelerated CFR
Implementation
Created new matrix_cfr/ module alongside existing MCCFR:
| File | Purpose | Lines |
|---|---|---|
game_tree.py | Build game tree as sparse matrices | ~560 |
info_set_manager.py | Vectorized info set storage | ~150 |
matrix_cfr.py | Core matrix CFR algorithm | ~200 |
benchmark_matrix_cfr.py | Performance comparison | ~180 |
Key Data Structures:
@dataclassclass GameTreeMatrices: num_histories: int # 22,141 betting sequences num_info_sets: int # 4.7M with 5/5/5/5 buckets is_terminal: Tensor # (N,) boolean mask is_player0: Tensor # (N,) SB action nodes is_player1: Tensor # (N,) BB action nodes parent: Tensor # (N,) parent node indices children: Tensor # (N, 3) child indices per action levels: List[Tensor] # BFS level structure (28 levels)Benchmark Results
| Implementation | Device | Iter/sec | Speedup |
|---|---|---|---|
| Recursive MCCFR | CPU | 180 | baseline |
| Matrix CFR | CPU | 14 | 0.08x |
| Matrix CFR | GPU | 34 | 0.19x |
Key Finding: Matrix CFR is currently slower than recursive MCCFR!
Why Matrix CFR Is Slower for Poker
-
Pre-allocation Overhead: Matrix CFR pre-allocates for ALL 4.7M possible info sets, but MCCFR only visits ~42k (0.9%) per 500 iterations.
-
Tensor Operation Overhead: Python loops through 28 tree levels with tensor slicing has more overhead than pure Python recursion for this tree size.
-
No Card Sampling: Current matrix CFR lacks terminal payoff computation (needs card distribution sampling).
-
Monte Carlo Wins: Random sampling naturally prunes the tree - most game states are never reached in practice.
Lessons Learned
The GPU-CFR paper achieved 352x speedups because:
- Games had explicit enumerable states (not abstraction-based)
- All states were visited every iteration
- Larger game trees amortized kernel overhead
For abstracted poker:
- Only 0.9% of info sets are visited
- Monte Carlo sampling is inherently efficient
- Pre-allocating unused info sets is wasteful
Tests Added
- 47 new tests for matrix CFR module
- All tests passing (47/47)
Future Directions for GPU Poker
Better approaches than pure Matrix CFR:
- Batched Monte Carlo: Sample 1000+ deals per iteration, process in parallel
- Sparse Info Set Storage: Lazily allocate only visited info sets
- GPU Hand Evaluation: Accelerate the 39.4% bottleneck (_evaluate_5card)
Key Learnings
1. GPU ≠ Automatically Faster
Algorithm structure matters more than raw compute power. Sequential tree traversal with thousands of tiny operations is hostile to GPU architecture.
2. Consumer AMD GPUs Have Limited ML Support
| Framework | AMD RDNA3 Support |
|---|---|
| JAX ROCm | Not supported (datacenter only) |
| PyTorch ROCm | Partial (kernel issues) |
3. NumPy Beats PyTorch for Small Sequential Operations
Framework overhead dominates for small games. NumPy is ~4x faster than PyTorch CPU for Kuhn Poker.
4. MCCFR Needs Batching for GPU Benefits
To leverage GPU parallelization, the algorithm must be restructured to batch operations across multiple hands simultaneously.
5. Monte Carlo Sampling Is Tunable
Reducing EHS samples from 100 to 50 provides 1.46x speedup with negligible accuracy impact. CFR’s iterative nature is robust to estimation noise.
6. Caching Is Powerful
In-memory caching achieved 97.4% hit rate, making pre-computation largely redundant for random sampling patterns.
Future Work
Phase 6: Push/Fold No-Limit Hold’em (Proposed)
A simplified No-Limit format where you only have two actions: Push (all-in) or Fold.
When It’s Used: Tournament play with short stacks (10-15 big blinds).
| Aspect | Full No-Limit | Push/Fold |
|---|---|---|
| Bet sizes | Continuous (infinite) | 1 (all-in) |
| Post-flop streets | 3 (flop/turn/river) | 0 (showdown only) |
| Actions | fold/check/call/raise/all-in | fold/push |
| Info sets | Millions | ~10,000 |
Why It’s a Good Target:
- Demonstrates No-Limit concepts
- Manageable complexity for basic MCCFR
- Has known Nash equilibrium (published Push/Fold charts)
- Can validate against established optimal play
GPU Showdown Evaluation
Current bottleneck is _evaluate_5card at 39.4% of training time. GPU-accelerating hand evaluation could provide 1.5-2x additional speedup.
Batched MCCFR
Restructure algorithm to process 1000+ hands in parallel:
# Current (sequential)for iteration in range(n): hand = deal() traverse(hand) # 300+ tiny operations
# Future (batched)hands = deal_batch(1000)ehs_batch = gpu_compute_all_ehs(hands) # One kerneltraverse_batch(hands, ehs_batch)Bucket-Based Caching
Use 5,000 abstraction buckets instead of 2.8B exact states for >90% cache hit rate potential.
Exploitability Analysis
Compute exact exploitability via best response to measure convergence quality.
Final Package State
packages/cfr-poker/├── src/cfr_poker/│ ├── kuhn.py # Kuhn Poker (12 info sets)│ ├── leduc.py # Leduc Poker (288 info sets)│ ├── torch_cfr.py # PyTorch backend│ └── holdem/│ ├── mccfr.py # Full 4-street MCCFR│ ├── abstraction.py # EHS bucketing│ ├── strategy_io.py # Save/load strategies│ ├── play.py # Interactive play mode│ ├── gpu/ # GPU acceleration│ └── matrix_cfr/ # Matrix CFR (experimental)│ ├── game_tree.py # Sparse matrix tree│ ├── info_set_manager.py # Vectorized storage│ └── matrix_cfr.py # Core algorithm├── scripts/│ ├── train_holdem.py│ ├── play_holdem.py│ ├── benchmark_ehs_samples.py│ └── benchmark_matrix_cfr.py└── tests/ # 256 tests passingPerformance Summary
| Game | Info Sets | Speed (Optimized) |
|---|---|---|
| Kuhn Poker | 12 | ~6,500 iter/sec |
| Leduc Poker | 288 | ~22 iter/sec |
| Limit Hold’em (5/10/10/10) | ~33k | ~372 iter/sec |