A chronological timeline documenting the development of a CFR (Counterfactual Regret Minimization) poker AI solver, from initial research through GPU optimization.

Table of Contents


Phase 0: Research & Foundation

Date: December 2, 2025

Initial Research

Started with a research request covering:

  • Incomplete information game theory
  • Kuhn Poker
  • Monte-Carlo Counterfactual Regret Minimization (MCCFR)
  • Nash equilibrium convergence

Documentation Created

Created comprehensive documentation at docs/game-theory-cfr/:

DocumentDescription
README.mdTopic index and navigation
mccfr-nash-equilibrium.mdCFR algorithm deep-dive, MCCFR variants, Libratus/Pluribus architecture
kuhn-poker.mdBenchmark game with 12 information sets, Nash equilibrium strategies
implementation-guide.mdStep-by-step tutorial with working code

Package Creation

Created packages/cfr-poker/ using uv with:

packages/cfr-poker/
├── pyproject.toml # uv config with ROCm/CUDA optional deps
├── README.md # Usage documentation
├── src/cfr_poker/
│ ├── __init__.py # Package exports
│ ├── information_set.py # Regret tracking
│ ├── kuhn.py # Full Kuhn Poker CFR
│ ├── cli.py # CLI commands
│ └── jax_cfr.py # JAX GPU acceleration (experimental)
└── tests/
└── test_kuhn.py # 19 tests

GPU Exploration: JAX vs PyTorch

JAX ROCm Attempt (Failed)

Attempted JAX ROCm for GPU acceleration:

Terminal window
# Errors encountered
- libhipfftw.so missing
- libdw.so.1 missing
- Memory access faults on gfx1100

Conclusion: JAX ROCm only supports MI Instinct datacenter GPUs, NOT consumer RDNA3 (gfx1100/gfx1101).

PyTorch ROCm (Partial Success)

Switched to PyTorch ROCm:

Terminal window
pip install torch --index-url https://download.pytorch.org/whl/rocm6.2
# torch==2.5.1+rocm6.2
  • Detected GPUs: RX 7800 XT + integrated graphics
  • Created torch_cfr.py with TorchKuhnCFR class
  • GPU execution failed with “HIP error: invalid device function”

Initial Benchmarks

BackendKuhn (12 info sets)Leduc (288 info sets)
NumPy CPU~6,500 iter/sec~22 iter/sec
PyTorch CPU~1,500 iter/sec-
PyTorch GPUFAILED-

Key Finding: NumPy is fastest for small games due to framework overhead.

Leduc Poker Implementation

Implemented Leduc Poker (~288 info sets):

  • 6-card deck (J, Q, K in two suits)
  • 2 betting rounds with community card
  • Max 2 raises per round
  • Pair beats high card at showdown

Phase 1: Preflop-Only MCCFR

Date: December 2-3, 2025

Implementation

Created Hold’em modules in src/cfr_poker/holdem/:

FilePurpose
cards.py52-card representation
hand_evaluator.pyMulti-backend hand evaluation
abstraction.pyPreflop 10 buckets by equity
preflop_mccfr.pyExternal sampling MCCFR

Results

  • Performance: ~5,000 iterations/sec on CPU
  • Information Sets: 100 (10 buckets × 10 betting histories)
  • Convergence: ~20 seconds (100k iterations)
  • Tests: 70 passing

Strategy Insights

  • Premium hands (AA, KK) raise ~85% from SB
  • Trash hands (72o) mostly call/fold
  • SB loses ~0.006 chips/hand (BB position advantage)

Phase 2: Full 4-Street MCCFR

Date: December 3, 2025

Implementation

Extended to all betting streets: Preflop → Flop → Turn → River

New Classes:

  • LimitHoldemMCCFR - Full 4-street solver
  • PostflopAbstraction - EHS (Expected Hand Strength) bucketing

Key Design Decisions:

# Info set key format
"PF{bucket}:F{bucket}:T{bucket}:R{bucket}|{history}"
# History encoding with street separators
"rc/cr/c/rc" # raise-call / check-raise / call / raise-call
# Bet sizes
Preflop/Flop: 2 chips
Turn/River: 4 chips
# Position
SB acts first preflop
BB acts first postflop

Results (5/10/10/10 Buckets)

MetricValue
Information Sets32,887
Speed26 iter/sec
SB Expected Value+0.20 chips/hand
Memory~50 MB
Tests131 passing

Usage

Terminal window
# Preflop only (fast)
uv run python scripts/train_holdem.py -i 100000 -v
# Full multi-street
uv run python scripts/train_holdem.py --multistreet -i 1000 -v

Phase 3: GPU Acceleration

Date: December 3, 2025

Initial Attempt: Direct GPU

Created GPU modules in src/cfr_poker/holdem/gpu/:

FilePurpose
__init__.pyGPU availability detection
utils.pyCombinatorial indexing (C(52,5) = 2,598,960)
lookup_table.pyPre-computed hand rank table (~5MB)
batched_ehs.pyBatched Monte Carlo EHS

The Problem: GPU Was Slower!

CPU: 26 iter/sec
GPU: 10 iter/sec (2.6x SLOWER)

Root Cause: Sequential MCCFR algorithm creates 300+ tiny GPU kernel launches per hand. Kernel launch overhead (~50μs) exceeded actual computation time.

Solution: Pre-computation

Shifted strategy to pre-computing EHS tables offline:

Pre-computation speedup: 76x (7140s → 94s)
Training improvement: ~4% (26 → 27 iter/sec)

The training improvement was minimal because the bottleneck was flop/turn computation, not river.

Profiling Discovery

# In-memory cache statistics
Hit rate: 97.4%
Pre-computed coverage: 0.036% of 2.8B combinations

Pre-computed tables were ineffective because:

  1. In-memory cache already handled most repeated queries
  2. Random sampling covered only 0.036% of possible combinations
  3. Per-call GPU overhead remained the bottleneck

Cleanup

Removed ineffective pre-computation code, keeping the codebase clean.


Phase 4: Monte Carlo Sample Reduction

Date: December 3, 2025

Hypothesis

EHS computation uses Monte Carlo sampling (default: 100 samples). Reducing samples trades accuracy for speed.

Benchmarks

EHS SamplesIterations/secSpeedupAccuracy Impact
100255baselineBest
503721.46xMinimal
254481.76xNoticeable

Mathematical Basis

EHS estimation error scales with 1/√n:

  • 100 samples: ±3.2% error
  • 50 samples: ±4.5% error
  • 25 samples: ±6.3% error

CFR is robust to noise, so 50 samples provides good tradeoff.

Combined Speedups

PhaseOptimizationSpeedup
Phase 3GPU vectorization10.4x
Phase 4Sample reduction1.46x
TotalCombined15.2x

Current Bottleneck

Profiling revealed CPU showdown evaluation (_evaluate_5card) consumes 39.4% of training time.

Recommendations

Terminal window
# Development (1.46x faster)
uv run python scripts/train_holdem.py --multistreet --ehs-samples 50 -i 10000
# Final accuracy testing
uv run python scripts/train_holdem.py --multistreet --ehs-samples 100 -i 100000

Phase 5: Save/Load & Interactive Play

Date: December 4-5, 2025

Motivation

After completing GPU optimization, the next question was: “How do I actually use this trained model?” This led to implementing:

  1. Strategy persistence (save/load to JSON)
  2. Interactive play mode against the trained bot

Implementation (TDD Approach)

New Files Created:

FilePurposeTests
src/cfr_poker/holdem/strategy_io.pySave/load strategies to JSON13
src/cfr_poker/holdem/play.pyInteractive play with Unicode cards22
scripts/play_holdem.pyCLI entry point-

Files Modified:

FileChange
scripts/train_holdem.pyAdded --save and --load flags
src/cfr_poker/holdem/mccfr.pyFixed fold payoff calculation bug

Strategy I/O

class StrategyCodec:
VERSION = "1.0"
@staticmethod
def encode_strategy(mccfr, game_value=None) -> Dict:
# Convert MCCFR to JSON-serializable dict
@staticmethod
def decode_strategy(data: Dict) -> Dict[str, InformationSet]:
# Reconstruct InformationSet objects from JSON
@staticmethod
def validate_compatibility(data, preflop_buckets, flop_buckets, ...):
# Check bucket configuration matches

Interactive Play Features

  • Unicode card display: A♠ K♥ format using suit symbols (♠♥♦♣)
  • Full game loop: Deal → Bet → Showdown with pot tracking
  • Position selection: Play as SB or BB
  • Quit support: Exit mid-game with ‘q’

Bug Fixes

Fold Payoff Bug: User reported “Bot folds” but “You lose 12 chips”

# Before (buggy) - action_count incremented inside inner loop
action_count = 0
for part in history.split("/"):
for i, action in enumerate(part):
if action == "f":
folder = (action_count + 1) % 2 # Wrong!
break
action_count += 1 # BUG: Inside inner loop!
# After (fixed) - proper fold position detection
streets = history.split("/")
for street_idx, street_history in enumerate(streets):
if "f" in street_history:
fold_pos = street_history.index("f")
if street_idx == 0:
folder = fold_pos % 2 # Preflop: SB acts first
else:
folder = (fold_pos + 1) % 2 # Postflop: BB acts first
break

Display Bug: Hands weren’t shown when someone folded. Fixed to show both hands on any terminal action.

Usage

Terminal window
# Train and save strategy
python scripts/train_holdem.py --multistreet -i 100000 --save strategy.json
# Play against the bot
python scripts/play_holdem.py --load strategy.json --position sb

Results

  • 227 tests passing (35 new tests added)
  • All phases complete (1-5)
  • Final performance: ~372 iter/sec (15.2x from baseline)

Phase 6: Matrix CFR Exploration

Date: December 5, 2025

Motivation

Traditional recursive CFR doesn’t parallelize well on GPUs because:

  • Irregular tree traversal with unpredictable branching
  • Sequential dependencies between parent/child nodes
  • Control flow divergence kills GPU efficiency

Matrix CFR reformulates the algorithm as:

  1. Forward pass: Compute reach probabilities via sparse matrix multiply
  2. Backward pass: Compute counterfactual values via transposed matrix multiply
  3. Update: Vectorized regret and strategy accumulation

Reference: GPU-Accelerated CFR

Implementation

Created new matrix_cfr/ module alongside existing MCCFR:

FilePurposeLines
game_tree.pyBuild game tree as sparse matrices~560
info_set_manager.pyVectorized info set storage~150
matrix_cfr.pyCore matrix CFR algorithm~200
benchmark_matrix_cfr.pyPerformance comparison~180

Key Data Structures:

@dataclass
class GameTreeMatrices:
num_histories: int # 22,141 betting sequences
num_info_sets: int # 4.7M with 5/5/5/5 buckets
is_terminal: Tensor # (N,) boolean mask
is_player0: Tensor # (N,) SB action nodes
is_player1: Tensor # (N,) BB action nodes
parent: Tensor # (N,) parent node indices
children: Tensor # (N, 3) child indices per action
levels: List[Tensor] # BFS level structure (28 levels)

Benchmark Results

ImplementationDeviceIter/secSpeedup
Recursive MCCFRCPU180baseline
Matrix CFRCPU140.08x
Matrix CFRGPU340.19x

Key Finding: Matrix CFR is currently slower than recursive MCCFR!

Why Matrix CFR Is Slower for Poker

  1. Pre-allocation Overhead: Matrix CFR pre-allocates for ALL 4.7M possible info sets, but MCCFR only visits ~42k (0.9%) per 500 iterations.

  2. Tensor Operation Overhead: Python loops through 28 tree levels with tensor slicing has more overhead than pure Python recursion for this tree size.

  3. No Card Sampling: Current matrix CFR lacks terminal payoff computation (needs card distribution sampling).

  4. Monte Carlo Wins: Random sampling naturally prunes the tree - most game states are never reached in practice.

Lessons Learned

The GPU-CFR paper achieved 352x speedups because:

  • Games had explicit enumerable states (not abstraction-based)
  • All states were visited every iteration
  • Larger game trees amortized kernel overhead

For abstracted poker:

  • Only 0.9% of info sets are visited
  • Monte Carlo sampling is inherently efficient
  • Pre-allocating unused info sets is wasteful

Tests Added

  • 47 new tests for matrix CFR module
  • All tests passing (47/47)

Future Directions for GPU Poker

Better approaches than pure Matrix CFR:

  1. Batched Monte Carlo: Sample 1000+ deals per iteration, process in parallel
  2. Sparse Info Set Storage: Lazily allocate only visited info sets
  3. GPU Hand Evaluation: Accelerate the 39.4% bottleneck (_evaluate_5card)

Key Learnings

1. GPU ≠ Automatically Faster

Algorithm structure matters more than raw compute power. Sequential tree traversal with thousands of tiny operations is hostile to GPU architecture.

2. Consumer AMD GPUs Have Limited ML Support

FrameworkAMD RDNA3 Support
JAX ROCmNot supported (datacenter only)
PyTorch ROCmPartial (kernel issues)

3. NumPy Beats PyTorch for Small Sequential Operations

Framework overhead dominates for small games. NumPy is ~4x faster than PyTorch CPU for Kuhn Poker.

4. MCCFR Needs Batching for GPU Benefits

To leverage GPU parallelization, the algorithm must be restructured to batch operations across multiple hands simultaneously.

5. Monte Carlo Sampling Is Tunable

Reducing EHS samples from 100 to 50 provides 1.46x speedup with negligible accuracy impact. CFR’s iterative nature is robust to estimation noise.

6. Caching Is Powerful

In-memory caching achieved 97.4% hit rate, making pre-computation largely redundant for random sampling patterns.


Future Work

Phase 6: Push/Fold No-Limit Hold’em (Proposed)

A simplified No-Limit format where you only have two actions: Push (all-in) or Fold.

When It’s Used: Tournament play with short stacks (10-15 big blinds).

AspectFull No-LimitPush/Fold
Bet sizesContinuous (infinite)1 (all-in)
Post-flop streets3 (flop/turn/river)0 (showdown only)
Actionsfold/check/call/raise/all-infold/push
Info setsMillions~10,000

Why It’s a Good Target:

  • Demonstrates No-Limit concepts
  • Manageable complexity for basic MCCFR
  • Has known Nash equilibrium (published Push/Fold charts)
  • Can validate against established optimal play

GPU Showdown Evaluation

Current bottleneck is _evaluate_5card at 39.4% of training time. GPU-accelerating hand evaluation could provide 1.5-2x additional speedup.

Batched MCCFR

Restructure algorithm to process 1000+ hands in parallel:

# Current (sequential)
for iteration in range(n):
hand = deal()
traverse(hand) # 300+ tiny operations
# Future (batched)
hands = deal_batch(1000)
ehs_batch = gpu_compute_all_ehs(hands) # One kernel
traverse_batch(hands, ehs_batch)

Bucket-Based Caching

Use 5,000 abstraction buckets instead of 2.8B exact states for >90% cache hit rate potential.

Exploitability Analysis

Compute exact exploitability via best response to measure convergence quality.


Final Package State

packages/cfr-poker/
├── src/cfr_poker/
│ ├── kuhn.py # Kuhn Poker (12 info sets)
│ ├── leduc.py # Leduc Poker (288 info sets)
│ ├── torch_cfr.py # PyTorch backend
│ └── holdem/
│ ├── mccfr.py # Full 4-street MCCFR
│ ├── abstraction.py # EHS bucketing
│ ├── strategy_io.py # Save/load strategies
│ ├── play.py # Interactive play mode
│ ├── gpu/ # GPU acceleration
│ └── matrix_cfr/ # Matrix CFR (experimental)
│ ├── game_tree.py # Sparse matrix tree
│ ├── info_set_manager.py # Vectorized storage
│ └── matrix_cfr.py # Core algorithm
├── scripts/
│ ├── train_holdem.py
│ ├── play_holdem.py
│ ├── benchmark_ehs_samples.py
│ └── benchmark_matrix_cfr.py
└── tests/ # 256 tests passing

Performance Summary

GameInfo SetsSpeed (Optimized)
Kuhn Poker12~6,500 iter/sec
Leduc Poker288~22 iter/sec
Limit Hold’em (5/10/10/10)~33k~372 iter/sec