CFR Poker Development Journey

A chronological timeline documenting the development of a CFR (Counterfactual Regret Minimization) poker AI solver, from initial research through GPU optimization.

Phase 0: Research & Foundation
Phase 1: Preflop-Only MCCFR
Phase 2: Full 4-Street MCCFR
Phase 3: GPU Acceleration
Phase 4: Monte Carlo Sample Reduction
Phase 5: Save/Load & Interactive Play
Phase 6: Matrix CFR Exploration
Key Learnings
Future Work

Phase 0: Research & Foundation

Date: December 2, 2025

Initial Research

Started with a research request covering:

Incomplete information game theory
Kuhn Poker
Monte-Carlo Counterfactual Regret Minimization (MCCFR)
Nash equilibrium convergence

Documentation Created

Created comprehensive documentation at docs/game-theory-cfr/:

Document	Description
`README.md`	Topic index and navigation
`mccfr-nash-equilibrium.md`	CFR algorithm deep-dive, MCCFR variants, Libratus/Pluribus architecture
`kuhn-poker.md`	Benchmark game with 12 information sets, Nash equilibrium strategies
`implementation-guide.md`	Step-by-step tutorial with working code

Package Creation

Created packages/cfr-poker/ using uv with:

packages/cfr-poker/
├── pyproject.toml          # uv config with ROCm/CUDA optional deps
├── README.md               # Usage documentation
├── src/cfr_poker/
│   ├── __init__.py         # Package exports
│   ├── information_set.py  # Regret tracking
│   ├── kuhn.py             # Full Kuhn Poker CFR
│   ├── cli.py              # CLI commands
│   └── jax_cfr.py          # JAX GPU acceleration (experimental)
└── tests/
    └── test_kuhn.py        # 19 tests

GPU Exploration: JAX vs PyTorch

JAX ROCm Attempt (Failed)

Attempted JAX ROCm for GPU acceleration:

# Errors encountered
- libhipfftw.so missing
- libdw.so.1 missing
- Memory access faults on gfx1100

Conclusion: JAX ROCm only supports MI Instinct datacenter GPUs, NOT consumer RDNA3 (gfx1100/gfx1101).

PyTorch ROCm (Partial Success)

Switched to PyTorch ROCm:

pip install torch --index-url https://download.pytorch.org/whl/rocm6.2
# torch==2.5.1+rocm6.2

Detected GPUs: RX 7800 XT + integrated graphics
Created torch_cfr.py with TorchKuhnCFR class
GPU execution failed with “HIP error: invalid device function”

Initial Benchmarks

Backend	Kuhn (12 info sets)	Leduc (288 info sets)
NumPy CPU	~6,500 iter/sec	~22 iter/sec
PyTorch CPU	~1,500 iter/sec	-
PyTorch GPU	FAILED	-

Key Finding: NumPy is fastest for small games due to framework overhead.

Leduc Poker Implementation

Implemented Leduc Poker (~288 info sets):

6-card deck (J, Q, K in two suits)
2 betting rounds with community card
Max 2 raises per round
Pair beats high card at showdown

Phase 1: Preflop-Only MCCFR

Date: December 2-3, 2025

Implementation

Created Hold’em modules in src/cfr_poker/holdem/:

File	Purpose
`cards.py`	52-card representation
`hand_evaluator.py`	Multi-backend hand evaluation
`abstraction.py`	Preflop 10 buckets by equity
`preflop_mccfr.py`	External sampling MCCFR

Results

Performance: ~5,000 iterations/sec on CPU
Information Sets: 100 (10 buckets × 10 betting histories)
Convergence: ~20 seconds (100k iterations)
Tests: 70 passing

Strategy Insights

Premium hands (AA, KK) raise ~85% from SB
Trash hands (72o) mostly call/fold
SB loses ~0.006 chips/hand (BB position advantage)

Phase 2: Full 4-Street MCCFR

Date: December 3, 2025

Implementation

Extended to all betting streets: Preflop → Flop → Turn → River

New Classes:

LimitHoldemMCCFR - Full 4-street solver
PostflopAbstraction - EHS (Expected Hand Strength) bucketing

Key Design Decisions:

# Info set key format
"PF{bucket}:F{bucket}:T{bucket}:R{bucket}|{history}"

# History encoding with street separators
"rc/cr/c/rc"  # raise-call / check-raise / call / raise-call

# Bet sizes
Preflop/Flop: 2 chips
Turn/River: 4 chips

# Position
SB acts first preflop
BB acts first postflop

Results (5/10/10/10 Buckets)

Metric	Value
Information Sets	32,887
Speed	26 iter/sec
SB Expected Value	+0.20 chips/hand
Memory	~50 MB
Tests	131 passing

Usage

# Preflop only (fast)
uv run python scripts/train_holdem.py -i 100000 -v

# Full multi-street
uv run python scripts/train_holdem.py --multistreet -i 1000 -v

Phase 3: GPU Acceleration

Date: December 3, 2025

Initial Attempt: Direct GPU

Created GPU modules in src/cfr_poker/holdem/gpu/:

File	Purpose
`__init__.py`	GPU availability detection
`utils.py`	Combinatorial indexing (C(52,5) = 2,598,960)
`lookup_table.py`	Pre-computed hand rank table (~5MB)
`batched_ehs.py`	Batched Monte Carlo EHS

The Problem: GPU Was Slower!

CPU: 26 iter/sec
GPU: 10 iter/sec  (2.6x SLOWER)

Root Cause: Sequential MCCFR algorithm creates 300+ tiny GPU kernel launches per hand. Kernel launch overhead (~50μs) exceeded actual computation time.

Solution: Pre-computation

Shifted strategy to pre-computing EHS tables offline:

Pre-computation speedup: 76x (7140s → 94s)
Training improvement: ~4% (26 → 27 iter/sec)

The training improvement was minimal because the bottleneck was flop/turn computation, not river.

Profiling Discovery

# In-memory cache statistics
Hit rate: 97.4%
Pre-computed coverage: 0.036% of 2.8B combinations

Pre-computed tables were ineffective because:

In-memory cache already handled most repeated queries
Random sampling covered only 0.036% of possible combinations
Per-call GPU overhead remained the bottleneck

Cleanup

Removed ineffective pre-computation code, keeping the codebase clean.

Phase 4: Monte Carlo Sample Reduction

Date: December 3, 2025

Hypothesis

EHS computation uses Monte Carlo sampling (default: 100 samples). Reducing samples trades accuracy for speed.

Benchmarks

EHS Samples	Iterations/sec	Speedup	Accuracy Impact
100	255	baseline	Best
50	372	1.46x	Minimal
25	448	1.76x	Noticeable

Mathematical Basis

EHS estimation error scales with 1/√n:

100 samples: ±3.2% error
50 samples: ±4.5% error
25 samples: ±6.3% error

CFR is robust to noise, so 50 samples provides good tradeoff.

Combined Speedups

Phase	Optimization	Speedup
Phase 3	GPU vectorization	10.4x
Phase 4	Sample reduction	1.46x
Total	Combined	15.2x

Current Bottleneck

Profiling revealed CPU showdown evaluation (_evaluate_5card) consumes 39.4% of training time.

Recommendations

# Development (1.46x faster)
uv run python scripts/train_holdem.py --multistreet --ehs-samples 50 -i 10000

# Final accuracy testing
uv run python scripts/train_holdem.py --multistreet --ehs-samples 100 -i 100000

Phase 5: Save/Load & Interactive Play

Date: December 4-5, 2025

Motivation

After completing GPU optimization, the next question was: “How do I actually use this trained model?” This led to implementing:

Strategy persistence (save/load to JSON)
Interactive play mode against the trained bot

Implementation (TDD Approach)

New Files Created:

File	Purpose	Tests
`src/cfr_poker/holdem/strategy_io.py`	Save/load strategies to JSON	13
`src/cfr_poker/holdem/play.py`	Interactive play with Unicode cards	22
`scripts/play_holdem.py`	CLI entry point	-

Files Modified:

File	Change
`scripts/train_holdem.py`	Added `--save` and `--load` flags
`src/cfr_poker/holdem/mccfr.py`	Fixed fold payoff calculation bug

Strategy I/O

class StrategyCodec:
    VERSION = "1.0"

    @staticmethod
    def encode_strategy(mccfr, game_value=None) -> Dict:
        # Convert MCCFR to JSON-serializable dict

    @staticmethod
    def decode_strategy(data: Dict) -> Dict[str, InformationSet]:
        # Reconstruct InformationSet objects from JSON

    @staticmethod
    def validate_compatibility(data, preflop_buckets, flop_buckets, ...):
        # Check bucket configuration matches

Interactive Play Features

Unicode card display: A♠ K♥ format using suit symbols (♠♥♦♣)
Full game loop: Deal → Bet → Showdown with pot tracking
Position selection: Play as SB or BB
Quit support: Exit mid-game with ‘q’

Bug Fixes

Fold Payoff Bug: User reported “Bot folds” but “You lose 12 chips”

# Before (buggy) - action_count incremented inside inner loop
action_count = 0
for part in history.split("/"):
    for i, action in enumerate(part):
        if action == "f":
            folder = (action_count + 1) % 2  # Wrong!
            break
        action_count += 1  # BUG: Inside inner loop!

# After (fixed) - proper fold position detection
streets = history.split("/")
for street_idx, street_history in enumerate(streets):
    if "f" in street_history:
        fold_pos = street_history.index("f")
        if street_idx == 0:
            folder = fold_pos % 2  # Preflop: SB acts first
        else:
            folder = (fold_pos + 1) % 2  # Postflop: BB acts first
        break

Display Bug: Hands weren’t shown when someone folded. Fixed to show both hands on any terminal action.

Usage

# Train and save strategy
python scripts/train_holdem.py --multistreet -i 100000 --save strategy.json

# Play against the bot
python scripts/play_holdem.py --load strategy.json --position sb

Results

227 tests passing (35 new tests added)
All phases complete (1-5)
Final performance: ~372 iter/sec (15.2x from baseline)

Phase 6: Matrix CFR Exploration

Date: December 5, 2025

Motivation

Traditional recursive CFR doesn’t parallelize well on GPUs because:

Irregular tree traversal with unpredictable branching
Sequential dependencies between parent/child nodes
Control flow divergence kills GPU efficiency

Matrix CFR reformulates the algorithm as:

Forward pass: Compute reach probabilities via sparse matrix multiply
Backward pass: Compute counterfactual values via transposed matrix multiply
Update: Vectorized regret and strategy accumulation

Reference: GPU-Accelerated CFR

Implementation

Created new matrix_cfr/ module alongside existing MCCFR:

File	Purpose	Lines
`game_tree.py`	Build game tree as sparse matrices	~560
`info_set_manager.py`	Vectorized info set storage	~150
`matrix_cfr.py`	Core matrix CFR algorithm	~200
`benchmark_matrix_cfr.py`	Performance comparison	~180

Key Data Structures:

@dataclass
class GameTreeMatrices:
    num_histories: int       # 22,141 betting sequences
    num_info_sets: int       # 4.7M with 5/5/5/5 buckets
    is_terminal: Tensor      # (N,) boolean mask
    is_player0: Tensor       # (N,) SB action nodes
    is_player1: Tensor       # (N,) BB action nodes
    parent: Tensor           # (N,) parent node indices
    children: Tensor         # (N, 3) child indices per action
    levels: List[Tensor]     # BFS level structure (28 levels)

Benchmark Results

Implementation	Device	Iter/sec	Speedup
Recursive MCCFR	CPU	180	baseline
Matrix CFR	CPU	14	0.08x
Matrix CFR	GPU	34	0.19x

Key Finding: Matrix CFR is currently slower than recursive MCCFR!

Why Matrix CFR Is Slower for Poker

Pre-allocation Overhead: Matrix CFR pre-allocates for ALL 4.7M possible info sets, but MCCFR only visits ~42k (0.9%) per 500 iterations.
Tensor Operation Overhead: Python loops through 28 tree levels with tensor slicing has more overhead than pure Python recursion for this tree size.
No Card Sampling: Current matrix CFR lacks terminal payoff computation (needs card distribution sampling).
Monte Carlo Wins: Random sampling naturally prunes the tree - most game states are never reached in practice.

Lessons Learned

The GPU-CFR paper achieved 352x speedups because:

Games had explicit enumerable states (not abstraction-based)
All states were visited every iteration
Larger game trees amortized kernel overhead

For abstracted poker:

Only 0.9% of info sets are visited
Monte Carlo sampling is inherently efficient
Pre-allocating unused info sets is wasteful

Tests Added

47 new tests for matrix CFR module
All tests passing (47/47)

Future Directions for GPU Poker

Better approaches than pure Matrix CFR:

Batched Monte Carlo: Sample 1000+ deals per iteration, process in parallel
Sparse Info Set Storage: Lazily allocate only visited info sets
GPU Hand Evaluation: Accelerate the 39.4% bottleneck (_evaluate_5card)

Key Learnings

1. GPU ≠ Automatically Faster

Algorithm structure matters more than raw compute power. Sequential tree traversal with thousands of tiny operations is hostile to GPU architecture.

2. Consumer AMD GPUs Have Limited ML Support

Framework	AMD RDNA3 Support
JAX ROCm	Not supported (datacenter only)
PyTorch ROCm	Partial (kernel issues)

3. NumPy Beats PyTorch for Small Sequential Operations

Framework overhead dominates for small games. NumPy is ~4x faster than PyTorch CPU for Kuhn Poker.

4. MCCFR Needs Batching for GPU Benefits

To leverage GPU parallelization, the algorithm must be restructured to batch operations across multiple hands simultaneously.

5. Monte Carlo Sampling Is Tunable

Reducing EHS samples from 100 to 50 provides 1.46x speedup with negligible accuracy impact. CFR’s iterative nature is robust to estimation noise.

6. Caching Is Powerful

In-memory caching achieved 97.4% hit rate, making pre-computation largely redundant for random sampling patterns.

Future Work

Phase 6: Push/Fold No-Limit Hold’em (Proposed)

A simplified No-Limit format where you only have two actions: Push (all-in) or Fold.

When It’s Used: Tournament play with short stacks (10-15 big blinds).

Aspect	Full No-Limit	Push/Fold
Bet sizes	Continuous (infinite)	1 (all-in)
Post-flop streets	3 (flop/turn/river)	0 (showdown only)
Actions	fold/check/call/raise/all-in	fold/push
Info sets	Millions	~10,000

Why It’s a Good Target:

Demonstrates No-Limit concepts
Manageable complexity for basic MCCFR
Has known Nash equilibrium (published Push/Fold charts)
Can validate against established optimal play

GPU Showdown Evaluation

Current bottleneck is _evaluate_5card at 39.4% of training time. GPU-accelerating hand evaluation could provide 1.5-2x additional speedup.

Batched MCCFR

Restructure algorithm to process 1000+ hands in parallel:

# Current (sequential)
for iteration in range(n):
    hand = deal()
    traverse(hand)  # 300+ tiny operations

# Future (batched)
hands = deal_batch(1000)
ehs_batch = gpu_compute_all_ehs(hands)  # One kernel
traverse_batch(hands, ehs_batch)

Bucket-Based Caching

Use 5,000 abstraction buckets instead of 2.8B exact states for >90% cache hit rate potential.

Exploitability Analysis

Compute exact exploitability via best response to measure convergence quality.

Final Package State

packages/cfr-poker/
├── src/cfr_poker/
│   ├── kuhn.py           # Kuhn Poker (12 info sets)
│   ├── leduc.py          # Leduc Poker (288 info sets)
│   ├── torch_cfr.py      # PyTorch backend
│   └── holdem/
│       ├── mccfr.py      # Full 4-street MCCFR
│       ├── abstraction.py # EHS bucketing
│       ├── strategy_io.py # Save/load strategies
│       ├── play.py       # Interactive play mode
│       ├── gpu/          # GPU acceleration
│       └── matrix_cfr/   # Matrix CFR (experimental)
│           ├── game_tree.py       # Sparse matrix tree
│           ├── info_set_manager.py # Vectorized storage
│           └── matrix_cfr.py      # Core algorithm
├── scripts/
│   ├── train_holdem.py
│   ├── play_holdem.py
│   ├── benchmark_ehs_samples.py
│   └── benchmark_matrix_cfr.py
└── tests/                # 256 tests passing

Performance Summary

Game	Info Sets	Speed (Optimized)
Kuhn Poker	12	~6,500 iter/sec
Leduc Poker	288	~22 iter/sec
Limit Hold’em (5/10/10/10)	~33k	~372 iter/sec

CFR Poker Development Journey

Table of Contents

Phase 0: Research & Foundation

Initial Research

Documentation Created

Package Creation

GPU Exploration: JAX vs PyTorch

JAX ROCm Attempt (Failed)

PyTorch ROCm (Partial Success)

Initial Benchmarks

Leduc Poker Implementation

Phase 1: Preflop-Only MCCFR

Implementation

Results

Strategy Insights

Phase 2: Full 4-Street MCCFR

Implementation

Results (5/10/10/10 Buckets)

Usage

Phase 3: GPU Acceleration

Initial Attempt: Direct GPU

The Problem: GPU Was Slower!

Solution: Pre-computation

Profiling Discovery

Cleanup

Phase 4: Monte Carlo Sample Reduction

Hypothesis

Benchmarks

Mathematical Basis

Combined Speedups

Current Bottleneck

Recommendations

Phase 5: Save/Load & Interactive Play

Motivation

Implementation (TDD Approach)

Strategy I/O

Interactive Play Features

Bug Fixes

Usage

Results

Phase 6: Matrix CFR Exploration

Motivation

Implementation

Benchmark Results

Why Matrix CFR Is Slower for Poker

Lessons Learned

Tests Added

Future Directions for GPU Poker

Key Learnings

1. GPU ≠ Automatically Faster

2. Consumer AMD GPUs Have Limited ML Support

3. NumPy Beats PyTorch for Small Sequential Operations

4. MCCFR Needs Batching for GPU Benefits

5. Monte Carlo Sampling Is Tunable

6. Caching Is Powerful

Future Work

Phase 6: Push/Fold No-Limit Hold’em (Proposed)

GPU Showdown Evaluation

Batched MCCFR

Bucket-Based Caching

Exploitability Analysis

Final Package State

Performance Summary

Related Documentation