GPU Acceleration for CFR Poker - Findings and Analysis

Summary

We implemented GPU acceleration for EHS (Expected Hand Strength) computation in the CFR poker solver. Contrary to expectations, the GPU implementation is slower than CPU for the current workload pattern. This document explains why.

Initial Hypothesis

We expected GPU acceleration would provide significant speedup because:

Large computation volume: Training runs ~47.3M hand evaluations per 10k iterations
Parallel workload: Monte Carlo sampling is embarrassingly parallel
Pre-computed lookup table: 2.6M 5-card hand rankings can be stored on GPU (~5MB)

What We Implemented

Phase 3 Components

Component	Purpose	Location
`GPUHandRankTable`	Pre-computed lookup for all C(52,5) hands	`gpu/lookup_table.py`
`BatchedEHSComputer`	GPU-accelerated Monte Carlo EHS	`gpu/batched_ehs.py`
`PostflopAbstraction`	GPU routing for EHS computation	`abstraction.py`

Key Design Decisions

Lookup table approach: Pre-compute all 2,598,960 5-card hand rankings once
Monte Carlo sampling on GPU: Sample opponent hands using GPU tensors
7-card evaluation: Evaluate all 21 5-card combinations for each 7-card hand

Benchmark Results

Mode	Iterations	Time	Iter/sec	Notes
CPU (phevaluator)	100	3.90s	26	Baseline
GPU-on-CPU (PyTorch CPU)	100	2.86s	35	36% faster
GPU-on-GPU (AMD 7800 XT)	100	10.28s	10	2.6x slower

Why GPU Is Slower: Analysis

1. Sequential Operation Pattern

The MCCFR algorithm traverses the game tree one iteration at a time, processing hands sequentially:

for iteration in range(N):
    deal_hand()              # Single hand
    traverse_tree(hand)      # Recursive, sequential
        compute_ehs(hole, board)  # Called many times per traversal

Each compute_ehs() call does:

for sample in range(100):           # Sequential loop
    torch.randperm(45)               # GPU kernel launch
    torch.cat([...])                 # GPU kernel launch
    lookup_table.lookup_7card(...)   # GPU kernel launch

Result: 300+ small GPU kernel launches per EHS computation, ~thousands per iteration.

2. GPU Kernel Launch Overhead

GPU operations have fixed overhead per kernel launch:

Kernel launch: ~10-50μs on ROCm
Memory transfer: ~1-10μs for small tensors
Synchronization: Implicit sync on .item() calls

For 100 Monte Carlo samples with 3 operations each = 300 kernel launches.

CPU equivalent: ~0.01μs per operation (cache hit)

3. Small Batch Size Problem

The current implementation processes one hand at a time:

# Current: Sequential, small batches
for i in range(iterations):
    ehs = compute_ehs(single_hole, single_board)  # 100 samples

GPU needs large batches to amortize overhead:

# Ideal: Batch across many hands
holes = torch.stack([...])      # 10,000 hands
boards = torch.stack([...])     # 10,000 boards
ehs_batch = compute_ehs_batch(holes, boards)  # 1M samples in one kernel

4. AMD ROCm Specific Issues

RDNA 3 (7800 XT) needs workaround: HSA_OVERRIDE_GFX_VERSION=11.0.0
Kernel compilation overhead: First run compiles HIP kernels
Less optimized than CUDA: ROCm still maturing vs NVIDIA’s ecosystem

Where GPU Would Actually Help

GPU acceleration makes sense when:

Scenario	Why GPU Helps
Pre-computing abstraction	Batch evaluate millions of hands at once
Real-time inference	Lookup table queries are O(1) on GPU
Batched training	Process 1000+ hands in parallel
Larger games (PLO)	4-card hands = 270,725 hole combinations

The Right Architecture

For GPU speedup, restructure to batch across iterations:

# Batch across iterations (not within single EHS)
def train_batched(iterations_per_batch=1000):
    # Deal all hands upfront
    hands = deal_hands(iterations_per_batch)  # Batch

    # Compute all EHS values in one GPU call
    all_ehs = compute_ehs_batch(hands)  # Single kernel

    # Update regrets using pre-computed EHS
    for i, (hand, ehs) in enumerate(zip(hands, all_ehs)):
        update_regrets(hand, ehs)

Conclusions

What Worked

Lookup table architecture is sound (~5MB for 2.6M hands)
PyTorch tensor operations on CPU are actually faster than raw Python
36% speedup with --gpu-device cpu (no actual GPU)

What Didn’t Work

Sequential MCCFR traversal prevents GPU parallelism
Small batch sizes don’t amortize GPU kernel overhead
AMD ROCm has compatibility issues with RDNA 3

Recommendations

For current workload: Use --gpu-device cpu for best performance
For pre-computation: GPU is ideal for building abstraction tables
For major speedup: Restructure MCCFR to batch across iterations
For AMD GPUs: Set HSA_OVERRIDE_GFX_VERSION=11.0.0 in environment

Technical Details

Environment

PyTorch 2.5.1+rocm6.2
AMD Radeon RX 7800 XT (RDNA 3, gfx1101)
ROCm 6.2 with ROCk module 6.14.14

ROCm Workaround

# Add to ~/.bashrc
export HSA_OVERRIDE_GFX_VERSION=11.0.0

Files Modified

src/cfr_poker/holdem/gpu/__init__.py - GPU detection
src/cfr_poker/holdem/gpu/utils.py - Combinatorial indexing
src/cfr_poker/holdem/gpu/lookup_table.py - Hand rank lookup
src/cfr_poker/holdem/gpu/batched_ehs.py - Batched EHS
src/cfr_poker/holdem/abstraction.py - GPU routing
src/cfr_poker/holdem/mccfr.py - GPU parameters
scripts/train_holdem.py - CLI flags

Solution: Batched EHS Optimization

UPDATE (2025-12-03): We solved the kernel launch overhead problem!

See Batched EHS Optimization for the solution achieving 76x speedup:

Sequential GPU: 140 hands/sec
Batched GPU: 10,000+ hands/sec

Key techniques:

BatchEHSCollector - Collect queries, compute in one batch
RiverEHSTable - Persistent O(1) lookup cache
Gumbel-max sampling for vectorized opponent selection

Profiling Analysis: Where Time Actually Goes

UPDATE (2025-12-03): We profiled 200 iterations to identify the actual bottleneck.

Time Breakdown (21.29s total, 9.4 iter/sec)

Function	Self Time	%	Description
`_compute_ehs_partial_board`	7.81s	36.7%	GPU EHS for flop/turn
`_compute_indices`	4.32s	20.3%	Combinatorial index math
`_sample_opponents`	2.12s	10.0%	Random opponent sampling
`torch.sort`	1.32s	6.2%	7-card hand sorting
`torch.randperm`	1.31s	6.2%	Random permutations
Other	~4.4s	20.6%	Tree traversal, regret updates

Key Discovery: In-Memory Cache Already Has 97.4% Hit Rate!

Total _compute_ehs calls:  27,398
Actual GPU compute calls:     720
Cache hits:               26,678  (97.4% hit rate!)
In-memory cache entries:   1,080

The existing _ehs_cache in PostflopAbstraction is highly effective. Pre-computed tables provide no benefit because the cache is already working.

The Real Bottleneck: Per-Call GPU Overhead

720 GPU EHS calls take 17.1 seconds = 23.8ms per call

Each EHS computation involves:

100 Monte Carlo samples
Per sample: randperm (1.31s total) + sort (1.32s) + lookup_7card + comparisons
These are small tensor operations with high kernel launch overhead

Why Pre-computed Tables Don’t Help

Scenario	Cache Hits	Bottleneck
Without pre-computed	97.4% (in-memory)	720 GPU calls × 23.8ms
With pre-computed	~97.5% (marginal gain)	Same 700+ GPU calls × 23.8ms

The pre-computed tables only help with the ~2.6% cache misses, which are already being cached in-memory after first computation.

River Pre-computed Table: Also Useless

Metric	Value
River bucket requests	8,531
Actual GPU computations	360
In-memory cache hit rate	95.8%
Pre-computed entries	1,000,000
Total river combinations	2,809,475,760
Pre-computed coverage	0.0356%
Expected hits from 1M table	0.13 out of 360
Time saved	0.001 seconds

The 1M pre-computed river entries cover only 0.036% of the 2.8B possible combinations. The in-memory cache is what provides the 95.8% hit rate.

Actual Optimization Opportunities

Batch multiple EHS queries - Collect queries across tree branches, compute in one GPU call
Reduce samples per EHS - 100 samples → 50 samples (2x faster, slightly less accurate)
Optimize _compute_indices - 20% of time in combinatorial math
Async GPU - Overlap GPU work with tree traversal

Pre-computed EHS Tables: Why Random Sampling Doesn’t Help

UPDATE (2025-12-03): We extended pre-computed EHS to all streets (flop, turn, river) but found no significant improvement in training speed.

Implementation

Created unified precompute_ehs.py script for all streets
Extended BatchEHSCollectorV2 to support variable board lengths (3/4/5 cards)
Added CLI flags: --precomputed-ehs, --flop-cache-path, --turn-cache-path, --river-cache-path

Benchmark Results

Configuration	Time (500 iter)	Iter/sec
GPU baseline	55.38s	9
GPU + Pre-computed EHS (100k/street)	54.01s	9

No significant improvement (~2% faster, within noise).

Why Random Pre-computation Fails

The fundamental issue is the state space explosion:

Street	Total Combinations	100k Cache	Coverage
Flop	1,326 × C(50,3) = ~26M	100,000	0.38%
Turn	1,326 × C(50,4) = ~305M	100,000	0.033%
River	1,326 × C(50,5) = ~2.8B	100,000	0.0036%

Problem: Random pre-computed samples have <1% hit rate because training visits different random paths each iteration.

See EHS Caching Strategies for detailed analysis with diagrams.

Why Bucket-Based Caching Doesn’t Help Either

Initially we thought caching by bucket instead of exact (hole, board) would help, but this is logically circular:

To look up by bucket, you need to know which bucket
To know the bucket, you must compute EHS first
If you already computed EHS, you have the answer

The in-memory (hole, board) → bucket cache already achieves 97.4% hit rate during training. The bottleneck is the ~720 cache misses that each require expensive GPU computation (23.8ms/call).

Conclusion: Pre-computation Approaches Exhausted

All pre-computation strategies have been evaluated and found ineffective:

Approach	Result	Why
Random pre-computed EHS	❌ Ineffective	<1% hit rate on 2.8B state space
Bucket-based pre-computation	❌ Logically impossible	Need EHS to determine bucket
In-memory session cache	✅ Already working	97.4% hit rate

The real bottleneck was per-call GPU overhead (23.8ms × 720 calls = 17.1s).

Solution: Vectorized Monte Carlo Sampling (UPDATE 2025-12-03)

10.4x speedup achieved by vectorizing the Monte Carlo loop in _compute_ehs_partial_board():

The Problem

The original implementation had 300+ GPU kernel launches per EHS computation:

for _ in range(100):  # 100 samples
    perm = torch.randperm(...)       # kernel launch
    hero_rank = lookup_7card(...)    # kernel launch
    opp_rank = lookup_7card(...)     # kernel launch

The Solution

Use Gumbel-top-k trick for vectorized sampling, batch all rank computations:

# Single noise generation + topk selection
noise = torch.rand((100, n_available), device=self.device)
_, indices = noise.topk(n_cards_per_sample, dim=1)

# Batch all hero/opponent rank computations into 2 calls
hero_ranks = self.rank_table.lookup_7card(hero_hands)  # (100,)
opp_ranks = self.rank_table.lookup_7card(opp_hands)    # (100,)

Results

Metric	Before	After	Improvement
200 iterations	21.29s	2.04s	10.4x
Iterations/sec	9.4	98.1	10.4x
`_compute_ehs_partial_board`	7.81s	0.313s	25x
GPU kernel launches/EHS	~300	~8	37x fewer

New Bottleneck

After vectorization, the bottleneck shifted to CPU showdown evaluation:

_evaluate_5card: 39.4% of time (comparing 7-card hands at showdown)
This is the expected final bottleneck for this architecture

Phase 4: Monte Carlo Sample Reduction (UPDATE 2025-12-03)

1.46x-1.76x speedup confirmed by reducing EHS samples during training:

Benchmark Results

EHS Samples	Iterations/sec	Speedup vs 100	Game Value	Notes
100	255	1.00x (baseline)	0.7108	Original
75	313	1.23x	-	Linear improvement
50	372	1.46x	0.7958	Good balance
25	448	1.76x	0.3992	Fast but less accurate

Trade-offs

100 samples: Better statistical accuracy, slower convergence
50 samples: Good balance - nearly 1.5x faster with acceptable accuracy
25 samples: Very fast but precision noticeably affected

Recommendation

Default: Use --ehs-samples 50 for training:

1.46x speedup with minimal accuracy loss
Still stochastically valid for MCCFR convergence
Sufficient for pre-flop strategy development

Use 100 samples only when high precision is critical (final exploitation tests).

Future Work

Batch MCCFR implementation: Process multiple game states in parallel ✅ Done via BatchEHSCollector
Pre-compute abstraction offline: Use GPU to build EHS lookup tables ✅ Attempted, ineffective
Pre-compute flop/turn EHS: Extend pre-computation to all streets ✅ Attempted, ineffective
Bucket-based EHS caching: Cache by abstraction bucket ❌ Logically impossible
Vectorize Monte Carlo sampling: Batch all samples in single GPU call ✅ 10.4x speedup achieved
Reduce Monte Carlo samples: 100 → 50 samples for 2x speedup ✅ 1.46x speedup confirmed
GPU showdown evaluation: Move _evaluate_5card to GPU (current 39.4% bottleneck)
Adaptive sampling: Use fewer samples early, more samples for convergence
CUDA comparison: Test on NVIDIA GPU for comparison