GPU Acceleration for CFR Poker - Findings and Analysis
Summary
We implemented GPU acceleration for EHS (Expected Hand Strength) computation in the CFR poker solver. Contrary to expectations, the GPU implementation is slower than CPU for the current workload pattern. This document explains why.
Initial Hypothesis
We expected GPU acceleration would provide significant speedup because:
- Large computation volume: Training runs ~47.3M hand evaluations per 10k iterations
- Parallel workload: Monte Carlo sampling is embarrassingly parallel
- Pre-computed lookup table: 2.6M 5-card hand rankings can be stored on GPU (~5MB)
What We Implemented
Phase 3 Components
| Component | Purpose | Location |
|---|---|---|
GPUHandRankTable | Pre-computed lookup for all C(52,5) hands | gpu/lookup_table.py |
BatchedEHSComputer | GPU-accelerated Monte Carlo EHS | gpu/batched_ehs.py |
PostflopAbstraction | GPU routing for EHS computation | abstraction.py |
Key Design Decisions
- Lookup table approach: Pre-compute all 2,598,960 5-card hand rankings once
- Monte Carlo sampling on GPU: Sample opponent hands using GPU tensors
- 7-card evaluation: Evaluate all 21 5-card combinations for each 7-card hand
Benchmark Results
| Mode | Iterations | Time | Iter/sec | Notes |
|---|---|---|---|---|
| CPU (phevaluator) | 100 | 3.90s | 26 | Baseline |
| GPU-on-CPU (PyTorch CPU) | 100 | 2.86s | 35 | 36% faster |
| GPU-on-GPU (AMD 7800 XT) | 100 | 10.28s | 10 | 2.6x slower |
Why GPU Is Slower: Analysis
1. Sequential Operation Pattern
The MCCFR algorithm traverses the game tree one iteration at a time, processing hands sequentially:
for iteration in range(N): deal_hand() # Single hand traverse_tree(hand) # Recursive, sequential compute_ehs(hole, board) # Called many times per traversalEach compute_ehs() call does:
for sample in range(100): # Sequential loop torch.randperm(45) # GPU kernel launch torch.cat([...]) # GPU kernel launch lookup_table.lookup_7card(...) # GPU kernel launchResult: 300+ small GPU kernel launches per EHS computation, ~thousands per iteration.
2. GPU Kernel Launch Overhead
GPU operations have fixed overhead per kernel launch:
- Kernel launch: ~10-50μs on ROCm
- Memory transfer: ~1-10μs for small tensors
- Synchronization: Implicit sync on
.item()calls
For 100 Monte Carlo samples with 3 operations each = 300 kernel launches.
CPU equivalent: ~0.01μs per operation (cache hit)
3. Small Batch Size Problem
The current implementation processes one hand at a time:
# Current: Sequential, small batchesfor i in range(iterations): ehs = compute_ehs(single_hole, single_board) # 100 samplesGPU needs large batches to amortize overhead:
# Ideal: Batch across many handsholes = torch.stack([...]) # 10,000 handsboards = torch.stack([...]) # 10,000 boardsehs_batch = compute_ehs_batch(holes, boards) # 1M samples in one kernel4. AMD ROCm Specific Issues
- RDNA 3 (7800 XT) needs workaround:
HSA_OVERRIDE_GFX_VERSION=11.0.0 - Kernel compilation overhead: First run compiles HIP kernels
- Less optimized than CUDA: ROCm still maturing vs NVIDIA’s ecosystem
Where GPU Would Actually Help
GPU acceleration makes sense when:
| Scenario | Why GPU Helps |
|---|---|
| Pre-computing abstraction | Batch evaluate millions of hands at once |
| Real-time inference | Lookup table queries are O(1) on GPU |
| Batched training | Process 1000+ hands in parallel |
| Larger games (PLO) | 4-card hands = 270,725 hole combinations |
The Right Architecture
For GPU speedup, restructure to batch across iterations:
# Batch across iterations (not within single EHS)def train_batched(iterations_per_batch=1000): # Deal all hands upfront hands = deal_hands(iterations_per_batch) # Batch
# Compute all EHS values in one GPU call all_ehs = compute_ehs_batch(hands) # Single kernel
# Update regrets using pre-computed EHS for i, (hand, ehs) in enumerate(zip(hands, all_ehs)): update_regrets(hand, ehs)Conclusions
What Worked
- Lookup table architecture is sound (~5MB for 2.6M hands)
- PyTorch tensor operations on CPU are actually faster than raw Python
- 36% speedup with
--gpu-device cpu(no actual GPU)
What Didn’t Work
- Sequential MCCFR traversal prevents GPU parallelism
- Small batch sizes don’t amortize GPU kernel overhead
- AMD ROCm has compatibility issues with RDNA 3
Recommendations
- For current workload: Use
--gpu-device cpufor best performance - For pre-computation: GPU is ideal for building abstraction tables
- For major speedup: Restructure MCCFR to batch across iterations
- For AMD GPUs: Set
HSA_OVERRIDE_GFX_VERSION=11.0.0in environment
Technical Details
Environment
- PyTorch 2.5.1+rocm6.2
- AMD Radeon RX 7800 XT (RDNA 3, gfx1101)
- ROCm 6.2 with ROCk module 6.14.14
ROCm Workaround
# Add to ~/.bashrcexport HSA_OVERRIDE_GFX_VERSION=11.0.0Files Modified
src/cfr_poker/holdem/gpu/__init__.py- GPU detectionsrc/cfr_poker/holdem/gpu/utils.py- Combinatorial indexingsrc/cfr_poker/holdem/gpu/lookup_table.py- Hand rank lookupsrc/cfr_poker/holdem/gpu/batched_ehs.py- Batched EHSsrc/cfr_poker/holdem/abstraction.py- GPU routingsrc/cfr_poker/holdem/mccfr.py- GPU parametersscripts/train_holdem.py- CLI flags
Solution: Batched EHS Optimization
UPDATE (2025-12-03): We solved the kernel launch overhead problem!
See Batched EHS Optimization for the solution achieving 76x speedup:
- Sequential GPU: 140 hands/sec
- Batched GPU: 10,000+ hands/sec
Key techniques:
BatchEHSCollector- Collect queries, compute in one batchRiverEHSTable- Persistent O(1) lookup cache- Gumbel-max sampling for vectorized opponent selection
Profiling Analysis: Where Time Actually Goes
UPDATE (2025-12-03): We profiled 200 iterations to identify the actual bottleneck.
Time Breakdown (21.29s total, 9.4 iter/sec)
| Function | Self Time | % | Description |
|---|---|---|---|
_compute_ehs_partial_board | 7.81s | 36.7% | GPU EHS for flop/turn |
_compute_indices | 4.32s | 20.3% | Combinatorial index math |
_sample_opponents | 2.12s | 10.0% | Random opponent sampling |
torch.sort | 1.32s | 6.2% | 7-card hand sorting |
torch.randperm | 1.31s | 6.2% | Random permutations |
| Other | ~4.4s | 20.6% | Tree traversal, regret updates |
Key Discovery: In-Memory Cache Already Has 97.4% Hit Rate!
Total _compute_ehs calls: 27,398Actual GPU compute calls: 720Cache hits: 26,678 (97.4% hit rate!)In-memory cache entries: 1,080The existing _ehs_cache in PostflopAbstraction is highly effective. Pre-computed tables provide no benefit because the cache is already working.
The Real Bottleneck: Per-Call GPU Overhead
720 GPU EHS calls take 17.1 seconds = 23.8ms per call
Each EHS computation involves:
- 100 Monte Carlo samples
- Per sample:
randperm(1.31s total) +sort(1.32s) +lookup_7card+ comparisons - These are small tensor operations with high kernel launch overhead
Why Pre-computed Tables Don’t Help
| Scenario | Cache Hits | Bottleneck |
|---|---|---|
| Without pre-computed | 97.4% (in-memory) | 720 GPU calls × 23.8ms |
| With pre-computed | ~97.5% (marginal gain) | Same 700+ GPU calls × 23.8ms |
The pre-computed tables only help with the ~2.6% cache misses, which are already being cached in-memory after first computation.
River Pre-computed Table: Also Useless
| Metric | Value |
|---|---|
| River bucket requests | 8,531 |
| Actual GPU computations | 360 |
| In-memory cache hit rate | 95.8% |
| Pre-computed entries | 1,000,000 |
| Total river combinations | 2,809,475,760 |
| Pre-computed coverage | 0.0356% |
| Expected hits from 1M table | 0.13 out of 360 |
| Time saved | 0.001 seconds |
The 1M pre-computed river entries cover only 0.036% of the 2.8B possible combinations. The in-memory cache is what provides the 95.8% hit rate.
Actual Optimization Opportunities
- Batch multiple EHS queries - Collect queries across tree branches, compute in one GPU call
- Reduce samples per EHS - 100 samples → 50 samples (2x faster, slightly less accurate)
- Optimize
_compute_indices- 20% of time in combinatorial math - Async GPU - Overlap GPU work with tree traversal
Pre-computed EHS Tables: Why Random Sampling Doesn’t Help
UPDATE (2025-12-03): We extended pre-computed EHS to all streets (flop, turn, river) but found no significant improvement in training speed.
Implementation
- Created unified
precompute_ehs.pyscript for all streets - Extended
BatchEHSCollectorV2to support variable board lengths (3/4/5 cards) - Added CLI flags:
--precomputed-ehs,--flop-cache-path,--turn-cache-path,--river-cache-path
Benchmark Results
| Configuration | Time (500 iter) | Iter/sec |
|---|---|---|
| GPU baseline | 55.38s | 9 |
| GPU + Pre-computed EHS (100k/street) | 54.01s | 9 |
No significant improvement (~2% faster, within noise).
Why Random Pre-computation Fails
The fundamental issue is the state space explosion:
| Street | Total Combinations | 100k Cache | Coverage |
|---|---|---|---|
| Flop | 1,326 × C(50,3) = ~26M | 100,000 | 0.38% |
| Turn | 1,326 × C(50,4) = ~305M | 100,000 | 0.033% |
| River | 1,326 × C(50,5) = ~2.8B | 100,000 | 0.0036% |
Problem: Random pre-computed samples have <1% hit rate because training visits different random paths each iteration.
See EHS Caching Strategies for detailed analysis with diagrams.
Why Bucket-Based Caching Doesn’t Help Either
Initially we thought caching by bucket instead of exact (hole, board) would help, but this is logically circular:
- To look up by bucket, you need to know which bucket
- To know the bucket, you must compute EHS first
- If you already computed EHS, you have the answer
The in-memory (hole, board) → bucket cache already achieves 97.4% hit rate during training. The bottleneck is the ~720 cache misses that each require expensive GPU computation (23.8ms/call).
Conclusion: Pre-computation Approaches Exhausted
All pre-computation strategies have been evaluated and found ineffective:
| Approach | Result | Why |
|---|---|---|
| Random pre-computed EHS | ❌ Ineffective | <1% hit rate on 2.8B state space |
| Bucket-based pre-computation | ❌ Logically impossible | Need EHS to determine bucket |
| In-memory session cache | ✅ Already working | 97.4% hit rate |
The real bottleneck was per-call GPU overhead (23.8ms × 720 calls = 17.1s).
Solution: Vectorized Monte Carlo Sampling (UPDATE 2025-12-03)
10.4x speedup achieved by vectorizing the Monte Carlo loop in _compute_ehs_partial_board():
The Problem
The original implementation had 300+ GPU kernel launches per EHS computation:
for _ in range(100): # 100 samples perm = torch.randperm(...) # kernel launch hero_rank = lookup_7card(...) # kernel launch opp_rank = lookup_7card(...) # kernel launchThe Solution
Use Gumbel-top-k trick for vectorized sampling, batch all rank computations:
# Single noise generation + topk selectionnoise = torch.rand((100, n_available), device=self.device)_, indices = noise.topk(n_cards_per_sample, dim=1)
# Batch all hero/opponent rank computations into 2 callshero_ranks = self.rank_table.lookup_7card(hero_hands) # (100,)opp_ranks = self.rank_table.lookup_7card(opp_hands) # (100,)Results
| Metric | Before | After | Improvement |
|---|---|---|---|
| 200 iterations | 21.29s | 2.04s | 10.4x |
| Iterations/sec | 9.4 | 98.1 | 10.4x |
_compute_ehs_partial_board | 7.81s | 0.313s | 25x |
| GPU kernel launches/EHS | ~300 | ~8 | 37x fewer |
New Bottleneck
After vectorization, the bottleneck shifted to CPU showdown evaluation:
_evaluate_5card: 39.4% of time (comparing 7-card hands at showdown)- This is the expected final bottleneck for this architecture
Phase 4: Monte Carlo Sample Reduction (UPDATE 2025-12-03)
1.46x-1.76x speedup confirmed by reducing EHS samples during training:
Benchmark Results
| EHS Samples | Iterations/sec | Speedup vs 100 | Game Value | Notes |
|---|---|---|---|---|
| 100 | 255 | 1.00x (baseline) | 0.7108 | Original |
| 75 | 313 | 1.23x | - | Linear improvement |
| 50 | 372 | 1.46x | 0.7958 | Good balance |
| 25 | 448 | 1.76x | 0.3992 | Fast but less accurate |
Trade-offs
- 100 samples: Better statistical accuracy, slower convergence
- 50 samples: Good balance - nearly 1.5x faster with acceptable accuracy
- 25 samples: Very fast but precision noticeably affected
Recommendation
Default: Use --ehs-samples 50 for training:
- 1.46x speedup with minimal accuracy loss
- Still stochastically valid for MCCFR convergence
- Sufficient for pre-flop strategy development
Use 100 samples only when high precision is critical (final exploitation tests).
Future Work
Batch MCCFR implementation: Process multiple game states in parallel✅ Done via BatchEHSCollectorPre-compute abstraction offline: Use GPU to build EHS lookup tables✅ Attempted, ineffectivePre-compute flop/turn EHS: Extend pre-computation to all streets✅ Attempted, ineffectiveBucket-based EHS caching: Cache by abstraction bucket❌ Logically impossibleVectorize Monte Carlo sampling: Batch all samples in single GPU call✅ 10.4x speedup achievedReduce Monte Carlo samples: 100 → 50 samples for 2x speedup✅ 1.46x speedup confirmed- GPU showdown evaluation: Move
_evaluate_5cardto GPU (current 39.4% bottleneck) - Adaptive sampling: Use fewer samples early, more samples for convergence
- CUDA comparison: Test on NVIDIA GPU for comparison