Summary

We implemented GPU acceleration for EHS (Expected Hand Strength) computation in the CFR poker solver. Contrary to expectations, the GPU implementation is slower than CPU for the current workload pattern. This document explains why.

Initial Hypothesis

We expected GPU acceleration would provide significant speedup because:

  1. Large computation volume: Training runs ~47.3M hand evaluations per 10k iterations
  2. Parallel workload: Monte Carlo sampling is embarrassingly parallel
  3. Pre-computed lookup table: 2.6M 5-card hand rankings can be stored on GPU (~5MB)

What We Implemented

Phase 3 Components

ComponentPurposeLocation
GPUHandRankTablePre-computed lookup for all C(52,5) handsgpu/lookup_table.py
BatchedEHSComputerGPU-accelerated Monte Carlo EHSgpu/batched_ehs.py
PostflopAbstractionGPU routing for EHS computationabstraction.py

Key Design Decisions

  1. Lookup table approach: Pre-compute all 2,598,960 5-card hand rankings once
  2. Monte Carlo sampling on GPU: Sample opponent hands using GPU tensors
  3. 7-card evaluation: Evaluate all 21 5-card combinations for each 7-card hand

Benchmark Results

ModeIterationsTimeIter/secNotes
CPU (phevaluator)1003.90s26Baseline
GPU-on-CPU (PyTorch CPU)1002.86s3536% faster
GPU-on-GPU (AMD 7800 XT)10010.28s102.6x slower

Why GPU Is Slower: Analysis

1. Sequential Operation Pattern

The MCCFR algorithm traverses the game tree one iteration at a time, processing hands sequentially:

for iteration in range(N):
deal_hand() # Single hand
traverse_tree(hand) # Recursive, sequential
compute_ehs(hole, board) # Called many times per traversal

Each compute_ehs() call does:

for sample in range(100): # Sequential loop
torch.randperm(45) # GPU kernel launch
torch.cat([...]) # GPU kernel launch
lookup_table.lookup_7card(...) # GPU kernel launch

Result: 300+ small GPU kernel launches per EHS computation, ~thousands per iteration.

2. GPU Kernel Launch Overhead

GPU operations have fixed overhead per kernel launch:

  • Kernel launch: ~10-50μs on ROCm
  • Memory transfer: ~1-10μs for small tensors
  • Synchronization: Implicit sync on .item() calls

For 100 Monte Carlo samples with 3 operations each = 300 kernel launches.

CPU equivalent: ~0.01μs per operation (cache hit)

3. Small Batch Size Problem

The current implementation processes one hand at a time:

# Current: Sequential, small batches
for i in range(iterations):
ehs = compute_ehs(single_hole, single_board) # 100 samples

GPU needs large batches to amortize overhead:

# Ideal: Batch across many hands
holes = torch.stack([...]) # 10,000 hands
boards = torch.stack([...]) # 10,000 boards
ehs_batch = compute_ehs_batch(holes, boards) # 1M samples in one kernel

4. AMD ROCm Specific Issues

  • RDNA 3 (7800 XT) needs workaround: HSA_OVERRIDE_GFX_VERSION=11.0.0
  • Kernel compilation overhead: First run compiles HIP kernels
  • Less optimized than CUDA: ROCm still maturing vs NVIDIA’s ecosystem

Where GPU Would Actually Help

GPU acceleration makes sense when:

ScenarioWhy GPU Helps
Pre-computing abstractionBatch evaluate millions of hands at once
Real-time inferenceLookup table queries are O(1) on GPU
Batched trainingProcess 1000+ hands in parallel
Larger games (PLO)4-card hands = 270,725 hole combinations

The Right Architecture

For GPU speedup, restructure to batch across iterations:

# Batch across iterations (not within single EHS)
def train_batched(iterations_per_batch=1000):
# Deal all hands upfront
hands = deal_hands(iterations_per_batch) # Batch
# Compute all EHS values in one GPU call
all_ehs = compute_ehs_batch(hands) # Single kernel
# Update regrets using pre-computed EHS
for i, (hand, ehs) in enumerate(zip(hands, all_ehs)):
update_regrets(hand, ehs)

Conclusions

What Worked

  • Lookup table architecture is sound (~5MB for 2.6M hands)
  • PyTorch tensor operations on CPU are actually faster than raw Python
  • 36% speedup with --gpu-device cpu (no actual GPU)

What Didn’t Work

  • Sequential MCCFR traversal prevents GPU parallelism
  • Small batch sizes don’t amortize GPU kernel overhead
  • AMD ROCm has compatibility issues with RDNA 3

Recommendations

  1. For current workload: Use --gpu-device cpu for best performance
  2. For pre-computation: GPU is ideal for building abstraction tables
  3. For major speedup: Restructure MCCFR to batch across iterations
  4. For AMD GPUs: Set HSA_OVERRIDE_GFX_VERSION=11.0.0 in environment

Technical Details

Environment

  • PyTorch 2.5.1+rocm6.2
  • AMD Radeon RX 7800 XT (RDNA 3, gfx1101)
  • ROCm 6.2 with ROCk module 6.14.14

ROCm Workaround

Terminal window
# Add to ~/.bashrc
export HSA_OVERRIDE_GFX_VERSION=11.0.0

Files Modified

  • src/cfr_poker/holdem/gpu/__init__.py - GPU detection
  • src/cfr_poker/holdem/gpu/utils.py - Combinatorial indexing
  • src/cfr_poker/holdem/gpu/lookup_table.py - Hand rank lookup
  • src/cfr_poker/holdem/gpu/batched_ehs.py - Batched EHS
  • src/cfr_poker/holdem/abstraction.py - GPU routing
  • src/cfr_poker/holdem/mccfr.py - GPU parameters
  • scripts/train_holdem.py - CLI flags

Solution: Batched EHS Optimization

UPDATE (2025-12-03): We solved the kernel launch overhead problem!

See Batched EHS Optimization for the solution achieving 76x speedup:

  • Sequential GPU: 140 hands/sec
  • Batched GPU: 10,000+ hands/sec

Key techniques:

  1. BatchEHSCollector - Collect queries, compute in one batch
  2. RiverEHSTable - Persistent O(1) lookup cache
  3. Gumbel-max sampling for vectorized opponent selection

Profiling Analysis: Where Time Actually Goes

UPDATE (2025-12-03): We profiled 200 iterations to identify the actual bottleneck.

Time Breakdown (21.29s total, 9.4 iter/sec)

FunctionSelf Time%Description
_compute_ehs_partial_board7.81s36.7%GPU EHS for flop/turn
_compute_indices4.32s20.3%Combinatorial index math
_sample_opponents2.12s10.0%Random opponent sampling
torch.sort1.32s6.2%7-card hand sorting
torch.randperm1.31s6.2%Random permutations
Other~4.4s20.6%Tree traversal, regret updates

Key Discovery: In-Memory Cache Already Has 97.4% Hit Rate!

Total _compute_ehs calls: 27,398
Actual GPU compute calls: 720
Cache hits: 26,678 (97.4% hit rate!)
In-memory cache entries: 1,080

The existing _ehs_cache in PostflopAbstraction is highly effective. Pre-computed tables provide no benefit because the cache is already working.

The Real Bottleneck: Per-Call GPU Overhead

720 GPU EHS calls take 17.1 seconds = 23.8ms per call

Each EHS computation involves:

  • 100 Monte Carlo samples
  • Per sample: randperm (1.31s total) + sort (1.32s) + lookup_7card + comparisons
  • These are small tensor operations with high kernel launch overhead

Why Pre-computed Tables Don’t Help

ScenarioCache HitsBottleneck
Without pre-computed97.4% (in-memory)720 GPU calls × 23.8ms
With pre-computed~97.5% (marginal gain)Same 700+ GPU calls × 23.8ms

The pre-computed tables only help with the ~2.6% cache misses, which are already being cached in-memory after first computation.

River Pre-computed Table: Also Useless

MetricValue
River bucket requests8,531
Actual GPU computations360
In-memory cache hit rate95.8%
Pre-computed entries1,000,000
Total river combinations2,809,475,760
Pre-computed coverage0.0356%
Expected hits from 1M table0.13 out of 360
Time saved0.001 seconds

The 1M pre-computed river entries cover only 0.036% of the 2.8B possible combinations. The in-memory cache is what provides the 95.8% hit rate.

Actual Optimization Opportunities

  1. Batch multiple EHS queries - Collect queries across tree branches, compute in one GPU call
  2. Reduce samples per EHS - 100 samples → 50 samples (2x faster, slightly less accurate)
  3. Optimize _compute_indices - 20% of time in combinatorial math
  4. Async GPU - Overlap GPU work with tree traversal

Pre-computed EHS Tables: Why Random Sampling Doesn’t Help

UPDATE (2025-12-03): We extended pre-computed EHS to all streets (flop, turn, river) but found no significant improvement in training speed.

Implementation

  • Created unified precompute_ehs.py script for all streets
  • Extended BatchEHSCollectorV2 to support variable board lengths (3/4/5 cards)
  • Added CLI flags: --precomputed-ehs, --flop-cache-path, --turn-cache-path, --river-cache-path

Benchmark Results

ConfigurationTime (500 iter)Iter/sec
GPU baseline55.38s9
GPU + Pre-computed EHS (100k/street)54.01s9

No significant improvement (~2% faster, within noise).

Why Random Pre-computation Fails

The fundamental issue is the state space explosion:

StreetTotal Combinations100k CacheCoverage
Flop1,326 × C(50,3) = ~26M100,0000.38%
Turn1,326 × C(50,4) = ~305M100,0000.033%
River1,326 × C(50,5) = ~2.8B100,0000.0036%

Problem: Random pre-computed samples have <1% hit rate because training visits different random paths each iteration.

See EHS Caching Strategies for detailed analysis with diagrams.

Why Bucket-Based Caching Doesn’t Help Either

Initially we thought caching by bucket instead of exact (hole, board) would help, but this is logically circular:

  1. To look up by bucket, you need to know which bucket
  2. To know the bucket, you must compute EHS first
  3. If you already computed EHS, you have the answer

The in-memory (hole, board) → bucket cache already achieves 97.4% hit rate during training. The bottleneck is the ~720 cache misses that each require expensive GPU computation (23.8ms/call).

Conclusion: Pre-computation Approaches Exhausted

All pre-computation strategies have been evaluated and found ineffective:

ApproachResultWhy
Random pre-computed EHS❌ Ineffective<1% hit rate on 2.8B state space
Bucket-based pre-computation❌ Logically impossibleNeed EHS to determine bucket
In-memory session cache✅ Already working97.4% hit rate

The real bottleneck was per-call GPU overhead (23.8ms × 720 calls = 17.1s).

Solution: Vectorized Monte Carlo Sampling (UPDATE 2025-12-03)

10.4x speedup achieved by vectorizing the Monte Carlo loop in _compute_ehs_partial_board():

The Problem

The original implementation had 300+ GPU kernel launches per EHS computation:

for _ in range(100): # 100 samples
perm = torch.randperm(...) # kernel launch
hero_rank = lookup_7card(...) # kernel launch
opp_rank = lookup_7card(...) # kernel launch

The Solution

Use Gumbel-top-k trick for vectorized sampling, batch all rank computations:

# Single noise generation + topk selection
noise = torch.rand((100, n_available), device=self.device)
_, indices = noise.topk(n_cards_per_sample, dim=1)
# Batch all hero/opponent rank computations into 2 calls
hero_ranks = self.rank_table.lookup_7card(hero_hands) # (100,)
opp_ranks = self.rank_table.lookup_7card(opp_hands) # (100,)

Results

MetricBeforeAfterImprovement
200 iterations21.29s2.04s10.4x
Iterations/sec9.498.110.4x
_compute_ehs_partial_board7.81s0.313s25x
GPU kernel launches/EHS~300~837x fewer

New Bottleneck

After vectorization, the bottleneck shifted to CPU showdown evaluation:

  • _evaluate_5card: 39.4% of time (comparing 7-card hands at showdown)
  • This is the expected final bottleneck for this architecture

Phase 4: Monte Carlo Sample Reduction (UPDATE 2025-12-03)

1.46x-1.76x speedup confirmed by reducing EHS samples during training:

Benchmark Results

EHS SamplesIterations/secSpeedup vs 100Game ValueNotes
1002551.00x (baseline)0.7108Original
753131.23x-Linear improvement
503721.46x0.7958Good balance
254481.76x0.3992Fast but less accurate

Trade-offs

  • 100 samples: Better statistical accuracy, slower convergence
  • 50 samples: Good balance - nearly 1.5x faster with acceptable accuracy
  • 25 samples: Very fast but precision noticeably affected

Recommendation

Default: Use --ehs-samples 50 for training:

  • 1.46x speedup with minimal accuracy loss
  • Still stochastically valid for MCCFR convergence
  • Sufficient for pre-flop strategy development

Use 100 samples only when high precision is critical (final exploitation tests).

Future Work

  1. Batch MCCFR implementation: Process multiple game states in parallel ✅ Done via BatchEHSCollector
  2. Pre-compute abstraction offline: Use GPU to build EHS lookup tables ✅ Attempted, ineffective
  3. Pre-compute flop/turn EHS: Extend pre-computation to all streets ✅ Attempted, ineffective
  4. Bucket-based EHS caching: Cache by abstraction bucket ❌ Logically impossible
  5. Vectorize Monte Carlo sampling: Batch all samples in single GPU call10.4x speedup achieved
  6. Reduce Monte Carlo samples: 100 → 50 samples for 2x speedup1.46x speedup confirmed
  7. GPU showdown evaluation: Move _evaluate_5card to GPU (current 39.4% bottleneck)
  8. Adaptive sampling: Use fewer samples early, more samples for convergence
  9. CUDA comparison: Test on NVIDIA GPU for comparison