Table of Contents

  1. Introduction
  2. Game Theory Foundations
  3. Counterfactual Regret Minimization
  4. Monte Carlo CFR (MCCFR)
  5. CFR Variants
  6. Real-World Applications
  7. Implementation Guide
  8. Sources

Introduction

Counterfactual Regret Minimization (CFR) is a family of algorithms that find approximate Nash equilibrium strategies in imperfect information games. Originally introduced by Zinkevich et al. in 2007, CFR has become the foundation for all modern poker AI systems, including superhuman players like Libratus and Pluribus.

Key insight: By minimizing regret at each decision point independently, the overall strategy converges to Nash equilibrium.

Game Theory Foundations

Information Types in Games

TypeDefinitionExample
Perfect InformationAll players see all movesChess, Go
Imperfect InformationPlayers can’t observe all previous movesPoker (hidden cards)
Complete InformationAll game parameters knownStandard board games
Incomplete InformationUnknown payoffs, types, or rulesAuctions, negotiations

Harsanyi Transformation: Games of incomplete information can be converted to imperfect information games by introducing a “Nature” player who makes initial random moves to determine hidden parameters.

Information Sets

An information set groups all game states that a player cannot distinguish between when making a decision. In poker, a player’s information set includes:

  • Their own cards (private)
  • All betting history (public)
  • Board cards if any (public)

The player cannot see opponent cards, so all game states differing only in opponent hands belong to the same information set.

Nash Equilibrium

A Nash equilibrium is a strategy profile where no player can improve their expected payoff by unilaterally changing their strategy.

Properties in two-player zero-sum games:

  • Always exists (Nash’s Theorem)
  • May have multiple equilibria, but all share the same game value
  • Computationally tractable to approximate
  • Playing Nash equilibrium guarantees at least the game’s value

Key quote: “Every one of those superhuman AI systems was generated by attempting to approximate a Nash equilibrium strategy rather than by trying to detect and exploit weaknesses in the opponent.”

Counterfactual Regret Minimization

Core Concepts

Regret: The difference between what you could have earned by always playing an action versus what you actually earned.

Regret(action a) = Value(always play a) - Value(current strategy)

Counterfactual Regret: Regret calculated assuming you tried to reach a game state, even if your strategy wouldn’t normally lead there. This isolates the quality of decisions at each information set.

Regret Matching: At each iteration, actions are chosen with probability proportional to their positive accumulated regret:

def regret_matching(regrets):
positive_regrets = [max(0, r) for r in regrets]
total = sum(positive_regrets)
if total > 0:
return [r / total for r in positive_regrets]
else:
return uniform_distribution()

Algorithm Steps

1. INITIALIZE all regrets and strategies to zero
2. FOR each iteration t = 1 to T:
a. Traverse game tree from root
b. At each information set I:
- Compute strategy σ(I) using regret matching
- Compute counterfactual value v(I, a) for each action a
- Update cumulative regrets: R(I, a) += v(I, a) - v(I)
- Update cumulative strategy: S(I, a) += reach_prob × σ(I, a)
3. OUTPUT: Average strategy = S / Σ reach_probabilities

Convergence Guarantee

In two-player zero-sum games, CFR guarantees the average strategy converges to a Nash equilibrium with exploitability bounded by:

ε ≤ O(1/√T)

where T is the number of iterations.

Monte Carlo CFR (MCCFR)

Vanilla CFR requires traversing the entire game tree each iteration, which is infeasible for large games like No-Limit Texas Hold’em (~10^160 states). MCCFR uses sampling to make this tractable.

Sampling Variants

VariantWhat’s SampledVarianceSpeedMemory
Chance SamplingOnly chance eventsLowestSlowestHighest
External SamplingOpponent actions + chanceMediumMediumMedium
Outcome SamplingEverything (single trajectory)HighestFastestLowest

External Sampling (Most Common)

External sampling traverses all actions for the current player but samples:

  • Chance events (card deals)
  • Opponent actions

Pseudocode:

def mccfr_external(history, player, probs):
if terminal(history):
return utility(history, player)
if chance_node(history):
a = sample_chance_action()
return mccfr_external(history + a, player, probs)
info_set = get_info_set(history)
strategy = regret_matching(info_set.regrets)
if current_player(history) == player:
# Traverse all actions for current player
values = {}
for action in actions(history):
values[action] = mccfr_external(
history + action, player,
probs * strategy[action]
)
node_value = sum(strategy[a] * values[a] for a in actions)
for action in actions(history):
regret = values[action] - node_value
info_set.regrets[action] += probs[-player] * regret
return node_value
else:
# Sample opponent action
action = sample(actions(history), weights=strategy)
return mccfr_external(history + action, player, probs)

Outcome Sampling

Samples a single trajectory through the entire game tree. Higher variance but:

  • Much faster per iteration
  • Lower memory requirements
  • Easier to parallelize

Used when memory is constrained or massive parallelization is needed.

CFR Variants

CFR+ (2014)

Introduced by Oskari Tammelin, CFR+ improves convergence speed by:

  • Setting negative regrets to zero immediately (not just for strategy calculation)
  • Using weighted averaging that emphasizes later iterations

Achieves O(1/T) convergence vs O(1/√T) for vanilla CFR - a massive improvement.

Linear CFR

Weights iterations linearly: later iterations count more than early ones.

Discounted CFR (DCFR)

Applies discount factors to both regrets and strategy accumulation. Parameters α, β, γ control:

  • α: Positive regret discount
  • β: Negative regret discount
  • γ: Strategy contribution discount

Performance: DCFR achieves the lowest exploitability on Kuhn Poker benchmarks.

Deep CFR

Combines CFR with neural networks:

  • Uses deep networks to approximate regrets/values
  • Enables generalization across similar game states
  • Foundational for scaling to truly massive games

Real-World Applications

Libratus (2017)

Defeated world-class professionals in heads-up no-limit Texas Hold’em with 99.98% statistical significance.

Architecture:

  1. Game Abstraction: Groups similar actions/hands to reduce state space
  2. Blueprint Solving: CFR+ on abstracted game (~10^12 states)
  3. Real-time Subgame Solving: Expands strategies during play using nested safe subgame solving

Key innovation: “Reach” subgame solving balances exploitability prevention with finding better strategies.

Pluribus (2019)

First AI to beat professionals in six-player no-limit Hold’em.

Approach:

  • 12,400 CPU core-hours of self-play (8 days on 64-core server)
  • Monte Carlo Linear CFR for large subgames
  • Vector-based Linear CFR for smaller subgames
  • No deep learning required

Efficiency: Runs on 2 CPUs with <128GB RAM (vs AlphaGo’s 1920 CPUs + 280 GPUs).

Key Researchers

  • Noam Brown (Meta AI): Co-created Libratus and Pluribus
  • Tuomas Sandholm (CMU): Pioneered computational game theory for imperfect information games
  • Martin Zinkevich: Introduced original CFR algorithm

Implementation Guide

Basic CFR Implementation Structure

class InformationSet:
def __init__(self, num_actions):
self.regret_sum = np.zeros(num_actions)
self.strategy_sum = np.zeros(num_actions)
self.num_actions = num_actions
def get_strategy(self, reach_prob):
"""Get current strategy via regret matching"""
strategy = np.maximum(self.regret_sum, 0)
total = strategy.sum()
if total > 0:
strategy /= total
else:
strategy = np.ones(self.num_actions) / self.num_actions
self.strategy_sum += reach_prob * strategy
return strategy
def get_average_strategy(self):
"""Get Nash equilibrium approximation"""
total = self.strategy_sum.sum()
if total > 0:
return self.strategy_sum / total
return np.ones(self.num_actions) / self.num_actions

Training Loop

def train_cfr(num_iterations):
info_sets = {} # Maps info_set_key -> InformationSet
for i in range(num_iterations):
for player in [0, 1]:
cfr(
history="",
reach_probs=[1.0, 1.0],
player=player,
info_sets=info_sets
)
# Extract Nash equilibrium strategies
strategies = {}
for key, info_set in info_sets.items():
strategies[key] = info_set.get_average_strategy()
return strategies

Practical Considerations

  1. Memory: Store only information sets, not game states
  2. Parallelization: Outcome sampling parallelizes well
  3. Warm starting: Initialize from simpler abstractions
  4. Pruning: Skip actions with very negative regret
  5. Convergence monitoring: Track exploitability on simplified versions

Sources

  1. Vanilla CFR for Engineers - Justin Sermeno
  2. CFR on Kuhn Poker - labml.ai
  3. Libratus: World’s Best Poker Player - The Gradient
  4. Superhuman AI for Multiplayer Poker - Science
  5. Libratus Wikipedia
  6. GitHub: Poker CFR Implementation
  7. Zinkevich et al. (2007) - “Regret Minimization in Games with Incomplete Information”
  8. Lanctot et al. (2009) - “Monte Carlo Sampling for Regret Minimization in Extensive Games”