mccfr-nash-equilibrium
Table of Contents
- Introduction
- Game Theory Foundations
- Counterfactual Regret Minimization
- Monte Carlo CFR (MCCFR)
- CFR Variants
- Real-World Applications
- Implementation Guide
- Sources
Introduction
Counterfactual Regret Minimization (CFR) is a family of algorithms that find approximate Nash equilibrium strategies in imperfect information games. Originally introduced by Zinkevich et al. in 2007, CFR has become the foundation for all modern poker AI systems, including superhuman players like Libratus and Pluribus.
Key insight: By minimizing regret at each decision point independently, the overall strategy converges to Nash equilibrium.
Game Theory Foundations
Information Types in Games
| Type | Definition | Example |
|---|---|---|
| Perfect Information | All players see all moves | Chess, Go |
| Imperfect Information | Players can’t observe all previous moves | Poker (hidden cards) |
| Complete Information | All game parameters known | Standard board games |
| Incomplete Information | Unknown payoffs, types, or rules | Auctions, negotiations |
Harsanyi Transformation: Games of incomplete information can be converted to imperfect information games by introducing a “Nature” player who makes initial random moves to determine hidden parameters.
Information Sets
An information set groups all game states that a player cannot distinguish between when making a decision. In poker, a player’s information set includes:
- Their own cards (private)
- All betting history (public)
- Board cards if any (public)
The player cannot see opponent cards, so all game states differing only in opponent hands belong to the same information set.
Nash Equilibrium
A Nash equilibrium is a strategy profile where no player can improve their expected payoff by unilaterally changing their strategy.
Properties in two-player zero-sum games:
- Always exists (Nash’s Theorem)
- May have multiple equilibria, but all share the same game value
- Computationally tractable to approximate
- Playing Nash equilibrium guarantees at least the game’s value
Key quote: “Every one of those superhuman AI systems was generated by attempting to approximate a Nash equilibrium strategy rather than by trying to detect and exploit weaknesses in the opponent.”
Counterfactual Regret Minimization
Core Concepts
Regret: The difference between what you could have earned by always playing an action versus what you actually earned.
Regret(action a) = Value(always play a) - Value(current strategy)Counterfactual Regret: Regret calculated assuming you tried to reach a game state, even if your strategy wouldn’t normally lead there. This isolates the quality of decisions at each information set.
Regret Matching: At each iteration, actions are chosen with probability proportional to their positive accumulated regret:
def regret_matching(regrets): positive_regrets = [max(0, r) for r in regrets] total = sum(positive_regrets) if total > 0: return [r / total for r in positive_regrets] else: return uniform_distribution()Algorithm Steps
1. INITIALIZE all regrets and strategies to zero
2. FOR each iteration t = 1 to T: a. Traverse game tree from root b. At each information set I: - Compute strategy σ(I) using regret matching - Compute counterfactual value v(I, a) for each action a - Update cumulative regrets: R(I, a) += v(I, a) - v(I) - Update cumulative strategy: S(I, a) += reach_prob × σ(I, a)
3. OUTPUT: Average strategy = S / Σ reach_probabilitiesConvergence Guarantee
In two-player zero-sum games, CFR guarantees the average strategy converges to a Nash equilibrium with exploitability bounded by:
ε ≤ O(1/√T)where T is the number of iterations.
Monte Carlo CFR (MCCFR)
Vanilla CFR requires traversing the entire game tree each iteration, which is infeasible for large games like No-Limit Texas Hold’em (~10^160 states). MCCFR uses sampling to make this tractable.
Sampling Variants
| Variant | What’s Sampled | Variance | Speed | Memory |
|---|---|---|---|---|
| Chance Sampling | Only chance events | Lowest | Slowest | Highest |
| External Sampling | Opponent actions + chance | Medium | Medium | Medium |
| Outcome Sampling | Everything (single trajectory) | Highest | Fastest | Lowest |
External Sampling (Most Common)
External sampling traverses all actions for the current player but samples:
- Chance events (card deals)
- Opponent actions
Pseudocode:
def mccfr_external(history, player, probs): if terminal(history): return utility(history, player)
if chance_node(history): a = sample_chance_action() return mccfr_external(history + a, player, probs)
info_set = get_info_set(history) strategy = regret_matching(info_set.regrets)
if current_player(history) == player: # Traverse all actions for current player values = {} for action in actions(history): values[action] = mccfr_external( history + action, player, probs * strategy[action] )
node_value = sum(strategy[a] * values[a] for a in actions)
for action in actions(history): regret = values[action] - node_value info_set.regrets[action] += probs[-player] * regret
return node_value else: # Sample opponent action action = sample(actions(history), weights=strategy) return mccfr_external(history + action, player, probs)Outcome Sampling
Samples a single trajectory through the entire game tree. Higher variance but:
- Much faster per iteration
- Lower memory requirements
- Easier to parallelize
Used when memory is constrained or massive parallelization is needed.
CFR Variants
CFR+ (2014)
Introduced by Oskari Tammelin, CFR+ improves convergence speed by:
- Setting negative regrets to zero immediately (not just for strategy calculation)
- Using weighted averaging that emphasizes later iterations
Achieves O(1/T) convergence vs O(1/√T) for vanilla CFR - a massive improvement.
Linear CFR
Weights iterations linearly: later iterations count more than early ones.
Discounted CFR (DCFR)
Applies discount factors to both regrets and strategy accumulation. Parameters α, β, γ control:
- α: Positive regret discount
- β: Negative regret discount
- γ: Strategy contribution discount
Performance: DCFR achieves the lowest exploitability on Kuhn Poker benchmarks.
Deep CFR
Combines CFR with neural networks:
- Uses deep networks to approximate regrets/values
- Enables generalization across similar game states
- Foundational for scaling to truly massive games
Real-World Applications
Libratus (2017)
Defeated world-class professionals in heads-up no-limit Texas Hold’em with 99.98% statistical significance.
Architecture:
- Game Abstraction: Groups similar actions/hands to reduce state space
- Blueprint Solving: CFR+ on abstracted game (~10^12 states)
- Real-time Subgame Solving: Expands strategies during play using nested safe subgame solving
Key innovation: “Reach” subgame solving balances exploitability prevention with finding better strategies.
Pluribus (2019)
First AI to beat professionals in six-player no-limit Hold’em.
Approach:
- 12,400 CPU core-hours of self-play (8 days on 64-core server)
- Monte Carlo Linear CFR for large subgames
- Vector-based Linear CFR for smaller subgames
- No deep learning required
Efficiency: Runs on 2 CPUs with <128GB RAM (vs AlphaGo’s 1920 CPUs + 280 GPUs).
Key Researchers
- Noam Brown (Meta AI): Co-created Libratus and Pluribus
- Tuomas Sandholm (CMU): Pioneered computational game theory for imperfect information games
- Martin Zinkevich: Introduced original CFR algorithm
Implementation Guide
Basic CFR Implementation Structure
class InformationSet: def __init__(self, num_actions): self.regret_sum = np.zeros(num_actions) self.strategy_sum = np.zeros(num_actions) self.num_actions = num_actions
def get_strategy(self, reach_prob): """Get current strategy via regret matching""" strategy = np.maximum(self.regret_sum, 0) total = strategy.sum()
if total > 0: strategy /= total else: strategy = np.ones(self.num_actions) / self.num_actions
self.strategy_sum += reach_prob * strategy return strategy
def get_average_strategy(self): """Get Nash equilibrium approximation""" total = self.strategy_sum.sum() if total > 0: return self.strategy_sum / total return np.ones(self.num_actions) / self.num_actionsTraining Loop
def train_cfr(num_iterations): info_sets = {} # Maps info_set_key -> InformationSet
for i in range(num_iterations): for player in [0, 1]: cfr( history="", reach_probs=[1.0, 1.0], player=player, info_sets=info_sets )
# Extract Nash equilibrium strategies strategies = {} for key, info_set in info_sets.items(): strategies[key] = info_set.get_average_strategy()
return strategiesPractical Considerations
- Memory: Store only information sets, not game states
- Parallelization: Outcome sampling parallelizes well
- Warm starting: Initialize from simpler abstractions
- Pruning: Skip actions with very negative regret
- Convergence monitoring: Track exploitability on simplified versions
Sources
- Vanilla CFR for Engineers - Justin Sermeno
- CFR on Kuhn Poker - labml.ai
- Libratus: World’s Best Poker Player - The Gradient
- Superhuman AI for Multiplayer Poker - Science
- Libratus Wikipedia
- GitHub: Poker CFR Implementation
- Zinkevich et al. (2007) - “Regret Minimization in Games with Incomplete Information”
- Lanctot et al. (2009) - “Monte Carlo Sampling for Regret Minimization in Extensive Games”