mccfr-nash-equilibrium

Introduction
Game Theory Foundations
Counterfactual Regret Minimization
Monte Carlo CFR (MCCFR)
CFR Variants
Real-World Applications
Implementation Guide
Sources

Introduction

Counterfactual Regret Minimization (CFR) is a family of algorithms that find approximate Nash equilibrium strategies in imperfect information games. Originally introduced by Zinkevich et al. in 2007, CFR has become the foundation for all modern poker AI systems, including superhuman players like Libratus and Pluribus.

Key insight: By minimizing regret at each decision point independently, the overall strategy converges to Nash equilibrium.

Game Theory Foundations

Information Types in Games

Type	Definition	Example
Perfect Information	All players see all moves	Chess, Go
Imperfect Information	Players can’t observe all previous moves	Poker (hidden cards)
Complete Information	All game parameters known	Standard board games
Incomplete Information	Unknown payoffs, types, or rules	Auctions, negotiations

Harsanyi Transformation: Games of incomplete information can be converted to imperfect information games by introducing a “Nature” player who makes initial random moves to determine hidden parameters.

Information Sets

An information set groups all game states that a player cannot distinguish between when making a decision. In poker, a player’s information set includes:

Their own cards (private)
All betting history (public)
Board cards if any (public)

The player cannot see opponent cards, so all game states differing only in opponent hands belong to the same information set.

Nash Equilibrium

A Nash equilibrium is a strategy profile where no player can improve their expected payoff by unilaterally changing their strategy.

Properties in two-player zero-sum games:

Always exists (Nash’s Theorem)
May have multiple equilibria, but all share the same game value
Computationally tractable to approximate
Playing Nash equilibrium guarantees at least the game’s value

Key quote: “Every one of those superhuman AI systems was generated by attempting to approximate a Nash equilibrium strategy rather than by trying to detect and exploit weaknesses in the opponent.”

Counterfactual Regret Minimization

Core Concepts

Regret: The difference between what you could have earned by always playing an action versus what you actually earned.

Regret(action a) = Value(always play a) - Value(current strategy)

Counterfactual Regret: Regret calculated assuming you tried to reach a game state, even if your strategy wouldn’t normally lead there. This isolates the quality of decisions at each information set.

Regret Matching: At each iteration, actions are chosen with probability proportional to their positive accumulated regret:

def regret_matching(regrets):
    positive_regrets = [max(0, r) for r in regrets]
    total = sum(positive_regrets)
    if total > 0:
        return [r / total for r in positive_regrets]
    else:
        return uniform_distribution()

Algorithm Steps

1. INITIALIZE all regrets and strategies to zero

2. FOR each iteration t = 1 to T:
   a. Traverse game tree from root
   b. At each information set I:
      - Compute strategy σ(I) using regret matching
      - Compute counterfactual value v(I, a) for each action a
      - Update cumulative regrets: R(I, a) += v(I, a) - v(I)
      - Update cumulative strategy: S(I, a) += reach_prob × σ(I, a)

3. OUTPUT: Average strategy = S / Σ reach_probabilities

Convergence Guarantee

In two-player zero-sum games, CFR guarantees the average strategy converges to a Nash equilibrium with exploitability bounded by:

ε ≤ O(1/√T)

where T is the number of iterations.

Monte Carlo CFR (MCCFR)

Vanilla CFR requires traversing the entire game tree each iteration, which is infeasible for large games like No-Limit Texas Hold’em (~10^160 states). MCCFR uses sampling to make this tractable.

Sampling Variants

Variant	What’s Sampled	Variance	Speed	Memory
Chance Sampling	Only chance events	Lowest	Slowest	Highest
External Sampling	Opponent actions + chance	Medium	Medium	Medium
Outcome Sampling	Everything (single trajectory)	Highest	Fastest	Lowest

External Sampling (Most Common)

External sampling traverses all actions for the current player but samples:

Chance events (card deals)
Opponent actions

Pseudocode:

def mccfr_external(history, player, probs):
    if terminal(history):
        return utility(history, player)

    if chance_node(history):
        a = sample_chance_action()
        return mccfr_external(history + a, player, probs)

    info_set = get_info_set(history)
    strategy = regret_matching(info_set.regrets)

    if current_player(history) == player:
        # Traverse all actions for current player
        values = {}
        for action in actions(history):
            values[action] = mccfr_external(
                history + action, player,
                probs * strategy[action]
            )

        node_value = sum(strategy[a] * values[a] for a in actions)

        for action in actions(history):
            regret = values[action] - node_value
            info_set.regrets[action] += probs[-player] * regret

        return node_value
    else:
        # Sample opponent action
        action = sample(actions(history), weights=strategy)
        return mccfr_external(history + action, player, probs)

Outcome Sampling

Samples a single trajectory through the entire game tree. Higher variance but:

Much faster per iteration
Lower memory requirements
Easier to parallelize

Used when memory is constrained or massive parallelization is needed.

CFR Variants

CFR+ (2014)

Introduced by Oskari Tammelin, CFR+ improves convergence speed by:

Setting negative regrets to zero immediately (not just for strategy calculation)
Using weighted averaging that emphasizes later iterations

Achieves O(1/T) convergence vs O(1/√T) for vanilla CFR - a massive improvement.

Linear CFR

Weights iterations linearly: later iterations count more than early ones.

Discounted CFR (DCFR)

Applies discount factors to both regrets and strategy accumulation. Parameters α, β, γ control:

α: Positive regret discount
β: Negative regret discount
γ: Strategy contribution discount

Performance: DCFR achieves the lowest exploitability on Kuhn Poker benchmarks.

Deep CFR

Combines CFR with neural networks:

Uses deep networks to approximate regrets/values
Enables generalization across similar game states
Foundational for scaling to truly massive games

Real-World Applications

Libratus (2017)

Defeated world-class professionals in heads-up no-limit Texas Hold’em with 99.98% statistical significance.

Architecture:

Game Abstraction: Groups similar actions/hands to reduce state space
Blueprint Solving: CFR+ on abstracted game (~10^12 states)
Real-time Subgame Solving: Expands strategies during play using nested safe subgame solving

Key innovation: “Reach” subgame solving balances exploitability prevention with finding better strategies.

Pluribus (2019)

First AI to beat professionals in six-player no-limit Hold’em.

Approach:

12,400 CPU core-hours of self-play (8 days on 64-core server)
Monte Carlo Linear CFR for large subgames
Vector-based Linear CFR for smaller subgames
No deep learning required

Efficiency: Runs on 2 CPUs with <128GB RAM (vs AlphaGo’s 1920 CPUs + 280 GPUs).

Key Researchers

Noam Brown (Meta AI): Co-created Libratus and Pluribus
Tuomas Sandholm (CMU): Pioneered computational game theory for imperfect information games
Martin Zinkevich: Introduced original CFR algorithm

Implementation Guide

Basic CFR Implementation Structure

class InformationSet:
    def __init__(self, num_actions):
        self.regret_sum = np.zeros(num_actions)
        self.strategy_sum = np.zeros(num_actions)
        self.num_actions = num_actions

    def get_strategy(self, reach_prob):
        """Get current strategy via regret matching"""
        strategy = np.maximum(self.regret_sum, 0)
        total = strategy.sum()

        if total > 0:
            strategy /= total
        else:
            strategy = np.ones(self.num_actions) / self.num_actions

        self.strategy_sum += reach_prob * strategy
        return strategy

    def get_average_strategy(self):
        """Get Nash equilibrium approximation"""
        total = self.strategy_sum.sum()
        if total > 0:
            return self.strategy_sum / total
        return np.ones(self.num_actions) / self.num_actions

Training Loop

def train_cfr(num_iterations):
    info_sets = {}  # Maps info_set_key -> InformationSet

    for i in range(num_iterations):
        for player in [0, 1]:
            cfr(
                history="",
                reach_probs=[1.0, 1.0],
                player=player,
                info_sets=info_sets
            )

    # Extract Nash equilibrium strategies
    strategies = {}
    for key, info_set in info_sets.items():
        strategies[key] = info_set.get_average_strategy()

    return strategies

Practical Considerations

Memory: Store only information sets, not game states
Parallelization: Outcome sampling parallelizes well
Warm starting: Initialize from simpler abstractions
Pruning: Skip actions with very negative regret
Convergence monitoring: Track exploitability on simplified versions

Sources

Vanilla CFR for Engineers - Justin Sermeno
CFR on Kuhn Poker - labml.ai
Libratus: World’s Best Poker Player - The Gradient
Superhuman AI for Multiplayer Poker - Science
Libratus Wikipedia
GitHub: Poker CFR Implementation
Zinkevich et al. (2007) - “Regret Minimization in Games with Incomplete Information”
Lanctot et al. (2009) - “Monte Carlo Sampling for Regret Minimization in Extensive Games”