Bridge: GANs as Zero-Sum Games, Multi-Agent RL & RLHF as Mechanism Design

Game theory was developed to model economic behavior, but its mathematical structures appear throughout modern machine learning: GANs formalize adversarial training as a minimax game, multi-agent reinforcement learning studies learning in strategic environments, and RLHF can be understood as a mechanism design problem where the goal is to elicit honest human preferences. Understanding these connections reveals why certain training procedures succeed or fail — and what theoretical guarantees we can or cannot expect.

Concepts

Minimax tree traversal with MAX nodes (root and depth 2) and MIN nodes (depth 1). Step through the algorithm.

1/6

Game tree structure — MAX nodes (purple) maximize, MIN nodes (amber) minimize.

GAN training, ad auction bidding, multi-agent reinforcement learning, and RLHF preference learning all look like engineering problems — until you recognize the game-theoretic structure underneath. Each involves multiple agents with conflicting objectives, and the outcome depends on what all of them do simultaneously. Game theory provides the vocabulary: Nash equilibrium tells you what stable play looks like, mechanism design tells you how to engineer rules that produce the outcome you want, and the minimax theorem tells you when a stable outcome is guaranteed to exist. The gaps between these theoretical guarantees and what gradient descent actually does in practice is where GAN instability, reward hacking, and multi-agent nonstationarity all live.

GANs as Zero-Sum Games

A Generative Adversarial Network (GAN) (Goodfellow et al., 2014) trains a generator $G$ and discriminator $D$ via:

$\min_G \max_D\; V(D, G) = \mathbb{E}_{x \sim p_{\text{data}}}[\log D(x)] + \mathbb{E}_{z \sim p_z}[\log(1 - D(G(z)))].$

This is a zero-sum game: $G$ 's loss equals $D$ 's gain. By von Neumann's minimax theorem, a Nash equilibrium exists (in the space of all measurable functions). At Nash, $p_G = p_{\text{data}}$ and $D^*(x) = 1/2$ everywhere.

Optimal discriminator. For fixed $G$ , the optimal discriminator is:

$D^*(x) = \frac{p_{\text{data}}(x)}{p_{\text{data}}(x) + p_G(x)}.$

Substituting $D^*$ into $V$ :

\begin{aligned} V(D^*, G) &= \mathbb{E}_{x \sim p_{\text{data}}}\!\left[\log \frac{p_{\text{data}}}{p_{\text{data}} + p_G}\right] + \mathbb{E}_{x \sim p_G}\!\left[\log \frac{p_G}{p_{\text{data}} + p_G}\right] \\ &= 2\,\mathrm{JSD}(p_{\text{data}} \| p_G) - \log 4 \end{aligned}

where $\mathrm{JSD}$ is the Jensen-Shannon divergence. Minimizing $V$ over $G$ minimizes JSD, achieving $p_G = p_{\text{data}}$ at the global minimum.

The JSD formulation reveals why the GAN objective is theoretically beautiful but practically difficult: JSD is zero only when the two distributions are identical, but its gradient vanishes when they are far apart (saturating discriminator). This is forced by the zero-sum structure — the discriminator's loss and the generator's loss must cancel. Wasserstein GANs escape this by replacing JSD with a distance that has non-vanishing gradients even for non-overlapping distributions, at the cost of a more complex constraint on the discriminator.

Multi-Agent Reinforcement Learning

In multi-agent RL, $n$ agents interact in a shared environment. Each agent $i$ has policy $\pi_i$ , state observations $o_i$ , and reward $r_i$ . Key challenges beyond single-agent RL:

Nash Q-learning. In cooperative or competitive settings, agents can learn Nash equilibria by maintaining Q-functions over joint actions and computing Nash equilibria at each step. In practice, this requires solving a small game at each update — feasible for two-player zero-sum, PPAD-hard in general.

Opponent modeling. An agent that maintains a model of opponents' policies can exploit predictable behavior. But if both agents model each other recursively, the reasoning can regress infinitely — "I think you think I think..."

Credit assignment. In cooperative settings with a shared reward, attributing individual contributions is hard. The counterfactual baseline (Foerster et al., 2018) compares an agent's return to what the team would have received had the agent played a default action, approximating the VCG externality computation.

Emergent communication. When agents must communicate to achieve joint goals and the communication channel is learned (not designed), agents develop proto-languages. The structure of these emergent languages is studied through the lens of signaling game equilibria.

Auction Theory in Ad Systems

Online advertising uses auction mechanisms at scale. For a single keyword slot, second-price (Vickrey) auctions are DSIC: advertisers bid their true value, winner pays the second-highest bid.

For multi-slot auctions (search results, multiple ad positions), the Generalized Second Price (GSP) mechanism assigns slots in bid order, with each winner paying the bid of the next-lower bidder. GSP is not DSIC — truthful bidding is not a dominant strategy. However, GSP has an envy-free Nash equilibrium (Edelman, Ostrovsky, and Schwarz, 2007) that generates the same revenue as VCG. In practice, GSP is preferred for its simplicity and resistance to certain forms of manipulation.

Mean-Field Games

When the number of strategic agents $n \to \infty$ , tracking all pairwise interactions becomes intractable. Mean-field game (MFG) theory (Lasry & Lions, 2007) approximates large populations of homogeneous strategic agents by replacing interactions with a single mean-field distribution $\mu$ over the population state.

A mean-field equilibrium is a fixed point: each agent best-responds to $\mu$ , and the induced distribution of agents' states equals $\mu$ . Formally:

\begin{aligned} \pi^* &= \arg\max_\pi\; \mathbb{E}\left[\sum_t r(s_t, a_t, \mu)\right] \\ \mu &= \text{distribution induced by } \pi^* \text{ against itself.} \end{aligned}

MFGs apply to pricing (each firm best-responds to the aggregate market price), traffic routing (each driver best-responds to aggregate congestion), and epidemic control (each agent best-responds to population-level infection rate).

Inverse Game Theory and IRL

Inverse reinforcement learning (IRL) infers a reward function $R$ from observed expert behavior, assuming the expert is optimizing $R$ . This is the inverse problem to RL.

Maximum entropy IRL (Ziebart et al., 2008) resolves the ill-posedness (many rewards are consistent with any behavior) by finding the reward that makes observed behavior maximum-likelihood under a maximum-entropy policy. The agent's policy satisfies:

$\pi(a | s) \propto \exp\!\left(\frac{Q(s, a)}{\alpha}\right)$

which is the Boltzmann rational model. Maximum entropy IRL recovers the reward by gradient ascent on the log-likelihood of observed trajectories.

Worked Example

GAN Training Instability: The Rock-Paper-Scissors Analogy

Consider training a GAN where at each step the generator picks one of three "styles" and the discriminator picks one of three "detectors." The payoff matrix is the RPS matrix $A$ with value $v = 0$ and unique Nash $\sigma^* = (1/3, 1/3, 1/3)$ .

Simultaneous gradient descent on this game performs updates:

$x_{t+1} = x_t + \eta \nabla_x x^\top A y_t, \qquad y_{t+1} = y_t - \eta \nabla_y x_t^\top A y.$

The eigenvalues of the update Jacobian are purely imaginary for the RPS matrix — gradient descent orbits around the Nash equilibrium rather than converging to it. The iterates cycle: if today the generator favors Rock, the discriminator shifts to Paper, the generator shifts to Scissors, and so on indefinitely. This limit-cycle behavior is the root cause of GAN training instability.

Fixes. Techniques like simultaneous gradient ascent-descent with momentum, optimistic gradient descent (Daskalakis et al., 2018), or consensus optimization (Mescheder et al., 2017) introduce extra-gradient steps that dampen cycling. These converge in convex-concave games — though GAN objectives are neither convex in $G$ nor concave in $D$ in the neural network parameterization.

Deriving the JSD Formulation

Starting from $V(D^*, G)$ above, let $m(x) = (p_{\text{data}}(x) + p_G(x))/2$ . Then:

\begin{aligned} V(D^*, G) &= \int p_{\text{data}} \log \frac{p_{\text{data}}}{2m} \, dx + \int p_G \log \frac{p_G}{2m} \, dx \\ &= \mathrm{KL}(p_{\text{data}} \| m) + \mathrm{KL}(p_G \| m) - 2\log 2 \\ &= 2\,\mathrm{JSD}(p_{\text{data}} \| p_G) - \log 4. \end{aligned}

Since $\mathrm{JSD} \geq 0$ with equality iff $p_G = p_{\text{data}}$ , the global minimum of $V$ over $G$ is $-\log 4$ , achieved exactly when the generator matches the data distribution.

Connections

Where Your Intuition Breaks

RLHF is often framed as "training a reward model from human feedback." The mechanism design framing exposes what this actually requires: the human evaluator is an agent with private preferences (their type), the preference collection protocol is a direct mechanism, and reward hacking (Goodhart's Law) is what happens when the mechanism is not incentive-compatible. The revelation principle says you lose nothing by asking humans to directly report their preferences — but it assumes humans are rational agents maximizing their utility, with well-defined types. Real human evaluators are inconsistent, context-sensitive, and susceptible to framing effects. When the mechanism design assumption fails, RLHF trains a model that optimizes a proxy of human preferences, not the preferences themselves. This is not a failure of the mechanism design analogy — it is the analogy correctly identifying where the theoretical guarantee breaks down.

💡Intuition

GAN equilibrium exists but gradient descent doesn't reach it. The minimax theorem guarantees Nash equilibrium exists for the GAN objective — but the theorem says nothing about how to find it algorithmically. Gradient descent converges to Nash only in convex-concave games (by the theory of monotone operators). The GAN objective is not convex-concave in the neural network parameters, so simultaneous gradient updates cycle rather than converge. This is not a failure of game theory — it is a precise diagnosis of why gradient descent is the wrong algorithm for this game.

Framework	Game-theoretic concept	ML instantiation
GANs	Zero-sum minimax	Generator vs. discriminator
MARL	Nash / correlated equilibrium	Multi-agent policy learning
Ad auctions	VCG / GSP mechanism	Bid optimization, click prediction
RLHF	Mechanism design	Reward model as preference elicitor
IRL	Inverse game theory	Reward learning from demonstrations
Mean-field games	Mean-field equilibrium	Large-scale agent simulation

💡Intuition

RLHF as mechanism design. In Reinforcement Learning from Human Feedback (RLHF), a human evaluator (principal) provides preference labels to train a reward model, which then guides LLM fine-tuning. The mechanism design framing makes the strategic structure explicit: the human evaluator has private preferences (their type), the reward model is the designer's attempt to elicit those preferences honestly, and the LLM is the agent whose behavior the designer wishes to control. The revelation principle applies: if humans are rational, a direct mechanism (ask for types, not actions) is without loss of generality. In practice, humans are not rational evaluators and the reward model is an approximation — so reward hacking (Goodhart's Law) is the mechanism design analogue of a strategic agent manipulating a flawed mechanism.

Enjoying these notes?

Get new lessons delivered to your inbox. No spam.

Mechanism Design: Revelation Principle, VCG & Auction Theory

Functional Analysis

Normed & Banach Spaces: Completeness, Compactness & the Hahn-Banach Theorem