Fictitious Play
Last modified 9/21/2004
Introduction
There are a number of different ways to find
Nash
equilibria, including the mathematical techniques discussed in the
previous
lecture. Later in the semester, we will spend a lot of time
talking
about using multi-agent learning to find solutions to certain
multi-agent
games where the agents do not know the payoff matrix. Because of
this emphasis on learning, I think it is useful to discuss a
learning-based
technique for finding Nash equilibria. This learning-based
technique
is called Fictitious Play.
Note that for learning to
take
place, there must be repeated encounters with another agent.
Thus,
when we talk about learning, we will talk about stage games which
are simply iterated plays on the same game. Each new encounter on
the game is called a stage. There are two types of learning that
can occur in stage games. The first type of learning assumes that
the stage game will stop at some point; after the game has stopped, the
agent will take what it has learned and use this strategy forever
after.
Fictitious play is an algorithm from this type of learning. The
second
type of learning assumes that the stage game may never stop, or that
the
agents cannot determine when it will stop; this implies that the agent
is learning online and must be capable of learning indefinitely.
The no-regret learning algorithms (check with me for a reference)
are examples of this type of learning.
In the remainder of these notes, we will discuss the
motivation for the algorithm, two variants of the algorithm, and some
characteristics
of the algorithm.
I will base my description of Cournot adjustment
and fictitious play on the book "The Theory of Learning in Games" by D.
Fudenberg and D. K. Levine, MIT Press, 1999. I will also use some
of the material from Chapter 8 of "Behavioral Game Theory: Experiments
in Strategic Interaction" by C. F. Camerer, Princeton University Press,
2003.
Modeling Other Agents and Cournot Adjustment
Fictitious play is a technique for finding the
Nash
equilibrium to a game. The idea is to
-
guess what the other agent is going to do,
-
select a best
response to this guess,
-
observe what the other agent did and update the guess, and
-
repeat.
This is done before real play actually occurs; it is an imaginary
"battle
of the wits" with the other agent (hence the name fictitious
play).
During this imaginary contest, each agent counts the number of times
that
the other agent plays a particular strategy. At the end of the
imaginary
contest, each agent has developed a description of what strategy the
other
agent is likely to play. If all of the individual descriptions
are
put together and if the updating went well, the combined descriptions
correspond
to the Nash equilibrium.
Before getting into standard
fictitious play, it is useful to study a very simple way to model
another
agent. A simple approach is to guess that another agent will
repeat
the same action this stage as they did the previous time you met.
Given this simple model, the agent generates a best response to this
model.
This model and technique for generating a solution is called Cournot
adjustment; it is perhaps the most simple model of competition
treated
in the economics literature.
Example of Cournot dynamics: Best Response and Adjustment.
Suppose that two players are repeatedly playing the Fighters
and Bombers problem discussed in the online notes.
|
Payoff Matrix for the Fighter vs. Bomber
|
|
Fighter/Bomber
|
Look Up
|
Look Down
|
|
Sun Attack
|
0.95
|
1
|
|
Bottom Attack
|
1
|
0
|
Each agent has a prior model of the other
agent
(or, the randomly select some model). Suppose that the fighter
has
some beliefs about what the bomber will do. We can write this
model
as Pfighter(Up) and Pfighter(Down).
The
Pfighter() denotes the belief that the fighter has about the
other agent. Thus, Pfighter(Up) denotes the fighter's
belief that the bomber will look up.
If the fighter believes that the bomber will look up every time,
then
the mode is written as Pfighter(Up)=1 and Pfighter(Down)=0.
Based on this model, we can compute the expected utility of each of the
fighter's possible actions, and then choose the action with highest
expected
utility. The expected utility of the sun attack is E(Sun)=0.95 *
1 + 1 * 0=0.95, and the expected utility of the bottom attack is
E(Bottom)
= 1*1 + 0*0 = 1. The best response for the fighter is to do
a bottom attack.
So, the fighter chooses a bottom attack.
Suppose that the bomber chose to look down. If this is the case,
then the fighter gets 0 points (this is a fictitious play, so death
does
not mean the end of the stages). According to the Cournot model,
the fighter should adjust its model such that all belief is placed in
the
bomber's most recently observed strategy. This means that Pfighter(Up)=0
and Pfighter(Down)=1. Corresponding to these
probabilities,
the fighter computes new expected values for a sun attack as
E(Sun)=0.95
* 0 + 1 * 1=1, and the expected utility for a bottom attack as
E(Bottom)
= 1*0 + 0*1 = 0. The best response is now the Sun attack.
Sometimes, Cournot dynamics find the Nash
equilibrium
very easily. Consider what would happen if both agents used
Cournot
adjustment in the Prisoner's dilemma --- the Nash equilibrium would
quickly
be reached. However, sometimes Cournot adjustment leads to
cycles.
Consider the following example.
Example of Cournot dynamics: Cycles. Consider the
game
of rock-paper-scissors.
|
Payoff Matrix for Rock-Paper-Scissors
|
|
Row/Column
|
Rock
|
Paper
|
Scissors
|
|
Rock
|
(0,0)
|
(-1,1)
|
(1,-1)
|
|
Paper
|
(1,-1)
|
(0,0)
|
(-1,1)
|
|
Scissors
|
(-1,1)
|
(1,-1)
|
(0,0)
|
The payoff for the row player is listed first in each ordered pair,
and the payoff for the column player is listed second. For
example,
if row plays rock and column plays paper, then row loses 1
point and column wins 1
point.
Under Cournot dynamics, starting the game
requires
each player to make an initial guess about what the other player is
going
to do. Suppose that each player guesses that the other will play
Rock. This guess can be formally described as follows:
-
Row Player's Model of Column Player
-
Prow(Rock)=1
-
Prow(Paper)=0
-
Prow(Scissors)=0
-
Column Player's Model of Row Player
-
Pcolumn(Rock)=1
-
Pcolumn(Paper)=0
-
Pcolumn(Scissors)=0
In this notation, Prow(Rock)=1 means that the row player
believes that the column player will play rock every time. The P
represents a probability, and since all guess is placed on a single
action,
all probability is on Rock.
Both row and column now get to make a
choice.
They do this by finding the best response to the other player using the
guess of what the other player might do. According to these
models,
Row's best response given its model is to play Paper, and Column's best
response given its model is to play Paper. Each player gets 0
points.
After receiving the points, the players modify
their models using the Cournot adjustment process. This means
that
the new models are
-
Row Player's Model of Column Player
-
Prow(Rock)=0
-
Prow(Paper)=1
-
Prow(Scissors)=0
-
Column Player's Model of Row Player
-
Pcolumn(Rock)=0
-
Pcolumn(Paper)=1
-
Pcolumn(Scissors)=0
This new model was obtained by changing all belief from rock to
paper.
In essence, we say "She played paper, so I think that she will play
paper
again." We ignore the fact that on the first play we believed
that
the other agent would play rock. This is what Cournot adjustment
demands. The new best response for each agent is scissors, each player
gets 0 points, and each player updates its model to place all weight on
scissors.
This cycle continued indefinitely. Thus,
the Cournot dynamics did not work very well.
What do you think would happen with Cournot
adjustment
in the Battle of the Sexes game? Would it depend on the original
model?
Fictitious Play Basics
Cournot adjustment is too simple to work in most games, but it does
give
us some insight into what we could do better. Fictitious play is
a substantial improvement on Cournot dynamics that allows each agent to
remember everything that went on before. In essence, fictitious
play
changes the Cournot model so that, rather than placing all weight on
the
previously played action, weight is shifted between actions in such a
way
that a probabilistic model is produced. The probabilities are
obtained
by counting the frequencies with which particular actions are
chosen.
Simply put, we count the number of times the other player selects an
action,
and then divide this by the total number of stages chosen.
Example of Fictitious Play dynamics. Consider the
game
of rock-paper-scissors, and suppose that each player has an initial
model
as follows:
-
Row Player's Model of Column Player
-
Prow(Rock)=1
-
Prow(Paper)=0
-
Prow(Scissors)=0
-
Column Player's Model of Row Player
-
Pcolumn(Rock)=1
-
Pcolumn(Paper)=0
Pcolumn(Scissors)=0
As before, each player selects the best response to this model,
yielding
a choice of Paper for each player. Now, however, the players
change
their model as follows:
-
Row Player's Model of Column Player
-
Krow(Rock)=1
Prow(Rock)=1/2
-
Krow(Paper)=1
Prow(Paper)=1/2
-
Krow(Scissors)=0
Prow(Scissors)=0/2
-
Column Player's Model of Row Player
-
Kcolumn(Rock)=1
Pcolumn(Rock)=1/2
-
Kcolumn(Paper)=1
Pcolumn(Paper)=1/2
-
Kcolumn(Scissors)=0
Pcolumn(Scissors)=0/2
This model has two elements: a count, which is denoted by K, and a
probablity,
which is denoted by P. The count is simply the number of times
that
each element has been chosen. Since my initial model assumed the
probability of choosing rock was 1, it made sense to me to say that
rock
had been played once. (I'll talk more about this bootstrapping
process
in class, but trust me for now.) So, assuming I pretended that I
initially saw them play rock, I set Krow(Rock)=1. I
then
saw the other agent play Paper, so I set Krow(Paper)=1.
Since Scissors has not been observed, I set Krow(Scissors)=0.
Probabilities are then created by taking the K number and dividing it
by
the total number of actions that have been played. Thus, Prow(Rock)=1/2
because two actions have been played, and one of those actions was Rock.
On the first round of play, each player chose
paper so each player incremented the count for paper.
Probabilities
were then formed by dividing the count associated with each action by
the
total number of plays. Since my initial model assumed that rock
was
important, it made sense to me to say that there had been two plays ---
one real play, and one guessed play.
Given this new model, what is the best
response
for each agent? Each agent assumes that the other agent has a 50%
chance of playing rock and a 50% chance of playing paper, so the
expected
utilities of each possible counter is
-
E(Rock) = 0*.5 + -1*.5 + 1*0 = -.5
-
E(Paper) = 1*.5 + 0*.5 + -1*0 = .5
-
E(Scissors) = -1*.5 + 1*.5 + 0*0 = 0
In essence, we assume that the other agent will act according to the
model we have constructed of them. We then search through all of
our actions to find the action that gives us the maximum expected
utility
with respect to this model. We choose this maximal action.
Given our model above, the best response for each agent is therefore to
play Paper.
Updating the model gives:
-
Row Player's Model of Column Player
-
Krow(Rock)=1
Prow(Rock)=1/3
-
Krow(Paper)=2
Prow(Paper)=2/3
-
Krow(Scissors)=0
Prow(Scissors)=0/3
-
Column Player's Model of Row Player
-
Kcolumn(Rock)=1
Pcolumn(Rock)=1/3
-
Kcolumn(Paper)=2
Pcolumn(Paper)=2/3
-
Kcolumn(Scissors)=0
Pcolumn(Scissors)=0/3
Why did Krow(Paper)=2 go up? Because I have now
seen
that column has played Paper twice so I start to believe more that he
or
she will play paper again in the future. The probability becomes
Prow(Paper)=2/3 because I have seen paper two of the three
times
that the game was played.
Taking expectations with respect to these probabilistic models
gives
-
E(Rock) = 0*1/3 + -1*2/3 + 1*0 = -2/3
-
E(Paper) = 1*1/3 + 0*2/3 + -1*0 = 1/3
-
E(Scissors) = -1*1/3 + 1*2/3 + 0*0 = 1/3
Rather than repeating this for a long time, it is better to show a
movie
of how the probabilities change as a function of experience. This
movie is available in avi format here.
Note how the probabilities initially jump around a lot, but they
eventually
converge to [1/3,1/3,1/3]. Interestingly, this is the Nash
equilibrium
for this game.
Fictitious Play Characteristics
The behavior of the fictitious play algorithm is typical; agents jump
around
a lot until eventually the probabilties converge to the Nash
equilibrium
values. It is worth considering some properties of the
algorithm.
To help you better understand what I am talking about, I will have you
play with each of these properties in the corresponding
homework
assignment.
Property 0:
The solution to fictitious play is the probability models obtained by
each
agent. For example, if the rock-paper-scissors agents continue
playing
fictitious play as in the example and the movie, the probabilities
approach
-
Row Player's Model of Column Player
-
Prow(Rock)=1/3
-
Prow(Paper)=1/3
-
Prow(Scissors)=1/3
-
Column Player's Model of Row Player
-
Pcolumn(Rock)=1/3
-
Pcolumn(Paper)=1/3
-
Pcolumn(Scissors)=1/3
These probabilities are the Nash equilibrium values in mixed strategy
space.
This is why fictitious play is useful; for many problems, the
probabilities
in ficitious play approach the Nash equilibrium solutions.
Property 1:
The first property is actually a formal statement of the algorithm.
-
Start with some initial guess of probabilities, encoded as a
count.
This count acts as a prior probability.
-
Find the best response using the estimated probabilities. Play
this
best response. If there are more than one action with highest
expected
utility, choose one at random.
-
Observe what the other agent did, and adjust the counts accordingly.
-
Repeat steps 2-3 until you get tired.
Property 2: Prior values matter
The first property is that the numbers that initially go into K affect
the dynamics of the algorithm. Consider the rock paper scissors
game
with initial values of K=[1000,0,0] as shown in this avi
movie.
Did you notice how many iterations it took to overcome the initial
bias.
Property 3: Sometimes ficitious play cycles
For some games, the fictitious play algorithm does not converge to a
probability.
Instead, the probabilities cycle indefinitely. I will not discuss
these technical details here, but you will experiment with them in the
homework.
Property 4: Sometimes ficitious play fails to converge or cycle
For some games, fictitious play does not find a nice cycle to
follow.
Instead, the process converges to a strange attractor a la chaos
theory.
Property 5: Multiplayer games can be tricky
The key question in a multiplayer (more than two player) game is how to
keep track of the moves of other agents. Should each other agent
be modelled, or should the group as a whole be modelled? By
modelling
each agent individually, it is possible to detect correlations in the
choices
made by other agents; this requires a lot of storage space and effort,
but it allows fictitious play to exploit patterns better. By
modelling
the group as a whole, space is reduced but so is power.
Property 6: Initial guesses become less important as time passes
You were probably a bit skeptical of my initial guess in the rock paper
scissors game. Why was I counting this guess as if I had really
seen
rock played once? The answer is that it does not really matter if
the game is played a long time. The contribution from the initial
guess becomes vanishingly small as more and more observations add
up.
For example, if I have never seen Rock, but seen Paper and Scissors one
million times each, my subjective probability for rock becomes 1 (the
initial
guess ) divided by two million. For those of you who are familiar
with Bayes theory, this is analogous to the way a Bayesian prior
becomes
less important as more and more evidential updating occurs.
Phase Trajectories
As a homework assignment, you will be asked to program the fictitious
play
algorithm and evaluate it on a couple of different problems. It
will
be helpful to you to have a way of discussing the results of your
program
on games where fictitious play does not converge. The notion of a
phase trajectory may help you better understand how the algorithm
works.
The main reason that phase trajectories are worth looking at is that
patterns
of behavior are easier to see. Thus, phase trajectories present a
visualization technique that helps you see what is happening with the
learning
dynamics.
By now, you have already looked at the movies that show how the
probabilities
in rock-paper-scissors converge to the Nash equilibrium. These
movies
showed the probabilities of each action as a bar graph. By observing
the
bar graph, you may have noticed that certain patterns appeared in how
the probabilites are updated. One way to visualize these
probabilities is
to cross plot one probability against the others. Cross plotting
how parameters or states change over time is known as a phase
trajectory
or as a state-space trajectory.
The idea is best illustrated by an animated example. In the movie,
I show the probabilities from the rock-paper-scissors example presented
above. The x-axis is the probability of playing Rock, and the
y-axis
is the probability of playing Paper. The probability of playing
Scissors
is not shown since it is given by P(Scissors) = 1 - P(Rock) - P(Paper)
--- (can you see why?). Each point on the graph is a different
step
of the algorithm. The first point, where all probability is
placed
on the other agent playing Rock, is in the lower right corner.
The key thing to observe is that cycles are clearly illustrated in
the
phase trajectory. Notice how the plot is a series of triangles
that
get smaller as learning occurs. (At the end, I only plot every
1000th
point, so the triangle pattern is not shown.) This sequence of
triangles
converges to the Nash equilibrium point.
Weighted Fictitious Play
There are several variants of fictitious play that seek to overcome the
game's deficiencies. The first variant is called weighted
ficitious
play. If I have time, I'll write more about these variants later.