Cooperative Games: Repeated Play
Many of the concepts, including all of
the
theorems at the bottom of the page, were taken from Axelrod in his great
1984 book The Evolution of Cooperation.
Updated Aug 29, 2005.
Battle of the Sexes Revisited
Remember the famous game of Battle of the Sexes. In the
game,
a husband and wife must independently decide on a date activity.
The husband would prefer one form of entertainment, say fishing, and
the
wife would prefer another form of entertainment, say shopping for
clothes.
Although both have their most preferred activity, both prefer being
together
to being alone.
Battle of the Sexes Consequences
|
husband/wife
|
Shopping
|
Fishing
|
|
Fishing
|
Though the husband gets to go fishing and the wife
shopping, they
are apart; both say "Ehh."
|
The husband gets to spend time with his wife and go
fishing.
The wife gets to spend time with her husband.
|
|
Shopping
|
The wife gets to spend time with her husband and go
shopping.
The husband gets to spend time with his wife.
|
The husband goes shopping and the wife goes fishing; both
say "Yuck!"
|
By deciding to go fishing, the wife is defecting to her husband, and
by deciding to go shopping, the husband is defecting to his wife.
Using the cooperate/defect terminology and using 4 as the most
preferred
choice and 1 as the least preferred, the payoffs for this game are as
follows:
Battle of the Sexes Payoffs
|
P1/P2
|
Cooperate
|
Defect
|
|
Cooperate
|
(2,2)
|
(4,3)
|
|
Defect
|
(3,4)
|
(1,1)
|
Note that I switched from the game-specific "husband/wife" formulation
to the abstract "P1/P2" formulation. Since the game is symmetric,
there is no need to keep the stereotypical labels once the abstraction
to payoffs is performed.
Maximin vs. Equilibrium
For this game, the maximin solutions for both the husband (P1) and the
wife (P2) are to cooperate. The bad news is that neither of the
partners
is particularly content with this solution; if either defected then
both
would be better off. In fact, either consequence that results
when
one cooperates and the other defects dominates (in the Pareto-optimal
sense)
the maximin solution, and both of these consequences are in equilibrium
(since neither player benefits by unilaterally changing his/her
mind).
Unfortunately, with no way to communicate the players are left with
making
independent choices to try and reach an equilbrium. Additionally,
using a mixed strategy does not help their chances. In fact, the
expected payoffs for two independent mixed strategies are pretty bad;
if
both flip an unbiased coin that they use to make their choice for them
then they both receive an average (expected) payoff of only 2.5 -- not
much better than the minimax value.
For your information, the set
of possible payoffs for all possible combinations of mixed strategies
is
illustrated below.

This figure represents the expected payoffs to player 1 on the
x-axis and the expected payoffs to player 2 on the y-axis.
Expectations (or averages) are taken with respect to the various
probabilistic strategies that the two players can choose. For
example, if both P1 and P2 play cooperate 50% of the time and
defect 50% of the time, then P1 gets an expected payoff of
(1/4)2 + (1/4)3 + (1/4)4 + (1/4)1 = 2.5.
The graph was created by stepping through a discretized set of
probabilities for player 1 and player 2, computing the expected values
for player 1 and player 2, and then cross-plotting these expected
values. A MATLAB script that will generate the plot is given
below -- it might help you understand how the plot is generated.
(You can open a MATLAB window, cut and paste this text, and run it to
generate the plot.)
clf;
hold on;
for (p1=0:0.01:1)
for (p2=0:0.01:1)
EU1 = 2*p1*p2 + 3*(1-p1)*p2 +
4*p1*(1-p2) + 1*(1-p1)*(1-p2);
EU2 = 2*p1*p2 + 4*(1-p1)*p2 +
3*p1*(1-p2) + 1*(1-p1)*(1-p2);
plot(EU1,EU2,'b.');
end
end
hold off;
One of the frustrating things about this game is the fact that if we
could communicate then we could both do much better. What are
some
of the ways that communication can affect the outcome of this game?
- Taking Turns in Repeated Play: Suppose that we can talk before we
play
the game and come to an agreement about how we might take turns in
playing
this game multiple times. For example, suppose that the pair
decides
to start with fishing and then alternate shopping and fishing.
Then,
on the average, each player receives a payoff of 3.5 units. This
ability to take turns can be visualized using the figure above; taking
turns allows us to fill in the gap between the (3,4) and (4,3) payoffs
(i.e., we can take the convex hull of the above figure).
- Side Payments: Suppose we associate the payoffs in the game
matrix with
a standard unit which we'll call the happiness unit (HU).
One way to make sure that I get what I want in this game is to do some
service for you that increases your happiness and still allows me to
get
what I want. So, for example, if the husband wants to go fishing
then he can offer to do some service for his wife that is worth, say, d
HUs, if she'll agree to go fishing. (For concreteness, we'll let
this service be visiting his wife's mother on the next Sunday.)
Let c
denote the cost to the husband for performing this service (in HUs),
then
the game payoff changes to the following. For this modified game,
if c<1 and d>1 then the game has a unique Pareto
optimal
solution (which one is it?), and this solution is also an equilibrium
solution. Is there a unique Pareto optimal solution if
c=d=1? What about if c>1 and d<1? (Note that I switched
away from the abstract P1/P2 notation since the motivation for the type
of payment depends on the stereotypical roles.)
Battle of the Sexes Payoffs (with side payments)
|
Husband/Wife
|
Cooperate
|
Defect
|
|
Cooperate
|
(2,2)
|
(4-c,3+d)
|
|
Defect
|
(3,4)
|
(1,1)
|
- Threats: Threats are similar to side payments but, instead of
increasing
the payoff for your opponent for cooperation as in side payments, a
threat
decreases the payoff for your opponent if they don't cooperate.
For
the Battle of the Sexes game that we are discussing, the husband could
tell his wife that he'll never go to his mother-in-law's house next
Sunday
if she refuses to go fishing. (For concreteness purposes, this
husband
is a jerk.) This means that the wife stands to lose d HUs
if she chooses to go shopping. The husband also stands to lose
something
(since the wife is likely to be justifiably mad at him on the fishing
trip), and we'll
denote
this loss by e HUs. The new payoff matrix is shown
below.
For this modified game, if d>1 and e<1 then this
game
has a unique Pareto optimal solution (which one is it?), and this
solution
is also an equilibrium solution. Note that the threat payoff matrix is
not a clear cut as the side payments payoff matrix, because a threat
can cost everybody on every outcome of the game.
Battle of the Sexes Payoffs (with threats)
Husband/Wife
|
Cooperate
|
Defect
|
|
Cooperate
|
(2,2)
|
(4-e,3)
|
|
Defect
|
(3,4-d)
|
(1,1)
|
- Negotiation Protocols: Threats and side payments effectually
change the
payoffs for the game by introducing extraneous factors. An
alternative
to these payoff-modifying forms of communication, some systems have
forms
of authority that constrain how groups interact and thereby constrain
communication
and choice. There are dozens of these negotiation (arbitration)
protocols,
and we'll spend a lot of time discussing some of these.
Unfortunately, this discussion will probably be unsatisfactory because
everyone of the the results produced by these protocols has at least
one weakness. Some of this
discomfort
will results from a central theorem called Arrow's Impossiblity Theorem
which states that there is no protocol that satisfies a set of
desirable
axioms.
- Repeated Play: Even without explicit communication or
authoritarian
restrictions,
cooperation can emerge between agents when they play against each
other
several times. I think this is fascinating, but not
surprising.
As we talk about the emergence of cooperation, we'll find it useful to
use terms and concepts from evolutionary biology. We'll discuss ecologically
fit strategies, which are strategies that are well adapted to the
environment
of a repeated play game; and we'll discuss evolutionary justification
for how these strategies can come to be adopted by a population within
an environment.
Cooperation in the Prisoner's Dilemma
As we talk about how cooperation emerges in repeated play games, we'll
restrict attention to the Prisoner's Dilemma game. We'll begin by
talking about the wrong way to structure such a game -- a way in which
cooperation is unlikely to emerge. Suppose that there is a
population
of N agents who will interact with one another in a series of
Prisoner's
Dilemma situations. In these situations, two agents (i and
j)
will meet and play L iterations of the game. How should i
make its choice? Relying on the principle of optimality and the
notion
of dynamic programming, we decide to consider what action
i should
take in the Lth (last) iteration of the game. (We can do
this
since the principle of optimality dictates that it doesn't matter what
choices I've made to this point, for my strategy to be optimal it must
be optimal from here on out.) Since agent j will have no
incentive
to cooperate with me (because I can't affect his or her future) I have
no idea what j might do. The rational choice is to play
the
maximin strategy and defect. If j is rational, then
j
will play the maximin strategy too. We can then consider the L-1st
iteration. Since the payoff for the last iteration is determined,
we can (in effect) ignore it which means we are left with a similar
situation
and the same solution. Repeating for all iterations, we determine
that the rational strategy is to always defect. We can expect a
payoff
of L*2 which is much worse than L*3 when
L is large.
How do we correct this problem? Easy! We let the game go on
indefinitely.
You're probably thinking, "That's a stupid idea. The best way to
solve the problem of cooperation in a repeated play game cannot be to
assume
that the players are immortal. Is it too late to drop this
class?"
To which I reply, "I wouldn't drop the class if I were you.
Dropping
the class will hurt my feelings, and that means if I teach a class
which
you must take, or if I'm assigned to your graduate committee, etc.,
then
we will have to interact again. When we interact again, I'll show
you!" (By the way, this example is a type for the paragraph to
follow.)
It turns out that there is an equivalence between immortal agents
and
agents that will probably interact again. Let me explain.
When
you are an immortal agent, you must figure out some way to balance your
needs today with your needs for the indefinite future. (Think of
the movie "Death Becomes Her" with Goldie Hawn, Merryl Streep, and
Bruce
Willis. In the movie, two shrew-like women are granted
immortality,
but in their jealousy and pettiness they surrender eternal health for
momentary
revenge.) One way for immortal agents to achieve this balance is
to discount future payoffs. Let 0 <= w < 1 denote
this
discount factor. If I receive payoff v(i) at iteration i,
then my overall utility function is given by V=Sum_{i=0}^{infinity}
wi v(i), that is the discounted sum of all future
rewards.
But stop and think for just a minute. If w is the
probability
that the game will continue to the next iteration, then wi
is the probability that the game will continue to the ith
iteration.
The expected value for continuing to play the game with these odds is V.
Thus, an eternal interaction between immortals is equivalent to an
interaction
between mortals who are likely to meet in the future. It turns
out
that if agents have a high enough probability of meeting in the future
(and can remember their past interactions) then cooperation can emerge
between the agents and they can expect a higher payoff than possible by
using the always-defect strategy. The most important
strategy
for immortal agents is called tit-for-tat.
Tit for Tat
What is tit-for-tat? It is a strategy for playing repeated play
Prisoner's
Dilemma for immortal agents that is about as good as you can get.
The strategy is simple. I begin by cooperating, and thereafter I
simply play whatever you played on the previous round.
Remarkably,
this simple strategy is markedly superior to other strategies in an
ecological
sense, even though it loses in head-to-head competition with almost
every
other strategy. To see what this means, consider how tit-for-tat
would perform in a very small population with only four other
strategies:
always defect (AD), never forgive (NF), random (R), and Tit-for-Tat
(TfT).
In the never forgive strategy, the agent begins by cooperating, but if
its opponent ever defects then the NF player defects on every play
thereafter.
Let's construct a table of the resulting payoffs for w=0.9.
We'll find it useful to use the following relation for 0<=w<1:
Sumi=0infinity wi V = V *
1/(1-w)
Payoffs for column strategies against row strategies in the
iterated
Prisoner's Dilemma
| |
AD
|
NF
|
R
|
TfT
|
|
AD
|
2/0.1
|
0.9*2/0.1+1
|
(1+2)/(0.1*2)
|
0.9*2/0.1+1
|
|
NF
|
0.9*2/0.1+4
|
3/0.1
|
B
|
3/0.1
|
|
R
|
(4+ 2)/(2*0.1)
|
A
|
(3+1+2+4)/(0.1*4)
|
D
|
|
TfT
|
0.9*2/0.1+4
|
3/0.1
|
C
|
3/0.1
|
|
Total
|
94
|
79+A
|
40+B+C
|
79+D
|
Table 1: Payoffs for Prisoner's Dilemma
|
P1/P2
|
Cooperate
|
Defect
|
|
Cooperate
|
(3,3)
|
(1,4)
|
|
Defect
|
(4,1)
|
(2,2)
|
Although you should probably check my math, I think that these
answers
are pretty close. Given these answers, we want to figure out
which
one performs best. We note that the worst case performance for
any strategy occurs when always cooperate
plays
always defect, which produces a payoff to always cooperate of
1/0.1=10. Given
this
worst-case payoff, we know that A, B, C, and D are all greater than
10.
We also note that I'm too lazy to calculate these answers out
explicitly,
but we can do a little hand-waving to convince ourselves that TfT is
better
than the others.
I did one thousand simulations of these different strategies against
each other. The results are tabulated below:
Table 2: Payoffs for column strategies against row
strategies
in the iterated Prisoner's Dilemma
| |
AD
|
NF
|
R
|
TfT
|
|
AD
|
2/0.1
|
0.9*2/0.1+1
|
(1+2)/(0.1*2)
|
0.9*2/0.1+1
|
|
NF
|
0.9*2/0.1+4
|
3/0.1
|
18.5
|
3/0.1
|
|
R
|
(4+ 2)/(2*0.1)
|
28.3
|
(3+1+2+4)/(0.1*4)
|
24.5
|
|
TfT
|
0.9*2/0.1+4
|
3/0.1
|
26.0
|
3/0.1
|
|
Total
|
94
|
107.3
|
84.5
|
104.5
|
Thus, TfT did better than AD and R, but worse (by a little bit)
than
NF. What would have happened if we had increased the payoff for
cooperating?
What would have happened if I had increased the likelihood of having
future
interactions from 0.9 to 0.99?
Interestingly enough, two tournaments were held (with slightly
different
payoffs, but still a Prisoner's Dilemma) with participants from
computer
hobbiests, game theorists, social psychologists, computer scientists,
etc.
In both tournaments (see Axelrod), Tit for Tat was a convincing
winner.
Why? We'll now discuss the reasons. Note one interesting
phenomena
before continuing. When we compare the values in the TfT column
(the
payoff received by TfT against all other strategies) with the values in
the TfT row (the payoffs received by strategies against TfT) we note
that
TfT never beats any strategy, but the overall payoff is excellent.
Some Results (taken from Axelrod)
I hope as you study these theorems that you consciously notice how
evolutionary
and ecological concepts influence what is being discussed.
Theorem 1: Under Certain Conditions (w high enough), No Strategy
Dominates
Against all other Strategies
Let's see if we believe this. Consider the AD and NF
strategies.
What strategy works best against the AD strategy? Another AD
strategy.
This is because if player P1 always defects then this restricts the
payoff
matrix in Table 1 to the bottom row. Looking carefully at this
bottom
row reveals that the best that P2 can do in this row is to always
defect.
Thus, AD is the best strategy to play against AD. It's also nice
to observe that our logic is verified when we compare the values in the
AD row of Table 2. Recall that each cell shows how much the
strategy
shown in the column receives when it plays against the strategy shown
in
the row. Thus, when I look at the AD row I see that the AD
strategy receives a payoff of 20, NF receives a payoff of 19, R
receives
a payoff of 15, and TfT receives a payoff of 19. These results
show
that the best that no other strategy can do as well against AD as AD
(unless w=0).
Now, let's turn our attention from what works best against AD to
what
works best against NF. All we need to do is to show that
some
strategy other than AD yields a higher payoff than AD. The payoff
for AD when it is used against NF is the temptation payoff on the first
round
(+4 in this case), and the the mutual defection payoff thereafter (+2
in
this case). By contrast, when NF or TfT plays against NF both
receive
the mutual cooperation payoff (+3 in this case) for the duration of the
game. For w high enough, TfT will do better against NF than AD
will. (Can you compute this value for w?) Thus, the best strategy
against AD is not the best strategy
against NF.
The first paragraph states that best response strategy against AD is
AD. The second paragraph states that the best response strategy
against NF is not AD. Thus, no strategy is best
against all other strategies.
Theorem 2: Under Certain Conditions, No Strategy Can Invade a
Society
of Tit-for-Tat-ers
What does it mean for an agent who uses strategy A to "invade"
a
society of agents who are using strategy B? Let's answer
this
question first by an example and then by a formal definition.
Consider
a society of agents who always cooperate. In this society, each
agent
always receives a payoff of 3 units every time two agents get
together.
On the average, therefore, each agent can expect a payoff of 3/0.1 = 30
units of payoff. We'll denote this payoff V(B|B) where B
stands for "Be nice by always cooperating" and V(B|B) is the
expected
value that a player using B will receive when it faces another
player
using B. Now what happens to an agent who enters this
community,
but who uses the AD strategy? Such an agent always gets the
"temptation"
payoff of 4 units, so the payoff of using AD against a society of B-playing
agents is given by V(AD|B) = 4/0.1 = 40. Since V(AD|B)
>V(B|B) , we say that the agent using AD has invaded
the society.
Thus, a strategy A has invaded a society of agents using
strategy
B
when V(A|B) >V(B|B). When no strategy exists which
can invade
a society of agents using a strategy A then this strategy is
said
to be collectively stable. Theorem 2 simply states that TfT is
a
collectively stable strategy. Our task is to figure out what
conditions
must hold for this theorem to be true.
Table 3: Payoffs for Prisoner's Dilemma
|
P1/P2
|
C
|
D
|
|
C
|
(R=3,R=3)
|
(S=1,T=4)
|
|
D
|
(T=4,S=1)
|
(P=2,P=2)
|
To find these conditions, we'll find it convenient to use a little
bit
more abstract notation than we have previously done. More
specifically,
I've modified Table 1 by adding R (reward for mutual cooperation), T
(reward
for yielding to temptation), S (reward for being a sucker), and P
(reward
for punishing each other); the changes are shown in Table 3.
We'll now return to our theorem. We begin by showing that we
only
need to consider two strategies to
see if TfT is collectively stable. Observe that TfT has only two
states, depending on what the other player did the previous move (and
assuming
cooperation on the first move). Thus, when we look at strategy A
all we really need to do is look at it's past choice. On the
previous
move, suppose that A chose action D.
Then, A will use what it knows about
how
the TfT strategy will respond to D and choose either C or D
on the current move. A similar statement can be made when
strategy
A
chose action C on the previous move. Since TfT has only two
states,
this means that there are only four possibilities for the best that A
can do against TfT: repeated sequences of CC, CD, DC, or DD (can
you see why?). Let's look at these repeated sequences
one at a time.
- The sequence CC behaves just the same as TfT so such a strategy
cannot
invade a society of TfT'ers (recall that invasion of A into a
TfT'ers
means that V(A|TfT) > V(TfT|TfT) -- key
in on the strictly greater than
symbol in the inequality).
- The sequence CD cannot do better than both the DC and the CC
sequence,
so we can ignore this. This is not trivial to see. Try and
prove this. Don't worry if you don't get it right the first time,
we'll prove it in class using a simple technique -- be sure to remind
me to show you.
- This means that we need only consider strategies that produce DC
and DD
sequences.
So, what strategy produces DD sequences? None other than our
beloved
AD strategy. To say that AD cannot invade TfT means that V(AD|TfT)
<= V(TfT|TfT). When AD meets TfT, it gets T on the
first move and P thereafter, making V(AD|TfT) = T + wP/(1-w).
By contrast, when TfT meets its twin it receives V(TfT|TfT)
=
R/(1-w). Thus, the non-invasion requirement translates into T
+ wP/(1-w). <= R/(1-w) or, equivalently, w>=(T-R)/(T-P).
For our payoffs, this means that the probability of meeting again must
be at least (4-3)/(4-2) = 0.5. If this is the case, then
AD
cannot invade TfT.
Now, what strategy produces DC sequences? We'll define a new
strategy
that alternates between D and C, and we'll call it DC (clever,
huh?).
To say that DC cannot invade TfT means that V(DC|TfT) <=
V(TfT|TfT).
But V(DC|TfT) = (T+wS)/(1-w*w). (Can you see
why? In DC versus TfT, the rewards that are received go something like
T,
wS, w2T, w3S, .... Thus, the total payoff
consists
of two series is T + w2T + w4T ... + w(S + w2S
+ w4S +...). Using the substitution v=w2,
these series become
T + vT + v2T + ... + w(S + vS + v2S
+...). Applying our useful relation, these two series reduce to
T/(1-v)
+ wS/(1-v), but when we put v=w2 back in and
gather
like terms, we get (T+wS)/(1-w2).) The
non-invasion requrement
therefore translates into w >= (T-R)/(R-S) which, for our
payoffs,
means that the probabilty of meeting again must be at least
(4-3)/(3-1)
= 0.5. If this is the case, then DC cannot invade TfT.
If both of these restrictions on the probability of meeting again
are
satisfied, then no strategy can invade a society of Tif-for-Tat-ers.
Theorem 3: No Individual Can Invade a Society of All-Defectors
I'll leave the proof of this to you. The trick is to show that
for
all strategies A, the collectively stable condition holds V(A|AD)
<= V(AD|AD).
Theorem 4: A Family of Tit-for-Tat-ers Can Invade a Society of
All-Defectors.
I think that this is one of the most interesting theorems that we will
discuss. In essence, it states that whenever a cluster of agents
playing TfT get to play often enough against each other, then this
cluster
can invade a society of agents playing AD. Before we prove the
theorem,
lets define an important concept. Lets talk about what it means
to
invade a society in a family (or cluster). A p-cluster of A
invades
B if pV(A|A) + (1-p) V(A|B) > V(B|B), where p is the
proportion
of interactions by a player using strategy A with another such
player.
So, what we need to show to prove the theorem is that there exists a
p
such that pV(TfT|TfT) + (1-p) V(TfT|AD) > V(AD|AD)
for a given discount parameter w. Let's plug and chug,
and
see what happens. We know that V(TfT|TfT)=R/(1-w),
and that V(AD|AD)=P/(1-w). We also know that V(TfT|AD)=S+wP/(1-w).
Plugging these values into the equation that represents the conditions
for p-cluster invasion, gives
pV(TfT|TfT) + (1-p) V(TfT|AD) > V(AD|AD)
pR/(1-w) + (1-p)(S+wP/(1-w)) > P/(1-w)
pR -p(S(1-w)+wP) > P-wP-S(1-w)
p(R-S(1-w)-wP) > (P-S)(1-w)
p> (P-S)(1-w)/(R-S(1-w)-wP).
For the payoff values that we've been using in the Prisoner's Dilemma
and
for w=0.9 we find that
p>1(.1)/(3-1(.1)-(.9)2)
p>.091
This means that if there is a 9.1% chance of meeting another TfT
player,
then this family can invade a society of all-defectors. (Note --
when I compute this value I get a different number than Axelrod because
I used different values for T,R,S, and P than he did.)
Theorem 5: If an individual all-defector cannot invade a
society
of Tit-for-Tat-ers, then a family of All-Defectors Cannot Invade a
Society
of Tit-for-Tat-ers.
We now need to ask ourselves what will happen if a cluster/family of
players
using AD invade a society of players using TfT. The central idea
of the proof is that if a single individual cannot invade a society of
TfT'ers, then no cluster of such individuals can. Let's prove
this
for AD against TfT. For an AD cluster to invade a population of
TfT'ers,
there must be a p<=1 such that pV(AD|AD) + (1-p)V(AD|TfT)
> V(TfT|TfT). First, we note that V(AD|AD)
< V(TfT|TfT). But this means that AD can invade as
a cluster only if V(AD|TfT) > V(TfT|TfT) which
is
equivalent to saying that an individual AD has invaded TfT. Since
this cannot happen, TfT cannot be invaded by a cluster of AD.
Brief Summary
- TfT cannot beat any other strategy outright (the best it can do
is
tie).
Nevertheless, as TfT interacts with enough individuals (if the society
is large enough) then all of the strategies that are designed to win in
head-to-head competitions beat each other up, and the sum total of TfT
interactions is the largest. Thus, from an ecological
perspective,
TfT has the highest fitness.
- Once TfT gets established in a society, you cannot root it
out.
It
is possible to get established if clusters of TfT'ers invade
together.
Once they are there, their fitness level is higher than other
approaches
so they can eventually become a substantial portion of the
population.
Once they are a substantial portion of the population, you cannot
invade
with a new cluster of meanies.
- One point that we did not discuss, is that for a nice strategy
(one
that
will be the first one to try to cooperate) to survive in a society of
meanies,
it must be willing to retaliate against the other. This prevents
it from being repeatedly exploited.