Cooperative Games: Repeated Play

Many of the concepts, including all of the theorems at the bottom of the page, were taken from Axelrod in his great  1984 book The Evolution of Cooperation.

Updated Aug 29, 2005.

Battle of the Sexes Revisited

Remember the famous game of Battle of the Sexes.  In the game, a husband and wife must independently decide on a date activity.  The husband would prefer one form of entertainment, say fishing, and the wife would prefer another form of entertainment, say shopping for clothes.  Although both have their most preferred activity, both prefer being together to being alone.
 
Battle of the Sexes Consequences
husband/wife
Shopping
Fishing
Fishing
Though the husband gets to go fishing and the wife shopping, they are apart; both say "Ehh."
The husband gets to spend time with his wife and go fishing.  The wife gets to spend time with her husband.
Shopping
The wife gets to spend time with her husband and go shopping.  The husband gets to spend time with his wife.
The husband goes shopping and the wife goes fishing; both say "Yuck!"

By deciding to go fishing, the wife is defecting to her husband, and by deciding to go shopping, the husband is defecting to his wife.  Using the cooperate/defect terminology and using 4 as the most preferred choice and 1 as the least preferred, the payoffs for this game are as follows:
 

Battle of the Sexes Payoffs
P1/P2
Cooperate
Defect
Cooperate
(2,2)
(4,3)
Defect
(3,4)
(1,1)
Note that I switched from the game-specific "husband/wife" formulation to the abstract "P1/P2" formulation.  Since the game is symmetric, there is no need to keep the stereotypical labels once the abstraction to payoffs is performed.

Maximin vs. Equilibrium

For this game, the maximin solutions for both the husband (P1) and the wife (P2) are to cooperate.  The bad news is that neither of the partners is particularly content with this solution; if either defected then both would be better off.  In fact, either consequence that results when one cooperates and the other defects dominates (in the Pareto-optimal sense) the maximin solution, and both of these consequences are in equilibrium (since neither player benefits by unilaterally changing his/her mind).  Unfortunately, with no way to communicate the players are left with making independent choices to try and reach an equilbrium.  Additionally, using a mixed strategy does not help their chances.  In fact, the expected payoffs for two independent mixed strategies are pretty bad; if both flip an unbiased coin that they use to make their choice for them then they both receive an average (expected) payoff of only 2.5 -- not much better than the minimax value. 

For your information, the set of possible payoffs for all possible combinations of mixed strategies is illustrated below.


This figure represents the expected payoffs to player 1 on the x-axis and the expected payoffs to player 2 on the y-axis.  Expectations (or averages) are taken with respect to the various probabilistic strategies that the two players can choose.  For example, if  both P1 and P2 play cooperate 50% of the time and defect 50% of the time, then P1 gets an expected payoff of

(1/4)2 + (1/4)3 + (1/4)4 + (1/4)1 = 2.5.

The graph was created by stepping through a discretized set of probabilities for player 1 and player 2, computing the expected values for player 1 and player 2, and then cross-plotting these expected values.  A MATLAB script that will generate the plot is given below -- it might help you understand how the plot is generated.  (You can open a MATLAB window, cut and paste this text, and run it to generate the plot.)

clf;
hold on;
for (p1=0:0.01:1)
    for (p2=0:0.01:1)
        EU1 = 2*p1*p2 + 3*(1-p1)*p2 + 4*p1*(1-p2) + 1*(1-p1)*(1-p2);
        EU2 = 2*p1*p2 + 4*(1-p1)*p2 + 3*p1*(1-p2) + 1*(1-p1)*(1-p2);
        plot(EU1,EU2,'b.');
    end
end
hold off;

One of the frustrating things about this game is the fact that if we could communicate then we could both do much better.  What are some of the ways that communication can affect the outcome of this game?


Cooperation in the Prisoner's Dilemma

As we talk about how cooperation emerges in repeated play games, we'll restrict attention to the Prisoner's Dilemma game.  We'll begin by talking about the wrong way to structure such a game -- a way in which cooperation is unlikely to emerge.  Suppose that there is a population of N agents who will interact with one another in a series of Prisoner's Dilemma situations.  In these situations, two agents (i and j) will meet and play L iterations of the game.  How should i make its choice?  Relying on the principle of optimality and the notion of dynamic programming, we decide to consider what action i should take in the Lth (last) iteration of the game.  (We can do this since the principle of optimality dictates that it doesn't matter what choices I've made to this point, for my strategy to be optimal it must be optimal from here on out.)  Since agent j will have no incentive to cooperate with me (because I can't affect his or her future) I have no idea what j might do.  The rational choice is to play the maximin strategy and defect.  If j is rational, then j will play the maximin strategy too.  We can then consider the L-1st iteration.  Since the payoff for the last iteration is determined, we can (in effect) ignore it which means we are left with a similar situation and the same solution.  Repeating for all iterations, we determine that the rational strategy is to always defect.  We can expect a payoff of L*2 which is much worse than L*3 when L is large.

How do we correct this problem? Easy! We let the game go on indefinitely.  You're probably thinking, "That's a stupid idea.  The best way to solve the problem of cooperation in a repeated play game cannot be to assume that the players are immortal.  Is it too late to drop this class?"  To which I reply, "I wouldn't drop the class if I were you.  Dropping the class will hurt my feelings, and that means if I teach a class which you must take, or if I'm assigned to your graduate committee, etc., then we will have to interact again.  When we interact again, I'll show you!"  (By the way, this example is a type for the paragraph to follow.)

It turns out that there is an equivalence between immortal agents and agents that will probably interact again.  Let me explain.  When you are an immortal agent, you must figure out some way to balance your needs today with your needs for the indefinite future.  (Think of the movie "Death Becomes Her" with Goldie Hawn, Merryl Streep, and Bruce Willis.  In the movie, two shrew-like women are granted immortality, but in their jealousy and pettiness they surrender eternal health for momentary revenge.)  One way for immortal agents to achieve this balance is to discount future payoffs.  Let 0 <= w < 1 denote this discount factor.  If I receive payoff v(i) at iteration i, then my overall utility function is given by V=Sum_{i=0}^{infinity} wi v(i), that is the discounted sum of all future rewards.  But stop and think for just a minute.  If w is the probability that the game will continue to the next iteration, then wi is the probability that the game will continue to the ith iteration.  The expected value for continuing to play the game with these odds is V.  Thus, an eternal interaction between immortals is equivalent to an interaction between mortals who are likely to meet in the future.  It turns out that if agents have a high enough probability of meeting in the future (and can remember their past interactions) then cooperation can emerge between the agents and they can expect a higher payoff than possible by using the always-defect strategy.  The most important strategy for immortal agents is called tit-for-tat.


Tit for Tat

What is tit-for-tat?  It is a strategy for playing repeated play Prisoner's Dilemma for immortal agents that is about as good as you can get.  The strategy is simple.  I begin by cooperating, and thereafter I simply play whatever you played on the previous round.  Remarkably, this simple strategy is markedly superior to other strategies in an ecological sense, even though it loses in head-to-head competition with almost every other strategy.  To see what this means, consider how tit-for-tat would perform in a very small population with only four other strategies: always defect (AD), never forgive (NF), random (R), and Tit-for-Tat (TfT).  In the never forgive strategy, the agent begins by cooperating, but if its opponent ever defects then the NF player defects on every play thereafter.  Let's construct a table of the resulting payoffs for w=0.9.  We'll find it useful to use the following relation for 0<=w<1:
Sumi=0infinity wi V = V * 1/(1-w)
Payoffs for column strategies against row strategies in the iterated Prisoner's Dilemma
 
AD
NF
R
TfT
AD
2/0.1
0.9*2/0.1+1
(1+2)/(0.1*2)
0.9*2/0.1+1
NF
0.9*2/0.1+4
3/0.1
B
3/0.1
R
(4+ 2)/(2*0.1)
A
(3+1+2+4)/(0.1*4)
D
TfT
0.9*2/0.1+4
3/0.1
C
3/0.1
Total
94
79+A
40+B+C
79+D

 
 
Table 1: Payoffs for Prisoner's Dilemma
P1/P2
Cooperate
Defect
Cooperate
(3,3)
(1,4)
Defect
(4,1)
(2,2)

Although you should probably check my math, I think that these answers are pretty close.  Given these answers, we want to figure out which one performs best.  We note that the worst case performance for any strategy occurs when always cooperate plays always defect, which produces a payoff to always cooperate of 1/0.1=10.  Given this worst-case payoff, we know that A, B, C, and D are all greater than 10.  We also note that I'm too lazy to calculate these answers out explicitly, but we can do a little hand-waving to convince ourselves that TfT is better than the others.

I did one thousand simulations of these different strategies against each other.  The results are tabulated below:
 

Table 2: Payoffs for column strategies against row strategies in the iterated Prisoner's Dilemma
 
AD
NF
R
TfT
AD
2/0.1
0.9*2/0.1+1
(1+2)/(0.1*2)
0.9*2/0.1+1
NF
0.9*2/0.1+4
3/0.1
18.5
3/0.1
R
(4+ 2)/(2*0.1)
28.3
(3+1+2+4)/(0.1*4)
24.5
TfT
0.9*2/0.1+4
3/0.1
26.0
3/0.1
Total
94
107.3
84.5
104.5
 Thus, TfT did better than AD and R, but worse (by a little bit) than NF.  What would have happened if we had increased the payoff for cooperating?  What would have happened if I had increased the likelihood of having future interactions from 0.9 to 0.99?

Interestingly enough, two tournaments were held (with slightly different payoffs, but still a Prisoner's Dilemma) with participants from computer hobbiests, game theorists, social psychologists, computer scientists, etc.  In both tournaments (see Axelrod), Tit for Tat was a convincing winner.  Why?  We'll now discuss the reasons.  Note one interesting phenomena before continuing.  When we compare the values in the TfT column (the payoff received by TfT against all other strategies) with the values in the TfT row (the payoffs received by strategies against TfT) we note that TfT never beats any strategy, but the overall payoff is excellent.


Some Results (taken from Axelrod)

I hope as you study these theorems that you consciously notice how evolutionary and ecological concepts influence what is being discussed.

Theorem 1: Under Certain Conditions (w high enough), No Strategy Dominates Against all other Strategies

Let's see if we believe this.  Consider the AD and NF strategies.  What strategy works best against the AD strategy?  Another AD strategy.  This is because if player P1 always defects then this restricts the payoff matrix in Table 1 to the bottom row.  Looking carefully at this bottom row reveals that the best that P2 can do in this row is to always defect.  Thus, AD is the best strategy to play against AD.  It's also nice to observe that our logic is verified when we compare the values in the AD row of Table 2.  Recall that each cell shows how much the strategy shown in the column receives when it plays against the strategy shown in the row.  Thus, when I look at the AD row I see that  the AD strategy receives a payoff of 20, NF receives a payoff of 19, R receives a payoff of 15, and TfT receives a payoff of 19.  These results show that the best that no other strategy can do as well against AD as AD (unless w=0).

Now, let's turn our attention from what works best against AD to what works best against NF.   All we need to do is to show that some strategy other than AD yields a higher payoff than AD.  The payoff for AD when it is used against NF is the temptation payoff on the first round (+4 in this case), and the the mutual defection payoff thereafter (+2 in this case).  By contrast, when NF or TfT plays against NF both receive the mutual cooperation payoff (+3 in this case) for the duration of the game.  For w high enough, TfT will do better against NF than AD will.  (Can you compute this value for w?) Thus, the best strategy against AD is not the best strategy against NF.

The first paragraph states that best response strategy against AD is AD.  The second paragraph states that the best response strategy against NF is not AD.  Thus, no strategy is best against all other strategies.

Theorem 2: Under Certain Conditions, No Strategy Can Invade a Society of Tit-for-Tat-ers

What does it mean for an agent who uses strategy A to "invade" a society of agents who are using strategy B?  Let's answer this question first by an example and then by a formal definition.  Consider a society of agents who always cooperate.  In this society, each agent always receives a payoff of 3 units every time two agents get together.  On the average, therefore, each agent can expect a payoff of 3/0.1 = 30 units of payoff.  We'll denote this payoff V(B|B) where B stands for "Be nice by always cooperating" and V(B|B) is the expected value that a player using B will receive when it faces another player using B.  Now what happens to an agent who enters this community, but who uses the AD strategy?  Such an agent always gets the "temptation" payoff of 4 units, so the payoff of using AD against a society of B-playing agents is given by V(AD|B) = 4/0.1 = 40.  Since V(AD|B) >V(B|B) , we say that the agent using AD has invaded the society.

Thus, a strategy A has invaded a society of agents using strategy B when V(A|B) >V(B|B). When no strategy exists which can invade a society of agents using a strategy A then this strategy is said to be collectively stable. Theorem 2 simply states that TfT is a collectively stable strategy.  Our task is to figure out what conditions must hold for this theorem to be true.

Table 3: Payoffs for Prisoner's Dilemma
P1/P2
C
D
C
(R=3,R=3)
(S=1,T=4)
D
(T=4,S=1)
(P=2,P=2)

To find these conditions, we'll find it convenient to use a little bit more abstract notation than we have previously done.  More specifically, I've modified Table 1 by adding R (reward for mutual cooperation), T (reward for yielding to temptation), S (reward for being a sucker), and P (reward for punishing each other); the changes are shown in Table 3.

We'll now return to our theorem.  We begin by showing that we only need to consider two strategies to see if TfT is collectively stable.  Observe that TfT has only two states, depending on what the other player did the previous move (and assuming cooperation on the first move).  Thus, when we look at strategy A all we really need to do is look at it's past choice.  On the previous move, suppose that A chose action D.  Then,  A will use what it knows about how the TfT strategy will respond to D and choose either C or D on the current move.  A similar statement can be made when strategy A chose action C on the previous move.  Since TfT has only two states, this means that there are only four possibilities for the best that A can do against TfT: repeated sequences of CC, CD, DC, or DD (can you see why?).  Let's look at these repeated sequences one at a time.

So, what strategy produces DD sequences?  None other than our beloved AD strategy.  To say that AD cannot invade TfT means that V(AD|TfT) <= V(TfT|TfT).  When AD meets TfT, it gets T on the first move and P thereafter, making V(AD|TfT) = T + wP/(1-w).  By contrast, when TfT meets its twin it receives V(TfT|TfT) = R/(1-w).  Thus, the non-invasion requirement translates into T + wP/(1-w). <= R/(1-w) or, equivalently, w>=(T-R)/(T-P).  For our payoffs, this means that the probability of meeting again must be at least (4-3)/(4-2) = 0.5.  If this is the case, then AD cannot invade TfT.

Now, what strategy produces DC sequences?  We'll define a new strategy that alternates between D and C, and we'll call it DC (clever, huh?).    To say that DC cannot invade TfT means that V(DC|TfT) <= V(TfT|TfT).  But V(DC|TfT)  = (T+wS)/(1-w*w).  (Can you see why? In DC versus TfT, the rewards that are received go something like T, wS, w2T, w3S, .... Thus, the total payoff consists of two series is T + w2T + w4T ... + w(S + w2S + w4S +...). Using the substitution v=w2, these series become T + vT + v2T + ... + w(S + vS + v2S +...). Applying our useful relation, these two series reduce to T/(1-v) + wS/(1-v), but when we put v=w2 back in and gather like terms, we get (T+wS)/(1-w2).) The non-invasion requrement therefore translates into w >= (T-R)/(R-S) which, for our payoffs, means that the probabilty of meeting again must be at least (4-3)/(3-1) = 0.5.  If this is the case, then DC cannot invade TfT.

If both of these restrictions on the probability of meeting again are satisfied, then no strategy can invade a society of Tif-for-Tat-ers.

Theorem 3: No Individual Can Invade a Society of All-Defectors

I'll leave the proof of this to you.  The trick is to show that for all strategies A, the collectively stable condition holds V(A|AD) <= V(AD|AD).

Theorem 4: A Family of Tit-for-Tat-ers Can Invade a Society of All-Defectors.

I think that this is one of the most interesting theorems that we will discuss.  In essence, it states that whenever a cluster of agents playing TfT get to play often enough against each other, then this cluster can invade a society of agents playing AD.  Before we prove the theorem, lets define an important concept.  Lets talk about what it means to invade a society in a family (or cluster).  A p-cluster of A invades B if pV(A|A) + (1-p) V(A|B) > V(B|B), where p is the proportion of interactions by a player using strategy A with another such player.

So, what we need to show to prove the theorem is that there exists a p such that pV(TfT|TfT) + (1-p) V(TfT|AD) > V(AD|AD) for a given discount parameter w.  Let's plug and chug, and see what happens.  We know that V(TfT|TfT)=R/(1-w), and that V(AD|AD)=P/(1-w).  We also know that V(TfT|AD)=S+wP/(1-w).  Plugging these values into the equation that represents the conditions for p-cluster invasion, gives

pV(TfT|TfT) + (1-p) V(TfT|AD) > V(AD|AD)
pR/(1-w) + (1-p)(S+wP/(1-w)) > P/(1-w)
pR -p(S(1-w)+wP) > P-wP-S(1-w)
p(R-S(1-w)-wP) > (P-S)(1-w)
p> (P-S)(1-w)/(R-S(1-w)-wP).
For the payoff values that we've been using in the Prisoner's Dilemma and for w=0.9 we find that
p>1(.1)/(3-1(.1)-(.9)2)
p>.091
This means that if there is a 9.1% chance of meeting another TfT player, then this family can invade a society of all-defectors.  (Note -- when I compute this value I get a different number than Axelrod because I used different values for T,R,S, and P than he did.)

Theorem 5: If an individual all-defector cannot invade a society of Tit-for-Tat-ers, then a family of All-Defectors Cannot Invade a Society of Tit-for-Tat-ers.

We now need to ask ourselves what will happen if a cluster/family of players using AD invade a society of players using TfT.  The central idea of the proof is that if a single individual cannot invade a society of TfT'ers, then no cluster of such individuals can.  Let's prove this for AD against TfT.  For an AD cluster to invade a population of TfT'ers, there must be a p<=1 such that pV(AD|AD) + (1-p)V(AD|TfT) > V(TfT|TfT).  First, we note that V(AD|AD)  < V(TfT|TfT).  But this means that AD can invade as a cluster only if V(AD|TfT) > V(TfT|TfT) which is equivalent to saying that an individual AD has invaded TfT.  Since this cannot happen, TfT cannot be invaded by a cluster of AD.

Brief Summary