Most artificial multiagent learning algorithms for repeated games that have been presented of late in the literature have focused on learning a one-shot Nash equilibrium. While this technique is appropriate and successful in some contexts, it is not appropriate for repeated general-sum games as a whole if an agent seeks to maximize its own payoffs over time. This principle is shown by the folk theorem from game theory. We review the folk theorem below and discuss some of what it entails to the multiagent learning problem for repeated general-sum games.
Quick Review of the Folk TheoremThe folk theorem for repeated games is an important part of game theory. A good description of it can be found in, among other places, (Gintis 2000). We simply state it, and then discuss its implications to multiagent learning in repeated general-sum games. To do so, we consider the prisoner’s dilemma matrix game shown in Table 1, where the payoff for the row player is listed first, followed by the payoff for the column player.
The joint payoff space of the prisoner’s dilemma game is shown in Figure 1. In the figure, the x-axis shows the payoffs to the row player and the y-axis shows the payoffs to the column player. Convex combinations of the various joint payoffs of the game form the games payoff space (its convex hull), which is depicted by the union of the shaded regions (light and dark) in the figure. Since each player can guarantee itself a non-negative payoff by playing defect, neither of the players has incentive to receive an average payoff (over time) less than 0. Thus, the darkly shaded region of the convex hull shows the set of joint payoffs that the agents may possibly accept as reasonable payoffs per iteration of the game. The folk theorem says that any joint payoff in this region can be sustained by a Nash equilibrium of the repeated game, provided that the discount rates of the players are close to unity (i.e., players believe that play will continue with high probability after each stage of the game).
More formally, let C be the set of points in the convex hull of the game. Let P_i(a, b) be the expected payoff to agent i when the row player plays strategy a and the column player plays strategy b. Then, let m_i (agent i's security level) be the maximum expected payoff agent i can guarantee itself if it is "punished" by the other agent. That is,
Any joint payoff p = (p_1, p_2) in C such that p_1 >= m_1 and p_2 >= m_2 can be sustained as a Nash equilibrium of the repeated game provided, again, that the discount rates of the players are close to unity. The folk theorem can be extended to games of more than two players.
Which Nash Equilibrium of the Repeated Game is Best?The folk theorem shows that there are an infinite number of Nash equilibria in all games in which the joint payoff m = (m_1, m_2) is not pareto efficient (which constitutes a large portion of general-sum games). While the folk theorem identifies the existence of Nash equilibria of the repeated game, it does little to identify the equilibrium that agents should learn to play. It does, however, help to identify the types of payoffs an agent can expect to receive. Since a self-interested agent should seek to maximize its own payoffs over time, it should not, in most cases, be content to play its part of a Nash equilibrium in which it only receives its security level (m_i). Whether this is the case largely depends on its associate(s), but in many games, such as the prisoner’s dilemma, all agents may achieve payoffs higher than their security levels if they are willing to cooperate. In most circumstances, we believe that the Nash equilibria that agents learn (if all agents learn successfully) should be, at least, pareto efficient.
Littman and Stone (Littman and Stone 2003) give an algorithm to compute, in polynomial time, a Nash equilibrium strategy of the repeated game for two agent games. In their algorithm, they select a Nash equilibrium strategy that yields payoffs that maximize the product of the agents’ advantages. This type of equilibrium solution is suggested by Nash (Nash 1950).
While maximizing the product of the agents’ advantages is a reasonable target solution (it is pareto efficient and "fair"), it may not be the solution that a self-interested agent would want to adopt. To see this, consider the game chicken shown in Table 2. The game’s payoff space is shown in Figure 2.
In chicken, each agent can guarantee itself at least 2 points per stage if it always plays swerve. The one-shot Nash equilibria of chicken are for the row player to play swerve and the column player to play straight and vice versa (labeled as points C and B in Figure 2, respectively). However, when the game repeats many times, a self-interested agent may be a fool to always accept a payoff of 2, since there are many Nash equilibria of the repeated game that give it higher payoffs. Littman’s algorithm calculates that agents should always swerve resulting in a payoff of 3 for each agent (labeled A in Figure 2). However, if the column player will continue to swerve as long as the row player plays straight no more than every fifth play, then the row player should learn to play straight every fifth play. This would result in average payoffs of 3.1 and 2.8 (shown as point D in Figure 2), respectively. However, learning to play straight every fifth play could result in lower payoffs to the row player over time if the column player will not "accept" this behavior. Therefore, the solution that the row player should learn to play is unknown, and depends on the reputation of the column player, which the row player must identify. Informally, an agent’s reputation is what other agents believe it will do in a given situation.
ConclusionWhile each one-shot Nash equilibria of a game are indeed Nash equilibrium of the repeated game, they may yield an agent undesirable payoffs. One-shot Nash equilibria are frequently too myopic for repeated games, and thus multiagent learning algorithms should learn to play non-myopic strategies. However, which strategy is appropriate (with respect to payoff maximization) depends on how an agent's associates will react, which is probably not completely known. This is what makes multiagent learning so fascinating!
While the "optimal" strategy may be impossible to know, the folk theorem tells us that an agent should not receive a expected payoff of less than its security level over time. The folk theorem also shows that an agent should frequently be able to do better than its security level, as long as it can somehow get its associates to cooperate. (I add that playing myopic best-response strategies is frequently not a very good way of coaxing one's associates to cooperate.)
ReferencesHerbert Gintis. Game Theory Evolving: A Problem-Centered Introduction to Modeling Strategic Behavior. Princeton University Press, Princeton, New Jersey, 2000.
Michael L. Littman and Peter Stone. A polnomial-time nash equilibrium algorithm for repeated games. In ACM Conference on Electronic Commerce, 2003.
John F. Nash. The bargaining problem. Econometrica, 28:155–162, 1950.