The Prisoner's Dilemma constitutes a problem in game theory. It was originally framed by Merrill Flood and Melvin Dresher working at RAND in 1950. Albert W. Tucker formalized the game with prison sentence payoffs and gave it the "Prisoner's Dilemma" name (Poundstone, 1992).
In its "classical" form, the prisoner's dilemma (PD) is presented as follows:
If we assume that each player prefers shorter sentences to longer ones, and that each gets no utility out of lowering the other player's sentence, and that there are no reputation effects from a player's decision, then the prisoner's dilemma forms a non-zero-sum game in which two players may each "cooperate" with or "defect" from (i.e., betray) the other player. In this game, as in all game theory, the only concern of each individual player ("prisoner") is maximizing his/her own payoff, without any concern for the other player's payoff. The unique equilibrium for this game is a Pareto-suboptimal solution—that is, rational choice leads the two players to both play defect even though each player's individual reward would be greater if they both played cooperately.
In the classic form of this game, cooperating is strictly dominated by defecting, so that the only possible equilibrium for the game is for all players to defect. In simpler terms, no matter what the other player does, one player will always gain a greater payoff by playing defect. Since in any situation playing defect is more beneficial than cooperating, all rational players will play defect, all things being equal.
In the iterated prisoner's dilemma the game is played repeatedly. Thus each player has an opportunity to "punish" the other player for previous non-cooperative play. Cooperation may then arise as an equilibrium outcome. The incentive to defect is overcome by the threat of punishment, leading to the possibility of a cooperative outcome. So if the game is infinitely repeated, cooperation may be a subgame perfect Nash equilibrium although both players defecting always remains an equilibrium and there are many other equilibrium outcomes.
In casual usage, the label "prisoner's dilemma" may be applied to situations not strictly matching the formal criteria of the classic or iterative games; for instance, those in which two entities could gain important benefits from cooperating or suffer from the failure to do so, but find it merely difficult or expensive, not necessarily impossible, to coordinate their activities to achieve cooperation.
The classical prisoner's dilemma can be summarized thus:
| Prisoner B Stays Silent | Prisoner B Betrays | |
|---|---|---|
| Prisoner A Stays Silent | Each serves 6 months | Prisoner A: 10 years Prisoner B: goes free |
| Prisoner A Betrays | Prisoner A: goes free Prisoner B: 10 years | Each serves 5 years |
In this game, regardless of what the opponent chooses, each player always receives a higher payoff (lesser sentence) by betraying; that is to say that betraying is the strictly dominant strategy. For instance, Prisoner A can accurately say, "No matter what Prisoner B does, I personally am better off betraying than staying silent. Therefore, for my own sake, I should betray." However, if the other player acts similarly, then they both betray and both get a lower payoff than they would get by staying silent. Rational self-interested decisions result in each prisoner's being worse off than if each chose to lessen the sentence of the accomplice at the cost of staying a little longer in jail himself. Hence a seeming dilemma. In game theory, this demonstrates very elegantly that in a non-zero sum game a Nash Equilibrium need not be a Pareto optimum.
If player 1 (red) defects and player 2 (blue) cooperates, player 1 gets the Temptation to Defect payoff of 5 points while player 2 receives the Sucker's payoff of 0 points. If both cooperate they get the Reward for Mutual Cooperation payoff of 3 points each, while if they both defect they get the Punishment for Mutual Defection payoff of 1 point. The checker board payoff matrix showing the payoffs is given below.
| Cooperate | Defect | |
|---|---|---|
| Cooperate | 3, 3 | 0, 5 |
| Defect | 5, 0 | 1, 1 |
In "win-lose" terminology the table looks like this:
| Cooperate | Defect | |
|---|---|---|
| Cooperate | win-win
| lose much-win much |
| Defect | win much-lose much
| lose-lose |
These point assignments are given arbitrarily for illustration. It is possible to generalize them, as follows:
| Cooperate | Defect | |
|---|---|---|
| Cooperate | R, R | S, T |
| Defect | T, S | P, P |
Where T stands for Temptation to defect, R for Reward for mutual cooperation, P for Punishment for mutual defection and S for Sucker's payoff. To be defined as Prisoner's dilemma, the following inequalities must hold:
T > R > P > S
This condition ensures that the equilibrium outcome is defection, but that cooperation Pareto dominates equilibrium play. In addition to the above condition, if the game is repeatedly played by two players, the following condition should be added.
2 R > T + S
If that condition does not hold, then full cooperation is not necessarily Pareto optimal, as the players are collectively better off by having each player alternate between cooperate and defect.
These rules were established by cognitive scientist Douglas Hofstadter and form the formal canonical description of a typical game of Prisoner's Dilemma.
A simple special case occurs when the advantage of defection over cooperation is independent of what the co-player does and cost of the co-players defection is independent of one's own action, i.e. T+S = P+R.
Axelrod discovered that when these encounters were repeated over a long period of time with many players, each with different strategies, greedy strategies tended to do very poorly in the long run while more altruistic strategies did better, as judged purely by self-interest. He used this to show a possible mechanism for the evolution of altruistic behaviour from mechanisms that are initially purely selfish, by natural selection.
The best deterministic strategy was found to be "Tit for Tat," which Anatol Rapoport developed and entered into the tournament. It was the simplest of any program entered, containing only four lines of BASIC, and won the contest. The strategy is simply to cooperate on the first iteration of the game; after that, the player does what his opponent did on the previous move. Depending on the situation, a slightly better strategy can be "Tit for Tat with forgiveness." When the opponent defects, on the next move, the player sometimes cooperates anyway, with a small probability (around 1%-5%). This allows for occasional recovery from getting trapped in a cycle of defections. The exact probability depends on the line-up of opponents.
By analysing the top-scoring strategies, Axelrod stated several conditions necessary for a strategy to be successful. Nice: The most important condition is that the strategy must be "nice", that is, it will not defect before its opponent does (this is sometimes referred to as an "optimistic" algorithm). Almost all of the top-scoring strategies were nice; therefore a purely selfish strategy will not "cheat" on its opponent, for purely utilitarian reasons first. Retaliating: However, Axelrod contended, the successful strategy must not be a blind optimist. It must sometimes retaliate. An example of a non-retaliating strategy is Always Cooperate. This is a very bad choice, as "nasty" strategies will ruthlessly exploit such players. Forgiving: Successful strategies must also be forgiving. Though players will retaliate, they will once again fall back to cooperating if the opponent does not continue to play defects. This stops long runs of revenge and counter-revenge, maximizing points. Non-envious: The last quality is being non-envious, that is not striving to score more than the opponent (impossible for a ‘nice’ strategy, i.e., a 'nice' strategy can never score more than the opponent).
Therefore, Axelrod reached the oxymoron-sounding conclusion that selfish individuals for their own selfish good will tend to be nice and forgiving and non-envious.
The optimal (points-maximizing) strategy for the one-time PD game is simply defection; as explained above, this is true whatever the composition of opponents may be. However, in the iterated-PD game the optimal strategy depends upon the strategies of likely opponents, and how they will react to defections and cooperations. For example, consider a population where everyone defects every time, except for a single individual following the Tit-for-Tat strategy. That individual is at a slight disadvantage because of the loss on the first turn. In such a population, the optimal strategy for that individual is to defect every time. In a population with a certain percentage of always-defectors and the rest being Tit-for-Tat players, the optimal strategy for an individual depends on the percentage, and on the length of the game.
A strategy called Pavlov (an example of Win-Stay, Lose-Switch) cooperates at the first iteration and whenever the player and co-player did the same thing at the previous iteration; Pavlov defects when the player and co-player did different things at the previous iteration. For a certain range of parameters, Pavlov beats all other strategies by giving preferential treatment to co-players which resemble Pavlov.
Deriving the optimal strategy is generally done in two ways:
Although Tit-for-Tat is considered to be the most robust basic strategy, a team from Southampton University in England (led by Professor Nicholas Jennings
and consisting of Rajdeep Dash, Sarvapali Ramchurn, Alex Rogers, Perukrishnen Vytelingum) introduced a new strategy at the 20th-anniversary Iterated Prisoner's Dilemma competition, which proved to be more successful than Tit-for-Tat. This strategy relied on cooperation between programs to achieve the highest number of points for a single program. The University submitted 60 programs to the competition, which were designed to recognize each other through a series of five to ten moves at the start. Once this recognition was made, one program would always cooperate and the other would always defect, assuring the maximum number of points for the defector. If the program realized that it was playing a non-Southampton player, it would continuously defect in an attempt to minimize the score of the competing program. As a result, this strategy ended up taking the top three positions in the competition, as well as a number of positions towards the bottom.
This strategy takes advantage of the fact that multiple entries were allowed in this particular competition, and that the performance of a team was measured by that of the highest-scoring player (meaning that the use of self-sacrificing players was a form of minmaxing). In a competition where one has control of only a single player, Tit-for-Tat is certainly a better strategy. Because of this new rule, this competition also has little theoretical significance when analysing single agent strategies as compared to Axelrod's seminal tournament. However, it provided the framework for analysing how to achieve cooperative strategies in multi-agent frameworks, especially in the presence of noise. In fact, long before this new-rules tournament was played, Richard Dawkins in his book The Selfish Gene pointed out the possibility of such strategies winning if multiple entries were allowed, but remarked that most probably Axelrod would not have allowed them if they had been submitted. It also relies on circumventing rules about the prisoner's dilemma in that there is no communication allowed between the two players. When the Southampton programs engage in an opening "ten move dance" to recognize one another, this only reinforces just how valuable communication can be in shifting the balance of the game.
If an iterated PD is going to be iterated exactly N times, for some known constant N, then it is always game theoretically optimal to defect in all rounds. The only possible Nash equilibrium is to always defect. The proof goes like this: one might as well defect on the last turn, since the opponent will not have a chance to punish the player. Therefore, both will defect on the last turn. Thus, the player might as well defect on the second-to-last turn, since the opponent will defect on the last no matter what is done, and so on. For cooperation to emerge between game theoretic rational players, the total number of rounds must be random, or at least unknown to the players. However, even in this case always defect is no longer a strictly dominant strategy, only a Nash equilibrium. The superrational strategy in this case is to cooperate against a superrational opponent, and in the limit of large fixed N, experimental results on strategies agree with the superrational version, not the game-theoretic rational one.
Another odd case is "play forever" prisoner's dilemma. The game is repeated infinitely many times, and the player's score is the average (suitably computed).
The prisoner's dilemma game is fundamental to certain theories of human cooperation and trust. On the assumption that the PD can model transactions between two people requiring trust, cooperative behaviour in populations may be modelled by a multi-player, iterated, version of the game. It has, consequently, fascinated many scholars over the years. In 1975, Grofman and Pool estimated the count of scholarly articles devoted to it at over 2,000. The iterated prisoner's dilemma has also been referred to as the "Peace-War game".
The likelihood of defection in a population may be reduced by the experience of cooperation in earlier games allowing trust to build up. Hence self-sacrificing behaviour may, in some instances, strengthen the moral fibre of a group. If the group is small the positive behaviour is more likely to feed back in a mutually affirming way, encouraging individuals within that group to continue to cooperate. This is allied to the twin dilemma of encouraging those people whom one would aid to indulge in behaviour that might put them at risk. Such processes are major concerns within the study of reciprocal altruism, group selection, kin selection and moral philosophy.
Douglas Hofstadter in his Metamagical Themas proposed that the definition of "rational" that led "rational" players to defect is faulty. He proposed that there is another type of rational behavior, which he called "superrational", where players take into account that the other person is presumably superrational, like them. Superrational players behave identically, and know that they will behave identically. They take that into account before they maximize their payoffs, and they therefore cooperate.
This view of the one-shot PD leads to cooperation as follows:
However, if a superrational player plays against a rational opponent, he will serve a 10-year sentence, and the rational player will go free.
One-shot cooperation is observed in human culture, wherever religious and ethical codes exist.
Superrationality is not studied by academic economists, as rationality excludes any superrational behavior.
Douglas Hofstadter expresses a strong personal belief that the mathematical symmetry is reinforced by a moral symmetry, along the lines of the Kantian categorical imperative: defecting in the hope that the other player cooperates is morally indefensible. If players treat each other as they would treat themselves, then they will cooperate.
In political science, for instance, the PD scenario is often used to illustrate the problem of two states engaged in an arms race. Both will reason that they have two options, either to increase military expenditure or to make an agreement to reduce weapons. Neither state can be certain that the other one will keep to such an agreement; therefore, they both incline towards military expansion. The paradox is that both states are acting rationally, but producing an apparently irrational result. This could be considered a corollary to deterrence theory.
In sociology or criminology, the PD may be applied to an actual dilemma facing two inmates. The game theorist Marek Kaminski, a former political prisoner, analysed the factors contributing to payoffs in the game set up by a prosecutor for arrested defendants (cf. References). He concluded that while the PD is the ideal game of a prosecutor, numerous factors may strongly affect the payoffs and potentially change the properties of the game.
In environmental studies, the PD is evident in crises such as global climate change. All countries will benefit from a stable climate, but any single country is often hesitant to curb emissions. The benefit to an individual country to maintain current behavior is greater than the benefit to all countries if behavior was changed, therefore explaining the current impasse concerning climate change.
In program management and technology development, the PD applies to the relationship between the customer and the developer. Capt Dan Ward, an officer in the US Air Force, examined The Program Manager's Dilemma in an article published in Defense AT&L, a defense technology journal.
PD frequently occurs in cycling races, for instance in the Tour de France. Consider two cyclists halfway in a race, with the peloton (larger group) at great distance behind them. The two riders often work together (mutual cooperation) by sharing the tough load of the front position, where there is no shelter from the wind. If neither of the riders makes an effort to stay ahead, the peloton will soon catch up (mutual defection). An often-seen scenario is one rider doing the hard work alone (cooperating), keeping the two ahead of the peloton. Nearer to the finish (where the threat of the peloton has disappeared), the game becomes a simple zero-sum game, with each rider trying to avoid at all costs giving a slipstream advantage to the other rider. If there was a (single) defecting rider in the preceding prisoners' dilemma, it is usually he who will win this zero-sum game, having saved energy in the cooperating rider's slipstream. The cooperating rider's attitude may seem extremely naive, but he often has no other choice when both riders have different physical profiles. The cooperating rider typically has an endurance profile, whereas the defecting rider will more likely be a sprinter. When continuously taking the head position of the twosome, the 'cooperating' rider is merely trying to ride away from the defecting sprinter using his endurance advantage over long distance, thus avoiding a sprint duel at the finish, which he would be bound to lose, even if the sprinting rider had cooperated. Just after the escape from the peloton, the endurance-sprinter difference is less of importance, and it is therefore at this stage of the race that mutual cooperation PD can usually be observed. Arguably, it is this almost unavoidably present of PD (and its transition in zero-sum games) that (unconsciously) makes cycling an exciting sport to watch.
PD hardly applies to running sports, because of the negligible importance of air resistance (and shelter from it).
In high school wrestling, sometimes participants intentionally lose unnaturally large amounts of weight so as to compete against lighter opponents. In doing so, the participants are clearly not at their top level of physical and athletic fitness and yet often end up competing against the same opponents anyway, who have also followed this practice (mutual defection). The result is a reduction in the level of competition. Yet if a participant maintains their natural weight (cooperating), they will most likely compete against a stronger opponent who has lost considerable weight.
Advertising is sometimes cited as a real life example of the prisoner’s dilemma. When cigarette advertising was legal in the United States, competing cigarette manufacturers had to decide how much money to spend on advertising. The effectiveness of Firm A’s advertising was partially determined by the advertising conducted by Firm B. Likewise, the profit derived from advertising for Firm B is affected by the advertising conducted by Firm A. If both Firm A and Firm B chose to advertise during a given period the advertising cancels out, receipts remain constant, and expenses increase due to the cost of advertising. Both firms would benefit from a reduction in advertising. However, should Firm B choose not to advertise, Firm A could benefit greatly by advertising. Nevertheless, the optimal amount of advertising by one firm depends on how much advertising the other undertakes. As the best strategy is dependent on what the other firm chooses there is no dominant strategy and this is not a prisoner's dilemma but rather is an example of a stag hunt. The outcome is similar, though, in that both firms would be better off were they to advertise less than in the equilibrium. Sometimes cooperative behaviors do emerge in business situations. For instance, cigarette manufacturers endorsed the creation of laws banning cigarette advertising, understanding that this would reduce costs and increase profits across the industry. This analysis is likely to be pertinent in many other business situations involving advertising.
Members of a cartel are also involved in a (multi-player) prisonners' dilemma. 'Cooperating' typically means keeping prices at a pre-agreed minimum level. 'Defecting' means selling under this minimum level, instantly stealing business (and profits) from other cartel members. Ironically, anti-trust authorities want potential kartel members to mutually defect, ensuring the lowest possible prices for consumers.
The theoretical conclusion of PD is one reason why, in many countries, plea bargaining is forbidden. Often, precisely the PD scenario applies: it is in the interest of both suspects to confess and testify against the other prisoner/suspect, even if each is innocent of the alleged crime. Arguably, the worst case is when only one party is guilty — here, the innocent one is unlikely to confess, while the guilty one is likely to confess and testify against the innocent.
In the 2008 edition of Big Brother (UK), the dilemma was applied to two of the housemates. A prize fund of £50,000 was available. If housemates chose to share the prize fund, each would receive £25,000. If one chose to share, and the other chose to take, the one who took it would receive the entire £50,000. If both chose to take, both housemates would receive nothing. The housemates had a minute to discuss their decision, and were given the possibility to lie. Both housemates declared they would share the prize fund, but either could have potentially been lying. When asked to give their final answers by big brother, both housemates did indeed choose to share, and so won £25,000 each.
Many real-life dilemmas involve multiple players. Although metaphorical, Hardin's tragedy of the commons may be viewed as an example of a multi-player generalization of the PD: Each villager makes a choice for personal gain or restraint. The collective reward for unanimous (or even frequent) defection is very low payoffs (representing the destruction of the "commons"). Such multi-player PDs are not formal as they can always be decomposed into a set of classical two-player games. The commons are not always exploited: William Poundstone, in a book about the Prisoner's Dilemma (see References below), describes a situation in New Zealand where newspaper boxes are left unlocked. It is possible for someone to take a paper without paying (defecting) but very few do, feeling that if they do not pay then neither will others, destroying the system.
Because there is no mechanism for personal choice to influence others' decisions, this type of thinking relies on correlations between behavior, not on causation. Because of this property, those who do not understand superrationality often mistake it for magical thinking. Without superrationality, not only petty theft, but voluntary voting requires widespread magical thinking, since a non-voter is a free rider on a democratic system.
In this game, defection is always the best course, implying that rational agents will never play. However, in this case both players cooperating and both players defecting actually give the same result, so chances of mutual cooperation, even in repeated games, are few.
The payoff matrix is
| Cooperate | Defect | |
|---|---|---|
| Cooperate | 1, 1 | 0, 2 |
| Defect | 2, 0 | 0, 0 |
This payoff matrix was later used on the British television programmes Shafted and Golden Balls.
