Fans of baseball, and more and more often managers and general managers of professional baseball teams, are known for their interest in statistics, and also for making inference on probabilities of future events given statistics summarizing historical occurrence (although they might not think about this in exactly these terms!). This is an adoption of the frequentist interpretation of probability.
I’m a lifelong fan of the Boston Red Sox, who tonight faced a Game 3 battle against the Houston Astros in the 2018 American League Championship Series. The series was tied 1-1 after two games at Fenway, and the series moved to Houston for the next 3 games. ESPN previewed Game 3, and made sure to emphasize how important it is to win Game 3 because when a series is tied 1-1, the team that that wins Game 3 wins a 7-game MLB postseason series 69% of the time.
Is this a surprising statistic? Well, let’s think about the situation. The team winning Game 3 has just taken a 2-1 series lead with (at most) 4 games left to play. They win the series by winning two of the remaining games, while the losing team must win three. What then is the probability that the team that wins Game 3 wins the series?
Suppose that the two teams are equally matched in each game, regardless of where the game is played and the current series score. In this case, the likelihood that the team with the 2-1 series lead wins any remaining game is , and loses with . If you have just learned about the binomial distribution, you might be tempted to think that the probability that team wins the series is given by the binomial probability that they win 2 of the remaining 4 games:
Intuition indicates that this approach must be wrong; why would the team down 2-1 in the series have a higher likelihood of winning? Another wrong approach is to use the binomial probabilities again to compute the likelihood that team loses 3 games, and then to subtract this likelihood from one to determine the likelihood that they win the series:
This estimate of course is more plausible, since you would certainly guess that team with a 2-1 series lead is more likely to win the series. But it is still not correct.
A safer approach to computing probabilities is to examine the possible events directly. Let the outcomes of the remaining games be represented by a tuple with an entry when team wins, and when they lose. Here is the complete set of game outcome events resulting in team winning the series: , , , , , and . The likelihood of each of these events depends only on the number of wins and losses in the tuple: . Thus, the likelihoods of these events are respectively , , , , , and . Adding them together yields , or a probability of 68.75%. Interesting.
Assuming that the two teams have equal likelihood of winning any remaining game leads to an estimate of the likelihood of a series win for the team with a 2-1 lead that is essentially equal to the statistic summarizing what is actually observed. The hypothesis that teams tied 1-1 after two games are equally likely to win any remaining game seems consistent with the observed series results.
So what was wrong with the binomial distribution approach to computing the probability? It is important to remember that the binomial counts successes from exactly Bernoulli trials. Setting requires care in this computation, since in many cases the series is concluded without playing all four remaining games. We can get the right answers from the binomial if we assume that the unnecessary games will be played. Then, team would win the series if they win 2, 3, or 4 games. Similarly, team would win if they win 3 or 4 games. Yet another way to view this is that team will not win the series if they win 0 or 1 more game. But again, each of these approaches only yields the correct probability if each likelihood is computed assuming that all four games are played. For example, using the approach of computing one minus the likelihood that team wins 1 or 0 additional game:
As I was writing this post, the Red Sox won Game 3 and took a 2-1 series lead!