In-game win probabilities

Remember when Billy Packer declared the 2008 Final Four game between Kansas and North Carolina over? Billy got a bit of blowback for that, especially after UNC was able to pull within four points midway through the second half. I always felt like Billy was on safe ground with his statement. Granted, I supposed “over” taken literally means that there was no chance of the game becoming interesting. I took it to mean UNC had no chance of winning, although of course there was some small chance of winning. But just how safe was Billy’s statement?

Previous attempts to quantify in-game win probabilities in college basketball are limited and have left me unsatisfied because none of them accounted for information known before the game starts. For instance, if Kansas and Alcorn State were tied five minutes into a game, we could come up with a better estimate than just saying each team has an equal chance of winning at that point. We can do better and this post documents my first attempt to do so.

My first step was to estimate a team’s chances of winning, knowing the time and score, and assuming a game between teams of equal strength. To do this, I filtered play-by-play data using my ratings (while accounting for game location). This limits the sample to about 700 play-by-plays involving nearly equal teams, but that’s enough to make reasonable estimates of the probability. With each game, I recorded the lead at a given time and then whether that team won the game. As an example, there were 76 times that a team led by four with ten minutes to go in the first half. Those teams won 56.6% of the time.

We can’t take that number literally because teams with a 5-point lead at that time had a winning percentage of 67.2, which is a larger difference than is logical. So some smoothing of the data had to be applied, then some logistic regression, and finally I got a table of values that makes sense, as shown below.

               Minutes left
Lead  35   30   25   20   15   10   5
0    500  500  500  500  500  500  500
1    514  520  526  534  539  547  569
2    529  541  553  568  578  593  636
3    543  561  579  602  616  637  698
4    557  581  604  634  653  679  753
5    572  601  630  666  688  719  801
6    586  620  654  696  721  755  842
7    600  639  677  724  751  788  876
8    613  658  700  751  780  818  903
9    627  676  722  775  806  844  925
10   640  694  743  798  829  867  942
11   653  711  762  820  850  888  956
12   666  728  781  839  869  905  966
13   679  743  799  857  886  920  974
14   692  759  815  873  901  933  980
15   704  773  831  887  915  944  985
16   716  787  845  901  926  953  989
17   727  801  858  912  936  961  991
18   738  814  871  923  945  967  993
19   749  826  882  932  953  973  995
20   760  837  893  940  959  977  996
21   770  848  903  947  965  981  997
22   780  858  912  954  970  984  998
23   790  868  920  960  974  987  998
24   800  877  927  965  978  989  999
25   809  886  934  969  981  991  999

You can read the values as percent times 10. So that team with a four-point lead with 10 minutes left in the first half has a 58.1% chance of winning. This table ignores a couple of important things, namely which team has possession of the ball and the pace of the game. I’m going to punt on the latter for now, since the effect of pace on winning probabilities is an issue requiring additional study. For the possession issue, it seems reasonable to add a point to whichever team has possession since that’s the expected value of a possession. (Update: My original logic was batty on this issue. It’s more correct to add a half-point for possession.)

I feel that this table is very accurate for teams of even strength, but unfortunately such a matchup is rare in college basketball. Even the two games in the national semifinals, which are matchups of comparable teams, would not have made it through my filter for finding a battle of nearly equal teams. The difficult part is trying to account for team strength.

I need to use an example to explain why. Let’s say we have a game where we assume one team has a 90% chance to win before the game starts. Now suppose that the game is tied at halftime. From our trusty chart, our favorite would have a 50% chance of winning were it an even match with its opponent. The simple thing to do would be to average our two values – our team has a 70% chance to win now. It seems to make sense to use this linear approach, but one can quickly poke holes in it.

Suppose the favorite jumped out to a 15-point lead five minutes into the game. Our chart gives the even-strength team a 70.4% chance of winning in that case. Using the linear method, the favorite would now have an 87% chance of winning. But wait, our favorite just jumped all over their opponent, and their chance of winning dropped slightly? Think of it another way. With these two teams starting tied and 40 minutes of basketball ahead of them, the underdog had a 10% chance for victory. Now faced with a 15-point deficit and just 35 minutes remaining, the ‘dog has a better chance of winning? It doesn’t make sense.

(From this point on, I only recommend reading if you like awkwardly-structured sentences and math. Just know that I have a good formula to calculate win probabilities given the score, time remaining, team possession, and the relative strength of the teams involved. And also know that I’ll be tweeting the in-game probabilities at five-minute game-time intervals during the Final Four.)

I’ve used two tricks to overcome this. First, I’m not going treat time as linear. This doesn’t change much in the example provided at the 35-minute mark, but think about the halftime example. I don’t believe our favorite had a 70% chance to win at that point. I believe it was higher. I’m not going to bore you with theory on this point, and I haven’t looked at data to support the idea. For now, I’m accepting it. If need be, players are going to try harder as the game goes on. In order to account for this, I’m altering the time scale of the game by taking the square root of the fractional time remaining. That’s a mouthful, but at halftime, instead of assuming there is 50% of the game yet to be played, I’m going to pretend like there’s 70.7% of the game left to be played.

However, at the 35-minute mark, no combination of our initial 90% and the predicted 70.4% will give us a number higher than 90%, which is what would make sense. For this, I’m using log5 to adjust our initial estimate of our favorite, using 90% for the favorite, and the 39.6% (100%-70.4%) that’s the even-strength estimate for the opposing team at this point. That returns a value of 95.5%. I can use that in the linear calculation of win probability. I actually convert the probability to odds before I do this. But putting 95.5% and 70.4% into this sausage machine returns a probability of 95.3% that our favored team will win once they have a 15-point lead five minutes into the game. That our favorite’s chances went from 90% to 95.3% with their early run sounds reasonable.

There’s lots more calibration to do with this system, but since I just thought about doing this a few days ago, it was necessary to get something done before the Final Four started. This will allow us to get a feel for how important events affect the outcome of each game this weekend.

By the way, according to the formula, UNC had about a 5% chance of coming back on Kansas when they were down 28 with 5 minutes to go in the first half. If that seems high, it may be. In my database of evenly-matched games, the largest deficit a team faced at that point in the game was 22. But amazingly, I have cases where a team overcame a 21- and a 19-point deficit. So perhaps Billy Packer was slightly crazy for jumping to conclusions when he did.

ADVANCED ANALYSIS OF COLLEGE BASKETBALL