by Ken Pomeroy on Saturday, April 3, 2010
Remember when Billy Packer declared the 2008 Final Four game between Kansas and North Carolina over? Billy got a bit of blowback for that, especially after UNC was able to pull within four points midway through the second half. I always felt like Billy was on safe ground with his statement. Granted, I supposed “over” taken literally means that there was no chance of the game becoming interesting. I took it to mean UNC had no chance of winning, although of course there was some small chance of winning. But just how safe was Billy’s statement?
Previous attempts to quantify in-game win probabilities in college basketball are limited and have left me unsatisfied because none of them accounted for information known before the game starts. For instance, if Kansas and Alcorn State were tied five minutes into a game, we could come up with a better estimate than just saying each team has an equal chance of winning at that point. We can do better and this post documents my first attempt to do so.
My first step was to estimate a team’s chances of winning, knowing the time and score, and assuming a game between teams of equal strength. To do this, I filtered play-by-play data using my ratings (while accounting for game location). This limits the sample to about 700 play-by-plays involving nearly equal teams, but that’s enough to make reasonable estimates of the probability. With each game, I recorded the lead at a given time and then whether that team won the game. As an example, there were 76 times that a team led by four with ten minutes to go in the first half. Those teams won 56.6% of the time.
We can’t take that number literally because teams with a 5-point lead at that time had a winning percentage of 67.2, which is a larger difference than is logical. So some smoothing of the data had to be applied, then some logistic regression, and finally I got a table of values that makes sense, as shown below.
Minutes left Lead 35 30 25 20 15 10 5 0 500 500 500 500 500 500 500 1 514 520 526 534 539 547 569 2 529 541 553 568 578 593 636 3 543 561 579 602 616 637 698 4 557 581 604 634 653 679 753 5 572 601 630 666 688 719 801 6 586 620 654 696 721 755 842 7 600 639 677 724 751 788 876 8 613 658 700 751 780 818 903 9 627 676 722 775 806 844 925 10 640 694 743 798 829 867 942 11 653 711 762 820 850 888 956 12 666 728 781 839 869 905 966 13 679 743 799 857 886 920 974 14 692 759 815 873 901 933 980 15 704 773 831 887 915 944 985 16 716 787 845 901 926 953 989 17 727 801 858 912 936 961 991 18 738 814 871 923 945 967 993 19 749 826 882 932 953 973 995 20 760 837 893 940 959 977 996 21 770 848 903 947 965 981 997 22 780 858 912 954 970 984 998 23 790 868 920 960 974 987 998 24 800 877 927 965 978 989 999 25 809 886 934 969 981 991 999
You can read the values as percent times 10. So that team with a four-point lead with 10 minutes left in the first half has a 58.1% chance of winning. This table ignores a couple of important things, namely which team has possession of the ball and the pace of the game. I’m going to punt on the latter for now, since the effect of pace on winning probabilities is an issue requiring additional study. For the possession issue, it seems reasonable to add a point to whichever team has possession since that’s the expected value of a possession. (Update: My original logic was batty on this issue. It’s more correct to add a half-point for possession.)
I feel that this table is very accurate for teams of even strength, but unfortunately such a matchup is rare in college basketball. Even the two games in the national semifinals, which are matchups of comparable teams, would not have made it through my filter for finding a battle of nearly equal teams. The difficult part is trying to account for team strength.
I need to use an example to explain why. Let’s say we have a game where we assume one team has a 90% chance to win before the game starts. Now suppose that the game is tied at halftime. From our trusty chart, our favorite would have a 50% chance of winning were it an even match with its opponent. The simple thing to do would be to average our two values – our team has a 70% chance to win now. It seems to make sense to use this linear approach, but one can quickly poke holes in it.
Suppose the favorite jumped out to a 15-point lead five minutes into the game. Our chart gives the even-strength team a 70.4% chance of winning in that case. Using the linear method, the favorite would now have an 87% chance of winning. But wait, our favorite just jumped all over their opponent, and their chance of winning dropped slightly? Think of it another way. With these two teams starting tied and 40 minutes of basketball ahead of them, the underdog had a 10% chance for victory. Now faced with a 15-point deficit and just 35 minutes remaining, the ‘dog has a better chance of winning? It doesn’t make sense.
(From this point on, I only recommend reading if you like awkwardly-structured sentences and math. Just know that I have a good formula to calculate win probabilities given the score, time remaining, team possession, and the relative strength of the teams involved. And also know that I’ll be tweeting the in-game probabilities at five-minute game-time intervals during the Final Four.)
I’ve used two tricks to overcome this. First, I’m not going treat time as linear. This doesn’t change much in the example provided at the 35-minute mark, but think about the halftime example. I don’t believe our favorite had a 70% chance to win at that point. I believe it was higher. I’m not going to bore you with theory on this point, and I haven’t looked at data to support the idea. For now, I’m accepting it. If need be, players are going to try harder as the game goes on. In order to account for this, I’m altering the time scale of the game by taking the square root of the fractional time remaining. That’s a mouthful, but at halftime, instead of assuming there is 50% of the game yet to be played, I’m going to pretend like there’s 70.7% of the game left to be played.
However, at the 35-minute mark, no combination of our initial 90% and the predicted 70.4% will give us a number higher than 90%, which is what would make sense. For this, I’m using log5 to adjust our initial estimate of our favorite, using 90% for the favorite, and the 39.6% (100%-70.4%) that’s the even-strength estimate for the opposing team at this point. That returns a value of 95.5%. I can use that in the linear calculation of win probability. I actually convert the probability to odds before I do this. But putting 95.5% and 70.4% into this sausage machine returns a probability of 95.3% that our favored team will win once they have a 15-point lead five minutes into the game. That our favorite’s chances went from 90% to 95.3% with their early run sounds reasonable.
There’s lots more calibration to do with this system, but since I just thought about doing this a few days ago, it was necessary to get something done before the Final Four started. This will allow us to get a feel for how important events affect the outcome of each game this weekend.
By the way, according to the formula, UNC had about a 5% chance of coming back on Kansas when they were down 28 with 5 minutes to go in the first half. If that seems high, it may be. In my database of evenly-matched games, the largest deficit a team faced at that point in the game was 22. But amazingly, I have cases where a team overcame a 21- and a 19-point deficit. So perhaps Billy Packer was slightly crazy for jumping to conclusions when he did.