Can the BCS computers pick next week’s—or last week’s—winners?
There’s a great scene near the climax of the 2001 film Zoolander in which Hansel, played by Owen Wilson (hook ‘em), grabs a late-90s iMac and as then-BCS Coordinator John Swofford looks on in horror, yells: "We got thirty years of computer rankings, right here in this computer, that are gonna bring you down!" At least that’s how I remember it.
Hansel is of course alluding to the controversy and interest that has followed the inclusion of mysterious algorithms as part of the selection process of the two teams deemed worthy to play for the national championship.
The use of computer ratings as part of the BCS formula has an interesting history. It’s worth looking back at the BCS’s byzantine methodology in its early years to better appreciate the simplicity of the current 1/3-1/3-1/3 poll-poll-computers formula. No more arbitrary “quality win” bonuses and no stand-alone strength of schedule category. No more consideration of margin-of-victory by the computers either—it was deemed not PC for PCs to seemingly endorse running up the score. Apparently it is more difficult to make an algorithm account for the diminishing importance of score margin than tell sports information directors to not just look at the ticker on the bottom of ESPN.
This no-score margin requirement led to the selection prior to the 2004 season of six different computer rating systems—Anderson-Hester, Richard Billingsley, Wes Colley’s Matrix, Ken Massey, Jeff Sagarin, and Peter Wolfe—from which a score for each team would be calculated by throwing out the max/min computer rankings and taking the average of the other four. These six rankings systems have been explored by others many times before, e.g. here, here, here, and here, so I won’t go into too much of this but to give a few fun tidbits:
-The Billingsley ratings have a preseason rating component. Wow.
-Wes Colley let’s you play Football God and see how hypothetical games affect his rankings.
-Massey & Wolfe provide rankings for ALL college football teams—700+—while Sagarin includes I-A and I-AA (the other three just rank I-A teams).
-Special thanks to Ken Massey for giving me the go-to mascot look-up site.
There is great debate as to what a rating/ranking system should measure, be it a human voter or an algorithm; for instance: how do you weigh “big wins”?, does rewarding the schedule played mean more than predicting a future or hypothetical contest?, etc. All of these arguments have merit, and I believe the disagreement between pollsters and statisticians on what is of paramount importance can provide for a healthy diversity in the rankings. To cut to the chase, assessing the computer rankings led me to three observations: 1.) The BCS computers behave very similarly to one another; 2.) Very few college football games have surprising outcomes; 3.) Computer rankings raise the same philosophical voting conundrums as the human polls.
There are two easily-quantified ways to rate a rating system: how much is it in line with past results, and how well does it predict future events? Given the brevity of the football season, a common debate is the ‘who would beat who on a neutral field next week’ discussion. Or, to use an irrelevant hypothetical, would Oklahoma State beat Alabama? While the computer ratings are not supposed to predict winners per se, their ability to do so is an indication that they accurately assess the ability of teams. Using the most recent rankings to predict the next week’s games, from the release of the first BCS rankings onward (not all of the BCS computers released rankings until then), the rating systems’ predictive abilities are shown in Table 1 and compared to the Vegas oddsmakers.
|Vegas, Straight Up||256||93||73.35%|
Table 1 - Predictive ability of various computer rankings. The top group includes the six algorithms used in the BCS ratings for games between two I-A (FBS) teams. The second group are the Massey and Sagarin variations that include victory margins. The third group compares the algorithms that rate I-AA (FCS) and I-A teams for all games including those teams for which I could find a point spread. Finally, the fourth group shows the success of Vegas oddsmakers at predicting winners and the percentage of favorites who covered the spread. Note: “MOV” for Massey is for “margin of victory”.
A few points worth mentioning from Table 1. The Colley Matrix was the prediction winner this season while Peter Wolfe’s ranking was the loser; the other four algorithms were essentially identical. Also, the two algorithms that used margin of victory slightly outperformed Vegas. Pretty impressive given those algorithms take no intangible information into account (e.g. injuries, matchup advantages) (note: I didn’t adjust Sagarin Predictor ratings for home versus away games). These percentages, however, are somewhat inflated by the large number of mismatches on the college football schedule. Some games are so easy to pick, even the preseason Coaches’ Poll would do well. Therefore in Table 2 I limited the sample set to games in which the point spread was a touchdown or less. As expected, both gamblers and computers didn’t fare as well.
|Vegas, Straight Up||71||58||55.04%|
Table 2 - The BCS computers aren't much better than a coin flip when the games are between comparable squads. Note I removed the I-AA inclusive section here because those point spreads were all greater than 7.
The most notable stat in Table 2 is how well Sagarin’s Predictor algorithm did at… predicting. Four games better than the Vegas oddsmakers, not too shabby. Strangely, despite picking different teams (at times), the two versions of Massey’s rating system had the same success rate. The success rates in Table 2 compare to over 80% for games where the point spread was greater than 7.
On the opposite end of the spectrum, how well do these algorithms “fit” the season. In other words, looking at the post conference title game computer ratings, how accurately could you guess the outcome of individual games over the course of the season? As it turns out, very accurately (Table 3).
|Point Spread <= 7||All Games|
|Vegas, Straight Up||155||121||56.16%||588||185||76.07%|
Table 3 - How the final regular season computer ratings do at “postdicting” all regular season games.
Three points to make here: only about 1/6 games can be considered “upsets” after the fact, that is, these games deviate from the “fit (computer rating) to the data (games played)”; second, as the BCS rankings only account for wins and losses—and only wins and losses are used to rate the ratings in this table—the Sagarin Predictor algorithm actually fares worse (interestingly Massey’s doesn’t) than its politically correct counterpart. The 6 BCS computers are within 1% of one another—only 4 games difference top-to-bottom out of almost 700 games! Only 146 games are postdicted incorrectly by any of the computers. (Note: last year ~81% of games (after bowls, using final computer ratings) were correctly postdicted).
Building on the third point, the similarity of fit in the six algorithms suggests that none of them produce bizarre or unreasonable results. But this also raises a concern: is there a lack of diversity in their methods and weighting systems*? Perhaps we should be glad there is no digital Craig James (for the record, Boise State is no worse than 12 in the computer polls) but this gets back to the original question: how should a computer measure a team? (*Relevant evidence that there is some diversity among the computers: median gap between high and low rank for teams is 12—Tennessee and Texas Tech both vary by 32 positions!)
As I said, I don’t think there is necessarily a right answer. I reckon most human voters reach some happy equilibrium of weighing wins accrued, losses, and the apparent ability to beat certain teams they haven’t faced. On an individual level this is highly subjective, yet we’re generally comfortable with the consensus result. When it comes to the computer ratings, we could, if we desired actually achieve a transparent and consistent weighting scheme, yet as it stands now, only the Colley rankings are completely transparent and reproducible.
Seeing how the computer polls predict and react to game outcomes has definitely helped me put a number on some relevant numbers that characterize the college football season. I never knew, for instance that about 15-20% of games are “unexplainable” when measuring the season as a whole, or that a touchdown in the point spread means the game isn’t much more predictable than a coin flip. But strangely enough I come back to the same questions I ask about human voting philosophies.
I’d love to hear the readers’ thoughts in the comments section on a couple different points:
-If you were an AP voter, how would you weigh variables like wins/losses/talent when rating a team?
-Do you think the diversity of voting philosophies is good or should there be uniformity?
-How would you design an array of computer rating systems?
Before closing, a couple things I came across when crunching some of these numbers. First, a table of who the computers have ranked higher in this season’s bowl matchups versus the Vegas lines.
Second, the plot below shows how well the Sagarin rankings do at pre- and post-dicting the entire season’s worth of games. As expected the later ratings are consistent with more of the season’s games, but remarkably, Sagarin’s rankings using only the first week’s results and his preseason weightings (not sure how he rates teams in the preseason) can be used to predict 70% of the games for the entire 2011 season (or over two-thirds of the games from week 2 on). (Note: by October 16th the preseason component is removed from the ratings.)
Unfortunately I don’t have the requisite data to do all these comparisons for past seasons, but I’m happy to run the numbers on anything you’d like to see if possible; this site is also a great resource for similar metrics. There are many mathematical studies of rating systems, e.g. here, for those interested in further reading.
Figure 1 - Plot of how well Sagarin's various ratings predict (and/or post-dict) the entire regular season based on which week the ratings were released.