The BCS System: Rate Not, Lest Ye Be Rated
Can the BCS computers pick next week’s—or last week’s—winners?
There’s a great scene near the climax of the 2001 film Zoolander in which Hansel, played by Owen Wilson (hook ‘em), grabs a late-90s iMac and as then-BCS Coordinator John Swofford looks on in horror, yells: "We got thirty years of computer rankings, right here in this computer, that are gonna bring you down!" At least that’s how I remember it.
Hansel is of course alluding to the controversy and interest that has followed the inclusion of mysterious algorithms as part of the selection process of the two teams deemed worthy to play for the national championship.
The use of computer ratings as part of the BCS formula has an interesting history. It’s worth looking back at the BCS’s byzantine methodology in its early years to better appreciate the simplicity of the current 1/3-1/3-1/3 poll-poll-computers formula. No more arbitrary “quality win” bonuses and no stand-alone strength of schedule category. No more consideration of margin-of-victory by the computers either—it was deemed not PC for PCs to seemingly endorse running up the score. Apparently it is more difficult to make an algorithm account for the diminishing importance of score margin than tell sports information directors to not just look at the ticker on the bottom of ESPN.
This no-score margin requirement led to the selection prior to the 2004 season of six different computer rating systems—Anderson-Hester, Richard Billingsley, Wes Colley’s Matrix, Ken Massey, Jeff Sagarin, and Peter Wolfe—from which a score for each team would be calculated by throwing out the max/min computer rankings and taking the average of the other four. These six rankings systems have been explored by others many times before, e.g. here, here, here, and here, so I won’t go into too much of this but to give a few fun tidbits:
-The Billingsley ratings have a preseason rating component. Wow.
-Wes Colley let’s you play Football God and see how hypothetical games affect his rankings.
-Ken Massey & Jeff Sagarin have BCS-unfriendly versions of their polls that consider margin of victory (I’ve used Sagarin’s “Predictor” rankings twice before here).
-Massey & Wolfe provide rankings for ALL college football teams—700+—while Sagarin includes I-A and I-AA (the other three just rank I-A teams).
-Colley is the only who releases his formula. Billingsley, Massey, and Wolfe offer some insight.
-Special thanks to Ken Massey for giving me the go-to mascot look-up site.
There is great debate as to what a rating/ranking system should measure, be it a human voter or an algorithm; for instance: how do you weigh “big wins”?, does rewarding the schedule played mean more than predicting a future or hypothetical contest?, etc. All of these arguments have merit, and I believe the disagreement between pollsters and statisticians on what is of paramount importance can provide for a healthy diversity in the rankings. To cut to the chase, assessing the computer rankings led me to three observations: 1.) The BCS computers behave very similarly to one another; 2.) Very few college football games have surprising outcomes; 3.) Computer rankings raise the same philosophical voting conundrums as the human polls.
There are two easily-quantified ways to rate a rating system: how much is it in line with past results, and how well does it predict future events? Given the brevity of the football season, a common debate is the ‘who would beat who on a neutral field next week’ discussion. Or, to use an irrelevant hypothetical, would Oklahoma State beat Alabama? While the computer ratings are not supposed to predict winners per se, their ability to do so is an indication that they accurately assess the ability of teams. Using the most recent rankings to predict the next week’s games, from the release of the first BCS rankings onward (not all of the BCS computers released rankings until then), the rating systems’ predictive abilities are shown in Table 1 and compared to the Vegas oddsmakers.
| Rating System | Correct | Incorrect | PCT. |
| Anderson-Hester | 240 | 102 | 70.18% |
| Billingsley | 241 | 101 | 70.47% |
| Colley Matrix | 245 | 97 | 71.64% |
| Wolfe (FBS) | 236 | 106 | 69.01% |
| Sagarin (FBS)(ELO-Chess) | 240 | 102 | 70.18% |
| Massey (FBS)(BCS) | 241 | 101 | 70.47% |
| Sagarin (FBS)(Predictor) | 251 | 91 | 73.39% |
| Massey (FBS)(MOV) | 249 | 93 | 72.81% |
| Wolfe | 244 | 106 | 69.71% |
| Sagarin (ELO-Chess) | 248 | 102 | 70.86% |
| Massey (BCS) | 249 | 101 | 71.14% |
| Sagarin (Predictor) | 259 | 91 | 74.00% |
| Massey (MOV) | 257 | 93 | 73.43% |
| Vegas, Straight Up | 256 | 93 | 73.35% |
| Favorites Covered: | 169 | 173 | 49.42% |
Table 1 - Predictive ability of various computer rankings. The top group includes the six algorithms used in the BCS ratings for games between two I-A (FBS) teams. The second group are the Massey and Sagarin variations that include victory margins. The third group compares the algorithms that rate I-AA (FCS) and I-A teams for all games including those teams for which I could find a point spread. Finally, the fourth group shows the success of Vegas oddsmakers at predicting winners and the percentage of favorites who covered the spread. Note: “MOV” for Massey is for “margin of victory”.
A few points worth mentioning from Table 1. The Colley Matrix was the prediction winner this season while Peter Wolfe’s ranking was the loser; the other four algorithms were essentially identical. Also, the two algorithms that used margin of victory slightly outperformed Vegas. Pretty impressive given those algorithms take no intangible information into account (e.g. injuries, matchup advantages) (note: I didn’t adjust Sagarin Predictor ratings for home versus away games). These percentages, however, are somewhat inflated by the large number of mismatches on the college football schedule. Some games are so easy to pick, even the preseason Coaches’ Poll would do well. Therefore in Table 2 I limited the sample set to games in which the point spread was a touchdown or less. As expected, both gamblers and computers didn’t fare as well.
| Rating System | Correct | Incorrect | PCT. |
| Anderson-Hester | 67 | 63 | 51.54% |
| Billingsley | 69 | 61 | 53.08% |
| Colley Matrix | 70 | 60 | 53.85% |
| Wolfe (FBS) | 63 | 67 | 48.46% |
| Sagarin (FBS)(ELO-Chess) | 69 | 61 | 53.08% |
| Massey (FBS)(BCS) | 71 | 59 | 54.62% |
| Sagarin (FBS)(Predictor) | 75 | 55 | 57.69% |
| Massey (FBS)(MOV) | 71 | 59 | 54.62% |
| Vegas, Straight Up | 71 | 58 | 55.04% |
| Favorites Covered: | 60 | 66 | 47.62% |
Table 2 - The BCS computers aren't much better than a coin flip when the games are between comparable squads. Note I removed the I-AA inclusive section here because those point spreads were all greater than 7.
The most notable stat in Table 2 is how well Sagarin’s Predictor algorithm did at… predicting. Four games better than the Vegas oddsmakers, not too shabby. Strangely, despite picking different teams (at times), the two versions of Massey’s rating system had the same success rate. The success rates in Table 2 compare to over 80% for games where the point spread was greater than 7.
On the opposite end of the spectrum, how well do these algorithms “fit” the season. In other words, looking at the post conference title game computer ratings, how accurately could you guess the outcome of individual games over the course of the season? As it turns out, very accurately (Table 3).
| Point Spread <= 7 | All Games | |||||
| Rating System | Correct | Incorrect | Pct. | Correct | Incorrect | Pct. |
| Anderson-Hester | 208 | 65 | 76.19% | 571 | 108 | 84.09% |
| Billingsley | 210 | 63 | 76.92% | 568 | 111 | 83.65% |
| Colley Matrix | 210 | 63 | 76.92% | 566 | 113 | 83.36% |
| Wolfe (FBS) | 210 | 63 | 76.92% | 568 | 111 | 83.65% |
| Sagarin (FBS)(ELO-Chess) | 207 | 66 | 75.82% | 568 | 111 | 83.65% |
| Massey (FBS)(BCS) | 205 | 68 | 75.09% | 567 | 112 | 83.51% |
| Sagarin (FBS)(Predictor) | 189 | 84 | 69.23% | 545 | 134 | 80.27% |
| Massey (FBS)(MOV) | 207 | 66 | 75.82% | 563 | 116 | 82.92% |
| Wolfe | 216 | 63 | 77.42% | 661 | 115 | 85.18% |
| Sagarin (ELO-Chess) | 213 | 66 | 76.34% | 660 | 116 | 85.05% |
| Massey (BCS) | 210 | 68 | 75.54% | 650 | 117 | 84.75% |
| Sagarin (Predictor) | 195 | 84 | 69.89% | 637 | 139 | 82.09% |
| Massey (MOV) | 212 | 66 | 76.26% | 646 | 121 | 84.22% |
| Vegas, Straight Up | 155 | 121 | 56.16% | 588 | 185 | 76.07% |
| Favorites Covered: | 124 | 147 | 45.76% | 372 | 383 | 49.27% |
Table 3 - How the final regular season computer ratings do at “postdicting” all regular season games.
Three points to make here: only about 1/6 games can be considered “upsets” after the fact, that is, these games deviate from the “fit (computer rating) to the data (games played)”; second, as the BCS rankings only account for wins and losses—and only wins and losses are used to rate the ratings in this table—the Sagarin Predictor algorithm actually fares worse (interestingly Massey’s doesn’t) than its politically correct counterpart. The 6 BCS computers are within 1% of one another—only 4 games difference top-to-bottom out of almost 700 games! Only 146 games are postdicted incorrectly by any of the computers. (Note: last year ~81% of games (after bowls, using final computer ratings) were correctly postdicted).
Building on the third point, the similarity of fit in the six algorithms suggests that none of them produce bizarre or unreasonable results. But this also raises a concern: is there a lack of diversity in their methods and weighting systems*? Perhaps we should be glad there is no digital Craig James (for the record, Boise State is no worse than 12 in the computer polls) but this gets back to the original question: how should a computer measure a team? (*Relevant evidence that there is some diversity among the computers: median gap between high and low rank for teams is 12—Tennessee and Texas Tech both vary by 32 positions!)
As I said, I don’t think there is necessarily a right answer. I reckon most human voters reach some happy equilibrium of weighing wins accrued, losses, and the apparent ability to beat certain teams they haven’t faced. On an individual level this is highly subjective, yet we’re generally comfortable with the consensus result. When it comes to the computer ratings, we could, if we desired actually achieve a transparent and consistent weighting scheme, yet as it stands now, only the Colley rankings are completely transparent and reproducible.
Seeing how the computer polls predict and react to game outcomes has definitely helped me put a number on some relevant numbers that characterize the college football season. I never knew, for instance that about 15-20% of games are “unexplainable” when measuring the season as a whole, or that a touchdown in the point spread means the game isn’t much more predictable than a coin flip. But strangely enough I come back to the same questions I ask about human voting philosophies.
I’d love to hear the readers’ thoughts in the comments section on a couple different points:
-If you were an AP voter, how would you weigh variables like wins/losses/talent when rating a team?
-Do you think the diversity of voting philosophies is good or should there be uniformity?
-How would you design an array of computer rating systems?
Before closing, a couple things I came across when crunching some of these numbers. First, a table of who the computers have ranked higher in this season’s bowl matchups versus the Vegas lines.
Second, the plot below shows how well the Sagarin rankings do at pre- and post-dicting the entire season’s worth of games. As expected the later ratings are consistent with more of the season’s games, but remarkably, Sagarin’s rankings using only the first week’s results and his preseason weightings (not sure how he rates teams in the preseason) can be used to predict 70% of the games for the entire 2011 season (or over two-thirds of the games from week 2 on). (Note: by October 16th the preseason component is removed from the ratings.)
Unfortunately I don’t have the requisite data to do all these comparisons for past seasons, but I’m happy to run the numbers on anything you’d like to see if possible; this site is also a great resource for similar metrics. There are many mathematical studies of rating systems, e.g. here, for those interested in further reading.

Figure 1 - Plot of how well Sagarin's various ratings predict (and/or post-dict) the entire regular season based on which week the ratings were released.
29 comments
|
0 recs |
Do you like this story?
Comments
The plethora of numbers and graphs required for you to make your point here is evidence enough as to the lunacy of the BCS.
by Zzzizzzy on Dec 14, 2011 4:28 PM CST reply actions
Heh, yeah I included all the numbers to be thorough, but the interesting part is that all the BCS computers behave quite similarly.
To sum up: the BCS computer rankings:
Successfully predict ~70% of the next week’s games… but barely over 50% of games where the spread is a touchdown or less-meaning ~15% of games “don’t fit the pattern” of the full season
-“Post-dict” (or “retrodict”) ~84% of the entire season
-Also, the season as a whole is quite predictable; after just a week, Sagarin’s rankings can predict 67% of the remaining games. Unsurprising given the large number of on-paper mismatches, but I never would have guessed such a high figure.
So the cliffnotes numbers are 70%, 52%, 84%, 67%.
by Tom Brennan on Dec 14, 2011 4:41 PM CST reply actions
Imagine a team plays a Top 10 schedule, beats 7 teams with winning records and wins its conference. Then that same team is 3rd in the BCS and misses the MNC game to a team with a 40ish ranked SOS, only beats 3 winning teams and fails to even win its division. Now, imagine that team is Alabama.
by Eskimohorn on Dec 14, 2011 4:47 PM CST reply actions
I think the interesting number for me is the ~15% retrodict (love it!) outlier. That, to me, lends support to the idea that comparing losses should not be the metric for an AP voter to use when deciding whether to put Alabama over Okie State.
by noone on Dec 14, 2011 5:13 PM CST reply actions
Radical.
O wait. You said “imagine.”
I’m trying. I can’t. Truth IS apparently stranger than fiction.
by lurkerinthedark on Dec 14, 2011 5:14 PM CST reply actions
Ratings won’t be appreciably larger for this BCSCG than they were for the first LSU-Bama game. I am willing to bet they will be lower.
That will push us down the road toward plus-one.
by Bob in Houston on Dec 14, 2011 5:19 PM CST reply actions
The thing I like about the computers is that they don’t tell us one year that a team that didn’t win its conference shouldn’t play for the MNC, and then do a 180 3 years later. They recognize that a home loss to ole miss is worse than a road loss to tech, and can actually remember it a few games later.
by Horncasting on Dec 14, 2011 5:29 PM CST reply actions
Vegas isn’t picking games and does not measure itself by the outcomes you are using. They are strictly market makers and their only objective is to have a number that encourages, as close as they can possibly get, the betting public to bet half the action on either side. They measure by total “handle”, then do what tinkering they can (adjusting lines, laying off, etc) to balance the action.
If they do this, they know statistically what they can expect to earn as a percentage of the handle, less bad debts.
If you want to say they are predicting something, then it is the number that splits the action and not the outcome of the game. If there is any credit to be given to how close the betting lines are to the final score of the contest it is attributable to the wisdom of crowds more than anything else.
by two521 on Dec 14, 2011 6:08 PM CST reply actions
I actually disagree with that thought process. Vegas is that stock market for sports. Its final line is where buyer and sellers are equal, just like a stock. All the information is in those lines. Both buyers and sellers are acting in their best interest, vegas itself does not predict but its “investors” do
by codaxx on Dec 14, 2011 7:45 PM CST reply actions
Re: Vegas
It’s a good point that “Vegas” itself doesn’t pick the lines but I think it should be a pretty efficient market. I don’t know what the breakdown is in sports betting between recreational and serious bettors, but I imagine it’s the more expert analyses (through combinations of algorithms and human assessment) that are driving the point spreads, otherwise someone would be cleaning up.
Basically the net out is that the point spread should be ~the median of the expert bettors’ predictions.
by Tom Brennan on Dec 14, 2011 7:58 PM CST reply actions
Awesome Zoolander references! I love that movie! (I feel like I’m taking crazy pills!). anyway, I can’t help but think that we would never have gotten computer ratings included in the original system in’98 if the AP had voted for Nebraska in ‘97 instead of insisting that if “number 1” doesn’t lose, then they can’t move down.
by Hoju on Dec 14, 2011 8:51 PM CST reply actions
“Heh, yeah I included all the numbers to be thorough, but the interesting part is that all the BCS computers behave quite similarly.”
Not throwing stones here. Your analysis is peach. Just the monstrous amount of calculations required to justify keeping a crony bowl system is laughable.
by Zzzizzzy on Dec 14, 2011 8:55 PM CST reply actions
Speaking of predictions, who would win between OKSt and Alabama in a neutral field? what percent of the time? Yeah, it’s irrelevant, but I want to know, but I’m not smart enough to do the math.
by Fried Rice on Dec 14, 2011 11:08 PM CST reply actions
I think the Olympics should start using the BCS for the medal rounds.
by Horns Up on Dec 15, 2011 3:52 AM CST reply actions
Re: A Bama-OSU prediction.
Sagarin’s Predictor has Bama at 2 and OSU at 3, as does Massey’s margin-of-victory inclusive algorithm. I’m not sure how sophisticated either of these are in terms of analyzing relevant offensive and defensive stats versus just the final score margin. Massey’s, at least, appear to include some sort of sophisticated analysis of O & D.
BC’s own Huckleberry is probably the man to ask on how these more nuanced predictions are generated (i.e. those factoring yards-per-play and the like). Other sites like www.thepowerrank.com, which rates O & D separately, also has Bama ahead of OSU.
Unrelated, but on the Colley Matrix “football god” tool, switching the outcome of the OU-Baylor game results in OU jumping up from 6 to 3, comfortably ahead of Bama.
by Tom Brennan on Dec 15, 2011 5:01 AM CST reply actions
Now my head hurts. But not as badly as when I think about another Alabama/LSU game.
by JoeT63 on Dec 15, 2011 9:45 AM CST reply actions
I think you kind of touch on the fundamental problem with the BCS, or even more generally, with any system which determines championships off the field. Computers are really good at measuring things, people less so, but at the end of the day someone has to define what exactly is being measured.
And that’s exactly what we do not have.
Are we supposed to be determining who the best/most talented team is? Or who has had the better year – and whether that should be determined by who’s got the best wins or who’s got the worst loss? Or who would win (or at least be favored) if they actually did play the games? Or some arcane combination of some or all of the above?
There truly is no established criteria, and so you have computer programs designed according to someone’s notion of what we should be seeking to measure, along with actual human voters who establish their own criteria (and often change that criteria on a year-by-year, or even week-by-week basis) and rank accordingly. And with the human vote, that’s not even taking account whether they’ve even input enough data to accurately determine whether their perceptions of which teams “win” based on their criteria are remotely accurate…
In the end, that’s the main reason a playoff is so desirable. If you set the field large enough to actually include all the teams with legitimate claims to a championship, it all comes down to actual results. There is no “best” team, there’s just a champion.
by The Bobs on Dec 15, 2011 10:26 AM CST reply actions
Tom, the link where you have the computer picks for the bowls, are they versus the spread or straight up? I could see the Vegas lines and the Sagarin predictor had the lines in them, but wasn’t so sure about the other choices. Nice writeup.
by Kilgore Trout on Dec 15, 2011 12:52 PM CST reply actions
“(*Relevant evidence that there is some diversity among the computers: median gap between high and low rank for teams is 12—Tennessee and Texas Tech both vary by 32 positions!)”
While this does prove diversity among the computer formulas, it also emphasizes that we force stack rank a large chunk of teams that are largely equal in quality. Are Tennessee and Texas Tech actually 32 teams different? They’re likely nearly the same with the other 30 teams in that span. Rank gets forced from statistically insignificant numbers.
by BurntOrangeBen on Dec 15, 2011 1:00 PM CST reply actions
For the bowl “picks” I just used the higher ranking to select the teams, straight up. Only Sagarin’s Predictor has an easily extractable point margin for comparing teams, so I included that as well in its column.
Re: the variance of teams in the middle of the rankings, yeah I imagine that is strongly dependent upon how strong the computer weighs a team’s conference, and especially with Texas Tech, how it evaluates flukish but impressive wins. I’ve often thought about making plots of normalized ratings to see roughly where “the middle” begins—also a good way to compare the dominance of the top teams across different seasons.
by Tom Brennan on Dec 15, 2011 1:31 PM CST reply actions
It seems important to note that the only system included that is intended as an effort to predict future matchups is Sagarin’s Predictor rating. Every other system is either entirely “accomplishment-based” (e.g.,Wolfe and Sagarin ELO-Chess), or includes a Bayesian correction reflecting accomplishments to date (e.g., Massey MOV and Sagarin overall rating). Accomplishment in this case merely meaning the W/L results of each team against the schedule faced.
Therefore, judging the ratings sets based on a pre-concieved notion that they should be able to “assess the ability of teams” is possibly invalid because the systems are not designed to assess the ability of teams, but rather to rank the teams based on their accomplishments.
Because I’m interested, I may run this study using my system for this season later today.
by Huckleberry on Dec 16, 2011 7:44 AM CST reply actions
I used the Sagarin Predictor for most of my bowl picks, so I’ll see how it ends up performing in my office pool.
by KilgoreTrout on Dec 16, 2011 9:41 AM CST reply actions
Is there going to be a stat-off between Huck and the newcomer, Tom?
by ut-06 on Dec 16, 2011 9:55 AM CST reply actions
Re: Huckleberry, yeah, that’s why I was surprised how the BCS computers actually a decent prediction job, despite not being designed to do so. Do you know if most prediction-geared algorithms (like your own) give a single rating output wherein any prediction is determined by looking at this singe rating (e.g. as in the Sagarin ratings) or do they generate predictions for individual matchups using offensive & defensive metrics?
It’s interesting how much more hamstrung the computers are in evaluating teams relative to voters. Voters can consider who they think would win in a matchup whereas computers can only measure past wins & losses (without margin of victory, a prediction element is essentially impossible—although as I show, interpreting them as predictions works out okay) and are further limited in their post-game/body of work analysis by not looking at the score. For this reason I really think the BCS should hire some stats guys to design an array of computer ratings that weight all these things to different extent. Would be highly transparent, and weightings could be agreed upon beforehand. Would take out biases created by hype and bowing to Lord Saban’s every demand.
by Tom Brennan on Dec 16, 2011 12:57 PM CST reply actions
I agree on your end goal but I also understand it would never happen.
My power rating is a predictive rating when combined with standard homefield advantage. I also, of course, have stat-based predictive ratings. My most fervent agreement, though, is that the models used by the BCS should be transparent. It would be fairly easy for the BCS to get a group of guys, obviously including me, to publish a set of transparent ratings that all hit on the various options in publishing a computer rating set under the no-MOV constraint. I addressed that a little in this post from 2008.
by Huckleberry on Dec 16, 2011 1:14 PM CST reply actions
“ut-06 said: December 16th, 2011 at 8:55 am
Is there going to be a stat-off between Huck and the newcomer, Tom?"
Possibly a more exciting matchup than the Sugar Bowl rematch, IMO.
by Jake Lonergan on Dec 16, 2011 2:21 PM CST reply actions
TheBobs is correct: the fundamental problem with the football polling construct is that there’s no agreement on what is being measured. At least in the basketball tournament selection process, the selection committee says, ex ante, what factors are good (say, wins on road or at neutral sites), bad (low SOS), and not material (say, not winning conference). On the one hand, I find it ridiculous when football poll voters explain they’ve raised their ranking of Team X because player Y has become healthy again; on the other hand, who am I to say that the absolute quality of the roster, rather than on-field-results, shouldn’t be considered?
I will make the extraordinarily iconoclastic, never before heard in this forum point that the BCS regime, whether using human polls or computers, is fatally flawed. As long as you have some small, but greater-than-two, number of teams that have comparable credentials and haven’t played each other head-to-head, you will end up with credible teams shut out of the opportunity to play for a title. Move to a “+1” format and make the discussion a less high stakes one about “who’s #5 and on the outside looking in” rather than “who’s #3 or #4?” Move to an 8-team playoff and the controversy diminishes further. Move to 16-teams and #17 might whine, but you won’t have the whole system called into question. You’ll also have bowl games that have real consequence.
by Major Major on Dec 16, 2011 5:05 PM CST reply actions
No real championship contestants should ever be chosen by voters or computer rankings. Only the outcome of real games within the confines of a real play off system, where no participant is awarded a spot in the play offs through any type of ranking or vote, should determine the champion. Until that happens, all Div 1 champions in college football are mythical and unearned.
by prehist51 on Dec 18, 2011 3:35 PM CST reply actions

by 























