Which Statistics Correlate Best to Winning?
The Texas Staff Takes the Plunge into Statistical Analysis
Will Muschamp famously declared that stats are for losers. Greg Davis? It turns out that he wants to know which stats are for winners. It has come to Barking Carnival's attention that the Texas staff, and Davis in particular, ran some numbers to determine which stats correlate best to wins. My excitement that the Texas staff was willing to use statistical research in their efforts to continously improve was quickly tempered by the study's content.
It's not that their method was completely without merit. It's just that it leaves a lot to be desired from a statistical analysis standpoint. And if it was indeed Davis himself that ran the numbers it's also completely understandable that his time would be better spent on the ins and outs of the offense than on improving his analysis technique. Based on the information HenryJames mailed to me on what appears to be a Rainbow Cattle Company cocktail napkin, the study was performed by taking the season-ending Top 10 in various statistical categories and summing up that group's wins and losses for the year. That gave each statistic a Top 10 Winning Percentage. The study was performed for both the 2008 season only as well as the nine-year period beginning in 2000 and ending last year. The results:
Nine-Year Period -
| Rank | Statistic | W-L | WPCT |
|---|---|---|---|
| 1 | Scoring Offense | 927-226 | 0.804 |
| 2 | Scoring Defense | 913-227 | 0.801 |
| 3 | Rush Defense | 847-263 | 0.763 |
| 4 | Total Defense | 843-283 | 0.749 |
| 5 | Turnover Margin | 827-291 | 0.740 |
| 6 | Total Offense | 823-301 | 0.732 |
| 7 | Pass Offense | 731-389 | 0.653 |
| 8 | Rush Offense | 712-400 | 0.640 |
| 9 | Pass Defense | 723-412 | 0.637 |
2008 Only -
| Rank | Statistic | W-L | WPCT |
|---|---|---|---|
| 1 | Scoring Defense | 105-27 | 0.795 |
| 2 | Scoring Offense | 106-28 | 0.791 |
| 3 | Rush Defense | 100-32 | 0.758 |
| 4 | Total Offense | 100-33 | 0.752 |
| 5 | Total Defense | 99-34 | 0.744 |
| 6 | Pass Offense | 95-37 | 0.720 |
| 7 | Turnover Margin | 91-42 | 0.684 |
| 8 | Rush Offense | 84-46 | 0.646 |
| 9 | Pass Defense | 67-59 | 0.532 |
Some issues with the study:
- I haven't checked, but I doubt that they pulled out games between teams that were both in the Top 10 of each stat. Such games would artificially lower the winning percentage by bringing it toward .500 because of the 1-1 mark in the game.
- Using per game totals for the yardage stats. The biggest effect of this can be seen by the relative placement of rush defense and pass defense in each list. Teams that are trying to come from behind are more likely to throw the ball. Using total yardage numbers here instead of efficiency numbers probably skews the study heavily.
- Using raw stats instead of adjusted stats. Can't complain too much here simply because adjusted stats weren't available to the Texas staff. They simply used what the NCAA publishes, which leads to all kinds of problems. The NCAA year-end stats include games against FCS competition and they clearly don't account for the schedule each team faced.
So What the Hell Are You Going To Do About It?
The first thing we need to do is identify a better method. A linear regression analysis would be nice to do, as Trojan Football Analysis did here using raw totals. However, Sailor Ripley is way too cheap to spring $185 to get me a copy of the analysis software TFA used, so we're stuck with basic Excel capabilities. Good for us, then, that we can calculate statistical correlation fairly easily using season long data and simple functions. The correlation coefficient will serve our purposes quite well. Linear regression would be better able to tell us the exact form of the relationship between each stat and its dependent result, but correlation will tell us everything we need to know about the strength of that relationship. For those unfamiliar with correlation coefficients, the values range from -1 to 1 where -1 is a perfect negative correlation, 0 represents uncorrelated variables, and 1 is a perfect positive correlation.

The finite population equation was used for the analyses.
The second step, of course, is to find adjusted stats. It occurred to me during the past college basketball season that Ken Pomeroy's site for that sport could really use an equivalent for college football. Dozens of times on this blog and various message boards I have argued with other college football fans about using raw stat totals without context. Usually I would then go out and calculate the adjusted stat for that single use. It was the last time I did this that I realized I was going to have to publish an adjusted stats database myself that could be used in these situations. Go ahead, take a week or two to digest the information at that link. There's a lot there. The numbers include only games against FBS opponents.
Okay, I'm with You on the Method, but Prove the Adjusted Stats to Me
Hopefully the superiority of correlation analysis compared to a winning percentage summation needs no explanation. But I admit that the adjusted stats argument listed above may stick out as an unwarranted assumption to some people. This is especially true if we are comparing a team's stat to its own W/L record as a weak schedule that allows a high ranking in a statistic should also allow a better record overall. Quite true, and as we'll see in a minute the use of adjusted stats becomes more important as you drill down further from the end goal - the wins and losses.
So the first thing I set out to do was to compare the correlation coefficients using raw stats compared to the coefficients using adjusted stats. So in addition to the adjusted stats available above, I also adjusted each team's winning percentage using the same method used for the stats. Here is a single table showing both the raw coefficients as well as the adjusted coefficients for point statistics. They are sorted by the absolute value of the coefficient, keeping in mind that a higher defensive statistic total such as more points allowed leads to fewer wins so the negative coefficient is correct. The stats that don't say adjusted are correlated to the unadjusted winning percentage while the adjusted stats are correlated to the adjusted percentage.
| Rank | Statistic | Coefficient |
|---|---|---|
| 1 | Adjusted Point Margin per Game | .935 |
| 2 | Point Margin per Game | .928 |
| 3 | Adjusted Scoring Offense | .816 |
| 4 | Adjusted Scoring Defense | -.811 |
| 5 | Scoring Offense | .755 |
| 6 | Scoring Defense | -.711 |
Feeling confident that the point was proven, I have not run any further raw stat coefficients. We can see above that the obviously most important statistic, point margin, shows up where it should at the top of the list. Point margin was included only as a sanity check on the calculations because its position on the list has to be at the top. Also, each adjusted stat outpaces the raw stat and the difference at the secondary stat level is very large compared to the first level. I will use only adjusted stats moving forward. Some other quick thoughts on this table are that winning is 7% luck (heart or clutch for you fairy tale lovers) - how's that for drawing an overly specific conclusion from a statistic? - while offense and defense are pretty much just as important as each other. Using the adjusted numbers the difference between the .816 and -.811 coefficients is extremely small. Now that the method and statistical basis are both settled, it's time to move forward and re-run the numbers the Texas staff ran.
The Texas Study Revised
Here is the revised study for 2008 followed by some more comments:
2008 only -
| Rank | Statistic | Coefficient |
|---|---|---|
| 1 | Scoring Offense | .816 |
| 2 | Scoring Defense | -.811 |
| 3 | Rush Defense | -.767 |
| 4 | Total Defense | -.694 |
| 5 | Total Offense | .610 |
| 6 | Rush Offense | .485 |
| 7 | Turnover Margin | .425 |
| 8 | Pass Offense | .238 |
| 9 | Pass Defense | -.134 |
- No, I don't have adjusted statistical data for 2000-2007. There is no automated way that I know of to get box scores for all games in previous years. I estimate roughly 12 hours of data entry just for 2008. I now have a single file with every 2008 box score and will be compiling one as 2009 progresses.
- The numbers used were adjusted per game stats as the Texas staff used per game numbers.
- With the above comment in mind, it's important to remember that correlation numbers do not tell us which variable is the cause and which is the effect. The most immediately obvious stat issue above is that rush defense, in adjusted yards per game, ranks so highly compared to pass defense. But as stated above, we know that teams that are winning run the ball and teams that are losing throw the ball. So might the correlation number above be because winning creates better rush defense numbers rather than vice versa? And the same idea holds for the low pass defense ranking.
To deal with that last thought, I ran the coefficients using my adjusted per play stats instead of the per game figures. Each stat was replaced in this manner and the result is shown here:
| Rank | Statistic | Coefficient |
|---|---|---|
| 1 | Points Scored per Play | .833 |
| 2 | Points Allowed per Play | -.803 |
| 3 | Yards Allowed per Play | -.696 |
| 4 | Total Passing Yards Allowed per Attempt | -.681 |
| 5 | Total Passing Yards per Attempt | .663 |
| 6 | Yards Gained per Play | .649 |
| 7 | Total Rushing Yards Allowed per Carry | .619 |
| 8 | Turnover Margin per Play | .484 |
| 9 | Total Rushing Yards per Carry | .434 |
- Now the passing and rushing phases of defense are much closer in correlation values with passing defense being slightly more correlated to winning percentage.
- The "total" yards in the two categories simply means that sacks and sack yardage were moved from rushing to passing numbers.
- Being able to run the ball effectively is the least strongly correlated statistic in the study. However, a .434 value shows that it is still significantly important to winning.
What Next?
The adjusted stats database creates the possibility of an almost impossible to tackle number of correlation studies, so it's important to determine a starting point. Whereas the Texas Study chose to correlate substats directly to wins, I personally believe that scoring offense, defense, and margin are the stats one should correlate to wins. I think that stats should be considered building blocks and that a team's scoring stats are the level below winning and losing (yes, it's true that scoring offense and defense are actually substats of scoring margin, but you get the point). Contributing to those stats are substats such as yardage and efficiencies. While it's true that there is interplay in football between the offense and defense, as opposed to baseball, I think time is better spent analyzing how yards per carry correlates with offensive scoring instead of directly to wins. I will probably determine some secondary correlation figures such as the coefficient between defensive yards per play allowed and offensive scoring to isolate the interplay mentioned above, but more time will be devoted on offense and defense in isolation from each other. Another avenue I have already decided to go down is to determine a better passing efficiency formula than the one currently used by the NCAA. Using relative coefficients, we should be able to modify the formula to better account for each statistic's correlation with offensive scoring.
Thoughts and discussion are appreciated on either the correlation coefficient discussion or the stats pages. Also, any corrections on the stat pages would be greatly appreciated. My data entry consisted of 78,870 fields. Not all of those were manually populated, but there are probably some errors. There are still some tweaks I'd like to make to the pages. For example, the total passing and total rushing numbers are only included in the tables for all teams and are not on each team's page. I should add those to the team pages in the near future.
35 comments
|
0 recs |
Do you like this story?
Comments
You can probably find a pirated version of R somewhere, but Excel is cheap and available. Write a macro.
Thanks to Huckleberry for giving us stat-geeks some fat to chew on.
by CurrentLonghornStudent on Aug 14, 2009 10:12 PM CDT reply actions
Heck, maybe I should volunteer my time to Huck and help out.
by CurrentLonghornStudent on Aug 14, 2009 10:12 PM CDT reply actions
R is free, and widely used in statistics departments and analytical companies.
by mm on Aug 14, 2009 10:18 PM CDT reply actions
Well, being in macro-writing mode to run the stat adjustment calculations caused me to write a macro to calculate the correlation coefficients. Of course, Excel comes with the CORREL function that does exactly that. Whoops. That was a wasted 5 minutes.
Also, the Analysis ToolPak has a regression capability, too. Just noticed that. Correlation works for now, though.
by Huckleberry on Aug 14, 2009 10:21 PM CDT reply actions
What the hell do those numbers tell Greg Davis, or US?!
This is when statistics can be stifling.
by scagnetti on Aug 14, 2009 10:27 PM CDT reply actions
Which numbers? The correlation values or the adjusted stats?
by Huckleberry on Aug 14, 2009 10:29 PM CDT reply actions
I will assume you’re talking about the correlation values.
They tell us lots of things even though these are very vague and not nearly the relationships I would look at. But if the Texas staff is paying attention to their own study then there is some danger attached to it.
For example, look at where passing defense ranks in their study. Look at where it is in my game averages study. Dead last in both.
Now look where it is when you use both the correlation analysis and the per play numbers instead of per game (remember that the per game numbers are affected by the ration of run plays to pass plays called when teams are winning versus when they’re losing). It leaps all the way to 4th and is the highest rated unit stat.
So if the Texas staff were to use their study, even just a little bit, to prioritize practice time, etc., then they would be shooting themselves in the foot. If pass defense is relegated to lesser importance than rush defense, as their study suggests should be done, then they will be doing the wrong thing.
by Huckleberry on Aug 14, 2009 10:36 PM CDT reply actions
It would be interesting to see the most significant winning statistics between similarly ranked teams, like between UT / OU or tOSU / Penn last year, and see if there’s any difference between season long stats and tough contests.
Also if there are any stats that correlate to upsets.
by jw on Aug 14, 2009 10:51 PM CDT reply actions
The correlation of scoring to wins is definitional, i.e., winning means outscoring. In other words, it doesn’t help a lot to say, if you want to win a lot of games, just outscore each opponent.
Your proposed secondary correlations are where you can add insight. Even better would be analysis of scoring potential for various offensive plays and defensive sets and analysis of sustained versus unsustained drives by play composition. Then coaches would have something to work with.
by OldTimeHorn on Aug 14, 2009 11:02 PM CDT reply actions
I always use jmp for my statistical analysis but since work pays for it price is no object. I agree secondary correlations are where it’s at:
1. Where does going for it on 4th down make sense (Distance and position on the field).
2. # of plays in scoring drives.
3. Scoring vs starting field position.
by KilgoreTrout on Aug 15, 2009 12:24 AM CDT reply actions
“This is a very simple game. You throw the ball, you catch the ball, you hit the ball. Sometimes you win, sometimes you lose, sometimes it rains. Think about that for a while.”
by Ebby Calvin LaLoosh on Aug 15, 2009 1:18 AM CDT reply actions
MATLAB/Octave is a good substitute – any kind of statistical function you’d want to model is available on the MATLAB file exchange.
by pleaseplaykindle on Aug 15, 2009 1:39 AM CDT reply actions
If the Texas staff did this, say, 5 years ago, I’d have been really worried that it was a way to get fans and other critics to STFU. As in, “Look, we did what the stats told us to do. Don’t blame us, blame the math!”
Thankfully, a MNC and some wins over the Sooners have dampened Mack’s more petulant tendencies.
Still, I share Huck’s sense of unease about this, since there’s a real danger that they’ll draw incorrect conclusions due to half-assed (if completely well-meaning) modeling.
I hope Boom retains his skepticism about the efficacy of stats. They’re great when you want to develop, say, an ERP; less so when trying to figure out how best herd a bunch of irrational young men into a coherent football team.
by CrazyJoeDavola on Aug 15, 2009 1:46 AM CDT reply actions
So scoring is what best correlates to winning. Who in the hell could have guessed that?
The problem with the study is that there’s no useful information gleaned from it. Of course scoring is what correlates best to winning because that is the only statistic that is used to decide the outcome of a game.
If you are to try to use this data to improve your team, you’d try to increase your scoring and decrease the opponent’s (again, no shit). Unfortunately, there’s no actionable strategy to achieve that goal in the absence of more data. What correlates best to scoring? Let’s say it is passing offense, for argument’s sake. What correlates best to passing offense? There are several layers of analysis that need to occur to get to data sufficiently granular to model an actionable strategy.
by CS on Aug 15, 2009 10:46 AM CDT reply actions
Uh, yeah. And that’s where it’s headed.
You have a firm grasp of the obvious, though, so that’s good. Scoring was included for two reasons. The first is to make sure the analysis was being done properly because we know that scoring should be at the top. That was mentioned above. The second was because the Texas staff actually included it in their study.
by Huckleberry on Aug 15, 2009 1:06 PM CDT reply actions
Points per play. Interesting.
Realise also that a correlation and a ordinary least squares (OLS) regression are the same thing. Once you have the correlation and the mean of each variable you can calculate the linear regression slope & intercept with basic algebra.
The other thing that would be interesting would be to look at the correlations among the predictors. I imagine that you might have some supprossor effects that would actually improve the prediction of defensive stats if you control for the fact that OU and Texas aren’t terribly concerned about giving up points when they are up by 30.
by Pancho Claus on Aug 15, 2009 4:06 PM CDT reply actions
Nice stuff, Huck…reminds of myself sometimes.
Yes, that’s a huge concern of mine as well—that the relative novice come up with the wrong, overly simplistic conclusions and bases his/their strategy on that.
Some think you can overdo the stats stuff—but you must have more than a surface understanding to get real value from them.
by SlickStreet on Aug 15, 2009 5:59 PM CDT reply actions
Best chance of winning—play a weaker opponent.
I’m a fool, R is free. It’s been six years since I dealt with it…
by CurrentLonghornStudent on Aug 15, 2009 6:47 PM CDT reply actions
Id a similar analysis in grad school and came up with
- - turnover margin
- - rush defense
- - rush offense
by Crazy Joe Clark on Aug 15, 2009 7:22 PM CDT reply actions
Points per play. Interesting.
I’m working on per possession analysis now. Given that possessions aren’t tracked in the box scores, I’m simply going to use
- Possessions = Rushing TDs + Passing TDs + Interceptions Thrown + Fumbles Lost + Punts + (Fourth Down Attempts – Fourth Down Conversions)
The only issues I can see there is that fumbles lost on returns will inflate the possession totals. A good thing about that calculation is that possessions that end because of the half ending will not be included.
Anyone have any improvements on that possessions estimate?
by Huckleberry on Aug 16, 2009 12:52 AM CDT reply actions
Some other quick thoughts on this table are that winning is 7% luck (heart or clutch for you fairy tale lovers)
This is sort of what I was thinking while reading it. Wouldn’t it be better to try to map the statistics against point differential than winning percentage?
With mapping to winning percentage, you are favoring a team that wins all of its games 13-10 over a team that wins all of its games but one 50-0 and loses one by a point. Doesn’t seem like the 13-10 model is the one you’d try to emulate.
by PatronSaint on Aug 16, 2009 9:05 AM CDT reply actions
“I hope Boom retains his skepticism about the efficacy of stats. They’re great when you want to develop, say, an ERP; less so when trying to figure out how best herd a bunch of irrational young men into a coherent football team.” says Crazy Joe.
BTW, I thought OldTimeHorn’s thoughts were timely and accurate…
Although I agree with this OP, I also find it more helpful to get doww into the secondary stats like KilgoreTrout suggested…that is where stats seem to be most helpful in developing best probable strategies…further, as others have said the whole point of this exercise is to get the data sets in reliable ranking as to affect on the game outcomes…to do otherwise is scary as has been said already.
I’m with Boom here…play with passion & energy, execute assignments to perfection and above all, play as a unified TEAM. We have the talent, so the stats will take care of themselves if the coordinator puts the TEAM in a position to win with the right plays & schemes in place on each play…
by uttotop on Aug 16, 2009 9:42 AM CDT reply actions
You never lose when you score more points than the other team.
by yogi berra on Aug 16, 2009 9:48 AM CDT reply actions
“One more fagot of these adamantine bandages, is, the new science of Statistics.”
I’m pretty sure Emerson thinks you’re gay.
Seriously, nice work. Your concern about Davis and stats equating to drunkards and lamp posts is probably valid.
by Doperbo on Aug 16, 2009 3:47 PM CDT reply actions
Huck, you should probably go here as you get started. Bill’s been doing a lot of the stuff you’re thinking about doing for about a year now. You can save yourself a lot of time on stuff that would be duplicative.
The more quantitative analysis, the better, far as I’m concerned.
by PB @ BON on Aug 16, 2009 4:49 PM CDT reply actions
PB -
Looks like some good stuff. I think he’d appreciate the adjusted stats database. Seems like he’s developed quite a few of his own stats that I’d have to read up on.
by Huckleberry on Aug 16, 2009 7:12 PM CDT reply actions
Definitely. Shoot me an email sometime, actually. We’re about to start on a big CFB stats database project you might enjoy being a part of.
Keep up the great work.
by PB @ BON on Aug 16, 2009 8:25 PM CDT reply actions
Huck, you might be interested in this:
http://gamblersbookclub.com/product.php?productid=18231&cat=903&page=1
I’ve picked it up the last couple seasons, makes for decent bathroom reading at the very least.
by Professah Funkensteen on Aug 16, 2009 10:25 PM CDT reply actions
So your telling me if I kiss a hermaphrodite that makes me gay and straight all at once?
by WTF? on Aug 18, 2009 5:45 PM CDT reply actions
Huckleberry, I am not certain how you did this. Did you do the regression analysis completely after the season was over? Or did you accumulate statistics during the season and use those to predict the game at that point. I would think the latter is the better way to do it.
thanks
by John on Oct 24, 2009 6:34 PM CDT reply actions
You really make it appear so easy along with your presentation but I to find this matter to be really one thing which I think I would by no means understand. It kind of feels too complicated and extremely huge for me. I’m having a look forward to your next post, I’ll attempt to get the grasp of it!
by movies online|movies online watch free|watch movies online for free no download|movies online on Feb 3, 2012 2:13 PM CST reply actions

by 






















