clock menu more-arrow no yes

Filed under:

Which Statistics Correlate Best to Winning?

New, comments

The Texas Staff Takes the Plunge into Statistical Analysis

Will Muschamp famously declared that stats are for losers. Greg Davis? It turns out that he wants to know which stats are for winners. It has come to Barking Carnival's attention that the Texas staff, and Davis in particular, ran some numbers to determine which stats correlate best to wins. My excitement that the Texas staff was willing to use statistical research in their efforts to continously improve was quickly tempered by the study's content.

It's not that their method was completely without merit. It's just that it leaves a lot to be desired from a statistical analysis standpoint. And if it was indeed Davis himself that ran the numbers it's also completely understandable that his time would be better spent on the ins and outs of the offense than on improving his analysis technique. Based on the information HenryJames mailed to me on what appears to be a Rainbow Cattle Company cocktail napkin, the study was performed by taking the season-ending Top 10 in various statistical categories and summing up that group's wins and losses for the year. That gave each statistic a Top 10 Winning Percentage. The study was performed for both the 2008 season only as well as the nine-year period beginning in 2000 and ending last year. The results:

Nine-Year Period -

Rank Statistic W-L WPCT
1 Scoring Offense 927-226 0.804
2 Scoring Defense 913-227 0.801
3 Rush Defense 847-263 0.763
4 Total Defense 843-283 0.749
5 Turnover Margin 827-291 0.740
6 Total Offense 823-301 0.732
7 Pass Offense 731-389 0.653
8 Rush Offense 712-400 0.640
9 Pass Defense 723-412 0.637

2008 Only -

Rank Statistic W-L WPCT
1 Scoring Defense 105-27 0.795
2 Scoring Offense 106-28 0.791
3 Rush Defense 100-32 0.758
4 Total Offense 100-33 0.752
5 Total Defense 99-34 0.744
6 Pass Offense 95-37 0.720
7 Turnover Margin 91-42 0.684
8 Rush Offense 84-46 0.646
9 Pass Defense 67-59 0.532

Some issues with the study:

  1. I haven't checked, but I doubt that they pulled out games between teams that were both in the Top 10 of each stat. Such games would artificially lower the winning percentage by bringing it toward .500 because of the 1-1 mark in the game.
  2. Using per game totals for the yardage stats. The biggest effect of this can be seen by the relative placement of rush defense and pass defense in each list. Teams that are trying to come from behind are more likely to throw the ball. Using total yardage numbers here instead of efficiency numbers probably skews the study heavily.
  3. Using raw stats instead of adjusted stats. Can't complain too much here simply because adjusted stats weren't available to the Texas staff. They simply used what the NCAA publishes, which leads to all kinds of problems. The NCAA year-end stats include games against FCS competition and they clearly don't account for the schedule each team faced.

So What the Hell Are You Going To Do About It?

The first thing we need to do is identify a better method. A linear regression analysis would be nice to do, as Trojan Football Analysis did here using raw totals. However, Sailor Ripley is way too cheap to spring $185 to get me a copy of the analysis software TFA used, so we're stuck with basic Excel capabilities. Good for us, then, that we can calculate statistical correlation fairly easily using season long data and simple functions. The correlation coefficient will serve our purposes quite well. Linear regression would be better able to tell us the exact form of the relationship between each stat and its dependent result, but correlation will tell us everything we need to know about the strength of that relationship. For those unfamiliar with correlation coefficients, the values range from -1 to 1 where -1 is a perfect negative correlation, 0 represents uncorrelated variables, and 1 is a perfect positive correlation.

The finite population equation was used for the analyses.

The second step, of course, is to find adjusted stats. It occurred to me during the past college basketball season that Ken Pomeroy's site for that sport could really use an equivalent for college football. Dozens of times on this blog and various message boards I have argued with other college football fans about using raw stat totals without context. Usually I would then go out and calculate the adjusted stat for that single use. It was the last time I did this that I realized I was going to have to publish an adjusted stats database myself that could be used in these situations. Go ahead, take a week or two to digest the information at that link. There's a lot there. The numbers include only games against FBS opponents.

Okay, I'm with You on the Method, but Prove the Adjusted Stats to Me

Hopefully the superiority of correlation analysis compared to a winning percentage summation needs no explanation. But I admit that the adjusted stats argument listed above may stick out as an unwarranted assumption to some people. This is especially true if we are comparing a team's stat to its own W/L record as a weak schedule that allows a high ranking in a statistic should also allow a better record overall. Quite true, and as we'll see in a minute the use of adjusted stats becomes more important as you drill down further from the end goal - the wins and losses.

So the first thing I set out to do was to compare the correlation coefficients using raw stats compared to the coefficients using adjusted stats. So in addition to the adjusted stats available above, I also adjusted each team's winning percentage using the same method used for the stats. Here is a single table showing both the raw coefficients as well as the adjusted coefficients for point statistics. They are sorted by the absolute value of the coefficient, keeping in mind that a higher defensive statistic total such as more points allowed leads to fewer wins so the negative coefficient is correct. The stats that don't say adjusted are correlated to the unadjusted winning percentage while the adjusted stats are correlated to the adjusted percentage.

Rank Statistic Coefficient
1 Adjusted Point Margin per Game .935
2 Point Margin per Game .928
3 Adjusted Scoring Offense .816
4 Adjusted Scoring Defense -.811
5 Scoring Offense .755
6 Scoring Defense -.711

Feeling confident that the point was proven, I have not run any further raw stat coefficients. We can see above that the obviously most important statistic, point margin, shows up where it should at the top of the list. Point margin was included only as a sanity check on the calculations because its position on the list has to be at the top. Also, each adjusted stat outpaces the raw stat and the difference at the secondary stat level is very large compared to the first level. I will use only adjusted stats moving forward. Some other quick thoughts on this table are that winning is 7% luck (heart or clutch for you fairy tale lovers) - how's that for drawing an overly specific conclusion from a statistic? - while offense and defense are pretty much just as important as each other. Using the adjusted numbers the difference between the .816 and -.811 coefficients is extremely small. Now that the method and statistical basis are both settled, it's time to move forward and re-run the numbers the Texas staff ran.

The Texas Study Revised

Here is the revised study for 2008 followed by some more comments:

2008 only -

Rank Statistic Coefficient
1 Scoring Offense .816
2 Scoring Defense -.811
3 Rush Defense -.767
4 Total Defense -.694
5 Total Offense .610
6 Rush Offense .485
7 Turnover Margin .425
8 Pass Offense .238
9 Pass Defense -.134
  1. No, I don't have adjusted statistical data for 2000-2007. There is no automated way that I know of to get box scores for all games in previous years. I estimate roughly 12 hours of data entry just for 2008. I now have a single file with every 2008 box score and will be compiling one as 2009 progresses.
  2. The numbers used were adjusted per game stats as the Texas staff used per game numbers.
  3. With the above comment in mind, it's important to remember that correlation numbers do not tell us which variable is the cause and which is the effect. The most immediately obvious stat issue above is that rush defense, in adjusted yards per game, ranks so highly compared to pass defense. But as stated above, we know that teams that are winning run the ball and teams that are losing throw the ball. So might the correlation number above be because winning creates better rush defense numbers rather than vice versa? And the same idea holds for the low pass defense ranking.

To deal with that last thought, I ran the coefficients using my adjusted per play stats instead of the per game figures. Each stat was replaced in this manner and the result is shown here:

Rank Statistic Coefficient
1 Points Scored per Play .833
2 Points Allowed per Play -.803
3 Yards Allowed per Play -.696
4 Total Passing Yards Allowed per Attempt -.681
5 Total Passing Yards per Attempt .663
6 Yards Gained per Play .649
7 Total Rushing Yards Allowed per Carry .619
8 Turnover Margin per Play .484
9 Total Rushing Yards per Carry .434
  1. Now the passing and rushing phases of defense are much closer in correlation values with passing defense being slightly more correlated to winning percentage.
  2. The "total" yards in the two categories simply means that sacks and sack yardage were moved from rushing to passing numbers.
  3. Being able to run the ball effectively is the least strongly correlated statistic in the study. However, a .434 value shows that it is still significantly important to winning.

What Next?

The adjusted stats database creates the possibility of an almost impossible to tackle number of correlation studies, so it's important to determine a starting point. Whereas the Texas Study chose to correlate substats directly to wins, I personally believe that scoring offense, defense, and margin are the stats one should correlate to wins. I think that stats should be considered building blocks and that a team's scoring stats are the level below winning and losing (yes, it's true that scoring offense and defense are actually substats of scoring margin, but you get the point). Contributing to those stats are substats such as yardage and efficiencies. While it's true that there is interplay in football between the offense and defense, as opposed to baseball, I think time is better spent analyzing how yards per carry correlates with offensive scoring instead of directly to wins. I will probably determine some secondary correlation figures such as the coefficient between defensive yards per play allowed and offensive scoring to isolate the interplay mentioned above, but more time will be devoted on offense and defense in isolation from each other. Another avenue I have already decided to go down is to determine a better passing efficiency formula than the one currently used by the NCAA. Using relative coefficients, we should be able to modify the formula to better account for each statistic's correlation with offensive scoring.

Thoughts and discussion are appreciated on either the correlation coefficient discussion or the stats pages. Also, any corrections on the stat pages would be greatly appreciated. My data entry consisted of 78,870 fields. Not all of those were manually populated, but there are probably some errors. There are still some tweaks I'd like to make to the pages. For example, the total passing and total rushing numbers are only included in the tables for all teams and are not on each team's page. I should add those to the team pages in the near future.