Measuring coach quality: the jury is still out on Charlie Strong

With all this talk regarding a potential coaching change at Texas, I’ve spent some time thinking about how we evaluate the success or failure of football coaches. The obvious metric is some combination of winning percentage and championships, but that’s not everything. And, even if it were, we often don’t have enough data to make any meaningful inference regarding a coach’s true winning percentage. Consider two scenarios: 1) a coach inherits a cesspool after the previous coach has been fired; 2) a coach inherits a loaded program from a departing legend, who’s moved on to bigger and better things. If the coach in scenario 1 struggles to win more than 50% of his games in his first two years, can we really conclude that his innate winning percentage is low? Similarly, if the coach in scenario 2 continues a run of successful seasons in his first few years, can we be certain that his long-term winning percentage is high? It seems like an unfair comparison if we confine our assessment to only the first few years of a coach’s tenure.

Obviously, the thing to do would be to lengthen the observation period and allow the long-run averages to emerge from the data. But how long is long enough? I think that depends on what you believe a data point is. If you think each game is a sufficient observation, you see three seasons as an adequate sample size. But, you would have to believe that the quality of a team, to the extent that a coach influences it, fluctuates significantly throughout a season. I think that’s a bit unrealistic. More likely, team quality is more or less fixed within years with most of the advancement or decline realized during the offseason due to player turnover, physical conditioning, new offensive and defensive strategies, etc. That being the case, observations should be made at annual frequency. And if an entire season comprises a single data point, we’re talking about years of observation to ensure an adequate sample size.

Here’s how I think about the assessment of coach quality. Every coach has some stationary distribution over winning percentage. The best coaches have distributions not only characterized by high averages, but also low standard deviations around those averages. They win consistently. Coaches within the next tier also have high averages, but with higher standard deviations, meaning they have good chances of winning championships but at the cost of the occasional disastrous season. The remainder of the ranking follows: lower averages and higher volatility indicate lower coach quality.

Assessing expected winning percentage

Consider how we might estimate the distribution of a coach’s winning percentage. It needs to be a very simple model since we have very little data to draw from. As a general rule we should have at least 10 observations for every parameter we try to estimate. For a lot of coaches, we’re going to fall shy of that just trying to determine means. But, we also need some way of controlling for the scenarios I mentioned in the first paragraph; we need to penalize coaches that inherit a stable program and run it into the ground while also handicapping coaches that step into a nightmare. One approach is a slight tweak of the AR(1) used to model time series. For each year, we estimate the deviation between a team’s winning percentage from its head coach’s stationary average as a function of the previous year’s deviation. So, if a coach has a long-run average winning percentage of 80% but inherits a team that went 5-7 and proceeds to go 6-7 and 7-6 in his first two years, we won’t necessarily reject that his innate winning percentage is 80%. The model would attribute the slow start to the initial (poor) quality of the team.

One unfortunate drawback to this approach is the lack of institutional effects. Coaches that remain at disadvantaged schools like Rice or Duke for long periods of time will likely inherit lower average winning percentages because of it. This is unavoidable. We cannot make the model more complex without sacrificing statistical significance. However, I would argue that great coaches win everywhere and don’t stay at those places for long anyway.

Ranking coaches

I offer some detail on the estimation in a separate section below, for those that are into that sort of thing. Here are the rankings, according to expected long-run winning percentage (note: estimates that are not statistically significant are crossed out and should be ignored). Due to laziness, I’ve neglected to gather data on several coaches that likely don’t have enough observations to make any meaningful inference, e.g. James Franklin, David Shaw, Chip Kelly, Hugh Freeze, etc. The coaches listed here are those that I thought might be interesting to look at. It’s by no means an exhaustive list.

Rank	Coach	Salary ($mil)	E[Win%]	Observations
1	Jimbo Fisher	5.2	83.3%	6
2	Pete Carroll		83.0%	9
3	Urban Meyer	5.9	81.6%	15
4	Nick Saban	7.1	78.6%	19
	~~Dabo Swinney~~	~~3.3~~	~~78.4%~~	7
5	Bob Stoops	5.4	78.2%	17
	~~Charlie Strong~~	~~5.4~~	~~77.9%~~	6
	~~Chris Petersen~~	~~3.4~~	~~76.2%~~	10
6	Les Miles	4.4	74.7%	15
7	Mark Richt	4.1	73.4%	15
8	Bobby Petrino	3.0	72.7%	11
9	Gary Patterson	3.9	70.4%	15
10	Bronco Mendenhall		69.6%	11
11	Mack Brown		69.1%	29
12	Brian Kelly		69.1%	12
13	Mark Dantonio	3.7	67.2%	12
14	Bret Bielema	4.0	66.6%	10
15	Mike Gundy	3.7	66.2%	11
16	Bill Snyder	3.0	66.1%	24
17	Kevin Sumlin	5.0	65.7%	8
18	Kyle Whittingham	2.6	65.1%	11
19	Ken Niumatalolo	1.6	64.6%	8
20	Dan Mullen	4.0	62.4%	7
21	Kirk Ferentz	4.1	62.1%	17
	~~Larry Fedora~~	~~1.9~~	~~61.1%~~	8
22	Paul Johnson	2.8	59.2%	14
23	Troy Calhoun	0.9	56.1%	9
24	Pat Fitzgerald	2.5	55.9%	9
25	David Cutcliffe	2.0	53.9%	14
26	Mike Riley	2.7	50.6%	15
27	Rick Stockstill	0.8	50.6%	10
28	Jim Grobe		50.1%	19
29	David Bailiff	0.8	49.3%	9
	~~Gary Andersen~~	~~2.5~~	~~17.2%~~	7

There are two caveats I’d like to make here. First, this only uses data up through last year, so it doesn’t take anything from this season into account. Second, even though Jimbo Fisher is listed as statistically significant, it’s hard to get a read on him because he’s only had 6 seasons worth of data, which have all come from a single school. I feel much more comfortable with the estimates for coaches that have at least eight or nine years’ worth of experience.

The top of the ranking is about what I expected. Pete Carroll was an amazing college coach. Urban Meyer and Nick Saban are amazing college coaches. Kirk Ferentz’s agent is an amazing agent. And, Les Miles and Mark Richt don’t get the credit they deserve.

As a sense-check, here’s a plot of the statistically significant expected winning percentages against coach salaries along with a local linear regression to highlight the relationship. The expected winning percentages from the model seem to align pretty well with the compensation these coaches receive.

s71e41R.0.png

Here are the estimated distributions over winning percentage for the top five coaches…

U84h0Oj.0.png

the next five…

W1B6nBD.0.png

and the bottom five.

BNl7Pxe.0.png

Regarding Texas...

One of the advantages that Texas has over most programs is money. And, one of the few ways in which that money affects on-the-field performance is through the coaching staff. If you’re Texas, you don’t have to make a gamble on the young up-and-coming coach with a limited track record. You have the resources to get someone with a demonstrated history of success, someone whose distribution over winning percentage can be confidently estimated. If you think Tom Herman is an elite coach, I won’t argue with you. But I also won’t concede that he is because I have no data to suggest that he’s more like Nick Saban than Jim Grobe. Now, I’m not saying Texas can get anybody it wants; there are other considerations than money, especially for millionaires more concerned with legacy than bank statements. For whatever reason, Texas hired an unproven coach three years ago. I argue that there still isn’t enough data to discern what Charlie Strong’s quality as a head coach is. We will get a clearer picture in a couple of years. Until then, I think dismissing him and starting over is a mistake, especially if the new guy comes with the same data limitations.

Calculation

Assume a team’s expected winning percentage, denoted as y, during year t is given by:

y_t - µ = φ(y_t-1 - µ) + ε_t

where µ is the coach’s mean (stationary) winning percentage, φ is an autocorrelation parameter dictating how much of year t’s (y - µ) is dependent on the previous year’s, and ε_t is a normally distributed disturbance term with a mean equal to zero and a standard deviation equal to the coach’s volatility, denoted σ. Rearranging this equation, we get something that looks like the following:

y_t = µ (1 – φ) + φ y_t-1 + ε_t.

If we run the following regression:

y_t = α + φ y_t-1 + ε_t,

we can recover the parameter estimates for a coach’s stationary distribution over winning percentage. Using estimates of α* and φ*, a coach’s distribution is fully described (since we’re assuming normality) by its mean, given by

µ* = α* / (1 – φ*),

and its standard deviation, σ*, which is just the residual error from the regression.

At this point many of you are screaming that winning percentage is bounded between zero and one and we should be talking about truncated normal distributions. I agree, but consider the following. 1) While there might be a better way of estimating this by programming an estimator that incorporates a truncated error, this is a blog post not Econometrica. There are a lot of issues here, namely: we’re ignoring a lot of potentially influential variables and the degree of autocorrelation is probably higher than one. While I don’t want to make light of any bias that might arise from this specification, I also want this to be as simple as possible. The whole point of this exercise is to try to apply some statistical machinery to a scant dataset in the hope that some objective measure of coaching quality bubbles up to the surface. With so few observations, we cannot hope to add additional parameters to the model and get reliable estimates on all of them unless we confine our study to only coach’s with decades of head coaching experience. Besides, I think this model does do a good job of pinning down coaching quality, at least in a relative sense. 2) I correct for the truncation after the fact by calculating a coach’s expected winning percentage assuming µ* and σ* are the location and scale parameters characterizing a truncated normal distribution over the support [0,1]. The statistic by which I measure coach quality is therefore given by

E[Win%] = µ* + σ* [N’(-µ* / σ*) – N’((1 - µ*) / σ*)] / [N((1 - µ*) / σ*) - N(-µ* / σ*)],

where N’(･) and N(･) are the standard normal density and distribution functions, respectively.