University of Tennessee, Knoxville, TN 37996-1200
Soren P. Sorensen
University of Tennessee
Knoxville, TN 38996-1200
soren-sorensen@utk.edu
The purpose of this report is
In order to do this extensive use of mathematics is used, which might make the text more difficult to read, but ensures the method is well documented and reproducible by others, who might want to use it or derive another ranking method from it. The report is, on the other hand, also more detailed than a ''typical'' scientific paper and discusses details, which in a scientific paper intended for publication would be omitted.
We will in this report focus on NCAA 1-A football, but the methods described here are very general and can be applied to most other sports with only minor modifications.
In general most ranking systems fall in one of the following two categories: predictive or earned rankings. The goal of an earned ranking is to rank the teams according to their past performance in the season in order to provide a method for selecting either a champ or a set of teams that should participate in a playoff (or bowl games). The goal of a predictive ranking method, on the other hand, is to provide the best possible prediction of the outcome of a future game between two teams.
In an earned system objective and well publicized criteria should be used to rank the teams, like who won or the score difference or a combination of both. By using well defined criteria for the ranking then teams know exactly what the consequences of a win or at loss will be. This is done in most football conferences to select the conference champion or at least the two teams selected for a championship game. In general an earned ranking system allocates a number (often called the power ranking), which is used the order teams in a linear sequence.
Most systems found on the WWW is predictive, even many of the BCS systems. In order to make a predictive system as accurate as possible it is allowed to include any information, which is deemed useful, like the strength of the quarterback, yards earned, number of fumbles etc. In particular it is very common to put more weight on recent games than older games. This allows a more precise extrapolation the next weeks games. In a more advanced predictive system the teams are not necessarily linearly ordered. One can easily imagine situations, where a good predictive system will predict, that A beats B, B beats C, but C beats A. This is not possible in a pure earned system.
Unfortunately most WWW ranking systems are a strange mix of the two types of systems described here. The BCS ranking, that determines which teams have earned the right to play in the various bowls, in particular the bowl determining the national championship, should be a pure earned ranking system,. But may, if not most, of the BCS systems seem to be predictive. It is even so bad, that a particular web site ranks the BCS systems according to how well they predict next weeks games!
The method described below are all intended as earned systems and now efforts are spent on trying to optimize for predictive capabilities.
Currently the Bowl Championship Series (BCS) system [] is based on 4 components, 1) 2 subjective polls by coaches and journalists, b) 8 computer programs, c) a well defined, but primitive, method for calculating the strength of schedule, and 4) the number of losses of each team.
Here we will be concern ourselves with the 8 BCS computer based methods for ranking the teams.
Of all these systems only Rothman offers the code to others. However, Massey has a fairly detailed description of his system so it should be possible to recreate his rankings. Since all the other systems seem to be well guarded secrets it is not possible for others to check the calculations, or try to estimate what effect wins or losses in the next weeks will have. It is especially problematic in the case of Jeff Sagarin, who is making a living of publishing ratings for various sports in USA Today. Since Mr. Sagarin has an economical advantage of keeping his system proprietary, it will be very difficult to obtain a detailed description or the actual source text of his system.
This secrecy hurts the computer rankings tremendously, since they create the impression that they are un-understandable and unfair and rewards ''running up the score''.
It is also a very uncommon situation. In other sports, where rankings play an important role (chess, golf, tennis, etc.) the ranking method is completely public and can be checked by any interested person.
I will therefore suggest, that the current system is replaced with an open system based on publicly accessible code with a detailed mathematical description. Ideally the whole package should be available for download from the BCS WWW site, so coaches, players, journalists and fans can perform the rankings themselves. The criteria the teams should be ranked on should be well published and should be quantifiable in terms of either simple formulas or tables.
For this reason I have chosen to give a detailed report on the ranking system I use. As will be seen, it is based on the work of others. It does, however, contain a few original, minor ideas.
I will also propose, that the NCAA should be responsible for the WWW posting of all NCAA results in a standard ASCII based format, that can be used by everybody ranking sports teams.
Often a ranking between teams is obtained by letting them play in tournaments with the purpose of either establishing a ranking between them or at least define ''the best team''. This tournament can either be stretched over the whole season or, more commonly, is used in a playoff at the end of the season.
All teams in the tournament play each other and the ranking is determined by the number of points each team accumulates (see the section on accumulative point systems). This is a very ''fair'' system, but requires a large amount of games.
The tournament is played in ''rounds''. Only winners are allowed to play in the next round. Eventually only one team is left and is declared the winner. The theme can be varied by introducing a losers bracket, so a team is only eliminated after two loses. This is a very popular system, since it will find an undisputed ''best'' team using the least amount of games. However, due to the unavoidable fluctuations in performance of each team from game to game it happens very often, that a better teams looses a match to an inferior opponent and is then eliminated. This is not fair from a ranking point of view, but has a lot of appeal from a spectator point of view.
The Monrad system is a very interesting variation of the cup system, which to my knowledge is only used on a regular basis in chess tournaments. In the first round all teams are paired randomly. The winner gets 1 point and the looser zero. In each successive round all teams with the same number of points are paired randomly (except that teams which earlier have played each other can not be paired if there are other pairing possibilities). This system has the advantage, that all teams keep playing, in contrast to the cup system, and as the season (or tournament) advances teams with equal strength will be meeting each other. There are no limitations to the number of rounds that can be played, but eventually teams have to be paired if they have similar, but not necessarily identical, number of points. The team with the largest number of points after a predefined set of rounds is the winner.
This system would be ideally suited for NCAA football, if only tradition and logistics would not interfere. Early in the season teams would play 5-6 Monrad games within their conference and the rest of the season they would be paired nationally, but with a pairing preference for teams geographically close. This would result in some very exciting games in November and would create optimally matched bowl games.
Most sports use ranking systems based on the idea, that for each match or tournament the team (or player) acquires a certain number of points depending on their performance and the teams ranking is based on the total number of point they have accumulated during the season. If several teams have the same number of points additional objective criteria are used, like winner of mutual game or accumulated score difference.
In most soccer leagues the winning team gets 2 (or 3) points and the looser none and each team gets 1 point for a tie. If all teams play each other (round robin as described above) this provides a very simple and effective method for evaluating the integrated performance of each team. Typical sizes of leagues range from 6 to 24. Usually the best teams from a league will move up in a higher league next season and the lowest ranked teams will be moved to a lower league. The winner of the highest league will be the overall winner.
Golf and tennis are using systems where each player accumulates points based on their placement in each tournament according to published tables. Prestigious tournaments will provide more points and small local tournament will of course provide less points. Usually the points are accumulated over a sliding time interval of one year.
The advantage of the accumulative point system is their simplicity. It is easy for each team or the spectators to figure out their accumulated score and therefore their ranking. All participants know what they gain or loose by winning or loosing a game. The disadvantage is, that the point allocation, especially in tournaments for large systems like in golf and tennis, becomes somewhat arbitrary and not based on the actual strength of the participants. It is also sometimes possible to rake up a lot of points by carefully selecting weak tournaments or by playing a large amount of tournaments.
The Elo rating system was first used by the International Chess Federation in 1970 to rank chess players. The system was proposed by Arpad E. Elo. It is partly based on earlier work done by Anton Hoesslinger. The official description of the system as it is used in chess can be found at []. Jones has given a nice overview of the system in .
The basic idea in the system is to continuously change a players rating Rp based on whether she performs better or worse than expected in tournaments or matches. For a new player with a total of N matches, where N £ Ncut (Ncut = 20) the rating Rp is calculated as
|
where á Rc ñ is the arithmetric average of the competitor's ratings at the time of the match, Nw and Nl is the number of wins and losses, respectively, and a = 400 is an initial scale factor. This is the basic Ingo system named after the place of origin, Ingolstadt in Germany, of it's inventor Anton Hoesslinger.
Elo's important improvement to this system was to introduce the Win Expectancy Function We, which is defined as
|
where
|
For a tournament with M matches played by a player with a rating Rp the new ranking Rp,new after the tournament becomes
|
where the score is defined as (Nt is the number of tied games)
|
and the sum i runs over each of the games the player played in the tournament. K is in principle a constant, but is in reality varied slightly depending on the rating of the player.
The Elo system seem especially well suited to sports with a large number of participants ( 10,000), where methods based on linear algebra have problems due to memory limitations in computers. However, the idea of introducing the probability function We is very powerful and can be used in other ranking system. In particular Massey has used part of these ideas in his very interesting BCS ranking system.
Select a ranking that minimizes the number of violations. A violation is a game where a team with a lower ranking defeats a team with a higher ranking.
.......
Let us consider a set of teams T consisting of NT teams playing a total of NG games between each other. Depending on the nature of the sport the term ''team'' can refer to either an individual (chess, boxing, singles tennis etc.) or a set of individuals (football, baseball, basketball, doubles tennis etc.). We will only consider games consisting of a set of two teams, but the method outlined in this paper can easily be generalized to consider games consisting of n > 2 teams (common in track and field, swimming etc.).
In game g (g = 1,...,NG) the home team is denoted th (h = 1,...,NT) and the away team is ta (a = 1,...,NT). In this game the home team obtains a score of Sh and the away team a score of Sa. The game is played at time Tg within a given season y. Results of games are assumed to be available from a total of Ny seasons (y = 1,...,Ny). The score of the winner is Sw and the score of the looser is Sl. We will assume for simplicity that the winner of a game is the team with the larger score. The margin of victory or point spread DS can be defined in two different ways.
|
or
|
DSwl will always be non-negative, whereas DSha will be positive if the home team wins and negative if the away team wins. The relation between them is
|
The assumptions of the current ranking model are:
In the following the discussion will be based on examples from football, but the formalism is completely general and could be used for any binary game resulting in a final set of scores. It follows from assumptions 4, that the current ranking method does not take into account other ''unofficial'' statistics from a game, like half time score, yardage gained or lost, fumbles etc. While these variables might very well be of importance for a prediction algorithm, we consider them irrelevant and unfair to use for a ranking algorithm, which purpose is to evaluate which team is the best or which set of teams are the best. In this latter case the teams need to know exactly what they are being evaluated on.
Let us now consider in more detail the result Rg of game g. As stated in assumption 3 the result Rg(Sh,Sa) is a real function of the two scores Sh and Sa. Rg will be considered a measurement of the strength of the two teams. As usual with any physical measurement the result Rg does not provide an exact measurement of the strengths of the two team, but an uncertainty sg is associated with each measurement g. There is, unfortunately, not a universally accepted result function. Some commonly used functions are discussed below.
It only matters which team wins (= which team has the higher score), but the actual scores do not matter.
|
This system is often referred to as the JWB system (Just Win Baby). Effectively this is how many fans view the game outcome, since the only thing that matters is whether you win or not. Not how you win or by how much.
The result of the game is defined as the difference between the scores of the two teams:
|
In football this system is often referred to as the BOMB Index (Bowden - Osborne Memorial Blowout Index), since it is perceived, that Florida State and Nebraska used to run up the score against weak opponents, which will help their ranking in this system.
A modification the SD system, where the score difference is truncated at some value DSmax in order to avoid to heavy an emphasis on games, where one team ''runs up'' the score:
|
In football typical values of maximum point spread DSmax ranges from 21-35. DSt is called the truncated point spread. Please note that the game outcome now also depends on the game-independent parameter DSmax.
A mathematically more elegant way of providing this cutoff is to use the hyperbolic tangent function
|
with
|
and
|
Since both the pure WL and SD system tend to favor a particular, but maybe extreme, view of the game outcome, other systems have introduced a simple linear combination of the two systems:
|
In football typical values for the ''bonus'' Bw for a win is 50-100. For Bw = 0 the system reduces to the SD system and for Bw >> |DS| it is identical to the WL system.
Instead of forming the difference between the score, as in the SD system, the ratio between the scores is the game outcome.
|
In this case -1 £ RgSR £ 1, with | RgSR| = 1 if the loosing team does not score any points (a shot-out). The rare case of Sh = Sa = 0 is not defined in this system and it can therefore not be used in sports, where this result is possible.
All the above mentioned methods for evaluating the outcome of a game have their virtues. It is therefore natural to form a outcome function, that combines all of them. The simplest way of doing this is by forming a linear combination of the three types of outcome:
|
There are three parameters in this approach: 1) the win bonus Bw, 2) the maximum point spread DSmax, and 3) the scoring ratio weight factor Br. They are, however, not independent since a scaling of all of them will lead to the same ranking. Since the sum of the three parameters is equal to the maximum value of the game outcome function Rgmax, it is convenient to constrain the three parameters by choosing Rgmax to have a convenient value like 100.
|
In this paper we will choose the following values
|
A subsequent paper will discuss techniques for choosing the most optimal values for (Bw,DSmax,Br).
The outcome prediction function Pg(rh,ra|aj) estimates the outcome of a game g between the two teams th and ta based on their strength rating rh and ra, respectively. In addition P can depend on a number of additional parameters, depending on the choice of game outcome function, Rg. In the case the WDR game outcome function RgWDR is used the parameter vector is [(a)\vec] = (Bw,DSmax,Br). In addition other parameters, like the home field advantage Bh, can be added to the parameter set, if needed. Actual choices of P will be discussed later in this paper, but first we will outline the method for determining the strength ratings, ri, independent of the functional expression of P.
During a season with Ng games a total of Ng measurements of the Nt strength ratings ri is performed and we want to determine the vector [r\vec] so the difference between the game result Rg and the prediction (or hypothesis) Pg is as small as possible for as many games as possible. This can be done in a number of ways. We choose to minimize the sum of the square of the difference between the game result and the prediction. This is the so-called Least Squares Method.
minimize: c2([r\vec]) = åg = 1Ng[ [(Rg([(a)\vec])-Pg([r\vec] | [(a)\vec]))/(sg)]] 2
The measurement error sg for each game will be discussed later.
Other methods are based on the maximum norm ......
If we furthermore assume, that the outcome prediction function depends linearly on the strength ratings, ri, we can use the general framework for linear least squares fits. This is a very strong assumption and a later paper will discuss non-linear approaches. We will furthermore assume Pg does not depend on [(a)\vec], but only depends on the relative difference between the strength ratings of the two teams participating in the game
|
This reduces the problem to the minimization of
|
Following the standard c2 nomenclature we introduce the design matrix A with elements
|
the weighted result vector b with elements
|
and the strength parameter vector r. In matrix notation the problem can now be written as
|
where the | | 2 symbol indicates the Euclidean norm in the vector space spanned by all the games.
Using the Single Value Decomposition algorithm the vector r, that minimizes c2 is
|
where U, V are the SVD matrices and v is the single value vector defined as
|
or
|
We obtain the U and V matrices and the v vector using the routines described in Press et al. [1992].
The advantage of using the linear c2 method is speed, since it only takes a few seconds on a normal PC to solve for the rankings. The SVD algorithm is the standard tool for solving linear c2 problems due to its robustness. For further details see press et al. [1992].
The game weight factors, defined as
|
provide the option to let various games influence the rankings differently. It is very common in other ranking models to make the weights time-dependent
|
where T is the time where the ranking is performed. This will put more emphasis on games played recently. This method is, however, more appropriate for prediction model than for ranking models. The current ranking model does not incorporate any time-dependence in the weights, with the exception of the initial period as explained later.
If two teams of very similar strength are playing each other, we consider the outcome of this game a ''good'' measurement of the relative strengths of the two teams. In contrast we consider mis-matched games between opponents with very different strengths as ''bad'' measurements. In mathematical terms this implies, that we will associate a larger weight wg (or equivalently a smaller uncertainty sg) to games between teams where the difference between the rankings rh and ra is small. This can, however, only be done in an iterative procedure, since the rankings rh and ra are not initially. We are therefore using the following method:
Initial values of the ranking vector r(1) are determined by assuming wg(1) = 1. Afterwards the c2 system is solved again, but this time with the following weights
|
where aw is determined from the condition
|
In other words, we consider a game between the best and the worst team to have an measurement error bw times larger than when two completely equal teams play. bw is a free parameter in this ranking model. The numerical examples show later have used the value bw = 3.
Early in the season, when only few games have been played, the design matrix A has a rank smaller than Nt. This has the effect, that the relative ranking of many sub-sets of teams are undetermined. For NCAA 1-A division football teams, where Nt = 112, the number of independent sub-sets of teams as function of number of the number of weeks played is shown in the table below (based on the 1996-98 seasons):
Number of independent sets | |||
week | 1996 | 1997 | 1998 |
1 | 109 | 104 | 104 |
2 | 78 | 78 | 65 |
3 | 36 | 36 | 18 |
4 | 3 | 2 | 1 |
5 | 1 | 1 | 1 |
|
Only after 5 playing weeks will it be possible to obtain a solution, where all teams are ''connected''. For NCAA 1-A football the onset of full connectiveness seems empirically to coincide with a condition, that the average number of games per team is 2.5 - 3.0 or a total of 139 to 170 games between the 112 teams (only games where both teams are division 1-A counts).
It is, however, often required to obtain a ranking early in the season before the full connectedness haven been obtained. Since we are requiring, that only the final results of games can be used in the ranking this can only be obtained by using game results from the previous season(s). So early in the season results from the previous season are also included in the ranking, but with a decreasing weight factor. The total weight factor on each game is
|
where wg is defined above and
ws = {
|
| ||
|
|
where á Ngt ñ is the average number of games played by each team and Rcut = 4.5 represents a cut-off value for the inclusion of previous seasons games.
Currently it is cumbersome to obtain NCAA results, especially for the lower divisions. I would therefore like to propose, that NCAA post results of all NCAA games (football, basketball, tennis, etc.) in a standard ASCII based format, that can be used directly a input to ranking programs.
For each sport two files should be created for each season: a team file and a game file.
The Team file should contain the following information on each team:
The format of the file should be ( generic C printf statement ):
fprintf( TeamFile, '' %5d %-25s %-35s %-25sn'',
Index, Name, Conference, Division );
The Game file should contain the following information on each game:
The format of the file should be ( generic C printf statement ):
fprintf( GameFile,
'' %4d %2d %2d %-25s %3d - %-25s %3d %5d %5d %1dn''
Year, Month, Day,
AwayTeamName, AwayTeamScore,
HomeTeamName, HomeTeamScore,
AwayTeamIndex, HomeTeamIndex, FieldCode );
==============================================
Created Sun Nov 07 11:13:54 1999
by Soren Sorensen
==============================================