Last year, I had mixed results. On the positive side, my bracket out-competed the members of my lab. However, in the NCAA bracket group my family ran, my mom - who I am pretty sure made rather random choices - beat my bracket.
This year, I'm hoping to improve on that (limited) success. Here's my strategy:
One of the most basic statistics about a team is their win-loss record. For example, this year Kansas State had a win-loss record of 21-10. Intuitively, we expect teams with favorable win-loss records to beat those with poor win-loss records.
Win-Loss record isn't the whole story, though, because all wins are not created equal. Defeating a bad team is easy (and tells us very little), but defeating a good team is indicative of skill. Therefore, a simple (though recursive) definition of a good team is: a team that is capable of winning games against other good teams.
Our major project, then, is to identify good teams without any a priori knowledge of team quality.
To find these teams, I modeled the win-loss history as a weighted, directed network. To do this, I analyzed 5242 games played between Division I teams in the 2011-2012 season.
Win-loss history is simply a list of game results (e.g. K-State beat the Charleston South 72-67, then beat Loyola 74-61, then ... lost to West Virginia 80-85, ... etc). A directed network is simply a connection of nodes (representing teams) and arrows connecting teams called directed edges. Every time a team defeated another, an arrow was drawn from the losing team's node to the winning team's node to represent this game.
Note though that all wins are not created equal. A game decided by a 20-point margin ("nuked from orbit", as I always say) is a lot different than a game decided by a single point. In the former case, one team is clearly superior, while in the latter, the teams are roughly equal. So let's revise our definition of a good team to be: a team that is capable of winning games against other good teams, by a large margin.
To account for this, each arrow is weighted between 0 and 1 to describe the magnitude of the victory. Winning a game by a large margin results in an arrow with weight 1 being drawn from the loser to the winner. By contrast, for games that are decided by only a few points, we'll draw arrows from both loser to winner AND winner to loser, both being weighted about 0.5 (slightly less for the loser).
Mathematically, edge weights can be calculated by using a a logistic function centered at zero. We'll need to choose an operational definition of "winning by a large margin" (which corresponds to choosing how large a margin is required before a score of 0.9 is assigned to an edge). Some experimentation (guided by a tiny amount of intuition) suggests that about 6 points is enough to definitively resolve that one team is better than another.
To find good teams, I computed each team's eigenvector centrality in the network. For those who like math (if not, skip ahead to the next paragraph): any network can be modeled as an adjacency matrix. The dominant eigenvector of this can be found by using power iteration. The score assigned to the team associated with the i'th row of the adjacency matrix is given by the i'th element of the dominant eigenvector.
A simplified (and mostly accurate) way to think about this is that every team starts out with an equal number of "quality points". Every time the computer says "Go", teams distribute their quality points to all the teams that beat them. Thus, good teams get more quality points than they gave away (and vice versa for bad teams). After a few rounds of this procedure, the quality points for every team approaches convergence.
[ Detail note: to prevent all the quality points from getting totally absorbed by undefeated [or nearly-so] teams, every team gives a small fraction of their points to every other team, just to keep some points in circulation. ]
Using this procedure, we can get a list of teams ranked in order of quality.
The top 10 ranked teams are (in order):
- Kentucky
- North Carolina
- Syracuse
- Missouri
- Michigan St.
- Ohio St.
- Belmont
- VCU
- Kansas
- Louisville
If you were wondering, the 5 lowest-scoring Division I teams - according to this particular analysis anyway - are (getting worse as you read down the list):
- Fairleigh Dickinson
- Binghamton
- Tenn-Martin
- Towson
- Nebraska - Omaha
If you're wondering where your favorite team is ranked, you can download the full ranked list:
Using this, we can predict the NCAA Tournament bracket using the simple heuristic that teams with higher eigenvector centrality rank ("more quality points") beat those with lower rank. This produces what I call the Eigenbracket for the 2012 NCAA tournament.
The ultimate test of the bracket will be - of course - how well it predicts the tournament outcomes. I tried to estimate the accuracy of my ranking method by using it to predict the results of all games across the whole season between either (a) all teams or (b) the top 80 teams. In both cases, the accuracy was around 75%.
This isn't the best testing methodology because the test data is also being used to build the network. A better way to test is to leave some data out (say, 10% of all games played), re-build the network and use those results to predict the outcomes of the games we held in reserve. I experimented with this a a bit and think the results are similar (70-80% accuracy). This also seems to do a little better than an analysis that fails to take into account margin of victory. That strategy had about 65-75% accuracy.
In summary, network analysis can be applied to interesting problems to help make sense of huge datasets so that we can make informed predictions about the future.