Tuesday, March 13, 2012

EigenBracket 2012: Using Graph Theory to Predict NCAA March Madness Basketball

Last year one of the research associates in my lab asked if I wanted to make a bracket for the NCAA March Madness tournament.  Since I don't follow sports (other than what Jenny's interest in K-State sports exposes me to), I thought I wouldn't do very well just quasi-randomly picking teams.  Instead, I thought it would be a lot more fun to perform a mathematical analysis on the Win-Loss records for all the teams and use that data to "rationally" choose the bracket.

Last year, I had mixed results.  On the positive side, my bracket out-competed the members of my lab.  However, in the NCAA bracket group my family ran, my mom - who I am pretty sure made rather random choices - beat my bracket.

This year, I'm hoping to improve on that (limited) success.  Here's my strategy:

One of the most basic statistics about a team is their win-loss record.  For example, this year Kansas State had a win-loss record of 21-10.  Intuitively, we expect teams with favorable win-loss records to beat those with poor win-loss records.

Win-Loss record isn't the whole story, though, because all wins are not created equal.  Defeating a bad team is easy (and tells us very little), but defeating a good team is indicative of skill.  Therefore, a simple (though recursive) definition of a good team is: a team that is capable of winning games against other good teams.
 
Our major project, then, is to identify good teams without any a priori knowledge of team quality.
To find these teams, I modeled the win-loss history as a weighted, directed network.  To do this, I analyzed 5242 games played between Division I teams in the 2011-2012 season.

Win-loss history is simply a list of game results (e.g. K-State beat the Charleston South 72-67, then beat Loyola 74-61, then ... lost to West Virginia 80-85, ... etc).  A directed network is simply a connection of nodes (representing teams) and arrows connecting teams called directed edges.  Every time a team defeated another, an arrow was drawn from the losing team's node to the winning team's node to represent this game.

Note though that all wins are not created equal.  A game decided by a 20-point margin ("nuked from orbit", as I always say) is a lot different than a game decided by a single point.  In the former case, one team is clearly superior, while in the latter, the teams are roughly equal.  So let's revise our definition of a good team to be: a team that is capable of winning games against other good teams, by a large margin.

To account for this, each arrow is weighted between 0 and 1 to describe the magnitude of the victory.  Winning a game by a large margin results in an arrow with weight 1 being drawn from the loser to the winner.  By contrast, for games that are decided by only a few points, we'll draw arrows from both loser to winner AND winner to loser, both being weighted about 0.5 (slightly less for the loser).

Mathematically, edge weights can be calculated by using a a logistic function centered at zero.  We'll need to choose an operational definition of "winning by a large margin" (which corresponds to choosing how large a margin is required before a score of 0.9 is assigned to an edge).  Some experimentation (guided by a tiny amount of intuition) suggests that about 6 points is enough to definitively resolve that one team is better than another.

To find good teams, I computed each team's eigenvector centrality in the network.  For those who like math (if not, skip ahead to the next paragraph): any network can be modeled as an adjacency matrix.  The dominant eigenvector of this can be found by using power iteration.  The score assigned to the team associated with the i'th row of the adjacency matrix is given by the i'th element of the dominant eigenvector.

A simplified (and mostly accurate) way to think about this is that every team starts out with an equal number of "quality points".  Every time the computer says "Go", teams distribute their quality points to all the teams that beat them.  Thus, good teams get more quality points than they gave away (and vice versa for bad teams).  After a few rounds of this procedure, the quality points for every team approaches convergence.

[ Detail note: to prevent all the quality points from getting totally absorbed by undefeated [or nearly-so] teams, every team gives a small fraction of their points to every other team, just to keep some points in circulation. ]

Using this procedure, we can get a list of teams ranked in order of quality.

The top 10 ranked teams are (in order):
  1. Kentucky
  2. North Carolina
  3. Syracuse
  4. Missouri
  5. Michigan St.
  6. Ohio St.
  7. Belmont
  8. VCU
  9. Kansas
  10. Louisville
Note that the top 5 teams are all #1 or #2 seeded teams, but the algorithm did not know that a priori.  This provides a high-level validation to the algorithm.  Some other teams are seeded much lower.  For example, Belmont is a #14 seed, but has been talked about as an upset risk in their first game against Georgetown.  [ My analysis apparently does not think much of Georgetown, assigning it a rank of 65, which is in stark contrast to its #3 seed status ].

If you were wondering, the 5 lowest-scoring Division I teams - according to this particular analysis anyway - are (getting worse as you read down the list):
  1. Fairleigh Dickinson
  2. Binghamton
  3. Tenn-Martin
  4. Towson
  5. Nebraska - Omaha
All of these teams have bad win-loss ratios except Nebraska-Omaha.  Looking at their record, it seems Nebraska-Omaha played a large number of Division II (?) teams, which are not included in the analysis.  Because there is some dependence on overall number of games played (which cancels out for most teams, because most play ~ 30 Division I games), this probably artificially makes their score low, which highlights a (minor) limitation of the algorithm.

If you're wondering where your favorite team is ranked, you can download the full ranked list:


Using this, we can predict the NCAA Tournament bracket using the simple heuristic that teams with higher eigenvector centrality rank ("more quality points") beat those with lower rank.  This produces what I call the Eigenbracket for the 2012 NCAA tournament.


The ultimate test of the bracket will be - of course - how well it predicts the tournament outcomes.  I tried to estimate the accuracy of my ranking method by using it to predict the results of all games across the whole season between either (a) all teams or (b) the top 80 teams.  In both cases, the accuracy was around 75%.

This isn't the best testing methodology because the test data is also being used to build the network.  A better way to test is to leave some data out (say, 10% of all games played), re-build the network and use those results to predict the outcomes of the games we held in reserve.  I experimented with this a a bit and think the results are similar (70-80% accuracy).  This also seems to do a little better than an analysis that fails to take into account margin of victory.  That strategy had about 65-75% accuracy.

In summary, network analysis can be applied to interesting problems to help make sense of huge datasets so that we can make informed predictions about the future.

47 comments:

  1. wouldn't running the algorithm against past years' records and testing against past tournament results be the best possible test to tune the algorithm?

    ReplyDelete
    Replies
    1. Yes, I think so. I still have the dataset I assembled last year and originally planned to tune it using that data. I was working to get the bracket ready for today (before the first play-in games), so I didn't quite get to it yet.

      Delete
  2. You should strongly consider entering this into the StatGeekIdol Contest at TeamRankings.com... could win 1000 bucks. Awesome work.

    ReplyDelete
  3. This comment has been removed by the author.

    ReplyDelete
  4. Your approach, as you indicate, gives little weight to a team for winning a close game, however, a team that has a history of winning close games might have done so because of strong character. Couldn't that be an indicator of winning future close games?

    ReplyDelete
    Replies
    1. Yes, probably so!

      My idea in this case was to simply assign them both equal edges so that they could "share" their quality points, reasoning that teams that have close games are approximately equal in quality. The benefit you describe is partially accounted for by the model (because the winning team still gets a stronger edge weight), but I will have to think about how to include that idea for the future!

      Thanks for your suggestion!

      Delete
  5. Your methods are sound, but VU lose to Harvard? They defeated Kentucky for the SEC championship.

    ReplyDelete
  6. Any chance you'd be willing to share your data and/or source code?

    ReplyDelete
  7. Google 'Glicko'. It is used for ranking and predicting chess scorings.

    ReplyDelete
    Replies
    1. Thanks for the reference; I will check it out!

      Delete
  8. Have you tried adjusting the margin of victory that respresents a decisive victory? I would guess that by doing so, you can find a maximum accuracy for the model as a whole. I would also add one more adjustment factor -- home court. This is a major determinant in college basketball (see Arkansas's W/L at home and on the road). With some weighting of the home court advantage as a modifier to the "decisive victory" scale, I bet you could push this above 80% accuracy.

    ReplyDelete
    Replies
    1. Yes, I played around with the margin of victory thresholds a bit a monitored accuracy in predicting seasonal games. My initial guess was that ~6 points would be optimal. Some experiment suggested that somewhere between 5-7 is probably best. I left it at six to avoid trying to overfit the model to the data.

      Home court advantage is an interesting idea that I had not considered. I think maybe it could be incorporated into the weights somehow, possibly just by shifting the logistic curve a little. That way, home teams would need to win by a slightly larger margin to get the same benefit.

      Thanks for your comment!

      Delete
  9. It looks like really good mid-majors are weighted too highly. It will be interesting to see how this turns out.

    ReplyDelete
    Replies
    1. I think your comment is probably fairly accurate. My observation was that high-ranked and low-ranked teams remained (relatively) constant when I randomly deleted certain games from the win-loss history, but that mid-ranked teams exhibited substantially more variability.

      Probably this is reflective of some degree of overfitting among middle-of-the-road teams. A better way to build the network would be to build a bunch of networks using 90% of the data and averaging each team's rank order to produce a final list.

      Delete
  10. It seems like you have rediscovered the ELO rating system: http://en.wikipedia.org/wiki/Elo_rating_system. It is basically the same algorithm that is being used in googles page rank system. Graph theory has an amazing number of practical uses that are not immediately apparent until you ponder the problem.

    ReplyDelete
    Replies
    1. Thanks for your comment! I had heard about ELO before kind of in passing before, but haven't had the chance to look into it in much detail. I will give it a look though!

      Delete
  11. I had started doing something like this last week, but then quit because of the large about of time it was taking to copy all of the win-loss records for all of the teams. Did you copy this information manually or did you get it from somewhere that I could use for an analysis of my own.

    ReplyDelete
  12. I like this approach. I think a couple of factors could be incorporated for increased accuracy. There should be some consideration of stability between rankings. There are uncertain variances between games, where you would have a lower likelihood between 2 teams that were closely ranked. In this respect, you could assign probabilities for outcomes and have different outcomes with different likelihoods, which would be much more useful.

    ReplyDelete
    Replies
    1. btw, what do the numbers represent in the Ranked List?

      Delete
    2. Thanks for the suggestion! I'll have to think about that.

      The columns in the ranked ranked list are: Rank (#1, #2, #3, etc...), Eigenvector centrality score and team name.

      The eigenvector centrality score are all very low (Kentucky's node consumes most of them), despite redistributing them through the network. However, general trends in the rank order are pretty stable to changing the network a little, so I think they are meaningful. Also, one internal validation is that the top 5 teams correspond to top-seeded teams.

      Delete
  13. Cool idea! I hope you do well.

    Part of the fun is the "unpredictability" of the tourney, too. I don't think there is a way to predict injuries, "going cold" shooting, missed calls by referees, etc.

    ReplyDelete
    Replies
    1. Thanks! It should be a fun tournament.

      Delete
  14. I think you bring up a very interesting evaluation method. I am not familiar with all the math used here, but I think a good common sense question to ask is "does W-L record be reflect a team's ability?". While the answer to this question may be "yes", I think the teams performance (reflected in basketball statistics) could be a great indicator of tournament wins and losses. I know other math people (like KenPom), value in-game performance over end-game results. just something to consider

    ReplyDelete
    Replies
    1. I agree. One of my initial plans for this year was to try to use W/L history (without considering margin of victory) plus average free throw percentage on the idea that "free throws win games". However, I couldn't find an easy way to assemble those statistics, so I didn't pursue that this year.

      Thanks very much for your comment and suggestion!

      Delete
  15. Isn't this effectively what Professor Colley does as well? Colley doesn't account for margin of victory however- just wins and losses.

    http://www.colleyrankings.com/rank.html

    ReplyDelete
    Replies
    1. Thanks for the link! I'm always interested to see what others are doing. I haven't done much searching to see what was already known about this problem (since that's part of the fun!), but I will have a look to see what we do similarly and differently!

      Delete
  16. How do you account for team playing more than once? A weighted directed multigraph is hard to display on a adjacency matrix. Especially when the edges go the same direction from one vertex to another.

    ReplyDelete
    Replies
    1. I'm not sure what the BEST way to take this into account is, but what it currently does it just draws multiple arrows between those two nodes. The adjacency matrix takes the net arrow to simply be the weights of their sums.

      Probably a better strategy would be to calculate a single edge weight that summaries the all meetings of those two teams. However, I was worried that this would reduce the number of incoming nodes for teams in small conferences. Since reduced node degree is source of error in the analysis (see my comment on Nebraska-Omaha in the post), I used a more simple strategy for now.

      Delete
    2. I tried something a little different. Say team x beats team y by 20 in the first game , but team y beats team x by 2 in the next game, I averaged the 20 and (-2) , and had a arrow going from x to y with weight 9. This would eliminate the need of multiple paths.

      Delete
  17. Really interesting. Thanks for posting this and thanks for Slashdot for linking to it. During my commute (I take the subway) am going to try and recreate your algorithm and your results.

    The problem, however, that I see with your approach is that you're attempting to get a 1:1 ratio with the results of the rankings. That is, the rankings, a team ranked first in its bracket is presumably more likely to win more games than a team ranked 2nd, and so on...aren't accurate.

    Often, from my quick look at past winners and past brackets, the ranking of a team doesn't reflect how the team does in the tournament.

    So, in my opinion, a better approach would be to determine how wrong the rankings are. That is, if a team ranked first is actually ranked third, and so on...

    ReplyDelete
  18. This modeling does not include the effects of injuries and recoveries of key players. Example, a point guard sprains a wrist, missing close games that are lost or won narrowly. In one scenario, a backup player improves his skills as a result of getting more play time. By tournament time, the injured player returns to a stronger team. Scenario two is the replacement player(s) doesn't improve and the injured player doesn't recover before the tournament. Several top 25 times have had both scenarios

    ReplyDelete
    Replies
    1. Interesting comment about how injuries early on might affect team performance in the long run. I had not considered that.

      Delete
  19. When constructing your network, you should also factor in where the game was played. College basketball is notorious for having one of the largest home court advantages (although some of this is due to large programs being able to schedule more easy home games). Your new definition for a good team could be: a team that is capable of winning games against other good teams, by a large margin, playing on the road.

    When I make my bracket, I always check to see where the game will be played--many of the games turn out psudo-home games.

    I'll be interested to see if you can beat the Baysian rankings:
    http://www2.isye.gatech.edu/~jsokol/lrmc/

    ReplyDelete
    Replies
    1. Thanks; that's a great idea that I'll have to think about for the future!

      Thanks for the Bayesian ranking link; I'll take a look!

      Delete
  20. You should factor Fab Melo's ineligibility in the algorithm...

    ReplyDelete
    Replies
    1. I agree that this sort of data would be useful. I'm not currently sure how to model that. However, the more difficult part is finding a way to obtain that type of data in a machine-readable format.

      Delete
  21. Do you mind if I repost this on DZone.com? One of the topics I've been covering is NoSQL databases, and the readers at DZone like to read about graph theory. You can email me at egenesky@dzone.com. Thanks,

    Eric Genesky
    Community Curator
    DZone, Inc.
    egenesky@dzone.com

    ReplyDelete
    Replies
    1. Hello! Thanks for your interest! I'd really prefer that you link to the article here rather than copy the content. If you'd like though, please feel free to write a short (100-200 words?) introduction and with a "for more, see BioPhysEngr's blog at [link]!" (or something similar).

      Delete
  22. Can you provide a link to your bracket on ESPN? I'd like to follow it on their site, see how it does.

    ReplyDelete
    Replies
    1. I think this link should work:

      http://games.espn.go.com/tournament-challenge-bracket/en/entry?entryID=1339463

      Delete
  23. Replies
    1. Aye! I was sad to see them go down in the first round. Oh well; incentive to improve the analysis for next year!

      Delete
  24. This comment has been removed by the author.

    ReplyDelete
  25. Are the results of the regular season that you used to create the adj matrix available somewhere as an easy to read in file?

    ReplyDelete
    Replies
    1. To anyone interested, I found this page which has all of the results on it:
      http://www.ncaa.org/wps/wcm/connect/public/NCAA/Resources/Stats/M+Basketball/basketball.html

      Sadly, the database seems to break when you submit a query by date.

      Delete
  26. Any chance you'll do this for the 2013 Dance?

    ReplyDelete