MARCH MADNESS

Sanjit_3282
Nov 1, 2018
2 min read

The competition sponsored by Deloitte involved predicting the outcomes of the possible 2278 games between 68 teams for NCAA Men’s Basketball tournament 2018. We built a predictive model using R, Python to predict the result of the game along with the probability of team 1 winning the game. The model attained accuracy of 72% and a log-loss of 0.66 for the 2018 tournament and our team(Unofficial Intelligence) won 3rd Prize in the competition.

We were provided with the data of the March Madness games from year 2002-2017 covering game, coaching, polls, location, and RPI details. We decided to use Logistic Regression for our model to predict the outcomes. Along with the data available, we included external variable Sagarin Rating, to boost our model performance. This rating was scraped using Beautiful Soup package in Python.

The USP of our model was, we built three separate train/test (75/25 split) sets and average of these three was considered to check the accuracy of our model. Three separate sets were created to avoid overfitting to one set of test data – susceptible to outlier season like 2012, which had many upsets. The three sets were created as follows:

· Season 2002,2003,2016, 2017 – First and last two years for testing

· Season 2004,2008,2012,2016 – Evenly dispersed 4 years of testing

· Randomly selected 25% of data records for testing

We built our model in R with different variable combinations and compare their performance. Eventually, we ended up with Cross-validated Lasso Logistic Regression of our model. It gave us the best Log-Loss for our model and cut-down on the overfitting compared to other models.

Below is the video of our Report for reference. It was a team competition and my team members were Even DeCastros, Nakul Kaura and Yicong Hu.

#MarchMadness#CollegeBasketBall

SANJIT SISODIYA

DATA ENTHUSIAST

"Maybe stories are just data with a soul."

MARCH MADNESS

Recent Posts

Comments