top of page
Home: Welcome

DATA ENTHUSIAST

"Maybe stories are just data with a soul."

I am Sanjit Sisodiya - A Data Enthusiast. I am passionate about deriving insights from data and can make stories speak through visualization and dashboards. Over the past four years of my work experience, I have developed a belief in the process of working with data > extracting information >  taking action. Here, is the blog of some of my work involving machine learning and predictive analytics.

TECHNICAL SKILLS:

  • SQL

  • R

  • Python

  • Tableau

  • SPSS

  • Google Analytics

 

Search

MARCH MADNESS

  • Writer: Sanjit_3282
    Sanjit_3282
  • Nov 1, 2018
  • 2 min read

The competition sponsored by Deloitte involved predicting the outcomes of the possible 2278 games between 68 teams for NCAA Men’s Basketball tournament 2018. We built a predictive model using R, Python to predict the result of the game along with the probability of team 1 winning the game. The model attained accuracy of 72% and a log-loss of 0.66 for the 2018 tournament and our team(Unofficial Intelligence) won 3rd Prize in the competition.


The UNOFFICIAL INTELLIGENCE team

We were provided with the data of the March Madness games from year 2002-2017 covering game, coaching, polls, location, and RPI details. We decided to use Logistic Regression for our model to predict the outcomes. Along with the data available, we included external variable Sagarin Rating, to boost our model performance. This rating was scraped using Beautiful Soup package in Python.


The USP of our model was, we built three separate train/test (75/25 split) sets and average of these three was considered to check the accuracy of our model. Three separate sets were created to avoid overfitting to one set of test data – susceptible to outlier season like 2012, which had many upsets. The three sets were created as follows:

· Season 2002,2003,2016, 2017 – First and last two years for testing

· Season 2004,2008,2012,2016 – Evenly dispersed 4 years of testing

· Randomly selected 25% of data records for testing


We built our model in R with different variable combinations and compare their performance. Eventually, we ended up with Cross-validated Lasso Logistic Regression of our model. It gave us the best Log-Loss for our model and cut-down on the overfitting compared to other models.


Below is the video of our Report for reference. It was a team competition and my team members were Even DeCastros, Nakul Kaura and Yicong Hu.



#MarchMadness#CollegeBasketBall



 
 
 

Comments


Home: Blog2
Home: GetSubscribers_Widget

©2018 by Sanjit Sisodiya. Proudly created with Wix.com

bottom of page