An Attempt to Predict the NBA with a Machine Learning System Written in Python Part II

I now want to talk about the model I discussed in the first piece in more technical terms. Better a year late than never, I suppose.

For predicting the outcome of a match I used a logistic regression model. I compared it against models based on naive bayes, neural networks, random forest and support vector machines. Every model was cross-validated and their optimal hyperparameters were found.’

The reason I sticked with a logistic regression model was that it had a prediction accuracy on par, or superior, than more complex solutions and the transparency of the model means you can use it for qualitative analysis. With logistic regression you understand what are the key features and their weight. Also, logistic regression returns probabilities that are pretty accurate and this is important to have a notion of how confident you are in your prediction.

Features

The model consists of the following features with their coefficients. Features were standarized before fitting the model:

home court advantage: 0.10218887
effective field goal percentage difference: 0.16118265
turnover percentage difference: -0.05958713
offensive rebound percentage difference: 0.07061777
free throws to field goals attempts difference: 0.03267933
distance traveled in last 7 days difference: -0.01459163
form in last seven matches difference: 0.0828436
offensive rating difference: 0.17885523
defensive rating difference: -0.33924331
effective field goal percentage difference (court*): 0.10808104
turnover percentage difference (court*): -0.09548481
offensive rebound percentage difference (court*): 0.07055131
free throws to field goals attempts difference (court*): 0.0748545
form in last seven matches difference (court*): -0.00486437
offensive rating difference (court*): 0.14822224
defensive rating difference (court*): -0.21756487

*Considering court situation means that, for example if Team A is the host and Team B is the visitor the effective field goal percentage would be: A effective field goal percentage when playing at home — B effective field goal percentage on the road)

The input of a given match would be the difference in each of these metrics between the two teams.

Performance

Let’s take the last Celtics ring season as an example: 2007-2008. This model would have correctly predicted 70 % of the matches.

Is this number good ? We would obviously expect a dummy model that chooses winners randomly to be correct around 50 % of the time. However, we have a better benchmark at our disposal: Vegas money lines. A model that simply predicts that Vegas’ favorite would win, would have been correct 69.8 % of the time. Considering this is what bookies do for a living, spending a lot of resources building models for setting the odds, and leveraging the power of markets, I’d argue that having a similar performance than Vegas is a great result.

Interestingly, 87.6% of the time our model picked the Vegas’ favorite, while 12.4 % it picked the underdog. 51 % of the time, it correctly predict that the underdog would in.