Using a multivariable linear regression model to predict the sprint speed of players in FIFA 19

Written by emmanuels | Published 2019/02/12
Tech Story Tags: machine-learning | data-science | linear-regression | python | hackernoon-top-story

TLDRvia the TL;DR App

I play FIFA games occasionally but classify myself as a relatively strong player who wins more often than not against other casual players. I am not a huge soccer fan in general and do not try and play a strategic game. Instead I rely heavily on player sprint speed and making unpredictable runs and turns. I often combine these skills to find and make spaces in my opponents space and dribble my way to a victory - much to the frustration of my opponent. Given this backdrop, I decided to download the FIFA19 dataset from Kaggle with the intention of predicting player sprint speed based on variables/features that I believed would best predict a player's sprint speed.

Linear Regression

Linear Regression can be summed up as an attempt to model the relationship between one or multiple independent variables and a particular outcome or dependent variable. For this algorithm to be effective, there must be a linear relationship between the independent and dependent variables. Applied to data were a moderate to strong correlation exists between two or more variables it can be a very useful starting point in predicting the value of one outcome by finding the line that best fits/predicts an outcome.

Y = MX + B

The math behind this is fairly simple, particularly where you are only looking at one independent variable. Y represents the outcome, or the dependent variable, while m denotes the slope, x the independent variable and b the y-intercept. Simply put, if you know the slope of the line and the value of the independent variable you can predict the outcome, assuming a linear relationship exists between x and y.

In my case however, I am going to be looking at multiple independent variables therefore the formula required changes slightly.

F(x) = A +(B1*X1) +(B2*X2)+(B3*X2)+(B4*X4)...+(Bn*Xn)

With this formula I am assuming that there are (n) number of independent variables that I am considering. In this context F(x) is the predicted outcome of this linear model, A is the Y-intercept, X1-Xn are the predictors/independent variables, B1-Bn = the regression coefficients (comparable to the slope in the simple linear regression formula). Plugging the appropriate numbers in this formula would give me a prediction of an outcome, in this case the sprint speed of a player on FIFA19.

Interacting with the data

For this analysis I opted to use Python, downloaded the data from Kaggle uploaded it on my Google Drive, loaded up Google Colab and uploaded the data using the pandas read.csv function. After uploading the scipy, numpy and pandas libraries, I proceeded to the data clean up process.

#librariesimport pandas as pdimport numpy as npimport matplotlib.pyplot as pltimport scipy.stats as statsfrom google.colab import drive

#uploading datafifa_dataset = pd.read_csv('/content/gdrive/My Drive/Google Research/Learning/Kaggle Projects/FIFA19 dataset/data.csv')

Data Cleanup

I started off with a few assumptions, I assumed that sprint speed would be largely influenced by height, weight, age, acceleration stats and possibly the ratio between a player's weight and height. Upon observation of the data set I noticed that the heights and weights were recorded in string format (e.g 5'11 and 180lbs), additionally as someone who is more accustomed to the metric system I wanted to change these measurements to centimetres and kilograms respectively.

#inches to cmsfifa_dataset['Height']= fifa_dataset.Height.str.replace("'",".").apply(lambda x: float(x)*30.48).dropna()fifa_dataset['Height']= fifa_dataset['Height'].fillna(fifa_dataset['Height'].mean()).astype(np.int64)

#lbs to kgsfifa_dataset['Weight'] = fifa_dataset.Weight.str.replace("lbs", "").apply(lambda x: float(x)*0.45359237).dropna()fifa_dataset['Weight'] = fifa_dataset['Weight'].fillna(fifa_dataset['Weight'].mean()).astype(np.int64)

For weight, this conversion process involved splitting the string by looking for an apostrophe as a divider, replacing it with a full-stop(decimal separator) and applying a lambda function to convert the str to a float and converting it to centimetres. I converted the str to a float because I knew that the calculation would return a number that was essentially a float. After doing this I proceeded to fill all the NaN values with the mean height value in the dataframe and converted the number to an integer (under 100 rows). I made the assumption that filling in the missing values with a mean would be better for my analysis than forward filling, leaving out the NaN rows or changing them to zero. I later learnt that these columns would not be applicable to my analysis, however, I decided to include this to show the work I had to put in to clean some of the columns.

def func(x):x = x.fillna(x.mean()).astype(np.int64)return xfifa_dataset[['Agility','Acceleration','Balance','Positioning','Skill Moves','BallControl','Crossing','Finishing','Reactions','SprintSpeed']] = func(fifa_dataset[['Agility','Acceleration','Balance','Positioning','Skill Moves','BallControl','Crossing','Finishing','Reactions','SprintSpeed']])

After applying the same clean up to the height column, I defined a function that when applied to a column fills all NaN values with the mean of that column and converts the number to an int. After testing out which columns I would be using for my analysis, I applied this function to the relevant columns

Testing Correlation and Significance testing

To test correlation between each column and the outcome column (sprintspeed) I opted to use the spearmanr function from the scipy package. This function calculates correlation and returns both the correlation between x and y and the p-value or the probability of the significance of this correlation.

#We want to test for moderate to strong correlationsdef corr_test(x):x_corr = stats.spearmanr(x, fifa_dataset['SprintSpeed'])return x_corr

corr_test(fifa_dataset[x])

Using this function I ran through different columns in my dataset to determine which columns I would be using for my regression model. I opted to use columns where a moderate to strong correlation of at-least 0.50 (or under -0.50) existed. Using this as a benchmark I ended up with the columns; Agility, Acceleration, Balance, Positioning, Skill Moves, Ball Control, Crossing, Finishing and Reactions- these are the independent variables.

Typically when you measure for linearity you may need to visualise each column with a scatterplot to confirm that a linear relationship does exist. The problem with relying purely on the correlation coefficient is that influential outliers can either drastically increase or decrease the correlation coefficient making it appear as though a strong/weak correlation exists when the opposite is the case. With an understanding of how player scores are distributed in FIFA, I made the assumption that this was not necessary, we would not get singular values that are highly influential (with the exception of columns such as sale price).

Multivariate Linear Regression Model

#multivariate linear regression#80/20 split- 20% training datafrom sklearn.linear_model import LinearRegressionfrom sklearn.model_selection import train_test_splitfrom sklearn.metrics import mean_absolute_errorfrom sklearn.metrics import accuracy_scoretrain, test = train_test_split(fifa_dataset, test_size=0.2)

My machine learning algorithm (assuming you consider a linear regression model machine learning) relied heavily on the sklearn library. After importing this library, I opted to apply the 80/20 rule in splitting my data between training and test data, with 20% of the data falling under training data. I reasoned that I did not want to use more than 20% in order to get more certainty that my model could be generalised to the entire database.

#independent and dependent variablesfeatures= ['Agility', 'Acceleration', 'Balance','Reactions','Positioning','Skill Moves','BallControl','Crossing','Finishing']target = 'SprintSpeed'

#define model I am usingmodel = LinearRegression()

#training processmodel.fit(train[features], train[target])

#mean absolute value for training datadata = train[target]predict = model.predict(train[features])training_error = mean_absolute_error(data, predict)

#mean absolute value for test datatest_data = test[target]predict_test = model.predict(test[features])test_data_error = mean_absolute_error(test_data, predict_test)

I went on to define the features I would be using for this model (the independent variables) and the target or the variable I sought to predict (the dependant variable) then proceeded to train the model using the linear regression model. Training involved looking at the correlation between the independent and dependant variables to make calculations that would enable the model to predict outcomes from the test data.

Testing the Model

Mean Absolute Error Formula

To test the forecasting errors (loss function) in the data I calculated the mean absolute error (MAE) using the metrics module n sklearn for both the training and test data. In this formula n represents the number of errors in the data, Σ simply means summation and |yj — ŷj| refers to the absolute errors observed from the predictions. The formula sums the absolute errors and divides them by the total number of instances giving me a figure showing me what the average error between the predicted and actual sprint speed.

I would ideally want the number to be as small as possible and report that figure together with my prediction success rate. I could have alternatively open to use the root mean squared error (RMSE) similar to the MAE this would return a figure showing the deviation of the predicted values from the predictions. RMSE simply finds the square root of the MAE figure (however we would square the absolute errors in this instance).

#we need some metric to measure the accuracy of our regression modelfrom sklearn.metrics import r2_score

#on training datatrue_value = train[target]predicted_val = model.predict(train[features])accuracy = r2_score(true_value, predicted_val)

#on test datatrue_value2 = test[target]predicted_val2 = model.predict(test[features])accuracy2 = r2_score(true_value2, predicted_val2)

To test the accuracy of this model I relied on the r2_score metric (coefficient of determination). The R2 score or R-Squared, measures how close the data fits to the regression model, the more the number approaches 1 the more it shows that a significant percentage of the values are explained by the linear regression model-indicating stronger prediction capability.

print('This model accounts for {}% of the training data with mean data error of {}'.format(round(accuracy2*100,2), round(training_error,2)))print('This model accounts for {}% of the testing data with mean data error of {}'.format(round(accuracy*100,2), round(test_data_error,2)))

####RESULT####>This model accounts for 84.96% of the training data with mean data error of 4.08>This model accounts for 85.61% of the testing data with mean data error of 4.2

As reported by my console, the prediction model accounts for 85.61% of my testing data with an average deviation of about 4.2 (the average deviation between the predicted value and actual value). According to this result if for example we make a predictions of players with a sprint speed of 90 with this model, there is a very strong probability that the actual sprint speed will on average be between 86 and 94.

With the understanding that my R-Squared value will increase as I add more predictors because as I add more predictors to my model I account for more variability in my data. To factor in this I could look at the adjusted R-Squared value, which penalises the usage of more predictors, the magnitude of this penalty depends on how the number of predictors relates to the size of possible predictors (all the other columns) in the dataset. The result is in an increase in the adjusted R-squared value only when an added predictor improves the model more than it would be expected by chance.

Adjusted R Squared

In the formula given above, k denotes the number of predictors while n denotes the total number of columns in the dataset.

n = len(list(fifa_dataset))-1k = len(features)#calculating adjusted r squared using formula givenr2 = 1-(1-accuracy2)*(n-1)/(n-k-1)print('Adjusted R Squared is {}%'.format(round(r2*100,2)))

####RESULT####>Adjusted R Squared is 83.97%

This percentage would become more useful in instances where I want to test the goodness of fitness between this model and other models.

Another issue I noticed with the predictors used is that some of the predictors had correlation with other predictors creating multicollinearity. However, from my understanding this does not have a significant effect on the prediction capability of my model, it has a larger effect on the ability to estimate the effect each predictor has on my model.

Making a prediction

Now let's imagine we want to use this model to make an actual prediction. We pick a random player in our dataset. This happens to be the now 21 year old, England born Josef Yarney. This player happens to appear in the 26th row of my test dataframe.

Josef Yarney Source: WorldFootball.net

josef = test.iloc[25]josef_stats = josef[['Acceleration','Balance','Reactions','Positioning','Skill Moves','BallControl','Crossing','Finishing']]

#make predictionmodel.predict(np.array([[josef_stats[0],josef_stats[1],josef_stats[2],josef_stats[3],josef_stats[4],josef_stats[5],josef_stats[6],josef_stats[7]]]))josef_predic

####RESULT####>array([51.32203933])

I proceed to locate this player and extract the relevant stats, saving them under the josef_stats variable. I then make a prediction of the player's sprint speed using the multivariate linear regression formula created and get a sprint speed of about 51 versus the actual sprint speed of 48.

Visualising the equation

Visualising the multivariate linear regression equation for the FIFA dataset

To visualise how the predict function works we need to revisit the multivariate linear regression equation. Simply put, the predicted sprint speed is a function of the slopes of each of the predictors multiplied by their values (ie. if acceleration is 80 we multiply 80 by the slope of acceleration), we add these together and add the total figures to the y-intercept.

coefs = model.coef_

We can get the coefficients (slopes) of each predictor with the .coef_ function.

We can then interpret this equation in python using the following code:

speed = [a*b for a,b in zip(coefs,josef_stats)]sum(speed)+model.intercept_

####RESULT####>51.32203933182582

With this we are multiplying each value in the josef_stats list to its corresponding slope, adding these together and adding these numbers to the intercept which we find using the model.intecept_ function, to get the predicted sprint speed of 51.

This wraps up my analysis.

Feel free to send through any feedback or reach out to me on Twitter @Emmoemm


Written by emmanuels | I am an aspiring data scientist and entrepreneur
Published by HackerNoon on 2019/02/12