Little-known Linear Regression Assumptions

The model should conform to these assumptions to produce a best Linear Regression fit to the data.

— All the images (plots) are generated and modified by Author.

Introduction

At first, Linear Regression is a method of modelling the best linear relationship between the independent variables and dependent variables. The simplest form of Linear Regression can be defined by the following equation with one independent and one dependent variable:

x is the independent variable,

y is the dependent variable,

β1 is the coefficient of x, i.e. slope,

β0 is the intercept (constant) which tells the distance of the line from the origin on y-axis.

Linear regression is a linear approach to modelling the relationship between a scalar response (or dependent variable) and one or more explanatory variables (or independent variables).

— Wikipedia

Linear Regression Types

Simple Linear Regression — The simplest form of regression which involves one independent variable and one dependent variable, which is explained as above, where we fit a line to the model.
Multiple Linear Regression — The complex form of regression which involves multiple independent variables and one dependent variable, which can be explained by the following equation:

x1 to xn are the independent variable,

y is the dependent variable,

β1 to βn are the coefficients of respective x features, and

β0 is the intercept (constant) which tells the distance of the line from the origin on y-axis.

Assumptions in Linear Regression

1. Linear Relationship — It is assumed and understood that the relation between the independent variables and dependent variables is linear, i.e. the coefficients must be linear, what we find out using the model building and prediction.

The predictor variables are seen as fixed values and can be any complex function like polynomial, trigonometric, etc. But the coefficients will be strictly linear with the predictor variable.

This assumption is used for implementing the Polynomial regression, which uses linear regression to fit the response variable as an arbitrary polynomial function of a predictor variable which also makes the linear relationship with the coefficients.

2. Homoscedasticity (Constant Variance) — It is assumed that the residual terms (that is, the “noise” or random disturbance in the relationship between the features and the target) must have the constant variance, i.e. the error term is same across different values of independent features, regardless of the values of the predictor variables.

There should be no clear pattern in the distribution and if there is a specific pattern, the data is heteroscedastic. The leftmost graph shows no definite pattern among the error terms i.e the distribution is varied constantly, whereas the middle graph shows a pattern where the error decreases and then increases with the estimated values violating the constant variance rule and the rightmost graph also reveals a specific pattern where the error terms decrease with the predicted values representing heteroscedasticity. Two or more normal distributions are homoscedastic if they share a common covariance (or correlation) matrix.

3. Multivariate Normality — It is assumed that the error terms are normally distributed, i.e. the mean of error terms is zero and the sum of error terms is also equal to zero. A less widely known fact is that, as the sample size goes high, the normality assumption for the residuals is not needed anymore.

The above q-q plot shows that the errors or residuals are normally distributed. The error term can be seen as the composite of some minor residuals or errors. As the number of these minor residuals increases, the distribution of the error term tends to approach the normal distribution. This tendency is called the Central Limit Theorem where the t-test and F- test are only applicable if the error term is normally distributed.

4. No Multicollinearity — Multicollinearity is defined as the degree of inter-correlations among the independent variables used in the model. It is assumed that the independent feature variables are not at all or very less correlated among each other, which makes them independent. So in practical implementation, the correlation between two independent features must not be greater than 30% as it weakens the statistical power of the model built. For identification of highly correlated features, pair plots (scatter plot) and heatmaps (correlation matrix) can be used.

Highly correlated features should not be used in the model to maintain the strong relationship between the model and all its features present as the features tend to change in unison. Hence, with the change in one feature, the change in correlated feature does not make the latter constant as the model requires it while predicting the outcome using the weighted coefficients and the expected interpretation of regression coefficient does not conform.

5. No Auto-correlation — It is assumed that there should be no auto-correlation among the features in the data. It mainly occurs when there is a dependency between residual errors, i.e. the residual error should not be correlated positively or negatively, and it should have a good spread all over. This usually occurs in time series models where the next instant is dependent on the previous instant. The presence of correlation in the residual terms also reduces the model’s predictability.

Autocorrelation can be tested with the help of the Durbin-Watson test. The Durbin-Watson test statistics is defined as:

The Durbin-Watson test statistics will always have a value between 0 and 4. An exact value of 2.0 states that there is no autocorrelation detected in the sample. Values between 0 and 2 indicate positive autocorrelation and values between 2 and 4 indicates negative autocorrelation.

6. No Extrapolation — Extrapolation is an estimation that can exist beyond the original observation range. It is assumed that the trained model will be able to predict the values for the dependent variable on independent feature values only for the data that lies within the range of the training data. Therefore, the model cannot guarantee the predicted values that are beyond the range of trained independent feature values.

Conclusion:

We have explained the most important assumptions which must be focussed before implementing a Linear Regression Model to a given set of data. These assumptions are just a formal measure to ensure that the predictability of the built linear regression model is good enough to give us the best possible results for a given data set. These assumptions if not satisfied will not stop a Linear regression model to be built but will provide good confidence to the predictability of the model.

Thanks for reading. You can find my other Machine Learning related posts here.

I hope this post has been useful. I appreciate feedback and constructive criticism. If you want to talk about this article or other related topics, you can drop me a text here or at LinkedIn.

Previously published at https://towardsdatascience.com/assumptions-in-linear-regression-528bb7b0495d