This blog is final blog of Linear Regression series. In this blog we will discuss some of the key assumptions of Linear Regression and test the assumptions using Python code.
Assumptions of Linear Regression
1. Linearity – There should be linear relationship between dependent and independent variable. This is very logical and most essential assumption of Linear Regression. Visually it can be check by making a scatter plot between dependent and independent variable
2. Homoscedasticity – Constant Error Variance, i.e, the variance of the error term is same across all values of the independent variable. It can be easily checked by making a scatter plot between Residual and Fitted Values. If there is no trend then the variance of error term is constant.
import seaborn as sns sns.lmplot(x ="expected", y = "residual", data = result)
A close observation of the above plot shows that the variance of residual term is relatively more for higher fitted values. Note: In many real-life scenarios, it is practically difficult to ensure all assumptions of linear regression will hold 100%
3. Normal Error – The error term should be normally distributed. QQ plot is a good way of checking normality. If the plot forms a line that is roughly straight then we can assume there is normality.
import statsmodels.api as sm sm.qqplot(result["residual"], ylabel = "Residual Quantiles" )
4. No Autocorrelation of residual – This is typically applicable to time series data. Autocorrelation means the current value of Yt is dependent on historic value of Yt-n with n as lag period. Durbin-Watson test is a quick way to find if there is any autocorrelation.
5. No Perfect Multi-Collinearity – Multi-Collinearity is a phenomenon when two or more independent variables are highly correlated. Multi-collinearity is checked by Variance Inflation Factor (VIF). There should be no variable in the model having VIF above 2. (…for more details see our blog on Multi-Collinearity)
6. Exogeneity – Exogeneity is a standard assumption of regression and it means that each X variable does not depend on the dependent variable Y, rather Y depends on the Xs and on Error (e). In simple terms X is completely unaffected by Y.
7. Sample Size – In linear regression, it is desirable that the number of records should be at least 10 or more times the number of independent variables to avoid the curse of dimensionality.
With this blog, we complete our 10-part blog series on Linear Regression. We hope you, the student / blog reader would have enjoyed learning and practicing Linear Regression using R / Python.
Training and Testing – Any model built on training set should be check on an unseen data called as testing set. In this blog series, the focus was more on learning the basics of linear regression model and building it using R / Python. We will be discussing about Training – Testing in our separate blog.
For corrections, suggestions, or improvements, write to us in the comment section.