Linear Regression Assumptions Explained
Q: What are the assumptions underlying linear regression, and how can you evaluate if these assumptions are met?
- Statistics
- Senior level question
Explore all the latest Statistics interview questions and answers
ExploreMost Recent & up-to date
100% Actual interview focused
Create Statistics interview for FREE!
Linear regression relies on several critical assumptions to ensure the validity of the model and the reliability of its predictions. These assumptions include:
1. Linearity: The relationship between the independent and dependent variables should be linear. This can be evaluated using scatter plots to visualize the relationship. If the plot suggests a straight-line relationship, the assumption is likely met.
2. Independence: The residuals (errors) should be independent of each other. This can be assessed using the Durbin-Watson test or by plotting residuals against time (if applicable) to check for autocorrelation. If there’s no pattern, the assumption holds.
3. Homoscedasticity: The variance of residuals should be constant across all levels of the independent variable(s). You can evaluate this using a residuals vs. fitted values plot. If the spread of residuals remains consistent, then homoscedasticity is upheld.
4. Normality of Residuals: The residuals should be approximately normally distributed. This can be checked using a Q-Q plot or a histogram of the residuals. If the points in the Q-Q plot align closely with the diagonal line, the normality assumption is satisfied.
5. No Multicollinearity: Independent variables should not be highly correlated with each other. This can be assessed using Variance Inflation Factor (VIF); a VIF value greater than 10 indicates high multicollinearity and potential issues.
6. No Influential Outliers: Outliers can disproportionately affect the regression model. Cook’s distance can be used to identify influential points. A threshold for Cook’s distance is usually set; values greater than 1 may indicate influential points.
By rigorously testing these assumptions, we can ensure that the linear regression model is valid and provides reliable interpretations and predictions. For example, if you find a non-linear relationship in your initial scatter plot, you might consider transforming your variables or using a polynomial regression model instead of a simple linear regression.
1. Linearity: The relationship between the independent and dependent variables should be linear. This can be evaluated using scatter plots to visualize the relationship. If the plot suggests a straight-line relationship, the assumption is likely met.
2. Independence: The residuals (errors) should be independent of each other. This can be assessed using the Durbin-Watson test or by plotting residuals against time (if applicable) to check for autocorrelation. If there’s no pattern, the assumption holds.
3. Homoscedasticity: The variance of residuals should be constant across all levels of the independent variable(s). You can evaluate this using a residuals vs. fitted values plot. If the spread of residuals remains consistent, then homoscedasticity is upheld.
4. Normality of Residuals: The residuals should be approximately normally distributed. This can be checked using a Q-Q plot or a histogram of the residuals. If the points in the Q-Q plot align closely with the diagonal line, the normality assumption is satisfied.
5. No Multicollinearity: Independent variables should not be highly correlated with each other. This can be assessed using Variance Inflation Factor (VIF); a VIF value greater than 10 indicates high multicollinearity and potential issues.
6. No Influential Outliers: Outliers can disproportionately affect the regression model. Cook’s distance can be used to identify influential points. A threshold for Cook’s distance is usually set; values greater than 1 may indicate influential points.
By rigorously testing these assumptions, we can ensure that the linear regression model is valid and provides reliable interpretations and predictions. For example, if you find a non-linear relationship in your initial scatter plot, you might consider transforming your variables or using a polynomial regression model instead of a simple linear regression.


