Understanding Linear Regression Assumptions
Q: What are the assumptions behind a linear regression model, and how would you test for violations of these assumptions?
- Probability and Statistics
- Senior level question
Explore all the latest Probability and Statistics interview questions and answers
ExploreMost Recent & up-to date
100% Actual interview focused
Create Probability and Statistics interview for FREE!
Linear regression is built on several key assumptions, and it's crucial to ensure that these assumptions hold in order to make valid inferences from the model. The main assumptions are as follows:
1. Linearity: The relationship between the independent and dependent variables should be linear. This can be tested using scatterplots to visualize the relationship. If the relationship appears curvilinear, you might need to include polynomial terms or transformations.
2. Independence: The residuals (errors) should be independent of each other. This can be checked using the Durbin-Watson statistic; a value around 2 suggests no autocorrelation in the residuals.
3. Homoscedasticity: The variance of the residuals should be constant across all levels of the independent variables. We can check this by plotting the residuals against the predicted values. If the plot shows a funnel shape, this indicates a violation of homoscedasticity.
4. Normality of Residuals: The residuals should be approximately normally distributed, especially for smaller sample sizes. This can be assessed using a Q-Q plot or a Shapiro-Wilk test. If the residuals deviate significantly from normality, techniques such as transforming the dependent variable might be necessary.
5. No multicollinearity: There should be no perfect multicollinearity among the independent variables. We can test for multicollinearity using Variance Inflation Factor (VIF) values; a VIF above 10 typically indicates a severe multicollinearity issue.
6. Exogeneity: The independent variables should not be correlated with the error term. This can be a bit more complex to test but can be assessed using instruments or conducting the Hausman test.
For example, if you were building a model to predict house prices based on square footage and number of bedrooms, you would first create scatterplots to check for linearity in relationships, analyze the residuals for independence and homoscedasticity, and evaluate the residuals for normality to ensure the underlying assumptions are met. If homoscedasticity appears violated, you could use weighted least squares regression to account for the difference in variance across levels of predictors.
In summary, checking these assumptions is a crucial step in ensuring that the linear regression model is valid and that the conclusions drawn from the analysis are reliable.
1. Linearity: The relationship between the independent and dependent variables should be linear. This can be tested using scatterplots to visualize the relationship. If the relationship appears curvilinear, you might need to include polynomial terms or transformations.
2. Independence: The residuals (errors) should be independent of each other. This can be checked using the Durbin-Watson statistic; a value around 2 suggests no autocorrelation in the residuals.
3. Homoscedasticity: The variance of the residuals should be constant across all levels of the independent variables. We can check this by plotting the residuals against the predicted values. If the plot shows a funnel shape, this indicates a violation of homoscedasticity.
4. Normality of Residuals: The residuals should be approximately normally distributed, especially for smaller sample sizes. This can be assessed using a Q-Q plot or a Shapiro-Wilk test. If the residuals deviate significantly from normality, techniques such as transforming the dependent variable might be necessary.
5. No multicollinearity: There should be no perfect multicollinearity among the independent variables. We can test for multicollinearity using Variance Inflation Factor (VIF) values; a VIF above 10 typically indicates a severe multicollinearity issue.
6. Exogeneity: The independent variables should not be correlated with the error term. This can be a bit more complex to test but can be assessed using instruments or conducting the Hausman test.
For example, if you were building a model to predict house prices based on square footage and number of bedrooms, you would first create scatterplots to check for linearity in relationships, analyze the residuals for independence and homoscedasticity, and evaluate the residuals for normality to ensure the underlying assumptions are met. If homoscedasticity appears violated, you could use weighted least squares regression to account for the difference in variance across levels of predictors.
In summary, checking these assumptions is a crucial step in ensuring that the linear regression model is valid and that the conclusions drawn from the analysis are reliable.


