• We are pleased to announce that the winner of our Feedback Prize Draw for the Winter 2024-25 session and winning £150 of gift vouchers is Zhao Liang Tay. Congratulations to Zhao Liang. If you fancy winning £150 worth of gift vouchers (from a major UK store) for the Summer 2025 exam sitting for just a few minutes of your time throughout the session, please see our website at https://www.acted.co.uk/further-info.html?pat=feedback#feedback-prize for more information on how you can make sure your name is included in the draw at the end of the session.
  • Please be advised that the SP1, SP5 and SP7 X1 deadline is the 14th July and not the 17th June as first stated. Please accept out apologies for any confusion caused.

Chapter 21

vidhya36

Very Active Member
In Page 17 of chapter 21, I read a sentence - "In a normal linear regression model, as we include more variables, the proportion of the variance in the dependent variable that is explained cannot decrease."

What do they mean by this statement?
 
I think it just means that as you include more variables in your regression model, these will (naturally) pick up more of the marginal effects relating specifically to each variable added in order help to explain the response variable. So it's only by removing variables that your variance is potentially going to increase.
 
Very intuitive answer given! Just to add to that, there are good theoretical reasons why this must be the case:

1. Definition of the model methodology

We are using OLS to fit the model. If we have an initial model with p variables and consider adding an additional one, in the very worst case we can get back to the original model by setting the p original parameters equal to their prior values and the parameter in respect of the new variable to 0. Thus we at the very least we will have as good a model as we did previously, the only option is to get better (from the perspective mentioned in the notes - adding more variables this way doesn't mean an overall "better model"). This can also be seen by consideration of the relevant formulae.

2. Understanding the extremes

It doesn't just have to be variables that have any "real" explanatory power that can cause overfitting issues. If you start with a simple linear regression model with one variable and say 32 observations, then adding 31 white noise variables to the model will result in a perfect fit (all residuals being 0). This is because it effectively boils down to solving a system of 32 simultaneous equations in 32 variables to get all the fitted values to match perfectly.

You can see this yourself with some R code using one of the built in data sets. The overfitted model will have 0 sum of squared residuals. You can mess around with the number of parameters to see the change as we go up to 32 in total (31 white noise).

This model is pretty useless (not surprisingly) for anything other than the data points given, it would not be useful for predictions. Hence the issues with overfitting.

# check data
head(mtcars)

# quick linear model
mpg_hp <- lm(mpg ~ hp,data = mtcars)

# some benchmark measurement
mean((mpg_hp$residuals)^2)

# also anova
anova(mpg_hp)

# test overfitted model (can change parameter input)
params <- 31

# generate white noise
x <- matrix(rnorm(params*length(mtcars$mpg)), ncol=params)

# fit model
mpg_hp_overfit <- lm(mtcars$mpg ~ mtcars$hp + x)

# test benchmark and anova
mean((mpg_hp_overfit$residuals)^2)
anova(mpg_hp_overfit)
summary(mpg_hp_overfit)
 
Last edited by a moderator:
Back
Top