CS1B Paper - Linear Regression

User_111995 · Sep 9, 2022

Hi,
I was wondering if someone would be able to shed some light on forward selection for linear regression and how you know what variable to add to the model. I understand that you want to add variables that are significant (p-values), and keep a variable if the model has a high adjusted R squared and the new variable increases the adjusted R squared but what are the steps that I should take in order to know what variable I add first. I know that in the past papers they tend to tell you what variables to add, or will say add the parameters that have the highest significance but I wanted to think more outside the box.

If we think of the data that was used in CS1B - September 2021 (claims experience) would the process be to first fit a null model (fit0) with just the response variable claim_number and then create 4 other models that update fit0 adding a variable and then checking the adjusted R squared? Would it be worthwhile looking at the AIC too? Would you then take the variable that provides the highest adjusted R squared and then continue updating the model by adding any new parameter that makes the model more significant? Would you apply the same logic to interaction terms too?

I'm just confused as sometimes the adjusted R squared may increase but the significance of that variable might only be significant at the 1% level for example, and how do you know you have definitely got the most significant model by stepping through? If you have a dataset with a lot of variables would the process of adding a single parameter at a time not be a very long task?

Thank you

John Lee · Sep 12, 2022

As a general rule of thumb, we add the most significant variables first (ie those with the highest correlation).
The adjusted \(R^2\) is a quick way to compare models (as is the AIC) but technically we should perform a test and also double check that the variables are significant in the model.

However, in reality, this doesn't always produce the best model - and so in practice we often try adding variables in different orders. Alternatively, we do backwards selection.

But yes, with modern data collection (eg telematics) we often have too much data and so standard methods take too long. Hence we use PCA to reduce the number of variables in the pre-modelling step and use other methods to reduce the number of parameters in a model (which you'll learn a little bit about in CS2).

User_111995 · Sep 12, 2022

Hi John,

Thank you for your reply, that's really helpful. This makes sense as I think I remember seeing an exercise on the R material for linear regression that asks you to fit a linear regression model with the variables with the greatest correlation to the response variable, and so the model then included the variable that had the highest correlation.

Thank you!
Lily

John Lee · Sep 13, 2022

Glad I could help.

CS1B Paper - Linear Regression

User_111995

Keen member

John Lee

ActEd Tutor

User_111995

Keen member

John Lee

ActEd Tutor