User_111995
Keen member
Hi,
I was wondering if someone would be able to shed some light on forward selection for linear regression and how you know what variable to add to the model. I understand that you want to add variables that are significant (p-values), and keep a variable if the model has a high adjusted R squared and the new variable increases the adjusted R squared but what are the steps that I should take in order to know what variable I add first. I know that in the past papers they tend to tell you what variables to add, or will say add the parameters that have the highest significance but I wanted to think more outside the box.
If we think of the data that was used in CS1B - September 2021 (claims experience) would the process be to first fit a null model (fit0) with just the response variable claim_number and then create 4 other models that update fit0 adding a variable and then checking the adjusted R squared? Would it be worthwhile looking at the AIC too? Would you then take the variable that provides the highest adjusted R squared and then continue updating the model by adding any new parameter that makes the model more significant? Would you apply the same logic to interaction terms too?
I'm just confused as sometimes the adjusted R squared may increase but the significance of that variable might only be significant at the 1% level for example, and how do you know you have definitely got the most significant model by stepping through? If you have a dataset with a lot of variables would the process of adding a single parameter at a time not be a very long task?
Thank you
I was wondering if someone would be able to shed some light on forward selection for linear regression and how you know what variable to add to the model. I understand that you want to add variables that are significant (p-values), and keep a variable if the model has a high adjusted R squared and the new variable increases the adjusted R squared but what are the steps that I should take in order to know what variable I add first. I know that in the past papers they tend to tell you what variables to add, or will say add the parameters that have the highest significance but I wanted to think more outside the box.
If we think of the data that was used in CS1B - September 2021 (claims experience) would the process be to first fit a null model (fit0) with just the response variable claim_number and then create 4 other models that update fit0 adding a variable and then checking the adjusted R squared? Would it be worthwhile looking at the AIC too? Would you then take the variable that provides the highest adjusted R squared and then continue updating the model by adding any new parameter that makes the model more significant? Would you apply the same logic to interaction terms too?
I'm just confused as sometimes the adjusted R squared may increase but the significance of that variable might only be significant at the 1% level for example, and how do you know you have definitely got the most significant model by stepping through? If you have a dataset with a lot of variables would the process of adding a single parameter at a time not be a very long task?
Thank you