Model assessment and variable selection

Making simpler model for complex analysis

Author

TheRimalaya

Published

March 5, 2017

Modified

October 8, 2024

Adding new variables to a model can introduce noise, complicating analysis. Simpler models tend to be better because they’re easier to understand and contain less noise. Statistical methods can help us choose the best variables and improve our models.

This tutorial will show you how to compare models to find the best one. Using the mtcars dataset available in R, it will explore methods for variable selection that help build efficient models with fewer variables. Here are two popular techniques for selecting the best subset of variables:

Best subset method

The best subset selection procedure identifies the optimal regression model by evaluating all possible subsets of variables. As the number of variables increases, the number of potential combinations grows exponentially. For instance, with 2 variables, there are 4 possible subsets: one with no predictors, two with single predictors, and one with both predictors. With 4 variables, the number of possible subsets jumps to 16. Consequently, fitting all possible subsets of a large number of predictors can become computationally intensive.

After fitting all possible models, they are compared based on various criteria. These criteria may include the coefficient of determination (\(R^2\)), adjusted \(R^2\), Mallow’s CP, AIC, and BIC. In R, we can use the leaps package for this task. Let’s look at the mtcars example to see this in action.

Complete/full model

Fitting a complete model
full.model <- lm(mpg ~ ., data = mtcars)
smry <- summary(full.model)
smry

Call:
lm(formula = mpg ~ ., data = mtcars)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.4506 -1.6044 -0.1196  1.2193  4.6271 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)  
(Intercept) 12.30337   18.71788   0.657   0.5181  
cyl         -0.11144    1.04502  -0.107   0.9161  
disp         0.01334    0.01786   0.747   0.4635  
hp          -0.02148    0.02177  -0.987   0.3350  
drat         0.78711    1.63537   0.481   0.6353  
wt          -3.71530    1.89441  -1.961   0.0633 .
qsec         0.82104    0.73084   1.123   0.2739  
vs           0.31776    2.10451   0.151   0.8814  
am           2.52023    2.05665   1.225   0.2340  
gear         0.65541    1.49326   0.439   0.6652  
carb        -0.19942    0.82875  -0.241   0.8122  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.65 on 21 degrees of freedom
Multiple R-squared:  0.869, Adjusted R-squared:  0.8066 
F-statistic: 13.93 on 10 and 21 DF,  p-value: 3.793e-07

Here, we can see that the model has explained almost 86.9 percent of variation present in mpg, but non of the predictors are significant. This is a hint of having unnecessary variables that has increased model error. Using regsubsets function from leaps package, we can select a subset of predictors based on some criteria.

Selecting best subset

Selecting best subset model
library(leaps)
best.subset <- regsubsets(
  x      = mtcars[, -1], # predictor variables
  y      = mtcars[, 1], # response variable (mpg)
  nbest  = 1, # top 1 best model
  nvmax  = ncol(mtcars) - 1, # max. number of variable (all)
  method = "exhaustive" # search all possible subset
)
bs.smry <- summary(best.subset)

We can combine following summary output with a plot created from additional estimates to get some insight. These estimates are also found in the summary object. The output show which variables are included with a star(*).

Summary of best subset
bs.smry$outmat %>% 
  as_tidytable(.keep_rownames = "model") %>% 
  gt::gt(rowname_col = "model") %>% 
  gt::opt_vertical_padding(0.5) %>% 
  gt::opt_row_striping() %>% 
  gt::tab_options(
    column_labels.font.weight = "bold",
    stub.font.weight = "bold"
  )

bs.est <- tidytable(
  nvar   = 1:(best.subset$nvmax - 1),
  adj.r2 = round(bs.smry$adjr2, 3),
  cp     = round(bs.smry$cp, 3),
  bic    = round(bs.smry$bic, 3)
) %>% pivot_longer(
  cols = c(adj.r2:bic),
  names_to = "estimates",
  values_to = "value"
)
cyl disp hp drat wt qsec vs am gear carb
1 ( 1 ) *
2 ( 1 ) * *
3 ( 1 ) * * *
4 ( 1 ) * * * *
5 ( 1 ) * * * * *
6 ( 1 ) * * * * * *
7 ( 1 ) * * * * * * *
8 ( 1 ) * * * * * * * *
9 ( 1 ) * * * * * * * * *
10 ( 1 ) * * * * * * * * * *

We can make a plot to visualise the properties of these individual models and select a model with specific number of predictor that can give minimum BIC, or minimum CP or maximum adjusted rsquared.

Plotting Adj-Rsq, BIC, and CP
bs.est.select <- bs.est %>%
  group_by(estimates) %>%
  filter(
    (value == max(value) & estimates == "adj.r2") |
      (value == min(value) & estimates != "adj.r2")
  )
ggplot(bs.est, aes(nvar, value, color = estimates)) +
  geom_point(shape = 21, fill = "lightgray") +
  geom_line() +
  facet_wrap(~estimates, scale = "free_y") +
  theme(legend.position = "top") +
  labs(
    x = "Number of variables in the model",
    y = "Value of Estimate"
  ) +
  scale_x_continuous(breaks = seq(0, 10, 2)) +
  geom_point(
    data = bs.est.select, fill = "red",
    shape = 21
  ) +
  geom_text(
    aes(label = paste0("nvar:", nvar, "\n", "value:", value)),
    data = bs.est.select,
    size = 3, hjust = 0, vjust = c(1, -1, -1),
    color = "black", family = "monospace"
  )

Adj-Rsq, BIC, and Mallows’ CP for models with increasing number of predictors

From these plots, we see that with 5 variables we will obtain maximum adjusted coefficient of determination (\(R^2\)). Similarly, both BIC and Mallow CP will be minimum for models with only 3 predictor variables. With the help of table above, we can identify these variables. From the table, row corresponding to 3 variables, we see that the three predictors are wt, qsec and am. To obtain maximum adjusted \(R^2\), disp and hp should be added to the previous 3 predictors.

This way, we can reduce a model to few variables optimising different assessment criteria. Let look at the fit of these reduced models:

Reduced Model
model.3 <- lm(mpg ~ wt + qsec + am, data = mtcars)
model.5 <- update(model.3, . ~ . + disp + hp)

Model summaries

Model summary of model with 3 variables
summary(model.3)

Call:
lm(formula = mpg ~ wt + qsec + am, data = mtcars)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.4811 -1.5555 -0.7257  1.4110  4.6610 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   9.6178     6.9596   1.382 0.177915    
wt           -3.9165     0.7112  -5.507 6.95e-06 ***
qsec          1.2259     0.2887   4.247 0.000216 ***
am            2.9358     1.4109   2.081 0.046716 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.459 on 28 degrees of freedom
Multiple R-squared:  0.8497,    Adjusted R-squared:  0.8336 
F-statistic: 52.75 on 3 and 28 DF,  p-value: 1.21e-11
Model summary of model with 5 variables
summary(model.5)

Call:
lm(formula = mpg ~ wt + qsec + am + disp + hp, data = mtcars)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.5399 -1.7398 -0.3196  1.1676  4.5534 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)   
(Intercept) 14.36190    9.74079   1.474  0.15238   
wt          -4.08433    1.19410  -3.420  0.00208 **
qsec         1.00690    0.47543   2.118  0.04391 * 
am           3.47045    1.48578   2.336  0.02749 * 
disp         0.01124    0.01060   1.060  0.29897   
hp          -0.02117    0.01450  -1.460  0.15639   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.429 on 26 degrees of freedom
Multiple R-squared:  0.8637,    Adjusted R-squared:  0.8375 
F-statistic: 32.96 on 5 and 26 DF,  p-value: 1.844e-10

From these output, it seems that although adjusted \(R^2\) has increased in later model, the additional variables are not significant. we can compare these two model with an ANOVA test which compares the residual variance between these two models. We can write the hypothesis as,

\(H_0:\) Model 1 and Model 2 are same vs \(H_1:\) Model 1 and Model 2 are different

where, Model 1 and Model 2 represents 3 variable and 5 variable model

Anova comparing Model with 3 and 5 variables
anova(model.3, model.5)
Analysis of Variance Table

Model 1: mpg ~ wt + qsec + am
Model 2: mpg ~ wt + qsec + am + disp + hp
  Res.Df    RSS Df Sum of Sq      F Pr(>F)
1     28 169.29                           
2     26 153.44  2    15.848 1.3427 0.2786

The ANOVA result can not reject the hypothesis so claim that Model 1 and Model 2 are same. So, it is better to select the simpler model with 3 predictor variables.

Step-wise selection

In this section, we’ll explore another type of variable selection method, similar yet distinct from best subset selection method. We will explore forward and backward stepwise selection methods using the mtcars dataset in R. We’ll briefly compare these methods and provide insights into their application.

Forward selection starts with an empty model, and predictors are added one by one based on a selection criterion, typically the Akaike Information Criterion (AIC).

Forward selection
# Initial empty model
initial_model <- lm(mpg ~ 1, data = mtcars)

# Full model with all predictors
full_model <- lm(mpg ~ ., data = mtcars)

# Stepwise forward selection
forward_model <- MASS::stepAIC(
  initial_model, 
  direction = "forward", 
  scope = list(lower = initial_model, upper = full_model)
)

# Summary of the forward selection model
summary(forward_model)
Start:  AIC=115.94
mpg ~ 1

       Df Sum of Sq     RSS     AIC
+ wt    1    847.73  278.32  73.217
+ cyl   1    817.71  308.33  76.494
+ disp  1    808.89  317.16  77.397
+ hp    1    678.37  447.67  88.427
+ drat  1    522.48  603.57  97.988
+ vs    1    496.53  629.52  99.335
+ am    1    405.15  720.90 103.672
+ carb  1    341.78  784.27 106.369
+ gear  1    259.75  866.30 109.552
+ qsec  1    197.39  928.66 111.776
<none>              1126.05 115.943

Step:  AIC=73.22
mpg ~ wt

       Df Sum of Sq    RSS    AIC
+ cyl   1    87.150 191.17 63.198
+ hp    1    83.274 195.05 63.840
+ qsec  1    82.858 195.46 63.908
+ vs    1    54.228 224.09 68.283
+ carb  1    44.602 233.72 69.628
+ disp  1    31.639 246.68 71.356
<none>              278.32 73.217
+ drat  1     9.081 269.24 74.156
+ gear  1     1.137 277.19 75.086
+ am    1     0.002 278.32 75.217

Step:  AIC=63.2
mpg ~ wt + cyl

       Df Sum of Sq    RSS    AIC
+ hp    1   14.5514 176.62 62.665
+ carb  1   13.7724 177.40 62.805
<none>              191.17 63.198
+ qsec  1   10.5674 180.60 63.378
+ gear  1    3.0281 188.14 64.687
+ disp  1    2.6796 188.49 64.746
+ vs    1    0.7059 190.47 65.080
+ am    1    0.1249 191.05 65.177
+ drat  1    0.0010 191.17 65.198

Step:  AIC=62.66
mpg ~ wt + cyl + hp

       Df Sum of Sq    RSS    AIC
<none>              176.62 62.665
+ am    1    6.6228 170.00 63.442
+ disp  1    6.1762 170.44 63.526
+ carb  1    2.5187 174.10 64.205
+ drat  1    2.2453 174.38 64.255
+ qsec  1    1.4010 175.22 64.410
+ gear  1    0.8558 175.76 64.509
+ vs    1    0.0599 176.56 64.654

Call:
lm(formula = mpg ~ wt + cyl + hp, data = mtcars)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.9290 -1.5598 -0.5311  1.1850  5.8986 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 38.75179    1.78686  21.687  < 2e-16 ***
wt          -3.16697    0.74058  -4.276 0.000199 ***
cyl         -0.94162    0.55092  -1.709 0.098480 .  
hp          -0.01804    0.01188  -1.519 0.140015    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.512 on 28 degrees of freedom
Multiple R-squared:  0.8431,    Adjusted R-squared:  0.8263 
F-statistic: 50.17 on 3 and 28 DF,  p-value: 2.184e-11

Backward selection begins with the full model, removing one predictor at a time to improve the model according to the chosen criterion, typically AIC.

Backward selection
# Full model with all predictors
full_model <- lm(mpg ~ ., data = mtcars)

# Stepwise backward selection
backward_model <- MASS::stepAIC(
  full_model, 
  direction = "backward"
)

# Summary of the backward selection model
summary(backward_model)
Start:  AIC=70.9
mpg ~ cyl + disp + hp + drat + wt + qsec + vs + am + gear + carb

       Df Sum of Sq    RSS    AIC
- cyl   1    0.0799 147.57 68.915
- vs    1    0.1601 147.66 68.932
- carb  1    0.4067 147.90 68.986
- gear  1    1.3531 148.85 69.190
- drat  1    1.6270 149.12 69.249
- disp  1    3.9167 151.41 69.736
- hp    1    6.8399 154.33 70.348
- qsec  1    8.8641 156.36 70.765
<none>              147.49 70.898
- am    1   10.5467 158.04 71.108
- wt    1   27.0144 174.51 74.280

Step:  AIC=68.92
mpg ~ disp + hp + drat + wt + qsec + vs + am + gear + carb

       Df Sum of Sq    RSS    AIC
- vs    1    0.2685 147.84 66.973
- carb  1    0.5201 148.09 67.028
- gear  1    1.8211 149.40 67.308
- drat  1    1.9826 149.56 67.342
- disp  1    3.9009 151.47 67.750
- hp    1    7.3632 154.94 68.473
<none>              147.57 68.915
- qsec  1   10.0933 157.67 69.032
- am    1   11.8359 159.41 69.384
- wt    1   27.0280 174.60 72.297

Step:  AIC=66.97
mpg ~ disp + hp + drat + wt + qsec + am + gear + carb

       Df Sum of Sq    RSS    AIC
- carb  1    0.6855 148.53 65.121
- gear  1    2.1437 149.99 65.434
- drat  1    2.2139 150.06 65.449
- disp  1    3.6467 151.49 65.753
- hp    1    7.1060 154.95 66.475
<none>              147.84 66.973
- am    1   11.5694 159.41 67.384
- qsec  1   15.6830 163.53 68.200
- wt    1   27.3799 175.22 70.410

Step:  AIC=65.12
mpg ~ disp + hp + drat + wt + qsec + am + gear

       Df Sum of Sq    RSS    AIC
- gear  1     1.565 150.09 63.457
- drat  1     1.932 150.46 63.535
<none>              148.53 65.121
- disp  1    10.110 158.64 65.229
- am    1    12.323 160.85 65.672
- hp    1    14.826 163.35 66.166
- qsec  1    26.408 174.94 68.358
- wt    1    69.127 217.66 75.350

Step:  AIC=63.46
mpg ~ disp + hp + drat + wt + qsec + am

       Df Sum of Sq    RSS    AIC
- drat  1     3.345 153.44 62.162
- disp  1     8.545 158.64 63.229
<none>              150.09 63.457
- hp    1    13.285 163.38 64.171
- am    1    20.036 170.13 65.466
- qsec  1    25.574 175.67 66.491
- wt    1    67.572 217.66 73.351

Step:  AIC=62.16
mpg ~ disp + hp + wt + qsec + am

       Df Sum of Sq    RSS    AIC
- disp  1     6.629 160.07 61.515
<none>              153.44 62.162
- hp    1    12.572 166.01 62.682
- qsec  1    26.470 179.91 65.255
- am    1    32.198 185.63 66.258
- wt    1    69.043 222.48 72.051

Step:  AIC=61.52
mpg ~ hp + wt + qsec + am

       Df Sum of Sq    RSS    AIC
- hp    1     9.219 169.29 61.307
<none>              160.07 61.515
- qsec  1    20.225 180.29 63.323
- am    1    25.993 186.06 64.331
- wt    1    78.494 238.56 72.284

Step:  AIC=61.31
mpg ~ wt + qsec + am

       Df Sum of Sq    RSS    AIC
<none>              169.29 61.307
- am    1    26.178 195.46 63.908
- qsec  1   109.034 278.32 75.217
- wt    1   183.347 352.63 82.790

Call:
lm(formula = mpg ~ wt + qsec + am, data = mtcars)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.4811 -1.5555 -0.7257  1.4110  4.6610 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   9.6178     6.9596   1.382 0.177915    
wt           -3.9165     0.7112  -5.507 6.95e-06 ***
qsec          1.2259     0.2887   4.247 0.000216 ***
am            2.9358     1.4109   2.081 0.046716 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.459 on 28 degrees of freedom
Multiple R-squared:  0.8497,    Adjusted R-squared:  0.8336 
F-statistic: 52.75 on 3 and 28 DF,  p-value: 1.21e-11

Comparing best subset and stepwise selection method

Best subset selection considers all possible combinations of predictors and selects the model with the best performance based on a specified criterion (e.g., AIC, BIC, adjusted R²). While comprehensive, this method can be computationally expensive for datasets with a large number of predictors.

  • Best Subset Selection: Evaluates every possible model, providing the optimal model based on the criterion. It ensures the best fit but at a high computational cost.
  • Stepwise Selection (Forward and Backward): Evaluates models sequentially. It’s more computationally efficient but may not guarantee the globally optimal model.

Stepwise selection methods, whether forward or backward, provide a practical approach to model selection by adding or removing predictors based on defined criteria. While not as exhaustive as best subset selection, they offer a balance between computational efficiency and model performance. By understanding and applying these methods within the mtcars dataset, we can navigate model selection systematically and efficiently.

Glossary

  1. R-squared (R²): A statistical measure representing the proportion of the variance for the dependent variable that’s explained by the independent variables in the model. R² values range from 0 to 1, with higher values indicating better model performance.

  2. Adjusted R-squared (R² adjusted): Adjusted R² modifies R² to account for the number of predictors in the model. It provides a more accurate measure when comparing models with a different number of predictors, as it penalizes the addition of non-informative predictors.

  3. Akaike Information Criterion (AIC): AIC is an estimator of the relative quality of statistical models for a given set of data. It balances the goodness of fit of the model with the number of predictors, penalizing more complex models to avoid overfitting. Lower AIC values indicate better models.

  4. Bayesian Information Criterion (BIC): Similar to AIC, BIC also evaluates model quality but imposes a more substantial penalty for the number of predictors. BIC is used to discourage overfitting, and lower BIC values indicate a better model.

  5. Mallow’s CP: Mallow’s CP criterion assesses the trade-off between model complexity and goodness of fit. Lower values of CP are desired, with CP values close to the number of predictors plus one indicating well-fitted models.