Trying to Trick Linear Regression - Estimating Coefficients for Variables in R

In this post we will try to trick linear regression into thinking that a redundant variable is statistically significant. By redundant, I mean a candidate predictor variable that in reality is just noise (no effect on the outcome) but that we might include in an experiment because we don’t know if it is important or not. The trick is that we can set up the data generating process such that a redundant variable is highly correlated with the response. In such cases, the naive intuition is that the model would get “confused” about the importance of this variable and perhaps identify it as significant, just as a casual observer might when viewing plots of variable vs. outcome, given that rise and fall together.

It is commonly said that one should consider removing variables from the model when they are highly correlated to each other…but why? Because it messes with your model’s predictions? Because it messes with the estimation of the variable coefficients? Perhaps the uncertainty around the estimates? Let’s use the a simulation to see…

Load Libraries

library(tidyverse)
library(here)
library(broom)
library(gt)
library(GGally)

We’re going to build this dataset from scratch because we want to control the coefficients of each variable and also the overall effect size. There is going to be one numeric outcome and multiple predictor variables which puts us squarely in the realm of multiple regression. The model for MR is shown below:

\(Y_i = \beta_0 + \beta_1 X_{i1} + \beta_2 X_{i2} + \ldots + \beta_p X_{ip} + \epsilon_i\)

for each observation \(i = 1, ... , n\).

In the formula above we consider n observations of one dependent variable and \(p\) independent variables. Thus, \(Yi\) is the \(i\)th observation of the dependent variable, \(Xij\) is \(i\)th observation of the \(j\)th independent variable, \(j = 1, 2, ..., p\). The values \(βj\) represent parameters to be estimated, and \(εi\) is the \(i\)th independent identically distributed normal error.1

Let’s start by setting up some placeholder columns to hold the beta coefficients and the values of the x’s (predictors variables). Thinking ahead, we’ll want to be working row-wise to combine the coefficients, x’s, and error in order to generate an observation \(y\) for each row. Note that the values for the \(x\)`’s are just placeholders; we’ll assign them values after getting the structure of our dataframe into good shape. We’ll make 10 predictor variables total and 10 coefficients to go with them - a relatively large but still reasonable number of variables for a benchtop experiment in the medical device domain. The first coefficient is set to 1 so that the value of that variable will have a real, causal impact on the outcome. All other coefficients are set to zero, meaning they have no true effect with respect to generating the observations.

betas_tbl <- tibble(
  var_prefix = rep("b", 10),
  var_suffix = seq(from = 1, to = 10, by = 1)
) %>%
  mutate(var = str_glue("{var_prefix}_{var_suffix}") %>% as_factor()) %>%
  select(var) %>%
  mutate(case_1 = c(1, rep(0, 9))) %>%
  pivot_wider(names_from = "var", values_from = "case_1")

x_tbl <- tibble(
  var_prefix = rep("x", 10),
  var_suffix = seq(from = 1, to = 10, by = 1)
) %>%
  mutate(var = str_glue("{var_prefix}_{var_suffix}") %>% as_factor()) %>%
  select(var) %>%
  mutate(case_1 = rep(0, 10)) %>%
  pivot_wider(names_from = "var", values_from = "case_1")

betas_tbl %>%
  gt()
b_1 b_2 b_3 b_4 b_5 b_6 b_7 b_8 b_9 b_10
1 0 0 0 0 0 0 0 0 0
x_tbl %>%
  gt()
x_1 x_2 x_3 x_4 x_5 x_6 x_7 x_8 x_9 x_10
0 0 0 0 0 0 0 0 0 0

Now that we have placeholders set up to handle our eventual row-wise operations, we can proceed with populating the values for the observed variables \(X_{i}\) and the residuals \(εi\) and then eventually assembling the set of observations.2 After combining the columns using bind_cols() we use slice() to copy the setup structure many times, making a duplicate row for each instance (set by the each argument). The setup is consistent with the tidy format for data structures, where each row is an observation.

After replicating the setup, we use rowwise() to group by individual rows and allow our random number generation to be unique for each row. The next step is to replace the \(X\) values with a random number from the standard normal distribution by mutating across the \(X\) columns.

Now to the part where we try to trick the algorithm. The line mutate(x_2 = x_1 + rnorm(n=1, mean = 0, sd = 1)) replaces the randomly drawn \(X_2\) value from above with the exact value drawn for the one true predictor \(X_1\) plus a little noise. This is the critical step to our deception because we can see that \(X_2\) is still not involved with the data generation process since it gets multiplied to \(β_2\) (zero) before getting added into the observation. However, it will still be correlated with the true predictor \(X_1\) and also the observation \(Y_i\).

After laying the trap, we move on to combining the \(X\)s and the \(β\)s before adding a residual error term through resid = rnorm(1, mean = 0, sd = residual_sd)). The last step is to add up the bx and resid values to make the observation using rowSums.

set.seed(08062021)

residual_sd <- 1 # controls effect size
number_of_obserations <- 100 # number of observations we want in the dataset

full_wide_tbl <- betas_tbl %>%
  bind_cols(x_tbl) %>%
  slice(rep(1:n(), each = number_of_obserations)) %>% # copy the setup once for each desired observation in the dataset
  rowwise() %>%
  mutate(across(x_1:x_10, ~ rnorm(1, mean = 0, sd = 1))) %>% # assign x values
  mutate(x_2 = x_1 + rnorm(n = 1, mean = 0, sd = 1)) %>% # modify X_2 to match X_1 plus some noise
  mutate(
    bx_1 = b_1 * x_1,
    bx_2 = b_2 * x_2,
    bx_3 = b_3 * x_3,
    bx_4 = b_4 * x_4,
    bx_5 = b_5 * x_5,
    bx_6 = b_6 * x_6,
    bx_7 = b_7 * x_7,
    bx_8 = b_8 * x_8,
    bx_9 = b_9 * x_9,
    bx_10 = b_10 * x_10,
    resid = rnorm(1, mean = 0, sd = residual_sd)
  ) %>% # add a residual noise term to each observation
  ungroup() %>%
  mutate(observation = rowSums(select(., matches("bx|resid")))) # combine the bx's and the residual to make an observation for each row

full_wide_tbl %>%
  gt_preview()
b_1 b_2 b_3 b_4 b_5 b_6 b_7 b_8 b_9 b_10 x_1 x_2 x_3 x_4 x_5 x_6 x_7 x_8 x_9 x_10 bx_1 bx_2 bx_3 bx_4 bx_5 bx_6 bx_7 bx_8 bx_9 bx_10 resid observation
1 1 0 0 0 0 0 0 0 0 0 0.3395447 0.6799476 -0.3331312 0.02533631 -0.2857135 1.2768024 -0.3205689 0.3193416 -0.13623157 0.1504407 0.3395447 0 0 0 0 0 0 0 0 0 0.96925533 1.3088001
2 1 0 0 0 0 0 0 0 0 0 -0.8526105 -0.6350722 -1.7195343 -0.07721092 -0.7417792 0.4862414 -0.9801599 1.1703553 0.02182926 -0.4346919 -0.8526105 0 0 0 0 0 0 0 0 0 -0.04323569 -0.8958462
3 1 0 0 0 0 0 0 0 0 0 -2.1833921 -2.7169440 -0.8642178 -0.44676351 1.2837061 0.4657978 1.0804734 -1.3062251 1.21380041 -0.2091334 -2.1833921 0 0 0 0 0 0 0 0 0 0.88849224 -1.2948998
4 1 0 0 0 0 0 0 0 0 0 -0.7433093 -0.9123439 -0.3321398 0.94530764 -1.9000828 -0.4433316 -1.7927324 0.2250934 1.53592952 1.8170489 -0.7433093 0 0 0 0 0 0 0 0 0 0.28102444 -0.4622849
5 1 0 0 0 0 0 0 0 0 0 -0.1853604 -1.6034941 1.4372441 -0.02187797 -0.3158500 1.7377585 0.5419371 -0.8313007 -0.64240892 -1.3688378 -0.1853604 0 0 0 0 0 0 0 0 0 0.98018641 0.7948260
6..99
100 1 0 0 0 0 0 0 0 0 0 -0.1164012 -0.2047272 -0.2387522 0.65005294 0.7160411 0.9361878 0.2668423 -0.7293151 0.24998149 1.0019955 -0.1164012 0 0 0 0 0 0 0 0 0 0.27812099 0.1617198

Perfect. n=100 observations organized as 1 in each row, with the generative components available for examination in the other rows. Let’s see what happens if we plot the predictor variables vs. the outcome for each instance:

full_wide_tbl %>%
  pivot_longer(cols = x_1:x_10, names_to = "var", values_to = "val") %>%
  mutate(var = as_factor(var)) %>%
  ggplot(aes(x = val, y = observation)) +
  geom_point() +
  geom_smooth(method = "lm") +
  facet_wrap(~var) +
  labs(
    title = "Predictor Variables Plotted Against Outcome",
    subtitle = "x_1 is the only true predictor",
    x = ""
  ) +
  theme_bw()

Now imagine you are the engineer who set up this experiment and you saw the above plot. The obvious thing to conclude is that both variables x_1 and x_2 are important in the outcome. If you were feeling really brash, you might say that that changing ‘x_1’ and/or ‘x_2’ cause changes in y. Let’s see if linear regression falls into this same trap.

lm(observation ~ x_1 + x_2 + x_3 + x_4 + x_5 + x_6 + x_7 + x_8 + x_9 + x_10, data = full_wide_tbl) %>%
  summary() %>%
  tidy() %>%
  mutate(across(is.numeric, round, 2)) %>%
  mutate(`significant?` = case_when(
    p.value <= .05 ~ "heck_yes",
    TRUE ~ "nope"
  )) %>%
  gt()
term estimate std.error statistic p.value significant?
(Intercept) 0.15 0.10 1.42 0.16 nope
x_1 1.04 0.14 7.62 0.00 heck_yes
x_2 -0.02 0.11 -0.16 0.87 nope
x_3 -0.09 0.13 -0.67 0.50 nope
x_4 -0.13 0.11 -1.15 0.25 nope
x_5 0.21 0.11 1.90 0.06 nope
x_6 0.23 0.11 2.17 0.03 heck_yes
x_7 0.05 0.11 0.47 0.64 nope
x_8 -0.06 0.11 -0.59 0.56 nope
x_9 -0.13 0.11 -1.15 0.25 nope
x_10 0.19 0.12 1.66 0.10 nope

This is really pretty interesting. Not only has the model seen through our trap and ignored x_2, but it’s estimate of x_1 is quite reasonable at 1.04 (true is 1.00). Linear regression would not go report to the boss that x_2 causes the outcomes, or even that x_2 is associated with the outcome, but the naive engineer would based on the visuals! This is really nice evidence to use linear regression where some would say correlation analysis would do fine!3

But maybe this is just because we diluted the effect of x_2 too much. Perhaps if we just reduced the noise, then we could increase the correlation between x_2 and x_1 and cause the regression to think x_2 is important.

First, let’s convert our pipeline into a function that takes a noise factor as an input and spits out the table of simulated observations. Then a couple more helper functions to make the facet plot and do the inference.

sim_fcn <- function(noise) {
  betas_tbl %>%
    bind_cols(x_tbl) %>%
    slice(rep(1:n(), each = number_of_obserations)) %>%
    rowwise() %>%
    mutate(across(x_1:x_10, ~ rnorm(1, mean = 0, sd = 1))) %>%
    mutate(x_2 = x_1 + rnorm(n = 1, mean = 0, sd = noise)) %>% # control the strength of the correlation between x_1 and x_2
    mutate(
      bx_1 = b_1 * x_1,
      bx_2 = b_2 * x_2,
      bx_3 = b_3 * x_3,
      bx_4 = b_4 * x_4,
      bx_5 = b_5 * x_5,
      bx_6 = b_6 * x_6,
      bx_7 = b_7 * x_7,
      bx_8 = b_8 * x_8,
      bx_9 = b_9 * x_9,
      bx_10 = b_10 * x_10,
      resid = rnorm(1, mean = 0, sd = residual_sd)
    ) %>%
    ungroup() %>%
    mutate(observation = rowSums(select(., matches("bx|resid"))))
}

plt_fcn <- function(tbl) {
  tbl %>%
    pivot_longer(cols = x_1:x_10, names_to = "var", values_to = "val") %>%
    mutate(var = as_factor(var)) %>%
    ggplot(aes(x = val, y = observation)) +
    geom_point() +
    geom_smooth(method = "lm") +
    facet_wrap(~var) +
    labs(
      title = "Predictor Variables Plotted Against Outcome",
      subtitle = "x_1 is the only true predictor",
      x = ""
    ) +
    theme_bw()
}

inf_fcn <- function(tbl) {
  lm(observation ~ x_1 + x_2 + x_3 + x_4 + x_5 + x_6 + x_7 + x_8 + x_9 + x_10, data = tbl) %>%
    summary() %>%
    tidy() %>%
    mutate(across(is.numeric, round, 2)) %>%
    mutate(`significant?` = case_when(
      p.value <= .05 ~ "heck_yes",
      TRUE ~ "nope"
    ))
}

Sensitivity to Magnitude of Noise

With these helpers in hand, we can look at a few different scenarios for high, medium, and low noise. Another way to view this would as low, medium, and high correlation, respectively, between the true predictor x_1 and the redundant x_2.

High Noise

set.seed(1)
high_noise_tbl <- sim_fcn(noise = 2)

high_noise_tbl %>% plt_fcn()

high_noise_tbl %>% inf_fcn()
## # A tibble: 11 x 6
##    term        estimate std.error statistic p.value `significant?`
##    <chr>          <dbl>     <dbl>     <dbl>   <dbl> <chr>         
##  1 (Intercept)    -0.15      0.1      -1.42    0.16 nope          
##  2 x_1             0.91      0.12      7.42    0    heck_yes      
##  3 x_2             0.05      0.05      0.89    0.37 nope          
##  4 x_3            -0.03      0.1      -0.33    0.74 nope          
##  5 x_4             0.04      0.11      0.34    0.74 nope          
##  6 x_5            -0.03      0.09     -0.36    0.72 nope          
##  7 x_6            -0.01      0.11     -0.1     0.92 nope          
##  8 x_7            -0.01      0.1      -0.09    0.93 nope          
##  9 x_8            -0.03      0.1      -0.34    0.74 nope          
## 10 x_9            -0.13      0.1      -1.3     0.2  nope          
## 11 x_10            0.01      0.1       0.06    0.95 nope

Medium Noise

set.seed(2)
med_noise_tbl <- sim_fcn(noise = 1)

med_noise_tbl %>% plt_fcn()

med_noise_tbl %>% inf_fcn()
## # A tibble: 11 x 6
##    term        estimate std.error statistic p.value `significant?`
##    <chr>          <dbl>     <dbl>     <dbl>   <dbl> <chr>         
##  1 (Intercept)    -0.07      0.11     -0.67    0.51 nope          
##  2 x_1             0.84      0.13      6.55    0    heck_yes      
##  3 x_2             0.12      0.1       1.15    0.25 nope          
##  4 x_3             0.07      0.09      0.7     0.48 nope          
##  5 x_4             0.19      0.11      1.72    0.09 nope          
##  6 x_5             0.06      0.1       0.59    0.56 nope          
##  7 x_6             0.02      0.1       0.21    0.83 nope          
##  8 x_7            -0.05      0.1      -0.49    0.62 nope          
##  9 x_8             0.19      0.1       1.86    0.07 nope          
## 10 x_9             0.27      0.1       2.77    0.01 heck_yes      
## 11 x_10           -0.04      0.1      -0.42    0.68 nope

Low Noise

set.seed(3)
low_noise_tbl <- sim_fcn(noise = .1)

low_noise_tbl %>% plt_fcn()

#low_noise_tbl %>% inf_fcn()

Even in the low noise (large correlation) condition, the model isn’t tempted to see x_2 as important. So is there any cost to pay for leaving all correlated variables in the model? The astute observer might point out that the std. error of the estimate for \(β_1\) seems to trend upwards when the noise is low. Does this mean we are more uncertain about the estimate of the true predictor \(x_1\) when it is correlated with another predictor like \(x_2\)? Or was that just due to chance and the nature of the random draws in the simulation? After all, we only built 1 model for each scenario.

To explore this further we need a bigger experiment - or set of experiments - that looks at a range of correlation strengths between x_1 and x_2 and repeats each setting many times. This will also provide a nice opportunity to practice with nested lists, a workflow that is challenging (for me) but also elegant and powerful. Here’s how we could go about it:

  • Set up a preliminary tibble as before that has columns for the betas and the x’s
  • Set up a separate tibble with a list of the desired “noise”; this value is effectively the strength of the correlation between x_1 and x_2. When noise is 0, the two variables are perfectly correlated.
  • Use crossing() to generate all combinations of noise with the beta’s and x’s

Set up the simulation parameters

prelim_tbl <- betas_tbl %>%
  bind_cols(x_tbl)

error_se_tbl <- tibble(
  error_sd = seq(from = 0.01, to = 3.01, by = .05)
)

setup_tbl <- prelim_tbl %>%
  crossing(error_se_tbl) %>%
  mutate(row_id = row_number()) %>%
  select(row_id, error_sd, everything())

setup_tbl %>%
  gt_preview()
row_id error_sd b_1 b_2 b_3 b_4 b_5 b_6 b_7 b_8 b_9 b_10 x_1 x_2 x_3 x_4 x_5 x_6 x_7 x_8 x_9 x_10
1 1 0.01 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
2 2 0.06 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
3 3 0.11 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
4 4 0.16 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
5 5 0.21 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
6..60
61 61 3.01 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

Make Function to Generate Observations

Now we need a function that can look at each row in the setup_tbl and simulate many observations based on those setup parameters. This is basically the same thing that we did before with slice(), only now the noise variable error_sd can vary (which changes the correlation strength between x_1 and x_2.

gen_observations_fcn <- function(sd_row_number = 1, reps = 100) {
  setup_tbl %>%
    mutate(n_row = row_number()) %>%
    filter(n_row == sd_row_number) %>%
    slice(rep(1:n(), each = reps)) %>%
    mutate(n_row = row_number()) %>%
    group_by(n_row) %>%
    mutate(across(x_1:x_10, ~ rnorm(1, mean = 0, sd = 1))) %>%
    mutate(x_2 = x_1 + rnorm(n = 1, mean = 0, sd = error_sd)) %>% # control the strength of the correlation between x_1 and x_2
    mutate(
      bx_1 = b_1 * x_1,
      bx_2 = b_2 * x_2,
      bx_3 = b_3 * x_3,
      bx_4 = b_4 * x_4,
      bx_5 = b_5 * x_5,
      bx_6 = b_6 * x_6,
      bx_7 = b_7 * x_7,
      bx_8 = b_8 * x_8,
      bx_9 = b_9 * x_9,
      bx_10 = b_10 * x_10,
      resid = rnorm(1, mean = 0, sd = 1)
    ) %>%
    ungroup()
}

Use the Function in a Loop to Repeat the Simulation

Sincere apologies to the for-loop haters but I don’t currently know a more elegant way to do this part. First we establish the number of simulations we want - in this case 100 at each noise level will do fine. We then initialize a holder tibble; the data inside this one isn’t important but we want the column names to match what comes out of the simulation loop so that we can bind_rows to build out the superset of simulation data. The final step is to loop the simulation 100 times, generating 100 synthetic observations at a specified correlation each time through. Inside the loop, after the 100 observations are generated at each condition, the model is built and the coefficients, standard error, and p-values for coefficients are extracted and stored. When all is done, for every strength of correlation we should have 100 estimates based on 100 models of 100 data points generated under those specified conditions.

n_sims <- 100

# original_tbl <- tidy(summary(lm(observation ~ ., data = full_wide_tbl))) %>%
#   filter(term == "(Intercept)")

# for (i in 1:n_sims) {
#   unnested_mods_tbl <- setup_tbl %>%
#     rowwise() %>%
#     mutate(observations_tbl = map(row_id, ~ gen_observations_fcn(.x, reps = 100))) %>%
#     ungroup() %>%
#     select(observations_tbl) %>%
#     unnest() %>%
#     mutate(observation = rowSums(select(., matches("bx|resid")))) %>%
#     group_by(error_sd) %>%
#     nest() %>%
#     mutate(model = map(data, function(data) tidy(summary(lm(observation ~ x_1 + x_2 + x_3 + x_4 + x_5 + x_6 + x_7 + x_8 + x_9 + x_10, data = data))))) %>%
#     select(-data) %>%
#     unnest(model) %>%
#     filter(term == "x_1" | term == "x_2")
#
#   original_tbl <- original_tbl %>%
#     bind_rows(unnested_mods_tbl)
#
# }

Here’s what the output looks like:

cln_sim_tbl <- original_tbl %>%
  filter(term != "(Intercept)") %>%
  select(error_sd, everything())

cln_sim_tbl %>%
  gt_preview()
error_sd term estimate std.error statistic p.value
1 0.01 x_1 7.96513254 12.8959803 0.6176446 0.53838682
2 0.01 x_2 -7.14281172 12.9118544 -0.5531980 0.58151483
3 0.06 x_1 3.31952029 1.6226721 2.0457124 0.04373651
4 0.06 x_2 -2.34648487 1.6284278 -1.4409511 0.15310755
5 0.11 x_1 -0.17504497 0.8655740 -0.2022299 0.84019853
6..12199
12200 3.01 x_2 0.01955293 0.0411947 0.4746467 0.63620143

Now we can visualize what happens to the coefficient estimates, their standard errors, and the p-values as the correlation between x_1 and x_2 changes in strength. First, let’s look at the estimates for the coefficients.

cln_sim_tbl %>%
  ggplot(aes(x = estimate, y = error_sd)) +
  geom_jitter(aes(color = term), alpha = .4, height = .05) +
  xlim(c(-4, 4)) +
  geom_vline(xintercept = 1, linetype = 2) +
  geom_vline(xintercept = 0, linetype = 2) +
  coord_flip() +
  theme_bw() +
  scale_color_manual(values = c("purple", "limegreen")) +
  theme(legend.title = element_blank()) +
  labs(
    y = "SD of Random Noise in Relationship Between x_1 and x_2",
    x = "Estimate of Beta Coefficient",
    title = "Variable Correlation Strength vs. Coefficient Estimates",
    subtitle = "True Values are x_1 = 1, x_2 = 0",
    caption = "Lower Noise is Higher Correlation"
  )

The above plot shows that we can indeed trick the model. When the correlation is high enough, the spread of the coefficient estimates get very wide. Even though on average the estimate is still correct, the model is very uncertain about the value for any particular experiment and will occasionally think that x_2 has a large estimated coefficient. And we usually only get to run 1 experiment in real life.

Based on these data, we should expect to see the standard error of the estimate to grow large at high correlation. Let’s see if that’s true:

cln_sim_tbl %>%
  #  mutate(error_sd = as_factor(error_sd)) %>%
  ggplot(aes(x = std.error, y = error_sd)) +
  geom_jitter(aes(color = term), alpha = .4, height = .05) +
  xlim(c(0, 1.2)) +
  coord_flip() +
  theme_bw() +
  scale_color_manual(values = c("purple", "limegreen")) +
  theme(legend.title = element_blank()) +
  labs(
    y = "SD of Random Noise in Relationship Between x_1 and x_2",
    x = "Standard Error of Coefficient Estimate",
    title = "Variable Correlation Strength vs. Standard Error of Coefficient Estimates",
    caption = "Lower Noise is Higher Correlation"
  )

Just as expected - the uncertainty in the estimates spikes as the correlation between x_1 and x_2 approaches 1. But what happens to the p-values of the coefficients for each variable?

cln_sim_tbl %>%
  #  mutate(error_sd = as_factor(error_sd)) %>%
  ggplot(aes(x = p.value, y = error_sd)) +
  geom_jitter(aes(color = term), alpha = .5, height = .05) +
  geom_vline(xintercept = .05, linetype = 2, size = 1.3, alpha = .7) +
  coord_flip() +
  theme_bw() +
  scale_color_manual(values = c("purple", "limegreen")) +
  theme(legend.title = element_blank()) +
  labs(
    y = "SD of Random Noise in Relationship Between x_1 and x_2",
    x = "P-Value for Coefficient Estimate",
    title = "Variable Correlation Strength vs. P-Values for Coefficient Estimates",
    subtitle = "dotted line shows p = .05",
    caption = "Lower Noise is Higher Correlation"
  )

Yep - as noise goes to 0 and the correlation approaches 1, the p-values get very unstable and the true effect of x_1 can no longer be detected consistent relative to the null. But the good news is that as long as the correlation isn’t close to perfect, the model does its job quite well and the p-values for a true effect dive near 0.

Just to drive the point home, we could look at the actual correlation coefficients between x_1 and x_2 at the point that things start to go unstable. From the above plot, it looks like the model starts to get confused at the time the the noise is around 0.6. We can use our helper function from before to quickly make a single dataset at that noise level and then evaluate the correlation of x_1 and x_2 in that set:

unstable_tbl <- sim_fcn(noise = 0.6)

Here’s the correlation plot. We see that in this case, a correlation of around 0.8 is where things start to go unstable with the estimates.

unstable_tbl %>% 
  select(x_1:x_10) %>%
  ggcorr(
  high = "red",
  low = "blue",
  label = TRUE,
  hjust = .75,
  size = 3,
  label_size = 3,
  nbreaks = 4
) +
  labs(
    title = "Correlation Matrix",
    subtitle = "Pearson Method Using Pairwise Obervations"
  )

Here’s the variables plotted against the observation for the above scenario, with x_1 and x_2 correlated with each other at a 0.8 level.

unstable_tbl %>% plt_fcn()

In the end, we found out that we can trick the linear regression model, just not in the way we expected. We were never really able to get the model to think that x_2 was important but we were able to make it uncertain about x_1. The average coefficient estimate remains stable even in the presence of highly correlated predictors, but it’s the uncertainty about the estimates that pays the price for correlated variables. So should we just leave them all in? Like most things in statistics - it depends on the purpose of the model!

Thank you for following along with this post if you’ve made it this far. If you prefer learning concepts from mathematical principles rather than simulation, please refer to this response on Cross Validated for some explanation of the the bahavior we just saw. You could check out variance inflation factor which is the same idea presented a different way.

Session Info

sessionInfo()
## R version 4.0.3 (2020-10-10)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 18363)
## 
## Matrix products: default
## 
## locale:
## [1] LC_COLLATE=English_United States.1252 
## [2] LC_CTYPE=English_United States.1252   
## [3] LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C                          
## [5] LC_TIME=English_United States.1252    
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] GGally_2.0.0    gt_0.2.2        broom_0.7.8     here_1.0.0     
##  [5] forcats_0.5.0   stringr_1.4.0   dplyr_1.0.7     purrr_0.3.4    
##  [9] readr_1.4.0     tidyr_1.1.3     tibble_3.1.4    ggplot2_3.3.5  
## [13] tidyverse_1.3.0
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_1.0.7         lattice_0.20-41    lubridate_1.7.9.2  assertthat_0.2.1  
##  [5] rprojroot_2.0.2    digest_0.6.28      utf8_1.2.2         plyr_1.8.6        
##  [9] R6_2.5.1           cellranger_1.1.0   backports_1.2.0    reprex_0.3.0      
## [13] evaluate_0.14      highr_0.9          httr_1.4.2         blogdown_0.15     
## [17] pillar_1.6.2       rlang_0.4.11       readxl_1.3.1       rstudioapi_0.13   
## [21] jquerylib_0.1.4    Matrix_1.2-18      checkmate_2.0.0    rmarkdown_2.11    
## [25] labeling_0.4.2     splines_4.0.3      munsell_0.5.0      compiler_4.0.3    
## [29] modelr_0.1.8       xfun_0.26          pkgconfig_2.0.3    mgcv_1.8-33       
## [33] htmltools_0.5.2    tidyselect_1.1.1   bookdown_0.21      reshape_0.8.8     
## [37] fansi_0.5.0        crayon_1.4.1       dbplyr_2.0.0       withr_2.4.2       
## [41] grid_4.0.3         nlme_3.1-150       jsonlite_1.7.2     gtable_0.3.0      
## [45] lifecycle_1.0.1    DBI_1.1.0          magrittr_2.0.1     scales_1.1.1      
## [49] cli_3.0.1          stringi_1.7.4      farver_2.1.0       fs_1.5.0          
## [53] xml2_1.3.2         bslib_0.3.0        ellipsis_0.3.2     generics_0.1.0    
## [57] vctrs_0.3.8        RColorBrewer_1.1-2 tools_4.0.3        glue_1.4.2        
## [61] hms_0.5.3          fastmap_1.1.0      yaml_2.2.1         colorspace_2.0-2  
## [65] rvest_0.3.6        knitr_1.34         haven_2.3.1        sass_0.4.0

  1. This is a straight copy-paste job from the mathy explanation on Wikipedia here: https://en.wikipedia.org/wiki/Linear_regression#Simple_and_multiple_linear_regression↩︎

  2. Note that this structure for the simulation is perhaps a bit verbose and potentially not optimized for speed, but given that the purpose is to explore and understand, I prefer to make things more readable and easily observed at the expense of speed and length↩︎

  3. On a side note, we see that just by chance x_6 was flagged as statistically significant. This would be expected to happen 1 out of 20 times and should be another consideration when selecting variables for a model, but isn’t the main point of this post so we’ll set that aside for now.↩︎