Stats on [R]eliability

Boundary Conditions and Anatomy - Correlated Data and Kernel Density Estimation in R

Thu, 17 Dec 2020 00:00:00 +0000

Measurements taken from patient anatomy are often correlated. For example, larger blood vessels might tend to have less curvature. Additionally, data are rarely Gaussian, favoring skewed shapes with some very large values and a lower bound of zero. These properties can make simulation and inference hard. In this post I will walk through a workflow for an engineering problem that might be presented in my industry. It involves simulating a population of patients and identifying a subset of interest.

Imagine we have been assigned the task of identifying boundary conditions for a benchtop durability test of an implantable, artificial heart valve. In other words, we need to identify credible parameters for a physical test such that our test engineers can challenge the device under severe but realistic geometries and loads. To facilitate this task our clinical team has analyzed images and extracted measurements for the features of interest in a subset of n=300 patients. There are two main challenges when working with these data:

How do we use our sample to simulate the full population?

How do we use the simulated, full population to identify groups of interest and recommend boundary conditions for the test

The rest of this post explores what we should do with these data to resolve these challenges and identify appropriate and realistic test conditions.

The Data

Suppose the three parameters our team cares about are the ellipticity of the vessel cross section, curvature of the vessel in the vessel region of interest, and the blood pressure. Features such as these are important because they influence both the equilibrium geometry and the magnitude of forces acting on the implantable valve (in other words: the boundary conditions). The image below shows a schematic/example of ellipticity and vessel curvature in the LVOT and aortic valve annulus as observed in CT imaging.¹

I enjoy the tidyverse toolset for exploring and working with data so let’s get that loaded up along with some other packages that will help in the analysis to come.

library(readxl)
library(knitr)
library(DiagrammeR)
library(fitdistrplus)
library(MASS)
library(ggrepel)
library(readxl)
library(ks)
library(broom)
library(ggExtra)
library(GGally)
library(car)
library(rgl)
library(anySim)
library(tidyverse)
library(plotly)

Start by reading in the data and taking a look at the format.

sample_data <- readRDS(file = "sim_anatomy_data.rds")
sample_data

## # A tibble: 300 x 3
##    ellip  curv pressure
##    <dbl> <dbl>    <dbl>
##  1  1.26  4.51     92.7
##  2  1.28  5.02    183. 
##  3  1.29  4.03    154. 
##  4  1.23  2.14    109. 
##  5  1.13  3.67    124. 
##  6  1.22  2.37    114. 
##  7  1.10  3.06    113. 
##  8  1.04  2.31    105. 
##  9  1.11  5.31    115. 
## 10  1.09  2.04    109. 
## # ... with 290 more rows

As expected, 300 rows with our 3 features of interest.

It might seem tempting at this point to extract the maximum value from each group (or maybe something like the 95th percentile) and report those values together as a conservative worst-case. The problem with this approach is that each row of data is from a specific patient, so the variables are likely to be correlated. It could be that those severe values for each variable never occur together in the same patient. If we choose them all, we could over-test the device and over-design the device, potentially setting the program way behind. A more sophisticated approach is to consider the variables as a joint distribution and respect any correlation that may be present.

Here is some code to visualize the marginal distributions.

ellip_curv_plt <- sample_data %>%
  ggplot(aes(x = ellip, y = curv)) +
  geom_point(alpha = .5) +
  labs(
    title = "Patient Data From n=300 Scans",
    subtitle = "Vessel Ellipticity and Vessel Curvature Joint Distribution",
    x = "Ellipticity",
    y = "Curvature (mm)"
  )

ellip_pressure_plt <- sample_data %>%
  ggplot(aes(x = ellip, y = pressure)) +
  geom_point(alpha = .5, color = "firebrick") +
  labs(
    title = "Patient Data From n=300 Scans",
    subtitle = "Vessel Ellipticity and Blood Pressure Joint Distribution",
    x = "Ellipticity",
    y = "Pressure (mm Hg)"
  )

curv_pressure_plt <- sample_data %>%
  ggplot(aes(x = curv, y = pressure)) +
  geom_point(alpha = .5, color = "limegreen") +
  labs(
    title = "Patient Data From n=300 Scans",
    subtitle = "Vessel Curvature and Blood Pressure Joint Distribution",
    x = "Curvature (mm)",
    y = "Pressure (mm Hg"
  )

ellip_curv_mplt <- ggExtra::ggMarginal(ellip_curv_plt, type = "density", fill = "#2c3e50", alpha = .5)
ellip_pressure_mplt <- ggExtra::ggMarginal(ellip_pressure_plt, type = "density", fill = "firebrick", alpha = .5)
curv_pressure_mplt <- ggExtra::ggMarginal(curv_pressure_plt, type = "density", fill = "limegreen", alpha = .5)

The variables are strictly positive and show some skew. Let’s assume that from domain knowledge we know these variables to be well described by a lognormal. The visuals would be consistent with this assumption.

Correlations in the Original Dataset

ggcorr() from the GGally package is very convenient for visualizing correlations.

sample_data %>% ggcorr(
  high = "#20a486ff",
  low = "#fde725ff",
  label = TRUE,
  hjust = .75,
  size = 3,
  label_size = 3,
  label_round = 3,
  nbreaks = 3
) +
  labs(
    title = "Correlation Matrix - n=300 Patient Set",
    subtitle = "Pearson Method Using Pairwise Observations"
  )

We see that there are some positive correlations in this dataset.

To build out the sample into a simulated population we will fit a MLE estimate and use the model to push out a lot of predictions.² If the variables were not correlated, we could just execute a few rlnorm()’s and bind them together. The job is more challenging when the variables are correlated because they must be simulated all at once.

I know of 2 convenient engines in R to generate an arbitrary number of random values from a correlated, multivariate distribution:

AnySim::SimCorrRVs : For this method you specify the parameters of the marginal distributions and correlation matrix.³
mass::mvnorm() : For this method you transform each distribution to normal and supply the mean and sd of each variable along with the covariance matrix.

My personal preference is for the AnySim method which I’ll show below. The code for executing a similar simulation with mass::mvnorm() is shown in Appendix A.

AnySim - Generate Simulated Population of Correlated Patient Data

The AnySim workflow:

First: fit distributions to the original data and calculate correlations.

ellip_fit <- fitdist(sample_data$ellip, "lnorm")
curv_fit <- fitdist(sample_data$curv, "lnorm")
pressure_fit <- fitdist(sample_data$pressure, "lnorm")

# store lognormal parameters of original data
ellip_meanlog <- ellip_fit$estimate[["meanlog"]]
ellip_sdlog <- ellip_fit$estimate[["sdlog"]]
curv_meanlog <- curv_fit$estimate[["meanlog"]]
curv_sdlog <- curv_fit$estimate[["sdlog"]]
pressure_meanlog <- pressure_fit$estimate[["meanlog"]]
pressure_sdlog <- pressure_fit$estimate[["sdlog"]]

# store correlations in original data
cor_ec <- cor(x = sample_data$ellip, y = sample_data$curv)
cor_ep <- cor(x = sample_data$ellip, y = sample_data$pressure)
cor_cp <- cor(x = sample_data$curv, y = sample_data$pressure)

Apply the AnySim workflow. Note that this too goes through an auxiliary normal intermediate step.

set.seed(1234)

# Define the target distribution functions (ICDFs) of each random variable.

ellip_dist <- "qlnorm"
curv_dist <- "qlnorm"
pressure_dist <- "qlnorm"

# store the 3 ICDFs in a vector
dist_vec <- c(ellip_dist, curv_dist, pressure_dist)

# Define the parameters of the target distribution functions - store them in a list
ellip_params <- list(meanlog = ellip_meanlog, sdlog = ellip_sdlog)
curv_params <- list(meanlog = curv_meanlog, sdlog = curv_sdlog)
pressure_params <- list(meanlog = pressure_meanlog, sdlog = pressure_sdlog)

# this is a weird way to do it but I'm following along with an example from AnySim vignette :)
params_list <- list(NULL)
params_list[[1]] <- ellip_params
params_list[[2]] <- curv_params
params_list[[3]] <- pressure_params

# Define the target correlation matrix.
corr_matrix <- matrix(c(
  1, 0.268, 0.369,
  0.268, 1, .213,
  0.369, 0.213, 1
),
ncol = 3,
nrow = 3,
byrow = T
)
# Estimate the parameters of the auxiliary Gaussian model.
aux_gaussion_param_tbl <- EstCorrRVs(
  R = corr_matrix, dist = dist_vec, params = params_list,
  NatafIntMethod = "GH", NoEval = 9, polydeg = 8
)


# Generate 10000 synthetic realizations of the 3 correlated RVs.
correlated_ln_draws_tbl <- as_tibble(SimCorrRVs(n = 10000, paramsRVs = aux_gaussion_param_tbl)) %>%
  rename(
    ellip = V1,
    curv = V2,
    pressure = V3
  )

correlated_ln_draws_tbl %>%
  head(10) %>%
  kable(align = "c")

ellip	curv	pressure
1.123496	1.674471	78.14516
1.234755	3.927320	104.53631
1.299794	4.071074	116.55043
1.045001	2.721336	106.23896
1.246727	5.091741	102.83055
1.252843	2.869394	93.24030
1.169606	1.613312	125.12784
1.171699	2.921069	156.06701
1.170371	2.672507	145.82545
1.146382	3.030319	89.91766

Evaluate recovered marginal distributions with some helper functions:

extract_params_sim_fcn <- function(var, fit_to) {
  tidy(fitdistr(correlated_ln_draws_tbl %>% pull(var), fit_to)) %>%
    mutate(
      var = {
        var
      },
      dataset = "sim_draws"
    )
}

extract_params_pat_fcn <- function(var, fit_to) {
  tidy(fitdistr(sample_data %>% pull(var), fit_to)) %>%
    mutate(
      var = {
        var
      },
      dataset = "patient_set"
    )
}

sim_results_tbl <- tibble(
  var = c("ellip", "curv", "pressure"),
  fit_to = rep("lognormal", 3)
) %>%
  mutate(params = map2(.x = var, .y = fit_to, .f = extract_params_sim_fcn)) %>%
  unnest() %>%
  dplyr::select(-var1)

pat_results_tbl <- tibble(
  var = c("ellip", "curv", "pressure"),
  fit_to = rep("lognormal", 3)
) %>%
  mutate(params = map2(.x = var, .y = fit_to, .f = extract_params_pat_fcn)) %>%
  unnest() %>%
  dplyr::select(-var1)

sim_results_tbl %>%
  bind_rows(pat_results_tbl) %>%
  select(-std.error) %>%
  pivot_wider(id_cols = everything(), names_from = "dataset", values_from = "estimate") %>%
  kable(align = "c")

var	fit_to	term	sim_draws	patient_set
ellip	lognormal	meanlog	0.1936145	0.1932254
ellip	lognormal	sdlog	0.0628128	0.0636092
curv	lognormal	meanlog	1.1561942	1.1579793
curv	lognormal	sdlog	0.3114360	0.3091606
pressure	lognormal	meanlog	4.7841496	4.7831767
pressure	lognormal	sdlog	0.1896585	0.1910081

Evaluate recovered correlations:

correlated_ln_draws_tbl %>% ggcorr(
  high = "#20a486ff",
  low = "#fde725ff",
  label = TRUE,
  hjust = .75,
  size = 3,
  label_size = 3,
  label_round = 3,
  nbreaks = 3
) +
  labs(
    title = "Correlation Matrix - n=10000 Simulation Set",
    subtitle = "Pearson Method Using Pairwise Observations"
  )

Let’s take a look at the simulated population:

fig <- plotly::plot_ly()

fig <- fig %>% add_trace(x = correlated_ln_draws_tbl$ellip, y = correlated_ln_draws_tbl$curv, z = correlated_ln_draws_tbl$pressure, type = "scatter3d", opacity = .4, hoverinfo = "none", size = .1)

fig <- fig %>%
  layout(scene = list(
    xaxis = list(title = "ellip"),
    yaxis = list(title = "curv"),
    zaxis = list(title = "pressure")
  )) %>%
  layout(scene = list(
    xaxis = list(showspikes = FALSE),
    yaxis = list(showspikes = FALSE),
    zaxis = list(showspikes = FALSE)
  ))

# fig

Kernel Density Estimation - Map Density Contours to Data

The above tables and figures confirm the simulated population maintains the correlation structure and marginal distributions from the original sample as intended. The next step will be to build out some density estimates using a non-parametric, kernel density estimator. The reason we would want to do this is to understand the regions where data points are likely to fall and we can use the reference contours to identify the most extreme patients relative to the mode or to some region of interest.

Important Watch-Out : The exact workflow for generating and applying the kernel density estimate may vary depending on the data type. The default kde procedures may assign probabilities to regions outside the rigid boundaries when data does not have infinite support. This will occur for our dataset, since all of our variables are lognormal and should therefore never be negative. Methods for addressing this behavior include variable bandwidth estimators, transformations of estimators, and boundary estimators. To illustrate this problem and provide an example of resolution, I will show 2 parallel workflows below:

In the first, I apply the default global bandwidth kde to the simulated data
In the second, I transform the data from lognormal to normal, apply the kde, then backtransform to lognormal

Towards that end, I’ll add some more variables for transformed, normal version of each variable:

corr_draws_tbl <- correlated_ln_draws_tbl %>%
  mutate(
    ellip_n = log(ellip),
    curv_n = log(curv),
    pressure_n = log(pressure)
  ) %>%
  select(ellip_n, curv_n, pressure_n)

corr_draws_tbl %>%
  head(5) %>%
  kable(aalign = "c")

ellip_n	curv_n	pressure_n
0.1164450	0.5154974	4.358568
0.2108725	1.3679574	4.649534
0.2622058	1.4039068	4.758324
0.0440175	1.0011228	4.665691
0.2205217	1.6276199	4.633083

Quick visual check to verify the transformed properly:

corr_draws_tbl %>%
  pivot_longer(cols = everything()) %>%
  ggplot(aes(x = value)) +
  geom_density() +
  facet_wrap(~name, scales = "free")

Naive Method - Apply Default KDE to Lognormal Data

Estimate kde

The kde is constructed as follows:⁴

This first chunk converts the data and generates the kde. The bandwidth parameters controls the “smoothness” or granularity of the estimate and can be hard to specify in multiple dimensions. Hscv() provides a method of determining a reasonable bandwidth through cross-validation; see documentation in footnotes for more information if interested.

# convert simulated data tibble to matrix
d3m <- correlated_ln_draws_tbl %>%
  as.matrix()

# cross-validated bandwidth for kd (takes a while to calculate)
# hscv1 <- Hscv(correlated_ln_draws_tbl)
# hscv1 %>% write_rds(here::here("hscv1.rds"))

hscv1 <- read_rds(here::here("hscv1.rds"))

# generate kernel density estimate from simulated population
kd_d3m <- ks::kde(d3m, H = hscv1, compute.cont = TRUE)

Density proportions from kde estimate

# see the kde's calculated density thresholds for specified proportions
cont_vals_tbl <- tidy(kd_d3m$cont) %>%
  mutate(n_row = row_number()) %>%
  mutate(probs = 100 - n_row) %>%
  select(probs, x)

reference_grid_probs_tbl <- cont_vals_tbl %>%
  rename(estimate = x)

reference_grid_probs_tbl %>%
  head(10) %>%
  kable(align = rep("c"))

probs	estimate
99	0.0342569
98	0.0333578
97	0.0326260
96	0.0318672
95	0.0312299
94	0.0305985
93	0.0300632
92	0.0293781
91	0.0289256
90	0.0283849

KDE estimates in the range of the variables

By default the KDE provides density estimates for a grid of points that covers the space of the variables.

kd_grid_estimates <- kd_d3m

If we want to know the value at each point in the simulated population we use the eval.points argument.

mc_estimates <- ks::kde(
  x = d3m, H = hscv1,
  compute.cont = TRUE,
  eval.points = correlated_ln_draws_tbl %>% as.matrix()
)

Here are a couple different ways to convert the kde object features into a tibble:

mc_est_tbl_10000 <- tibble(estimate = mc_estimates$estimate) %>%
  bind_cols(correlated_ln_draws_tbl)

kd_grid_est_tbl_29k <- broom:::tidy.kde(kd_grid_estimates) %>%
  pivot_wider(names_from = variable, values_from = value) %>%
  rename(ellip = x1, curv = x2, pressure = x3) %>%
  select(-obs)

Each data point in our population has a estimate. Each data point on the grid that covers the space of interest has an estimate.

mc_est_tbl_10000 %>%
  head(10) %>%
  kable(align = "c")

estimate	ellip	curv	pressure
0.0032540	1.123496	1.674471	78.14516
0.0167218	1.234755	3.927320	104.53631
0.0114561	1.299794	4.071074	116.55043
0.0042927	1.045001	2.721336	106.23896
0.0050883	1.246727	5.091741	102.83055
0.0123645	1.252843	2.869394	93.24030
0.0073055	1.169606	1.613312	125.12784
0.0061654	1.171699	2.921069	156.06701
0.0103851	1.170371	2.672507	145.82545
0.0158417	1.146382	3.030319	89.91766

kd_grid_est_tbl_29k %>%
  head(10) %>%
  kable(align = "c")

ellip	curv	pressure
0.8949674	-0.0080549	30.51151
0.9187883	-0.0080549	30.51151
0.9426092	-0.0080549	30.51151
0.9664302	-0.0080549	30.51151
0.9902511	-0.0080549	30.51151
1.0140720	-0.0080549	30.51151
1.0378929	-0.0080549	30.51151
1.0617138	-0.0080549	30.51151
1.0855347	-0.0080549	30.51151
1.1093556	-0.0080549	30.51151

The ks package automatically stores the quantiles of the estimate variable when calculating the kde. We can access those probability boundaries by sub-setting the kd object.

# 5% contour line from kd grid based on 10k MC data
percentile_5 <- kd_d3m[["cont"]]["5%"]

Verify that 5% (500/10,000) values fall below the threshold:

mc_est_tbl_10000 %>% filter(estimate <= percentile_5)

## # A tibble: 500 x 4
##    estimate ellip  curv pressure
##       <dbl> <dbl> <dbl>    <dbl>
##  1 0.000418  1.41  4.68    184. 
##  2 0.000377  1.30  7.27    144. 
##  3 0.000951  1.06  3.32     72.1
##  4 0.000704  1.28  7.10    125. 
##  5 0.000719  1.17  2.59     62.1
##  6 0.000114  1.47  3.18    189. 
##  7 0.000905  1.01  3.21    102. 
##  8 0.000182  1.06  5.85    103. 
##  9 0.000521  1.36  4.04    200. 
## 10 0.000742  1.40  3.58     97.4
## # ... with 490 more rows

500 / 10,000 is the correct coverage for the 5/95 boundary.

If we wanted to know the nearest probability contour line for every point we could make a function to do so.

get_probs_fcn <- function(value) {
  t <- reference_grid_probs_tbl %>%
    mutate(value = value) %>%
    mutate(dif = abs(estimate - value)) %>%
    filter(dif == min(dif))

  t[[1, 1]]
}

Map the function over each value in the dataset.

# mc_1_to_99_tbl <- mc_est_tbl_10000 %>%
#   mutate(nearest_prob = map_dbl(estimate, get_probs_fcn))

# mc_1_to_99_tbl %>% write_rds(here::here("mc_1_to_99_tbl.rds"))
mc_1_to_99_tbl <- read_rds(here::here("mc_1_to_99_tbl.rds"))

mc_1_to_99_tbl

## # A tibble: 10,000 x 5
##    estimate ellip  curv pressure nearest_prob
##       <dbl> <dbl> <dbl>    <dbl>        <dbl>
##  1  0.0140   1.27  2.16    133.            59
##  2  0.0119   1.20  4.44    127.            52
##  3  0.00265  1.38  2.65    160.            14
##  4  0.0194   1.24  3.62    142.            75
##  5  0.0122   1.32  2.54    129.            53
##  6  0.0168   1.26  3.24    147.            68
##  7  0.00555  1.33  3.39    168.            28
##  8  0.0112   1.24  4.25    146.            50
##  9  0.00826  1.19  1.85     87.7           39
## 10  0.00197  1.32  4.14     90.5           11
## # ... with 9,990 more rows

Now the data, kde estimate, and nearest probability contour region boundary are stored in one tibble.

Density Plot with Probability Contours in 3d

Honestly, this part is pretty easy thanks to a built in plot.kde method. Just use the cont argument to specify with probability contours you want.

#plot(x = kd_d3m, cont = c(45, 70, 95), drawpoints = FALSE, col.pt = 1)

Add points using the points3d function. In this case I add 2 sets, 1 for the 5% most extreme and 1 for the 95% most common.

# plot(x = kd_d3m, cont = c(95) ,drawpoints = FALSE, col.pt = 1)
mc_lowest_5_tbl <- mc_1_to_99_tbl %>% filter(estimate < percentile_5)
mc_6_to_100_tbl <- mc_1_to_99_tbl %>% filter(estimate >= percentile_5)

# points3d(x = mc_lowest_5_tbl$ellip, y = mc_lowest_5_tbl$curv, z = mc_lowest_5_tbl$pressure, color = "dodgerblue",  size = 3, alpha = 1)

# points3d(x = mc_6_to_100_tbl$ellip, y = mc_6_to_100_tbl$curv, z = mc_6_to_100_tbl$pressure, color = "black",  size = 3, alpha = 1)

See the problem here? In the areas on the lower right of the middle and right-most images, the data stops but the surface keeps going. This is because the data has a boundary there due to being log-normal but the kde doesn’t know. See closeup below.

As previously mentioned, this can be addressed by using the normal dataset to fit the kde and then back-transforming both the data and the surface:

Fit KDE to normal data transform later

# convert simulated data tibble to matrix
d3m_n <- corr_draws_tbl %>%
  as.matrix()

# cross-validated bandwidth for kd (takes a while to calculate)
# hscv1_n <- Hscv(corr_draws_tbl)
# hscv1_n %>% write_rds(here::here("hscv1_n.rds"))

hscv1_n <- read_rds(here::here("hscv1_n.rds"))

# generate kernel density estimate from simulated population
kd_d3m_n <- ks::kde(d3m_n, H = hscv1_n, compute.cont = TRUE)

Density proportions from kde estimate

# see the kde's calculated density thresholds for specified proportions
cont_vals_tbl_n <- tidy(kd_d3m_n$cont) %>%
  mutate(n_row = row_number()) %>%
  mutate(probs = 100 - n_row) %>%
  select(probs, x)

reference_grid_probs_tbl_n <- cont_vals_tbl_n %>%
  rename(estimate = x)

reference_grid_probs_tbl_n %>%
  head(10) %>%
  kable(align = rep("c"))

probs	estimate
99	15.39736
98	14.90526
97	14.53676
96	14.16539
95	13.86481
94	13.56653
93	13.26884
92	12.98302
91	12.72985
90	12.51655

KDE estimates in the range of the variables

By default the KDE provides density estimates for a grid of points that covers the space of the variables.

kd_grid_estimates_n <- kd_d3m_n

If we want to know the value at each point in the simulated population we use the eval.points argument.

mc_estimates_n <- ks::kde(
  x = d3m_n, H = hscv1_n,
  compute.cont = TRUE,
  eval.points = corr_draws_tbl %>% as.matrix()
)

Here are a couple different ways to convert the kde object features into a tibble:

mc_est_tbl_10000_n <- tibble(estimate = mc_estimates_n$estimate) %>%
  bind_cols(corr_draws_tbl)

kd_grid_est_tbl_29k_n <- broom:::tidy.kde(kd_grid_estimates_n) %>%
  pivot_wider(names_from = variable, values_from = value) %>%
  rename(ellip_n = x1, curv_n = x2, pressure_n = x3) %>%
  select(-obs)

Each data point in our population has a estimate. Each data point on the grid that covers the space of interest has an estimate.

mc_est_tbl_10000_n %>%
  head(10) %>%
  kable(align = "c")

estimate	ellip_n	curv_n	pressure_n
0.5480641	0.1164450	0.5154974	4.358568
8.5263883	0.2108725	1.3679574	4.649534
6.8647221	0.2622058	1.4039068	4.758324
1.2686865	0.0440175	1.0011228	4.665691
3.4062801	0.2205217	1.6276199	4.633083
4.3754649	0.2254152	1.0541010	4.535180
1.5083333	0.1566667	0.4782892	4.829336
3.3766180	0.1584546	1.0719497	5.050286
5.0036132	0.1573211	0.9830169	4.982410
4.9614118	0.1366109	1.1086680	4.498894

kd_grid_est_tbl_29k_n %>%
  head(10) %>%
  kable(align = "c")

ellip_n	curv_n	pressure_n
-0.0877414	-0.4268598	3.811282
-0.0685395	-0.4268598	3.811282
-0.0493375	-0.4268598	3.811282
-0.0301356	-0.4268598	3.811282
-0.0109337	-0.4268598	3.811282
0.0082682	-0.4268598	3.811282
0.0274701	-0.4268598	3.811282
0.0466721	-0.4268598	3.811282
0.0658740	-0.4268598	3.811282
0.0850759	-0.4268598	3.811282

# 5% contour line from kd grid based on 10k MC data
percentile_5_n <- kd_d3m_n[["cont"]]["5%"]

Verify that 5% (500/10,000) values fall below the threshold:

mc_est_tbl_10000_n %>% filter(estimate <= percentile_5_n)

## # A tibble: 500 x 4
##    estimate ellip_n curv_n pressure_n
##       <dbl>   <dbl>  <dbl>      <dbl>
##  1   0.437   0.347   1.54        5.21
##  2   0.218   0.0546  1.20        4.28
##  3   0.346   0.114   0.375       4.43
##  4   0.340   0.121   0.382       4.96
##  5   0.0880  0.153   0.951       4.13
##  6   0.0860  0.387   1.16        5.24
##  7   0.268   0.0116  1.17        4.62
##  8   0.0957  0.0610  1.77        4.63
##  9   0.513   0.310   1.40        5.30
## 10   0.300   0.195   0.260       4.75
## # ... with 490 more rows

500 / 10,000 is the correct coverage for the 5/95 boundary.

get_probs_fcn_n <- function(value) {
  t <- reference_grid_probs_tbl_n %>%
    mutate(value = value) %>%
    mutate(dif = abs(estimate - value)) %>%
    filter(dif == min(dif))

  t[[1, 1]]
}

Map the function over each value in the dataset and then the grid.

# mc_1_to_99_tbl_n <- mc_est_tbl_10000_n %>%
#   mutate(nearest_prob = map_dbl(estimate, get_probs_fcn_n))
# #
# mc_1_to_99_tbl_n %>% write_rds(here::here("mc_1_to_99_tbl_n.rds"))
mc_1_to_99_tbl_n <- read_rds(here::here("mc_1_to_99_tbl_n.rds"))

mc_1_to_99_tbl_n

## # A tibble: 10,000 x 5
##    estimate ellip_n curv_n pressure_n nearest_prob
##       <dbl>   <dbl>  <dbl>      <dbl>        <dbl>
##  1    0.548  0.116   0.515       4.36            5
##  2    8.53   0.211   1.37        4.65           70
##  3    6.86   0.262   1.40        4.76           58
##  4    1.27   0.0440  1.00        4.67           12
##  5    3.41   0.221   1.63        4.63           32
##  6    4.38   0.225   1.05        4.54           39
##  7    1.51   0.157   0.478       4.83           14
##  8    3.38   0.158   1.07        5.05           31
##  9    5.00   0.157   0.983       4.98           44
## 10    4.96   0.137   1.11        4.50           44
## # ... with 9,990 more rows

grid_probs_tbl_n <- kd_grid_est_tbl_29k_n %>%
  mutate(nearest_prob = map_dbl(estimate, get_probs_fcn_n))

grid_probs_tbl_n %>% write_rds(here::here("grid_probs_tbl_n.rds"))
grid_probs_tbl_n <- read_rds(here::here("grid_probs_tbl_n.rds"))



grid_probs_95_n <- grid_probs_tbl_n %>%
  filter(nearest_prob == 95)

grid_probs_95_n %>% arrange(desc(nearest_prob))

## # A tibble: 4 x 5
##   estimate ellip_n curv_n pressure_n nearest_prob
##      <dbl>   <dbl>  <dbl>      <dbl>        <dbl>
## 1     13.8   0.162   1.09       4.68           95
## 2     13.8   0.200   1.20       4.68           95
## 3     13.8   0.219   1.20       4.74           95
## 4     13.8   0.219   1.09       4.80           95

grid_probs_95_n %>%
  head(5) %>%
  kable(align = "c")

estimate	ellip_n	curv_n	pressure_n	nearest_prob
13.84582	0.1618836	1.094335	4.679386	95
13.75222	0.2002874	1.195748	4.679386	95
13.83862	0.2194893	1.195748	4.741394	95
13.83317	0.2194893	1.094335	4.803401	95

Density Plot with Probability Contours in 3d

Honestly, this part is pretty easy thanks to a built in plot.kde method. Just use the cont argument to specify with probability contours you want.

plot(x = kd_d3m_n, cont = c(45, 70, 95), drawpoints = FALSE, col.pt = 1)

and with points

mc_lowest_5_tbl_n <- mc_1_to_99_tbl_n %>% filter(estimate < percentile_5_n)
mc_6_to_100_tbl_n <- mc_1_to_99_tbl_n %>% filter(estimate >= percentile_5_n)

plot(x = kd_d3m_n, cont = c(95), drawpoints = FALSE, col.pt = 1)


points3d(x = mc_lowest_5_tbl_n$ellip_n, y = mc_lowest_5_tbl_n$curv_n, z = mc_lowest_5_tbl_n$pressure_n, color = "dodgerblue", size = 3, alpha = 1)

points3d(x = mc_6_to_100_tbl_n$ellip_n, y = mc_6_to_100_tbl_n$curv_n, z = mc_6_to_100_tbl_n$pressure_n, color = "black", size = 3, alpha = 1)

points3d(x = grid_probs_tbl_n$ellip_n, y = grid_probs_tbl_n$curv_n, z = grid_probs_tbl_n$pressure_n, color = "firebrick", size = 2, alpha = 1)

Transform data and kde contour to original scale

mc_lowest_5_tbl_nbt <- mc_lowest_5_tbl_n %>% mutate(
  ellip_bt = exp(ellip_n),
  curv_bt = exp(curv_n),
  pressure_bt = exp(pressure_n)
)
mc_6_to_100_tbl_nbt <- mc_6_to_100_tbl_n %>% mutate(
  ellip_bt = exp(ellip_n),
  curv_bt = exp(curv_n),
  pressure_bt = exp(pressure_n)
)

full_mc_bt_tbl <- mc_lowest_5_tbl_nbt %>%
  bind_rows(mc_6_to_100_tbl_nbt)

grid_probs_95_bt <- grid_probs_tbl_n %>%
  filter(nearest_prob == 05) %>%
  mutate(
    ellip_bt = exp(ellip_n),
    curv_bt = exp(curv_n),
    pressure_bt = exp(pressure_n)
  )

Plot Back-Transformed Data with Plotly

fig <- plotly::plot_ly()

fig <- fig %>% add_trace(x = grid_probs_95_bt$ellip_bt, y = grid_probs_95_bt$curv_bt, z = grid_probs_95_bt$pressure_bt, type = "mesh3d", alphahull = 0, opacity = .5, hoverinfo = "none")


fig <- fig %>% add_trace(x = mc_lowest_5_tbl_nbt$ellip_bt, y = mc_lowest_5_tbl_nbt$curv_bt, z = mc_lowest_5_tbl_nbt$pressure_bt, type = "scatter3d", size = 30)

fig <- fig %>% add_trace(x = mc_6_to_100_tbl_nbt$ellip_bt, y = mc_6_to_100_tbl_nbt$curv_bt, z = mc_6_to_100_tbl_nbt$pressure_bt, type = "scatter3d", size = 30)

fig <- fig %>%
  layout(scene = list(
    xaxis = list(title = "ellip"),
    yaxis = list(title = "curv"),
    zaxis = list(title = "pressure")
  )) %>%
  layout(scene = list(
    xaxis = list(showspikes = FALSE),
    yaxis = list(showspikes = FALSE),
    zaxis = list(showspikes = FALSE)
  ))

# fig

The new image (shown on right) looks different near the boundary. Because we transformed everything from normal, no portion of the contour goes beyond the point cloud. This is what we want!

Filter extreme points and assess points on 95-5 contour

Now that our kde contour is set up to properly segregate the extreme points relative to the mode, we can filter them away and assess the remaining points which lie on the contour. We do this by pulling the grid points that make up the 95/5 surface and evaluating them as percentiles.

First, the ecdfs to get the percentiles from each variable

e1f <- ecdf(full_mc_bt_tbl$ellip_bt)
e2f <- ecdf(full_mc_bt_tbl$curv_bt)
e3f <- ecdf(full_mc_bt_tbl$pressure_bt)

Map ecdfs over the variables and then use the sum of the percentiles as a way to identify the largest values.

full_probs_95_tbl <- grid_probs_95_bt %>%
  rowwise() %>%
  mutate(
    percentile_e = map_dbl(ellip_bt, e1f),
    percentile_c = map_dbl(curv_bt, e2f),
    percentile_p = map_dbl(pressure_bt, e3f)
  ) %>%
  rowwise() %>%
  mutate(pct_sum = sum(c(percentile_e, percentile_c, percentile_p))) %>%
  ungroup() %>%
  arrange(desc(pct_sum)) %>%
  mutate(pct_sum_rank = row_number()) %>%
  select(ellip_bt, curv_bt, pressure_bt, percentile_e, percentile_c, percentile_p, pct_sum)

full_probs_95_tbl %>%
  head(10) %>%
  kable(align = "c", digits = 2)

ellip_bt	curv_bt	pressure_bt	percentile_e	percentile_c	percentile_p	pct_sum
1.40	5.49	176.88	0.99	0.96	0.98	2.93
1.37	5.49	188.19	0.97	0.96	0.99	2.92
1.37	6.08	166.24	0.97	0.98	0.96	2.92
1.34	5.49	188.19	0.95	0.96	0.99	2.90
1.37	4.96	188.19	0.97	0.92	0.99	2.89
1.42	4.96	166.24	1.00	0.92	0.96	2.88
1.32	6.08	176.88	0.91	0.98	0.98	2.87
1.32	5.49	188.19	0.91	0.96	0.99	2.86
1.40	4.48	188.19	0.99	0.86	0.99	2.84
1.42	4.48	176.88	1.00	0.86	0.98	2.84

Finally, we can show a few of the points with large percentiles on the 95/5 surface:

top_10 <- full_probs_95_tbl %>%
  head(10)

fig <- fig %>% add_trace(x = top_10$ellip_bt, y = top_10$curv_bt, z = top_10$pressure_bt, type = "scatter3d", size = 30)

# fig

And there we have it! 10 candidate points representing credible points on the edge of the 5% probability region for 3 correlated lognormal variables with proper treatment of the boundary.

If you’ve made it this far, I thank you. Here are a couple appendices as a reward!

Appendix A - simulating a multivariate distribution with mass mvnorm

Workflow:

Step 1 - Fit Distributions to Each Variable

ellip_fit <- fitdist(sample_data$ellip, "lnorm")
curv_fit <- fitdist(sample_data$curv, "lnorm")
pressure_fit <- fitdist(sample_data$pressure, "lnorm")

# store lognormal parameters of original data
ellip_meanlog <- ellip_fit$estimate[["meanlog"]]
ellip_sdlog <- ellip_fit$estimate[["sdlog"]]
curv_meanlog <- curv_fit$estimate[["meanlog"]]
curv_sdlog <- curv_fit$estimate[["sdlog"]]
pressure_meanlog <- pressure_fit$estimate[["meanlog"]]
pressure_sdlog <- pressure_fit$estimate[["sdlog"]]

# store correlations in original data
cor_ec <- cor(x = sample_data$ellip, y = sample_data$curv)
cor_ep <- cor(x = sample_data$ellip, y = sample_data$pressure)
cor_cp <- cor(x = sample_data$curv, y = sample_data$pressure)

# store covariances in original data
cov_ellip_curv <- cov(x = sample_data$ellip, y = sample_data$curv)
cov_ellip_ellip <- cov(x = sample_data$ellip, y = sample_data$ellip)
cov_curv_curv <- cov(x = sample_data$curv, y = sample_data$curv)
cov_ellip_pressure <- cov(x = sample_data$ellip, y = sample_data$pressure)
cov_pressure_pressure <- cov(x = sample_data$pressure, y = sample_data$pressure)
cov_curv_pressure <- cov(x = sample_data$curv, y = sample_data$pressure)

# summarize the parameters and reshape a bit
original_data_param_tbl <- tibble(
  ellip_meanlog = ellip_meanlog,
  ellip_sdlog = ellip_sdlog,
  curv_meanlog = curv_meanlog,
  curv_sdlog = curv_sdlog,
  pressure_meanlog = pressure_meanlog,
  pressure_sdlog = pressure_sdlog,
  ellip_curv_correlation = cor_ec,
  ellip_pressure_correlation = cor_ep,
  curv_pressure_correlation = cor_cp,
  ellip_ellip_covariance = cov_ellip_ellip,
  ellip_curv_covariance = cov_ellip_curv,
  curv_curv_covariance = cov_curv_curv,
  ellip_pressure_covariance = cov_ellip_pressure,
  pressure_pressure_covariance = cov_pressure_pressure,
  curv_pressure_covariance = cov_curv_pressure
) %>%
  pivot_longer(cols = everything(), names_to = "feature", values_to = "value") %>%
  mutate(dataset = "original_data") %>%
  mutate_if(is.character, as_factor)

# View summary table of original data
original_data_param_tbl %>%
  kable(align = "c", digits = 3)

feature	value	dataset
ellip_meanlog	0.193	original_data
ellip_sdlog	0.064	original_data
curv_meanlog	1.158	original_data
curv_sdlog	0.309	original_data
pressure_meanlog	4.783	original_data
pressure_sdlog	0.191	original_data
ellip_curv_correlation	0.268	original_data
ellip_pressure_correlation	0.369	original_data
curv_pressure_correlation	0.213	original_data
ellip_ellip_covariance	0.006	original_data
ellip_curv_covariance	0.022	original_data
curv_curv_covariance	1.157	original_data
ellip_pressure_covariance	0.659	original_data
pressure_pressure_covariance	530.683	original_data
curv_pressure_covariance	5.285	original_data

Step 2 - Transform all variables to normal

A simple log operation brings the lognormal variable to normal.

# transform original, lognormal data to normal
normal_sample_data <- sample_data %>%
  mutate(
    n_ellip = log(ellip),
    n_curv = log(curv),
    n_pressure = log(pressure)
  )

normal_sample_data %>%
  head() %>%
  kable(align = "c", digits = 3)

ellip	curv	pressure	n_ellip	n_curv	n_pressure
1.255	4.506	92.739	0.228	1.505	4.530
1.285	5.019	182.970	0.251	1.613	5.209
1.289	4.027	153.858	0.254	1.393	5.036
1.234	2.139	108.669	0.210	0.760	4.688
1.133	3.673	123.633	0.125	1.301	4.817
1.219	2.373	113.944	0.198	0.864	4.736

Step 3 - Fit normal distributions to each transformed variable

We don’t actually have to formally fit normal distributions since it is convenient to obtain the mean and standard deviation at any time using the mean() or sd() functions. But we will extract and store correlations and covariances for the simulation to come.

# get correlations of transformed, normal data
ncor_ec <- cor(
  x = normal_sample_data$n_ellip,
  normal_sample_data$n_curv
)
ncor_ep <- cor(
  x = normal_sample_data$n_ellip,
  normal_sample_data$n_pressure
)
ncor_cp <- cor(
  x = normal_sample_data$n_curv,
  normal_sample_data$n_pressure
)

# get covariance of transformed, normal data
n_cov_ellip_curv <- cov(
  x = normal_sample_data$n_ellip,
  y = normal_sample_data$n_curv
)
n_cov_ellip_ellip <- cov(
  x = normal_sample_data$n_ellip,
  y = normal_sample_data$n_ellip
)
n_cov_curv_curv <- cov(
  x = normal_sample_data$n_curv,
  y = normal_sample_data$n_curv
)

n_cov_ellip_pressure <- cov(
  x = normal_sample_data$n_ellip,
  y = normal_sample_data$n_pressure
)
n_cov_pressure_pressure <- cov(
  x = normal_sample_data$n_pressure,
  y = normal_sample_data$n_pressure
)
n_cov_curv_pressure <- cov(
  x = normal_sample_data$n_curv,
  y = normal_sample_data$n_pressure
)

Step 4 - Draw joint distribution using mvrnorm() or equivalent function

Time to actually draw the correlated values. I store them here in an object called mult_norm.

# draw from multivariate normal with parameters from transformed normal distributions and correlation
set.seed(0118)

mult_norm <- as_tibble(MASS::mvrnorm(
  10000, c(
    mean(normal_sample_data$n_ellip),
    mean(normal_sample_data$n_curv),
    mean(normal_sample_data$n_pressure)
  ),
  matrix(c(
    n_cov_ellip_ellip,
    n_cov_ellip_curv,
    n_cov_ellip_pressure,
    n_cov_ellip_curv,
    n_cov_curv_curv,
    n_cov_curv_pressure,
    n_cov_ellip_pressure,
    n_cov_curv_pressure,
    n_cov_pressure_pressure
  ), 3, 3)
)) %>%
  rename(
    n_ellip_sim = V1,
    n_curv_sim = V2,
    n_pressure_sim = V3
  )

Step 5 - Back-transform simulated data to original distribution

Exponentiating the data brings it back to lognormal.

# convert back to lognormal
log_norm <- mult_norm %>%
  mutate(
    ellip_sim = exp(n_ellip_sim),
    curv_sim = exp(n_curv_sim),
    pressure_sim = exp(n_pressure_sim)
  )

log_norm %>%
  head() %>%
  kable(align = "c", digits = 3)

n_ellip_sim	n_curv_sim	n_pressure_sim	ellip_sim	curv_sim	pressure_sim
0.254	1.600	5.248	1.290	4.952	190.266
0.233	1.038	5.107	1.262	2.823	165.178
0.236	1.152	4.812	1.266	3.165	123.018
0.313	1.003	5.048	1.368	2.727	155.636
0.224	1.622	5.192	1.251	5.066	179.912
0.197	1.486	4.822	1.218	4.422	124.185

Step 6 - Evaluate parameters and marginal distributions of the back-transfomed data

# evaluate the marginal distributions of the simulated data
ellip_sim_fit <- fitdistrplus::fitdist(log_norm$ellip_sim, "lnorm")
curv_sim_fit <- fitdistrplus::fitdist(log_norm$curv_sim, "lnorm")
pressure_sim_fit <- fitdistrplus::fitdist(log_norm$pressure_sim, "lnorm")

Obtain and store the correlation, covariances, and parameters of simulated set:

# get correlation and covariances of simulated data
sim_cor_ec <- cor(x = log_norm$ellip_sim, log_norm$curv_sim)
sim_cor_ep <- cor(x = log_norm$ellip_sim, log_norm$pressure_sim)
sim_cor_cp <- cor(x = log_norm$curv_sim, log_norm$pressure_sim)

sim_cov_ellip_curv <- cov(x = log_norm$ellip_sim, y = log_norm$curv_sim)
sim_cov_ellip_ellip <- cov(x = log_norm$ellip_sim, y = log_norm$ellip_sim)
sim_cov_curv_curv <- cov(x = log_norm$curv_sim, y = log_norm$curv_sim)

sim_cov_ellip_pressure <- cov(x = log_norm$ellip_sim, y = log_norm$pressure_sim)
sim_cov_pressure_pressure <- cov(x = log_norm$pressure_sim, y = log_norm$pressure_sim)
sim_cov_curv_pressure <- cov(x = log_norm$curv_sim, y = log_norm$pressure_sim)

# store parameters of simulated data
ellip_sim_meanlog <- ellip_sim_fit$estimate[["meanlog"]]
ellip_sim_sdlog <- ellip_sim_fit$estimate[["sdlog"]]
curv_sim_meanlog <- curv_sim_fit$estimate[["meanlog"]]
curv_sim_sdlog <- curv_sim_fit$estimate[["sdlog"]]
pressure_sim_meanlog <- pressure_sim_fit$estimate[["meanlog"]]
pressure_sim_sdlog <- pressure_sim_fit$estimate[["sdlog"]]

# collect parameters from simulated data
sim_data_param_tbl <- tibble(
  ellip_meanlog = ellip_sim_meanlog,
  ellip_sdlog = ellip_sim_sdlog,
  curv_meanlog = curv_sim_meanlog,
  curv_sdlog = curv_sim_sdlog,
  pressure_meanlog = pressure_sim_meanlog,
  pressure_sdlog = pressure_sim_sdlog,

  ellip_curv_correlation = sim_cor_ec,
  ellip_pressure_correlation = sim_cor_ep,
  curv_pressure_correlation = sim_cor_cp,

  ellip_curv_covariance = sim_cov_ellip_curv,
  ellip_ellip_covariance = sim_cov_ellip_ellip,
  curv_curv_covariance = sim_cov_curv_curv,

  ellip_pressure_covariance = sim_cov_ellip_pressure,
  pressure_pressure_covariance = sim_cov_pressure_pressure,
  curv_pressure_covariance = sim_cov_curv_pressure
) %>%
  pivot_longer(cols = everything(), names_to = "feature", values_to = "value") %>%
  mutate(dataset = "simulated_data") %>%
  mutate_if(is.character, as_factor)

sim_data_param_tbl %>%
  kable(align = "c")

feature	value	dataset
ellip_meanlog	0.1932042	simulated_data
ellip_sdlog	0.0630117	simulated_data
curv_meanlog	1.1626798	simulated_data
curv_sdlog	0.3092643	simulated_data
pressure_meanlog	4.7878497	simulated_data
pressure_sdlog	0.1900026	simulated_data
ellip_curv_correlation	0.2505145	simulated_data
ellip_pressure_correlation	0.3644292	simulated_data
curv_pressure_correlation	0.1956149	simulated_data
ellip_curv_covariance	0.0203344	simulated_data
ellip_ellip_covariance	0.0058779	simulated_data
curv_curv_covariance	1.1209300	simulated_data
ellip_pressure_covariance	0.6534943	simulated_data
pressure_pressure_covariance	547.0647415	simulated_data
curv_pressure_covariance	4.8440727	simulated_data

Compare Original Data to Simulated Data

A bit more wrangling let’s us compare the feature of the original dataset to the new, simulated population to see if they agree.

compare_tbl <- bind_rows(original_data_param_tbl, sim_data_param_tbl) %>%
  pivot_wider(id_cols = everything(), names_from = dataset)

compare_tbl %>%
  kable(align = "c", digits = 3)

feature	original_data	simulated_data
ellip_meanlog	0.193	0.193
ellip_sdlog	0.064	0.063
curv_meanlog	1.158	1.163
curv_sdlog	0.309	0.309
pressure_meanlog	4.783	4.788
pressure_sdlog	0.191	0.190
ellip_curv_correlation	0.268	0.251
ellip_pressure_correlation	0.369	0.364
curv_pressure_correlation	0.213	0.196
ellip_ellip_covariance	0.006	0.006
ellip_curv_covariance	0.022	0.020
curv_curv_covariance	1.157	1.121
ellip_pressure_covariance	0.659	0.653
pressure_pressure_covariance	530.683	547.065
curv_pressure_covariance	5.285	4.844

Appendix B - 2d kde plot with probability traces

First, select the 2 variables of interest.

d <- correlated_ln_draws_tbl %>% select(ellip, curv)

## density function
kd <- ks::kde(d, compute.cont = TRUE, h = 0.05)

Here’s ellipticity vs. curvature (these lines are not probability region boundaries, but they are related)

cp_plt <- correlated_ln_draws_tbl %>%
  ggplot(aes(x = ellip, y = curv)) +
  geom_point(alpha = .3, size = .5) +
  geom_density2d(size = 1.3) +
  theme_classic() +
  xlim(c(.9, 1.6)) +
  ylim(c(1, 7.5)) +
  labs(
    title = "Joint Distribution of Vessel Ellipticity and Curvature",
    subtitle = "Density Contours at Default Settings",
    x = "Ellipticity (unitless)",
    y = "Radius of Curvature (mm)"
  )

cp_plt

Now a a function to extract the points of the contour line from the kde:

get_contour <- function(kd_out = kd, prob = "5%") {
  contour_95 <- with(kd_out, contourLines(
    x = eval.points[[1]], y = eval.points[[2]],
    z = estimate, levels = cont[prob]
  )[[1]])
  as_tibble(contour_95) %>%
    mutate(prob = prob)
}

Map it over the kd object.

dat_out <- map_dfr(c("5%", "20%", "40%", "60%", "80%", "95%"), ~ get_contour(kd, .)) %>%
  group_by(prob) %>%
  mutate(n_val = 1:n()) %>%
  ungroup()

dat_out %>%
  head(10) %>%
  kable(align = "c")

level	x	y	prob	n_val
0.1144314	1.027589	2.246533	5%	1
0.1144314	1.027172	2.265195	5%	2
0.1144314	1.025855	2.335547	5%	3
0.1144314	1.025083	2.405899	5%	4
0.1144314	1.025079	2.476250	5%	5
0.1144314	1.025998	2.546603	5%	6
0.1144314	1.027589	2.606175	5%	7
0.1144314	1.027847	2.616954	5%	8
0.1144314	1.030135	2.687306	5%	9
0.1144314	1.032082	2.738850	5%	10

Clean kde output

kd_df <- expand_grid(x = kd$eval.points[[1]], y = kd$eval.points[[2]]) %>%
  mutate(z = c(kd$estimate %>% t()))

Now visualize again, this time with probability contours at specified values and the 5% curve labeled with geom_label_repel().

label_tbl <- dat_out %>%
  filter(
    prob == "5%",
    n_val == 100
  )

# visualize
ellip_curv_2plt <- ggplot(data = kd_df, aes(x, y)) +
  geom_tile(aes(fill = z)) +
  geom_point(data = d, aes(x = ellip, y = curv), alpha = .4, size = .4, colour = "white") +
  geom_path(aes(x, y, group = prob),
    data = dat_out %>% filter(prob %in% c("5%", "20%", "40%", "60%", "80%", "95%")), colour = "white", size = 1.2, alpha = .8
  ) +
  #  geom_text(aes(label = prob), data =
  #              filter(dat_out, (prob %in% c("5%") & n_val==1)), # | (prob %in% c("90%") & n_val==20)),
  #            colour = "yellow", size = 5)+
  geom_label_repel(
    data = label_tbl, aes(x, y),
    label = label_tbl$prob[1],
    fill = "yellow",
    color = "black",
    segment.color = "yellow",
    #    segment.size = 1,
    min.segment.length = unit(1, "lines"),
    nudge_y = .5,
    nudge_x = -.025
  ) +
  xlim(c(.95, 1.5)) +
  ylim(c(0, 7.5)) +
  labs(
    title = "Joint Distribution [Ellipticity and Radius of Curvature]",
    subtitle = "Simulated Data",
    caption = "Density Contours shown at 5%, 20%, 40%, 60%, 80%, 95%"
  ) +
  scale_fill_viridis_c(end = .9) +
  theme_bw() +
  theme(legend.position = "none") +
  labs(x = "Ellipticity (unitless)", y = "Radius of Curvature (mm)")

ggExtra::ggMarginal(ellip_curv_2plt, type = "density", fill = "#403891ff", alpha = .7)

Hamdan et. al. Journal of the American College of Cardiology, Volume 59, Issue 2, 2012, Pages 119-127↩︎
This method would be analogous to creating prediction intervals and are conditional on the model in the sense that the only parameters considered are the maximum likelihood estimates. Alternate, more conservative ways to simulate the population could involve tolerance intervals or bayesian methods with a simulated posterior distribution to push out predictions.↩︎
see Water 2020, 12, 1645; doi:10.3390/w12061645 ↩︎
Adapted from this Stack Overflow response: https://stackoverflow.com/questions/23437000/how-to-plot-a-contour-line-showing-where-95-of-values-fall-within-in-r-and-in ↩︎

Creating and Using a Simple, Bayesian Linear Model (in brms and R)

Sun, 01 Dec 2019 00:00:00 +0000

This post is my good-faith effort to create a simple linear model using the Bayesian framework and workflow described by Richard McElreath in his Statistical Rethinking book.¹ As always - please view this post through the lens of the eager student and not the learned master. I did my best to check my work, but it’s entirely possible that something was missed. Please let me know - I won’t take it personally. As McElreath notes in his lectures - “if you’re confused, it’s because you’re paying attention”. And sometimes I get confused - this a lot harder than my old workflow which consisted of clicking “add a trendline” in Excel. Thinking Bayesian is still relatively new to me. Disclaimer over - let’s get to it.

I’m playing around with a bunch of fun libraries in this one.

library(tidyverse)
library(styler)
library(ggExtra)
library(knitr)
library(brms)
library(cowplot)
library(gridExtra)
library(skimr)
library(DiagrammeR)
library(rayshader)
library(av)
library(rgl)

I made up this data set. It represents hypothetical values of ablation time and tissue impedance as measured by sensors embedded in a RF ablation catheter. This type of device is designed to apply RF or thermal energy to the vessel wall. The result is a lesion that can aid in improve arrhythmia, reduce hypertension, or provide some other desired outcome.

In RF ablations, the tissue heats up over the course of the RF cycle, resulting in a drop in impedance that varies over time. As described above, the goal will be to see how much of the variation in impedance is described by time (over some limited range) and then communicate the uncertainty in the predictions visually. None of this detail is terribly important other than I like to frame my examples from within my industry and McElreath emphasizes grounding our modeling in real world science and domain knowledge. This is what an ablation catheter system looks like:²

To get things started, load the data and give it a look with skim(). There are no missing values.

ablation_dta_tbl <- read.csv(file = "abl_data_2.csv")
ablation_dta_tbl <- ablation_dta_tbl %>% select(temp, time)
ablation_dta_tbl %>% skim()

## Skim summary statistics
##  n obs: 331 
##  n variables: 2 
## 
## -- Variable type:numeric ------------------------------------------------------------------------------------------------------------
##  variable missing complete   n  mean   sd    p0   p25   p50   p75  p100
##      temp       0      331 331 77.37 3.9  68.26 74.61 77.15 80.33 89.53
##      time       0      331 331 22.57 3.22 15.83 20.22 22.54 24.69 31.5 
##      hist
##  <U+2581><U+2585><U+2587><U+2586><U+2586><U+2583><U+2581><U+2581>
##  <U+2582><U+2586><U+2587><U+2587><U+2587><U+2583><U+2582><U+2581>

Let’s start with a simple visualization. The code below builds out a scatterplot with marginal histograms which I think is a nice, clean way to evaluate scatter data.³ These data seem plausible since the impedance will typically drop as the tissue heats up during the procedure. In reality the impedance goes asymptotic but we’ll work over a limited range of time where the behavior might reasonably be linear.

scatter_1_fig <- ablation_dta_tbl %>% ggplot(aes(x = time, y = temp)) +
  geom_point(
    colour = "#2c3e50",
    fill = "#2c3e50",
    size = 2,
    alpha = 0.4
  ) +
  labs(
    x = "Ablation Time (seconds)",
    y = "Tissue Temperature (deg C)",
    title = "Ablation Time vs. Tissue Temperature",
    subtitle = "Simulated Catheter RF Ablation"
  )

scatter_hist_1_fig <- ggMarginal(scatter_1_fig,
  type = "histogram",
  color = "white",
  alpha = 0.7,
  fill = "#2c3e50",
  xparams = list(binwidth = 1),
  yparams = list(binwidth = 2.5)
)

# ggExtra needs these explit calls to display in Markdown docs *shrug*
grid::grid.newpage()
grid::grid.draw(scatter_hist_1_fig)

It helps to have a plan. If I can create a posterior distribution that captures reasonable values for the model parameters and confirm that the model makes reasonable predictions then I will be happy. Here’s the workflow that hopefully will get me there.

grViz("digraph flowchart {
      # node definitions with substituted label text
      node [fontname = Helvetica, shape = rectangle, fillcolor = yellow]        
      tab1 [label = 'Step 1: Propose a distribution for the response variable \n Choose a maximum entropy distribution given the constraints you understand']
      tab2 [label = 'Step 2: Parameterize the mean \n The mean of the response distribution will vary linearly across the range of predictor values']
      tab3 [label = 'Step 3: Set priors \n Simulate what the model knows before seeing the data.  Use domain knowledge as constraints.']
      tab4 [label = 'Step 4: Define the model \n Create the model using the observed data, the likelihood function, and the priors']
      tab5 [label = 'Step 5: Draw from the posterior \n Plot plausible lines using parameters visited by the Markov chains']
      tab6 [label = 'Step 6: Push the parameters back through the model \n Simulate real data from plausible combinations of mean and sigma']
      # edge definitions with the node IDs
      tab1 -> tab2 -> tab3 -> tab4 -> tab5 -> tab6;
      }
      ")

Step 1: Propose a distribution for the response variable

A Gaussian model is reasonable for the outcome variable Temperature as we know it is a measured from the thermocouples on the distal end of the catheter. According to McElreath (pg 75):

Measurement errors, variations in growth, and the velocities of molecules all tend towards Gaussian distributions. These processes do this because at their heart, these processes add together fluctuations. And repeatedly adding finite fluctuations results in a distribution of sums that have shed all information about the underlying process, aside from mean and spread.

Here’s us formally asserting Temperature as a normal distribution with mean mu and standard deviation sigma. These two parameters are all that is needed to completely describe the distribution and also pin down the likelihood function.

\(T_i \sim \text{Normal}(\mu_i, \sigma)\)

Step 2: Parameterize the mean

If we further parameterize the mean we can do some neat things like move it around with the predictor variable. This is a pretty key concept - you move the mean of the outcome variable around by parameterizing it. If we make it a line then it will move linearly with the predictor variable. The real data will still have a spread once the sigma term is folded back in, but we can think of the whole distribution shifting up and down based on the properties of the line.

Here’s us asserting we want mu to move linearly with changes in the predictor variable (time). Subtracting the mean from each value of the predictor variable “centers” the data which McElreath recommends in most cases. I will explore the differences between centered and un-centered later on.

\(\mu_i = \alpha + \beta (x_i - \bar{x})\)

Step 3: Set priors

We know some things about these data and we should use it to help regularize to model through the priors.

Temperature is a continuous variable so we want a continuous distribution. We also know from the nature of the treatment that there isn’t really any physical mechanism within the device that would be expected to cool down the tissue below normal body temperature. Since only heating is expected, the slope should be positive or zero.

McElreath emphasizes simulating from the priors to visualize “what the model knows before it sees the data”. Here are some priors to consider. Let’s evaluate.

# Set seed for repeatability
set.seed(1999)

# number of sims
n <- 150

# random draws from the specified prior distributions
# lognormal distribution is used to constrain slopes to positive values
a <- rnorm(n, 75, 15)

b <- rnorm(n, 0, 1)
b_ <- rlnorm(n, 0, 0.8)

# calc mean of time and temp for later use
mean_temp <- mean(ablation_dta_tbl$temp)
mean_time <- mean(ablation_dta_tbl$time)

# dummy tibble to feed ggplot()
empty_tbl <- tibble(x = 0)

# y = b(x - mean(var_1)) + a is equivalent to:
# y = bx + (a - b * mean(var_1))

# in this fig we use the uninformed prior that generates some unrealistic values
prior_fig_1 <- empty_tbl %>% ggplot() +
  geom_abline(
    intercept = a - b * mean_time,
    slope = b,
    color = "#2c3e50",
    alpha = 0.3,
    size = 1
  ) +
  ylim(c(0, 150)) +
  xlim(c(0, 150)) +
  labs(
    x = "time (sec)",
    y = "Temp (C)",
    title = "Prior Predictive Simulations",
    subtitle = "Uninformed Prior"
  )

# in this fig we confine the slopes to broad ranges informed by what we know about the domain
prior_fig_2 <- empty_tbl %>% ggplot() +
  geom_abline(
    intercept = a - b_ * mean_time,
    slope = b_,
    color = "#2c3e50",
    alpha = 0.3,
    size = 1
  ) +
  ylim(c(0, 150)) +
  xlim(c(0, 150)) +
  labs(
    x = "time (sec)",
    y = "Temp (C)",
    title = "Prior Predictive Simulations",
    subtitle = "Mildly Informed Prior"
  )

plot_grid(prior_fig_1, prior_fig_2)

The plots above show what the model thinks before seeing the data for two different sets of priors. In both cases, I have centered the data by subtracting the mean of the time from each individual value of time. This means the intercept has the meaning of the expected temperature at the mean of time. The family of lines on the right seem a lot more realistic despite having some slopes that predict strange values out of sample (blood coagulates at ~90C). Choosing a log normal distribution for time ensures positives slopes. You could probably go even tighter on these priors but for this exercise I’m feeling good about proceeding.

Looking only at the time window of the original observations and the Temp window bounded by body temperature (lower bound) and water boiling (upper bound).

empty_tbl %>% ggplot() +
  geom_abline(
    intercept = a - b_ * mean_time,
    slope = b_,
    color = "#2c3e50",
    alpha = 0.3,
    size = 1
  ) +
  ylim(c(37, 100)) +
  xlim(c(15, 40)) +
  labs(
    x = "time (sec)",
    y = "Temp (C)",
    title = "Prior Predictive Simulations",
    subtitle = "Mildly Informed Prior, Original Data Range"
  )

Here are the prior distributions selected to go forward.

\(\alpha \sim \text{Normal}(75, 15)\)

\(\beta \sim \text{LogNormal}(0, .8)\)

\(\sigma \sim \text{Uniform}(0, 30)\)

Step 4: Define the model

Here I use the brm() function in brms to build what I’m creatively calling: “model_1”. This one uses the un-centered data for time. This function uses Markov Chain Monte Carlo to survey the parameter space. After the warm up cycles, the relative amount of time the chains spend at each parameter value is a good approximation of the true posterior distribution. I’m using a lot of warm up cycles because I’ve heard chains for the uniform priors on sigma can take a long time to converge. This model still takes a bit of time to chug through the parameter space on my modest laptop.

#model_1 <-
#  brm(
#    data = ablation_dta_tbl, family = gaussian,
#    temp ~ 1 + time,
#    prior = c(
#      prior(normal(75, 15), class = Intercept),
#      prior(lognormal(0, .8), class = b),
#      prior(uniform(0, 30), class = sigma)
#    ),
#    iter = 41000, warmup = 40000, chains = 4, cores = 4,
#    seed = 4
#  )

Step 5: Draw from the posterior

The fruits of all my labor! The posterior holds credible combinations for sigma and the slope and intercept (which together describe the mean of the outcome variable we care about). Let’s take a look.

post_samplesM1_tbl <-
  posterior_samples(model_1) %>%
  select(-lp__) %>%
  round(digits = 3)

post_samplesM1_tbl %>%
  head(10) %>%
  kable(align = rep("c", 3))

b_Intercept	b_time	sigma
58.509	0.841	2.682
55.983	0.949	2.648
56.195	0.937	2.540
56.661	0.919	2.474
55.143	0.978	2.593
55.170	0.977	2.667
54.908	0.996	2.621
58.453	0.836	2.534
54.134	1.031	2.647
58.713	0.828	2.707

The plotting function in brms is pretty sweet. I’m not expert in MCMC diagnostics but I do know the “fuzzy caterpillar” look of the trace plots is desirable.

plot(model_1)

Posterior_summary() can grab the model results in table form.

mod_1_summary_tbl <-
  posterior_summary(model_1) %>%
  as.data.frame() %>%
  rownames_to_column() %>%
  as_tibble() %>%
  mutate_if(is.numeric, funs(as.character(signif(., 2)))) %>%
  mutate_at(.vars = c(2:5), funs(as.numeric(.)))

mod_1_summary_tbl %>%
  kable(align = rep("c", 5))

rowname	Estimate	Est.Error	Q2.5	Q97.5
b_Intercept	57.00	1.000	55.00	59.0
b_time	0.91	0.045	0.83	1.0
sigma	2.60	0.100	2.40	2.8
lp__	-790.00	1.300	-790.00	-790.0

Now let’s see what changes if the time data is centered. Everything is the same here in model_2 except the time_c data which is transformed by subtracting the mean from each value.

ablation_dta_tbl <- ablation_dta_tbl %>% mutate(time_c = time - mean(time))

#model_2 <-
#  brm(
#    data = ablation_dta_tbl, family = gaussian,
#    temp ~ 1 + time_c,
#    prior = c(
#      prior(normal(75, 15), class = Intercept),
#      prior(lognormal(0, .8), class = b),
#      prior(uniform(0, 30), class = sigma)
#    ),
#    iter = 41000, warmup = 40000, chains = 4, cores = 4,
#    seed = 4
#  )

Plotting model_2 to compare with the output of model_1 above.

plot_mod_2_fig <- plot(model_2)

The slope B and sigma are very similar. The intercept is the only difference with model_1 ranging from low to high 50’s. Model 2 is tight around 77. We should visualize the lines proposed by the parameters in the posteriors of our models to understand the uncertainty associated with the mean and also understand why the intercepts are different between models. First, store the posterior samples as a tibble in anticipation for ggplot.

post_samplesM2_tbl <-
  posterior_samples(model_2) %>%
  select(-lp__) %>%
  round(digits = 3)

post_samplesM2_tbl %>%
  head(10) %>%
  kable(align = rep("c", 3))

b_Intercept	b_time_c	sigma
77.323	0.894	2.350
77.430	0.881	2.516
77.335	0.957	2.571
77.011	0.947	2.776
77.209	1.013	2.691
77.517	0.820	2.488
77.335	0.881	2.682
77.313	0.857	2.538
77.423	0.873	2.569
77.302	0.926	2.340

Visualize the original data (centered and un-centered versions) along with plausible values for regression line of the mean:

mean_regressionM1_fig <-
  ablation_dta_tbl %>%
  ggplot(aes(x = time, y = temp)) +
  geom_point(
    colour = "#481567FF",
    size = 2,
    alpha = 0.6
  ) +
  geom_abline(aes(intercept = b_Intercept, slope = b_time),
    data = post_samplesM1_tbl,
    alpha = 0.1, color = "gray50"
  ) +
  geom_abline(
    slope = mean(post_samplesM1_tbl$b_time),
    intercept = mean(post_samplesM1_tbl$b_Intercept),
    color = "blue", size = 1
  ) +
  labs(
    title = "Regression Line Representing Mean of Slope",
    subtitle = "Data is As-Observed (No Centering of Predictor)",
    x = "Time (s)",
    y = "Temperature (C)"
  )

mean_regressionM2_fig <-
  ablation_dta_tbl %>%
  ggplot(aes(x = time_c, y = temp)) +
  geom_point(
    color = "#55C667FF",
    size = 2,
    alpha = 0.6
  ) +
  geom_abline(aes(intercept = b_Intercept, slope = b_time_c),
    data = post_samplesM2_tbl,
    alpha = 0.1, color = "gray50"
  ) +
  geom_abline(
    slope = mean(post_samplesM2_tbl$b_time_c),
    intercept = mean(post_samplesM2_tbl$b_Intercept),
    color = "blue", size = 1
  ) +
  labs(
    title = "Regression Line Representing Mean of Slope",
    subtitle = "Predictor Data (Time) is Centered",
    x = "Time (Difference from Mean Time in seconds)",
    y = "Temperature (C)"
  )


combined_mean_fig <-
  ablation_dta_tbl %>%
  ggplot(aes(x = time, y = temp)) +
  geom_point(
    colour = "#481567FF",
    size = 2,
    alpha = 0.6
  ) +
  geom_point(
    data = ablation_dta_tbl, aes(x = time_c, y = temp),
    colour = "#55C667FF",
    size = 2,
    alpha = 0.6
  ) +
  geom_abline(aes(intercept = b_Intercept, slope = b_time),
    data = post_samplesM1_tbl,
    alpha = 0.1, color = "gray50"
  ) +
  geom_abline(
    slope = mean(post_samplesM1_tbl$b_time),
    intercept = mean(post_samplesM1_tbl$b_Intercept),
    color = "blue", size = 1
  ) +
  geom_abline(aes(intercept = b_Intercept, slope = b_time_c),
    data = post_samplesM2_tbl,
    alpha = 0.1, color = "gray50"
  ) +
  geom_abline(
    slope = mean(post_samplesM2_tbl$b_time_c),
    intercept = mean(post_samplesM2_tbl$b_Intercept),
    color = "blue", size = 1
  ) +
  labs(
    title = "Regression Line Representing Mean of Slope",
    subtitle = "Centered and Un-Centered Predictor Data",
    x = "Time (s)",
    y = "Temperature (C)"
  )

combined_predicts_fig <- combined_mean_fig + 
  ylim(c(56,90)) +
  labs(title = "Points Represent Observed Data (Green is Centered)",
       subtitle = "Regression Line Represents Rate of Change of Mean (Grey Bands are Uncertainty)")

Now everything is clear. The slopes are exactly the same (as we saw in the density plots between model_1 and model_2 in summary()). The intercepts are different because in the centered data (green) the intercept occurs when the predictor equals 0 (its new mean). The outcome variable temp must therefore also be at its mean value in the “knot” of the bow-tie.

For the un-centered data (purple), the intercept is the value of Temperature when the un-adjusted time is at 0. The range of possible intercepts is much more uncertain here.

Another way to look at the differences is as a map of the plausible parameter space. We need a plot that can represent 3 parameters: intercept, slope, and sigma. Each point will be a credible combination of the three parameters as observed in 1 row of the posterior distribution tibble(s).

First, the un-centered model.

p_spaceM1_fig <- 
  post_samplesM1_tbl[1:1000, ] %>%
  ggplot(aes(x = b_time, y = b_Intercept, color = sigma)) +
  geom_point(alpha = 0.5) +
  geom_density2d(color = "gray30") +
  scale_color_viridis_c() +
  labs(
    title = "Parameter Space - Model 1 (Un-Centered)",
    subtitle = "Intercept Represents the Expected Temp at Time = 0"
  )

Now the centered version:

p_spaceM2_fig <- 
  post_samplesM2_tbl[1:1000, ] %>%
  ggplot(aes(x = b_time_c, y = b_Intercept, color = sigma)) +
  geom_point(alpha = 0.5) +
  geom_density2d(color = "gray30") +
  scale_color_viridis_c() +
  labs(
    title = "Parameter Space - Model 2 (Centered)",
    subtitle = "Intercept Represents the Expected Temp at Mean Time"
  )

#p_spaceM2_fig 
#ggsave(filename = "p_spaceM2_fig.png")

These look way different, but part of it is an illusion of the scaling on the y-axis. Remember how the credible values of the intercept were much tighter for the centered model? If we plot them both on the same canvas we can understand better, and it’s pretty (to my eye at least).

p_spaceC_tbl <- 
  post_samplesM2_tbl[1:1000, ] %>%
  ggplot(aes(x = b_time_c, y = b_Intercept, color = sigma)) +
  geom_point(alpha = 0.5) +
  geom_point(data = post_samplesM1_tbl, aes(x = b_time, y = b_Intercept, color = sigma), alpha = 0.5) +
  scale_color_viridis_c() +
  labs(
    title = "Credible Parameter Values for Models 1 and 2",
    subtitle = "Model 1 is Un-Centered, Model 2 is Centered",
    x = expression(beta["time"]),
    y = expression(alpha["Intercept"])) +
  ylim(c(54, 80))

Now we see they aren’t as different as they first seemed. They cover very similar ranges for the slope and the un-centered model covers a wider range of plausible intercepts.

I’ve been looking for a good time to fire up the rayshader package and I’m not throwing away my shot here. Plotting with rayshader feels like a superpower that I shouldn’t be allowed to have. It’s silly how easy it is to make these ridiculous visuals. First, a fancy 3d plot providing some perspective on the relative “heights” of theta.

#par(mfrow = c(1, 1))
#plot_gg(p_spaceC_tbl, width = 5, height = 4, scale = 300, multicore = TRUE, windowsize = c(1200, 960),
#        fov = 70, zoom = 0.45, theta = 330, phi = 40)

#Sys.sleep(0.2)
#render_depth(focus = 0.7, focallength = 200)

If you want more, this code below renders a video guaranteed to impress small children and executives. I borrowed this code from Joey Stanley who borrowed it from Morgan Wall.⁴

#install.packages("av")
#library(av)

# Set up the camera position and angle
#phivechalf = 30 + 60 * 1/(1 + exp(seq(-7, 20, length.out = 180)/2))
#phivecfull = c(phivechalf, rev(phivechalf))
#thetavec = 0 + 60 * sin(seq(0,359,length.out = 360) * pi/180)
#zoomvec = 0.45 + 0.2 * 1/(1 + exp(seq(-5, 20, length.out = 180)))
#zoomvecfull = c(zoomvec, rev(zoomvec))

# Actually render the video.
#render_movie(filename = "hex_plot_fancy_2", type = "custom", 
#            frames = 360,  phi = phivecfull, zoom = zoomvecfull, theta = thetavec)

Step 6: Push the parameters back through the model

After a lot of work we have finally identified the credible values for our model parameters. We now want to see what sort of predictions our posterior makes. Again, I’ll work with both the centered and un-centered data to try to understand the difference between the approaches. The first step in both cases is to create a sequence of time data to predict off of. For some reason I couldn’t get the predict() function in brms to cooperate so I wrote my own function to predict values. You enter a time value and the function makes a temperature prediction for every combination of mean and standard deviation derived from the parameters in the posterior distribution. Our goal will be to map this function over the sequence of predictor values we just set up.

#sequence of time data to predict off of.  Could use the same for both models but I created 2 for clarity
time_seq_tbl   <- tibble(pred_time   = seq(from = -15, to = 60, by = 1))
time_seq_tbl_2 <- tibble(pred_time_2 = seq(from = -15, to = 60, by = 1))

#function that takes a time value and makes a prediction using model_1 (un-centered) 
rk_predict <- 
function(time_to_sim){
  rnorm(n = nrow(post_samplesM1_tbl),
        mean = post_samplesM1_tbl$b_Intercept + post_samplesM1_tbl$b_time*time_to_sim,
        sd = post_samplesM1_tbl$sigma
  )
}

#function that takes a time value and makes a prediction using model_2 (centered)
rk_predict2 <- 
function(time_to_sim){
  rnorm(n = nrow(post_samplesM2_tbl),
        mean = post_samplesM2_tbl$b_Intercept + post_samplesM2_tbl$b_time_c*time_to_sim,
        sd = post_samplesM2_tbl$sigma
  )
}

#map the first prediction function over all values in the time sequence
#then calculate the .025 and .975 quantiles in anticipation of 95% prediction intervals
predicts_m1_tbl <- time_seq_tbl %>%
  mutate(preds_for_this_time = map(pred_time, rk_predict)) %>%
  mutate(percentile_2.5  = map_dbl(preds_for_this_time, ~quantile(., .025))) %>%
  mutate(percentile_97.5 = map_dbl(preds_for_this_time, ~quantile(., .975)))
    
#same for the 2nd prediction function
predicts_m2_tbl <- time_seq_tbl_2 %>%
  mutate(preds_for_this_time = map(pred_time_2, rk_predict2)) %>%
  mutate(percentile_2.5  = map_dbl(preds_for_this_time, ~quantile(., .025))) %>%
  mutate(percentile_97.5 = map_dbl(preds_for_this_time, ~quantile(., .975)))   

#visualize what is stored in the nested prediction cells (sanity check)
test_array <- predicts_m2_tbl[1, 2] %>% unnest(cols = c(preds_for_this_time))
test_array %>% 
  round(digits = 2) %>%
  head(5) %>%
  kable(align = rep("c", 1))

preds_for_this_time
68.13
61.67
65.55
62.12
64.05

And now the grand finale - overlay the 95% prediction intervals on the original data along with the credible values of mean. We see there is no difference between the predictions made from centered data vs. un-centered.

big_enchilada <- 
  tibble(h=0) %>%
  ggplot() +
  geom_point(
    data = ablation_dta_tbl, aes(x = time, y = temp),
    colour = "#481567FF",
    size = 2,
    alpha = 0.6
  ) +
  geom_point(
    data = ablation_dta_tbl, aes(x = time_c, y = temp),
    colour = "#55C667FF",
    size = 2,
    alpha = 0.6
  ) +
  geom_abline(aes(intercept = b_Intercept, slope = b_time),
    data = post_samplesM1_tbl,
    alpha = 0.1, color = "gray50"
  ) +
  geom_abline(
    slope = mean(post_samplesM1_tbl$b_time),
    intercept = mean(post_samplesM1_tbl$b_Intercept),
    color = "blue", size = 1
  ) +
  geom_abline(aes(intercept = b_Intercept, slope = b_time_c),
    data = post_samplesM2_tbl,
    alpha = 0.1, color = "gray50"
  ) +
  geom_abline(
    slope = mean(post_samplesM2_tbl$b_time_c),
    intercept = mean(post_samplesM2_tbl$b_Intercept),
    color = "blue", size = 1
  ) +
  geom_ribbon(
  data = predicts_m1_tbl, aes(x = predicts_m1_tbl$pred_time, ymin = predicts_m1_tbl$percentile_2.5, ymax = predicts_m1_tbl$percentile_97.5), alpha = 0.25, fill = "pink", color = "black", size = .3
) +
  geom_ribbon(
  data = predicts_m2_tbl, aes(x = predicts_m2_tbl$pred_time_2, ymin = predicts_m2_tbl$percentile_2.5, ymax = predicts_m2_tbl$percentile_97.5), alpha = 0.4, fill = "pink", color = "black", size = .3
) +
  labs(
    title = "Regression Line Representing Mean of Slope",
    subtitle = "Centered and Un-Centered Predictor Data",
    x = "Time (s)",
    y = "Temperature (C)"
  ) +
  scale_x_continuous(limits = c(-10, 37), expand = c(0, 0)) +
  scale_y_continuous(limits = c(40, 120), expand = c(0, 0))

What a ride! This seemingly simple problem really stretched my brain. There are still a lot of question I want to go deeper on - diagnostics for the MCMC, impact of the regularizing priors, different between this workflow and frequentist at various sample sizes and priors, etc… but that will have to wait for another day.

For those looking for more interpretations of McElreath’s workflows using Tidyverse tools, Solomon Kurz has a brilliant collection here.⁵

Thank you for reading.

Statistical Rethinking, https://github.com/rmcelreath/statrethinking_winter2019 ↩
https://www.sciencedirect.com/science/article/abs/pii/S1547527116001806 ↩
There’s a funky bug in ggExtra which makes you break this code into 2 chunks when working in Markdown, https://cran.r-project.org/web/packages/ggExtra/vignettes/ggExtra.html ↩
3D Vowel Plots with Rayshader, http://joeystanley.com/blog/3d-vowel-plots-with-rayshader ↩
Statistical Rethinking with brms, ggplot2, and the tidyverse, https://bookdown.org/ajkurz/Statistical_Rethinking_recoded/↩

Confounders and Colliders - Modeling Spurious Correlations in R

Tue, 29 Oct 2019 00:00:00 +0000

Like many engineers, my first models were based on Designed Experiments in the tradition of Cox and Montgomery. I hadn’t seen anything like a causal diagram until I picked the The Book of Why which explores all sorts of experimental relationships and structures I never imagined.¹ Colliders, confounders, causal diagrams, M-bias - these concepts are all relatively new to me and I want to understand them better. In this post I will attempt to create some simple structural causal models (SCMs) for myself using the Dagitty and GGDag packages and then show the potential effects of confounders and colliders on a simulated experiment adapted from here.²

It turns out that it is not as simple as identifying lurking variables and holding them constant while we conduct the experiment of interest (as I was always taught).

First, load the libraries.

# Load libraries
library(tidyverse)
library(kableExtra)
library(tidymodels)
library(viridisLite)
library(GGally)
library(dagitty)
library(ggdag)
library(visreg)
library(styler)
library(cowplot)

A structural causal model (SCM) is a type of directed acyclic graph (DAG) that maps causal assumptions onto a simple model of experimental variables. In the figure below, each node(blue dot) represents a variable. The edges(yellow lines) between nodes represent assumed causal effects.

Dagitty uses the dafigy() function to create the relationships in the DAG. These are stored in a DAG object which is provided to ggplot and can then be customized and adjusted. Most of the code below the DAG object is just formatting the figure.

# create DAG object
g <- dagify(
  A ~ J,
  X ~ J,
  X ~ A
)

# tidy the dag object and supply to ggplot
set.seed(100)
g %>%
  tidy_dagitty() %>%
  mutate(x = c(0, 1, 1, 2)) %>%
  mutate(y = c(0, 2, 2, 0)) %>%
  mutate(xend = c(2, 0, 2, NA)) %>%
  mutate(yend = c(0, 0, 0, NA)) %>%
  dag_label(labels = c(
    "A" = "Independent\n Variable",
    "X" = "Dependent\n Variable",
    "J" = "The\n Confounder"
  )) %>%
  ggplot(aes(x = x, y = y, xend = xend, yend = yend)) +
  geom_dag_edges(
    edge_colour = "#b8de29ff",
    edge_width = .8
  ) +
  geom_dag_node(
    color = "#2c3e50",
    alpha = 0.8
  ) +
  geom_dag_text(color = "white") +
  geom_dag_label_repel(aes(label = label),
    col = "white",
    label.size = .4,
    fill = "#20a486ff",
    alpha = 0.8,
    show.legend = FALSE,
    nudge_x = .7,
    nudge_y = .3
  ) +
  labs(
    title = " Directed Acyclic Graph",
    subtitle = " Two Variables of Interest with a Confounder"
  ) +
  xlim(c(-1.5, 3.5)) +
  ylim(c(-.33, 2.2)) +
  geom_rect(
    xmin = -.5,
    xmax = 3.25,
    ymin = -.25,
    ymax = .65,
    alpha = .04,
    fill = "white"
  ) +
  theme_void() +
  theme(
    plot.background = element_rect(fill = "#222222"),
    plot.title = element_text(color = "white"),
    plot.subtitle = element_text(color = "white")
  )

The relationship of interest is captured in the lower rectangle: we want to change the value of independent variable A and record the effect on dependent variable X (in epidemiology these might be called “treatment” and “outcome”). There also happens to be a confounding variable J that has a causal effect on both A and X.

We can set up a simulated experiment that follows the structure of the SCM above:

Each variable will have n=1000 values. J is generated by drawing randomly from a standard normal distribution. We want J to be a cause of A so we use J in the creation of A along with a random error term to represent noise. The model above shows a causal link from A to X but we don’t actually know if this exists - that’s the point of the experiment. It may or may not be there (from the point of view of the experimenter/engineer). For the purposes of demonstration we will structure the simulation such that there is no causal relationship between A and X (A will not be used in the creation of the variable J). Again we need J as a cause of X so we use J in the creation of the dependent_var_X object along with a random noise component.

The simulation is now set up to model an experiment where the experimenter/engineer wants to understand the effect of A on X but the true effect is zero. Meanwhile, there is a confounding variable J that is a parent to both A and X.

# set seed for repeatability
set.seed(805)

# n = 1000 points for the simulation
n <- 1000

# create variables
# J is random draws from standard normal (mean = 0, stdev = 1)
confounding_var_J <- rnorm(n)

# J is used in creation of A since it is a cause of A (confounder)
independent_var_A <- 1.1 * confounding_var_J + rnorm(n)

# J is used in creation of X since it is a cause of X (confounder)
dependent_var_X <- 1.9 * confounding_var_J + rnorm(n)

In reality, the experimenter may or may not be aware of the parent confounder J. We will create two different regression models below. In the first, denoted crude_model, we will assume the experimenter was unaware of the confounder. The model is then created with A as the only predictor variable of X.

In the second, denoted confounder_model, we will assume the experimenter was aware of the confounder and chose to include it in their model. This version is created with A and J as predictors of X.

# create crude regression model with A predicting X.  J is omitted
crude_model <- lm(dependent_var_X ~ independent_var_A)

# create confounder model with A and J predicting X
confounder_model <- lm(dependent_var_X ~ independent_var_A + confounding_var_J)

# tidy the crude model and examine it
crude_model_tbl <- summary(crude_model) %>% tidy()
crude_model_kbl <- summary(crude_model) %>%
  tidy() %>%
  kable(align = rep("c", 5), digits = 3)
crude_model_kbl

term	estimate	std.error	statistic	p.value
(Intercept)	-0.007	0.051	-0.135	0.893
independent_var_A	0.967	0.034	28.415	0.000

# Tidy the confounder model and examine it
confounder_model_tbl <- summary(confounder_model) %>% tidy()
confounder_model_kbl <- summary(confounder_model) %>%
  tidy() %>%
  kable(align = rep("c", 5), digits = 3)
confounder_model_kbl

term	estimate	std.error	statistic	p.value
(Intercept)	-0.005	0.032	-0.151	0.880
independent_var_A	0.005	0.033	0.153	0.878
confounding_var_J	1.860	0.048	38.460	0.000

# add column for labels
crude_model_tbl <- crude_model_tbl %>% mutate(model = "crude_model: no confounder")
confounder_model_tbl <- confounder_model_tbl %>% mutate(model = "confounder_model: with confounder")

# combine into a single kable
confounder_model_summary_tbl <- bind_rows(crude_model_tbl, confounder_model_tbl)
confounder_model_summary_tbl <- confounder_model_summary_tbl %>% select(model, everything())
confounder_model_summary_tbl %>% kable(align = rep("c", 6), digits = 3)

model	term	estimate	std.error	statistic	p.value
crude_model: no confounder	(Intercept)	-0.007	0.051	-0.135	0.893
crude_model: no confounder	independent_var_A	0.967	0.034	28.415	0.000
confounder_model: with confounder	(Intercept)	-0.005	0.032	-0.151	0.880
confounder_model: with confounder	independent_var_A	0.005	0.033	0.153	0.878
confounder_model: with confounder	confounding_var_J	1.860	0.048	38.460	0.000

The combined summary table above provides the effect sizes and the difference between the two models is striking. Conditional plots are a way to visualize regression models. The visreg package creates conditional plots by supplying a model object and a predictor variable to the visreg() function. The x-axis shows the value of the predictor variable and the y-axis shows change in the response variable. All other variables are held constant at their medians.

# visualize conditional plot of A vs X, crude model
v1 <- visreg(crude_model,
  "independent_var_A",
  gg = TRUE,
  line = list(col = "#E66101")
) +
  labs(
    title = "Relationship Between A and X",
    subtitle = "Neglecting Confounder Variable J"
  ) +
  ylab("Change in Response X") +
  ylim(-6, 6) +
  theme(plot.subtitle = element_text(face = "bold", color = "#404788FF"))

# visualize conditional plot of A vs X, confounder model
v2 <- visreg(confounder_model,
  "independent_var_A",
  gg = TRUE,
  line = list(col = "#E66101")
) +
  labs(
    title = "Relationship Between A and X",
    subtitle = "Considering Confounder Variable J"
  ) +
  ylab("Change in Response X") +
  ylim(-6, 6) +
  theme(plot.subtitle = element_text(face = "bold", color = "#20a486ff"))

plot_grid(v1, v2)

We know from creating the simulated data that A has no real effect on the outcome X. X was created using only J and some noise. But the left plot shows a large, positive slope and significant coefficient! How can this be? This faulty estimate of the true effect is biased; more specifically we are seeing “confounder bias” or “omitted variable bias”. Adding J to the regression model has the effect of conditioning on J and revealing the true relationship between A and X: no effect of A on X.

Confounding is pretty easy to understand. “Correlation does not imply causation” has been drilled into my brain effectively. Still, confounders that aren’t anticipated can derail studies and confuse observers. For example, the first generation of drug eluting stents was released in the early 2000’s. They showed great promise but their long-term risk profile was not well understood. Observational studies indicated an improved mortality rate for drug-eluting stents relative to their bare-metal counterparts. However, the performance benefit could not be replicated in randomized controlled trials.³

The disconnect was eventually linked (at least in part) to a confounding factor. Outside of a RCT, clinicians took into account the health of the patient going into the procedure. Specifically, if the patient was scheduled for a pending surgery or had a history of clotting then the clinician would hedge towards a bare-metal stent (since early gen DES tended to have thrombotic events at a greater frequency than BMS). Over the long term, these sicker patients were assigned BMS disproportionately, biasing the effect of stent type on long-term mortality via patient health as a confounder.

So we always want to include every variable we know about in our regression models, right? Wrong. Here is a case that looks similar to the confounder scenario but is slightly different. The question of interest is the same: evaluate the effect of predictor B on the outcome Y. Again, there is a 3rd variable at play. But this time, the third variable is caused by both B and Y rather than being itself the common cause.

# assign DAG object
h <- dagify(
  K ~ B + Y,
  Y ~ B
)

# tidy the dag object and suppply to ggplot
set.seed(100)
h %>%
  tidy_dagitty() %>%
  mutate(x = c(0, 0, 2, 1)) %>%
  mutate(y = c(0, 0, 0, 2)) %>%
  mutate(xend = c(1, 2, 1, NA)) %>%
  mutate(yend = c(2, 0, 2, NA)) %>%
  dag_label(labels = c(
    "B" = "Independent\n Variable",
    "Y" = "Dependent\n Variable",
    "K" = "The\n Collider"
  )) %>%
  ggplot(aes(x = x, y = y, xend = xend, yend = yend)) +
  geom_dag_edges(
    edge_colour = "#b8de29ff",
    edge_width = .8
  ) +
  geom_dag_node(
    color = "#2c3e50",
    alpha = 0.8
  ) +
  geom_dag_text(color = "white") +
  geom_dag_label_repel(aes(label = label),
    col = "white",
    label.size = .4,
    fill = "#20a486ff",
    alpha = 0.8,
    show.legend = FALSE,
    nudge_x = .7,
    nudge_y = .3
  ) +
  labs(
    title = " Directed Acyclic Graph",
    subtitle = " Two Variables of Interest with a Collider"
  ) +
  xlim(c(-1.5, 3.5)) +
  ylim(c(-.33, 2.2)) +
  geom_rect(
    xmin = -.5,
    xmax = 3.25,
    ymin = -.25,
    ymax = .65,
    alpha = .04,
    fill = "white"
  ) +
  theme_void() +
  theme(
    plot.background = element_rect(fill = "#222222"),
    plot.title = element_text(color = "white"),
    plot.subtitle = element_text(color = "white")
  )

A variable like this is called a collider because the causal arrows from from B and Y collide at K. K is created in the simulation below using both B and Y plus random noise. This time, the outcome Y is created using B as an input, thereby assigning a causal relation with an effect size of 0.3.

# create variables
# B is random draws from standard normal (mean = 0, stdev = 1)
independent_var_B <- rnorm(n)

# Y is created with B and noise. Effect size of B on Y is 0.3
dependent_var_Y <- .3 * independent_var_B + rnorm(n)

# K (collider) is created with B and Y + noise
collider_var_K <- 1.2 * independent_var_B + 0.9 * dependent_var_Y + rnorm(n)

Let’s assume that the experimenter knows about possible collider variable K. What should they do with it when they go to create their regression model? Let’s create two models again to compare results. Following the nomenclature from before: crude_model_b uses only B to predict Y and collider_model uses both B and K to predict Y.

# create crude regression model with B predicting Y.  K is omitted
crude_model_b <- lm(dependent_var_Y ~ independent_var_B)

# create collider model with B and K predicting Y
collider_model <- lm(dependent_var_Y ~ independent_var_B + collider_var_K)

# tidy the crude model and examine it
crude_model_b_kbl <- summary(crude_model_b) %>%
  tidy() %>%
  kable(align = rep("c", 5), digits = 3)
crude_model_b_tbl <- summary(crude_model_b) %>% tidy()
crude_model_b_kbl

term	estimate	std.error	statistic	p.value
(Intercept)	-0.021	0.032	-0.666	0.506
independent_var_B	0.247	0.032	7.820	0.000

# tidy the collider model and examine it
collider_model_kbl <- summary(collider_model) %>%
  tidy() %>%
  kable(align = rep("c", 5), digits = 3)
collider_model_tbl <- summary(collider_model) %>% tidy()
collider_model_kbl

term	estimate	std.error	statistic	p.value
(Intercept)	-0.011	0.023	-0.453	0.651
independent_var_B	-0.481	0.034	-14.250	0.000
collider_var_K	0.519	0.018	29.510	0.000

# add label column
crude_model_b_tbl <- crude_model_b_tbl %>% mutate(model = "crude_model_b: no collider")
collider_model_tbl <- collider_model_tbl %>% mutate(model = "collider_model: with collider")

# combine and examine
collider_model_summary_tbl <- bind_rows(crude_model_b_tbl, collider_model_tbl)
collider_model_summary_tbl <- collider_model_summary_tbl %>% select(model, everything())
collider_model_summary_tbl %>% kable(align = rep("c", 6), digits = 3)

model	term	estimate	std.error	statistic	p.value
crude_model_b: no collider	(Intercept)	-0.021	0.032	-0.666	0.506
crude_model_b: no collider	independent_var_B	0.247	0.032	7.820	0.000
collider_model: with collider	(Intercept)	-0.011	0.023	-0.453	0.651
collider_model: with collider	independent_var_B	-0.481	0.034	-14.250	0.000
collider_model: with collider	collider_var_K	0.519	0.018	29.510	0.000

This time, omitting the collider variable is the proper way to recover the true effect of B on Y. Let’s verify with conditional plots as before. Again, we know the true slope should be around 0.3.

# create conditional plot with crude_model_b and B
v3 <- visreg(crude_model_b,
  "independent_var_B",
  gg = TRUE,
  line = list(col = "#E66101")
) +
  labs(
    title = "Relationship Between B and Y",
    subtitle = "Neglecting Collider Variable K"
  ) +
  ylab("Change in Response Y") +
  ylim(-6, 6) +
  theme(plot.subtitle = element_text(face = "bold", color = "#f68f46b2"))

# create conditional plot with collider_model and B
v4 <- visreg(collider_model,
  "independent_var_B",
  gg = TRUE,
  line = list(col = "#E66101")
) +
  labs(
    title = "Relationship Between B and Y",
    subtitle = "Considering Collider Variable K"
  ) +
  ylab("Change in Response Y") +
  ylim(-6, 6) +
  theme(plot.subtitle = element_text(face = "bold", color = "#403891b2"))

plot_grid(v3, v4)

Incredibly, the conclusion one draws about the relationship between B and Y completely reverses depending upon which model is used. The true effect is positive (we only know this for sure because we created the data) but by including the collider variable in the model we observe it as negative. We should not control for a collider variable!

Controlling for a confounder reduces bias but controlling for a collider increases it - a simple summary that I will try to remember as I design future experiments or attempt to derive meaning from observational studies. These are the simple insights that make learning this stuff really fun (for me at least)!

Thanks for reading.

Modeling Particulate Counts as a Poisson Process in R

Wed, 18 Sep 2019 00:00:00 +0000

I’ve never really worked much with Poisson data and wanted to get my hands dirty. I thought that for this project I might combine a Poisson data set with the simple Bayesian methods that I’ve explored before since it turns out the Poisson rate parameter lambda also has a nice conjugate prior (more on that later). Poisson distributed data are counts per unit time or space - they are events that arrive at random intervals but that have a characteristic rate parameter which also equals the variance. This rate parameter is usually denoted as lambda. No-hitters in baseball are often modeled as Poisson data, as are certain types of processing defects in electronics and medical devices. A particularly relevant application is in particulate testing for implantable devices. Particulate shed is an unassuming but potentially costly and dangerous phenomenon.

Particulate can be shed from the surface of medical devices even when the manufacturing environment is diligently controlled. The source of the particulate can vary: light particulate is attracted to the surface of sheaths and luers due to static charge; hydrophilic coatings may delaminate from the surface during delivery; therapeutic coating on the implant’s surface may degrade over time in the presence of blood.

The clinical harms that the patient may face due to particulate shed include neurological events if the particulate migrates cranially or embolism it migrates caudally. The occurrence and severity of symptoms are understood to be functions of both size and quantity of particulate. In recent years, FDA and friends have been more stringent in requiring manufacturers to quantify and understand the nature of the particulate burden associated with their devices. In the analysis below, I’m going to simulate an experiment in which particulate data are collected for 20 devices.

Before I get there, I want to remind myself of what Poisson data look like for different rate parameters. I set up a function to make a Poisson pdf based on number of events n and rate parameter lambda. The function then converts the information to a tibble for use with ggplot.

#Load libraries
library(tidyverse)
library(knitr)
library(kableExtra)
library(tolerance)
library(ggrepel)

#Sequence from 0 to 24 by 1 (x-axis of plot)
number_of_events <- seq(0, 24, by = 1)

#Function to make a Poisson density vector from n and lambda, convert into tibble
pois_fcn <- function(lambda){
            pois_vector <- dpois(x = number_of_events, lambda = lambda, log = FALSE)
            pois_tbl    <- tibble("num_of_events" = number_of_events,
                                  "prob"          = pois_vector,
                                  "lambda"        = lambda)
            }
#Objects to hold tibbles for different Poisson rates
pois_dist_1_tbl <-  pois_fcn(lambda = 1)
pois_dist_5_tbl <-  pois_fcn(lambda = 5)
pois_dist_15_tbl <- pois_fcn(lambda = 15)

#Combine in one df
pois_total_tbl <- bind_rows(pois_dist_1_tbl,
                            pois_dist_5_tbl,
                            pois_dist_15_tbl)

#Convert lambda front int to factor so ggplot maps aesthetics as levels, not gradient
pois_total_int_tbl <- pois_total_tbl %>% 
  mutate(lambda = as_factor(lambda))

#Make and store ggplot obj
h1 <- pois_total_int_tbl %>% ggplot(aes(x = num_of_events, y = prob)) +
  geom_col(aes(y = prob, fill = lambda), position = "dodge", color = "black") +
  scale_fill_manual(values = c("#2C728EFF", "#75D054FF", "#FDE725FF")) +
  labs(x        = "Number of Events", 
       y        = "Probability",
       title    = "Probability Mass Function",
       subtitle = "Poisson Distributions with Different Rates (Lambda)")

h1

Cool - so when the rate is low it looks sort of like the discrete version of an exponential curve. It’s still not symmetric at lambda = 5 but by lambda = 15 it looks a lot like a binomial distribution.

The data I simulate below are intended to represent the fluid collected during bench-top simulated use testing in a clean “flow loop” or vascular deployment model. The fluid would generally be passed through light obscuration censors to quantify the size and counts of particulate relative to a control. Particulate requirements for many endovascular devices are borrowed from USP <788>. According to that standard, no more than 60 particles greater than 25 micron effective diameter are acceptable. I want to know the probability of passing the test but don’t know the rate parameter lambda. The end goal is to understand what the most credible values for lambda are based on the bench-top data from multiple devices. First I’ll try to quantify the uncertainty in the rate parameter lambda. Each lambda can then be used to estimate a reliability. The large number of simulated lamdas will make a large set of simulated reliabilities. From there I should be able to extract any information needed regarding the uncertainty of the device reliability as it relates to particulate shed. That’s the plan! Note: I’m trying out knitr::kable() which generates html tables nicely. I’m not too good at it yet so bare with me please.

Take a look at the data:

#Peek at some data
particulate_data %>% head(5) %>%
  kable() %>% kable_styling("full_width" = F)

x
46
58
38
50
62

I’m using a Bayesian approach again - partially because I need practice and partially because the Poisson parameter lambda has a convenient conjugate prior: the gamma distribution. This means that some simple math can get me from the prior to the posterior. I love simple math. Using the gamma distribution to describe the prior belief in lambda, the posterior distribution for lambda is:

\[\mbox{prior: lambda ~ Gamma}(a, b)\] As a reminder to myself, this is read as “lambda is distributed as a Gamma distribution with parameters a and b”.

\[\mbox{posterior: lambda ~ Gamma}(a + \sum_{i=1}^{n} x_i\ , b + n)\] It is reasonable to use an relatively uninformed prior for lambda since I don’t have much preliminary knowledge about particulate data for my device design. Setting the shape a to 1 and the rate b to 0.1 provides allocates the credibility across a wide range of lambdas to start. To go from prior to posterior we need only sum up all the particulate counts in the data set and add the total to the shape a, then add the total number of devices tested (sample size n) to the rate b.

#Set parameters and constants
a <- 1
b <- 0.1
n <- length(particulate_data)
total_particulate_count <- sum(particulate_data)

I like to peek at the prior and posterior distributions of lambda since they are easy to visualize via the relationships above. We are back into continuous distribution mode because the rate parameter lambda can be any positive value even though the particulate counts the come from the process are discrete.

#Set sequence of x values; generate prior using a,b; generate posterior 
x_values  <- seq(0, 60, length.out = 1000)
prior     <- dgamma(x_values, shape = 1, rate = 0.1)
posterior <- dgamma(x_values, shape = a + total_particulate_count, rate = b + n)

#Prior in tibble format
prior_tbl <- tibble(
  "x_values" = x_values,
  "prob"     = prior,
  "config"   = "prior"
)

#Posterior in tibble format
posterior_tbl <- tibble(
  "x_values" = x_values,
  "prob"     = posterior,
  "config"   = "posterior"
)

#Combine prior and posterior in 1 tibble
prior_post_tbl <- bind_rows(prior_tbl, posterior_tbl)

#Visualize 
prior_post_tbl %>% ggplot(aes(x = x_values, y = prob)) +
  geom_line(aes(color = config), size = 1.5, alpha = 0.8) +
  scale_y_continuous(name="Density", limits=c(0, 0.3)) +
  scale_color_manual(values = c("#75D054FF", "#2C728EFF")) +
  labs(
    title    = "Rate Parameter Lambda For Particle Counts",
    subtitle = "Modeled as Poisson Process",
    x        = "Lambda",
    color    = ""
  )

Having access to the posterior distribution of lambda enables simulation of possible values of lambda by drawing random values from the distributions. The probability of drawing any particular value of lambda is based on the density shown on the y-axis (although the probability of any particular point is zero; we must calculate over a span of x via integration). Each of the values randomly drawn from the posterior can be used to simulate a distribution of particulate counts for comparison with the spec. The workflow is essentially a series of questions:

What might the values of the rate parameter lambda be based on the data? -> Combine data with conjugate prior to generate the posterior distribution of credible lambdas. (Done and shown above)
If a random value of lambda is pulled from the posterior distribution , what would we expect regarding the uncertainty of the original experiment? -> Draw random values lambda and then evaluate what percentage of the cdf lies above the spec (could also run simulations for each random lambda and then count the number of simulated runs above the spec but this is time consuming (10,000 lambdas x 10,000 simulations to build out the particle count distribution for each one…)
Combine each of these tail areas into a new distribution. This new distribution represents the uncertainty in the reliability estimate based on uncertainty in lambda. How to estimate the reliability of the real device while taking uncertainty into account? -> Calculate the lower bound of the 95% credible interval by finding the .05 quantile from the set of simulated reliability values.

Let’s do this!

#Sample and store 10000 random lambda values from posterior 
n_posterior_samples <- 10000
sampled_posterior_lambda <- rgamma(n_posterior_samples, shape = a + total_particulate_count, rate = b + n)

#Initialize empty vector to hold reliability data
reliability_vector <- rep(NA, n_posterior_samples)

#For each lambda value, calc cumulative probability of less than or equal to q particles shed from 1 sample?
for(i in 1:n_posterior_samples){
  reliability_vector[i] <- ppois(q = 60, lambda = sampled_posterior_lambda[i])
}

#Visualize
reliability_vector %>% head() %>% 
  kable(align=rep('c')) %>% kable_styling("full_width" = F)

x
0.9147028
0.9506510
0.9431756
0.9700806
0.9546490
0.9540933

Checking what the simulated reliabilities are:

#Convert reliability vector to tibble
reliability_tbl <- reliability_vector %>% 
  as_tibble() %>%
  mutate("reliability" = value) %>%
  select(reliability)

#Visualize with histogram
reliability_tbl %>% ggplot(aes(reliability)) +
  geom_histogram(fill = "#2c3e50", color = "white", binwidth = .01, alpha = 0.8) +
    labs(
        x        = "Reliability",
        title    = "Estimated Reliability Range for Particulate Shed Performance",
        subtitle = "Requirement: 60 or less of 25 um or larger"
    )

The 95% credible interval for the reliability (conformance rate) is the .05 quantile of this distribution since the spec is 1-sided:

#Calculate .05 quantile
reliability_tbl$reliability %>% 
  quantile(probs = .05)     %>% 
  signif(digits = 3)

##    5% 
## 0.893

Finally, the answer! The lowest reliability expected is 89.3 % based on a 95% credible interval. This would likely not meet the product requirements (assigned based on risk of the harms that come from this particular failure mode) and we would likely need to improve our design or processes to reduce particulate shed from the product.

This concludes the Bayesian inference of reliability in Poisson distributed particle counts. But hey, since we’re here… one of the things I love about R is the ability to easily check sensitivities, assumptions, and alternatives easily. What would this analysis look like using the conventional frequentist approach? I admit I’m not sure exactly but I assume we would extend the standard tolerance interval approach that is common in Class III medical device submissions. Tolerance intervals are easy to pull from tables or software but actually pretty tricky (for me at least) to derive. They involve uncertainty in both the mean and the variance. For simplicity (and because I’m not confident enough to derive the formula), I’ll use the tolerance package in R to calculate tolerance intervals for Poisson data. It turns out that there are 8 methods and I’ll use them all because I’m feeling a little wild and I want to see if they result in different results.

## 95%/95% 1-sided Poisson tolerance limits for future
## occurrences in a period of length 1 part. All eight methods
## are presented for comparison.
tl_tab <- poistol.int(x = sum_part_data, n = n, m = 1, alpha = 0.05, P = 0.95,
side = 1, method = "TAB") %>% mutate(method = "TAB") %>% as_tibble() 

tl_ls <- poistol.int(x = sum_part_data, n = n, m = 1, alpha = 0.05, P = 0.95,
side = 1, method = "LS") %>% mutate(method = "LS") %>% as_tibble() 

tl_sc <- poistol.int(x = sum_part_data, n = n, m = 1, alpha = 0.05, P = 0.95,
side = 1, method = "SC") %>% mutate(method = "SC") %>% as_tibble() 

tl_cc <- poistol.int(x = sum_part_data, n = n, m = 1, alpha = 0.05, P = 0.95,
side = 1, method = "CC") %>% mutate(method = "CC") %>% as_tibble()

tl_vs <- poistol.int(x = sum_part_data, n = n, m = 1, alpha = 0.05, P = 0.95,
side = 1, method = "VS") %>% mutate(method = "VS") %>% as_tibble() 

tl_rvs <- poistol.int(x = sum_part_data, n = n, m = 1, alpha = 0.05, P = 0.95,
side = 1, method = "RVS") %>% mutate(method = "RVS") %>% as_tibble() 

tl_ft <- poistol.int(x = sum_part_data, n = n, m = 1, alpha = 0.05, P = 0.95,
side = 1, method = "FT") %>%mutate(method = "FT") %>% as_tibble() 

tl_csc <- poistol.int(x = sum_part_data, n = n, m = 1, alpha = 0.05, P = 0.95,
side = 1, method = "CSC") %>% mutate(method = "CSC") %>% as_tibble() 

tl_all_tbl <-  bind_rows(tl_tab,
                         tl_ls,
                         tl_sc,
                         tl_cc,
                         tl_vs,
                         tl_rvs,
                         tl_ft,
                         tl_csc)

tl_all_tbl %>% kable(align=rep('c', 5))

alpha	P	lambda.hat	1-sided.lower	1-sided.upper	method
0.05	0.95	49.15	36	64	TAB
0.05	0.95	49.15	36	64	LS
0.05	0.95	49.15	36	64	SC
0.05	0.95	49.15	36	64	CC
0.05	0.95	49.15	36	64	VS
0.05	0.95	49.15	36	64	RVS
0.05	0.95	49.15	36	64	FT
0.05	0.95	49.15	36	64	CSC

For this data set it can be seen that all 8 methods produce the same 1-sided 95/95 upper tolerance interval 64 counts per device. N=60 was the requirement - since the edge of our tolerance interval lies above the 1-sided spec we would fail this test. This conclusion is consistent with the Bayesian method that estimates the reliability below the 95% requirement.

But what sort of reliability claim could our data support? For the Bayesian approach we concluded that the answer was 89.3% (lower bound of 1-sided 95% credible interval). For the frequentist method, we don’t have a posterior distribution to examine. We could try using the tolerance interval function above with various values of P to impute the value of P which coincides with the spec limit of 60.

#Sequence of reliability values for which to use as P 
reliability_freq_tbl <- tibble(
  "proportion_covered_P" = seq(.40, .99, .01)
)

#Function that is just like poistol.int but extracts and reports only the upper limit
#of the tolerance interval
tol_interval_fcn <- function(data_vec = sum_part_data, n=20, m=1, alpha=.05, P=.95, side=1, method="TAB"){
  holder <- poistol.int(data_vec, n, m, alpha, P, side, method)
  holder_2 <- holder[1,5]
}

#Test the function
test_1 <- tol_interval_fcn(data_vec = sum_part_data, n=n, m=1, alpha = .05, P = .95, side = 1, method = "TAB")

#Test the function
test_2 <- tol_interval_fcn(P = .95)

#Map the function across a vector of proportions
#Note to future self: map() arguments are: the list of values map the fn over, the fn
#itself, then all the additional arguments of the fn that you aren't mapping over (odd syntax)
upper_tol_tbl <- reliability_freq_tbl %>% mutate(
  particles_per_part = map(proportion_covered_P, tol_interval_fcn, data_vec = sum_part_data, n=n, m=1, alpha = .05, side = 1, method = "TAB") %>% as.integer() 
)

#View haead and tail of data
upper_tol_tbl %>% head(20) %>% kable(align=rep('c', 2))

proportion_covered_P	particles_per_part
0.40	50
0.41	50
0.42	50
0.43	50
0.44	51
0.45	51
0.46	51
0.47	51
0.48	51
0.49	51
0.50	52
0.51	52
0.52	52
0.53	52
0.54	52
0.55	53
0.56	53
0.57	53
0.58	53
0.59	53

upper_tol_tbl %>% tail(20) %>% kable(align=rep('c', 2))

proportion_covered_P	particles_per_part
0.80	58
0.81	58
0.82	58
0.83	59
0.84	59
0.85	59
0.86	60
0.87	60
0.88	60
0.89	61
0.90	61
0.91	62
0.92	62
0.93	63
0.94	63
0.95	64
0.96	65
0.97	66
0.98	67
0.99	69

#need this data to feed to gg_label_repel to tell it where to attach label
point_tbl <- tibble(x = .65, y = 60)

#visualize 
upper_tol_tbl %>% ggplot(aes(x = proportion_covered_P, y = particles_per_part)) +
  geom_line(color = "#2c3e50",
            size = 2.5) +
    labs(x = "Estimated Reliability at .95 Confidence Level",
         y = "Edge of 1-Sided Tolerance Interval (Particles per Device)",
         title = "Edge of Tolerance Interval vs. Specified Reliability",
         subtitle = "95% Confidence Level Using TAB Tolerance Technique") +
  scale_y_continuous(breaks = seq(40, 70, 5)) +
  geom_vline(xintercept = .88) +
  geom_hline(yintercept = 60) +
  geom_point(x = .65, y = 60, size = 0, alpha = 0) +
  geom_label_repel(data = point_tbl, aes(x, y), 
                   label = "Spec Limit: 60 Particles Max",
                   fill = "#2c3e50", 
                   color = "white",
                   segment.color = "#2c3e50",
                   segment.size = 1,
                   min.segment.length = unit(1, "lines"),
                   nudge_y = 2,
                   nudge_x = .05)

Here’s a plot that I’ve never made or seen before. For given set of data (in this case: particulate_data from earlier with n=20 from a Poisson distribution, lambda = 50), the x-axis shows the estimated reliability and the y-axis represents the number of particles at the edge the calculated tolerance interval using the TAB method. That is to say: the standard approaches to calculate the edge of the relevant tolerance interval for a specified proportion at a specified confidence level. For example, we could state we want to know the estimate for the 95th percentile at 95% confidence level - the answer would be 64 particles per device. Since the requirement for clinical safety is set at 60 particles max, we would not pass the test because we could not state with high confidence that 95 or more (out of 100) would pass. Usually it’s just a binary pass/fail decision.

It’s obvious that the 95/95 edge of the tolerance interval is out of spec… but what would be the greatest reliability we could claim at 95% confidence? It ends up being .88 or 88% - very close to the predicted lower bound of the 95% credible interval calculated from the Bayesian method (which was 89.3%, from above)! In this case, the frequentist and Bayesian methods happen to be similar (even though they aren’t measuring the same thing). Interesting stuff!

Stopping Rules for Significance Testing in R

Fri, 06 Sep 2019 00:00:00 +0000

When doing comparative testing it can be tempting to stop when we see the result that we hoped for. In the case of null hypothesis significance testing (NHST), the desired outcome is often a p-value of < .05. In the medical device industry, bench top testing can cost a lot of money. Why not just recalculate the p-value after every test and stop when the p-value reaches .05? The reason is that the confidence statement attached to your testing is only valid for a specific stopping rule. In other words, to achieve the desired false positive rate we must continue testing speciments until the pre-determined sample size is reached. Evaluating the p-value as you proceed through the testing is known as “peeking” and it’s a statistical no-no.

Suppose we are attempting to demonstrate that a raw material provided by a new vendor results in better corrosion resistance in finished stents relative to the standard supplier. A bench top test is set up to measure the breakdown potential of each sample in a cyclic potentiodynamic polarization (CPP) test. Our goal is to compare the means of the CPP data from the old supplier and the new supplier. The null hypothesis is that the means are equivalent and if the t-test results in a p-value of .05 or lower then we will reject the null and claim improved performance. What happens to the p-value over the course of the testing? We can run a simulation to monitor the p-value and calculate the effect of peeking on the long-term false positive rate. For the test to perform as intended, the long-term false positive rate should be controlled at a level equal to (1 - confidence level).

library(tidyverse)
library(knitr)
library(kableExtra)

First, initialize the objects to hold the data and establish any constants we might need later.

#Initial offset constant to keep minimum group size at n=6
INITIAL_OFFSET <- 5

#Initial values for number of inner and outer loop iterations
n_inner_loop <- 50
n_inner_data <- n_inner_loop + INITIAL_OFFSET
n_outer <- 100

#Initialize empty vector to store p values
store_p_values_vec <- rep(NA, n_inner_loop)

#Initialize a tibble with placeholder column
many_runs_tbl <-  tibble(
  V1 = rep(NA,  n_inner_loop)
  )

The simulation requires 2 for loops. The inner loop performs a series of t-test adding 1 more experimental observation to each group after each iteration. The p-value for that iteration is extracted and stored. In the outer loop, the initial data for the 2 groups are generated randomly from normal distributions. Since we can’t really run a t-test on groups with very low sample sizes, we use an initial offset value so that the t-test loops don’t start until both groups have a few observations from which to calculate the means.

The p-value for a traditional t-test should be an indication of the long-term false positive rate. In other words: if we ran a t-test on samples drawn from 2 identical populations many times we would see a few large differences in means simply due to chance draws. Among all such simulations, the value at the 95% quantile represents the p-value of .05.

We can gut-check our simulation in this way by setting the two populations identical to each other and drawing random values in the outer loop as mentioned above.

#Set seed for repeatability
set.seed(1234)

#Outer loop: replicates a t-test between 2 groups
for(l in 1:n_outer) {
    
    #Generate simulated data for each group.  The parameters are set the same to represent 1 population
    example_group_1 <- rnorm(n = n_inner_data, mean = 10, sd = 4)
    example_group_2 <- rnorm(n = n_inner_data, mean = 10, sd = 4)
    
    #Inner loop: subset the first (i + initial offset) values from grp 1 and grp 2 (y)
    #Perform t-test, extract p-value, store in a vector
    #Increment each group's size by 1 after each iteration
    for (i in 1:n_inner_loop) {
    t_test_obj <- t.test(x = example_group_1[1:(INITIAL_OFFSET + i)], y = example_group_2[1:(INITIAL_OFFSET + i)])
    store_p_values_vec[i] = t_test_obj$p.value
  }
  
    #Store each vector of n_inner_loop p-values to a column in the many_runs_tbl
    many_runs_tbl[,l] <- store_p_values_vec
}

#visualize tibble 
many_runs_tbl[,1:12] %>% 
  signif(digits = 3) %>%
  head(10) %>% 
  kable(align=rep('c', 100))

V1	V2	V3	V4	V5	V6	V7	V8	V9	V10	V11	V12
0.3960	0.0990	0.204	0.412	0.0686	0.1450	0.894	0.360	0.721	0.897	0.0535	0.668
0.1700	0.0628	0.106	0.951	0.2240	0.0834	0.802	0.614	0.750	0.886	0.3170	0.517
0.1410	0.0929	0.057	0.618	0.1360	0.0296	0.499	0.561	0.846	0.809	0.1740	0.410
0.1560	0.4050	0.146	0.800	0.1690	0.0625	0.724	0.700	0.857	0.687	0.3620	0.338
0.1140	0.2610	0.104	0.992	0.2550	0.1860	0.548	0.846	0.727	0.911	0.4270	0.334
0.0540	0.3400	0.143	0.889	0.3180	0.1740	0.775	0.768	0.795	0.666	0.5630	0.229
0.0693	0.4030	0.125	0.871	0.7340	0.0757	0.826	0.792	0.704	0.755	0.4810	0.694
0.0324	0.4050	0.181	0.930	0.8630	0.0617	0.738	0.564	0.501	0.611	0.3930	0.472
0.0206	0.4550	0.112	0.912	0.7560	0.0958	0.644	0.708	0.265	0.687	0.2520	0.638
0.0294	0.6690	0.103	0.777	0.8680	0.1700	0.664	0.703	0.284	0.912	0.2450	0.441

Each column above represents n=50 p-values, with each successive value calculated after observing the newest data point in the simulated test sequence. These are the p-values we see if we peek at the calculation every time.

We need to convert data into tidy format for better visualization. In the tidy format, every column should be a unique variable. The gather() function converts data from wide to long by adding a new variable called “rep_sim_number” and combining all the various runs from 1 to 100 in a single column. In total, we’ll have only 3 columns in the tidy version.

#add new column with row id numbers
final_runs_tbl <- many_runs_tbl %>% 
    mutate(row_id = row_number()) %>%
    select(row_id, everything())

#convert from wide format (untidy) to long (tidy) using gather()
final_runs_tidy_tbl <- final_runs_tbl %>% gather(key = "rep_sim_number", value = "p_value", -row_id)

#visualize tidy data structure
final_runs_tidy_tbl %>% 
  head(10) %>% 
  kable(align=rep('c', 3))

row_id	rep_sim_number	p_value
1	V1	0.3963352
2	V1	0.1704697
3	V1	0.1414021
4	V1	0.1557261
5	V1	0.1141854
6	V1	0.0539595
7	V1	0.0693410
8	V1	0.0324232
9	V1	0.0205511
10	V1	0.0293952

final_runs_tidy_tbl %>% 
  tail(10) %>% 
  kable(align=rep('c', 3))

row_id	rep_sim_number	p_value
41	V100	0.0515933
42	V100	0.0509430
43	V100	0.0386845
44	V100	0.0567804
45	V100	0.0762953
46	V100	0.0933081
47	V100	0.0755494
48	V100	0.0558263
49	V100	0.0731072
50	V100	0.0496300

From here it is straightforward to visualize the trajectory of the p-values through the course of the testing for all 100 simulations.

#visualize history of n_outer p-values across n_inner_loop consecutive data points as lineplot
lp_1 <- final_runs_tidy_tbl %>% ggplot(aes(x = row_id, y = p_value, group = rep_sim_number)) +
  geom_line(show.legend = "none",
            color       = "grey",
            alpha       = 0.7) +
  labs(x        = "Sequential Benchtop Test Observations",
       title    = "P-Value History for Difference in Means, Standard T-Test",
       subtitle = "Both Groups Sampled From Same Population"
       )

lp_1

The p-values are all over the place! It makes sense that at the pre-determined stopping point (n=50) we would have a spread of p-values since the population parameters for the two groups were identical and p should only rarely land below .05. However, this visualization makes it clear that prior to the stopping point, the path of any particular p-value fluctuates wildly. This is the reason why we can’t stop early or peek!

Let’s take a look at the false positives, defined here as the runs where the p-value ended up less than or equal to .05 at the pre-determined stopping point of n=50.

#filter for runs that ended in false positives (p < .05) at the last data point
filtered_endpoint_tbl <- final_runs_tidy_tbl %>% 
    filter(row_id == 50,
           p_value <= 0.05) %>%
    select(rep_sim_number) %>%
    rename("false_positives" = rep_sim_number)

filtered_endpoint_tbl %>% 
  head(10) %>% 
  kable(align='c') %>%
  kable_styling(full_width = FALSE)

false_positives
V1
V23
V48
V54
V77
V86
V89
V100

So 8 out of 100 simulations have p-values < .05. This is about as expected since the long term false positive rate should be 5%. Having now identified the false positives, we can visualize the trajectory of their p-values after obtaining each successive data point. This is what happens when we peek early or stop the test when we first see a desired outcome. The following code pulls the full history of the false positive test sequences so we can see their paths before the stopping point.

#extract full false positive test histories.  %in% filters rows that match anything in the false_positives vector
full_low_runs_tbl <- final_runs_tidy_tbl %>%
    filter(rep_sim_number %in% filtered_endpoint_tbl$false_positives)

#visualize trajectory of false positives by highlighting their traces
lp_2 <- final_runs_tidy_tbl %>% 
    ggplot(aes(x = row_id, y = p_value, group = rep_sim_number)) +
    geom_line(alpha = 0.7, show.legend = FALSE, color = "grey") +
    geom_line(aes(color = rep_sim_number), data = full_low_runs_tbl, show.legend = FALSE, size = .8, alpha = 0.7) +
    labs(x       = "Sequential Benchtop Test Observations",
        title    = "P-Value History for Difference in Means, Standard T-Test",
        subtitle = "Highlighted Traces Represent Sequences with p < .05 at n=50"
        )

lp_2

Indeed, the p-values that end up less than .05 do not take a straight line path to get there. Likewise, there may be tests that dip below p=.05 at some point but culminate well above .05 at the pre-determined stopping point. These represent additional false-positives we invite when we peek or stop early. Let’s identify and count these:

#filter for all run who's p-value ever dipped to .05 or lower at any point 
low_p_tbl <- final_runs_tidy_tbl %>% 
    filter(p_value <= .05) %>% 
    distinct(rep_sim_number)

#visualize
low_p_tbl %>% 
  head(10) %>% 
  kable(align='c') %>% 
  kable_styling(full_width = FALSE)

rep_sim_number
V1
V6
V7
V16
V17
V20
V21
V23
V30
V33

#count total number of false positives with peeking
low_p_tbl %>% nrow() %>%
  kable(align = "c") %>% 
  kable_styling(full_width = FALSE)

x
37

The false positives go from 8 to 37!

#filter for only the rows where rep_sim_number here matches at least 1 value from low_p_tbl$rep_sim_number
#this extracts the full history of runs who's p-value dipped to .05 or lower at any point 
any_low_runs_tbl <- final_runs_tidy_tbl %>%
    filter(rep_sim_number %in% low_p_tbl$rep_sim_number)

#visualize
any_low_runs_tbl %>% 
  head(10) %>% 
  kable(align = rep("c", 3))

row_id	rep_sim_number	p_value
1	V1	0.3963352
2	V1	0.1704697
3	V1	0.1414021
4	V1	0.1557261
5	V1	0.1141854
6	V1	0.0539595
7	V1	0.0693410
8	V1	0.0324232
9	V1	0.0205511
10	V1	0.0293952

#visualize the trajectory or runs that dipped to .05 or below
lp_3 <- final_runs_tidy_tbl %>% 
    ggplot(aes(x = row_id, y = p_value, group = rep_sim_number)) +
    geom_line(alpha = 0.7, show.legend = FALSE, color = "grey") +
    geom_line(aes(color = rep_sim_number), data = any_low_runs_tbl, show.legend = FALSE, size = .8, alpha = 0.7) +
    labs(x       = "Sequential Benchtop Test Observations",
        title    = "P-Value History for Difference in Means, Standard T-Test",
        subtitle = "Highlighted Runs Represent p < .05 at Any Point"
        )

lp_3

All these differences in means would be considered significant if we don’t observe our pre-determined stopping rule. This could be a big deal. We might claim a performance benefit when there is none, or waste precious time and money trying to figure out why we can’t replicate an earlier experiment!

Thanks for reading.

Assessing Design Verification Risk with Bayesian Estimation in R

Fri, 23 Aug 2019 00:00:00 +0000

Suppose our team is preparing to freeze a new implant design. In order to move into the next phase of the PDP, it is common to perform a suite of formal “Design Freeze” testing. If the results of the Design Freeze testing are acceptable, the project can advance from Design Freeze (DF) into Design Verification (DV). DV is an expensive and resource intensive phase culminating in formal reports that are included in the regulatory submission. One key goal of DF is therefore to burn down enough risk to feel confident going into DV. Despite the high stakes, I haven’t ever seen a quantitative assessment of residual risk at the phase review. In this post we’ll attempt to use some simple Bayesian methods to quantify the DV risk as a function of DF sample size for a single, high-risk test.

Consider the requirement for accelerated durability (sometimes called fatigue resistance). In this test, the device is subjected to cyclic loading for a number of cycles equal to the desired service life. For 10 years of loading due to systolic - diastolic pressure cycles, vascular implants must survive approximately 400 million cycles. Accelerated durability is usually treated as attribute type data because the results can be only pass (if no fractures observed) or fail (if fractures are observed). Each test specimen can therefore be considered a Bernoulli trial and the number of passing units in n samples can be modeled with the binomial.

How many samples should we include in DF? We’ll set up some simulations to find out. In order to incorporate the outcome of the DF data into a statement about the probability of success for DV, we’ll need to apply Bayesian methods.

First, load the libraries:

library(tidyverse)
library(cowplot)
library(gghighlight)
library(knitr)
library(kableExtra)

The simulation should start off before we even execute Design Freeze testing. If we’re going to use Bayesian techniques we need to express our uncertainty about the parameters in terms of probability. In this case, the parameter we care about is the reliability. Before seeing any DF data we might know very little about what the true reliability is for this design. If we were asked to indicate what we thought the reliability might be, we should probably state a wide range of possibilities. The design might be good but it might be quite poor. Our belief about the reliability before we do any testing at all is called the prior and we expess it as a probability density function, not a point estimate. We need a mathematical function to describe how we want to spread out our belief in the true reliability.

The beta is a flexible distribution that can be adjusted to take a variety of different forms. By tweaking the two shape factors of the beta we can customize the probability density curve in many different ways. If we were super confident that every part we ever made would pass the durability testing, we could put a “spike” prior right on 1.0. This is like saying “there’s no way any part could ever fail”. But the whole point is to communicate uncertainty and in reality there is always a chance the reliability might only be 97%, or 94%, etc. Since we haven’t really seen any DF data, we should probably drop some of our credibility into many different possible values of the reliability. Let’s be very conservative here and just use the flat prior. By evenly binning all of our credibility across the full range of reliability from 0 to 1, we’re saying we don’t want our pre-conceived notions to influence the final estimated reliability much. We’ll instead use the DF data themselves to re-allocate the credibility across the range of reliabilities appropriately according to Bayes’ rule after looking at the Design Freeze results. The more DF data we observe, the more precise the posterior estimate.

The mathematical way to turn the beta distribution into a straight line (flat prior) is to set the shape parameters alpha and beta to (1,1). Note the area under the curve must always sum to 1. The image on the left shows a flat prior generated from a beta density with parameters (1,1).

Another way to display the prior is to build out the visualization manually by drawing random values from the beta(1,1) distribution and constructing a histogram. This method isn’t terribly useful since we already know the exact distribution we want to use but I like to include it to emphasize the idea of “binning” the credibility across different values of reliability. It’s also nice to see the uncertainty we might see when we start to randomly draw from the distribution (full disclosure: I also just to practice my coding).

#Plot flat prior using stat_function and ggplot
p_1 <- tibble(x_canvas=c(0,1)) %>% ggplot(aes(x=x_canvas)) +
    stat_function(fun   = dbeta,
                  args  = list(1, 1),
                  color = "#2c3e50", 
                  size  = 1,
                  alpha = .8) +
    ylim(c(0,1.5)) +
    labs(
        y = "Density of Beta",
        x = "Reliability",
        title = "Credibility Allocation, Start of DF",
        subtitle = "Uninformed Prior with Beta (1,1)"
    )

#Set the number of random draws from beta(1,1) to construct histogram flat prior
set.seed(123)
n_draws <- 100000

#Draw random values from beta(1,1), store in object
prior_dist_sim <- rbeta(n = n_draws, shape1 = 1, shape2 = 1)

#Convert from vector to tibble
prior_dist_sim_tbl <- prior_dist_sim %>% as_tibble()

#Visualize with ggplot
p_2 <- prior_dist_sim_tbl %>% ggplot(aes(x = value)) +
    geom_histogram(
        boundary = 1, 
        binwidth = .05, 
        color    = "white",
        fill     = "#2c3e50",
        alpha    = 0.8
        ) +
    xlim(c(-0.05, 1.05)) +
    ylim(c(0, 7500)) +
    labs(
        y = "Count",
        title = "Credibility Simulation , Start of DF",
        subtitle = "Uninformed Prior with Beta (1,1)",
        x = "Reliability"
    
    ) 

plot_grid(p_1,p_2)

OK now the fun stuff. There is a cool, mathematical shortcut we can take to combine our simulated Design Freeze data with our flat prior to create the posterior distribution. It’s very simple: we just add the number of passing DF units to our alpha parameter and the number of failing DF units to our beta parameter. The reason why this works so well is beyond the scope of this post, but the main idea is that when the functional form of the prior (beta function in our case) is similar to the functional form of the likelihood function (Bernoulli in our case), then you can multiply them together easily and the product also takes a similar form. When this happens, the prior is said to be the “conjugate prior” of the likelihood function ¹ The beta and binomial are a special case that go together like peanut butter and jelly.

Again, to understand how our belief in the reliability should be allocated after observing the DF data, all we need to do is update the beta function by adding the number of passing units from DF testing to alpha (Shape1 parameter) and the number of failing units to beta (Shape2 parameter).

\[\mbox{Beta}(\alpha_0+\mbox{passes}, \beta_0+\mbox{fails})\] We’re going to assume all units pass DF, so we only need to adjust the alpha parameter. The resulting beta distribution that we get after updating the alpha parameter represents our belief in where the true reliability may lie after observing the DF data. Remember, even though every unit passed, we can’t just say the reliability is 100% because we’re smart enough to know that if the sample size was, for example, n=15 - there is a reasonable chance that a product with true reliability of 97% could run off n=15 in a row without failing. Even 90% reliability might hit 15 straight every once in a while but it would be pretty unlikely.

The code below looks at four different possible sample size options for DF: n=15, n=30, n=45, and a full n=59 (just like we plan for DV).

#Draw radomly from 4 different beta distributions. Alpha parameter is adjusted based on DF sample size
posterior_dist_sim_15 <- rbeta(n_draws, 16, 1)
posterior_dist_sim_30 <- rbeta(n_draws, 31, 1)
posterior_dist_sim_45 <- rbeta(n_draws, 46, 1)
posterior_dist_sim_59 <- rbeta(n_draws, 60, 1)

#Function to convert vectors above into tibbles and add column for Sample Size 
pds_clean_fcn <- function(pds, s_size){
    pds %>% as_tibble() %>% mutate(Sample_Size = s_size) %>%
    mutate(Sample_Size = factor(Sample_Size, levels = unique(Sample_Size)))}

#Apply function to 4 vectors above
posterior_dist_sim_15_tbl <- pds_clean_fcn(posterior_dist_sim_15, 15)
posterior_dist_sim_30_tbl <- pds_clean_fcn(posterior_dist_sim_30, 30)
posterior_dist_sim_45_tbl <- pds_clean_fcn(posterior_dist_sim_45, 45)
posterior_dist_sim_59_tbl <- pds_clean_fcn(posterior_dist_sim_59, 59)

#Combine the tibbles in a tidy format for visualization
full_post_df_tbl <- bind_rows(
            posterior_dist_sim_15_tbl,
            posterior_dist_sim_30_tbl, 
            posterior_dist_sim_45_tbl, 
            posterior_dist_sim_59_tbl
            )

#Visualize with density plot
df_density_plt <- full_post_df_tbl %>% ggplot(aes(x = value, fill = Sample_Size)) +
    geom_density(alpha = .6) +
    xlim(c(0.85,1)) +
    labs(x = "",
         y = "Density of Beta",
         title = "Credibility Simulation, After Design Freeze",
         subtitle = "Updated Belief Modeled with Beta(1 + n,1)") +
    scale_fill_manual(values = c("#2C728EFF", "#20A486FF", "#75D054FF", "#FDE725FF")) 

#Visualize with histogram 
df_hist_plt <- full_post_df_tbl %>% ggplot(aes(x = value, fill = Sample_Size)) +
    geom_histogram(alpha = .9,
                   position = "dodge",
                   boundary = 1,
                   color = "black") +
    xlim(c(0.85,1)) +
    labs(x = "Reliability",
         y = "Count") +
    scale_fill_manual(values = c("#2C728EFF", "#20A486FF", "#75D054FF", "#FDE725FF"))

plot_grid(df_density_plt, df_hist_plt, ncol = 1)

If we unpack these charts a bit, we can see that if we only do n=15 in Design Freeze, we still need to allocate some credibility to reliability parameters below .90. For a full n=59, anything below .95 reliability is very unlikely, yet the 59 straight passing units could have very well come from a product with reliability = .98 or .97.

We now have a good feel for our uncertainty about the reliability after DF, but what we really want to know is our likelihood of passing Design Verification. To answer this question, we’ll extend our simulation to perform many replicates of n=59 Bernoulli trials, each representing a round of Design Verification testing. The probability of failure will be randomly drawn from the distributions via Monte Carlo. Let’s see how many of these virtual DV tests end with 59/59 passing:

#Perform many sets of random binom runs, each with n=59 trials. p is taken from the probs previously generated 
DV_acceptable_units_15 <- rbinom(size = 59, n = n_draws, 
                                 prob = (posterior_dist_sim_15_tbl$value))
DV_acceptable_units_30 <- rbinom(size = 59, n = n_draws, 
                                 prob = (posterior_dist_sim_30_tbl$value))
DV_acceptable_units_45 <- rbinom(size = 59, n = n_draws, 
                                 prob = (posterior_dist_sim_45_tbl$value))
DV_acceptable_units_59 <- rbinom(size = 59, n = n_draws, 
                                 prob = (posterior_dist_sim_59_tbl$value))

#Function to convert vectors to tibbles and add col for sample size
setup_fcn <- function(vec, ss){
    vec %>% as_tibble() %>% mutate(DF_Sample_Size = ss) %>%
    mutate(DF_Sample_Size = factor(DF_Sample_Size, levels = unique(DF_Sample_Size)))}

#Apply function
DV_acceptable_units_15_tbl <- setup_fcn(DV_acceptable_units_15, 15)
DV_acceptable_units_30_tbl <- setup_fcn(DV_acceptable_units_30, 30)
DV_acceptable_units_45_tbl <- setup_fcn(DV_acceptable_units_45, 45)
DV_acceptable_units_59_tbl <- setup_fcn(DV_acceptable_units_59, 59)

#Combine the tibbles in a tidy format for visualization
DV_acceptable_full_tbl <- bind_rows(DV_acceptable_units_15_tbl,
                                    DV_acceptable_units_30_tbl,
                                    DV_acceptable_units_45_tbl,
                                    DV_acceptable_units_59_tbl)

#Visualize with ggplot.  Apply gghighlight where appropriate
g1 <- DV_acceptable_full_tbl %>%
   ggplot(aes(x = value)) +
   geom_histogram(aes(fill = DF_Sample_Size),binwidth = 1, color = "black", position = "dodge", alpha = .9) +
    xlim(c(45, 60)) +
    scale_x_continuous(limits = c(45, 60), breaks=seq(45, 60, 1)) +
    scale_fill_manual(values = c("#2C728EFF", "#20A486FF", "#75D054FF", "#FDE725FF")) +
    labs(
        x = "Passing Parts out of 59 total",
        title = "Simulated Design Verification Testing",
        subtitle = "100,000 Simulated DV Runs of n=59"
    )

g2 <- g1 +
    gghighlight(value == 59, use_direct_label = FALSE) +
    labs(
        title = "Simulations that PASSED Design Verification"
     )
    
g3 <- g1 +
    gghighlight(value < 59, use_direct_label = FALSE) +
    labs(
        title = "Simulations that FAILED Design Verification"
    )

Taking into consideration the uncertainty of the true reliability after the DF testing, the percentage of times we expect to pass Design Verification is shown below. These percentages are calculated as the number of simulated DV runs that achieved 59/59 passing units divided by the total number of simulated DV runs. Any simulation with 58 or less passing units would have failed DV.

\[\mbox{expected probability of passing DV = (number of sims with n=59 pass) / (total sims) }\]

#Function to calculate how many DV simulations resulted in 59/59 passing units
pct_pass_fct <- function(tbl, n){
    pct_dv_pass <- tbl %>% filter(value == 59) %>% nrow() / n_draws
    paste("DF with n = ",n, "(all pass): ", pct_dv_pass %>% scales::percent(), "expected probability of next 59/59 passing DV")}

DF with n = 15 (all pass): 21.3% expected probability of next 59/59 passing DV

DF with n = 30 (all pass): 34.5% expected probability of next 59/59 passing DV

DF with n = 45 (all pass): 43.5% expected probability of next 59/59 passing DV

DF with n = 59 (all pass): 50.3% expected probability of next 59/59 passing DV

The percentage of time we expect to pass Design Verification is shockingly low! Even when we did a full n=59 in Design Freeze, we still only be able to predict 50% success in DV! This is because even with 59/59 passes, we still must account for the possibility that the reliability isn’t 100%. We don’t have enough DF data to shift the credibility all the way near 100%, and when the credibility is spread to include possible reliabilities in the mid .90’s we should always be prepared for the possibility of failing Design Verification.

We could just leave it at that but I have found that when discussing risk, stakeholders want more than just an estimation of the rate of bad outcomes. They want a recommendation and a mitigation plan. Here are a few ideas; can you think of any more?

Maintain multiple design configurations as long as possible (often not feasible, but provides an out if 1 design fails)
Perform durability testing as “fatigue-to-failure”. In this methodology, the devices are run to failure and the cycles to failure are treated as variable data. By varying the amplitude of the loading cycles, we can force the devices to fail and understand the uncertainty within the failure envelope. ²
Fold in information from pre-DF testing, predicate testing, etc to inform the prior better. I will look at the sensitivity of the reliability estimations to the prior in a future post.
Build redundant design cycles into the project schedule to accomodate additional design turns without falling behind the contracted timeline

Thanks for reading!

Kruschke, Doing Bayesian Data Analysis, https://sites.google.com/site/doingbayesiandataanalysis/↩
Fatigue-to Fracture ASTM Standard: https://www.astm.org/Standards/F3211.htm ↩

Permutation Test for NHST of 2 Samples in R

Sat, 10 Aug 2019 00:00:00 +0000

As engineers, it is not uncommon to be asked to determine whether or not two different configurations of a product perform the same. Perhaps we are asked to compare the durability of a next-generation prototype to the current generation. Sometimes we are testing the flexibility of our device versus a competitor for marketing purposes. Maybe we identify a new vendor for a raw material but must first understand whether the resultant finished product will perform any differently than when built using material from the standard supplier. All of these situations call for a comparison between two groups culminating in a statistically supported recommendation.

There are a lot of interesting ways to do this: regions of practical equivalence, Bayes Factors, etc. The most common method is still null hypothesis significance testing (NHST) and that’s what I want to explore in this first post. Frequentist methods yield the least useful inferences but have the advantage of a long usage history. Most medical device professionals will be looking for a p-value, so a p-value we must provide.

In NHST, the plan is usually to calculate a test statistic from our data and use a table of reference values or a statistical program to tell us how surprising our derived statistic would be in a world where the null hypothesis was true. We generally do this by comparing our statistic to a reference distribution or table of tabulated values. Unfortunately, whenever our benchtop data violates an assumption of the reference model, we are no longer comparing apples-to-apples. We must make tweaks and adjustments to try to compensate. It is easy to get overwhelmed in a decision tree of test names and use cases.

A more robust and intuitive approach to NHST is to replace the off-the-shelf distributions and tables with a simulation built right from our dataset. The workflow any such test is shown below. ¹

The main difference here is that we create the distribution of the data under the null hypothesis using simulation instead of relying on a reference distribution. It’s intuitive, powerful, and fun.

Imagine we have just designed a benchtop experiment in which we intend to measure the pressure (in mm Hg) at which a pair of overlapped stent grafts started to migrate or disconnect when deployed in a large thoracic aneurysm. ²

A common null hypothesis for comparing groups is that there is no difference between them. Under this model, we can treat all the experimental data as one big group instead of 2 different groups. We therefore pool the data from our completed experiment into one big group, shuffle it, and randomly assign data points into two groups of the original size. This is our generative model. After each round of permutation and assignment, we calculate and store the test statistic for the observed effect (difference in means between the two groups). Once many simulations have been completed, we’ll see where our true data falls relative to the virtual data.

One way to setup and execute a simulation-based NHST for comparing two groups in R is as follows (note: there are quicker shortcuts to executing this type of testing but the long version below allows for customization, visualization, and adjust-ability):

First, we read in the libraries and transcribe the benchtop data into R and evaluate sample size

library(tidyverse)
library(cowplot)
library(knitr)
library(kableExtra)

#Migration pressure for predicate device
predicate <-  c(186, 188, 189, 189, 192, 193, 194, 194, 194, 195, 195, 196, 196, 197, 197, 198, 198, 199, 199, 201, 206, 207, 210, 213, 216, 218)

#Migration pressure for next_gen device
next_gen <-  c(189, 190, 192, 193, 193, 196, 199, 199, 199, 202, 203, 204, 205, 206, 206, 207, 208, 208, 210, 210, 212, 214, 216, 216, 217, 218)

Sample Size of Predicate Device Data: 26

Sample Size of Next-Gen Device Data: 26

So we have slightly uneven groups and relatively small sample sizes. No problem - assign each group to a variable and convert to tibble format:

#Assign variables for each group and convert to tibble
predicate_tbl <- tibble(Device = "Predicate",
                        Pressure = predicate)

next_gen_tbl <- tibble(Device = "Next_Gen",
                        Pressure = next_gen)

Combine predicate and next_gen data into a single, pooled group called results_tbl. Taking a look at the first few and last few rows in the pooled tibble confirm it was combined appropriately.

#Combine in tibble
results_tbl <- bind_rows(predicate_tbl, next_gen_tbl)
results_tbl %>% 
  head() %>% 
  kable(align = rep("c",2))

Device	Pressure
Predicate	186
Predicate	188
Predicate	189
Predicate	189
Predicate	192
Predicate	193

results_tbl %>% tail() %>% 
  head() %>% 
  kable(align = rep("c",2))

Device	Pressure
Next_Gen	212
Next_Gen	214
Next_Gen	216
Next_Gen	216
Next_Gen	217
Next_Gen	218

Now we do some exploratory data analysis to identify general shape and distribution.

# Visualize w/ basic boxplot
boxplot_eda <- results_tbl %>% 
    ggplot(aes(x=Device, y=Pressure)) +
    geom_boxplot(
        alpha  = .6,
        width  = .4,
        size   = .8,
        fatten = .5,
        fill   = c("#FDE725FF","#20A486FF")) +
    labs(
        y        = "Pressure (mm Hg)",
        title    = "Predicate and Next-Gen Data",
        subtitle = "Modular Disconnect Pressure"
    )

boxplot_eda

#Visualize with density plot
density_eda <- results_tbl %>% 
    ggplot(aes(x = Pressure)) +
    geom_density(aes(fill = Device),
        color = "black",
        alpha = 0.6
        ) +
    scale_fill_manual(values = c("#FDE725FF","#20A486FF")) +
    labs(
        x        = "Pressure (mm Hg)",
        title    = "Predicate and Next-Gen Data",
        subtitle = "Modular Disconnect Pressure"
    )

density_eda

Yikes! These data do not look normal. Fortunately, the permutation test does not need the data to take on any particular distribution. The main assumption is exchangability, meaning it must be reasonable that the labels could be arbitrarily permuted under the null hypothesis. Provided the sample size is approximately equal, the permutation test is robust against unequal variances.³ This gives us an attractive option for data shaped as shown above.

To get started with our permutation test we create a function that accepts 3 arguments: the pooled data from all trials in our benchtop experiment (x), the number of observations taken from Group 1 (n1), and the number of observations taken from Group 2 (n2). The function creates an object containing indices 1:n, then randomly assigns indices into two Groups A and B with sizes to match the original group sizes. It then uses the randomly assigned indices to splice the dataset x producing 2 “shuffled” groups from the original data. Finally, it computes and returns the mean between the 2 randomly assigned groups.

#Function to permute vector indices and then compute difference in group means
perm_fun <- function(x, n1, n2){
  n <- n1 + n2
  group_B <- sample(1:n, n1)
  group_A <- setdiff(1:n, group_B)
  mean_diff <- mean(x[group_B] - mean(x[group_A]))
  return(mean_diff)
}

Here we initialize an dummy vector called perm_diffs to hold the results of the loop we are about to use. It’ll have all 0’s to start and then we’ll assign values from each iteration of the for loop.

#Set number of simulations to run
n_sims <- 10000

#Initialize empty vector
perm_diffs <- rep(0,n_sims)
perm_diffs %>% head()  %>% 
  kable(align = "c", col.names = NULL)

Set up a simple for loop to execute the same evaluation using perm_fun() 10,000 times. On each iteration, we’ll store the results into the corresponding index within perm_diffs that we initialized above.

#Set seed for reproducibility
set.seed(2015)

#Iterate over desired number of simulations using permutation function
for (i in 1:n_sims)
  perm_diffs[i] = perm_fun(results_tbl$Pressure, 26, 26)

Now we have 10,000 replicates of our permutation test stored in perm_diffs. We want to visualize the data with ggplot so we convert it into a tibble frame using tibble().

#Convert results to a tibble and look at it
perm_diffs_df <- tibble(perm_diffs)
perm_diffs_df %>% head()  %>% 
  kable(align = "c")

perm_diffs
-0.6153846
-3.3076923
0.6923077
-2.3846154
-0.3076923
3.1538462

Visualize the difference in means as a histogram and density plot:

#Visualize difference in means as a histogram
diffs_histogram_plot <- perm_diffs_df %>% ggplot(aes(perm_diffs)) +
  geom_histogram(fill = "#2c3e50", color = "white", binwidth = .3, alpha = 0.8) +
    labs(
        x = "Pressure (mm Hg)",
        title = "Histogram of Difference in Means",
        subtitle = "Generated Under Null Hypothesis"
    )

#Visualize difference in means as a density plot
diffs_density_plot <-  perm_diffs_df %>% ggplot(aes(perm_diffs)) +
  geom_density(fill = "#2c3e50", color = "white", alpha = 0.8) +
     labs(
        x = "Pressure (mm Hg)",
        title = "Density Plot of Difference in Means",
        subtitle = "Generated Under Null Hypothesis"
    )

plot_grid(diffs_histogram_plot, diffs_density_plot)

We just simulated many tests from the null hypothesis. These virtual data give us a good understanding of what sort of difference in means we might observe if there truly was no difference between the groups. As expected, most of the time the difference is around 0. But occasionally there is a noticeable difference in means just due to chance.

But how big was the difference in means from our real world dataset? We’ll call this “baseline difference”.

#Evaluate difference in means from true data set
predicate_pressure_mean <- mean(predicate_tbl$Pressure)
next_gen_pressure_mean <- mean(next_gen_tbl$Pressure)

baseline_difference <- predicate_pressure_mean - next_gen_pressure_mean
baseline_difference  %>% 
  signif(digits = 3) %>%
  kable(align = "c", col.names = NULL)

-5.85

So our real, observed data show a difference in means of -5.85. Is this large or small? With the context of the shuffle testing we already performed, we know exactly how extreme our observed data is and can visualize it with a vertical line.

#Visualize real data in context of simulations
g1 <- diffs_histogram_plot + 
  geom_vline(xintercept = baseline_difference, 
             linetype   = "dotted", 
             color      = "#2c3e50", 
             size       = 1
             ) 

g2 <- diffs_density_plot + 
  geom_vline(xintercept = baseline_difference, 
             linetype   ="dotted", 
             color      = "#2c3e50", 
             size       = 1
             ) 

plot_grid(g1,g2)

It looks like the our benchtop data was pretty extreme relative to the null. We should start to consider the possibility that this effect was not due solely to chance alone. 0.05 is a commonly used threshold for declaring statistical significance. Let’s see if our data is more or less extreme than 0.05 (solid line).

#Calculate the 5% quantile of the simulated distribution for difference in means
the_five_percent_quantile <- quantile(perm_diffs_df$perm_diffs, probs = 0.05)
the_five_percent_quantile

##        5% 
## -4.153846

#Visualize the 5% quantile on the histogram and density plots
g3 <- g1 +
         geom_vline(xintercept = the_five_percent_quantile, 
             color      = "#2c3e50", 
             size       = 1
             )

g4 <- g2 +
        geom_vline(xintercept = the_five_percent_quantile, 
             color      = "#2c3e50", 
             size       = 1
             )

plot_grid(g3,g4)

We can see here that our data is more extreme than the 5% quantile which means our p-value is less than 0.05. This satisfies the traditional, frequentist definition of statistically significant. If we want to actual p-value, we have to determine the percentage of simulated data that are as extreme or more extreme than our observed data.

#Calculate percentage of simulations as extreme or more extreme than the observed data (p-value)
p_value <- perm_diffs_df %>% 
    filter(perm_diffs <= baseline_difference) %>%
    nrow() / n_sims

paste("The empirical p-value is: ", p_value)  %>% 
  kable(align = "c", col.names = NULL)

The empirical p-value is: 0.0096

Our p-value is well below 0.05. This is likely enough evidence for us to claim that there was a statistically significant difference observed between the Next Gen device and the predicate device.

Our marketing team will be thrilled, but we should always be wary that statistically significant does not mean practically important. Domain knowledge should provide the context to interpret the relevance of the observed difference. A difference in mean Pressure of a few mm Hg seems to be enough to claim a statistically significant improvement in our new device vs. the predicate, but is it enough for our marketing team to make a meaningful campaign? In reality, a few mm Hg is noticeable on the bench but is likely lost in the noise of anatomical variation within real patient anatomies.

Probably Overthinking It, http://allendowney.blogspot.com/2016/06/there-is-still-only-one-test.html ↩
J ENDOVASC THER 2011;18:559-568, open access https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3163409/↩
Simulations and Explanation of Unequal Variance and Sample Sizes, https://stats.stackexchange.com/questions/87215/does-a-big-difference-in-sample-sizes-together-with-a-difference-in-variances-ma ↩