<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">
  <channel>
    <title>Stats on [R]eliability</title>
    <link>https://rileyking.netlify.app/tags/stats/</link>
    <description>Recent content in Stats on [R]eliability</description>
    <generator>Hugo -- gohugo.io</generator>
    <language>en-us</language>
    <copyright>&amp;copy; 2019. All rights reserved.</copyright>
    <lastBuildDate>Thu, 17 Dec 2020 00:00:00 +0000</lastBuildDate>
    
        <atom:link href="https://rileyking.netlify.app/tags/stats/index.xml" rel="self" type="application/rss+xml" />
    
    
    <item>
      <title>Boundary Conditions and Anatomy - Correlated Data and Kernel Density Estimation in R</title>
      <link>https://rileyking.netlify.app/post/boundary-conditions-and-anatomy-exploring-correlated-data-simulation-in-r/</link>
      <pubDate>Thu, 17 Dec 2020 00:00:00 +0000</pubDate>
      
      <guid>https://rileyking.netlify.app/post/boundary-conditions-and-anatomy-exploring-correlated-data-simulation-in-r/</guid>
      <description>
&lt;script src=&#34;https://rileyking.netlify.app/rmarkdown-libs/header-attrs/header-attrs.js&#34;&gt;&lt;/script&gt;
&lt;link href=&#34;https://rileyking.netlify.app/rmarkdown-libs/anchor-sections/anchor-sections.css&#34; rel=&#34;stylesheet&#34; /&gt;
&lt;script src=&#34;https://rileyking.netlify.app/rmarkdown-libs/anchor-sections/anchor-sections.js&#34;&gt;&lt;/script&gt;
&lt;script src=&#34;https://rileyking.netlify.app/rmarkdown-libs/htmlwidgets/htmlwidgets.js&#34;&gt;&lt;/script&gt;
&lt;script src=&#34;https://rileyking.netlify.app/rmarkdown-libs/viz/viz.js&#34;&gt;&lt;/script&gt;
&lt;link href=&#34;https://rileyking.netlify.app/rmarkdown-libs/DiagrammeR-styles/styles.css&#34; rel=&#34;stylesheet&#34; /&gt;
&lt;script src=&#34;https://rileyking.netlify.app/rmarkdown-libs/grViz-binding/grViz.js&#34;&gt;&lt;/script&gt;


&lt;p&gt;Measurements taken from patient anatomy are often correlated. For example, larger blood vessels might tend to have less curvature. Additionally, data are rarely Gaussian, favoring skewed shapes with some very large values and a lower bound of zero. These properties can make simulation and inference hard. In this post I will walk through a workflow for an engineering problem that might be presented in my industry. It involves simulating a population of patients and identifying a subset of interest.&lt;/p&gt;
&lt;p&gt;Imagine we have been assigned the task of identifying boundary conditions for a benchtop durability test of an implantable, artificial heart valve. In other words, we need to identify credible parameters for a physical test such that our test engineers can challenge the device under severe but realistic geometries and loads. To facilitate this task our clinical team has analyzed images and extracted measurements for the features of interest in a subset of n=300 patients. There are two main challenges when working with these data:&lt;/p&gt;
&lt;blockquote&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;How do we use our sample to simulate the full population?&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/blockquote&gt;
&lt;blockquote&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;How do we use the simulated, full population to identify groups of interest and recommend boundary conditions for the test&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/blockquote&gt;
&lt;p&gt;The rest of this post explores what we should do with these data to resolve these challenges and identify appropriate and realistic test conditions.&lt;/p&gt;
&lt;div id=&#34;the-data&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;The Data&lt;/h1&gt;
&lt;p&gt;Suppose the three parameters our team cares about are the &lt;strong&gt;&lt;em&gt;ellipticity&lt;/em&gt;&lt;/strong&gt; of the vessel cross section, &lt;strong&gt;&lt;em&gt;curvature&lt;/em&gt;&lt;/strong&gt; of the vessel in the vessel region of interest, and the blood &lt;strong&gt;&lt;em&gt;pressure&lt;/em&gt;&lt;/strong&gt;. Features such as these are important because they influence both the equilibrium geometry and the magnitude of forces acting on the implantable valve (in other words: the boundary conditions). The image below shows a schematic/example of ellipticity and vessel curvature in the LVOT and aortic valve annulus as observed in CT imaging.&lt;a href=&#34;#fn1&#34; class=&#34;footnote-ref&#34; id=&#34;fnref1&#34;&gt;&lt;sup&gt;1&lt;/sup&gt;&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;&lt;img src=&#34;https://rileyking.netlify.app/post/2020-12-06-boundary-conditions-and-anatomy-exploring-correlated-data-simulation-in-r_files/ellipticity_angulation.png&#34; style=&#34;width:100.0%;height:100.0%&#34; /&gt;&lt;/p&gt;
&lt;p&gt;I enjoy the tidyverse toolset for exploring and working with data so let’s get that loaded up along with some other packages that will help in the analysis to come.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;library(readxl)
library(knitr)
library(DiagrammeR)
library(fitdistrplus)
library(MASS)
library(ggrepel)
library(readxl)
library(ks)
library(broom)
library(ggExtra)
library(GGally)
library(car)
library(rgl)
library(anySim)
library(tidyverse)
library(plotly)&lt;/code&gt;&lt;/pre&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&#34;#the-data&#34;&gt;The Data&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#correlations-in-the-original-dataset&#34;&gt;Correlations in the Original Dataset&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#anysim---generate-simulated-population-of-correlated-patient-data&#34;&gt;AnySim - Generate Simulated Population of Correlated Patient Data&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#kernel-density-estimation---map-density-contours-to-data&#34;&gt;Kernel Density Estimation - Map Density Contours to Data&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#naive-method---apply-default-kde-to-lognormal-data&#34;&gt;Naive Method - Apply Default KDE to Lognormal Data&lt;/a&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&#34;#estimate-kde&#34;&gt;Estimate kde&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#density-proportions-from-kde-estimate&#34;&gt;Density proportions from kde estimate&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#kde-estimates-in-the-range-of-the-variables&#34;&gt;KDE estimates in the range of the variables&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#density-plot-with-probability-contours-in-3d&#34;&gt;Density Plot with Probability Contours in 3d&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#fit-kde-to-normal-data-transform-later&#34;&gt;Fit KDE to normal data transform later&lt;/a&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&#34;#density-proportions-from-kde-estimate&#34;&gt;Density proportions from kde estimate&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#kde-estimates-in-the-range-of-the-variables&#34;&gt;KDE estimates in the range of the variables&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#density-plot-with-probability-contours-in-3d&#34;&gt;Density Plot with Probability Contours in 3d&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#transform-data-and-kde-contour-to-original-scale&#34;&gt;Transform data and kde contour to original scale&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#plot-back-transformed-data-with-plotly&#34;&gt;Plot Back-Transformed Data with Plotly&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#filter-extreme-points-and-assess-points-on-95-5-contour&#34;&gt;Filter extreme points and assess points on 95-5 contour&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#appendix-a---simulating-a-multivariate-distribution-with-mass-mvnorm&#34;&gt;Appendix A - simulating a multivariate distribution with mass mvnorm&lt;/a&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&#34;#step-1---fit-distributions-to-each-variable&#34;&gt;Step 1 - Fit Distributions to Each Variable&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#step-2---transform-all-variables-to-normal&#34;&gt;Step 2 - Transform all variables to normal&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#step-3---fit-normal-distributions-to-each-transformed-variable&#34;&gt;Step 3 - Fit normal distributions to each transformed variable&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#step-4---draw-joint-distribution-using-mvrnorm()-or-equivalent-function&#34;&gt;Step 4 - Draw joint distribution using mvrnorm() or equivalent function&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#step-5---back-transform-simulated-data-to-original-distribution&#34;&gt;Step 5 - Back-transform simulated data to original distribution&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#step-6---evaluate-parameters-and-marginal-distributions-of-the-back-transfomed-data&#34;&gt;Step 6 - Evaluate parameters and marginal distributions of the back-transfomed data&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#compare-original-data-to-simulated-data&#34;&gt;Compare Original Data to Simulated Data&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#appendix-b---2d-kde-plot-with-probability-traces&#34;&gt;Appendix B - 2d kde plot with probability traces&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Start by reading in the data and taking a look at the format.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;sample_data &amp;lt;- readRDS(file = &amp;quot;sim_anatomy_data.rds&amp;quot;)
sample_data&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## # A tibble: 300 x 3
##    ellip  curv pressure
##    &amp;lt;dbl&amp;gt; &amp;lt;dbl&amp;gt;    &amp;lt;dbl&amp;gt;
##  1  1.26  4.51     92.7
##  2  1.28  5.02    183. 
##  3  1.29  4.03    154. 
##  4  1.23  2.14    109. 
##  5  1.13  3.67    124. 
##  6  1.22  2.37    114. 
##  7  1.10  3.06    113. 
##  8  1.04  2.31    105. 
##  9  1.11  5.31    115. 
## 10  1.09  2.04    109. 
## # ... with 290 more rows&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;As expected, 300 rows with our 3 features of interest.&lt;/p&gt;
&lt;p&gt;It might seem tempting at this point to extract the maximum value from each group (or maybe something like the 95th percentile) and report those values together as a conservative worst-case. The problem with this approach is that each row of data is from a specific patient, so the variables are likely to be correlated. It could be that those severe values for each variable never occur together in the same patient. If we choose them all, we could over-test the device and over-design the device, potentially setting the program way behind. A more sophisticated approach is to consider the variables as a joint distribution and respect any correlation that may be present.&lt;/p&gt;
&lt;p&gt;Here is some code to visualize the marginal distributions.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;ellip_curv_plt &amp;lt;- sample_data %&amp;gt;%
  ggplot(aes(x = ellip, y = curv)) +
  geom_point(alpha = .5) +
  labs(
    title = &amp;quot;Patient Data From n=300 Scans&amp;quot;,
    subtitle = &amp;quot;Vessel Ellipticity and Vessel Curvature Joint Distribution&amp;quot;,
    x = &amp;quot;Ellipticity&amp;quot;,
    y = &amp;quot;Curvature (mm)&amp;quot;
  )

ellip_pressure_plt &amp;lt;- sample_data %&amp;gt;%
  ggplot(aes(x = ellip, y = pressure)) +
  geom_point(alpha = .5, color = &amp;quot;firebrick&amp;quot;) +
  labs(
    title = &amp;quot;Patient Data From n=300 Scans&amp;quot;,
    subtitle = &amp;quot;Vessel Ellipticity and Blood Pressure Joint Distribution&amp;quot;,
    x = &amp;quot;Ellipticity&amp;quot;,
    y = &amp;quot;Pressure (mm Hg)&amp;quot;
  )

curv_pressure_plt &amp;lt;- sample_data %&amp;gt;%
  ggplot(aes(x = curv, y = pressure)) +
  geom_point(alpha = .5, color = &amp;quot;limegreen&amp;quot;) +
  labs(
    title = &amp;quot;Patient Data From n=300 Scans&amp;quot;,
    subtitle = &amp;quot;Vessel Curvature and Blood Pressure Joint Distribution&amp;quot;,
    x = &amp;quot;Curvature (mm)&amp;quot;,
    y = &amp;quot;Pressure (mm Hg&amp;quot;
  )

ellip_curv_mplt &amp;lt;- ggExtra::ggMarginal(ellip_curv_plt, type = &amp;quot;density&amp;quot;, fill = &amp;quot;#2c3e50&amp;quot;, alpha = .5)
ellip_pressure_mplt &amp;lt;- ggExtra::ggMarginal(ellip_pressure_plt, type = &amp;quot;density&amp;quot;, fill = &amp;quot;firebrick&amp;quot;, alpha = .5)
curv_pressure_mplt &amp;lt;- ggExtra::ggMarginal(curv_pressure_plt, type = &amp;quot;density&amp;quot;, fill = &amp;quot;limegreen&amp;quot;, alpha = .5)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://rileyking.netlify.app/post/2020-12-06-boundary-conditions-and-anatomy-exploring-correlated-data-simulation-in-r_files/figure-html/unnamed-chunk-5-1.png&#34; width=&#34;100%&#34; height=&#34;500px&#34; /&gt;&lt;/p&gt;
&lt;p&gt;&lt;img src=&#34;https://rileyking.netlify.app/post/2020-12-06-boundary-conditions-and-anatomy-exploring-correlated-data-simulation-in-r_files/figure-html/unnamed-chunk-6-1.png&#34; width=&#34;100%&#34; height=&#34;500px&#34; /&gt;&lt;/p&gt;
&lt;p&gt;&lt;img src=&#34;https://rileyking.netlify.app/post/2020-12-06-boundary-conditions-and-anatomy-exploring-correlated-data-simulation-in-r_files/figure-html/unnamed-chunk-7-1.png&#34; width=&#34;100%&#34; height=&#34;500px&#34; /&gt;&lt;/p&gt;
&lt;p&gt;The variables are strictly positive and show some skew. Let’s assume that from domain knowledge we know these variables to be well described by a lognormal. The visuals would be consistent with this assumption.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;correlations-in-the-original-dataset&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Correlations in the Original Dataset&lt;/h1&gt;
&lt;p&gt;ggcorr() from the GGally package is very convenient for visualizing correlations.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;sample_data %&amp;gt;% ggcorr(
  high = &amp;quot;#20a486ff&amp;quot;,
  low = &amp;quot;#fde725ff&amp;quot;,
  label = TRUE,
  hjust = .75,
  size = 3,
  label_size = 3,
  label_round = 3,
  nbreaks = 3
) +
  labs(
    title = &amp;quot;Correlation Matrix - n=300 Patient Set&amp;quot;,
    subtitle = &amp;quot;Pearson Method Using Pairwise Observations&amp;quot;
  )&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://rileyking.netlify.app/post/2020-12-06-boundary-conditions-and-anatomy-exploring-correlated-data-simulation-in-r_files/figure-html/unnamed-chunk-8-1.png&#34; width=&#34;100%&#34; height=&#34;500px&#34; /&gt;
We see that there are some positive correlations in this dataset.&lt;/p&gt;
&lt;p&gt;To build out the sample into a simulated population we will fit a MLE estimate and use the model to push out a lot of predictions.&lt;a href=&#34;#fn2&#34; class=&#34;footnote-ref&#34; id=&#34;fnref2&#34;&gt;&lt;sup&gt;2&lt;/sup&gt;&lt;/a&gt; If the variables were not correlated, we could just execute a few rlnorm()’s and bind them together. The job is more challenging when the variables are correlated because they must be simulated all at once.&lt;/p&gt;
&lt;p&gt;I know of 2 convenient engines in R to generate an arbitrary number of random values from a correlated, multivariate distribution:&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;AnySim::SimCorrRVs&lt;/strong&gt; : For this method you specify the parameters of the marginal distributions and correlation matrix.&lt;a href=&#34;#fn3&#34; class=&#34;footnote-ref&#34; id=&#34;fnref3&#34;&gt;&lt;sup&gt;3&lt;/sup&gt;&lt;/a&gt;&lt;br /&gt;
&lt;strong&gt;mass::mvnorm()&lt;/strong&gt; : For this method you transform each distribution to normal and supply the mean and sd of each variable along with the covariance matrix.&lt;/p&gt;
&lt;p&gt;My personal preference is for the AnySim method which I’ll show below. The code for executing a similar simulation with mass::mvnorm() is shown in Appendix A.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;anysim---generate-simulated-population-of-correlated-patient-data&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;AnySim - Generate Simulated Population of Correlated Patient Data&lt;/h1&gt;
&lt;p&gt;The AnySim workflow:&lt;/p&gt;
&lt;div id=&#34;htmlwidget-1&#34; style=&#34;width:100%;height:500px;&#34; class=&#34;grViz html-widget&#34;&gt;&lt;/div&gt;
&lt;script type=&#34;application/json&#34; data-for=&#34;htmlwidget-1&#34;&gt;{&#34;x&#34;:{&#34;diagram&#34;:&#34;digraph flowchart {\n      # node definitions with substituted label text\n      node [fontname = Helvetica, shape = rectangle, fillcolor = yellow]        \n      tab1 [label = \&#34;Step 1: Specify desired distributions for each variable and store as object\&#34;]\n      tab2 [label = \&#34;Step 2: Specify parameters for each variable and store as object\&#34;]\n      tab3 [label = \&#34;Step 3: Specify desired correlation matrix and store as object\&#34;]\n      tab4 [label = \&#34;Step 4: Provide the above information to EstCorrRVs() to estimate\n parameters of auxiliary Gaussian model\&#34;]\n      tab5 [label = \&#34;Step 5: Generate simulated values using SimcorrRVs()\&#34;]\n      # edge definitions with the node IDs\n      tab1 -&gt; tab2 -&gt; tab3 -&gt; tab4 -&gt; tab5;\n      }\n      &#34;,&#34;config&#34;:{&#34;engine&#34;:&#34;dot&#34;,&#34;options&#34;:null}},&#34;evals&#34;:[],&#34;jsHooks&#34;:[]}&lt;/script&gt;
&lt;p&gt;First: fit distributions to the original data and calculate correlations.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;ellip_fit &amp;lt;- fitdist(sample_data$ellip, &amp;quot;lnorm&amp;quot;)
curv_fit &amp;lt;- fitdist(sample_data$curv, &amp;quot;lnorm&amp;quot;)
pressure_fit &amp;lt;- fitdist(sample_data$pressure, &amp;quot;lnorm&amp;quot;)

# store lognormal parameters of original data
ellip_meanlog &amp;lt;- ellip_fit$estimate[[&amp;quot;meanlog&amp;quot;]]
ellip_sdlog &amp;lt;- ellip_fit$estimate[[&amp;quot;sdlog&amp;quot;]]
curv_meanlog &amp;lt;- curv_fit$estimate[[&amp;quot;meanlog&amp;quot;]]
curv_sdlog &amp;lt;- curv_fit$estimate[[&amp;quot;sdlog&amp;quot;]]
pressure_meanlog &amp;lt;- pressure_fit$estimate[[&amp;quot;meanlog&amp;quot;]]
pressure_sdlog &amp;lt;- pressure_fit$estimate[[&amp;quot;sdlog&amp;quot;]]

# store correlations in original data
cor_ec &amp;lt;- cor(x = sample_data$ellip, y = sample_data$curv)
cor_ep &amp;lt;- cor(x = sample_data$ellip, y = sample_data$pressure)
cor_cp &amp;lt;- cor(x = sample_data$curv, y = sample_data$pressure)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Apply the AnySim workflow. Note that this too goes through an auxiliary normal intermediate step.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;set.seed(1234)

# Define the target distribution functions (ICDFs) of each random variable.

ellip_dist &amp;lt;- &amp;quot;qlnorm&amp;quot;
curv_dist &amp;lt;- &amp;quot;qlnorm&amp;quot;
pressure_dist &amp;lt;- &amp;quot;qlnorm&amp;quot;

# store the 3 ICDFs in a vector
dist_vec &amp;lt;- c(ellip_dist, curv_dist, pressure_dist)

# Define the parameters of the target distribution functions - store them in a list
ellip_params &amp;lt;- list(meanlog = ellip_meanlog, sdlog = ellip_sdlog)
curv_params &amp;lt;- list(meanlog = curv_meanlog, sdlog = curv_sdlog)
pressure_params &amp;lt;- list(meanlog = pressure_meanlog, sdlog = pressure_sdlog)

# this is a weird way to do it but I&amp;#39;m following along with an example from AnySim vignette :)
params_list &amp;lt;- list(NULL)
params_list[[1]] &amp;lt;- ellip_params
params_list[[2]] &amp;lt;- curv_params
params_list[[3]] &amp;lt;- pressure_params

# Define the target correlation matrix.
corr_matrix &amp;lt;- matrix(c(
  1, 0.268, 0.369,
  0.268, 1, .213,
  0.369, 0.213, 1
),
ncol = 3,
nrow = 3,
byrow = T
)
# Estimate the parameters of the auxiliary Gaussian model.
aux_gaussion_param_tbl &amp;lt;- EstCorrRVs(
  R = corr_matrix, dist = dist_vec, params = params_list,
  NatafIntMethod = &amp;quot;GH&amp;quot;, NoEval = 9, polydeg = 8
)


# Generate 10000 synthetic realizations of the 3 correlated RVs.
correlated_ln_draws_tbl &amp;lt;- as_tibble(SimCorrRVs(n = 10000, paramsRVs = aux_gaussion_param_tbl)) %&amp;gt;%
  rename(
    ellip = V1,
    curv = V2,
    pressure = V3
  )

correlated_ln_draws_tbl %&amp;gt;%
  head(10) %&amp;gt;%
  kable(align = &amp;quot;c&amp;quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr class=&#34;header&#34;&gt;
&lt;th align=&#34;center&#34;&gt;ellip&lt;/th&gt;
&lt;th align=&#34;center&#34;&gt;curv&lt;/th&gt;
&lt;th align=&#34;center&#34;&gt;pressure&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;center&#34;&gt;1.123496&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;1.674471&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;78.14516&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;center&#34;&gt;1.234755&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;3.927320&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;104.53631&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;center&#34;&gt;1.299794&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;4.071074&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;116.55043&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;center&#34;&gt;1.045001&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;2.721336&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;106.23896&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;center&#34;&gt;1.246727&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;5.091741&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;102.83055&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;center&#34;&gt;1.252843&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;2.869394&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;93.24030&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;center&#34;&gt;1.169606&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;1.613312&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;125.12784&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;center&#34;&gt;1.171699&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;2.921069&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;156.06701&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;center&#34;&gt;1.170371&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;2.672507&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;145.82545&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;center&#34;&gt;1.146382&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;3.030319&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;89.91766&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;Evaluate recovered marginal distributions with some helper functions:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;extract_params_sim_fcn &amp;lt;- function(var, fit_to) {
  tidy(fitdistr(correlated_ln_draws_tbl %&amp;gt;% pull(var), fit_to)) %&amp;gt;%
    mutate(
      var = {
        var
      },
      dataset = &amp;quot;sim_draws&amp;quot;
    )
}

extract_params_pat_fcn &amp;lt;- function(var, fit_to) {
  tidy(fitdistr(sample_data %&amp;gt;% pull(var), fit_to)) %&amp;gt;%
    mutate(
      var = {
        var
      },
      dataset = &amp;quot;patient_set&amp;quot;
    )
}

sim_results_tbl &amp;lt;- tibble(
  var = c(&amp;quot;ellip&amp;quot;, &amp;quot;curv&amp;quot;, &amp;quot;pressure&amp;quot;),
  fit_to = rep(&amp;quot;lognormal&amp;quot;, 3)
) %&amp;gt;%
  mutate(params = map2(.x = var, .y = fit_to, .f = extract_params_sim_fcn)) %&amp;gt;%
  unnest() %&amp;gt;%
  dplyr::select(-var1)

pat_results_tbl &amp;lt;- tibble(
  var = c(&amp;quot;ellip&amp;quot;, &amp;quot;curv&amp;quot;, &amp;quot;pressure&amp;quot;),
  fit_to = rep(&amp;quot;lognormal&amp;quot;, 3)
) %&amp;gt;%
  mutate(params = map2(.x = var, .y = fit_to, .f = extract_params_pat_fcn)) %&amp;gt;%
  unnest() %&amp;gt;%
  dplyr::select(-var1)

sim_results_tbl %&amp;gt;%
  bind_rows(pat_results_tbl) %&amp;gt;%
  select(-std.error) %&amp;gt;%
  pivot_wider(id_cols = everything(), names_from = &amp;quot;dataset&amp;quot;, values_from = &amp;quot;estimate&amp;quot;) %&amp;gt;%
  kable(align = &amp;quot;c&amp;quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr class=&#34;header&#34;&gt;
&lt;th align=&#34;center&#34;&gt;var&lt;/th&gt;
&lt;th align=&#34;center&#34;&gt;fit_to&lt;/th&gt;
&lt;th align=&#34;center&#34;&gt;term&lt;/th&gt;
&lt;th align=&#34;center&#34;&gt;sim_draws&lt;/th&gt;
&lt;th align=&#34;center&#34;&gt;patient_set&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;center&#34;&gt;ellip&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;lognormal&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;meanlog&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;0.1936145&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;0.1932254&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;center&#34;&gt;ellip&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;lognormal&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;sdlog&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;0.0628128&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;0.0636092&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;center&#34;&gt;curv&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;lognormal&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;meanlog&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;1.1561942&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;1.1579793&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;center&#34;&gt;curv&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;lognormal&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;sdlog&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;0.3114360&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;0.3091606&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;center&#34;&gt;pressure&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;lognormal&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;meanlog&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;4.7841496&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;4.7831767&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;center&#34;&gt;pressure&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;lognormal&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;sdlog&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;0.1896585&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;0.1910081&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;Evaluate recovered correlations:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;correlated_ln_draws_tbl %&amp;gt;% ggcorr(
  high = &amp;quot;#20a486ff&amp;quot;,
  low = &amp;quot;#fde725ff&amp;quot;,
  label = TRUE,
  hjust = .75,
  size = 3,
  label_size = 3,
  label_round = 3,
  nbreaks = 3
) +
  labs(
    title = &amp;quot;Correlation Matrix - n=10000 Simulation Set&amp;quot;,
    subtitle = &amp;quot;Pearson Method Using Pairwise Observations&amp;quot;
  )&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://rileyking.netlify.app/post/2020-12-06-boundary-conditions-and-anatomy-exploring-correlated-data-simulation-in-r_files/figure-html/unnamed-chunk-13-1.png&#34; width=&#34;100%&#34; height=&#34;500px&#34; /&gt;&lt;/p&gt;
&lt;p&gt;Let’s take a look at the simulated population:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;fig &amp;lt;- plotly::plot_ly()

fig &amp;lt;- fig %&amp;gt;% add_trace(x = correlated_ln_draws_tbl$ellip, y = correlated_ln_draws_tbl$curv, z = correlated_ln_draws_tbl$pressure, type = &amp;quot;scatter3d&amp;quot;, opacity = .4, hoverinfo = &amp;quot;none&amp;quot;, size = .1)

fig &amp;lt;- fig %&amp;gt;%
  layout(scene = list(
    xaxis = list(title = &amp;quot;ellip&amp;quot;),
    yaxis = list(title = &amp;quot;curv&amp;quot;),
    zaxis = list(title = &amp;quot;pressure&amp;quot;)
  )) %&amp;gt;%
  layout(scene = list(
    xaxis = list(showspikes = FALSE),
    yaxis = list(showspikes = FALSE),
    zaxis = list(showspikes = FALSE)
  ))

# fig&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://rileyking.netlify.app/post/2020-12-06-boundary-conditions-and-anatomy-exploring-correlated-data-simulation-in-r_files/j1.png&#34; style=&#34;width:100.0%;height:100.0%&#34; /&gt;&lt;/p&gt;
&lt;p&gt;&lt;img src=&#34;https://rileyking.netlify.app/post/2020-12-06-boundary-conditions-and-anatomy-exploring-correlated-data-simulation-in-r_files/j2.png&#34; style=&#34;width:100.0%;height:100.0%&#34; /&gt;
&lt;img src=&#34;https://rileyking.netlify.app/post/2020-12-06-boundary-conditions-and-anatomy-exploring-correlated-data-simulation-in-r_files/j3.png&#34; style=&#34;width:100.0%;height:100.0%&#34; /&gt;&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;kernel-density-estimation---map-density-contours-to-data&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Kernel Density Estimation - Map Density Contours to Data&lt;/h1&gt;
&lt;p&gt;The above tables and figures confirm the simulated population maintains the correlation structure and marginal distributions from the original sample as intended. The next step will be to build out some density estimates using a non-parametric, kernel density estimator. The reason we would want to do this is to understand the regions where data points are likely to fall and we can use the reference contours to identify the most extreme patients relative to the mode or to some region of interest.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Important Watch-Out&lt;/strong&gt; : The exact workflow for generating and applying the kernel density estimate may vary depending on the data type. The default kde procedures may assign probabilities to regions outside the rigid boundaries when data does not have infinite support. This will occur for our dataset, since all of our variables are lognormal and should therefore never be negative. Methods for addressing this behavior include variable bandwidth estimators, transformations of estimators, and boundary estimators. To illustrate this problem and provide an example of resolution, I will show 2 parallel workflows below:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;In the first, I apply the default global bandwidth kde to the simulated data&lt;/li&gt;
&lt;li&gt;In the second, I transform the data from lognormal to normal, apply the kde, then backtransform to lognormal&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Towards that end, I’ll add some more variables for transformed, normal version of each variable:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;corr_draws_tbl &amp;lt;- correlated_ln_draws_tbl %&amp;gt;%
  mutate(
    ellip_n = log(ellip),
    curv_n = log(curv),
    pressure_n = log(pressure)
  ) %&amp;gt;%
  select(ellip_n, curv_n, pressure_n)

corr_draws_tbl %&amp;gt;%
  head(5) %&amp;gt;%
  kable(aalign = &amp;quot;c&amp;quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr class=&#34;header&#34;&gt;
&lt;th align=&#34;right&#34;&gt;ellip_n&lt;/th&gt;
&lt;th align=&#34;right&#34;&gt;curv_n&lt;/th&gt;
&lt;th align=&#34;right&#34;&gt;pressure_n&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;right&#34;&gt;0.1164450&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0.5154974&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;4.358568&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;right&#34;&gt;0.2108725&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;1.3679574&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;4.649534&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;right&#34;&gt;0.2622058&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;1.4039068&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;4.758324&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;right&#34;&gt;0.0440175&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;1.0011228&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;4.665691&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;right&#34;&gt;0.2205217&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;1.6276199&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;4.633083&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;Quick visual check to verify the transformed properly:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;corr_draws_tbl %&amp;gt;%
  pivot_longer(cols = everything()) %&amp;gt;%
  ggplot(aes(x = value)) +
  geom_density() +
  facet_wrap(~name, scales = &amp;quot;free&amp;quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://rileyking.netlify.app/post/2020-12-06-boundary-conditions-and-anatomy-exploring-correlated-data-simulation-in-r_files/figure-html/unnamed-chunk-16-1.png&#34; width=&#34;100%&#34; height=&#34;500px&#34; /&gt;&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;naive-method---apply-default-kde-to-lognormal-data&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Naive Method - Apply Default KDE to Lognormal Data&lt;/h1&gt;
&lt;div id=&#34;estimate-kde&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Estimate kde&lt;/h2&gt;
&lt;p&gt;The kde is constructed as follows:&lt;a href=&#34;#fn4&#34; class=&#34;footnote-ref&#34; id=&#34;fnref4&#34;&gt;&lt;sup&gt;4&lt;/sup&gt;&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;This first chunk converts the data and generates the kde. The bandwidth parameters controls the “smoothness” or granularity of the estimate and can be hard to specify in multiple dimensions. Hscv() provides a method of determining a reasonable bandwidth through cross-validation; see documentation in footnotes for more information if interested.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;# convert simulated data tibble to matrix
d3m &amp;lt;- correlated_ln_draws_tbl %&amp;gt;%
  as.matrix()

# cross-validated bandwidth for kd (takes a while to calculate)
# hscv1 &amp;lt;- Hscv(correlated_ln_draws_tbl)
# hscv1 %&amp;gt;% write_rds(here::here(&amp;quot;hscv1.rds&amp;quot;))

hscv1 &amp;lt;- read_rds(here::here(&amp;quot;hscv1.rds&amp;quot;))

# generate kernel density estimate from simulated population
kd_d3m &amp;lt;- ks::kde(d3m, H = hscv1, compute.cont = TRUE)&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div id=&#34;density-proportions-from-kde-estimate&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Density proportions from kde estimate&lt;/h2&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;# see the kde&amp;#39;s calculated density thresholds for specified proportions
cont_vals_tbl &amp;lt;- tidy(kd_d3m$cont) %&amp;gt;%
  mutate(n_row = row_number()) %&amp;gt;%
  mutate(probs = 100 - n_row) %&amp;gt;%
  select(probs, x)

reference_grid_probs_tbl &amp;lt;- cont_vals_tbl %&amp;gt;%
  rename(estimate = x)

reference_grid_probs_tbl %&amp;gt;%
  head(10) %&amp;gt;%
  kable(align = rep(&amp;quot;c&amp;quot;))&lt;/code&gt;&lt;/pre&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr class=&#34;header&#34;&gt;
&lt;th align=&#34;center&#34;&gt;probs&lt;/th&gt;
&lt;th align=&#34;center&#34;&gt;estimate&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;center&#34;&gt;99&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;0.0342569&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;center&#34;&gt;98&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;0.0333578&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;center&#34;&gt;97&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;0.0326260&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;center&#34;&gt;96&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;0.0318672&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;center&#34;&gt;95&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;0.0312299&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;center&#34;&gt;94&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;0.0305985&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;center&#34;&gt;93&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;0.0300632&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;center&#34;&gt;92&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;0.0293781&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;center&#34;&gt;91&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;0.0289256&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;center&#34;&gt;90&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;0.0283849&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;/div&gt;
&lt;div id=&#34;kde-estimates-in-the-range-of-the-variables&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;KDE estimates in the range of the variables&lt;/h2&gt;
&lt;p&gt;By default the KDE provides density estimates for a grid of points that covers the space of the variables.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;kd_grid_estimates &amp;lt;- kd_d3m&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;If we want to know the value at each point in the simulated population we use the eval.points argument.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;mc_estimates &amp;lt;- ks::kde(
  x = d3m, H = hscv1,
  compute.cont = TRUE,
  eval.points = correlated_ln_draws_tbl %&amp;gt;% as.matrix()
)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Here are a couple different ways to convert the kde object features into a tibble:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;mc_est_tbl_10000 &amp;lt;- tibble(estimate = mc_estimates$estimate) %&amp;gt;%
  bind_cols(correlated_ln_draws_tbl)&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;kd_grid_est_tbl_29k &amp;lt;- broom:::tidy.kde(kd_grid_estimates) %&amp;gt;%
  pivot_wider(names_from = variable, values_from = value) %&amp;gt;%
  rename(ellip = x1, curv = x2, pressure = x3) %&amp;gt;%
  select(-obs)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Each data point in our population has a estimate. Each data point on the grid that covers the space of interest has an estimate.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;mc_est_tbl_10000 %&amp;gt;%
  head(10) %&amp;gt;%
  kable(align = &amp;quot;c&amp;quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr class=&#34;header&#34;&gt;
&lt;th align=&#34;center&#34;&gt;estimate&lt;/th&gt;
&lt;th align=&#34;center&#34;&gt;ellip&lt;/th&gt;
&lt;th align=&#34;center&#34;&gt;curv&lt;/th&gt;
&lt;th align=&#34;center&#34;&gt;pressure&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;center&#34;&gt;0.0032540&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;1.123496&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;1.674471&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;78.14516&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;center&#34;&gt;0.0167218&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;1.234755&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;3.927320&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;104.53631&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;center&#34;&gt;0.0114561&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;1.299794&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;4.071074&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;116.55043&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;center&#34;&gt;0.0042927&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;1.045001&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;2.721336&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;106.23896&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;center&#34;&gt;0.0050883&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;1.246727&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;5.091741&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;102.83055&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;center&#34;&gt;0.0123645&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;1.252843&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;2.869394&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;93.24030&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;center&#34;&gt;0.0073055&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;1.169606&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;1.613312&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;125.12784&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;center&#34;&gt;0.0061654&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;1.171699&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;2.921069&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;156.06701&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;center&#34;&gt;0.0103851&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;1.170371&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;2.672507&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;145.82545&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;center&#34;&gt;0.0158417&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;1.146382&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;3.030319&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;89.91766&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;kd_grid_est_tbl_29k %&amp;gt;%
  head(10) %&amp;gt;%
  kable(align = &amp;quot;c&amp;quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr class=&#34;header&#34;&gt;
&lt;th align=&#34;center&#34;&gt;estimate&lt;/th&gt;
&lt;th align=&#34;center&#34;&gt;ellip&lt;/th&gt;
&lt;th align=&#34;center&#34;&gt;curv&lt;/th&gt;
&lt;th align=&#34;center&#34;&gt;pressure&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;center&#34;&gt;0&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;0.8949674&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;-0.0080549&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;30.51151&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;center&#34;&gt;0&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;0.9187883&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;-0.0080549&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;30.51151&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;center&#34;&gt;0&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;0.9426092&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;-0.0080549&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;30.51151&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;center&#34;&gt;0&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;0.9664302&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;-0.0080549&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;30.51151&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;center&#34;&gt;0&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;0.9902511&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;-0.0080549&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;30.51151&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;center&#34;&gt;0&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;1.0140720&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;-0.0080549&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;30.51151&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;center&#34;&gt;0&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;1.0378929&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;-0.0080549&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;30.51151&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;center&#34;&gt;0&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;1.0617138&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;-0.0080549&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;30.51151&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;center&#34;&gt;0&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;1.0855347&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;-0.0080549&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;30.51151&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;center&#34;&gt;0&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;1.1093556&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;-0.0080549&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;30.51151&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;The ks package automatically stores the quantiles of the estimate variable when calculating the kde. We can access those probability boundaries by sub-setting the kd object.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;# 5% contour line from kd grid based on 10k MC data
percentile_5 &amp;lt;- kd_d3m[[&amp;quot;cont&amp;quot;]][&amp;quot;5%&amp;quot;]&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Verify that 5% (500/10,000) values fall below the threshold:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;mc_est_tbl_10000 %&amp;gt;% filter(estimate &amp;lt;= percentile_5)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## # A tibble: 500 x 4
##    estimate ellip  curv pressure
##       &amp;lt;dbl&amp;gt; &amp;lt;dbl&amp;gt; &amp;lt;dbl&amp;gt;    &amp;lt;dbl&amp;gt;
##  1 0.000418  1.41  4.68    184. 
##  2 0.000377  1.30  7.27    144. 
##  3 0.000951  1.06  3.32     72.1
##  4 0.000704  1.28  7.10    125. 
##  5 0.000719  1.17  2.59     62.1
##  6 0.000114  1.47  3.18    189. 
##  7 0.000905  1.01  3.21    102. 
##  8 0.000182  1.06  5.85    103. 
##  9 0.000521  1.36  4.04    200. 
## 10 0.000742  1.40  3.58     97.4
## # ... with 490 more rows&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;500 / 10,000 is the correct coverage for the 5/95 boundary.&lt;/p&gt;
&lt;p&gt;If we wanted to know the nearest probability contour line for every point we could make a function to do so.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;get_probs_fcn &amp;lt;- function(value) {
  t &amp;lt;- reference_grid_probs_tbl %&amp;gt;%
    mutate(value = value) %&amp;gt;%
    mutate(dif = abs(estimate - value)) %&amp;gt;%
    filter(dif == min(dif))

  t[[1, 1]]
}&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Map the function over each value in the dataset.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;# mc_1_to_99_tbl &amp;lt;- mc_est_tbl_10000 %&amp;gt;%
#   mutate(nearest_prob = map_dbl(estimate, get_probs_fcn))

# mc_1_to_99_tbl %&amp;gt;% write_rds(here::here(&amp;quot;mc_1_to_99_tbl.rds&amp;quot;))
mc_1_to_99_tbl &amp;lt;- read_rds(here::here(&amp;quot;mc_1_to_99_tbl.rds&amp;quot;))

mc_1_to_99_tbl&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## # A tibble: 10,000 x 5
##    estimate ellip  curv pressure nearest_prob
##       &amp;lt;dbl&amp;gt; &amp;lt;dbl&amp;gt; &amp;lt;dbl&amp;gt;    &amp;lt;dbl&amp;gt;        &amp;lt;dbl&amp;gt;
##  1  0.0140   1.27  2.16    133.            59
##  2  0.0119   1.20  4.44    127.            52
##  3  0.00265  1.38  2.65    160.            14
##  4  0.0194   1.24  3.62    142.            75
##  5  0.0122   1.32  2.54    129.            53
##  6  0.0168   1.26  3.24    147.            68
##  7  0.00555  1.33  3.39    168.            28
##  8  0.0112   1.24  4.25    146.            50
##  9  0.00826  1.19  1.85     87.7           39
## 10  0.00197  1.32  4.14     90.5           11
## # ... with 9,990 more rows&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Now the data, kde estimate, and nearest probability contour region boundary are stored in one tibble.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;density-plot-with-probability-contours-in-3d&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Density Plot with Probability Contours in 3d&lt;/h2&gt;
&lt;p&gt;Honestly, this part is pretty easy thanks to a built in plot.kde method. Just use the cont argument to specify with probability contours you want.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;#plot(x = kd_d3m, cont = c(45, 70, 95), drawpoints = FALSE, col.pt = 1)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://rileyking.netlify.app/post/2020-12-06-boundary-conditions-and-anatomy-exploring-correlated-data-simulation-in-r_files/3d_cont_1.png&#34; style=&#34;width:100.0%;height:100.0%&#34; /&gt;&lt;/p&gt;
&lt;p&gt;Add points using the points3d function. In this case I add 2 sets, 1 for the 5% most extreme and 1 for the 95% most common.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;# plot(x = kd_d3m, cont = c(95) ,drawpoints = FALSE, col.pt = 1)
mc_lowest_5_tbl &amp;lt;- mc_1_to_99_tbl %&amp;gt;% filter(estimate &amp;lt; percentile_5)
mc_6_to_100_tbl &amp;lt;- mc_1_to_99_tbl %&amp;gt;% filter(estimate &amp;gt;= percentile_5)

# points3d(x = mc_lowest_5_tbl$ellip, y = mc_lowest_5_tbl$curv, z = mc_lowest_5_tbl$pressure, color = &amp;quot;dodgerblue&amp;quot;,  size = 3, alpha = 1)

# points3d(x = mc_6_to_100_tbl$ellip, y = mc_6_to_100_tbl$curv, z = mc_6_to_100_tbl$pressure, color = &amp;quot;black&amp;quot;,  size = 3, alpha = 1)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://rileyking.netlify.app/post/2020-12-06-boundary-conditions-and-anatomy-exploring-correlated-data-simulation-in-r_files/3d_cont_2.png&#34; style=&#34;width:100.0%;height:100.0%&#34; /&gt;&lt;/p&gt;
&lt;p&gt;See the problem here? In the areas on the lower right of the middle and right-most images, the data stops but the surface keeps going. This is because the data has a boundary there due to being log-normal but the kde doesn’t know. See closeup below.&lt;/p&gt;
&lt;p&gt;&lt;img src=&#34;https://rileyking.netlify.app/post/2020-12-06-boundary-conditions-and-anatomy-exploring-correlated-data-simulation-in-r_files/3dd3.png&#34; style=&#34;width:100.0%;height:100.0%&#34; /&gt;&lt;/p&gt;
&lt;p&gt;As previously mentioned, this can be addressed by using the normal dataset to fit the kde and then back-transforming both the data and the surface:&lt;/p&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div id=&#34;fit-kde-to-normal-data-transform-later&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Fit KDE to normal data transform later&lt;/h1&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;# convert simulated data tibble to matrix
d3m_n &amp;lt;- corr_draws_tbl %&amp;gt;%
  as.matrix()

# cross-validated bandwidth for kd (takes a while to calculate)
# hscv1_n &amp;lt;- Hscv(corr_draws_tbl)
# hscv1_n %&amp;gt;% write_rds(here::here(&amp;quot;hscv1_n.rds&amp;quot;))

hscv1_n &amp;lt;- read_rds(here::here(&amp;quot;hscv1_n.rds&amp;quot;))

# generate kernel density estimate from simulated population
kd_d3m_n &amp;lt;- ks::kde(d3m_n, H = hscv1_n, compute.cont = TRUE)&lt;/code&gt;&lt;/pre&gt;
&lt;div id=&#34;density-proportions-from-kde-estimate-1&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Density proportions from kde estimate&lt;/h2&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;# see the kde&amp;#39;s calculated density thresholds for specified proportions
cont_vals_tbl_n &amp;lt;- tidy(kd_d3m_n$cont) %&amp;gt;%
  mutate(n_row = row_number()) %&amp;gt;%
  mutate(probs = 100 - n_row) %&amp;gt;%
  select(probs, x)

reference_grid_probs_tbl_n &amp;lt;- cont_vals_tbl_n %&amp;gt;%
  rename(estimate = x)

reference_grid_probs_tbl_n %&amp;gt;%
  head(10) %&amp;gt;%
  kable(align = rep(&amp;quot;c&amp;quot;))&lt;/code&gt;&lt;/pre&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr class=&#34;header&#34;&gt;
&lt;th align=&#34;center&#34;&gt;probs&lt;/th&gt;
&lt;th align=&#34;center&#34;&gt;estimate&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;center&#34;&gt;99&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;15.39736&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;center&#34;&gt;98&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;14.90526&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;center&#34;&gt;97&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;14.53676&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;center&#34;&gt;96&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;14.16539&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;center&#34;&gt;95&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;13.86481&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;center&#34;&gt;94&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;13.56653&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;center&#34;&gt;93&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;13.26884&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;center&#34;&gt;92&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;12.98302&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;center&#34;&gt;91&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;12.72985&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;center&#34;&gt;90&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;12.51655&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;/div&gt;
&lt;div id=&#34;kde-estimates-in-the-range-of-the-variables-1&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;KDE estimates in the range of the variables&lt;/h2&gt;
&lt;p&gt;By default the KDE provides density estimates for a grid of points that covers the space of the variables.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;kd_grid_estimates_n &amp;lt;- kd_d3m_n&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;If we want to know the value at each point in the simulated population we use the eval.points argument.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;mc_estimates_n &amp;lt;- ks::kde(
  x = d3m_n, H = hscv1_n,
  compute.cont = TRUE,
  eval.points = corr_draws_tbl %&amp;gt;% as.matrix()
)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Here are a couple different ways to convert the kde object features into a tibble:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;mc_est_tbl_10000_n &amp;lt;- tibble(estimate = mc_estimates_n$estimate) %&amp;gt;%
  bind_cols(corr_draws_tbl)&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;kd_grid_est_tbl_29k_n &amp;lt;- broom:::tidy.kde(kd_grid_estimates_n) %&amp;gt;%
  pivot_wider(names_from = variable, values_from = value) %&amp;gt;%
  rename(ellip_n = x1, curv_n = x2, pressure_n = x3) %&amp;gt;%
  select(-obs)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Each data point in our population has a estimate. Each data point on the grid that covers the space of interest has an estimate.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;mc_est_tbl_10000_n %&amp;gt;%
  head(10) %&amp;gt;%
  kable(align = &amp;quot;c&amp;quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr class=&#34;header&#34;&gt;
&lt;th align=&#34;center&#34;&gt;estimate&lt;/th&gt;
&lt;th align=&#34;center&#34;&gt;ellip_n&lt;/th&gt;
&lt;th align=&#34;center&#34;&gt;curv_n&lt;/th&gt;
&lt;th align=&#34;center&#34;&gt;pressure_n&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;center&#34;&gt;0.5480641&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;0.1164450&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;0.5154974&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;4.358568&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;center&#34;&gt;8.5263883&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;0.2108725&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;1.3679574&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;4.649534&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;center&#34;&gt;6.8647221&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;0.2622058&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;1.4039068&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;4.758324&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;center&#34;&gt;1.2686865&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;0.0440175&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;1.0011228&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;4.665691&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;center&#34;&gt;3.4062801&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;0.2205217&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;1.6276199&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;4.633083&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;center&#34;&gt;4.3754649&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;0.2254152&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;1.0541010&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;4.535180&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;center&#34;&gt;1.5083333&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;0.1566667&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;0.4782892&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;4.829336&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;center&#34;&gt;3.3766180&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;0.1584546&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;1.0719497&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;5.050286&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;center&#34;&gt;5.0036132&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;0.1573211&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;0.9830169&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;4.982410&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;center&#34;&gt;4.9614118&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;0.1366109&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;1.1086680&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;4.498894&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;kd_grid_est_tbl_29k_n %&amp;gt;%
  head(10) %&amp;gt;%
  kable(align = &amp;quot;c&amp;quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr class=&#34;header&#34;&gt;
&lt;th align=&#34;center&#34;&gt;estimate&lt;/th&gt;
&lt;th align=&#34;center&#34;&gt;ellip_n&lt;/th&gt;
&lt;th align=&#34;center&#34;&gt;curv_n&lt;/th&gt;
&lt;th align=&#34;center&#34;&gt;pressure_n&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;center&#34;&gt;0&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;-0.0877414&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;-0.4268598&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;3.811282&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;center&#34;&gt;0&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;-0.0685395&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;-0.4268598&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;3.811282&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;center&#34;&gt;0&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;-0.0493375&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;-0.4268598&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;3.811282&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;center&#34;&gt;0&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;-0.0301356&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;-0.4268598&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;3.811282&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;center&#34;&gt;0&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;-0.0109337&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;-0.4268598&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;3.811282&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;center&#34;&gt;0&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;0.0082682&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;-0.4268598&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;3.811282&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;center&#34;&gt;0&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;0.0274701&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;-0.4268598&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;3.811282&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;center&#34;&gt;0&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;0.0466721&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;-0.4268598&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;3.811282&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;center&#34;&gt;0&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;0.0658740&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;-0.4268598&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;3.811282&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;center&#34;&gt;0&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;0.0850759&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;-0.4268598&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;3.811282&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;# 5% contour line from kd grid based on 10k MC data
percentile_5_n &amp;lt;- kd_d3m_n[[&amp;quot;cont&amp;quot;]][&amp;quot;5%&amp;quot;]&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Verify that 5% (500/10,000) values fall below the threshold:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;mc_est_tbl_10000_n %&amp;gt;% filter(estimate &amp;lt;= percentile_5_n)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## # A tibble: 500 x 4
##    estimate ellip_n curv_n pressure_n
##       &amp;lt;dbl&amp;gt;   &amp;lt;dbl&amp;gt;  &amp;lt;dbl&amp;gt;      &amp;lt;dbl&amp;gt;
##  1   0.437   0.347   1.54        5.21
##  2   0.218   0.0546  1.20        4.28
##  3   0.346   0.114   0.375       4.43
##  4   0.340   0.121   0.382       4.96
##  5   0.0880  0.153   0.951       4.13
##  6   0.0860  0.387   1.16        5.24
##  7   0.268   0.0116  1.17        4.62
##  8   0.0957  0.0610  1.77        4.63
##  9   0.513   0.310   1.40        5.30
## 10   0.300   0.195   0.260       4.75
## # ... with 490 more rows&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;500 / 10,000 is the correct coverage for the 5/95 boundary.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;get_probs_fcn_n &amp;lt;- function(value) {
  t &amp;lt;- reference_grid_probs_tbl_n %&amp;gt;%
    mutate(value = value) %&amp;gt;%
    mutate(dif = abs(estimate - value)) %&amp;gt;%
    filter(dif == min(dif))

  t[[1, 1]]
}&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Map the function over each value in the dataset and then the grid.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;# mc_1_to_99_tbl_n &amp;lt;- mc_est_tbl_10000_n %&amp;gt;%
#   mutate(nearest_prob = map_dbl(estimate, get_probs_fcn_n))
# #
# mc_1_to_99_tbl_n %&amp;gt;% write_rds(here::here(&amp;quot;mc_1_to_99_tbl_n.rds&amp;quot;))
mc_1_to_99_tbl_n &amp;lt;- read_rds(here::here(&amp;quot;mc_1_to_99_tbl_n.rds&amp;quot;))

mc_1_to_99_tbl_n&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## # A tibble: 10,000 x 5
##    estimate ellip_n curv_n pressure_n nearest_prob
##       &amp;lt;dbl&amp;gt;   &amp;lt;dbl&amp;gt;  &amp;lt;dbl&amp;gt;      &amp;lt;dbl&amp;gt;        &amp;lt;dbl&amp;gt;
##  1    0.548  0.116   0.515       4.36            5
##  2    8.53   0.211   1.37        4.65           70
##  3    6.86   0.262   1.40        4.76           58
##  4    1.27   0.0440  1.00        4.67           12
##  5    3.41   0.221   1.63        4.63           32
##  6    4.38   0.225   1.05        4.54           39
##  7    1.51   0.157   0.478       4.83           14
##  8    3.38   0.158   1.07        5.05           31
##  9    5.00   0.157   0.983       4.98           44
## 10    4.96   0.137   1.11        4.50           44
## # ... with 9,990 more rows&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;grid_probs_tbl_n &amp;lt;- kd_grid_est_tbl_29k_n %&amp;gt;%
  mutate(nearest_prob = map_dbl(estimate, get_probs_fcn_n))

grid_probs_tbl_n %&amp;gt;% write_rds(here::here(&amp;quot;grid_probs_tbl_n.rds&amp;quot;))
grid_probs_tbl_n &amp;lt;- read_rds(here::here(&amp;quot;grid_probs_tbl_n.rds&amp;quot;))



grid_probs_95_n &amp;lt;- grid_probs_tbl_n %&amp;gt;%
  filter(nearest_prob == 95)

grid_probs_95_n %&amp;gt;% arrange(desc(nearest_prob))&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## # A tibble: 4 x 5
##   estimate ellip_n curv_n pressure_n nearest_prob
##      &amp;lt;dbl&amp;gt;   &amp;lt;dbl&amp;gt;  &amp;lt;dbl&amp;gt;      &amp;lt;dbl&amp;gt;        &amp;lt;dbl&amp;gt;
## 1     13.8   0.162   1.09       4.68           95
## 2     13.8   0.200   1.20       4.68           95
## 3     13.8   0.219   1.20       4.74           95
## 4     13.8   0.219   1.09       4.80           95&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;grid_probs_95_n %&amp;gt;%
  head(5) %&amp;gt;%
  kable(align = &amp;quot;c&amp;quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr class=&#34;header&#34;&gt;
&lt;th align=&#34;center&#34;&gt;estimate&lt;/th&gt;
&lt;th align=&#34;center&#34;&gt;ellip_n&lt;/th&gt;
&lt;th align=&#34;center&#34;&gt;curv_n&lt;/th&gt;
&lt;th align=&#34;center&#34;&gt;pressure_n&lt;/th&gt;
&lt;th align=&#34;center&#34;&gt;nearest_prob&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;center&#34;&gt;13.84582&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;0.1618836&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;1.094335&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;4.679386&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;95&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;center&#34;&gt;13.75222&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;0.2002874&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;1.195748&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;4.679386&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;95&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;center&#34;&gt;13.83862&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;0.2194893&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;1.195748&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;4.741394&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;95&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;center&#34;&gt;13.83317&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;0.2194893&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;1.094335&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;4.803401&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;95&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;/div&gt;
&lt;div id=&#34;density-plot-with-probability-contours-in-3d-1&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Density Plot with Probability Contours in 3d&lt;/h2&gt;
&lt;p&gt;Honestly, this part is pretty easy thanks to a built in plot.kde method. Just use the cont argument to specify with probability contours you want.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;plot(x = kd_d3m_n, cont = c(45, 70, 95), drawpoints = FALSE, col.pt = 1)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://rileyking.netlify.app/post/2020-12-06-boundary-conditions-and-anatomy-exploring-correlated-data-simulation-in-r_files/pp1.png&#34; style=&#34;width:100.0%;height:100.0%&#34; /&gt;&lt;/p&gt;
&lt;p&gt;and with points&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;mc_lowest_5_tbl_n &amp;lt;- mc_1_to_99_tbl_n %&amp;gt;% filter(estimate &amp;lt; percentile_5_n)
mc_6_to_100_tbl_n &amp;lt;- mc_1_to_99_tbl_n %&amp;gt;% filter(estimate &amp;gt;= percentile_5_n)

plot(x = kd_d3m_n, cont = c(95), drawpoints = FALSE, col.pt = 1)


points3d(x = mc_lowest_5_tbl_n$ellip_n, y = mc_lowest_5_tbl_n$curv_n, z = mc_lowest_5_tbl_n$pressure_n, color = &amp;quot;dodgerblue&amp;quot;, size = 3, alpha = 1)

points3d(x = mc_6_to_100_tbl_n$ellip_n, y = mc_6_to_100_tbl_n$curv_n, z = mc_6_to_100_tbl_n$pressure_n, color = &amp;quot;black&amp;quot;, size = 3, alpha = 1)

points3d(x = grid_probs_tbl_n$ellip_n, y = grid_probs_tbl_n$curv_n, z = grid_probs_tbl_n$pressure_n, color = &amp;quot;firebrick&amp;quot;, size = 2, alpha = 1)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://rileyking.netlify.app/post/2020-12-06-boundary-conditions-and-anatomy-exploring-correlated-data-simulation-in-r_files/pp2.png&#34; style=&#34;width:100.0%;height:100.0%&#34; /&gt;&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;transform-data-and-kde-contour-to-original-scale&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Transform data and kde contour to original scale&lt;/h2&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;mc_lowest_5_tbl_nbt &amp;lt;- mc_lowest_5_tbl_n %&amp;gt;% mutate(
  ellip_bt = exp(ellip_n),
  curv_bt = exp(curv_n),
  pressure_bt = exp(pressure_n)
)
mc_6_to_100_tbl_nbt &amp;lt;- mc_6_to_100_tbl_n %&amp;gt;% mutate(
  ellip_bt = exp(ellip_n),
  curv_bt = exp(curv_n),
  pressure_bt = exp(pressure_n)
)

full_mc_bt_tbl &amp;lt;- mc_lowest_5_tbl_nbt %&amp;gt;%
  bind_rows(mc_6_to_100_tbl_nbt)

grid_probs_95_bt &amp;lt;- grid_probs_tbl_n %&amp;gt;%
  filter(nearest_prob == 05) %&amp;gt;%
  mutate(
    ellip_bt = exp(ellip_n),
    curv_bt = exp(curv_n),
    pressure_bt = exp(pressure_n)
  )&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div id=&#34;plot-back-transformed-data-with-plotly&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Plot Back-Transformed Data with Plotly&lt;/h2&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;fig &amp;lt;- plotly::plot_ly()

fig &amp;lt;- fig %&amp;gt;% add_trace(x = grid_probs_95_bt$ellip_bt, y = grid_probs_95_bt$curv_bt, z = grid_probs_95_bt$pressure_bt, type = &amp;quot;mesh3d&amp;quot;, alphahull = 0, opacity = .5, hoverinfo = &amp;quot;none&amp;quot;)


fig &amp;lt;- fig %&amp;gt;% add_trace(x = mc_lowest_5_tbl_nbt$ellip_bt, y = mc_lowest_5_tbl_nbt$curv_bt, z = mc_lowest_5_tbl_nbt$pressure_bt, type = &amp;quot;scatter3d&amp;quot;, size = 30)

fig &amp;lt;- fig %&amp;gt;% add_trace(x = mc_6_to_100_tbl_nbt$ellip_bt, y = mc_6_to_100_tbl_nbt$curv_bt, z = mc_6_to_100_tbl_nbt$pressure_bt, type = &amp;quot;scatter3d&amp;quot;, size = 30)

fig &amp;lt;- fig %&amp;gt;%
  layout(scene = list(
    xaxis = list(title = &amp;quot;ellip&amp;quot;),
    yaxis = list(title = &amp;quot;curv&amp;quot;),
    zaxis = list(title = &amp;quot;pressure&amp;quot;)
  )) %&amp;gt;%
  layout(scene = list(
    xaxis = list(showspikes = FALSE),
    yaxis = list(showspikes = FALSE),
    zaxis = list(showspikes = FALSE)
  ))

# fig&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://rileyking.netlify.app/post/2020-12-06-boundary-conditions-and-anatomy-exploring-correlated-data-simulation-in-r_files/pp3.png&#34; style=&#34;width:100.0%;height:100.0%&#34; /&gt;&lt;/p&gt;
&lt;p&gt;The new image (shown on right) looks different near the boundary. Because we transformed everything from normal, no portion of the contour goes beyond the point cloud. This is what we want!&lt;/p&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div id=&#34;filter-extreme-points-and-assess-points-on-95-5-contour&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Filter extreme points and assess points on 95-5 contour&lt;/h1&gt;
&lt;p&gt;Now that our kde contour is set up to properly segregate the extreme points relative to the mode, we can filter them away and assess the remaining points which lie on the contour. We do this by pulling the grid points that make up the 95/5 surface and evaluating them as percentiles.&lt;/p&gt;
&lt;p&gt;First, the ecdfs to get the percentiles from each variable&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;e1f &amp;lt;- ecdf(full_mc_bt_tbl$ellip_bt)
e2f &amp;lt;- ecdf(full_mc_bt_tbl$curv_bt)
e3f &amp;lt;- ecdf(full_mc_bt_tbl$pressure_bt)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Map ecdfs over the variables and then use the sum of the percentiles as a way to identify the largest values.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;full_probs_95_tbl &amp;lt;- grid_probs_95_bt %&amp;gt;%
  rowwise() %&amp;gt;%
  mutate(
    percentile_e = map_dbl(ellip_bt, e1f),
    percentile_c = map_dbl(curv_bt, e2f),
    percentile_p = map_dbl(pressure_bt, e3f)
  ) %&amp;gt;%
  rowwise() %&amp;gt;%
  mutate(pct_sum = sum(c(percentile_e, percentile_c, percentile_p))) %&amp;gt;%
  ungroup() %&amp;gt;%
  arrange(desc(pct_sum)) %&amp;gt;%
  mutate(pct_sum_rank = row_number()) %&amp;gt;%
  select(ellip_bt, curv_bt, pressure_bt, percentile_e, percentile_c, percentile_p, pct_sum)

full_probs_95_tbl %&amp;gt;%
  head(10) %&amp;gt;%
  kable(align = &amp;quot;c&amp;quot;, digits = 2)&lt;/code&gt;&lt;/pre&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr class=&#34;header&#34;&gt;
&lt;th align=&#34;center&#34;&gt;ellip_bt&lt;/th&gt;
&lt;th align=&#34;center&#34;&gt;curv_bt&lt;/th&gt;
&lt;th align=&#34;center&#34;&gt;pressure_bt&lt;/th&gt;
&lt;th align=&#34;center&#34;&gt;percentile_e&lt;/th&gt;
&lt;th align=&#34;center&#34;&gt;percentile_c&lt;/th&gt;
&lt;th align=&#34;center&#34;&gt;percentile_p&lt;/th&gt;
&lt;th align=&#34;center&#34;&gt;pct_sum&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;center&#34;&gt;1.40&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;5.49&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;176.88&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;0.99&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;0.96&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;0.98&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;2.93&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;center&#34;&gt;1.37&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;5.49&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;188.19&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;0.97&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;0.96&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;0.99&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;2.92&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;center&#34;&gt;1.37&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;6.08&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;166.24&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;0.97&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;0.98&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;0.96&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;2.92&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;center&#34;&gt;1.34&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;5.49&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;188.19&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;0.95&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;0.96&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;0.99&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;2.90&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;center&#34;&gt;1.37&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;4.96&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;188.19&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;0.97&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;0.92&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;0.99&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;2.89&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;center&#34;&gt;1.42&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;4.96&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;166.24&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;1.00&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;0.92&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;0.96&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;2.88&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;center&#34;&gt;1.32&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;6.08&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;176.88&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;0.91&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;0.98&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;0.98&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;2.87&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;center&#34;&gt;1.32&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;5.49&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;188.19&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;0.91&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;0.96&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;0.99&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;2.86&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;center&#34;&gt;1.40&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;4.48&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;188.19&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;0.99&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;0.86&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;0.99&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;2.84&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;center&#34;&gt;1.42&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;4.48&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;176.88&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;1.00&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;0.86&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;0.98&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;2.84&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;Finally, we can show a few of the points with large percentiles on the 95/5 surface:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;top_10 &amp;lt;- full_probs_95_tbl %&amp;gt;%
  head(10)

fig &amp;lt;- fig %&amp;gt;% add_trace(x = top_10$ellip_bt, y = top_10$curv_bt, z = top_10$pressure_bt, type = &amp;quot;scatter3d&amp;quot;, size = 30)

# fig&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://rileyking.netlify.app/post/2020-12-06-boundary-conditions-and-anatomy-exploring-correlated-data-simulation-in-r_files/pp4.png&#34; style=&#34;width:100.0%;height:100.0%&#34; /&gt;&lt;/p&gt;
&lt;p&gt;&lt;img src=&#34;https://rileyking.netlify.app/post/2020-12-06-boundary-conditions-and-anatomy-exploring-correlated-data-simulation-in-r_files/pp5.png&#34; style=&#34;width:100.0%;height:100.0%&#34; /&gt;&lt;/p&gt;
&lt;p&gt;And there we have it! 10 candidate points representing credible points on the edge of the 5% probability region for 3 correlated lognormal variables with proper treatment of the boundary.&lt;/p&gt;
&lt;p&gt;If you’ve made it this far, I thank you. Here are a couple appendices as a reward!&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;appendix-a---simulating-a-multivariate-distribution-with-mass-mvnorm&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Appendix A - simulating a multivariate distribution with mass mvnorm&lt;/h1&gt;
&lt;p&gt;Workflow:&lt;/p&gt;
&lt;div id=&#34;htmlwidget-2&#34; style=&#34;width:100%;height:500px;&#34; class=&#34;grViz html-widget&#34;&gt;&lt;/div&gt;
&lt;script type=&#34;application/json&#34; data-for=&#34;htmlwidget-2&#34;&gt;{&#34;x&#34;:{&#34;diagram&#34;:&#34;digraph flowchart {\n      # node definitions with substituted label text\n      node [fontname = Helvetica, shape = rectangle, fillcolor = yellow]        \n      tab1 [label = \&#34;Step 1: Fit distributions to each variable in the original dataset.\n Note parameters, correlations, covariances in original data\&#34;]\n      tab2 [label = \&#34;Step 2: Transform all variables to normal\&#34;]\n      tab3 [label = \&#34;Step 3: Fit normal distributions to each transformed variablet.\n Note parameters, correlations, covariances in transformed data\&#34;]\n      tab4 [label = \&#34;Step 4: Draw joint distribution using MASS::mvrnorm() or equivalent function.\n Use parameters and covariance matrix from normal, transformed data\&#34;]\n      tab5 [label = \&#34;Step 5: Back-transform simulated data to original distribution\&#34;]\n      tab6 [label = \&#34;Step 6: Evaluate parameters and marginal distributions of the back-transfomed data.\n Compare to raw, original data to see if marginals and correlations were recreated in the sim\&#34;]\n      # edge definitions with the node IDs\n      tab1 -&gt; tab2 -&gt; tab3 -&gt; tab4 -&gt; tab5 -&gt; tab6;\n      }\n      &#34;,&#34;config&#34;:{&#34;engine&#34;:&#34;dot&#34;,&#34;options&#34;:null}},&#34;evals&#34;:[],&#34;jsHooks&#34;:[]}&lt;/script&gt;
&lt;div id=&#34;step-1---fit-distributions-to-each-variable&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Step 1 - Fit Distributions to Each Variable&lt;/h2&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;ellip_fit &amp;lt;- fitdist(sample_data$ellip, &amp;quot;lnorm&amp;quot;)
curv_fit &amp;lt;- fitdist(sample_data$curv, &amp;quot;lnorm&amp;quot;)
pressure_fit &amp;lt;- fitdist(sample_data$pressure, &amp;quot;lnorm&amp;quot;)

# store lognormal parameters of original data
ellip_meanlog &amp;lt;- ellip_fit$estimate[[&amp;quot;meanlog&amp;quot;]]
ellip_sdlog &amp;lt;- ellip_fit$estimate[[&amp;quot;sdlog&amp;quot;]]
curv_meanlog &amp;lt;- curv_fit$estimate[[&amp;quot;meanlog&amp;quot;]]
curv_sdlog &amp;lt;- curv_fit$estimate[[&amp;quot;sdlog&amp;quot;]]
pressure_meanlog &amp;lt;- pressure_fit$estimate[[&amp;quot;meanlog&amp;quot;]]
pressure_sdlog &amp;lt;- pressure_fit$estimate[[&amp;quot;sdlog&amp;quot;]]

# store correlations in original data
cor_ec &amp;lt;- cor(x = sample_data$ellip, y = sample_data$curv)
cor_ep &amp;lt;- cor(x = sample_data$ellip, y = sample_data$pressure)
cor_cp &amp;lt;- cor(x = sample_data$curv, y = sample_data$pressure)

# store covariances in original data
cov_ellip_curv &amp;lt;- cov(x = sample_data$ellip, y = sample_data$curv)
cov_ellip_ellip &amp;lt;- cov(x = sample_data$ellip, y = sample_data$ellip)
cov_curv_curv &amp;lt;- cov(x = sample_data$curv, y = sample_data$curv)
cov_ellip_pressure &amp;lt;- cov(x = sample_data$ellip, y = sample_data$pressure)
cov_pressure_pressure &amp;lt;- cov(x = sample_data$pressure, y = sample_data$pressure)
cov_curv_pressure &amp;lt;- cov(x = sample_data$curv, y = sample_data$pressure)

# summarize the parameters and reshape a bit
original_data_param_tbl &amp;lt;- tibble(
  ellip_meanlog = ellip_meanlog,
  ellip_sdlog = ellip_sdlog,
  curv_meanlog = curv_meanlog,
  curv_sdlog = curv_sdlog,
  pressure_meanlog = pressure_meanlog,
  pressure_sdlog = pressure_sdlog,
  ellip_curv_correlation = cor_ec,
  ellip_pressure_correlation = cor_ep,
  curv_pressure_correlation = cor_cp,
  ellip_ellip_covariance = cov_ellip_ellip,
  ellip_curv_covariance = cov_ellip_curv,
  curv_curv_covariance = cov_curv_curv,
  ellip_pressure_covariance = cov_ellip_pressure,
  pressure_pressure_covariance = cov_pressure_pressure,
  curv_pressure_covariance = cov_curv_pressure
) %&amp;gt;%
  pivot_longer(cols = everything(), names_to = &amp;quot;feature&amp;quot;, values_to = &amp;quot;value&amp;quot;) %&amp;gt;%
  mutate(dataset = &amp;quot;original_data&amp;quot;) %&amp;gt;%
  mutate_if(is.character, as_factor)

# View summary table of original data
original_data_param_tbl %&amp;gt;%
  kable(align = &amp;quot;c&amp;quot;, digits = 3)&lt;/code&gt;&lt;/pre&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr class=&#34;header&#34;&gt;
&lt;th align=&#34;center&#34;&gt;feature&lt;/th&gt;
&lt;th align=&#34;center&#34;&gt;value&lt;/th&gt;
&lt;th align=&#34;center&#34;&gt;dataset&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;center&#34;&gt;ellip_meanlog&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;0.193&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;original_data&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;center&#34;&gt;ellip_sdlog&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;0.064&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;original_data&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;center&#34;&gt;curv_meanlog&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;1.158&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;original_data&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;center&#34;&gt;curv_sdlog&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;0.309&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;original_data&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;center&#34;&gt;pressure_meanlog&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;4.783&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;original_data&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;center&#34;&gt;pressure_sdlog&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;0.191&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;original_data&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;center&#34;&gt;ellip_curv_correlation&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;0.268&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;original_data&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;center&#34;&gt;ellip_pressure_correlation&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;0.369&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;original_data&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;center&#34;&gt;curv_pressure_correlation&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;0.213&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;original_data&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;center&#34;&gt;ellip_ellip_covariance&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;0.006&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;original_data&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;center&#34;&gt;ellip_curv_covariance&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;0.022&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;original_data&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;center&#34;&gt;curv_curv_covariance&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;1.157&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;original_data&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;center&#34;&gt;ellip_pressure_covariance&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;0.659&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;original_data&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;center&#34;&gt;pressure_pressure_covariance&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;530.683&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;original_data&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;center&#34;&gt;curv_pressure_covariance&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;5.285&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;original_data&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;/div&gt;
&lt;div id=&#34;step-2---transform-all-variables-to-normal&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Step 2 - Transform all variables to normal&lt;/h2&gt;
&lt;p&gt;A simple log operation brings the lognormal variable to normal.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;# transform original, lognormal data to normal
normal_sample_data &amp;lt;- sample_data %&amp;gt;%
  mutate(
    n_ellip = log(ellip),
    n_curv = log(curv),
    n_pressure = log(pressure)
  )

normal_sample_data %&amp;gt;%
  head() %&amp;gt;%
  kable(align = &amp;quot;c&amp;quot;, digits = 3)&lt;/code&gt;&lt;/pre&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr class=&#34;header&#34;&gt;
&lt;th align=&#34;center&#34;&gt;ellip&lt;/th&gt;
&lt;th align=&#34;center&#34;&gt;curv&lt;/th&gt;
&lt;th align=&#34;center&#34;&gt;pressure&lt;/th&gt;
&lt;th align=&#34;center&#34;&gt;n_ellip&lt;/th&gt;
&lt;th align=&#34;center&#34;&gt;n_curv&lt;/th&gt;
&lt;th align=&#34;center&#34;&gt;n_pressure&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;center&#34;&gt;1.255&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;4.506&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;92.739&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;0.228&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;1.505&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;4.530&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;center&#34;&gt;1.285&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;5.019&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;182.970&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;0.251&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;1.613&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;5.209&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;center&#34;&gt;1.289&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;4.027&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;153.858&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;0.254&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;1.393&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;5.036&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;center&#34;&gt;1.234&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;2.139&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;108.669&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;0.210&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;0.760&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;4.688&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;center&#34;&gt;1.133&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;3.673&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;123.633&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;0.125&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;1.301&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;4.817&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;center&#34;&gt;1.219&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;2.373&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;113.944&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;0.198&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;0.864&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;4.736&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;/div&gt;
&lt;div id=&#34;step-3---fit-normal-distributions-to-each-transformed-variable&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Step 3 - Fit normal distributions to each transformed variable&lt;/h2&gt;
&lt;p&gt;We don’t actually have to formally fit normal distributions since it is convenient to obtain the mean and standard deviation at any time using the mean() or sd() functions. But we will extract and store correlations and covariances for the simulation to come.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;# get correlations of transformed, normal data
ncor_ec &amp;lt;- cor(
  x = normal_sample_data$n_ellip,
  normal_sample_data$n_curv
)
ncor_ep &amp;lt;- cor(
  x = normal_sample_data$n_ellip,
  normal_sample_data$n_pressure
)
ncor_cp &amp;lt;- cor(
  x = normal_sample_data$n_curv,
  normal_sample_data$n_pressure
)

# get covariance of transformed, normal data
n_cov_ellip_curv &amp;lt;- cov(
  x = normal_sample_data$n_ellip,
  y = normal_sample_data$n_curv
)
n_cov_ellip_ellip &amp;lt;- cov(
  x = normal_sample_data$n_ellip,
  y = normal_sample_data$n_ellip
)
n_cov_curv_curv &amp;lt;- cov(
  x = normal_sample_data$n_curv,
  y = normal_sample_data$n_curv
)

n_cov_ellip_pressure &amp;lt;- cov(
  x = normal_sample_data$n_ellip,
  y = normal_sample_data$n_pressure
)
n_cov_pressure_pressure &amp;lt;- cov(
  x = normal_sample_data$n_pressure,
  y = normal_sample_data$n_pressure
)
n_cov_curv_pressure &amp;lt;- cov(
  x = normal_sample_data$n_curv,
  y = normal_sample_data$n_pressure
)&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div id=&#34;step-4---draw-joint-distribution-using-mvrnorm-or-equivalent-function&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Step 4 - Draw joint distribution using mvrnorm() or equivalent function&lt;/h2&gt;
&lt;p&gt;Time to actually draw the correlated values. I store them here in an object called mult_norm.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;# draw from multivariate normal with parameters from transformed normal distributions and correlation
set.seed(0118)

mult_norm &amp;lt;- as_tibble(MASS::mvrnorm(
  10000, c(
    mean(normal_sample_data$n_ellip),
    mean(normal_sample_data$n_curv),
    mean(normal_sample_data$n_pressure)
  ),
  matrix(c(
    n_cov_ellip_ellip,
    n_cov_ellip_curv,
    n_cov_ellip_pressure,
    n_cov_ellip_curv,
    n_cov_curv_curv,
    n_cov_curv_pressure,
    n_cov_ellip_pressure,
    n_cov_curv_pressure,
    n_cov_pressure_pressure
  ), 3, 3)
)) %&amp;gt;%
  rename(
    n_ellip_sim = V1,
    n_curv_sim = V2,
    n_pressure_sim = V3
  )&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div id=&#34;step-5---back-transform-simulated-data-to-original-distribution&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Step 5 - Back-transform simulated data to original distribution&lt;/h2&gt;
&lt;p&gt;Exponentiating the data brings it back to lognormal.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;# convert back to lognormal
log_norm &amp;lt;- mult_norm %&amp;gt;%
  mutate(
    ellip_sim = exp(n_ellip_sim),
    curv_sim = exp(n_curv_sim),
    pressure_sim = exp(n_pressure_sim)
  )

log_norm %&amp;gt;%
  head() %&amp;gt;%
  kable(align = &amp;quot;c&amp;quot;, digits = 3)&lt;/code&gt;&lt;/pre&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr class=&#34;header&#34;&gt;
&lt;th align=&#34;center&#34;&gt;n_ellip_sim&lt;/th&gt;
&lt;th align=&#34;center&#34;&gt;n_curv_sim&lt;/th&gt;
&lt;th align=&#34;center&#34;&gt;n_pressure_sim&lt;/th&gt;
&lt;th align=&#34;center&#34;&gt;ellip_sim&lt;/th&gt;
&lt;th align=&#34;center&#34;&gt;curv_sim&lt;/th&gt;
&lt;th align=&#34;center&#34;&gt;pressure_sim&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;center&#34;&gt;0.254&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;1.600&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;5.248&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;1.290&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;4.952&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;190.266&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;center&#34;&gt;0.233&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;1.038&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;5.107&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;1.262&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;2.823&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;165.178&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;center&#34;&gt;0.236&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;1.152&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;4.812&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;1.266&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;3.165&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;123.018&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;center&#34;&gt;0.313&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;1.003&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;5.048&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;1.368&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;2.727&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;155.636&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;center&#34;&gt;0.224&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;1.622&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;5.192&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;1.251&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;5.066&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;179.912&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;center&#34;&gt;0.197&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;1.486&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;4.822&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;1.218&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;4.422&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;124.185&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;/div&gt;
&lt;div id=&#34;step-6---evaluate-parameters-and-marginal-distributions-of-the-back-transfomed-data&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Step 6 - Evaluate parameters and marginal distributions of the back-transfomed data&lt;/h2&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;# evaluate the marginal distributions of the simulated data
ellip_sim_fit &amp;lt;- fitdistrplus::fitdist(log_norm$ellip_sim, &amp;quot;lnorm&amp;quot;)
curv_sim_fit &amp;lt;- fitdistrplus::fitdist(log_norm$curv_sim, &amp;quot;lnorm&amp;quot;)
pressure_sim_fit &amp;lt;- fitdistrplus::fitdist(log_norm$pressure_sim, &amp;quot;lnorm&amp;quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Obtain and store the correlation, covariances, and parameters of simulated set:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;# get correlation and covariances of simulated data
sim_cor_ec &amp;lt;- cor(x = log_norm$ellip_sim, log_norm$curv_sim)
sim_cor_ep &amp;lt;- cor(x = log_norm$ellip_sim, log_norm$pressure_sim)
sim_cor_cp &amp;lt;- cor(x = log_norm$curv_sim, log_norm$pressure_sim)

sim_cov_ellip_curv &amp;lt;- cov(x = log_norm$ellip_sim, y = log_norm$curv_sim)
sim_cov_ellip_ellip &amp;lt;- cov(x = log_norm$ellip_sim, y = log_norm$ellip_sim)
sim_cov_curv_curv &amp;lt;- cov(x = log_norm$curv_sim, y = log_norm$curv_sim)

sim_cov_ellip_pressure &amp;lt;- cov(x = log_norm$ellip_sim, y = log_norm$pressure_sim)
sim_cov_pressure_pressure &amp;lt;- cov(x = log_norm$pressure_sim, y = log_norm$pressure_sim)
sim_cov_curv_pressure &amp;lt;- cov(x = log_norm$curv_sim, y = log_norm$pressure_sim)

# store parameters of simulated data
ellip_sim_meanlog &amp;lt;- ellip_sim_fit$estimate[[&amp;quot;meanlog&amp;quot;]]
ellip_sim_sdlog &amp;lt;- ellip_sim_fit$estimate[[&amp;quot;sdlog&amp;quot;]]
curv_sim_meanlog &amp;lt;- curv_sim_fit$estimate[[&amp;quot;meanlog&amp;quot;]]
curv_sim_sdlog &amp;lt;- curv_sim_fit$estimate[[&amp;quot;sdlog&amp;quot;]]
pressure_sim_meanlog &amp;lt;- pressure_sim_fit$estimate[[&amp;quot;meanlog&amp;quot;]]
pressure_sim_sdlog &amp;lt;- pressure_sim_fit$estimate[[&amp;quot;sdlog&amp;quot;]]

# collect parameters from simulated data
sim_data_param_tbl &amp;lt;- tibble(
  ellip_meanlog = ellip_sim_meanlog,
  ellip_sdlog = ellip_sim_sdlog,
  curv_meanlog = curv_sim_meanlog,
  curv_sdlog = curv_sim_sdlog,
  pressure_meanlog = pressure_sim_meanlog,
  pressure_sdlog = pressure_sim_sdlog,

  ellip_curv_correlation = sim_cor_ec,
  ellip_pressure_correlation = sim_cor_ep,
  curv_pressure_correlation = sim_cor_cp,

  ellip_curv_covariance = sim_cov_ellip_curv,
  ellip_ellip_covariance = sim_cov_ellip_ellip,
  curv_curv_covariance = sim_cov_curv_curv,

  ellip_pressure_covariance = sim_cov_ellip_pressure,
  pressure_pressure_covariance = sim_cov_pressure_pressure,
  curv_pressure_covariance = sim_cov_curv_pressure
) %&amp;gt;%
  pivot_longer(cols = everything(), names_to = &amp;quot;feature&amp;quot;, values_to = &amp;quot;value&amp;quot;) %&amp;gt;%
  mutate(dataset = &amp;quot;simulated_data&amp;quot;) %&amp;gt;%
  mutate_if(is.character, as_factor)

sim_data_param_tbl %&amp;gt;%
  kable(align = &amp;quot;c&amp;quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr class=&#34;header&#34;&gt;
&lt;th align=&#34;center&#34;&gt;feature&lt;/th&gt;
&lt;th align=&#34;center&#34;&gt;value&lt;/th&gt;
&lt;th align=&#34;center&#34;&gt;dataset&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;center&#34;&gt;ellip_meanlog&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;0.1932042&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;simulated_data&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;center&#34;&gt;ellip_sdlog&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;0.0630117&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;simulated_data&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;center&#34;&gt;curv_meanlog&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;1.1626798&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;simulated_data&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;center&#34;&gt;curv_sdlog&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;0.3092643&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;simulated_data&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;center&#34;&gt;pressure_meanlog&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;4.7878497&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;simulated_data&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;center&#34;&gt;pressure_sdlog&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;0.1900026&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;simulated_data&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;center&#34;&gt;ellip_curv_correlation&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;0.2505145&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;simulated_data&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;center&#34;&gt;ellip_pressure_correlation&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;0.3644292&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;simulated_data&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;center&#34;&gt;curv_pressure_correlation&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;0.1956149&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;simulated_data&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;center&#34;&gt;ellip_curv_covariance&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;0.0203344&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;simulated_data&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;center&#34;&gt;ellip_ellip_covariance&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;0.0058779&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;simulated_data&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;center&#34;&gt;curv_curv_covariance&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;1.1209300&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;simulated_data&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;center&#34;&gt;ellip_pressure_covariance&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;0.6534943&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;simulated_data&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;center&#34;&gt;pressure_pressure_covariance&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;547.0647415&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;simulated_data&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;center&#34;&gt;curv_pressure_covariance&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;4.8440727&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;simulated_data&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;/div&gt;
&lt;div id=&#34;compare-original-data-to-simulated-data&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Compare Original Data to Simulated Data&lt;/h2&gt;
&lt;p&gt;A bit more wrangling let’s us compare the feature of the original dataset to the new, simulated population to see if they agree.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;compare_tbl &amp;lt;- bind_rows(original_data_param_tbl, sim_data_param_tbl) %&amp;gt;%
  pivot_wider(id_cols = everything(), names_from = dataset)

compare_tbl %&amp;gt;%
  kable(align = &amp;quot;c&amp;quot;, digits = 3)&lt;/code&gt;&lt;/pre&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr class=&#34;header&#34;&gt;
&lt;th align=&#34;center&#34;&gt;feature&lt;/th&gt;
&lt;th align=&#34;center&#34;&gt;original_data&lt;/th&gt;
&lt;th align=&#34;center&#34;&gt;simulated_data&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;center&#34;&gt;ellip_meanlog&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;0.193&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;0.193&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;center&#34;&gt;ellip_sdlog&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;0.064&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;0.063&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;center&#34;&gt;curv_meanlog&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;1.158&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;1.163&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;center&#34;&gt;curv_sdlog&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;0.309&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;0.309&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;center&#34;&gt;pressure_meanlog&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;4.783&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;4.788&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;center&#34;&gt;pressure_sdlog&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;0.191&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;0.190&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;center&#34;&gt;ellip_curv_correlation&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;0.268&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;0.251&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;center&#34;&gt;ellip_pressure_correlation&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;0.369&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;0.364&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;center&#34;&gt;curv_pressure_correlation&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;0.213&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;0.196&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;center&#34;&gt;ellip_ellip_covariance&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;0.006&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;0.006&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;center&#34;&gt;ellip_curv_covariance&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;0.022&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;0.020&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;center&#34;&gt;curv_curv_covariance&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;1.157&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;1.121&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;center&#34;&gt;ellip_pressure_covariance&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;0.659&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;0.653&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;center&#34;&gt;pressure_pressure_covariance&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;530.683&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;547.065&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;center&#34;&gt;curv_pressure_covariance&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;5.285&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;4.844&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div id=&#34;appendix-b---2d-kde-plot-with-probability-traces&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Appendix B - 2d kde plot with probability traces&lt;/h1&gt;
&lt;p&gt;First, select the 2 variables of interest.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;d &amp;lt;- correlated_ln_draws_tbl %&amp;gt;% select(ellip, curv)

## density function
kd &amp;lt;- ks::kde(d, compute.cont = TRUE, h = 0.05)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Here’s ellipticity vs. curvature (these lines are not probability region boundaries, but they are related)&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;cp_plt &amp;lt;- correlated_ln_draws_tbl %&amp;gt;%
  ggplot(aes(x = ellip, y = curv)) +
  geom_point(alpha = .3, size = .5) +
  geom_density2d(size = 1.3) +
  theme_classic() +
  xlim(c(.9, 1.6)) +
  ylim(c(1, 7.5)) +
  labs(
    title = &amp;quot;Joint Distribution of Vessel Ellipticity and Curvature&amp;quot;,
    subtitle = &amp;quot;Density Contours at Default Settings&amp;quot;,
    x = &amp;quot;Ellipticity (unitless)&amp;quot;,
    y = &amp;quot;Radius of Curvature (mm)&amp;quot;
  )

cp_plt&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://rileyking.netlify.app/post/2020-12-06-boundary-conditions-and-anatomy-exploring-correlated-data-simulation-in-r_files/figure-html/unnamed-chunk-65-1.png&#34; width=&#34;100%&#34; height=&#34;500px&#34; /&gt;&lt;/p&gt;
&lt;p&gt;Now a a function to extract the points of the contour line from the kde:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;get_contour &amp;lt;- function(kd_out = kd, prob = &amp;quot;5%&amp;quot;) {
  contour_95 &amp;lt;- with(kd_out, contourLines(
    x = eval.points[[1]], y = eval.points[[2]],
    z = estimate, levels = cont[prob]
  )[[1]])
  as_tibble(contour_95) %&amp;gt;%
    mutate(prob = prob)
}&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Map it over the kd object.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;dat_out &amp;lt;- map_dfr(c(&amp;quot;5%&amp;quot;, &amp;quot;20%&amp;quot;, &amp;quot;40%&amp;quot;, &amp;quot;60%&amp;quot;, &amp;quot;80%&amp;quot;, &amp;quot;95%&amp;quot;), ~ get_contour(kd, .)) %&amp;gt;%
  group_by(prob) %&amp;gt;%
  mutate(n_val = 1:n()) %&amp;gt;%
  ungroup()

dat_out %&amp;gt;%
  head(10) %&amp;gt;%
  kable(align = &amp;quot;c&amp;quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr class=&#34;header&#34;&gt;
&lt;th align=&#34;center&#34;&gt;level&lt;/th&gt;
&lt;th align=&#34;center&#34;&gt;x&lt;/th&gt;
&lt;th align=&#34;center&#34;&gt;y&lt;/th&gt;
&lt;th align=&#34;center&#34;&gt;prob&lt;/th&gt;
&lt;th align=&#34;center&#34;&gt;n_val&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;center&#34;&gt;0.1144314&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;1.027589&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;2.246533&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;5%&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;1&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;center&#34;&gt;0.1144314&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;1.027172&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;2.265195&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;5%&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;2&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;center&#34;&gt;0.1144314&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;1.025855&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;2.335547&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;5%&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;3&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;center&#34;&gt;0.1144314&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;1.025083&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;2.405899&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;5%&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;4&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;center&#34;&gt;0.1144314&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;1.025079&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;2.476250&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;5%&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;5&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;center&#34;&gt;0.1144314&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;1.025998&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;2.546603&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;5%&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;6&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;center&#34;&gt;0.1144314&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;1.027589&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;2.606175&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;5%&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;7&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;center&#34;&gt;0.1144314&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;1.027847&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;2.616954&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;5%&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;8&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;center&#34;&gt;0.1144314&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;1.030135&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;2.687306&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;5%&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;9&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;center&#34;&gt;0.1144314&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;1.032082&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;2.738850&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;5%&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;10&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;Clean kde output&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;kd_df &amp;lt;- expand_grid(x = kd$eval.points[[1]], y = kd$eval.points[[2]]) %&amp;gt;%
  mutate(z = c(kd$estimate %&amp;gt;% t()))&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Now visualize again, this time with probability contours at specified values and the 5% curve labeled with geom_label_repel().&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;label_tbl &amp;lt;- dat_out %&amp;gt;%
  filter(
    prob == &amp;quot;5%&amp;quot;,
    n_val == 100
  )

# visualize
ellip_curv_2plt &amp;lt;- ggplot(data = kd_df, aes(x, y)) +
  geom_tile(aes(fill = z)) +
  geom_point(data = d, aes(x = ellip, y = curv), alpha = .4, size = .4, colour = &amp;quot;white&amp;quot;) +
  geom_path(aes(x, y, group = prob),
    data = dat_out %&amp;gt;% filter(prob %in% c(&amp;quot;5%&amp;quot;, &amp;quot;20%&amp;quot;, &amp;quot;40%&amp;quot;, &amp;quot;60%&amp;quot;, &amp;quot;80%&amp;quot;, &amp;quot;95%&amp;quot;)), colour = &amp;quot;white&amp;quot;, size = 1.2, alpha = .8
  ) +
  #  geom_text(aes(label = prob), data =
  #              filter(dat_out, (prob %in% c(&amp;quot;5%&amp;quot;) &amp;amp; n_val==1)), # | (prob %in% c(&amp;quot;90%&amp;quot;) &amp;amp; n_val==20)),
  #            colour = &amp;quot;yellow&amp;quot;, size = 5)+
  geom_label_repel(
    data = label_tbl, aes(x, y),
    label = label_tbl$prob[1],
    fill = &amp;quot;yellow&amp;quot;,
    color = &amp;quot;black&amp;quot;,
    segment.color = &amp;quot;yellow&amp;quot;,
    #    segment.size = 1,
    min.segment.length = unit(1, &amp;quot;lines&amp;quot;),
    nudge_y = .5,
    nudge_x = -.025
  ) +
  xlim(c(.95, 1.5)) +
  ylim(c(0, 7.5)) +
  labs(
    title = &amp;quot;Joint Distribution [Ellipticity and Radius of Curvature]&amp;quot;,
    subtitle = &amp;quot;Simulated Data&amp;quot;,
    caption = &amp;quot;Density Contours shown at 5%, 20%, 40%, 60%, 80%, 95%&amp;quot;
  ) +
  scale_fill_viridis_c(end = .9) +
  theme_bw() +
  theme(legend.position = &amp;quot;none&amp;quot;) +
  labs(x = &amp;quot;Ellipticity (unitless)&amp;quot;, y = &amp;quot;Radius of Curvature (mm)&amp;quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;ggExtra::ggMarginal(ellip_curv_2plt, type = &amp;quot;density&amp;quot;, fill = &amp;quot;#403891ff&amp;quot;, alpha = .7)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://rileyking.netlify.app/post/2020-12-06-boundary-conditions-and-anatomy-exploring-correlated-data-simulation-in-r_files/figure-html/unnamed-chunk-70-1.png&#34; width=&#34;100%&#34; height=&#34;500px&#34; /&gt;&lt;/p&gt;
&lt;/div&gt;
&lt;div class=&#34;footnotes&#34;&gt;
&lt;hr /&gt;
&lt;ol&gt;
&lt;li id=&#34;fn1&#34;&gt;&lt;p&gt;Hamdan et. al. Journal of the American College of Cardiology, Volume 59, Issue 2, 2012, Pages 119-127&lt;a href=&#34;#fnref1&#34; class=&#34;footnote-back&#34;&gt;↩︎&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li id=&#34;fn2&#34;&gt;&lt;p&gt;This method would be analogous to creating prediction intervals and are conditional on the model in the sense that the only parameters considered are the maximum likelihood estimates. Alternate, more conservative ways to simulate the population could involve tolerance intervals or bayesian methods with a simulated posterior distribution to push out predictions.&lt;a href=&#34;#fnref2&#34; class=&#34;footnote-back&#34;&gt;↩︎&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li id=&#34;fn3&#34;&gt;&lt;p&gt;see Water 2020, 12, 1645; &lt;a href=&#34;doi:10.3390/w12061645&#34; class=&#34;uri&#34;&gt;doi:10.3390/w12061645&lt;/a&gt;&lt;a href=&#34;#fnref3&#34; class=&#34;footnote-back&#34;&gt;↩︎&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li id=&#34;fn4&#34;&gt;&lt;p&gt;Adapted from this Stack Overflow response: &lt;a href=&#34;https://stackoverflow.com/questions/23437000/how-to-plot-a-contour-line-showing-where-95-of-values-fall-within-in-r-and-in&#34; class=&#34;uri&#34;&gt;https://stackoverflow.com/questions/23437000/how-to-plot-a-contour-line-showing-where-95-of-values-fall-within-in-r-and-in&lt;/a&gt;&lt;a href=&#34;#fnref4&#34; class=&#34;footnote-back&#34;&gt;↩︎&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;/div&gt;
</description>
    </item>
    
    <item>
      <title> Creating and Using a Simple, Bayesian Linear Model (in brms and R)</title>
      <link>https://rileyking.netlify.app/post/creating-and-using-a-simple-bayesian-linear-model-in-brms-and-r/</link>
      <pubDate>Sun, 01 Dec 2019 00:00:00 +0000</pubDate>
      
      <guid>https://rileyking.netlify.app/post/creating-and-using-a-simple-bayesian-linear-model-in-brms-and-r/</guid>
      <description>
&lt;script src=&#34;https://rileyking.netlify.app/rmarkdown-libs/htmlwidgets/htmlwidgets.js&#34;&gt;&lt;/script&gt;
&lt;script src=&#34;https://rileyking.netlify.app/rmarkdown-libs/viz/viz.js&#34;&gt;&lt;/script&gt;
&lt;link href=&#34;https://rileyking.netlify.app/rmarkdown-libs/DiagrammeR-styles/styles.css&#34; rel=&#34;stylesheet&#34; /&gt;
&lt;script src=&#34;https://rileyking.netlify.app/rmarkdown-libs/grViz-binding/grViz.js&#34;&gt;&lt;/script&gt;


&lt;p&gt;This post is my good-faith effort to create a simple linear model using the Bayesian framework and workflow described by Richard McElreath in his Statistical Rethinking book.&lt;a href=&#34;#fn1&#34; class=&#34;footnoteRef&#34; id=&#34;fnref1&#34;&gt;&lt;sup&gt;1&lt;/sup&gt;&lt;/a&gt; As always - please view this post through the lens of the eager student and not the learned master. I did my best to check my work, but it’s entirely possible that something was missed. Please let me know - I won’t take it personally. As McElreath notes in his lectures - “if you’re confused, it’s because you’re paying attention”. And sometimes I get confused - this a lot harder than my old workflow which consisted of clicking “add a trendline” in Excel. Thinking Bayesian is still relatively new to me. Disclaimer over - let’s get to it.&lt;/p&gt;
&lt;p&gt;I’m playing around with a bunch of fun libraries in this one.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;library(tidyverse)
library(styler)
library(ggExtra)
library(knitr)
library(brms)
library(cowplot)
library(gridExtra)
library(skimr)
library(DiagrammeR)
library(rayshader)
library(av)
library(rgl)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;I made up this data set. It represents hypothetical values of ablation time and tissue impedance as measured by sensors embedded in a RF ablation catheter. This type of device is designed to apply RF or thermal energy to the vessel wall. The result is a lesion that can aid in improve arrhythmia, reduce hypertension, or provide some other desired outcome.&lt;/p&gt;
&lt;p&gt;In RF ablations, the tissue heats up over the course of the RF cycle, resulting in a drop in impedance that varies over time. As described above, the goal will be to see how much of the variation in impedance is described by time (over some limited range) and then communicate the uncertainty in the predictions visually. None of this detail is terribly important other than I like to frame my examples from within my industry and McElreath emphasizes grounding our modeling in real world science and domain knowledge. This is what an ablation catheter system looks like:&lt;a href=&#34;#fn2&#34; class=&#34;footnoteRef&#34; id=&#34;fnref2&#34;&gt;&lt;sup&gt;2&lt;/sup&gt;&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;&lt;img src=&#34;https://rileyking.netlify.app/./img/rf_cath.jpg&#34; width=&#34;100%&#34; height=&#34;100%&#34; style=&#34;display: block; margin: auto;&#34; /&gt;&lt;/p&gt;
&lt;p&gt;To get things started, load the data and give it a look with skim(). There are no missing values.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;ablation_dta_tbl &amp;lt;- read.csv(file = &amp;quot;abl_data_2.csv&amp;quot;)
ablation_dta_tbl &amp;lt;- ablation_dta_tbl %&amp;gt;% select(temp, time)
ablation_dta_tbl %&amp;gt;% skim()&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## Skim summary statistics
##  n obs: 331 
##  n variables: 2 
## 
## -- Variable type:numeric ------------------------------------------------------------------------------------------------------------
##  variable missing complete   n  mean   sd    p0   p25   p50   p75  p100
##      temp       0      331 331 77.37 3.9  68.26 74.61 77.15 80.33 89.53
##      time       0      331 331 22.57 3.22 15.83 20.22 22.54 24.69 31.5 
##      hist
##  &amp;lt;U+2581&amp;gt;&amp;lt;U+2585&amp;gt;&amp;lt;U+2587&amp;gt;&amp;lt;U+2586&amp;gt;&amp;lt;U+2586&amp;gt;&amp;lt;U+2583&amp;gt;&amp;lt;U+2581&amp;gt;&amp;lt;U+2581&amp;gt;
##  &amp;lt;U+2582&amp;gt;&amp;lt;U+2586&amp;gt;&amp;lt;U+2587&amp;gt;&amp;lt;U+2587&amp;gt;&amp;lt;U+2587&amp;gt;&amp;lt;U+2583&amp;gt;&amp;lt;U+2582&amp;gt;&amp;lt;U+2581&amp;gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Let’s start with a simple visualization. The code below builds out a scatterplot with marginal histograms which I think is a nice, clean way to evaluate scatter data.&lt;a href=&#34;#fn3&#34; class=&#34;footnoteRef&#34; id=&#34;fnref3&#34;&gt;&lt;sup&gt;3&lt;/sup&gt;&lt;/a&gt; These data seem plausible since the impedance will typically drop as the tissue heats up during the procedure. In reality the impedance goes asymptotic but we’ll work over a limited range of time where the behavior might reasonably be linear.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;scatter_1_fig &amp;lt;- ablation_dta_tbl %&amp;gt;% ggplot(aes(x = time, y = temp)) +
  geom_point(
    colour = &amp;quot;#2c3e50&amp;quot;,
    fill = &amp;quot;#2c3e50&amp;quot;,
    size = 2,
    alpha = 0.4
  ) +
  labs(
    x = &amp;quot;Ablation Time (seconds)&amp;quot;,
    y = &amp;quot;Tissue Temperature (deg C)&amp;quot;,
    title = &amp;quot;Ablation Time vs. Tissue Temperature&amp;quot;,
    subtitle = &amp;quot;Simulated Catheter RF Ablation&amp;quot;
  )

scatter_hist_1_fig &amp;lt;- ggMarginal(scatter_1_fig,
  type = &amp;quot;histogram&amp;quot;,
  color = &amp;quot;white&amp;quot;,
  alpha = 0.7,
  fill = &amp;quot;#2c3e50&amp;quot;,
  xparams = list(binwidth = 1),
  yparams = list(binwidth = 2.5)
)&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;# ggExtra needs these explit calls to display in Markdown docs *shrug*
grid::grid.newpage()
grid::grid.draw(scatter_hist_1_fig)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://rileyking.netlify.app/post/2019-12-01-creating-and-using-a-simple-bayesian-linear-model-in-brms-and-r_files/figure-html/unnamed-chunk-5-1.png&#34; width=&#34;100%&#34; height=&#34;500px&#34; /&gt;&lt;/p&gt;
&lt;p&gt;It helps to have a plan. If I can create a posterior distribution that captures reasonable values for the model parameters and confirm that the model makes reasonable predictions then I will be happy. Here’s the workflow that hopefully will get me there.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;grViz(&amp;quot;digraph flowchart {
      # node definitions with substituted label text
      node [fontname = Helvetica, shape = rectangle, fillcolor = yellow]        
      tab1 [label = &amp;#39;Step 1: Propose a distribution for the response variable \n Choose a maximum entropy distribution given the constraints you understand&amp;#39;]
      tab2 [label = &amp;#39;Step 2: Parameterize the mean \n The mean of the response distribution will vary linearly across the range of predictor values&amp;#39;]
      tab3 [label = &amp;#39;Step 3: Set priors \n Simulate what the model knows before seeing the data.  Use domain knowledge as constraints.&amp;#39;]
      tab4 [label = &amp;#39;Step 4: Define the model \n Create the model using the observed data, the likelihood function, and the priors&amp;#39;]
      tab5 [label = &amp;#39;Step 5: Draw from the posterior \n Plot plausible lines using parameters visited by the Markov chains&amp;#39;]
      tab6 [label = &amp;#39;Step 6: Push the parameters back through the model \n Simulate real data from plausible combinations of mean and sigma&amp;#39;]
      # edge definitions with the node IDs
      tab1 -&amp;gt; tab2 -&amp;gt; tab3 -&amp;gt; tab4 -&amp;gt; tab5 -&amp;gt; tab6;
      }
      &amp;quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;div id=&#34;htmlwidget-1&#34; style=&#34;width:100%;height:500px;&#34; class=&#34;grViz html-widget&#34;&gt;&lt;/div&gt;
&lt;script type=&#34;application/json&#34; data-for=&#34;htmlwidget-1&#34;&gt;{&#34;x&#34;:{&#34;diagram&#34;:&#34;digraph flowchart {\n      # node definitions with substituted label text\n      node [fontname = Helvetica, shape = rectangle, fillcolor = yellow]        \n      tab1 [label = \&#34;Step 1: Propose a distribution for the response variable \n Choose a maximum entropy distribution given the constraints you understand\&#34;]\n      tab2 [label = \&#34;Step 2: Parameterize the mean \n The mean of the response distribution will vary linearly across the range of predictor values\&#34;]\n      tab3 [label = \&#34;Step 3: Set priors \n Simulate what the model knows before seeing the data.  Use domain knowledge as constraints.\&#34;]\n      tab4 [label = \&#34;Step 4: Define the model \n Create the model using the observed data, the likelihood function, and the priors\&#34;]\n      tab5 [label = \&#34;Step 5: Draw from the posterior \n Plot plausible lines using parameters visited by the Markov chains\&#34;]\n      tab6 [label = \&#34;Step 6: Push the parameters back through the model \n Simulate real data from plausible combinations of mean and sigma\&#34;]\n      # edge definitions with the node IDs\n      tab1 -&gt; tab2 -&gt; tab3 -&gt; tab4 -&gt; tab5 -&gt; tab6;\n      }\n      &#34;,&#34;config&#34;:{&#34;engine&#34;:&#34;dot&#34;,&#34;options&#34;:null}},&#34;evals&#34;:[],&#34;jsHooks&#34;:[]}&lt;/script&gt;
&lt;div id=&#34;step-1-propose-a-distribution-for-the-response-variable&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;&lt;strong&gt;Step 1: Propose a distribution for the response variable&lt;/strong&gt;&lt;/h2&gt;
&lt;p&gt;A Gaussian model is reasonable for the outcome variable Temperature as we know it is a measured from the thermocouples on the distal end of the catheter. According to McElreath (pg 75):&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;&lt;em&gt;Measurement errors, variations in growth, and the velocities of molecules all tend towards Gaussian distributions. These processes do this because at their heart, these processes add together fluctuations. And repeatedly adding finite fluctuations results in a distribution of sums that have shed all information about the underlying process, aside from mean and spread.&lt;/em&gt;&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Here’s us formally asserting Temperature as a normal distribution with mean mu and standard deviation sigma. These two parameters are all that is needed to completely describe the distribution and also pin down the likelihood function.&lt;/p&gt;
&lt;p&gt;&lt;span class=&#34;math inline&#34;&gt;\(T_i \sim \text{Normal}(\mu_i, \sigma)\)&lt;/span&gt;&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;step-2-parameterize-the-mean&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;&lt;strong&gt;Step 2: Parameterize the mean&lt;/strong&gt;&lt;/h2&gt;
&lt;p&gt;If we further parameterize the mean we can do some neat things like move it around with the predictor variable. This is a pretty key concept - &lt;em&gt;you move the mean of the outcome variable around by parameterizing it. If we make it a line then it will move linearly with the predictor variable.&lt;/em&gt; The real data will still have a spread once the sigma term is folded back in, but we can think of the whole distribution shifting up and down based on the properties of the line.&lt;/p&gt;
&lt;p&gt;Here’s us asserting we want mu to move linearly with changes in the predictor variable (time). Subtracting the mean from each value of the predictor variable “centers” the data which McElreath recommends in most cases. I will explore the differences between centered and un-centered later on.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;span class=&#34;math inline&#34;&gt;\(\mu_i = \alpha + \beta (x_i - \bar{x})\)&lt;/span&gt;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;/div&gt;
&lt;div id=&#34;step-3-set-priors&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;&lt;strong&gt;Step 3: Set priors&lt;/strong&gt;&lt;/h2&gt;
&lt;p&gt;We know some things about these data and we should use it to help regularize to model through the priors.&lt;/p&gt;
&lt;p&gt;Temperature is a continuous variable so we want a continuous distribution. We also know from the nature of the treatment that there isn’t really any physical mechanism within the device that would be expected to cool down the tissue below normal body temperature. Since only heating is expected, the slope should be positive or zero.&lt;/p&gt;
&lt;p&gt;McElreath emphasizes simulating from the priors to visualize “what the model knows before it sees the data”. Here are some priors to consider. Let’s evaluate.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;# Set seed for repeatability
set.seed(1999)

# number of sims
n &amp;lt;- 150

# random draws from the specified prior distributions
# lognormal distribution is used to constrain slopes to positive values
a &amp;lt;- rnorm(n, 75, 15)

b &amp;lt;- rnorm(n, 0, 1)
b_ &amp;lt;- rlnorm(n, 0, 0.8)

# calc mean of time and temp for later use
mean_temp &amp;lt;- mean(ablation_dta_tbl$temp)
mean_time &amp;lt;- mean(ablation_dta_tbl$time)

# dummy tibble to feed ggplot()
empty_tbl &amp;lt;- tibble(x = 0)

# y = b(x - mean(var_1)) + a is equivalent to:
# y = bx + (a - b * mean(var_1))

# in this fig we use the uninformed prior that generates some unrealistic values
prior_fig_1 &amp;lt;- empty_tbl %&amp;gt;% ggplot() +
  geom_abline(
    intercept = a - b * mean_time,
    slope = b,
    color = &amp;quot;#2c3e50&amp;quot;,
    alpha = 0.3,
    size = 1
  ) +
  ylim(c(0, 150)) +
  xlim(c(0, 150)) +
  labs(
    x = &amp;quot;time (sec)&amp;quot;,
    y = &amp;quot;Temp (C)&amp;quot;,
    title = &amp;quot;Prior Predictive Simulations&amp;quot;,
    subtitle = &amp;quot;Uninformed Prior&amp;quot;
  )

# in this fig we confine the slopes to broad ranges informed by what we know about the domain
prior_fig_2 &amp;lt;- empty_tbl %&amp;gt;% ggplot() +
  geom_abline(
    intercept = a - b_ * mean_time,
    slope = b_,
    color = &amp;quot;#2c3e50&amp;quot;,
    alpha = 0.3,
    size = 1
  ) +
  ylim(c(0, 150)) +
  xlim(c(0, 150)) +
  labs(
    x = &amp;quot;time (sec)&amp;quot;,
    y = &amp;quot;Temp (C)&amp;quot;,
    title = &amp;quot;Prior Predictive Simulations&amp;quot;,
    subtitle = &amp;quot;Mildly Informed Prior&amp;quot;
  )

plot_grid(prior_fig_1, prior_fig_2)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://rileyking.netlify.app/post/2019-12-01-creating-and-using-a-simple-bayesian-linear-model-in-brms-and-r_files/figure-html/unnamed-chunk-7-1.png&#34; width=&#34;100%&#34; height=&#34;500px&#34; /&gt;&lt;/p&gt;
&lt;p&gt;The plots above show what the model thinks before seeing the data for two different sets of priors. In both cases, I have centered the data by subtracting the mean of the time from each individual value of time. This means the intercept has the meaning of the expected temperature at the mean of time. The family of lines on the right seem a lot more realistic despite having some slopes that predict strange values out of sample (blood coagulates at ~90C). Choosing a log normal distribution for time ensures positives slopes. You could probably go even tighter on these priors but for this exercise I’m feeling good about proceeding.&lt;/p&gt;
&lt;p&gt;Looking only at the time window of the original observations and the Temp window bounded by body temperature (lower bound) and water boiling (upper bound).&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;empty_tbl %&amp;gt;% ggplot() +
  geom_abline(
    intercept = a - b_ * mean_time,
    slope = b_,
    color = &amp;quot;#2c3e50&amp;quot;,
    alpha = 0.3,
    size = 1
  ) +
  ylim(c(37, 100)) +
  xlim(c(15, 40)) +
  labs(
    x = &amp;quot;time (sec)&amp;quot;,
    y = &amp;quot;Temp (C)&amp;quot;,
    title = &amp;quot;Prior Predictive Simulations&amp;quot;,
    subtitle = &amp;quot;Mildly Informed Prior, Original Data Range&amp;quot;
  )&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://rileyking.netlify.app/post/2019-12-01-creating-and-using-a-simple-bayesian-linear-model-in-brms-and-r_files/figure-html/unnamed-chunk-8-1.png&#34; width=&#34;100%&#34; height=&#34;500px&#34; /&gt;&lt;/p&gt;
&lt;p&gt;Here are the prior distributions selected to go forward.&lt;/p&gt;
&lt;p&gt;&lt;span class=&#34;math inline&#34;&gt;\(\alpha \sim \text{Normal}(75, 15)\)&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span class=&#34;math inline&#34;&gt;\(\beta \sim \text{LogNormal}(0, .8)\)&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span class=&#34;math inline&#34;&gt;\(\sigma \sim \text{Uniform}(0, 30)\)&lt;/span&gt;&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;step-4-define-the-model&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;&lt;strong&gt;Step 4: Define the model&lt;/strong&gt;&lt;/h2&gt;
&lt;p&gt;Here I use the brm() function in brms to build what I’m creatively calling: “model_1”. This one uses the un-centered data for time. This function uses Markov Chain Monte Carlo to survey the parameter space. After the warm up cycles, the relative amount of time the chains spend at each parameter value is a good approximation of the true posterior distribution. I’m using a lot of warm up cycles because I’ve heard chains for the uniform priors on sigma can take a long time to converge. This model still takes a bit of time to chug through the parameter space on my modest laptop.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;#model_1 &amp;lt;-
#  brm(
#    data = ablation_dta_tbl, family = gaussian,
#    temp ~ 1 + time,
#    prior = c(
#      prior(normal(75, 15), class = Intercept),
#      prior(lognormal(0, .8), class = b),
#      prior(uniform(0, 30), class = sigma)
#    ),
#    iter = 41000, warmup = 40000, chains = 4, cores = 4,
#    seed = 4
#  )&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div id=&#34;step-5-draw-from-the-posterior&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;&lt;strong&gt;Step 5: Draw from the posterior&lt;/strong&gt;&lt;/h2&gt;
&lt;p&gt;The fruits of all my labor! The posterior holds credible combinations for sigma and the slope and intercept (which together describe the mean of the outcome variable we care about). Let’s take a look.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;post_samplesM1_tbl &amp;lt;-
  posterior_samples(model_1) %&amp;gt;%
  select(-lp__) %&amp;gt;%
  round(digits = 3)

post_samplesM1_tbl %&amp;gt;%
  head(10) %&amp;gt;%
  kable(align = rep(&amp;quot;c&amp;quot;, 3))&lt;/code&gt;&lt;/pre&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr class=&#34;header&#34;&gt;
&lt;th align=&#34;center&#34;&gt;b_Intercept&lt;/th&gt;
&lt;th align=&#34;center&#34;&gt;b_time&lt;/th&gt;
&lt;th align=&#34;center&#34;&gt;sigma&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;center&#34;&gt;58.509&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;0.841&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;2.682&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;center&#34;&gt;55.983&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;0.949&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;2.648&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;center&#34;&gt;56.195&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;0.937&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;2.540&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;center&#34;&gt;56.661&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;0.919&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;2.474&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;center&#34;&gt;55.143&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;0.978&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;2.593&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;center&#34;&gt;55.170&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;0.977&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;2.667&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;center&#34;&gt;54.908&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;0.996&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;2.621&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;center&#34;&gt;58.453&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;0.836&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;2.534&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;center&#34;&gt;54.134&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;1.031&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;2.647&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;center&#34;&gt;58.713&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;0.828&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;2.707&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;The plotting function in brms is pretty sweet. I’m not expert in MCMC diagnostics but I do know the “fuzzy caterpillar” look of the trace plots is desirable.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;plot(model_1)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://rileyking.netlify.app/post/2019-12-01-creating-and-using-a-simple-bayesian-linear-model-in-brms-and-r_files/figure-html/unnamed-chunk-12-1.png&#34; width=&#34;100%&#34; height=&#34;500px&#34; /&gt;&lt;/p&gt;
&lt;p&gt;Posterior_summary() can grab the model results in table form.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;mod_1_summary_tbl &amp;lt;-
  posterior_summary(model_1) %&amp;gt;%
  as.data.frame() %&amp;gt;%
  rownames_to_column() %&amp;gt;%
  as_tibble() %&amp;gt;%
  mutate_if(is.numeric, funs(as.character(signif(., 2)))) %&amp;gt;%
  mutate_at(.vars = c(2:5), funs(as.numeric(.)))

mod_1_summary_tbl %&amp;gt;%
  kable(align = rep(&amp;quot;c&amp;quot;, 5))&lt;/code&gt;&lt;/pre&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr class=&#34;header&#34;&gt;
&lt;th align=&#34;center&#34;&gt;rowname&lt;/th&gt;
&lt;th align=&#34;center&#34;&gt;Estimate&lt;/th&gt;
&lt;th align=&#34;center&#34;&gt;Est.Error&lt;/th&gt;
&lt;th align=&#34;center&#34;&gt;Q2.5&lt;/th&gt;
&lt;th align=&#34;center&#34;&gt;Q97.5&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;center&#34;&gt;b_Intercept&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;57.00&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;1.000&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;55.00&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;59.0&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;center&#34;&gt;b_time&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;0.91&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;0.045&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;0.83&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;1.0&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;center&#34;&gt;sigma&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;2.60&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;0.100&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;2.40&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;2.8&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;center&#34;&gt;lp__&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;-790.00&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;1.300&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;-790.00&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;-790.0&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;Now let’s see what changes if the time data is centered. Everything is the same here in model_2 except the time_c data which is transformed by subtracting the mean from each value.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;ablation_dta_tbl &amp;lt;- ablation_dta_tbl %&amp;gt;% mutate(time_c = time - mean(time))&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;#model_2 &amp;lt;-
#  brm(
#    data = ablation_dta_tbl, family = gaussian,
#    temp ~ 1 + time_c,
#    prior = c(
#      prior(normal(75, 15), class = Intercept),
#      prior(lognormal(0, .8), class = b),
#      prior(uniform(0, 30), class = sigma)
#    ),
#    iter = 41000, warmup = 40000, chains = 4, cores = 4,
#    seed = 4
#  )&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Plotting model_2 to compare with the output of model_1 above.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;plot_mod_2_fig &amp;lt;- plot(model_2)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://rileyking.netlify.app/post/2019-12-01-creating-and-using-a-simple-bayesian-linear-model-in-brms-and-r_files/figure-html/unnamed-chunk-17-1.png&#34; width=&#34;100%&#34; height=&#34;500px&#34; /&gt;&lt;/p&gt;
&lt;p&gt;The slope B and sigma are very similar. The intercept is the only difference with model_1 ranging from low to high 50’s. Model 2 is tight around 77. We should visualize the lines proposed by the parameters in the posteriors of our models to understand the uncertainty associated with the mean and also understand why the intercepts are different between models. First, store the posterior samples as a tibble in anticipation for ggplot.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;post_samplesM2_tbl &amp;lt;-
  posterior_samples(model_2) %&amp;gt;%
  select(-lp__) %&amp;gt;%
  round(digits = 3)

post_samplesM2_tbl %&amp;gt;%
  head(10) %&amp;gt;%
  kable(align = rep(&amp;quot;c&amp;quot;, 3))&lt;/code&gt;&lt;/pre&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr class=&#34;header&#34;&gt;
&lt;th align=&#34;center&#34;&gt;b_Intercept&lt;/th&gt;
&lt;th align=&#34;center&#34;&gt;b_time_c&lt;/th&gt;
&lt;th align=&#34;center&#34;&gt;sigma&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;center&#34;&gt;77.323&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;0.894&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;2.350&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;center&#34;&gt;77.430&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;0.881&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;2.516&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;center&#34;&gt;77.335&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;0.957&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;2.571&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;center&#34;&gt;77.011&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;0.947&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;2.776&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;center&#34;&gt;77.209&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;1.013&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;2.691&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;center&#34;&gt;77.517&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;0.820&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;2.488&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;center&#34;&gt;77.335&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;0.881&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;2.682&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;center&#34;&gt;77.313&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;0.857&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;2.538&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;center&#34;&gt;77.423&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;0.873&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;2.569&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;center&#34;&gt;77.302&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;0.926&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;2.340&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;Visualize the original data (centered and un-centered versions) along with plausible values for regression line of the mean:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;mean_regressionM1_fig &amp;lt;-
  ablation_dta_tbl %&amp;gt;%
  ggplot(aes(x = time, y = temp)) +
  geom_point(
    colour = &amp;quot;#481567FF&amp;quot;,
    size = 2,
    alpha = 0.6
  ) +
  geom_abline(aes(intercept = b_Intercept, slope = b_time),
    data = post_samplesM1_tbl,
    alpha = 0.1, color = &amp;quot;gray50&amp;quot;
  ) +
  geom_abline(
    slope = mean(post_samplesM1_tbl$b_time),
    intercept = mean(post_samplesM1_tbl$b_Intercept),
    color = &amp;quot;blue&amp;quot;, size = 1
  ) +
  labs(
    title = &amp;quot;Regression Line Representing Mean of Slope&amp;quot;,
    subtitle = &amp;quot;Data is As-Observed (No Centering of Predictor)&amp;quot;,
    x = &amp;quot;Time (s)&amp;quot;,
    y = &amp;quot;Temperature (C)&amp;quot;
  )

mean_regressionM2_fig &amp;lt;-
  ablation_dta_tbl %&amp;gt;%
  ggplot(aes(x = time_c, y = temp)) +
  geom_point(
    color = &amp;quot;#55C667FF&amp;quot;,
    size = 2,
    alpha = 0.6
  ) +
  geom_abline(aes(intercept = b_Intercept, slope = b_time_c),
    data = post_samplesM2_tbl,
    alpha = 0.1, color = &amp;quot;gray50&amp;quot;
  ) +
  geom_abline(
    slope = mean(post_samplesM2_tbl$b_time_c),
    intercept = mean(post_samplesM2_tbl$b_Intercept),
    color = &amp;quot;blue&amp;quot;, size = 1
  ) +
  labs(
    title = &amp;quot;Regression Line Representing Mean of Slope&amp;quot;,
    subtitle = &amp;quot;Predictor Data (Time) is Centered&amp;quot;,
    x = &amp;quot;Time (Difference from Mean Time in seconds)&amp;quot;,
    y = &amp;quot;Temperature (C)&amp;quot;
  )


combined_mean_fig &amp;lt;-
  ablation_dta_tbl %&amp;gt;%
  ggplot(aes(x = time, y = temp)) +
  geom_point(
    colour = &amp;quot;#481567FF&amp;quot;,
    size = 2,
    alpha = 0.6
  ) +
  geom_point(
    data = ablation_dta_tbl, aes(x = time_c, y = temp),
    colour = &amp;quot;#55C667FF&amp;quot;,
    size = 2,
    alpha = 0.6
  ) +
  geom_abline(aes(intercept = b_Intercept, slope = b_time),
    data = post_samplesM1_tbl,
    alpha = 0.1, color = &amp;quot;gray50&amp;quot;
  ) +
  geom_abline(
    slope = mean(post_samplesM1_tbl$b_time),
    intercept = mean(post_samplesM1_tbl$b_Intercept),
    color = &amp;quot;blue&amp;quot;, size = 1
  ) +
  geom_abline(aes(intercept = b_Intercept, slope = b_time_c),
    data = post_samplesM2_tbl,
    alpha = 0.1, color = &amp;quot;gray50&amp;quot;
  ) +
  geom_abline(
    slope = mean(post_samplesM2_tbl$b_time_c),
    intercept = mean(post_samplesM2_tbl$b_Intercept),
    color = &amp;quot;blue&amp;quot;, size = 1
  ) +
  labs(
    title = &amp;quot;Regression Line Representing Mean of Slope&amp;quot;,
    subtitle = &amp;quot;Centered and Un-Centered Predictor Data&amp;quot;,
    x = &amp;quot;Time (s)&amp;quot;,
    y = &amp;quot;Temperature (C)&amp;quot;
  )&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;combined_predicts_fig &amp;lt;- combined_mean_fig + 
  ylim(c(56,90)) +
  labs(title = &amp;quot;Points Represent Observed Data (Green is Centered)&amp;quot;,
       subtitle = &amp;quot;Regression Line Represents Rate of Change of Mean (Grey Bands are Uncertainty)&amp;quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://rileyking.netlify.app/./img/combined_predicts_fig.png&#34; width=&#34;100%&#34; height=&#34;100%&#34; style=&#34;display: block; margin: auto;&#34; /&gt;&lt;/p&gt;
&lt;p&gt;Now everything is clear. The slopes are exactly the same (as we saw in the density plots between model_1 and model_2 in summary()). The intercepts are different because in the centered data (green) the intercept occurs when the predictor equals 0 (its new mean). The outcome variable temp must therefore also be at its mean value in the “knot” of the bow-tie.&lt;/p&gt;
&lt;p&gt;For the un-centered data (purple), the intercept is the value of Temperature when the un-adjusted time is at 0. The range of possible intercepts is much more uncertain here.&lt;/p&gt;
&lt;p&gt;Another way to look at the differences is as a map of the plausible parameter space. We need a plot that can represent 3 parameters: intercept, slope, and sigma. Each point will be a credible combination of the three parameters as observed in 1 row of the posterior distribution tibble(s).&lt;/p&gt;
&lt;p&gt;First, the un-centered model.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;p_spaceM1_fig &amp;lt;- 
  post_samplesM1_tbl[1:1000, ] %&amp;gt;%
  ggplot(aes(x = b_time, y = b_Intercept, color = sigma)) +
  geom_point(alpha = 0.5) +
  geom_density2d(color = &amp;quot;gray30&amp;quot;) +
  scale_color_viridis_c() +
  labs(
    title = &amp;quot;Parameter Space - Model 1 (Un-Centered)&amp;quot;,
    subtitle = &amp;quot;Intercept Represents the Expected Temp at Time = 0&amp;quot;
  )&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://rileyking.netlify.app/./img/p_spaceM1_fig.png&#34; width=&#34;100%&#34; height=&#34;100%&#34; style=&#34;display: block; margin: auto;&#34; /&gt;&lt;/p&gt;
&lt;p&gt;Now the centered version:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;p_spaceM2_fig &amp;lt;- 
  post_samplesM2_tbl[1:1000, ] %&amp;gt;%
  ggplot(aes(x = b_time_c, y = b_Intercept, color = sigma)) +
  geom_point(alpha = 0.5) +
  geom_density2d(color = &amp;quot;gray30&amp;quot;) +
  scale_color_viridis_c() +
  labs(
    title = &amp;quot;Parameter Space - Model 2 (Centered)&amp;quot;,
    subtitle = &amp;quot;Intercept Represents the Expected Temp at Mean Time&amp;quot;
  )

#p_spaceM2_fig 
#ggsave(filename = &amp;quot;p_spaceM2_fig.png&amp;quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://rileyking.netlify.app/./img/p_spaceM2_fig.png&#34; width=&#34;100%&#34; height=&#34;100%&#34; style=&#34;display: block; margin: auto;&#34; /&gt;&lt;/p&gt;
&lt;p&gt;These look way different, but part of it is an illusion of the scaling on the y-axis. Remember how the credible values of the intercept were much tighter for the centered model? If we plot them both on the same canvas we can understand better, and it’s pretty (to my eye at least).&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;p_spaceC_tbl &amp;lt;- 
  post_samplesM2_tbl[1:1000, ] %&amp;gt;%
  ggplot(aes(x = b_time_c, y = b_Intercept, color = sigma)) +
  geom_point(alpha = 0.5) +
  geom_point(data = post_samplesM1_tbl, aes(x = b_time, y = b_Intercept, color = sigma), alpha = 0.5) +
  scale_color_viridis_c() +
  labs(
    title = &amp;quot;Credible Parameter Values for Models 1 and 2&amp;quot;,
    subtitle = &amp;quot;Model 1 is Un-Centered, Model 2 is Centered&amp;quot;,
    x = expression(beta[&amp;quot;time&amp;quot;]),
    y = expression(alpha[&amp;quot;Intercept&amp;quot;])) +
  ylim(c(54, 80))&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://rileyking.netlify.app/./img/p_spaceC_tbl.png&#34; width=&#34;100%&#34; height=&#34;100%&#34; style=&#34;display: block; margin: auto;&#34; /&gt;&lt;/p&gt;
&lt;p&gt;Now we see they aren’t as different as they first seemed. They cover very similar ranges for the slope and the un-centered model covers a wider range of plausible intercepts.&lt;/p&gt;
&lt;p&gt;I’ve been looking for a good time to fire up the rayshader package and I’m not throwing away my shot here. Plotting with rayshader feels like a superpower that I shouldn’t be allowed to have. It’s silly how easy it is to make these ridiculous visuals. First, a fancy 3d plot providing some perspective on the relative “heights” of theta.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;#par(mfrow = c(1, 1))
#plot_gg(p_spaceC_tbl, width = 5, height = 4, scale = 300, multicore = TRUE, windowsize = c(1200, 960),
#        fov = 70, zoom = 0.45, theta = 330, phi = 40)

#Sys.sleep(0.2)
#render_depth(focus = 0.7, focallength = 200)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://rileyking.netlify.app/./img/3d_params.png&#34; width=&#34;100%&#34; height=&#34;100%&#34; style=&#34;display: block; margin: auto;&#34; /&gt;&lt;/p&gt;
&lt;p&gt;If you want more, this code below renders a video guaranteed to impress small children and executives. I borrowed this code from Joey Stanley who borrowed it from Morgan Wall.&lt;a href=&#34;#fn4&#34; class=&#34;footnoteRef&#34; id=&#34;fnref4&#34;&gt;&lt;sup&gt;4&lt;/sup&gt;&lt;/a&gt;&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;#install.packages(&amp;quot;av&amp;quot;)
#library(av)

# Set up the camera position and angle
#phivechalf = 30 + 60 * 1/(1 + exp(seq(-7, 20, length.out = 180)/2))
#phivecfull = c(phivechalf, rev(phivechalf))
#thetavec = 0 + 60 * sin(seq(0,359,length.out = 360) * pi/180)
#zoomvec = 0.45 + 0.2 * 1/(1 + exp(seq(-5, 20, length.out = 180)))
#zoomvecfull = c(zoomvec, rev(zoomvec))

# Actually render the video.
#render_movie(filename = &amp;quot;hex_plot_fancy_2&amp;quot;, type = &amp;quot;custom&amp;quot;, 
#            frames = 360,  phi = phivecfull, zoom = zoomvecfull, theta = thetavec)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://rileyking.netlify.app/./img/hex_plot_fancy_2.gif&#34; width=&#34;100%&#34; height=&#34;100%&#34; style=&#34;display: block; margin: auto;&#34; /&gt;&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;step-6-push-the-parameters-back-through-the-model&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;&lt;strong&gt;Step 6: Push the parameters back through the model&lt;/strong&gt;&lt;/h2&gt;
&lt;p&gt;After a lot of work we have finally identified the credible values for our model parameters. We now want to see what sort of predictions our posterior makes. Again, I’ll work with both the centered and un-centered data to try to understand the difference between the approaches. The first step in both cases is to create a sequence of time data to predict off of. For some reason I couldn’t get the predict() function in brms to cooperate so I wrote my own function to predict values. You enter a time value and the function makes a temperature prediction for every combination of mean and standard deviation derived from the parameters in the posterior distribution. Our goal will be to map this function over the sequence of predictor values we just set up.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;#sequence of time data to predict off of.  Could use the same for both models but I created 2 for clarity
time_seq_tbl   &amp;lt;- tibble(pred_time   = seq(from = -15, to = 60, by = 1))
time_seq_tbl_2 &amp;lt;- tibble(pred_time_2 = seq(from = -15, to = 60, by = 1))

#function that takes a time value and makes a prediction using model_1 (un-centered) 
rk_predict &amp;lt;- 
function(time_to_sim){
  rnorm(n = nrow(post_samplesM1_tbl),
        mean = post_samplesM1_tbl$b_Intercept + post_samplesM1_tbl$b_time*time_to_sim,
        sd = post_samplesM1_tbl$sigma
  )
}

#function that takes a time value and makes a prediction using model_2 (centered)
rk_predict2 &amp;lt;- 
function(time_to_sim){
  rnorm(n = nrow(post_samplesM2_tbl),
        mean = post_samplesM2_tbl$b_Intercept + post_samplesM2_tbl$b_time_c*time_to_sim,
        sd = post_samplesM2_tbl$sigma
  )
}

#map the first prediction function over all values in the time sequence
#then calculate the .025 and .975 quantiles in anticipation of 95% prediction intervals
predicts_m1_tbl &amp;lt;- time_seq_tbl %&amp;gt;%
  mutate(preds_for_this_time = map(pred_time, rk_predict)) %&amp;gt;%
  mutate(percentile_2.5  = map_dbl(preds_for_this_time, ~quantile(., .025))) %&amp;gt;%
  mutate(percentile_97.5 = map_dbl(preds_for_this_time, ~quantile(., .975)))
    
#same for the 2nd prediction function
predicts_m2_tbl &amp;lt;- time_seq_tbl_2 %&amp;gt;%
  mutate(preds_for_this_time = map(pred_time_2, rk_predict2)) %&amp;gt;%
  mutate(percentile_2.5  = map_dbl(preds_for_this_time, ~quantile(., .025))) %&amp;gt;%
  mutate(percentile_97.5 = map_dbl(preds_for_this_time, ~quantile(., .975)))   

#visualize what is stored in the nested prediction cells (sanity check)
test_array &amp;lt;- predicts_m2_tbl[1, 2] %&amp;gt;% unnest(cols = c(preds_for_this_time))
test_array %&amp;gt;% 
  round(digits = 2) %&amp;gt;%
  head(5) %&amp;gt;%
  kable(align = rep(&amp;quot;c&amp;quot;, 1))&lt;/code&gt;&lt;/pre&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr class=&#34;header&#34;&gt;
&lt;th&gt;preds_for_this_time&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td&gt;68.13&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td&gt;61.67&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td&gt;65.55&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td&gt;62.12&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td&gt;64.05&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;And now the grand finale - overlay the 95% prediction intervals on the original data along with the credible values of mean. We see there is no difference between the predictions made from centered data vs. un-centered.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;big_enchilada &amp;lt;- 
  tibble(h=0) %&amp;gt;%
  ggplot() +
  geom_point(
    data = ablation_dta_tbl, aes(x = time, y = temp),
    colour = &amp;quot;#481567FF&amp;quot;,
    size = 2,
    alpha = 0.6
  ) +
  geom_point(
    data = ablation_dta_tbl, aes(x = time_c, y = temp),
    colour = &amp;quot;#55C667FF&amp;quot;,
    size = 2,
    alpha = 0.6
  ) +
  geom_abline(aes(intercept = b_Intercept, slope = b_time),
    data = post_samplesM1_tbl,
    alpha = 0.1, color = &amp;quot;gray50&amp;quot;
  ) +
  geom_abline(
    slope = mean(post_samplesM1_tbl$b_time),
    intercept = mean(post_samplesM1_tbl$b_Intercept),
    color = &amp;quot;blue&amp;quot;, size = 1
  ) +
  geom_abline(aes(intercept = b_Intercept, slope = b_time_c),
    data = post_samplesM2_tbl,
    alpha = 0.1, color = &amp;quot;gray50&amp;quot;
  ) +
  geom_abline(
    slope = mean(post_samplesM2_tbl$b_time_c),
    intercept = mean(post_samplesM2_tbl$b_Intercept),
    color = &amp;quot;blue&amp;quot;, size = 1
  ) +
  geom_ribbon(
  data = predicts_m1_tbl, aes(x = predicts_m1_tbl$pred_time, ymin = predicts_m1_tbl$percentile_2.5, ymax = predicts_m1_tbl$percentile_97.5), alpha = 0.25, fill = &amp;quot;pink&amp;quot;, color = &amp;quot;black&amp;quot;, size = .3
) +
  geom_ribbon(
  data = predicts_m2_tbl, aes(x = predicts_m2_tbl$pred_time_2, ymin = predicts_m2_tbl$percentile_2.5, ymax = predicts_m2_tbl$percentile_97.5), alpha = 0.4, fill = &amp;quot;pink&amp;quot;, color = &amp;quot;black&amp;quot;, size = .3
) +
  labs(
    title = &amp;quot;Regression Line Representing Mean of Slope&amp;quot;,
    subtitle = &amp;quot;Centered and Un-Centered Predictor Data&amp;quot;,
    x = &amp;quot;Time (s)&amp;quot;,
    y = &amp;quot;Temperature (C)&amp;quot;
  ) +
  scale_x_continuous(limits = c(-10, 37), expand = c(0, 0)) +
  scale_y_continuous(limits = c(40, 120), expand = c(0, 0))&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://rileyking.netlify.app/./img/big_enchilada.png&#34; width=&#34;100%&#34; height=&#34;100%&#34; style=&#34;display: block; margin: auto;&#34; /&gt;&lt;/p&gt;
&lt;p&gt;What a ride! This seemingly simple problem really stretched my brain. There are still a lot of question I want to go deeper on - diagnostics for the MCMC, impact of the regularizing priors, different between this workflow and frequentist at various sample sizes and priors, etc… but that will have to wait for another day.&lt;/p&gt;
&lt;p&gt;For those looking for more interpretations of McElreath’s workflows using Tidyverse tools, Solomon Kurz has a brilliant collection here.&lt;a href=&#34;#fn5&#34; class=&#34;footnoteRef&#34; id=&#34;fnref5&#34;&gt;&lt;sup&gt;5&lt;/sup&gt;&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;Thank you for reading.&lt;/p&gt;
&lt;/div&gt;
&lt;div class=&#34;footnotes&#34;&gt;
&lt;hr /&gt;
&lt;ol&gt;
&lt;li id=&#34;fn1&#34;&gt;&lt;p&gt;Statistical Rethinking, &lt;a href=&#34;https://github.com/rmcelreath/statrethinking_winter2019&#34; class=&#34;uri&#34;&gt;https://github.com/rmcelreath/statrethinking_winter2019&lt;/a&gt;&lt;a href=&#34;#fnref1&#34;&gt;↩&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li id=&#34;fn2&#34;&gt;&lt;p&gt;&lt;a href=&#34;https://www.sciencedirect.com/science/article/abs/pii/S1547527116001806&#34; class=&#34;uri&#34;&gt;https://www.sciencedirect.com/science/article/abs/pii/S1547527116001806&lt;/a&gt;&lt;a href=&#34;#fnref2&#34;&gt;↩&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li id=&#34;fn3&#34;&gt;&lt;p&gt;There’s a funky bug in ggExtra which makes you break this code into 2 chunks when working in Markdown, &lt;a href=&#34;https://cran.r-project.org/web/packages/ggExtra/vignettes/ggExtra.html&#34; class=&#34;uri&#34;&gt;https://cran.r-project.org/web/packages/ggExtra/vignettes/ggExtra.html&lt;/a&gt;&lt;a href=&#34;#fnref3&#34;&gt;↩&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li id=&#34;fn4&#34;&gt;&lt;p&gt;3D Vowel Plots with Rayshader, &lt;a href=&#34;http://joeystanley.com/blog/3d-vowel-plots-with-rayshader&#34; class=&#34;uri&#34;&gt;http://joeystanley.com/blog/3d-vowel-plots-with-rayshader&lt;/a&gt;&lt;a href=&#34;#fnref4&#34;&gt;↩&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li id=&#34;fn5&#34;&gt;&lt;p&gt;Statistical Rethinking with brms, ggplot2, and the tidyverse, &lt;a href=&#34;https://bookdown.org/ajkurz/Statistical_Rethinking_recoded/&#34; class=&#34;uri&#34;&gt;https://bookdown.org/ajkurz/Statistical_Rethinking_recoded/&lt;/a&gt;&lt;a href=&#34;#fnref5&#34;&gt;↩&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;/div&gt;
</description>
    </item>
    
    <item>
      <title>Confounders and Colliders - Modeling Spurious Correlations in R</title>
      <link>https://rileyking.netlify.app/post/confounders-and-colliders-modeling-spurious-correlations-in-r/</link>
      <pubDate>Tue, 29 Oct 2019 00:00:00 +0000</pubDate>
      
      <guid>https://rileyking.netlify.app/post/confounders-and-colliders-modeling-spurious-correlations-in-r/</guid>
      <description>


&lt;p&gt;&lt;img src=&#34;https://rileyking.netlify.app/./img/dag.png&#34; width=&#34;100%&#34; height=&#34;100%&#34; style=&#34;display: block; margin: auto;&#34; /&gt;&lt;/p&gt;
&lt;p&gt;Like many engineers, my first models were based on Designed Experiments in the tradition of Cox and Montgomery. I hadn’t seen anything like a causal diagram until I picked the &lt;strong&gt;The Book of Why&lt;/strong&gt; which explores all sorts of experimental relationships and structures I never imagined.&lt;a href=&#34;#fn1&#34; class=&#34;footnoteRef&#34; id=&#34;fnref1&#34;&gt;&lt;sup&gt;1&lt;/sup&gt;&lt;/a&gt; Colliders, confounders, causal diagrams, M-bias - these concepts are all relatively new to me and I want to understand them better. In this post I will attempt to create some simple structural causal models (SCMs) for myself using the Dagitty and GGDag packages and then show the potential effects of confounders and colliders on a simulated experiment adapted from here.&lt;a href=&#34;#fn2&#34; class=&#34;footnoteRef&#34; id=&#34;fnref2&#34;&gt;&lt;sup&gt;2&lt;/sup&gt;&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;It turns out that it is not as simple as identifying lurking variables and holding them constant while we conduct the experiment of interest (as I was always taught).&lt;/p&gt;
&lt;p&gt;First, load the libraries.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;# Load libraries
library(tidyverse)
library(kableExtra)
library(tidymodels)
library(viridisLite)
library(GGally)
library(dagitty)
library(ggdag)
library(visreg)
library(styler)
library(cowplot)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;A structural causal model (SCM) is a type of directed acyclic graph (DAG) that maps causal assumptions onto a simple model of experimental variables. In the figure below, each node(blue dot) represents a variable. The edges(yellow lines) between nodes represent assumed causal effects.&lt;/p&gt;
&lt;p&gt;Dagitty uses the dafigy() function to create the relationships in the DAG. These are stored in a DAG object which is provided to ggplot and can then be customized and adjusted. Most of the code below the DAG object is just formatting the figure.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;# create DAG object
g &amp;lt;- dagify(
  A ~ J,
  X ~ J,
  X ~ A
)

# tidy the dag object and supply to ggplot
set.seed(100)
g %&amp;gt;%
  tidy_dagitty() %&amp;gt;%
  mutate(x = c(0, 1, 1, 2)) %&amp;gt;%
  mutate(y = c(0, 2, 2, 0)) %&amp;gt;%
  mutate(xend = c(2, 0, 2, NA)) %&amp;gt;%
  mutate(yend = c(0, 0, 0, NA)) %&amp;gt;%
  dag_label(labels = c(
    &amp;quot;A&amp;quot; = &amp;quot;Independent\n Variable&amp;quot;,
    &amp;quot;X&amp;quot; = &amp;quot;Dependent\n Variable&amp;quot;,
    &amp;quot;J&amp;quot; = &amp;quot;The\n Confounder&amp;quot;
  )) %&amp;gt;%
  ggplot(aes(x = x, y = y, xend = xend, yend = yend)) +
  geom_dag_edges(
    edge_colour = &amp;quot;#b8de29ff&amp;quot;,
    edge_width = .8
  ) +
  geom_dag_node(
    color = &amp;quot;#2c3e50&amp;quot;,
    alpha = 0.8
  ) +
  geom_dag_text(color = &amp;quot;white&amp;quot;) +
  geom_dag_label_repel(aes(label = label),
    col = &amp;quot;white&amp;quot;,
    label.size = .4,
    fill = &amp;quot;#20a486ff&amp;quot;,
    alpha = 0.8,
    show.legend = FALSE,
    nudge_x = .7,
    nudge_y = .3
  ) +
  labs(
    title = &amp;quot; Directed Acyclic Graph&amp;quot;,
    subtitle = &amp;quot; Two Variables of Interest with a Confounder&amp;quot;
  ) +
  xlim(c(-1.5, 3.5)) +
  ylim(c(-.33, 2.2)) +
  geom_rect(
    xmin = -.5,
    xmax = 3.25,
    ymin = -.25,
    ymax = .65,
    alpha = .04,
    fill = &amp;quot;white&amp;quot;
  ) +
  theme_void() +
  theme(
    plot.background = element_rect(fill = &amp;quot;#222222&amp;quot;),
    plot.title = element_text(color = &amp;quot;white&amp;quot;),
    plot.subtitle = element_text(color = &amp;quot;white&amp;quot;)
  )&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://rileyking.netlify.app/post/2019-10-29-confounders-and-colliders-modeling-spurious-correlations-in-r_files/figure-html/unnamed-chunk-3-1.png&#34; width=&#34;100%&#34; height=&#34;500px&#34; /&gt; The relationship of interest is captured in the lower rectangle: we want to change the value of independent variable &lt;strong&gt;A&lt;/strong&gt; and record the effect on dependent variable &lt;strong&gt;X&lt;/strong&gt; (in epidemiology these might be called “treatment” and “outcome”). There also happens to be a confounding variable &lt;strong&gt;J&lt;/strong&gt; that has a causal effect on both &lt;strong&gt;A&lt;/strong&gt; and &lt;strong&gt;X&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;We can set up a simulated experiment that follows the structure of the SCM above:&lt;/p&gt;
&lt;p&gt;Each variable will have n=1000 values. &lt;strong&gt;J&lt;/strong&gt; is generated by drawing randomly from a standard normal distribution. We want &lt;strong&gt;J&lt;/strong&gt; to be a cause of &lt;strong&gt;A&lt;/strong&gt; so we use &lt;strong&gt;J&lt;/strong&gt; in the creation of &lt;strong&gt;A&lt;/strong&gt; along with a random error term to represent noise. The model above shows a causal link from &lt;strong&gt;A&lt;/strong&gt; to &lt;strong&gt;X&lt;/strong&gt; but we don’t actually know if this exists - that’s the point of the experiment. It may or may not be there (from the point of view of the experimenter/engineer). For the purposes of demonstration we will structure the simulation such that there is &lt;strong&gt;no&lt;/strong&gt; causal relationship between &lt;strong&gt;A&lt;/strong&gt; and &lt;strong&gt;X&lt;/strong&gt; (&lt;strong&gt;A&lt;/strong&gt; will not be used in the creation of the variable &lt;strong&gt;J&lt;/strong&gt;). Again we need &lt;strong&gt;J&lt;/strong&gt; as a cause of &lt;strong&gt;X&lt;/strong&gt; so we use &lt;strong&gt;J&lt;/strong&gt; in the creation of the &lt;strong&gt;dependent_var_X&lt;/strong&gt; object along with a random noise component.&lt;/p&gt;
&lt;p&gt;The simulation is now set up to model an experiment where the experimenter/engineer wants to understand the effect of &lt;strong&gt;A&lt;/strong&gt; on &lt;strong&gt;X&lt;/strong&gt; but the true effect is zero. Meanwhile, there is a confounding variable &lt;strong&gt;J&lt;/strong&gt; that is a parent to both &lt;strong&gt;A&lt;/strong&gt; and &lt;strong&gt;X&lt;/strong&gt;.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;# set seed for repeatability
set.seed(805)

# n = 1000 points for the simulation
n &amp;lt;- 1000

# create variables
# J is random draws from standard normal (mean = 0, stdev = 1)
confounding_var_J &amp;lt;- rnorm(n)

# J is used in creation of A since it is a cause of A (confounder)
independent_var_A &amp;lt;- 1.1 * confounding_var_J + rnorm(n)

# J is used in creation of X since it is a cause of X (confounder)
dependent_var_X &amp;lt;- 1.9 * confounding_var_J + rnorm(n)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;In reality, the experimenter may or may not be aware of the parent confounder &lt;strong&gt;J&lt;/strong&gt;. We will create two different regression models below. In the first, denoted &lt;strong&gt;crude_model&lt;/strong&gt;, we will assume the experimenter was unaware of the confounder. The model is then created with &lt;strong&gt;A&lt;/strong&gt; as the only predictor variable of &lt;strong&gt;X&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;In the second, denoted &lt;strong&gt;confounder_model&lt;/strong&gt;, we will assume the experimenter was aware of the confounder and chose to include it in their model. This version is created with &lt;strong&gt;A&lt;/strong&gt; and &lt;strong&gt;J&lt;/strong&gt; as predictors of &lt;strong&gt;X&lt;/strong&gt;.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;# create crude regression model with A predicting X.  J is omitted
crude_model &amp;lt;- lm(dependent_var_X ~ independent_var_A)

# create confounder model with A and J predicting X
confounder_model &amp;lt;- lm(dependent_var_X ~ independent_var_A + confounding_var_J)

# tidy the crude model and examine it
crude_model_tbl &amp;lt;- summary(crude_model) %&amp;gt;% tidy()
crude_model_kbl &amp;lt;- summary(crude_model) %&amp;gt;%
  tidy() %&amp;gt;%
  kable(align = rep(&amp;quot;c&amp;quot;, 5), digits = 3)
crude_model_kbl&lt;/code&gt;&lt;/pre&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th style=&#34;text-align:center;&#34;&gt;
term
&lt;/th&gt;
&lt;th style=&#34;text-align:center;&#34;&gt;
estimate
&lt;/th&gt;
&lt;th style=&#34;text-align:center;&#34;&gt;
std.error
&lt;/th&gt;
&lt;th style=&#34;text-align:center;&#34;&gt;
statistic
&lt;/th&gt;
&lt;th style=&#34;text-align:center;&#34;&gt;
p.value
&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
(Intercept)
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
-0.007
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
0.051
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
-0.135
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
0.893
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
independent_var_A
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
0.967
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
0.034
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
28.415
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
0.000
&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;# Tidy the confounder model and examine it
confounder_model_tbl &amp;lt;- summary(confounder_model) %&amp;gt;% tidy()
confounder_model_kbl &amp;lt;- summary(confounder_model) %&amp;gt;%
  tidy() %&amp;gt;%
  kable(align = rep(&amp;quot;c&amp;quot;, 5), digits = 3)
confounder_model_kbl&lt;/code&gt;&lt;/pre&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th style=&#34;text-align:center;&#34;&gt;
term
&lt;/th&gt;
&lt;th style=&#34;text-align:center;&#34;&gt;
estimate
&lt;/th&gt;
&lt;th style=&#34;text-align:center;&#34;&gt;
std.error
&lt;/th&gt;
&lt;th style=&#34;text-align:center;&#34;&gt;
statistic
&lt;/th&gt;
&lt;th style=&#34;text-align:center;&#34;&gt;
p.value
&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
(Intercept)
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
-0.005
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
0.032
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
-0.151
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
0.880
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
independent_var_A
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
0.005
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
0.033
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
0.153
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
0.878
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
confounding_var_J
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
1.860
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
0.048
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
38.460
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
0.000
&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;# add column for labels
crude_model_tbl &amp;lt;- crude_model_tbl %&amp;gt;% mutate(model = &amp;quot;crude_model: no confounder&amp;quot;)
confounder_model_tbl &amp;lt;- confounder_model_tbl %&amp;gt;% mutate(model = &amp;quot;confounder_model: with confounder&amp;quot;)

# combine into a single kable
confounder_model_summary_tbl &amp;lt;- bind_rows(crude_model_tbl, confounder_model_tbl)
confounder_model_summary_tbl &amp;lt;- confounder_model_summary_tbl %&amp;gt;% select(model, everything())
confounder_model_summary_tbl %&amp;gt;% kable(align = rep(&amp;quot;c&amp;quot;, 6), digits = 3)&lt;/code&gt;&lt;/pre&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th style=&#34;text-align:center;&#34;&gt;
model
&lt;/th&gt;
&lt;th style=&#34;text-align:center;&#34;&gt;
term
&lt;/th&gt;
&lt;th style=&#34;text-align:center;&#34;&gt;
estimate
&lt;/th&gt;
&lt;th style=&#34;text-align:center;&#34;&gt;
std.error
&lt;/th&gt;
&lt;th style=&#34;text-align:center;&#34;&gt;
statistic
&lt;/th&gt;
&lt;th style=&#34;text-align:center;&#34;&gt;
p.value
&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
crude_model: no confounder
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
(Intercept)
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
-0.007
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
0.051
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
-0.135
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
0.893
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
crude_model: no confounder
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
independent_var_A
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
0.967
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
0.034
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
28.415
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
0.000
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
confounder_model: with confounder
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
(Intercept)
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
-0.005
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
0.032
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
-0.151
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
0.880
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
confounder_model: with confounder
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
independent_var_A
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
0.005
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
0.033
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
0.153
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
0.878
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
confounder_model: with confounder
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
confounding_var_J
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
1.860
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
0.048
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
38.460
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
0.000
&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;The combined summary table above provides the effect sizes and the difference between the two models is striking. Conditional plots are a way to visualize regression models. The visreg package creates conditional plots by supplying a model object and a predictor variable to the visreg() function. The x-axis shows the value of the predictor variable and the y-axis shows change in the response variable. All other variables are held constant at their medians.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;# visualize conditional plot of A vs X, crude model
v1 &amp;lt;- visreg(crude_model,
  &amp;quot;independent_var_A&amp;quot;,
  gg = TRUE,
  line = list(col = &amp;quot;#E66101&amp;quot;)
) +
  labs(
    title = &amp;quot;Relationship Between A and X&amp;quot;,
    subtitle = &amp;quot;Neglecting Confounder Variable J&amp;quot;
  ) +
  ylab(&amp;quot;Change in Response X&amp;quot;) +
  ylim(-6, 6) +
  theme(plot.subtitle = element_text(face = &amp;quot;bold&amp;quot;, color = &amp;quot;#404788FF&amp;quot;))

# visualize conditional plot of A vs X, confounder model
v2 &amp;lt;- visreg(confounder_model,
  &amp;quot;independent_var_A&amp;quot;,
  gg = TRUE,
  line = list(col = &amp;quot;#E66101&amp;quot;)
) +
  labs(
    title = &amp;quot;Relationship Between A and X&amp;quot;,
    subtitle = &amp;quot;Considering Confounder Variable J&amp;quot;
  ) +
  ylab(&amp;quot;Change in Response X&amp;quot;) +
  ylim(-6, 6) +
  theme(plot.subtitle = element_text(face = &amp;quot;bold&amp;quot;, color = &amp;quot;#20a486ff&amp;quot;))

plot_grid(v1, v2)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://rileyking.netlify.app/post/2019-10-29-confounders-and-colliders-modeling-spurious-correlations-in-r_files/figure-html/unnamed-chunk-6-1.png&#34; width=&#34;100%&#34; height=&#34;500px&#34; /&gt;&lt;/p&gt;
&lt;p&gt;We know from creating the simulated data that &lt;strong&gt;A&lt;/strong&gt; has no real effect on the outcome &lt;strong&gt;X&lt;/strong&gt;. &lt;strong&gt;X&lt;/strong&gt; was created using only &lt;strong&gt;J&lt;/strong&gt; and some noise. But the left plot shows a large, positive slope and significant coefficient! How can this be? This faulty estimate of the true effect is biased; more specifically we are seeing “confounder bias” or “omitted variable bias”. Adding &lt;strong&gt;J&lt;/strong&gt; to the regression model has the effect of conditioning on &lt;strong&gt;J&lt;/strong&gt; and revealing the true relationship between &lt;strong&gt;A&lt;/strong&gt; and &lt;strong&gt;X&lt;/strong&gt;: no effect of &lt;strong&gt;A&lt;/strong&gt; on &lt;strong&gt;X&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;Confounding is pretty easy to understand. “Correlation does not imply causation” has been drilled into my brain effectively. Still, confounders that aren’t anticipated can derail studies and confuse observers. For example, the first generation of drug eluting stents was released in the early 2000’s. They showed great promise but their long-term risk profile was not well understood. Observational studies indicated an improved mortality rate for drug-eluting stents relative to their bare-metal counterparts. However, the performance benefit could not be replicated in randomized controlled trials.&lt;a href=&#34;#fn3&#34; class=&#34;footnoteRef&#34; id=&#34;fnref3&#34;&gt;&lt;sup&gt;3&lt;/sup&gt;&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;The disconnect was eventually linked (at least in part) to a confounding factor. Outside of a RCT, clinicians took into account the health of the patient going into the procedure. Specifically, if the patient was scheduled for a pending surgery or had a history of clotting then the clinician would hedge towards a bare-metal stent (since early gen DES tended to have thrombotic events at a greater frequency than BMS). Over the long term, these sicker patients were assigned BMS disproportionately, biasing the effect of stent type on long-term mortality via patient health as a confounder.&lt;/p&gt;
&lt;p&gt;So we always want to include every variable we know about in our regression models, right? Wrong. Here is a case that looks similar to the confounder scenario but is slightly different. The question of interest is the same: evaluate the effect of predictor &lt;strong&gt;B&lt;/strong&gt; on the outcome &lt;strong&gt;Y&lt;/strong&gt;. Again, there is a 3rd variable at play. But this time, the third variable is caused by both &lt;strong&gt;B&lt;/strong&gt; and &lt;strong&gt;Y&lt;/strong&gt; rather than being itself the common cause.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;# assign DAG object
h &amp;lt;- dagify(
  K ~ B + Y,
  Y ~ B
)

# tidy the dag object and suppply to ggplot
set.seed(100)
h %&amp;gt;%
  tidy_dagitty() %&amp;gt;%
  mutate(x = c(0, 0, 2, 1)) %&amp;gt;%
  mutate(y = c(0, 0, 0, 2)) %&amp;gt;%
  mutate(xend = c(1, 2, 1, NA)) %&amp;gt;%
  mutate(yend = c(2, 0, 2, NA)) %&amp;gt;%
  dag_label(labels = c(
    &amp;quot;B&amp;quot; = &amp;quot;Independent\n Variable&amp;quot;,
    &amp;quot;Y&amp;quot; = &amp;quot;Dependent\n Variable&amp;quot;,
    &amp;quot;K&amp;quot; = &amp;quot;The\n Collider&amp;quot;
  )) %&amp;gt;%
  ggplot(aes(x = x, y = y, xend = xend, yend = yend)) +
  geom_dag_edges(
    edge_colour = &amp;quot;#b8de29ff&amp;quot;,
    edge_width = .8
  ) +
  geom_dag_node(
    color = &amp;quot;#2c3e50&amp;quot;,
    alpha = 0.8
  ) +
  geom_dag_text(color = &amp;quot;white&amp;quot;) +
  geom_dag_label_repel(aes(label = label),
    col = &amp;quot;white&amp;quot;,
    label.size = .4,
    fill = &amp;quot;#20a486ff&amp;quot;,
    alpha = 0.8,
    show.legend = FALSE,
    nudge_x = .7,
    nudge_y = .3
  ) +
  labs(
    title = &amp;quot; Directed Acyclic Graph&amp;quot;,
    subtitle = &amp;quot; Two Variables of Interest with a Collider&amp;quot;
  ) +
  xlim(c(-1.5, 3.5)) +
  ylim(c(-.33, 2.2)) +
  geom_rect(
    xmin = -.5,
    xmax = 3.25,
    ymin = -.25,
    ymax = .65,
    alpha = .04,
    fill = &amp;quot;white&amp;quot;
  ) +
  theme_void() +
  theme(
    plot.background = element_rect(fill = &amp;quot;#222222&amp;quot;),
    plot.title = element_text(color = &amp;quot;white&amp;quot;),
    plot.subtitle = element_text(color = &amp;quot;white&amp;quot;)
  )&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://rileyking.netlify.app/post/2019-10-29-confounders-and-colliders-modeling-spurious-correlations-in-r_files/figure-html/unnamed-chunk-7-1.png&#34; width=&#34;100%&#34; height=&#34;500px&#34; /&gt;&lt;/p&gt;
&lt;p&gt;A variable like this is called a collider because the causal arrows from from &lt;strong&gt;B&lt;/strong&gt; and &lt;strong&gt;Y&lt;/strong&gt; collide at &lt;strong&gt;K&lt;/strong&gt;. &lt;strong&gt;K&lt;/strong&gt; is created in the simulation below using both &lt;strong&gt;B&lt;/strong&gt; and &lt;strong&gt;Y&lt;/strong&gt; plus random noise. This time, the outcome &lt;strong&gt;Y&lt;/strong&gt; is created using &lt;strong&gt;B&lt;/strong&gt; as an input, thereby assigning a causal relation with an effect size of 0.3.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;# create variables
# B is random draws from standard normal (mean = 0, stdev = 1)
independent_var_B &amp;lt;- rnorm(n)

# Y is created with B and noise. Effect size of B on Y is 0.3
dependent_var_Y &amp;lt;- .3 * independent_var_B + rnorm(n)

# K (collider) is created with B and Y + noise
collider_var_K &amp;lt;- 1.2 * independent_var_B + 0.9 * dependent_var_Y + rnorm(n)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Let’s assume that the experimenter knows about possible collider variable &lt;strong&gt;K&lt;/strong&gt;. What should they do with it when they go to create their regression model? Let’s create two models again to compare results. Following the nomenclature from before: &lt;strong&gt;crude_model_b&lt;/strong&gt; uses only &lt;strong&gt;B&lt;/strong&gt; to predict &lt;strong&gt;Y&lt;/strong&gt; and &lt;strong&gt;collider_model&lt;/strong&gt; uses both &lt;strong&gt;B&lt;/strong&gt; and &lt;strong&gt;K&lt;/strong&gt; to predict &lt;strong&gt;Y&lt;/strong&gt;.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;# create crude regression model with B predicting Y.  K is omitted
crude_model_b &amp;lt;- lm(dependent_var_Y ~ independent_var_B)

# create collider model with B and K predicting Y
collider_model &amp;lt;- lm(dependent_var_Y ~ independent_var_B + collider_var_K)

# tidy the crude model and examine it
crude_model_b_kbl &amp;lt;- summary(crude_model_b) %&amp;gt;%
  tidy() %&amp;gt;%
  kable(align = rep(&amp;quot;c&amp;quot;, 5), digits = 3)
crude_model_b_tbl &amp;lt;- summary(crude_model_b) %&amp;gt;% tidy()
crude_model_b_kbl&lt;/code&gt;&lt;/pre&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th style=&#34;text-align:center;&#34;&gt;
term
&lt;/th&gt;
&lt;th style=&#34;text-align:center;&#34;&gt;
estimate
&lt;/th&gt;
&lt;th style=&#34;text-align:center;&#34;&gt;
std.error
&lt;/th&gt;
&lt;th style=&#34;text-align:center;&#34;&gt;
statistic
&lt;/th&gt;
&lt;th style=&#34;text-align:center;&#34;&gt;
p.value
&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
(Intercept)
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
-0.021
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
0.032
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
-0.666
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
0.506
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
independent_var_B
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
0.247
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
0.032
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
7.820
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
0.000
&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;# tidy the collider model and examine it
collider_model_kbl &amp;lt;- summary(collider_model) %&amp;gt;%
  tidy() %&amp;gt;%
  kable(align = rep(&amp;quot;c&amp;quot;, 5), digits = 3)
collider_model_tbl &amp;lt;- summary(collider_model) %&amp;gt;% tidy()
collider_model_kbl&lt;/code&gt;&lt;/pre&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th style=&#34;text-align:center;&#34;&gt;
term
&lt;/th&gt;
&lt;th style=&#34;text-align:center;&#34;&gt;
estimate
&lt;/th&gt;
&lt;th style=&#34;text-align:center;&#34;&gt;
std.error
&lt;/th&gt;
&lt;th style=&#34;text-align:center;&#34;&gt;
statistic
&lt;/th&gt;
&lt;th style=&#34;text-align:center;&#34;&gt;
p.value
&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
(Intercept)
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
-0.011
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
0.023
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
-0.453
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
0.651
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
independent_var_B
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
-0.481
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
0.034
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
-14.250
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
0.000
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
collider_var_K
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
0.519
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
0.018
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
29.510
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
0.000
&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;# add label column
crude_model_b_tbl &amp;lt;- crude_model_b_tbl %&amp;gt;% mutate(model = &amp;quot;crude_model_b: no collider&amp;quot;)
collider_model_tbl &amp;lt;- collider_model_tbl %&amp;gt;% mutate(model = &amp;quot;collider_model: with collider&amp;quot;)

# combine and examine
collider_model_summary_tbl &amp;lt;- bind_rows(crude_model_b_tbl, collider_model_tbl)
collider_model_summary_tbl &amp;lt;- collider_model_summary_tbl %&amp;gt;% select(model, everything())
collider_model_summary_tbl %&amp;gt;% kable(align = rep(&amp;quot;c&amp;quot;, 6), digits = 3)&lt;/code&gt;&lt;/pre&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th style=&#34;text-align:center;&#34;&gt;
model
&lt;/th&gt;
&lt;th style=&#34;text-align:center;&#34;&gt;
term
&lt;/th&gt;
&lt;th style=&#34;text-align:center;&#34;&gt;
estimate
&lt;/th&gt;
&lt;th style=&#34;text-align:center;&#34;&gt;
std.error
&lt;/th&gt;
&lt;th style=&#34;text-align:center;&#34;&gt;
statistic
&lt;/th&gt;
&lt;th style=&#34;text-align:center;&#34;&gt;
p.value
&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
crude_model_b: no collider
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
(Intercept)
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
-0.021
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
0.032
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
-0.666
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
0.506
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
crude_model_b: no collider
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
independent_var_B
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
0.247
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
0.032
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
7.820
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
0.000
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
collider_model: with collider
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
(Intercept)
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
-0.011
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
0.023
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
-0.453
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
0.651
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
collider_model: with collider
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
independent_var_B
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
-0.481
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
0.034
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
-14.250
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
0.000
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
collider_model: with collider
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
collider_var_K
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
0.519
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
0.018
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
29.510
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
0.000
&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;This time, omitting the collider variable is the proper way to recover the true effect of &lt;strong&gt;B&lt;/strong&gt; on &lt;strong&gt;Y&lt;/strong&gt;. Let’s verify with conditional plots as before. Again, we know the true slope should be around 0.3.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;# create conditional plot with crude_model_b and B
v3 &amp;lt;- visreg(crude_model_b,
  &amp;quot;independent_var_B&amp;quot;,
  gg = TRUE,
  line = list(col = &amp;quot;#E66101&amp;quot;)
) +
  labs(
    title = &amp;quot;Relationship Between B and Y&amp;quot;,
    subtitle = &amp;quot;Neglecting Collider Variable K&amp;quot;
  ) +
  ylab(&amp;quot;Change in Response Y&amp;quot;) +
  ylim(-6, 6) +
  theme(plot.subtitle = element_text(face = &amp;quot;bold&amp;quot;, color = &amp;quot;#f68f46b2&amp;quot;))

# create conditional plot with collider_model and B
v4 &amp;lt;- visreg(collider_model,
  &amp;quot;independent_var_B&amp;quot;,
  gg = TRUE,
  line = list(col = &amp;quot;#E66101&amp;quot;)
) +
  labs(
    title = &amp;quot;Relationship Between B and Y&amp;quot;,
    subtitle = &amp;quot;Considering Collider Variable K&amp;quot;
  ) +
  ylab(&amp;quot;Change in Response Y&amp;quot;) +
  ylim(-6, 6) +
  theme(plot.subtitle = element_text(face = &amp;quot;bold&amp;quot;, color = &amp;quot;#403891b2&amp;quot;))

plot_grid(v3, v4)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://rileyking.netlify.app/post/2019-10-29-confounders-and-colliders-modeling-spurious-correlations-in-r_files/figure-html/unnamed-chunk-10-1.png&#34; width=&#34;100%&#34; height=&#34;500px&#34; /&gt; Incredibly, the conclusion one draws about the relationship between &lt;strong&gt;B&lt;/strong&gt; and &lt;strong&gt;Y&lt;/strong&gt; completely reverses depending upon which model is used. The true effect is positive (we only know this for sure because we created the data) but by including the collider variable in the model we observe it as negative. &lt;strong&gt;We should not control for a collider variable!&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;Controlling for a confounder reduces bias but controlling for a collider increases it - a simple summary that I will try to remember as I design future experiments or attempt to derive meaning from observational studies. These are the simple insights that make learning this stuff really fun (for me at least)!&lt;/p&gt;
&lt;p&gt;Thanks for reading.&lt;/p&gt;
&lt;div class=&#34;footnotes&#34;&gt;
&lt;hr /&gt;
&lt;ol&gt;
&lt;li id=&#34;fn1&#34;&gt;&lt;p&gt;&lt;a href=&#34;http://bayes.cs.ucla.edu/WHY/&#34; class=&#34;uri&#34;&gt;http://bayes.cs.ucla.edu/WHY/&lt;/a&gt;&lt;a href=&#34;#fnref1&#34;&gt;↩&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li id=&#34;fn2&#34;&gt;&lt;p&gt;&lt;a href=&#34;https://scholar.harvard.edu/files/malf/files/ijeluquecollider.pdf&#34; class=&#34;uri&#34;&gt;https://scholar.harvard.edu/files/malf/files/ijeluquecollider.pdf&lt;/a&gt;&lt;a href=&#34;#fnref2&#34;&gt;↩&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li id=&#34;fn3&#34;&gt;&lt;p&gt;&lt;a href=&#34;https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3681250/&#34; class=&#34;uri&#34;&gt;https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3681250/&lt;/a&gt;&lt;a href=&#34;#fnref3&#34;&gt;↩&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;/div&gt;
</description>
    </item>
    
    <item>
      <title>Modeling Particulate Counts as a Poisson Process in R</title>
      <link>https://rileyking.netlify.app/post/modeling-particulate-counts-as-a-poisson-process-in-r/</link>
      <pubDate>Wed, 18 Sep 2019 00:00:00 +0000</pubDate>
      
      <guid>https://rileyking.netlify.app/post/modeling-particulate-counts-as-a-poisson-process-in-r/</guid>
      <description>
&lt;script src=&#34;https://rileyking.netlify.app/rmarkdown-libs/kePrint/kePrint.js&#34;&gt;&lt;/script&gt;


&lt;p&gt;I’ve never really worked much with Poisson data and wanted to get my hands dirty. I thought that for this project I might combine a Poisson data set with the simple Bayesian methods that I’ve explored before since it turns out the Poisson rate parameter lambda also has a nice conjugate prior (more on that later). Poisson distributed data are counts per unit time or space - they are events that arrive at random intervals but that have a characteristic rate parameter which also equals the variance. This rate parameter is usually denoted as lambda. No-hitters in baseball are often modeled as Poisson data, as are certain types of processing defects in electronics and medical devices. A particularly relevant application is in particulate testing for implantable devices. Particulate shed is an unassuming but potentially costly and dangerous phenomenon.&lt;/p&gt;
&lt;p&gt;Particulate can be shed from the surface of medical devices even when the manufacturing environment is diligently controlled. The source of the particulate can vary: light particulate is attracted to the surface of sheaths and luers due to static charge; hydrophilic coatings may delaminate from the surface during delivery; therapeutic coating on the implant’s surface may degrade over time in the presence of blood.&lt;/p&gt;
&lt;p&gt;The clinical harms that the patient may face due to particulate shed include neurological events if the particulate migrates cranially or embolism it migrates caudally. The occurrence and severity of symptoms are understood to be functions of both size and quantity of particulate. In recent years, FDA and friends have been more stringent in requiring manufacturers to quantify and understand the nature of the particulate burden associated with their devices. In the analysis below, I’m going to simulate an experiment in which particulate data are collected for 20 devices.&lt;/p&gt;
&lt;p&gt;Before I get there, I want to remind myself of what Poisson data look like for different rate parameters. I set up a function to make a Poisson pdf based on number of events n and rate parameter lambda. The function then converts the information to a tibble for use with ggplot.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;#Load libraries
library(tidyverse)
library(knitr)
library(kableExtra)
library(tolerance)
library(ggrepel)&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;#Sequence from 0 to 24 by 1 (x-axis of plot)
number_of_events &amp;lt;- seq(0, 24, by = 1)

#Function to make a Poisson density vector from n and lambda, convert into tibble
pois_fcn &amp;lt;- function(lambda){
            pois_vector &amp;lt;- dpois(x = number_of_events, lambda = lambda, log = FALSE)
            pois_tbl    &amp;lt;- tibble(&amp;quot;num_of_events&amp;quot; = number_of_events,
                                  &amp;quot;prob&amp;quot;          = pois_vector,
                                  &amp;quot;lambda&amp;quot;        = lambda)
            }
#Objects to hold tibbles for different Poisson rates
pois_dist_1_tbl &amp;lt;-  pois_fcn(lambda = 1)
pois_dist_5_tbl &amp;lt;-  pois_fcn(lambda = 5)
pois_dist_15_tbl &amp;lt;- pois_fcn(lambda = 15)

#Combine in one df
pois_total_tbl &amp;lt;- bind_rows(pois_dist_1_tbl,
                            pois_dist_5_tbl,
                            pois_dist_15_tbl)

#Convert lambda front int to factor so ggplot maps aesthetics as levels, not gradient
pois_total_int_tbl &amp;lt;- pois_total_tbl %&amp;gt;% 
  mutate(lambda = as_factor(lambda))

#Make and store ggplot obj
h1 &amp;lt;- pois_total_int_tbl %&amp;gt;% ggplot(aes(x = num_of_events, y = prob)) +
  geom_col(aes(y = prob, fill = lambda), position = &amp;quot;dodge&amp;quot;, color = &amp;quot;black&amp;quot;) +
  scale_fill_manual(values = c(&amp;quot;#2C728EFF&amp;quot;, &amp;quot;#75D054FF&amp;quot;, &amp;quot;#FDE725FF&amp;quot;)) +
  labs(x        = &amp;quot;Number of Events&amp;quot;, 
       y        = &amp;quot;Probability&amp;quot;,
       title    = &amp;quot;Probability Mass Function&amp;quot;,
       subtitle = &amp;quot;Poisson Distributions with Different Rates (Lambda)&amp;quot;)

h1&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://rileyking.netlify.app/post/2019-09-18-modeling-device-particulate-counts-as-a-poisson-process_files/figure-html/unnamed-chunk-2-1.png&#34; width=&#34;100%&#34; height=&#34;500px&#34; /&gt;&lt;/p&gt;
&lt;p&gt;Cool - so when the rate is low it looks sort of like the discrete version of an exponential curve. It’s still not symmetric at lambda = 5 but by lambda = 15 it looks a lot like a binomial distribution.&lt;/p&gt;
&lt;p&gt;The data I simulate below are intended to represent the fluid collected during bench-top simulated use testing in a clean “flow loop” or vascular deployment model. The fluid would generally be passed through light obscuration censors to quantify the size and counts of particulate relative to a control. Particulate requirements for many endovascular devices are borrowed from USP &amp;lt;788&amp;gt;. According to that standard, no more than 60 particles greater than 25 micron effective diameter are acceptable. I want to know the probability of passing the test but don’t know the rate parameter lambda. The end goal is to understand what the most credible values for lambda are based on the bench-top data from multiple devices. First I’ll try to quantify the uncertainty in the rate parameter lambda. Each lambda can then be used to estimate a reliability. The large number of simulated lamdas will make a large set of simulated reliabilities. From there I should be able to extract any information needed regarding the uncertainty of the device reliability as it relates to particulate shed. That’s the plan! Note: I’m trying out knitr::kable() which generates html tables nicely. I’m not too good at it yet so bare with me please.&lt;/p&gt;
&lt;p&gt;Take a look at the data:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;#Peek at some data
particulate_data %&amp;gt;% head(5) %&amp;gt;%
  kable() %&amp;gt;% kable_styling(&amp;quot;full_width&amp;quot; = F)&lt;/code&gt;&lt;/pre&gt;
&lt;table class=&#34;table&#34; style=&#34;width: auto !important; margin-left: auto; margin-right: auto;&#34;&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th style=&#34;text-align:right;&#34;&gt;
x
&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
46
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
58
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
38
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
50
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
62
&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;I’m using a Bayesian approach again - partially because I need practice and partially because the Poisson parameter lambda has a convenient conjugate prior: the gamma distribution. This means that some simple math can get me from the prior to the posterior. I love simple math. Using the gamma distribution to describe the prior belief in lambda, the posterior distribution for lambda is:&lt;/p&gt;
&lt;p&gt;&lt;span class=&#34;math display&#34;&gt;\[\mbox{prior:  lambda ~ Gamma}(a, b)\]&lt;/span&gt; As a reminder to myself, this is read as “lambda is distributed as a Gamma distribution with parameters a and b”.&lt;/p&gt;
&lt;p&gt;&lt;span class=&#34;math display&#34;&gt;\[\mbox{posterior:  lambda ~ Gamma}(a + \sum_{i=1}^{n} x_i\ , b + n)\]&lt;/span&gt; It is reasonable to use an relatively uninformed prior for lambda since I don’t have much preliminary knowledge about particulate data for my device design. Setting the shape a to 1 and the rate b to 0.1 provides allocates the credibility across a wide range of lambdas to start. To go from prior to posterior we need only sum up all the particulate counts in the data set and add the total to the shape a, then add the total number of devices tested (sample size n) to the rate b.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;#Set parameters and constants
a &amp;lt;- 1
b &amp;lt;- 0.1
n &amp;lt;- length(particulate_data)
total_particulate_count &amp;lt;- sum(particulate_data)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;I like to peek at the prior and posterior distributions of lambda since they are easy to visualize via the relationships above. We are back into continuous distribution mode because the rate parameter lambda can be any positive value even though the particulate counts the come from the process are discrete.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;#Set sequence of x values; generate prior using a,b; generate posterior 
x_values  &amp;lt;- seq(0, 60, length.out = 1000)
prior     &amp;lt;- dgamma(x_values, shape = 1, rate = 0.1)
posterior &amp;lt;- dgamma(x_values, shape = a + total_particulate_count, rate = b + n)

#Prior in tibble format
prior_tbl &amp;lt;- tibble(
  &amp;quot;x_values&amp;quot; = x_values,
  &amp;quot;prob&amp;quot;     = prior,
  &amp;quot;config&amp;quot;   = &amp;quot;prior&amp;quot;
)

#Posterior in tibble format
posterior_tbl &amp;lt;- tibble(
  &amp;quot;x_values&amp;quot; = x_values,
  &amp;quot;prob&amp;quot;     = posterior,
  &amp;quot;config&amp;quot;   = &amp;quot;posterior&amp;quot;
)

#Combine prior and posterior in 1 tibble
prior_post_tbl &amp;lt;- bind_rows(prior_tbl, posterior_tbl)

#Visualize 
prior_post_tbl %&amp;gt;% ggplot(aes(x = x_values, y = prob)) +
  geom_line(aes(color = config), size = 1.5, alpha = 0.8) +
  scale_y_continuous(name=&amp;quot;Density&amp;quot;, limits=c(0, 0.3)) +
  scale_color_manual(values = c(&amp;quot;#75D054FF&amp;quot;, &amp;quot;#2C728EFF&amp;quot;)) +
  labs(
    title    = &amp;quot;Rate Parameter Lambda For Particle Counts&amp;quot;,
    subtitle = &amp;quot;Modeled as Poisson Process&amp;quot;,
    x        = &amp;quot;Lambda&amp;quot;,
    color    = &amp;quot;&amp;quot;
  )&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://rileyking.netlify.app/post/2019-09-18-modeling-device-particulate-counts-as-a-poisson-process_files/figure-html/unnamed-chunk-6-1.png&#34; width=&#34;100%&#34; height=&#34;500px&#34; /&gt;&lt;/p&gt;
&lt;p&gt;Having access to the posterior distribution of lambda enables simulation of possible values of lambda by drawing random values from the distributions. The probability of drawing any particular value of lambda is based on the density shown on the y-axis (although the probability of any particular point is zero; we must calculate over a span of x via integration). Each of the values randomly drawn from the posterior can be used to simulate a distribution of particulate counts for comparison with the spec. The workflow is essentially a series of questions:&lt;/p&gt;
&lt;ol style=&#34;list-style-type: decimal&#34;&gt;
&lt;li&gt;&lt;p&gt;What might the values of the rate parameter lambda be based on the data? -&amp;gt; Combine data with conjugate prior to generate the posterior distribution of credible lambdas. (Done and shown above)&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;If a random value of lambda is pulled from the posterior distribution , what would we expect regarding the uncertainty of the original experiment? -&amp;gt; Draw random values lambda and then evaluate what percentage of the cdf lies above the spec (could also run simulations for each random lambda and then count the number of simulated runs above the spec but this is time consuming (10,000 lambdas x 10,000 simulations to build out the particle count distribution for each one…)&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Combine each of these tail areas into a new distribution. This new distribution represents the uncertainty in the reliability estimate based on uncertainty in lambda. How to estimate the reliability of the real device while taking uncertainty into account? -&amp;gt; Calculate the lower bound of the 95% credible interval by finding the .05 quantile from the set of simulated reliability values.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Let’s do this!&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;#Sample and store 10000 random lambda values from posterior 
n_posterior_samples &amp;lt;- 10000
sampled_posterior_lambda &amp;lt;- rgamma(n_posterior_samples, shape = a + total_particulate_count, rate = b + n)

#Initialize empty vector to hold reliability data
reliability_vector &amp;lt;- rep(NA, n_posterior_samples)

#For each lambda value, calc cumulative probability of less than or equal to q particles shed from 1 sample?
for(i in 1:n_posterior_samples){
  reliability_vector[i] &amp;lt;- ppois(q = 60, lambda = sampled_posterior_lambda[i])
}

#Visualize
reliability_vector %&amp;gt;% head() %&amp;gt;% 
  kable(align=rep(&amp;#39;c&amp;#39;)) %&amp;gt;% kable_styling(&amp;quot;full_width&amp;quot; = F)&lt;/code&gt;&lt;/pre&gt;
&lt;table class=&#34;table&#34; style=&#34;width: auto !important; margin-left: auto; margin-right: auto;&#34;&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th style=&#34;text-align:center;&#34;&gt;
x
&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
0.9147028
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
0.9506510
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
0.9431756
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
0.9700806
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
0.9546490
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
0.9540933
&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;Checking what the simulated reliabilities are:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;#Convert reliability vector to tibble
reliability_tbl &amp;lt;- reliability_vector %&amp;gt;% 
  as_tibble() %&amp;gt;%
  mutate(&amp;quot;reliability&amp;quot; = value) %&amp;gt;%
  select(reliability)

#Visualize with histogram
reliability_tbl %&amp;gt;% ggplot(aes(reliability)) +
  geom_histogram(fill = &amp;quot;#2c3e50&amp;quot;, color = &amp;quot;white&amp;quot;, binwidth = .01, alpha = 0.8) +
    labs(
        x        = &amp;quot;Reliability&amp;quot;,
        title    = &amp;quot;Estimated Reliability Range for Particulate Shed Performance&amp;quot;,
        subtitle = &amp;quot;Requirement: 60 or less of 25 um or larger&amp;quot;
    )&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://rileyking.netlify.app/post/2019-09-18-modeling-device-particulate-counts-as-a-poisson-process_files/figure-html/unnamed-chunk-8-1.png&#34; width=&#34;100%&#34; height=&#34;500px&#34; /&gt;&lt;/p&gt;
&lt;p&gt;The 95% credible interval for the reliability (conformance rate) is the .05 quantile of this distribution since the spec is 1-sided:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;#Calculate .05 quantile
reliability_tbl$reliability %&amp;gt;% 
  quantile(probs = .05)     %&amp;gt;% 
  signif(digits = 3)    &lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;##    5% 
## 0.893&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Finally, the answer! The lowest reliability expected is 89.3 % based on a 95% credible interval. This would likely not meet the product requirements (assigned based on risk of the harms that come from this particular failure mode) and we would likely need to improve our design or processes to reduce particulate shed from the product.&lt;/p&gt;
&lt;p&gt;This concludes the Bayesian inference of reliability in Poisson distributed particle counts. But hey, since we’re here… one of the things I love about R is the ability to easily check sensitivities, assumptions, and alternatives easily. What would this analysis look like using the conventional frequentist approach? I admit I’m not sure exactly but I assume we would extend the standard tolerance interval approach that is common in Class III medical device submissions. Tolerance intervals are easy to pull from tables or software but actually pretty tricky (for me at least) to derive. They involve uncertainty in both the mean and the variance. For simplicity (and because I’m not confident enough to derive the formula), I’ll use the tolerance package in R to calculate tolerance intervals for Poisson data. It turns out that there are 8 methods and I’ll use them all because I’m feeling a little wild and I want to see if they result in different results.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;## 95%/95% 1-sided Poisson tolerance limits for future
## occurrences in a period of length 1 part. All eight methods
## are presented for comparison.
tl_tab &amp;lt;- poistol.int(x = sum_part_data, n = n, m = 1, alpha = 0.05, P = 0.95,
side = 1, method = &amp;quot;TAB&amp;quot;) %&amp;gt;% mutate(method = &amp;quot;TAB&amp;quot;) %&amp;gt;% as_tibble() 

tl_ls &amp;lt;- poistol.int(x = sum_part_data, n = n, m = 1, alpha = 0.05, P = 0.95,
side = 1, method = &amp;quot;LS&amp;quot;) %&amp;gt;% mutate(method = &amp;quot;LS&amp;quot;) %&amp;gt;% as_tibble() 

tl_sc &amp;lt;- poistol.int(x = sum_part_data, n = n, m = 1, alpha = 0.05, P = 0.95,
side = 1, method = &amp;quot;SC&amp;quot;) %&amp;gt;% mutate(method = &amp;quot;SC&amp;quot;) %&amp;gt;% as_tibble() 

tl_cc &amp;lt;- poistol.int(x = sum_part_data, n = n, m = 1, alpha = 0.05, P = 0.95,
side = 1, method = &amp;quot;CC&amp;quot;) %&amp;gt;% mutate(method = &amp;quot;CC&amp;quot;) %&amp;gt;% as_tibble()

tl_vs &amp;lt;- poistol.int(x = sum_part_data, n = n, m = 1, alpha = 0.05, P = 0.95,
side = 1, method = &amp;quot;VS&amp;quot;) %&amp;gt;% mutate(method = &amp;quot;VS&amp;quot;) %&amp;gt;% as_tibble() 

tl_rvs &amp;lt;- poistol.int(x = sum_part_data, n = n, m = 1, alpha = 0.05, P = 0.95,
side = 1, method = &amp;quot;RVS&amp;quot;) %&amp;gt;% mutate(method = &amp;quot;RVS&amp;quot;) %&amp;gt;% as_tibble() 

tl_ft &amp;lt;- poistol.int(x = sum_part_data, n = n, m = 1, alpha = 0.05, P = 0.95,
side = 1, method = &amp;quot;FT&amp;quot;) %&amp;gt;%mutate(method = &amp;quot;FT&amp;quot;) %&amp;gt;% as_tibble() 

tl_csc &amp;lt;- poistol.int(x = sum_part_data, n = n, m = 1, alpha = 0.05, P = 0.95,
side = 1, method = &amp;quot;CSC&amp;quot;) %&amp;gt;% mutate(method = &amp;quot;CSC&amp;quot;) %&amp;gt;% as_tibble() 

tl_all_tbl &amp;lt;-  bind_rows(tl_tab,
                         tl_ls,
                         tl_sc,
                         tl_cc,
                         tl_vs,
                         tl_rvs,
                         tl_ft,
                         tl_csc)

tl_all_tbl %&amp;gt;% kable(align=rep(&amp;#39;c&amp;#39;, 5)) &lt;/code&gt;&lt;/pre&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th style=&#34;text-align:center;&#34;&gt;
alpha
&lt;/th&gt;
&lt;th style=&#34;text-align:center;&#34;&gt;
P
&lt;/th&gt;
&lt;th style=&#34;text-align:center;&#34;&gt;
lambda.hat
&lt;/th&gt;
&lt;th style=&#34;text-align:center;&#34;&gt;
1-sided.lower
&lt;/th&gt;
&lt;th style=&#34;text-align:center;&#34;&gt;
1-sided.upper
&lt;/th&gt;
&lt;th style=&#34;text-align:center;&#34;&gt;
method
&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
0.05
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
0.95
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
49.15
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
36
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
64
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
TAB
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
0.05
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
0.95
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
49.15
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
36
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
64
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
LS
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
0.05
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
0.95
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
49.15
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
36
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
64
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
SC
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
0.05
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
0.95
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
49.15
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
36
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
64
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
CC
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
0.05
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
0.95
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
49.15
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
36
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
64
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
VS
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
0.05
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
0.95
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
49.15
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
36
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
64
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
RVS
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
0.05
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
0.95
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
49.15
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
36
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
64
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
FT
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
0.05
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
0.95
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
49.15
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
36
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
64
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
CSC
&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;For this data set it can be seen that all 8 methods produce the same 1-sided 95/95 upper tolerance interval 64 counts per device. N=60 was the requirement - since the edge of our tolerance interval lies above the 1-sided spec we would fail this test. This conclusion is consistent with the Bayesian method that estimates the reliability below the 95% requirement.&lt;/p&gt;
&lt;p&gt;But what sort of reliability claim could our data support? For the Bayesian approach we concluded that the answer was 89.3% (lower bound of 1-sided 95% credible interval). For the frequentist method, we don’t have a posterior distribution to examine. We could try using the tolerance interval function above with various values of P to impute the value of P which coincides with the spec limit of 60.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;#Sequence of reliability values for which to use as P 
reliability_freq_tbl &amp;lt;- tibble(
  &amp;quot;proportion_covered_P&amp;quot; = seq(.40, .99, .01)
)

#Function that is just like poistol.int but extracts and reports only the upper limit
#of the tolerance interval
tol_interval_fcn &amp;lt;- function(data_vec = sum_part_data, n=20, m=1, alpha=.05, P=.95, side=1, method=&amp;quot;TAB&amp;quot;){
  holder &amp;lt;- poistol.int(data_vec, n, m, alpha, P, side, method)
  holder_2 &amp;lt;- holder[1,5]
}

#Test the function
test_1 &amp;lt;- tol_interval_fcn(data_vec = sum_part_data, n=n, m=1, alpha = .05, P = .95, side = 1, method = &amp;quot;TAB&amp;quot;)

#Test the function
test_2 &amp;lt;- tol_interval_fcn(P = .95)

#Map the function across a vector of proportions
#Note to future self: map() arguments are: the list of values map the fn over, the fn
#itself, then all the additional arguments of the fn that you aren&amp;#39;t mapping over (odd syntax)
upper_tol_tbl &amp;lt;- reliability_freq_tbl %&amp;gt;% mutate(
  particles_per_part = map(proportion_covered_P, tol_interval_fcn, data_vec = sum_part_data, n=n, m=1, alpha = .05, side = 1, method = &amp;quot;TAB&amp;quot;) %&amp;gt;% as.integer() 
)

#View haead and tail of data
upper_tol_tbl %&amp;gt;% head(20) %&amp;gt;% kable(align=rep(&amp;#39;c&amp;#39;, 2))&lt;/code&gt;&lt;/pre&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th style=&#34;text-align:center;&#34;&gt;
proportion_covered_P
&lt;/th&gt;
&lt;th style=&#34;text-align:center;&#34;&gt;
particles_per_part
&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
0.40
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
50
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
0.41
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
50
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
0.42
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
50
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
0.43
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
50
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
0.44
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
51
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
0.45
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
51
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
0.46
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
51
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
0.47
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
51
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
0.48
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
51
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
0.49
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
51
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
0.50
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
52
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
0.51
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
52
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
0.52
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
52
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
0.53
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
52
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
0.54
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
52
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
0.55
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
53
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
0.56
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
53
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
0.57
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
53
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
0.58
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
53
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
0.59
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
53
&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;upper_tol_tbl %&amp;gt;% tail(20) %&amp;gt;% kable(align=rep(&amp;#39;c&amp;#39;, 2))&lt;/code&gt;&lt;/pre&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th style=&#34;text-align:center;&#34;&gt;
proportion_covered_P
&lt;/th&gt;
&lt;th style=&#34;text-align:center;&#34;&gt;
particles_per_part
&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
0.80
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
58
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
0.81
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
58
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
0.82
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
58
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
0.83
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
59
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
0.84
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
59
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
0.85
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
59
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
0.86
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
60
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
0.87
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
60
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
0.88
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
60
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
0.89
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
61
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
0.90
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
61
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
0.91
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
62
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
0.92
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
62
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
0.93
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
63
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
0.94
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
63
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
0.95
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
64
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
0.96
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
65
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
0.97
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
66
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
0.98
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
67
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
0.99
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
69
&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;#need this data to feed to gg_label_repel to tell it where to attach label
point_tbl &amp;lt;- tibble(x = .65, y = 60)

#visualize 
upper_tol_tbl %&amp;gt;% ggplot(aes(x = proportion_covered_P, y = particles_per_part)) +
  geom_line(color = &amp;quot;#2c3e50&amp;quot;,
            size = 2.5) +
    labs(x = &amp;quot;Estimated Reliability at .95 Confidence Level&amp;quot;,
         y = &amp;quot;Edge of 1-Sided Tolerance Interval (Particles per Device)&amp;quot;,
         title = &amp;quot;Edge of Tolerance Interval vs. Specified Reliability&amp;quot;,
         subtitle = &amp;quot;95% Confidence Level Using TAB Tolerance Technique&amp;quot;) +
  scale_y_continuous(breaks = seq(40, 70, 5)) +
  geom_vline(xintercept = .88) +
  geom_hline(yintercept = 60) +
  geom_point(x = .65, y = 60, size = 0, alpha = 0) +
  geom_label_repel(data = point_tbl, aes(x, y), 
                   label = &amp;quot;Spec Limit: 60 Particles Max&amp;quot;,
                   fill = &amp;quot;#2c3e50&amp;quot;, 
                   color = &amp;quot;white&amp;quot;,
                   segment.color = &amp;quot;#2c3e50&amp;quot;,
                   segment.size = 1,
                   min.segment.length = unit(1, &amp;quot;lines&amp;quot;),
                   nudge_y = 2,
                   nudge_x = .05)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://rileyking.netlify.app/post/2019-09-18-modeling-device-particulate-counts-as-a-poisson-process_files/figure-html/unnamed-chunk-12-1.png&#34; width=&#34;100%&#34; height=&#34;500px&#34; /&gt;&lt;/p&gt;
&lt;p&gt;Here’s a plot that I’ve never made or seen before. For given set of data (in this case: particulate_data from earlier with n=20 from a Poisson distribution, lambda = 50), the x-axis shows the estimated reliability and the y-axis represents the number of particles at the edge the calculated tolerance interval using the TAB method. That is to say: the standard approaches to calculate the edge of the relevant tolerance interval for a specified proportion at a specified confidence level. For example, we could state we want to know the estimate for the 95th percentile at 95% confidence level - the answer would be 64 particles per device. Since the requirement for clinical safety is set at 60 particles max, we would not pass the test because we could not state with high confidence that 95 or more (out of 100) would pass. Usually it’s just a binary pass/fail decision.&lt;/p&gt;
&lt;p&gt;It’s obvious that the 95/95 edge of the tolerance interval is out of spec… but what would be the greatest reliability we could claim at 95% confidence? It ends up being .88 or 88% - very close to the predicted lower bound of the 95% credible interval calculated from the Bayesian method (which was 89.3%, from above)! In this case, the frequentist and Bayesian methods happen to be similar (even though they aren’t measuring the same thing). Interesting stuff!&lt;/p&gt;
</description>
    </item>
    
    <item>
      <title>Stopping Rules for Significance Testing in R</title>
      <link>https://rileyking.netlify.app/post/stopping-rules-for-significance-testing-in-r/</link>
      <pubDate>Fri, 06 Sep 2019 00:00:00 +0000</pubDate>
      
      <guid>https://rileyking.netlify.app/post/stopping-rules-for-significance-testing-in-r/</guid>
      <description>
&lt;script src=&#34;https://rileyking.netlify.app/rmarkdown-libs/kePrint/kePrint.js&#34;&gt;&lt;/script&gt;


&lt;p&gt;When doing comparative testing it can be tempting to stop when we see the result that we hoped for. In the case of null hypothesis significance testing (NHST), the desired outcome is often a p-value of &amp;lt; .05. In the medical device industry, bench top testing can cost a lot of money. Why not just recalculate the p-value after every test and stop when the p-value reaches .05? The reason is that the confidence statement attached to your testing is only valid for a specific stopping rule. In other words, to achieve the desired false positive rate we must continue testing speciments until the pre-determined sample size is reached. Evaluating the p-value as you proceed through the testing is known as “peeking” and it’s a statistical no-no.&lt;/p&gt;
&lt;p&gt;Suppose we are attempting to demonstrate that a raw material provided by a new vendor results in better corrosion resistance in finished stents relative to the standard supplier. A bench top test is set up to measure the breakdown potential of each sample in a cyclic potentiodynamic polarization (CPP) test. Our goal is to compare the means of the CPP data from the old supplier and the new supplier. The null hypothesis is that the means are equivalent and if the t-test results in a p-value of .05 or lower then we will reject the null and claim improved performance. What happens to the p-value over the course of the testing? We can run a simulation to monitor the p-value and calculate the effect of peeking on the long-term false positive rate. For the test to perform as intended, the long-term false positive rate should be controlled at a level equal to (1 - confidence level).&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;library(tidyverse)
library(knitr)
library(kableExtra)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;First, initialize the objects to hold the data and establish any constants we might need later.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;#Initial offset constant to keep minimum group size at n=6
INITIAL_OFFSET &amp;lt;- 5

#Initial values for number of inner and outer loop iterations
n_inner_loop &amp;lt;- 50
n_inner_data &amp;lt;- n_inner_loop + INITIAL_OFFSET
n_outer &amp;lt;- 100

#Initialize empty vector to store p values
store_p_values_vec &amp;lt;- rep(NA, n_inner_loop)

#Initialize a tibble with placeholder column
many_runs_tbl &amp;lt;-  tibble(
  V1 = rep(NA,  n_inner_loop)
  )&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The simulation requires 2 for loops. The inner loop performs a series of t-test adding 1 more experimental observation to each group after each iteration. The p-value for that iteration is extracted and stored. In the outer loop, the initial data for the 2 groups are generated randomly from normal distributions. Since we can’t really run a t-test on groups with very low sample sizes, we use an initial offset value so that the t-test loops don’t start until both groups have a few observations from which to calculate the means.&lt;/p&gt;
&lt;p&gt;The p-value for a traditional t-test should be an indication of the long-term false positive rate. In other words: if we ran a t-test on samples drawn from 2 identical populations many times we would see a few large differences in means simply due to chance draws. Among all such simulations, the value at the 95% quantile represents the p-value of .05.&lt;/p&gt;
&lt;p&gt;We can gut-check our simulation in this way by setting the two populations identical to each other and drawing random values in the outer loop as mentioned above.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;#Set seed for repeatability
set.seed(1234)

#Outer loop: replicates a t-test between 2 groups
for(l in 1:n_outer) {
    
    #Generate simulated data for each group.  The parameters are set the same to represent 1 population
    example_group_1 &amp;lt;- rnorm(n = n_inner_data, mean = 10, sd = 4)
    example_group_2 &amp;lt;- rnorm(n = n_inner_data, mean = 10, sd = 4)
    
    #Inner loop: subset the first (i + initial offset) values from grp 1 and grp 2 (y)
    #Perform t-test, extract p-value, store in a vector
    #Increment each group&amp;#39;s size by 1 after each iteration
    for (i in 1:n_inner_loop) {
    t_test_obj &amp;lt;- t.test(x = example_group_1[1:(INITIAL_OFFSET + i)], y = example_group_2[1:(INITIAL_OFFSET + i)])
    store_p_values_vec[i] = t_test_obj$p.value
  }
  
    #Store each vector of n_inner_loop p-values to a column in the many_runs_tbl
    many_runs_tbl[,l] &amp;lt;- store_p_values_vec
}

#visualize tibble 
many_runs_tbl[,1:12] %&amp;gt;% 
  signif(digits = 3) %&amp;gt;%
  head(10) %&amp;gt;% 
  kable(align=rep(&amp;#39;c&amp;#39;, 100))&lt;/code&gt;&lt;/pre&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th style=&#34;text-align:center;&#34;&gt;
V1
&lt;/th&gt;
&lt;th style=&#34;text-align:center;&#34;&gt;
V2
&lt;/th&gt;
&lt;th style=&#34;text-align:center;&#34;&gt;
V3
&lt;/th&gt;
&lt;th style=&#34;text-align:center;&#34;&gt;
V4
&lt;/th&gt;
&lt;th style=&#34;text-align:center;&#34;&gt;
V5
&lt;/th&gt;
&lt;th style=&#34;text-align:center;&#34;&gt;
V6
&lt;/th&gt;
&lt;th style=&#34;text-align:center;&#34;&gt;
V7
&lt;/th&gt;
&lt;th style=&#34;text-align:center;&#34;&gt;
V8
&lt;/th&gt;
&lt;th style=&#34;text-align:center;&#34;&gt;
V9
&lt;/th&gt;
&lt;th style=&#34;text-align:center;&#34;&gt;
V10
&lt;/th&gt;
&lt;th style=&#34;text-align:center;&#34;&gt;
V11
&lt;/th&gt;
&lt;th style=&#34;text-align:center;&#34;&gt;
V12
&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
0.3960
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
0.0990
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
0.204
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
0.412
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
0.0686
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
0.1450
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
0.894
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
0.360
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
0.721
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
0.897
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
0.0535
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
0.668
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
0.1700
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
0.0628
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
0.106
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
0.951
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
0.2240
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
0.0834
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
0.802
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
0.614
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
0.750
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
0.886
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
0.3170
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
0.517
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
0.1410
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
0.0929
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
0.057
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
0.618
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
0.1360
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
0.0296
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
0.499
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
0.561
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
0.846
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
0.809
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
0.1740
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
0.410
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
0.1560
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
0.4050
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
0.146
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
0.800
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
0.1690
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
0.0625
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
0.724
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
0.700
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
0.857
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
0.687
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
0.3620
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
0.338
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
0.1140
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
0.2610
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
0.104
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
0.992
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
0.2550
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
0.1860
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
0.548
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
0.846
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
0.727
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
0.911
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
0.4270
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
0.334
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
0.0540
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
0.3400
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
0.143
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
0.889
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
0.3180
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
0.1740
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
0.775
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
0.768
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
0.795
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
0.666
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
0.5630
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
0.229
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
0.0693
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
0.4030
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
0.125
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
0.871
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
0.7340
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
0.0757
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
0.826
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
0.792
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
0.704
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
0.755
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
0.4810
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
0.694
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
0.0324
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
0.4050
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
0.181
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
0.930
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
0.8630
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
0.0617
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
0.738
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
0.564
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
0.501
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
0.611
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
0.3930
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
0.472
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
0.0206
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
0.4550
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
0.112
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
0.912
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
0.7560
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
0.0958
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
0.644
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
0.708
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
0.265
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
0.687
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
0.2520
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
0.638
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
0.0294
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
0.6690
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
0.103
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
0.777
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
0.8680
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
0.1700
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
0.664
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
0.703
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
0.284
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
0.912
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
0.2450
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
0.441
&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;Each column above represents n=50 p-values, with each successive value calculated after observing the newest data point in the simulated test sequence. These are the p-values we see if we peek at the calculation every time.&lt;/p&gt;
&lt;p&gt;We need to convert data into tidy format for better visualization. In the tidy format, every column should be a unique variable. The gather() function converts data from wide to long by adding a new variable called “rep_sim_number” and combining all the various runs from 1 to 100 in a single column. In total, we’ll have only 3 columns in the tidy version.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;#add new column with row id numbers
final_runs_tbl &amp;lt;- many_runs_tbl %&amp;gt;% 
    mutate(row_id = row_number()) %&amp;gt;%
    select(row_id, everything())

#convert from wide format (untidy) to long (tidy) using gather()
final_runs_tidy_tbl &amp;lt;- final_runs_tbl %&amp;gt;% gather(key = &amp;quot;rep_sim_number&amp;quot;, value = &amp;quot;p_value&amp;quot;, -row_id)

#visualize tidy data structure
final_runs_tidy_tbl %&amp;gt;% 
  head(10) %&amp;gt;% 
  kable(align=rep(&amp;#39;c&amp;#39;, 3))&lt;/code&gt;&lt;/pre&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th style=&#34;text-align:center;&#34;&gt;
row_id
&lt;/th&gt;
&lt;th style=&#34;text-align:center;&#34;&gt;
rep_sim_number
&lt;/th&gt;
&lt;th style=&#34;text-align:center;&#34;&gt;
p_value
&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
1
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
V1
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
0.3963352
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
2
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
V1
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
0.1704697
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
3
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
V1
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
0.1414021
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
4
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
V1
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
0.1557261
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
5
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
V1
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
0.1141854
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
6
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
V1
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
0.0539595
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
7
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
V1
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
0.0693410
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
8
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
V1
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
0.0324232
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
9
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
V1
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
0.0205511
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
10
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
V1
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
0.0293952
&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;final_runs_tidy_tbl %&amp;gt;% 
  tail(10) %&amp;gt;% 
  kable(align=rep(&amp;#39;c&amp;#39;, 3))&lt;/code&gt;&lt;/pre&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th style=&#34;text-align:center;&#34;&gt;
row_id
&lt;/th&gt;
&lt;th style=&#34;text-align:center;&#34;&gt;
rep_sim_number
&lt;/th&gt;
&lt;th style=&#34;text-align:center;&#34;&gt;
p_value
&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
41
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
V100
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
0.0515933
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
42
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
V100
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
0.0509430
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
43
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
V100
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
0.0386845
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
44
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
V100
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
0.0567804
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
45
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
V100
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
0.0762953
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
46
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
V100
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
0.0933081
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
47
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
V100
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
0.0755494
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
48
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
V100
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
0.0558263
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
49
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
V100
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
0.0731072
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
50
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
V100
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
0.0496300
&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;From here it is straightforward to visualize the trajectory of the p-values through the course of the testing for all 100 simulations.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;#visualize history of n_outer p-values across n_inner_loop consecutive data points as lineplot
lp_1 &amp;lt;- final_runs_tidy_tbl %&amp;gt;% ggplot(aes(x = row_id, y = p_value, group = rep_sim_number)) +
  geom_line(show.legend = &amp;quot;none&amp;quot;,
            color       = &amp;quot;grey&amp;quot;,
            alpha       = 0.7) +
  labs(x        = &amp;quot;Sequential Benchtop Test Observations&amp;quot;,
       title    = &amp;quot;P-Value History for Difference in Means, Standard T-Test&amp;quot;,
       subtitle = &amp;quot;Both Groups Sampled From Same Population&amp;quot;
       )

lp_1&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://rileyking.netlify.app/post/2019-09-06-stopping-rules-for-significance-testing_files/figure-html/unnamed-chunk-4-1.png&#34; width=&#34;100%&#34; height=&#34;500px&#34; /&gt;&lt;/p&gt;
&lt;p&gt;The p-values are all over the place! It makes sense that at the pre-determined stopping point (n=50) we would have a spread of p-values since the population parameters for the two groups were identical and p should only rarely land below .05. However, this visualization makes it clear that prior to the stopping point, the path of any particular p-value fluctuates wildly. This is the reason why we can’t stop early or peek!&lt;/p&gt;
&lt;p&gt;Let’s take a look at the false positives, defined here as the runs where the p-value ended up less than or equal to .05 at the pre-determined stopping point of n=50.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;#filter for runs that ended in false positives (p &amp;lt; .05) at the last data point
filtered_endpoint_tbl &amp;lt;- final_runs_tidy_tbl %&amp;gt;% 
    filter(row_id == 50,
           p_value &amp;lt;= 0.05) %&amp;gt;%
    select(rep_sim_number) %&amp;gt;%
    rename(&amp;quot;false_positives&amp;quot; = rep_sim_number)

filtered_endpoint_tbl %&amp;gt;% 
  head(10) %&amp;gt;% 
  kable(align=&amp;#39;c&amp;#39;) %&amp;gt;%
  kable_styling(full_width = FALSE)&lt;/code&gt;&lt;/pre&gt;
&lt;table class=&#34;table&#34; style=&#34;width: auto !important; margin-left: auto; margin-right: auto;&#34;&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th style=&#34;text-align:center;&#34;&gt;
false_positives
&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
V1
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
V23
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
V48
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
V54
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
V77
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
V86
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
V89
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
V100
&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;So 8 out of 100 simulations have p-values &amp;lt; .05. This is about as expected since the long term false positive rate should be 5%. Having now identified the false positives, we can visualize the trajectory of their p-values after obtaining each successive data point. This is what happens when we peek early or stop the test when we first see a desired outcome. The following code pulls the full history of the false positive test sequences so we can see their paths before the stopping point.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;#extract full false positive test histories.  %in% filters rows that match anything in the false_positives vector
full_low_runs_tbl &amp;lt;- final_runs_tidy_tbl %&amp;gt;%
    filter(rep_sim_number %in% filtered_endpoint_tbl$false_positives)

#visualize trajectory of false positives by highlighting their traces
lp_2 &amp;lt;- final_runs_tidy_tbl %&amp;gt;% 
    ggplot(aes(x = row_id, y = p_value, group = rep_sim_number)) +
    geom_line(alpha = 0.7, show.legend = FALSE, color = &amp;quot;grey&amp;quot;) +
    geom_line(aes(color = rep_sim_number), data = full_low_runs_tbl, show.legend = FALSE, size = .8, alpha = 0.7) +
    labs(x       = &amp;quot;Sequential Benchtop Test Observations&amp;quot;,
        title    = &amp;quot;P-Value History for Difference in Means, Standard T-Test&amp;quot;,
        subtitle = &amp;quot;Highlighted Traces Represent Sequences with p &amp;lt; .05 at n=50&amp;quot;
        )

lp_2&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://rileyking.netlify.app/post/2019-09-06-stopping-rules-for-significance-testing_files/figure-html/unnamed-chunk-6-1.png&#34; width=&#34;100%&#34; height=&#34;500px&#34; /&gt; Indeed, the p-values that end up less than .05 do not take a straight line path to get there. Likewise, there may be tests that dip below p=.05 at some point but culminate well above .05 at the pre-determined stopping point. These represent additional false-positives we invite when we peek or stop early. Let’s identify and count these:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;#filter for all run who&amp;#39;s p-value ever dipped to .05 or lower at any point 
low_p_tbl &amp;lt;- final_runs_tidy_tbl %&amp;gt;% 
    filter(p_value &amp;lt;= .05) %&amp;gt;% 
    distinct(rep_sim_number)

#visualize
low_p_tbl %&amp;gt;% 
  head(10) %&amp;gt;% 
  kable(align=&amp;#39;c&amp;#39;) %&amp;gt;% 
  kable_styling(full_width = FALSE)&lt;/code&gt;&lt;/pre&gt;
&lt;table class=&#34;table&#34; style=&#34;width: auto !important; margin-left: auto; margin-right: auto;&#34;&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th style=&#34;text-align:center;&#34;&gt;
rep_sim_number
&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
V1
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
V6
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
V7
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
V16
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
V17
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
V20
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
V21
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
V23
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
V30
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
V33
&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;#count total number of false positives with peeking
low_p_tbl %&amp;gt;% nrow() %&amp;gt;%
  kable(align = &amp;quot;c&amp;quot;) %&amp;gt;% 
  kable_styling(full_width = FALSE)&lt;/code&gt;&lt;/pre&gt;
&lt;table class=&#34;table&#34; style=&#34;width: auto !important; margin-left: auto; margin-right: auto;&#34;&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th style=&#34;text-align:center;&#34;&gt;
x
&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
37
&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;The false positives go from 8 to 37!&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;#filter for only the rows where rep_sim_number here matches at least 1 value from low_p_tbl$rep_sim_number
#this extracts the full history of runs who&amp;#39;s p-value dipped to .05 or lower at any point 
any_low_runs_tbl &amp;lt;- final_runs_tidy_tbl %&amp;gt;%
    filter(rep_sim_number %in% low_p_tbl$rep_sim_number)

#visualize
any_low_runs_tbl %&amp;gt;% 
  head(10) %&amp;gt;% 
  kable(align = rep(&amp;quot;c&amp;quot;, 3))&lt;/code&gt;&lt;/pre&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th style=&#34;text-align:center;&#34;&gt;
row_id
&lt;/th&gt;
&lt;th style=&#34;text-align:center;&#34;&gt;
rep_sim_number
&lt;/th&gt;
&lt;th style=&#34;text-align:center;&#34;&gt;
p_value
&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
1
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
V1
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
0.3963352
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
2
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
V1
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
0.1704697
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
3
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
V1
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
0.1414021
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
4
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
V1
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
0.1557261
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
5
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
V1
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
0.1141854
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
6
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
V1
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
0.0539595
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
7
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
V1
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
0.0693410
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
8
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
V1
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
0.0324232
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
9
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
V1
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
0.0205511
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
10
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
V1
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
0.0293952
&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;#visualize the trajectory or runs that dipped to .05 or below
lp_3 &amp;lt;- final_runs_tidy_tbl %&amp;gt;% 
    ggplot(aes(x = row_id, y = p_value, group = rep_sim_number)) +
    geom_line(alpha = 0.7, show.legend = FALSE, color = &amp;quot;grey&amp;quot;) +
    geom_line(aes(color = rep_sim_number), data = any_low_runs_tbl, show.legend = FALSE, size = .8, alpha = 0.7) +
    labs(x       = &amp;quot;Sequential Benchtop Test Observations&amp;quot;,
        title    = &amp;quot;P-Value History for Difference in Means, Standard T-Test&amp;quot;,
        subtitle = &amp;quot;Highlighted Runs Represent p &amp;lt; .05 at Any Point&amp;quot;
        )

lp_3&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://rileyking.netlify.app/post/2019-09-06-stopping-rules-for-significance-testing_files/figure-html/unnamed-chunk-9-1.png&#34; width=&#34;100%&#34; height=&#34;500px&#34; /&gt;&lt;/p&gt;
&lt;p&gt;All these differences in means would be considered significant if we don’t observe our pre-determined stopping rule. This could be a big deal. We might claim a performance benefit when there is none, or waste precious time and money trying to figure out why we can’t replicate an earlier experiment!&lt;/p&gt;
&lt;p&gt;Thanks for reading.&lt;/p&gt;
</description>
    </item>
    
    <item>
      <title>Assessing Design Verification Risk with Bayesian Estimation in R</title>
      <link>https://rileyking.netlify.app/post/assessing-dv-risk-w-bayesian-estimation-in-r/</link>
      <pubDate>Fri, 23 Aug 2019 00:00:00 +0000</pubDate>
      
      <guid>https://rileyking.netlify.app/post/assessing-dv-risk-w-bayesian-estimation-in-r/</guid>
      <description>


&lt;p&gt;Suppose our team is preparing to freeze a new implant design. In order to move into the next phase of the PDP, it is common to perform a suite of formal “Design Freeze” testing. If the results of the Design Freeze testing are acceptable, the project can advance from Design Freeze (DF) into Design Verification (DV). DV is an expensive and resource intensive phase culminating in formal reports that are included in the regulatory submission. One key goal of DF is therefore to burn down enough risk to feel confident going into DV. Despite the high stakes, I haven’t ever seen a quantitative assessment of residual risk at the phase review. In this post we’ll attempt to use some simple Bayesian methods to quantify the DV risk as a function of DF sample size for a single, high-risk test.&lt;/p&gt;
&lt;p&gt;Consider the requirement for accelerated durability (sometimes called fatigue resistance). In this test, the device is subjected to cyclic loading for a number of cycles equal to the desired service life. For 10 years of loading due to systolic - diastolic pressure cycles, vascular implants must survive approximately 400 million cycles. Accelerated durability is usually treated as attribute type data because the results can be only pass (if no fractures observed) or fail (if fractures are observed). Each test specimen can therefore be considered a Bernoulli trial and the number of passing units in n samples can be modeled with the binomial.&lt;/p&gt;
&lt;p&gt;&lt;img src=&#34;https://rileyking.netlify.app/./img/fatigue.png&#34; width=&#34;100%&#34; height=&#34;500px&#34; style=&#34;display: block; margin: auto;&#34; /&gt;&lt;/p&gt;
&lt;p&gt;How many samples should we include in DF? We’ll set up some simulations to find out. In order to incorporate the outcome of the DF data into a statement about the probability of success for DV, we’ll need to apply Bayesian methods.&lt;/p&gt;
&lt;p&gt;First, load the libraries:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;library(tidyverse)
library(cowplot)
library(gghighlight)
library(knitr)
library(kableExtra)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The simulation should start off before we even execute Design Freeze testing. If we’re going to use Bayesian techniques we need to express our uncertainty about the parameters in terms of probability. In this case, the parameter we care about is the reliability. Before seeing any DF data we might know very little about what the true reliability is for this design. If we were asked to indicate what we thought the reliability might be, we should probably state a wide range of possibilities. The design might be good but it might be quite poor. Our belief about the reliability before we do any testing at all is called the prior and we expess it as a probability density function, not a point estimate. We need a mathematical function to describe how we want to spread out our belief in the true reliability.&lt;/p&gt;
&lt;p&gt;The beta is a flexible distribution that can be adjusted to take a variety of different forms. By tweaking the two shape factors of the beta we can customize the probability density curve in many different ways. If we were super confident that every part we ever made would pass the durability testing, we could put a “spike” prior right on 1.0. This is like saying “there’s no way any part could ever fail”. But the whole point is to communicate uncertainty and in reality there is always a chance the reliability might only be 97%, or 94%, etc. Since we haven’t really seen any DF data, we should probably drop some of our credibility into many different possible values of the reliability. Let’s be very conservative here and just use the flat prior. By evenly binning all of our credibility across the full range of reliability from 0 to 1, we’re saying we don’t want our pre-conceived notions to influence the final estimated reliability much. We’ll instead use the DF data themselves to re-allocate the credibility across the range of reliabilities appropriately according to Bayes’ rule after looking at the Design Freeze results. The more DF data we observe, the more precise the posterior estimate.&lt;/p&gt;
&lt;p&gt;The mathematical way to turn the beta distribution into a straight line (flat prior) is to set the shape parameters alpha and beta to (1,1). Note the area under the curve must always sum to 1. The image on the left shows a flat prior generated from a beta density with parameters (1,1).&lt;/p&gt;
&lt;p&gt;Another way to display the prior is to build out the visualization manually by drawing random values from the beta(1,1) distribution and constructing a histogram. This method isn’t terribly useful since we already know the exact distribution we want to use but I like to include it to emphasize the idea of “binning” the credibility across different values of reliability. It’s also nice to see the uncertainty we might see when we start to randomly draw from the distribution (full disclosure: I also just to practice my coding).&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;#Plot flat prior using stat_function and ggplot
p_1 &amp;lt;- tibble(x_canvas=c(0,1)) %&amp;gt;% ggplot(aes(x=x_canvas)) +
    stat_function(fun   = dbeta,
                  args  = list(1, 1),
                  color = &amp;quot;#2c3e50&amp;quot;, 
                  size  = 1,
                  alpha = .8) +
    ylim(c(0,1.5)) +
    labs(
        y = &amp;quot;Density of Beta&amp;quot;,
        x = &amp;quot;Reliability&amp;quot;,
        title = &amp;quot;Credibility Allocation, Start of DF&amp;quot;,
        subtitle = &amp;quot;Uninformed Prior with Beta (1,1)&amp;quot;
    )

#Set the number of random draws from beta(1,1) to construct histogram flat prior
set.seed(123)
n_draws &amp;lt;- 100000

#Draw random values from beta(1,1), store in object
prior_dist_sim &amp;lt;- rbeta(n = n_draws, shape1 = 1, shape2 = 1)

#Convert from vector to tibble
prior_dist_sim_tbl &amp;lt;- prior_dist_sim %&amp;gt;% as_tibble()

#Visualize with ggplot
p_2 &amp;lt;- prior_dist_sim_tbl %&amp;gt;% ggplot(aes(x = value)) +
    geom_histogram(
        boundary = 1, 
        binwidth = .05, 
        color    = &amp;quot;white&amp;quot;,
        fill     = &amp;quot;#2c3e50&amp;quot;,
        alpha    = 0.8
        ) +
    xlim(c(-0.05, 1.05)) +
    ylim(c(0, 7500)) +
    labs(
        y = &amp;quot;Count&amp;quot;,
        title = &amp;quot;Credibility Simulation , Start of DF&amp;quot;,
        subtitle = &amp;quot;Uninformed Prior with Beta (1,1)&amp;quot;,
        x = &amp;quot;Reliability&amp;quot;
    
    ) 

plot_grid(p_1,p_2)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://rileyking.netlify.app/post/2019-08-23-assessing-dv-risk-w-bayesian-estimation-in-r_files/figure-html/unnamed-chunk-2-1.png&#34; width=&#34;100%&#34; height=&#34;500px&#34; /&gt;&lt;/p&gt;
&lt;p&gt;OK now the fun stuff. There is a cool, mathematical shortcut we can take to combine our simulated Design Freeze data with our flat prior to create the posterior distribution. It’s very simple: we just add the number of passing DF units to our alpha parameter and the number of failing DF units to our beta parameter. The reason why this works so well is beyond the scope of this post, but the main idea is that when the functional form of the prior (beta function in our case) is similar to the functional form of the likelihood function (Bernoulli in our case), then you can multiply them together easily and the product also takes a similar form. When this happens, the prior is said to be the “conjugate prior” of the likelihood function &lt;a href=&#34;#fn1&#34; class=&#34;footnoteRef&#34; id=&#34;fnref1&#34;&gt;&lt;sup&gt;1&lt;/sup&gt;&lt;/a&gt; The beta and binomial are a special case that go together like peanut butter and jelly.&lt;/p&gt;
&lt;p&gt;Again, to understand how our belief in the reliability should be allocated after observing the DF data, all we need to do is update the beta function by adding the number of passing units from DF testing to alpha (Shape1 parameter) and the number of failing units to beta (Shape2 parameter).&lt;/p&gt;
&lt;p&gt;&lt;span class=&#34;math display&#34;&gt;\[\mbox{Beta}(\alpha_0+\mbox{passes}, \beta_0+\mbox{fails})\]&lt;/span&gt; We’re going to assume all units pass DF, so we only need to adjust the alpha parameter. The resulting beta distribution that we get after updating the alpha parameter represents our belief in where the true reliability may lie after observing the DF data. Remember, even though every unit passed, we can’t just say the reliability is 100% because we’re smart enough to know that if the sample size was, for example, n=15 - there is a reasonable chance that a product with true reliability of 97% could run off n=15 in a row without failing. Even 90% reliability might hit 15 straight every once in a while but it would be pretty unlikely.&lt;/p&gt;
&lt;p&gt;The code below looks at four different possible sample size options for DF: n=15, n=30, n=45, and a full n=59 (just like we plan for DV).&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;#Draw radomly from 4 different beta distributions. Alpha parameter is adjusted based on DF sample size
posterior_dist_sim_15 &amp;lt;- rbeta(n_draws, 16, 1)
posterior_dist_sim_30 &amp;lt;- rbeta(n_draws, 31, 1)
posterior_dist_sim_45 &amp;lt;- rbeta(n_draws, 46, 1)
posterior_dist_sim_59 &amp;lt;- rbeta(n_draws, 60, 1)

#Function to convert vectors above into tibbles and add column for Sample Size 
pds_clean_fcn &amp;lt;- function(pds, s_size){
    pds %&amp;gt;% as_tibble() %&amp;gt;% mutate(Sample_Size = s_size) %&amp;gt;%
    mutate(Sample_Size = factor(Sample_Size, levels = unique(Sample_Size)))}

#Apply function to 4 vectors above
posterior_dist_sim_15_tbl &amp;lt;- pds_clean_fcn(posterior_dist_sim_15, 15)
posterior_dist_sim_30_tbl &amp;lt;- pds_clean_fcn(posterior_dist_sim_30, 30)
posterior_dist_sim_45_tbl &amp;lt;- pds_clean_fcn(posterior_dist_sim_45, 45)
posterior_dist_sim_59_tbl &amp;lt;- pds_clean_fcn(posterior_dist_sim_59, 59)

#Combine the tibbles in a tidy format for visualization
full_post_df_tbl &amp;lt;- bind_rows(
            posterior_dist_sim_15_tbl,
            posterior_dist_sim_30_tbl, 
            posterior_dist_sim_45_tbl, 
            posterior_dist_sim_59_tbl
            )

#Visualize with density plot
df_density_plt &amp;lt;- full_post_df_tbl %&amp;gt;% ggplot(aes(x = value, fill = Sample_Size)) +
    geom_density(alpha = .6) +
    xlim(c(0.85,1)) +
    labs(x = &amp;quot;&amp;quot;,
         y = &amp;quot;Density of Beta&amp;quot;,
         title = &amp;quot;Credibility Simulation, After Design Freeze&amp;quot;,
         subtitle = &amp;quot;Updated Belief Modeled with Beta(1 + n,1)&amp;quot;) +
    scale_fill_manual(values = c(&amp;quot;#2C728EFF&amp;quot;, &amp;quot;#20A486FF&amp;quot;, &amp;quot;#75D054FF&amp;quot;, &amp;quot;#FDE725FF&amp;quot;)) 

#Visualize with histogram 
df_hist_plt &amp;lt;- full_post_df_tbl %&amp;gt;% ggplot(aes(x = value, fill = Sample_Size)) +
    geom_histogram(alpha = .9,
                   position = &amp;quot;dodge&amp;quot;,
                   boundary = 1,
                   color = &amp;quot;black&amp;quot;) +
    xlim(c(0.85,1)) +
    labs(x = &amp;quot;Reliability&amp;quot;,
         y = &amp;quot;Count&amp;quot;) +
    scale_fill_manual(values = c(&amp;quot;#2C728EFF&amp;quot;, &amp;quot;#20A486FF&amp;quot;, &amp;quot;#75D054FF&amp;quot;, &amp;quot;#FDE725FF&amp;quot;))

plot_grid(df_density_plt, df_hist_plt, ncol = 1)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://rileyking.netlify.app/post/2019-08-23-assessing-dv-risk-w-bayesian-estimation-in-r_files/figure-html/unnamed-chunk-3-1.png&#34; width=&#34;100%&#34; height=&#34;500px&#34; /&gt;&lt;/p&gt;
&lt;p&gt;If we unpack these charts a bit, we can see that if we only do n=15 in Design Freeze, we still need to allocate some credibility to reliability parameters below .90. For a full n=59, anything below .95 reliability is very unlikely, yet the 59 straight passing units could have very well come from a product with reliability = .98 or .97.&lt;/p&gt;
&lt;p&gt;We now have a good feel for our uncertainty about the reliability after DF, but what we really want to know is our likelihood of passing Design Verification. To answer this question, we’ll extend our simulation to perform many replicates of n=59 Bernoulli trials, each representing a round of Design Verification testing. The probability of failure will be randomly drawn from the distributions via Monte Carlo. Let’s see how many of these virtual DV tests end with 59/59 passing:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;#Perform many sets of random binom runs, each with n=59 trials. p is taken from the probs previously generated 
DV_acceptable_units_15 &amp;lt;- rbinom(size = 59, n = n_draws, 
                                 prob = (posterior_dist_sim_15_tbl$value))
DV_acceptable_units_30 &amp;lt;- rbinom(size = 59, n = n_draws, 
                                 prob = (posterior_dist_sim_30_tbl$value))
DV_acceptable_units_45 &amp;lt;- rbinom(size = 59, n = n_draws, 
                                 prob = (posterior_dist_sim_45_tbl$value))
DV_acceptable_units_59 &amp;lt;- rbinom(size = 59, n = n_draws, 
                                 prob = (posterior_dist_sim_59_tbl$value))

#Function to convert vectors to tibbles and add col for sample size
setup_fcn &amp;lt;- function(vec, ss){
    vec %&amp;gt;% as_tibble() %&amp;gt;% mutate(DF_Sample_Size = ss) %&amp;gt;%
    mutate(DF_Sample_Size = factor(DF_Sample_Size, levels = unique(DF_Sample_Size)))}

#Apply function
DV_acceptable_units_15_tbl &amp;lt;- setup_fcn(DV_acceptable_units_15, 15)
DV_acceptable_units_30_tbl &amp;lt;- setup_fcn(DV_acceptable_units_30, 30)
DV_acceptable_units_45_tbl &amp;lt;- setup_fcn(DV_acceptable_units_45, 45)
DV_acceptable_units_59_tbl &amp;lt;- setup_fcn(DV_acceptable_units_59, 59)

#Combine the tibbles in a tidy format for visualization
DV_acceptable_full_tbl &amp;lt;- bind_rows(DV_acceptable_units_15_tbl,
                                    DV_acceptable_units_30_tbl,
                                    DV_acceptable_units_45_tbl,
                                    DV_acceptable_units_59_tbl)

#Visualize with ggplot.  Apply gghighlight where appropriate
g1 &amp;lt;- DV_acceptable_full_tbl %&amp;gt;%
   ggplot(aes(x = value)) +
   geom_histogram(aes(fill = DF_Sample_Size),binwidth = 1, color = &amp;quot;black&amp;quot;, position = &amp;quot;dodge&amp;quot;, alpha = .9) +
    xlim(c(45, 60)) +
    scale_x_continuous(limits = c(45, 60), breaks=seq(45, 60, 1)) +
    scale_fill_manual(values = c(&amp;quot;#2C728EFF&amp;quot;, &amp;quot;#20A486FF&amp;quot;, &amp;quot;#75D054FF&amp;quot;, &amp;quot;#FDE725FF&amp;quot;)) +
    labs(
        x = &amp;quot;Passing Parts out of 59 total&amp;quot;,
        title = &amp;quot;Simulated Design Verification Testing&amp;quot;,
        subtitle = &amp;quot;100,000 Simulated DV Runs of n=59&amp;quot;
    )

g2 &amp;lt;- g1 +
    gghighlight(value == 59, use_direct_label = FALSE) +
    labs(
        title = &amp;quot;Simulations that PASSED Design Verification&amp;quot;
     )
    
g3 &amp;lt;- g1 +
    gghighlight(value &amp;lt; 59, use_direct_label = FALSE) +
    labs(
        title = &amp;quot;Simulations that FAILED Design Verification&amp;quot;
    )&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://rileyking.netlify.app/post/2019-08-23-assessing-dv-risk-w-bayesian-estimation-in-r_files/figure-html/unnamed-chunk-5-1.png&#34; width=&#34;100%&#34; height=&#34;500px&#34; /&gt;&lt;img src=&#34;https://rileyking.netlify.app/post/2019-08-23-assessing-dv-risk-w-bayesian-estimation-in-r_files/figure-html/unnamed-chunk-5-2.png&#34; width=&#34;100%&#34; height=&#34;500px&#34; /&gt;&lt;img src=&#34;https://rileyking.netlify.app/post/2019-08-23-assessing-dv-risk-w-bayesian-estimation-in-r_files/figure-html/unnamed-chunk-5-3.png&#34; width=&#34;100%&#34; height=&#34;500px&#34; /&gt;&lt;/p&gt;
&lt;p&gt;Taking into consideration the uncertainty of the true reliability after the DF testing, the percentage of times we expect to pass Design Verification is shown below. These percentages are calculated as the number of simulated DV runs that achieved 59/59 passing units divided by the total number of simulated DV runs. Any simulation with 58 or less passing units would have failed DV.&lt;/p&gt;
&lt;p&gt;&lt;span class=&#34;math display&#34;&gt;\[\mbox{expected probability of passing DV  = (number of sims with n=59 pass) / (total sims) }\]&lt;/span&gt;&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;#Function to calculate how many DV simulations resulted in 59/59 passing units
pct_pass_fct &amp;lt;- function(tbl, n){
    pct_dv_pass &amp;lt;- tbl %&amp;gt;% filter(value == 59) %&amp;gt;% nrow() / n_draws
    paste(&amp;quot;DF with n = &amp;quot;,n, &amp;quot;(all pass): &amp;quot;, pct_dv_pass %&amp;gt;% scales::percent(), &amp;quot;expected probability of next 59/59 passing DV&amp;quot;)}&lt;/code&gt;&lt;/pre&gt;
&lt;table&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
DF with n = 15 (all pass): 21.3% expected probability of next 59/59 passing DV
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
DF with n = 30 (all pass): 34.5% expected probability of next 59/59 passing DV
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
DF with n = 45 (all pass): 43.5% expected probability of next 59/59 passing DV
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
DF with n = 59 (all pass): 50.3% expected probability of next 59/59 passing DV
&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;The percentage of time we expect to pass Design Verification is shockingly low! Even when we did a full n=59 in Design Freeze, we still only be able to predict 50% success in DV! This is because even with 59/59 passes, we still must account for the possibility that the reliability isn’t 100%. We don’t have enough DF data to shift the credibility all the way near 100%, and when the credibility is spread to include possible reliabilities in the mid .90’s we should always be prepared for the possibility of failing Design Verification.&lt;/p&gt;
&lt;p&gt;We could just leave it at that but I have found that when discussing risk, stakeholders want more than just an estimation of the rate of bad outcomes. They want a recommendation and a mitigation plan. Here are a few ideas; can you think of any more?&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Maintain multiple design configurations as long as possible (often not feasible, but provides an out if 1 design fails)&lt;/li&gt;
&lt;li&gt;Perform durability testing as “fatigue-to-failure”. In this methodology, the devices are run to failure and the cycles to failure are treated as variable data. By varying the amplitude of the loading cycles, we can force the devices to fail and understand the uncertainty within the failure envelope. &lt;a href=&#34;#fn2&#34; class=&#34;footnoteRef&#34; id=&#34;fnref2&#34;&gt;&lt;sup&gt;2&lt;/sup&gt;&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;Fold in information from pre-DF testing, predicate testing, etc to inform the prior better. I will look at the sensitivity of the reliability estimations to the prior in a future post.&lt;/li&gt;
&lt;li&gt;Build redundant design cycles into the project schedule to accomodate additional design turns without falling behind the contracted timeline&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Thanks for reading!&lt;/p&gt;
&lt;style&gt;
body {
text-align: justify}
&lt;/style&gt;
&lt;div class=&#34;footnotes&#34;&gt;
&lt;hr /&gt;
&lt;ol&gt;
&lt;li id=&#34;fn1&#34;&gt;&lt;p&gt;Kruschke, Doing Bayesian Data Analysis, &lt;a href=&#34;https://sites.google.com/site/doingbayesiandataanalysis/&#34; class=&#34;uri&#34;&gt;https://sites.google.com/site/doingbayesiandataanalysis/&lt;/a&gt;&lt;a href=&#34;#fnref1&#34;&gt;↩&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li id=&#34;fn2&#34;&gt;&lt;p&gt;Fatigue-to Fracture ASTM Standard: &lt;a href=&#34;https://www.astm.org/Standards/F3211.htm&#34; class=&#34;uri&#34;&gt;https://www.astm.org/Standards/F3211.htm&lt;/a&gt;&lt;a href=&#34;#fnref2&#34;&gt;↩&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;/div&gt;
</description>
    </item>
    
    <item>
      <title>Permutation Test for NHST of 2 Samples in R</title>
      <link>https://rileyking.netlify.app/post/permutation-test-for-nhst-of-2-samples-in-r/</link>
      <pubDate>Sat, 10 Aug 2019 00:00:00 +0000</pubDate>
      
      <guid>https://rileyking.netlify.app/post/permutation-test-for-nhst-of-2-samples-in-r/</guid>
      <description>


&lt;p&gt;As engineers, it is not uncommon to be asked to determine whether or not two different configurations of a product perform the same. Perhaps we are asked to compare the durability of a next-generation prototype to the current generation. Sometimes we are testing the flexibility of our device versus a competitor for marketing purposes. Maybe we identify a new vendor for a raw material but must first understand whether the resultant finished product will perform any differently than when built using material from the standard supplier. All of these situations call for a comparison between two groups culminating in a statistically supported recommendation.&lt;/p&gt;
&lt;p&gt;There are a lot of interesting ways to do this: regions of practical equivalence, Bayes Factors, etc. The most common method is still null hypothesis significance testing (NHST) and that’s what I want to explore in this first post. Frequentist methods yield the least useful inferences but have the advantage of a long usage history. Most medical device professionals will be looking for a p-value, so a p-value we must provide.&lt;/p&gt;
&lt;p&gt;In NHST, the plan is usually to calculate a test statistic from our data and use a table of reference values or a statistical program to tell us how surprising our derived statistic would be in a world where the null hypothesis was true. We generally do this by comparing our statistic to a reference distribution or table of tabulated values. Unfortunately, whenever our benchtop data violates an assumption of the reference model, we are no longer comparing apples-to-apples. We must make tweaks and adjustments to try to compensate. It is easy to get overwhelmed in a decision tree of test names and use cases.&lt;/p&gt;
&lt;p&gt;A more robust and intuitive approach to NHST is to replace the off-the-shelf distributions and tables with a simulation built right from our dataset. The workflow any such test is shown below. &lt;a href=&#34;#fn1&#34; class=&#34;footnoteRef&#34; id=&#34;fnref1&#34;&gt;&lt;sup&gt;1&lt;/sup&gt;&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;&lt;img src=&#34;https://rileyking.netlify.app/./img/workflow.png&#34; width=&#34;75%&#34; height=&#34;75%&#34; style=&#34;display: block; margin: auto;&#34; /&gt;&lt;/p&gt;
&lt;p&gt;The main difference here is that we create the distribution of the data under the null hypothesis using simulation instead of relying on a reference distribution. It’s intuitive, powerful, and fun.&lt;/p&gt;
&lt;p&gt;Imagine we have just designed a benchtop experiment in which we intend to measure the pressure (in mm Hg) at which a pair of overlapped stent grafts started to migrate or disconnect when deployed in a large thoracic aneurysm. &lt;a href=&#34;#fn2&#34; class=&#34;footnoteRef&#34; id=&#34;fnref2&#34;&gt;&lt;sup&gt;2&lt;/sup&gt;&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;&lt;img src=&#34;https://rileyking.netlify.app/./img/migration_model.png&#34; width=&#34;100%&#34; height=&#34;500px&#34; /&gt;&lt;/p&gt;
&lt;p&gt;A common null hypothesis for comparing groups is that there is no difference between them. Under this model, &lt;strong&gt;we can treat all the experimental data as one big group instead of 2 different groups&lt;/strong&gt;. We therefore pool the data from our completed experiment into one big group, shuffle it, and randomly assign data points into two groups of the original size. This is our generative model. After each round of permutation and assignment, we calculate and store the test statistic for the observed effect (difference in means between the two groups). Once many simulations have been completed, we’ll see where our true data falls relative to the virtual data.&lt;/p&gt;
&lt;p&gt;One way to setup and execute a simulation-based NHST for comparing two groups in R is as follows (note: there are quicker shortcuts to executing this type of testing but the long version below allows for customization, visualization, and adjust-ability):&lt;/p&gt;
&lt;p&gt;First, we read in the libraries and transcribe the benchtop data into R and evaluate sample size&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;library(tidyverse)
library(cowplot)
library(knitr)
library(kableExtra)&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;#Migration pressure for predicate device
predicate &amp;lt;-  c(186, 188, 189, 189, 192, 193, 194, 194, 194, 195, 195, 196, 196, 197, 197, 198, 198, 199, 199, 201, 206, 207, 210, 213, 216, 218)

#Migration pressure for next_gen device
next_gen &amp;lt;-  c(189, 190, 192, 193, 193, 196, 199, 199, 199, 202, 203, 204, 205, 206, 206, 207, 208, 208, 210, 210, 212, 214, 216, 216, 217, 218)&lt;/code&gt;&lt;/pre&gt;
&lt;table&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
Sample Size of Predicate Device Data: 26
&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;table&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
Sample Size of Next-Gen Device Data: 26
&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;So we have slightly uneven groups and relatively small sample sizes. No problem - assign each group to a variable and convert to tibble format:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;#Assign variables for each group and convert to tibble
predicate_tbl &amp;lt;- tibble(Device = &amp;quot;Predicate&amp;quot;,
                        Pressure = predicate)

next_gen_tbl &amp;lt;- tibble(Device = &amp;quot;Next_Gen&amp;quot;,
                        Pressure = next_gen)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Combine predicate and next_gen data into a single, pooled group called results_tbl. Taking a look at the first few and last few rows in the pooled tibble confirm it was combined appropriately.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;#Combine in tibble
results_tbl &amp;lt;- bind_rows(predicate_tbl, next_gen_tbl)
results_tbl %&amp;gt;% 
  head() %&amp;gt;% 
  kable(align = rep(&amp;quot;c&amp;quot;,2))&lt;/code&gt;&lt;/pre&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th style=&#34;text-align:center;&#34;&gt;
Device
&lt;/th&gt;
&lt;th style=&#34;text-align:center;&#34;&gt;
Pressure
&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
Predicate
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
186
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
Predicate
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
188
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
Predicate
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
189
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
Predicate
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
189
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
Predicate
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
192
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
Predicate
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
193
&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;results_tbl %&amp;gt;% tail() %&amp;gt;% 
  head() %&amp;gt;% 
  kable(align = rep(&amp;quot;c&amp;quot;,2))&lt;/code&gt;&lt;/pre&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th style=&#34;text-align:center;&#34;&gt;
Device
&lt;/th&gt;
&lt;th style=&#34;text-align:center;&#34;&gt;
Pressure
&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
Next_Gen
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
212
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
Next_Gen
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
214
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
Next_Gen
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
216
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
Next_Gen
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
216
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
Next_Gen
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
217
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
Next_Gen
&lt;/td&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
218
&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;Now we do some exploratory data analysis to identify general shape and distribution.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;# Visualize w/ basic boxplot
boxplot_eda &amp;lt;- results_tbl %&amp;gt;% 
    ggplot(aes(x=Device, y=Pressure)) +
    geom_boxplot(
        alpha  = .6,
        width  = .4,
        size   = .8,
        fatten = .5,
        fill   = c(&amp;quot;#FDE725FF&amp;quot;,&amp;quot;#20A486FF&amp;quot;)) +
    labs(
        y        = &amp;quot;Pressure (mm Hg)&amp;quot;,
        title    = &amp;quot;Predicate and Next-Gen Data&amp;quot;,
        subtitle = &amp;quot;Modular Disconnect Pressure&amp;quot;
    )

boxplot_eda&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://rileyking.netlify.app/post/2019-08-10-simple-permutation-test-for-nhst-of-2-samples_files/figure-html/unnamed-chunk-7-1.png&#34; width=&#34;100%&#34; height=&#34;500px&#34; /&gt;&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;#Visualize with density plot
density_eda &amp;lt;- results_tbl %&amp;gt;% 
    ggplot(aes(x = Pressure)) +
    geom_density(aes(fill = Device),
        color = &amp;quot;black&amp;quot;,
        alpha = 0.6
        ) +
    scale_fill_manual(values = c(&amp;quot;#FDE725FF&amp;quot;,&amp;quot;#20A486FF&amp;quot;)) +
    labs(
        x        = &amp;quot;Pressure (mm Hg)&amp;quot;,
        title    = &amp;quot;Predicate and Next-Gen Data&amp;quot;,
        subtitle = &amp;quot;Modular Disconnect Pressure&amp;quot;
    )

density_eda&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://rileyking.netlify.app/post/2019-08-10-simple-permutation-test-for-nhst-of-2-samples_files/figure-html/unnamed-chunk-7-2.png&#34; width=&#34;100%&#34; height=&#34;500px&#34; /&gt;&lt;/p&gt;
&lt;p&gt;Yikes! These data do not look normal. Fortunately, the permutation test does not need the data to take on any particular distribution. The main assumption is exchangability, meaning it must be reasonable that the labels could be arbitrarily permuted under the null hypothesis. Provided the sample size is approximately equal, the permutation test is robust against unequal variances.&lt;a href=&#34;#fn3&#34; class=&#34;footnoteRef&#34; id=&#34;fnref3&#34;&gt;&lt;sup&gt;3&lt;/sup&gt;&lt;/a&gt; This gives us an attractive option for data shaped as shown above.&lt;/p&gt;
&lt;p&gt;To get started with our permutation test we create a function that accepts 3 arguments: the pooled data from all trials in our benchtop experiment (x), the number of observations taken from Group 1 (n1), and the number of observations taken from Group 2 (n2). The function creates an object containing indices 1:n, then randomly assigns indices into two Groups A and B with sizes to match the original group sizes. It then uses the randomly assigned indices to splice the dataset x producing 2 “shuffled” groups from the original data. Finally, it computes and returns the mean between the 2 randomly assigned groups.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;#Function to permute vector indices and then compute difference in group means
perm_fun &amp;lt;- function(x, n1, n2){
  n &amp;lt;- n1 + n2
  group_B &amp;lt;- sample(1:n, n1)
  group_A &amp;lt;- setdiff(1:n, group_B)
  mean_diff &amp;lt;- mean(x[group_B] - mean(x[group_A]))
  return(mean_diff)
}&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Here we initialize an dummy vector called perm_diffs to hold the results of the loop we are about to use. It’ll have all 0’s to start and then we’ll assign values from each iteration of the for loop.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;#Set number of simulations to run
n_sims &amp;lt;- 10000

#Initialize empty vector
perm_diffs &amp;lt;- rep(0,n_sims)
perm_diffs %&amp;gt;% head()  %&amp;gt;% 
  kable(align = &amp;quot;c&amp;quot;, col.names = NULL)&lt;/code&gt;&lt;/pre&gt;
&lt;table&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
0
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
0
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
0
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
0
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
0
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
0
&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;Set up a simple for loop to execute the same evaluation using perm_fun() 10,000 times. On each iteration, we’ll store the results into the corresponding index within perm_diffs that we initialized above.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;#Set seed for reproducibility
set.seed(2015)

#Iterate over desired number of simulations using permutation function
for (i in 1:n_sims)
  perm_diffs[i] = perm_fun(results_tbl$Pressure, 26, 26)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Now we have 10,000 replicates of our permutation test stored in perm_diffs. We want to visualize the data with ggplot so we convert it into a tibble frame using tibble().&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;#Convert results to a tibble and look at it
perm_diffs_df &amp;lt;- tibble(perm_diffs)
perm_diffs_df %&amp;gt;% head()  %&amp;gt;% 
  kable(align = &amp;quot;c&amp;quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th style=&#34;text-align:center;&#34;&gt;
perm_diffs
&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
-0.6153846
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
-3.3076923
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
0.6923077
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
-2.3846154
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
-0.3076923
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
3.1538462
&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;Visualize the difference in means as a histogram and density plot:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;#Visualize difference in means as a histogram
diffs_histogram_plot &amp;lt;- perm_diffs_df %&amp;gt;% ggplot(aes(perm_diffs)) +
  geom_histogram(fill = &amp;quot;#2c3e50&amp;quot;, color = &amp;quot;white&amp;quot;, binwidth = .3, alpha = 0.8) +
    labs(
        x = &amp;quot;Pressure (mm Hg)&amp;quot;,
        title = &amp;quot;Histogram of Difference in Means&amp;quot;,
        subtitle = &amp;quot;Generated Under Null Hypothesis&amp;quot;
    )

#Visualize difference in means as a density plot
diffs_density_plot &amp;lt;-  perm_diffs_df %&amp;gt;% ggplot(aes(perm_diffs)) +
  geom_density(fill = &amp;quot;#2c3e50&amp;quot;, color = &amp;quot;white&amp;quot;, alpha = 0.8) +
     labs(
        x = &amp;quot;Pressure (mm Hg)&amp;quot;,
        title = &amp;quot;Density Plot of Difference in Means&amp;quot;,
        subtitle = &amp;quot;Generated Under Null Hypothesis&amp;quot;
    )

plot_grid(diffs_histogram_plot, diffs_density_plot)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://rileyking.netlify.app/post/2019-08-10-simple-permutation-test-for-nhst-of-2-samples_files/figure-html/unnamed-chunk-12-1.png&#34; width=&#34;100%&#34; height=&#34;500px&#34; /&gt;&lt;/p&gt;
&lt;p&gt;We just simulated many tests from the null hypothesis. These virtual data give us a good understanding of what sort of difference in means we might observe if there truly was no difference between the groups. As expected, most of the time the difference is around 0. But occasionally there is a noticeable difference in means just due to chance.&lt;/p&gt;
&lt;p&gt;But how big was the difference in means from our real world dataset? We’ll call this “baseline difference”.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;#Evaluate difference in means from true data set
predicate_pressure_mean &amp;lt;- mean(predicate_tbl$Pressure)
next_gen_pressure_mean &amp;lt;- mean(next_gen_tbl$Pressure)

baseline_difference &amp;lt;- predicate_pressure_mean - next_gen_pressure_mean
baseline_difference  %&amp;gt;% 
  signif(digits = 3) %&amp;gt;%
  kable(align = &amp;quot;c&amp;quot;, col.names = NULL)&lt;/code&gt;&lt;/pre&gt;
&lt;table&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
-5.85
&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;So our real, observed data show a difference in means of -5.85. Is this large or small? With the context of the shuffle testing we already performed, we know exactly how extreme our observed data is and can visualize it with a vertical line.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;#Visualize real data in context of simulations
g1 &amp;lt;- diffs_histogram_plot + 
  geom_vline(xintercept = baseline_difference, 
             linetype   = &amp;quot;dotted&amp;quot;, 
             color      = &amp;quot;#2c3e50&amp;quot;, 
             size       = 1
             ) 

g2 &amp;lt;- diffs_density_plot + 
  geom_vline(xintercept = baseline_difference, 
             linetype   =&amp;quot;dotted&amp;quot;, 
             color      = &amp;quot;#2c3e50&amp;quot;, 
             size       = 1
             ) 

plot_grid(g1,g2)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://rileyking.netlify.app/post/2019-08-10-simple-permutation-test-for-nhst-of-2-samples_files/figure-html/unnamed-chunk-14-1.png&#34; width=&#34;100%&#34; height=&#34;500px&#34; /&gt;&lt;/p&gt;
&lt;p&gt;It looks like the our benchtop data was pretty extreme relative to the null. We should start to consider the possibility that this effect was not due solely to chance alone. 0.05 is a commonly used threshold for declaring statistical significance. Let’s see if our data is more or less extreme than 0.05 (solid line).&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;#Calculate the 5% quantile of the simulated distribution for difference in means
the_five_percent_quantile &amp;lt;- quantile(perm_diffs_df$perm_diffs, probs = 0.05)
the_five_percent_quantile&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;##        5% 
## -4.153846&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;#Visualize the 5% quantile on the histogram and density plots
g3 &amp;lt;- g1 +
         geom_vline(xintercept = the_five_percent_quantile, 
             color      = &amp;quot;#2c3e50&amp;quot;, 
             size       = 1
             )

g4 &amp;lt;- g2 +
        geom_vline(xintercept = the_five_percent_quantile, 
             color      = &amp;quot;#2c3e50&amp;quot;, 
             size       = 1
             )

plot_grid(g3,g4)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://rileyking.netlify.app/post/2019-08-10-simple-permutation-test-for-nhst-of-2-samples_files/figure-html/unnamed-chunk-15-1.png&#34; width=&#34;100%&#34; height=&#34;500px&#34; /&gt;&lt;/p&gt;
&lt;p&gt;We can see here that our data is more extreme than the 5% quantile which means our p-value is less than 0.05. This satisfies the traditional, frequentist definition of statistically significant. If we want to actual p-value, we have to determine the percentage of simulated data that are as extreme or more extreme than our observed data.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;#Calculate percentage of simulations as extreme or more extreme than the observed data (p-value)
p_value &amp;lt;- perm_diffs_df %&amp;gt;% 
    filter(perm_diffs &amp;lt;= baseline_difference) %&amp;gt;%
    nrow() / n_sims

paste(&amp;quot;The empirical p-value is: &amp;quot;, p_value)  %&amp;gt;% 
  kable(align = &amp;quot;c&amp;quot;, col.names = NULL)&lt;/code&gt;&lt;/pre&gt;
&lt;table&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:center;&#34;&gt;
The empirical p-value is: 0.0096
&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;Our p-value is well below 0.05. This is likely enough evidence for us to claim that there was a statistically significant difference observed between the Next Gen device and the predicate device.&lt;/p&gt;
&lt;p&gt;Our marketing team will be thrilled, but we should always be wary that statistically significant does not mean practically important. Domain knowledge should provide the context to interpret the relevance of the observed difference. A difference in mean Pressure of a few mm Hg seems to be enough to claim a statistically significant improvement in our new device vs. the predicate, but is it enough for our marketing team to make a meaningful campaign? In reality, a few mm Hg is noticeable on the bench but is likely lost in the noise of anatomical variation within real patient anatomies.&lt;/p&gt;
&lt;style&gt;
body {
text-align: justify}
&lt;/style&gt;
&lt;div class=&#34;footnotes&#34;&gt;
&lt;hr /&gt;
&lt;ol&gt;
&lt;li id=&#34;fn1&#34;&gt;&lt;p&gt;Probably Overthinking It, &lt;a href=&#34;http://allendowney.blogspot.com/2016/06/there-is-still-only-one-test.html&#34; class=&#34;uri&#34;&gt;http://allendowney.blogspot.com/2016/06/there-is-still-only-one-test.html&lt;/a&gt;&lt;a href=&#34;#fnref1&#34;&gt;↩&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li id=&#34;fn2&#34;&gt;&lt;p&gt;J ENDOVASC THER 2011;18:559-568, open access &lt;a href=&#34;https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3163409/&#34; class=&#34;uri&#34;&gt;https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3163409/&lt;/a&gt;&lt;a href=&#34;#fnref2&#34;&gt;↩&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li id=&#34;fn3&#34;&gt;&lt;p&gt;Simulations and Explanation of Unequal Variance and Sample Sizes, &lt;a href=&#34;https://stats.stackexchange.com/questions/87215/does-a-big-difference-in-sample-sizes-together-with-a-difference-in-variances-ma&#34; class=&#34;uri&#34;&gt;https://stats.stackexchange.com/questions/87215/does-a-big-difference-in-sample-sizes-together-with-a-difference-in-variances-ma&lt;/a&gt;&lt;a href=&#34;#fnref3&#34;&gt;↩&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;/div&gt;
</description>
    </item>
    
  </channel>
</rss>