#the libraries we need
library(MASS)
library(dplyr)
library(rstan)
library(ggplot2)
library(reshape2)
# some STAN settings
options(mc.cores = parallel::detectCores())
rstan_options(auto_write = TRUE)
Simulating data
We will test whether there is a relationship between mood and concentration. This is based on some smartphone data i collected in 120 individuals over 14 days.
I want to demonstrate the usefulness of multilevel correlations. Hence, we need data that has a multi-level structure, such as responses grouped by individuals.
This is strongly inspired by a blog post by Brendan Hasz (http://brendanhasz.github.io/), who implemented a multi-level correlation with PySTAN. Find the post here: https://brendanhasz.github.io/2018/06/27/bayesian-correlations.html
Explanation of simulation
Here is a quick run-through of the simulation procedure before it is implemented below:
1. drawing a correlation coefficient for individual i from a uniform distribution
- calculating a covariance matrix with this value, rho, and the hard-coded standard deviations (since corr_xy = cov_xy / SD_x * SD_y <=> cov_xy = corr_xy * SD_x * SD_y)
sd_x = 5
sd_y =10
cov_matrix= matrix(c(sd_x^2, rho * sd_x * sd_y,
rho * sd_y * sd_x, sd_y^2), nrow = 2)
cov_matrix
## [,1] [,2]
## [1,] 25 10
## [2,] 10 100
- With this individual covariance matrix, we can simulate data from a multivariate normal distribution, adding hard-coded mean values for both variables
## [,1] [,2]
## [1,] 54.10529 30.66347
## [2,] 50.69399 39.05639
## [3,] 47.38863 27.48681
## [4,] 44.50143 25.17216
## [5,] 48.92526 29.98349
## [6,] 54.12072 24.74203
## [7,] 47.58663 42.54078
## [8,] 45.98073 17.79705
## [9,] 54.16370 38.34444
## [10,] 52.48646 41.94266
- this is repeated for all individuals and their values are stored with ID columns to keep the structure intact.
Simulating
Here we simulate the data we want to analyse below.
set.seed(123) # Set seed for reproducibility
#library(MASS) # For mvrnorm function
# Parameters
n_individuals <- 120
n_timepoints <- 14
# Means and standard deviations for mood and concentration
mood_mean <- 50
mood_sd <- 8
concentration_mean <- 60
concentration_sd <- 14
# Function to simulate data for one individual
simulate_individual <- function(rho) {
# Define the covariance matrix based on the correlation rho
cov_matrix <- matrix(c(mood_sd^2, rho * mood_sd * concentration_sd,
rho * mood_sd * concentration_sd, concentration_sd^2), nrow = 2)
# Simulate 14 time points of mood and concentration using a multivariate normal distribution
mvrnorm(n_timepoints, mu = c(mood_mean, concentration_mean), Sigma = cov_matrix)
}
# Create an empty dataframe to store the results
data <- data.frame()
# Simulate data for each individual
for (i in 1:n_individuals) {
# Randomly sample a correlation between -0.2 and 0.8 for this individual
rho_i <- runif(1, -0.2, 0.8)
# Simulate data for this individual
individual_data <- simulate_individual(rho_i)
# Add individual ID and bind to the dataframe
individual_df <- data.frame(
id = rep(i, n_timepoints),
time = 1:n_timepoints,
mood = individual_data[,1],
concentration = individual_data[,2],
correlation = rep(rho_i, n_timepoints)
)
data <- rbind(data, individual_df)
}
# Display the first few rows of the simulated data
head(data)
## id time mood concentration correlation
## 1 1 1 59.14835 70.58300 0.08757752
## 2 1 2 40.95908 77.40735 0.08757752
## 3 1 3 41.70451 36.85095 0.08757752
## 4 1 4 51.72305 77.30507 0.08757752
## 5 1 5 56.10981 58.01596 0.08757752
## 6 1 6 55.69764 57.93003 0.08757752
Briefly testing the simulation
The values are as expected
## [1] 0.3345044
## [1] 8.036882
## [1] 13.89109
The individual correlations vary around:
individual_correlations <- data %>%
group_by(id) %>%
summarize(correlation_observed = cor(mood, concentration))
min(individual_correlations$correlation_observed)
## [1] -0.7648793
## [1] 0.9045649
## [1] 0.3935277
Multilevel correlation with STAN
Now for the multilevel version! The aim is to calculate correlations for each individual, as we saw just now, while also getting an overall estimate of the correlation.
A Bayesian approach lends itself well for this since it gives us not only a point estimate (the pooled correlation) but also a measure of uncertainty around this estimate. This is especially valuable when we want to learn how well this pooled correlation represents the population.
We will use STAN since it is flexible enough for us to write the covariance matrix that should be estimated. This is my first deep dive into STAN, so please excuse any errors and let me know if you find some!
first trying out a simple correlation
the model
the syntax follows the following structure:
1. the data - where the input to the model is defined.
2. the parameters - where the variables of the model are specified (we need the means and SDs per variable and their correlation, rho).
3. transformed parameters - here we will specify the multilevel correlation. For now it’s a simple 2x2 matrix, with each cell being defined individually.
4. the model - where we define how the data and the parameters should be linked.
stan_model_Corr_syntax <- "
data {
int<lower=0> N; // number of data points
vector[2] Y[N]; // observed data: rows are observations, columns are the two variables
}
parameters {
vector[2] mu; // per-variable mean
vector<lower=0>[2] sigma; // SD of each variable
real<lower=-1,upper=1> rho; // Pearson's rho
}
transformed parameters {
cov_matrix[2] C;
C[1,1] = sigma[1] * sigma[1];
C[1,2] = rho * sigma[1] * sigma[2];
C[2,1] = rho * sigma[1] * sigma[2];
C[2,2] = sigma[2] * sigma[2];
}
model {
Y ~ multi_normal(mu, C);
}
"
compiling the model
I will not get into the details of this, there are comprehensive explanations of rstan here: https://mc-stan.org/rstan/
## S4 class stanmodel 'anon_model' coded as follows:
## data {
## int<lower=0> N; // number of data points
## vector[2] Y[N]; // observed data: rows are observations, columns are the two variables
## }
##
## parameters {
## vector[2] mu; // per-variable mean
## vector<lower=0>[2] sigma; // SD of each variable
## real<lower=-1,upper=1> rho; // Pearson's rho
## }
##
## transformed parameters {
## cov_matrix[2] C;
## C[1,1] = sigma[1] * sigma[1];
## C[1,2] = rho * sigma[1] * sigma[2];
## C[2,1] = rho * sigma[1] * sigma[2];
## C[2,2] = sigma[2] * sigma[2];
## }
##
## model {
## Y ~ multi_normal(mu, C);
## }
preparing the data
The data input has to fit the STAN syntax. It has to be a list, with left hand variables corresponding to the model code we just wrote. Since our example data is rather simple, this doesn’t require much work.
With real-world data, this might require some more steps (e.g. excluding rows with missing data, changing the data-type).
sampling from posterior
Note: I had some issues when knitting the document due to the stan code, so I pre-compiled and sampled the model and read in the results here.
printing & visualising the data
## Inference for Stan model: anon_model.
## 4 chains, each with iter=10; warmup=5; thin=1;
## post-warmup draws per chain=5, total post-warmup draws=20.
##
## mean se_mean sd 2.5% 25%
## mu[1] -3.400000e-01 NaN 1.410000e+00 -1.730000e+00 -1.68000e+00
## mu[2] -2.300000e-01 NaN 1.180000e+00 -1.140000e+00 -9.70000e-01
## sigma[1] 2.520000e+00 NaN 1.630000e+00 7.500000e-01 1.46000e+00
## sigma[2] 7.800000e-01 NaN 2.700000e-01 3.700000e-01 6.60000e-01
## rho -6.000000e-02 NaN 5.100000e-01 -5.800000e-01 -4.70000e-01
## C[1,1] 8.890000e+00 NaN 9.960000e+00 5.600000e-01 2.30000e+00
## C[1,2] 4.200000e-01 NaN 1.650000e+00 -1.070000e+00 -5.90000e-01
## C[2,1] 4.200000e-01 NaN 1.650000e+00 -1.070000e+00 -5.90000e-01
## C[2,2] 6.900000e-01 NaN 3.900000e-01 1.400000e-01 4.60000e-01
## lp__ -2.184801e+10 NaN 1.662903e+10 -4.974495e+10 -2.33637e+10
## 50% 75% 97.5% n_eff Rhat
## mu[1] -5.30000e-01 8.0000e-01 1.400000e+00 NaN Inf
## mu[2] -7.60000e-01 -3.0000e-02 1.740000e+00 NaN Inf
## sigma[1] 2.16000e+00 3.2200e+00 5.020000e+00 NaN Inf
## sigma[2] 8.40000e-01 9.6000e-01 1.090000e+00 NaN Inf
## rho -1.70000e-01 2.3000e-01 6.700000e-01 NaN Inf
## C[1,1] 4.88000e+00 1.1470e+01 2.524000e+01 NaN Inf
## C[1,2] -1.90000e-01 8.2000e-01 3.120000e+00 NaN Inf
## C[2,1] -1.90000e-01 8.2000e-01 3.120000e+00 NaN Inf
## C[2,2] 7.10000e-01 9.3000e-01 1.180000e+00 NaN Inf
## lp__ -1.39907e+10 -1.2475e+10 -9.665684e+09 NaN Inf
##
## Samples were drawn using NUTS(diag_e) at Fri Oct 18 14:16:02 2024.
## For each parameter, n_eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor on split chains (at
## convergence, Rhat=1).
# Extract the posterior samples for rho
rho_samples <- extract(fit_Corr)$rho
# Convert rho_samples to a data frame for ggplot2
rho_samples_df <- data.frame(rho = rho_samples)
# Create a density plot
ggplot(rho_samples_df, aes(x = rho)) +
geom_density(fill = "#3DBEAB", alpha = 0.5) +
labs(title = "Posterior Distribution of rho",x = "rho", y = "Density") + theme_minimal()
As we can see, the posterior distribution is rather wide and uninformative. Now, we will see whether the multi-level version fares better.
Multi-level correlation
We follow the same steps as before: 1. Writing the Syntax, 2.Compiling the model, 3. Preparing the data, 4. Sampling, and 5. Working with the posterior.
Multilevel syntax
The main change is in the how we write the model:
1. In the data block, we describe the nested data structure: There are Ni individuals and N data points overall. With I[N], we assign each data point the individual from which it was recorded.
2. In the parameters block, we still estimate a mean and SD for mood and concentration. Now, we assign each individual, Ni, a correlation coefficient, rho, with rho[Ni]. There is a distribution for these rhos and we define a mean and SD for it. Finally, we state that there is a population-level rho.
3. The change in the covariance matrix may be subtle but important: We write a covariance matrix for each individual: C[Ni].
4. In the model block, we give the overall rho a fairly wide prior (Normal distribution with mu = 0 and SD = 1). A more informative prior could consider the fact that the correlation can only go from -1 to +1. We also specify that the individual correlation coefficients are drawn from a population of correlation coefficients and that the overall correlation is informed by these distributions of correlations. Finally, the data for each individual is linked to the model via a multivariate normal distribution, that considers each individual’s covariance matrix.
stan_model_MLMCorr_syntax <- "
data {
int<lower=0> N; // number of data points - variable*individual
int<lower=1> Ni; // number of individuals
int<lower=1> I[N]; // individual of each data point
vector[2] Y[N]; // mood and concentration
}
parameters {
vector[2] mu; // mean
vector<lower=0>[2] sigma; // SD
real<lower=-1,upper=1> rho[Ni]; // per-individual Pearson's rho
real<lower=-1,upper=1> mu_rho; // mean of population rhos
real<lower=0> sigma_rho; // SD of population rhos
real<lower=-1,upper=1> rho_pop; // population-level rho
}
transformed parameters {
cov_matrix[2] C[Ni]; // cov matrix for each individual
for (i in 1:Ni) {
C[i][1,1] = sigma[1] * sigma[1];
C[i][1,2] = rho[i] * sigma[1] * sigma[2];
C[i][2,1] = rho[i] * sigma[1] * sigma[2];
C[i][2,2] = sigma[2] * sigma[2];
}
}
model {
// Population-level rho prior
rho_pop ~ normal(0, 1); // Prior for population-level rho
// Each individual rho is drawn from population distribution
rho ~ normal(mu_rho, sigma_rho);
// Population rho is informed by the mean of the individual rhos
mu_rho ~ normal(rho_pop, sigma_rho);
// Each individual datapoint is drawn from its individual's distribution
for (i in 1:N) {
Y[i] ~ multi_normal(mu, C[I[i]]);
}
}
"
compiling the model
## S4 class stanmodel 'anon_model' coded as follows:
## data {
## int<lower=0> N; // number of data points - variable*individual
## int<lower=1> Ni; // number of individuals
## int<lower=1> I[N]; // individual of each data point
## vector[2] Y[N]; // mood and concentration
## }
##
## parameters {
## vector[2] mu; // mean
## vector<lower=0>[2] sigma; // SD
## real<lower=-1,upper=1> rho[Ni]; // per-individual Pearson's rho
## real<lower=-1,upper=1> mu_rho; // mean of population rhos
## real<lower=0> sigma_rho; // SD of population rhos
## real<lower=-1,upper=1> rho_pop; // population-level rho
## }
##
## transformed parameters {
## cov_matrix[2] C[Ni]; // cov matrix for each individual
## for (i in 1:Ni) {
## C[i][1,1] = sigma[1] * sigma[1];
## C[i][1,2] = rho[i] * sigma[1] * sigma[2];
## C[i][2,1] = rho[i] * sigma[1] * sigma[2];
## C[i][2,2] = sigma[2] * sigma[2];
## }
## }
##
## model {
## // Population-level rho prior
## rho_pop ~ normal(0, 1); // Prior for population-level rho
##
## // Each individual rho is drawn from population distribution
## rho ~ normal(mu_rho, sigma_rho);
##
## // Population rho is informed by the mean of the individual rhos
## mu_rho ~ normal(rho_pop, sigma_rho);
##
## // Each individual datapoint is drawn from its individual's distribution
## for (i in 1:N) {
## Y[i] ~ multi_normal(mu, C[I[i]]);
## }
## }
preparing the data
we need to add more indices to the list to account for the nested structure of the data.
data4STAN = list(
N = nrow(data),
Ni = length(unique(data$id)),
I = data$id,
Y = cbind(data$mood,data$concentration))
glimpse(data4STAN)
## List of 4
## $ N : int 1680
## $ Ni: int 120
## $ I : int [1:1680] 1 1 1 1 1 1 1 1 1 1 ...
## $ Y : num [1:1680, 1:2] 59.1 41 41.7 51.7 56.1 ...
fitting
extracting results
## Inference for Stan model: anon_model.
## 4 chains, each with iter=2000; warmup=1000; thin=1;
## post-warmup draws per chain=1000, total post-warmup draws=4000.
##
## mean se_mean sd 2.5% 25% 50% 75% 97.5%
## mu[1] 49.96 0.00 0.19 49.58 49.84 49.97 50.09 50.33
## mu[2] 60.00 0.00 0.32 59.39 59.78 60.00 60.21 60.64
## sigma[1] 8.04 0.00 0.14 7.77 7.95 8.03 8.12 8.31
## sigma[2] 13.81 0.00 0.24 13.35 13.66 13.81 13.97 14.28
## rho[1] 0.20 0.00 0.16 -0.14 0.09 0.21 0.31 0.50
## rho[2] 0.23 0.00 0.21 -0.20 0.10 0.25 0.38 0.59
## rho[3] 0.65 0.00 0.14 0.31 0.58 0.67 0.75 0.85
## rho[4] 0.22 0.00 0.18 -0.15 0.11 0.23 0.35 0.54
## rho[5] 0.44 0.00 0.16 0.08 0.34 0.45 0.55 0.71
## rho[6] 0.48 0.00 0.14 0.17 0.39 0.50 0.59 0.72
## rho[7] 0.15 0.00 0.17 -0.19 0.04 0.16 0.28 0.45
## rho[8] 0.67 0.00 0.14 0.31 0.59 0.69 0.77 0.87
## rho[9] 0.40 0.00 0.14 0.10 0.31 0.41 0.51 0.65
## rho[10] 0.22 0.00 0.17 -0.14 0.10 0.23 0.34 0.52
## rho[11] 0.37 0.00 0.15 0.06 0.27 0.38 0.48 0.63
## rho[12] 0.45 0.00 0.17 0.07 0.34 0.46 0.57 0.73
## rho[13] 0.25 0.00 0.19 -0.15 0.12 0.26 0.40 0.58
## rho[14] 0.26 0.00 0.17 -0.09 0.15 0.27 0.39 0.56
## rho[15] 0.11 0.00 0.16 -0.21 0.01 0.12 0.22 0.40
## rho[16] 0.15 0.00 0.15 -0.15 0.05 0.16 0.26 0.44
## n_eff Rhat
## mu[1] 7211 1
## mu[2] 8067 1
## sigma[1] 7507 1
## sigma[2] 5752 1
## rho[1] 7684 1
## rho[2] 7610 1
## rho[3] 5231 1
## rho[4] 7241 1
## rho[5] 8510 1
## rho[6] 8661 1
## rho[7] 9004 1
## rho[8] 5173 1
## rho[9] 7855 1
## rho[10] 7631 1
## rho[11] 8029 1
## rho[12] 7715 1
## rho[13] 7212 1
## rho[14] 8317 1
## rho[15] 7547 1
## rho[16] 9257 1
## [ erreichte getOption("max.print") -- 588 Zeilen ausgelassen ]
##
## Samples were drawn using NUTS(diag_e) at Fri Oct 18 14:03:33 2024.
## For each parameter, n_eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor on split chains (at
## convergence, Rhat=1).
We see that the chains mixed well (R-hat <1.01) and their is a sufficient number of effective samples. mu[1] is the estimated mean mood score, mu[2] is for concentration.
The rho[i] lines are the individual correlation coefficients for each individual. The overall correlation is not shown in this output but we can get it with:
## mean se_mean sd 2.5% 25% 50% 75% 97.5%
## 0.331 0.001 0.032 0.265 0.310 0.331 0.353 0.391
## n_eff Rhat
## 3837.756 1.000
post-processing the fitted model
# Convert the stanfit object to a dataframe
posterior_df <- as.data.frame(fit_MLMCorr)
# Extract only the columns corresponding to the rho parameters
rho_df <- posterior_df[, grep("^rho\\[", names(posterior_df))]
# Extract only the column for the average rho
mu_rho_df <- as.data.frame(posterior_df[, grep("^mu_rho", names(posterior_df))])
mu_rho_df$rho="mu_rho"
names(mu_rho_df)[1] <- "value"
# reshaping to long format
rho_df_long <- reshape2::melt(rho_df, variable.name = "rho", value.name = "value")
## No id variables; using all as measure variables
#adding mu_rho
rho_df_long = rbind(rho_df_long,mu_rho_df)
#adding index of whether is participant-rho or overall rho
rho_df_long$is_mu_rho = factor(ifelse(rho_df_long$rho=="mu_rho","yes","no"))
# View the first few rows of the dataframe
head(rho_df_long)
## rho value is_mu_rho
## 1 rho[1] 0.02631534 no
## 2 rho[1] -0.01727990 no
## 3 rho[1] 0.20368084 no
## 4 rho[1] 0.19331331 no
## 5 rho[1] 0.26387280 no
## 6 rho[1] 0.25776576 no
The nice thing about getting a posterior distribution with Bayesian models is that we can work with it! We can calculate the share of posterior density below 0, which is:
sum(subset(rho_df_long,is_mu_rho=="no")$value < 0) / length(subset(rho_df_long,is_mu_rho=="no")$value)
## [1] 0.1155104
plotting
ggplot(data=rho_df_long, aes(x=value, group=rho, group=is_mu_rho)) +
geom_density(lwd=0.4, aes(fill=is_mu_rho), alpha=0.4) +
xlab(expression(rho)) +
ylab("Posterior density") +
scale_colour_manual(values = c("#DB9D85","#3DBEAB")) +
scale_fill_manual(values = c("#DB9D85","#3DBEAB")) +
labs(color="Pooled",fill="Pooled") +
theme_minimal()
## Warning: Duplicated aesthetics after name standardisation: group
From this plot we can learn that, overall, the correlation is positive (and very close to the un-pooled version) but that the individual correlations vary substantially around this pooled estimate. That is something we would not know, if we only looked at the standard correlation coefficient.
## R version 4.3.1 (2023-06-16 ucrt)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 19045)
##
## Matrix products: default
##
##
## locale:
## [1] LC_COLLATE=German_Austria.utf8 LC_CTYPE=German_Austria.utf8
## [3] LC_MONETARY=German_Austria.utf8 LC_NUMERIC=C
## [5] LC_TIME=German_Austria.utf8
##
## time zone: Europe/Vienna
## tzcode source: internal
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] reshape2_1.4.4 ggplot2_3.5.0 rstan_2.32.5 StanHeaders_2.32.5
## [5] dplyr_1.1.4 MASS_7.3-60
##
## loaded via a namespace (and not attached):
## [1] sass_0.4.9 utf8_1.2.4 generics_0.1.3 prettydoc_0.4.1
## [5] lattice_0.21-8 stringi_1.8.3 digest_0.6.34 magrittr_2.0.3
## [9] timechange_0.3.0 evaluate_0.23 grid_4.3.1 fastmap_1.1.1
## [13] Matrix_1.6-5 plyr_1.8.9 jsonlite_1.8.8 pkgbuild_1.4.4
## [17] nnet_7.3-19 gridExtra_2.3 fansi_1.0.6 QuickJSR_1.1.3
## [21] scales_1.3.0 modeltools_0.2-23 codetools_0.2-19 jquerylib_0.1.4
## [25] cli_3.6.2 crayon_1.5.2 diffobj_0.3.5 rlang_1.1.3
## [29] munsell_0.5.1 withr_3.0.0 cachem_1.0.8 yaml_2.3.8
## [33] flexmix_2.3-19 tools_4.3.1 inline_0.3.19 parallel_4.3.1
## [37] colorspace_2.1-0 vctrs_0.6.5 R6_2.5.1 lubridate_1.9.3
## [41] matrixStats_1.2.0 stats4_4.3.1 lifecycle_1.0.4 stringr_1.5.1
## [45] pkgconfig_2.0.3 RcppParallel_5.1.7 pillar_1.9.0 bslib_0.6.1
## [49] gtable_0.3.4 loo_2.7.0 glue_1.7.0 Rcpp_1.0.12
## [53] highr_0.10 xfun_0.42 tibble_3.2.1 tidyselect_1.2.1
## [57] rstudioapi_0.15.0 knitr_1.45 farver_2.1.1 htmltools_0.5.7
## [61] labeling_0.4.3 rmarkdown_2.26 compiler_4.3.1