# Load the necessary libraries
library(tidyverse)
library(brms)
library(tidybayes)
# Set up the trial data
<- tibble(
optimise_data group = c("Intervention", "Control"),
events = c(134, 158),
total = c(366, 364)
)
This great educational article on the basics of statistical inference came up in resident doctor teaching. I don’t just think it’s great because I work with one of the authors, but because it manages to explain in detail how classical “p-value” medical statistics (or more formally, “frequentist statistics”) actually work. And it’s much more difficult to get your head around than you’d think.
The reason we use frequentist statistics in medical research is because they’re computationally relatively simple to perform (so could be done in the days before we carried powerful processors in our pockets), and give us a nice clear binary outcome. It either is or isn’t signifcant. Easy.
As such, for decades medical research has been dominated by this statistical philosophy which is built on our familiar p-values and confidence intervals. But this is a widely misunderstood, and deeply non-intuitive, approach to medical statistics.
Fundamentally the frequentist question we’re trying to answer is: “Assuming there is no effect of our treatment, how surprising is our data?”. But what we actually want to know (and how we think about clinical problems) is “given our data, what is the probability that this treatment actually works?”
So let’s rethink the core concepts that the article discusses using a Bayesian framework.
A Bayesian Take on Statistical Concepts
The foundational concepts of statistics, such as variables, parameters, and distributions, are tools for understanding data. While Sidebotham and Hewson’s article explains them from a frequentist perspective (focused on the long-run frequency of events) a Bayesian perspective reinterprets them through the lens of updating beliefs based on new data.
Variables and parameters
In the traditional view, a variable is a quantity that changes between individuals in a sample, like age or blood pressure. A parameter, like the true population mean, is an unknown but fixed number that we try to estimate.
A Bayesian sees this differently:
- Variables are the same - they are the data we observe.
- Parameters, however, are not fixed constants. They are unknown quantities about which we have uncertainty. Therefore, a Bayesian treats a parameter as a random variable and represents our knowledge about it with a probability distribution. Our goal isn’t to estimate a single “true” value but to update our belief distribution about the parameter’s plausible values after seeing the data.
Sampling variability
Frequentist statistics defines sampling variability as the reason why an estimate (such as of a sample mean) differs between repeated samples from the same population. It is seen as a source of imprecision or error that needs to be quantified.
To a Bayesian this variability is not an error - rather it’s the engine of learning. The information contained in our one specific sample (called the likelihood) is what allows us to update our initial beliefs. The process looks like this:
- Prior Belief: We start with a probability distribution representing our belief about a parameter before seeing the data.
- Likelihood (The Data): We collect data. The likelihood tells us how probable our observed data is for each possible value of the parameter.
- Posterior Belief: We combine our prior belief with the data’s likelihood using Bayes’ theorem. The result is a new, updated probability distribution for the parameter, called the posterior.
So, while a frequentist uses sampling variability to describe the precision of an estimate over many hypothetical trials, a Bayesian uses the information from one sample to directly update their beliefs.
Point estimates and summarising data
Both approaches use the same tools to summarise data, such as the mean, median, and standard deviation.
A frequentist would then use the sample data to calculate a point estimate (a single number, like the observed 6.8% absolute risk reduction in the OPTIMISE trial discussed in the article), that serves as the best guess for the true population parameter.
A Bayesian analysis, however, doesn’t produce just one point estimate. The primary result in a Bayesian analysis is the entire posterior distribution, which shows every plausible value for the parameter and its corresponding probability. While we can summarise this distribution with a single number (like the median or mean) for convenience, the full distribution provides a much richer picture of our uncertainty. The “point estimate” is just one feature of a larger landscape of belief.
Probability distributions and the Central Limit Theorem
In the frequentist framework, probability distributions (like the normal distribution) are used to model how variables are distributed in a population. The Central Limit Theorem is highlighted in the BJAEd article as a cornerstone of this, because it guarantees that the distribution of sample means from many repeated experiments will be approximately normal. This allows frequentists to calculate p-values and confidence intervals, which are statements about long-run procedural performance.
Bayesians use probability distributions more broadly:
- To describe our prior belief about a parameter.
- To describe the likelihood of the data.
- To describe our posterior (updated) belief about the parameter.
The CLT is therefore much less critical in modern Bayesian analysis. Instead of relying on its approximations about hypothetical repeated experiments, Bayesians use computational methods to directly calculate the exact posterior distribution of belief based on the one experiment that was actually conducted. The focus shifts from the frequency of a statistic to the probability of a parameter.
How does this effect “The machinary of statistical inference”?
From a Bayesian perspective, the “machinery of statistical inference” is fundamentally different because it aims to directly estimate the probability of a hypothesis given the data, rather than calculating the probability of the data assuming a hypothesis is true.
The article describes the common frequentist framework, null hypothesis significance testing (NHST). This follows a set process:
- Assume no effect: You start by assuming the null hypothesis is true - i.e. that the intervention has no effect in the population.
- Calculate a test statistic: You compute a single value from your sample data, which measures how far your result is from the result that you’d expect if the null hypothesis is true.
- Consult a hypothetical distribution: You compare this statistic to its null distribution, which is the distribution of results you would get if you repeated the experiment an infinite number of times and the null hypothesis were actually true.
- Calculate the p-value: This is the probability of observing data at least as extreme as yours, assuming the null hypothesis is true.
NHST is therefore an indirect, counter-intuitive process. It simply answers the question “how surprising is my data if I assume there is no effect?” without telling you anything about how likely the effect really is.
A Bayesian approach scraps this indirect machinery and instead uses Bayes’ Theorem to directly update beliefs in light of new evidence.
- No need for a null hypothesis: Instead of testing against a rigid “no effect” hypothesis, a Bayesian analysis calculates the entire posterior distribution for the effect size. This distribution represents our updated belief about all plausible values for the parameter after seeing the data.
- The result is a distribution, not a (p-)value: The main output is not a single value but a full probability distribution.
From this, we can derive intuitive summaries to describe our new beliefs:
- A credible interval: This is a range that contains the true parameter value with a specified probability (eg we’re now 95% certain that the true effect lies in this range). This is a direct statement of belief, unlike the confusing definition of a frequentist confidence interval.
- Posterior probabilities: We can then directly answer the questions we care about, such as “what is the probability that this treatment provides a clinically meaningful benefit?” or “what is the probability that the effect is greater than zero?”
In essence, the Bayesian machinery provides a direct, intuitive, and comprehensive summary of evidence that aligns with how we naturally reason about uncertainty. It moves beyond the rigid and frequently misinterpreted framework of null hypothesis testing to give researchers a more complete picture of what their data is actually telling them.
This is how our brains work when assessing patients!
So why don’t we use it? Because the nuances that make Bayesian statistics so intuitive and powerful also make them more computationally complex. Thankfully even cheap NHS laptops are easily powerful enough to perform Bayesian statistics, and althought it’s not a simple “point and click t-test” the techniques are relatively easy to learn and apply. So let’s use OPTIMISE as an example, exactly as Sidebotham and Hewson did, but apply a Bayesian approach.
Re-re-analysis of the OPTIMISE trial
The trial compared using cardiac output monitors to guide management of fluids and vasopressors to usual care for patients undergoing major gastrointestinal surgery.
The Data:
- Intervention group: 134 complications in 366 patients (36.6%).
- Control group (usual care): 158 complications in 364 patients (43.4%).
The original frequentist analysis reported a p-value of 0.07 and a 95% confidence interval for the relative risk of a complication occuring from 0.71 (less likely) to 1.01 (more likely). Because the confidence interval crosses 1 (no difference) the authors therefore concluded that cardiac output monitors didn’t impact on the rate of complications.
We can use the powerful brms
and tidybayes
packages in R to build a Bayesian model. We’ll model the probability of a complication based on the treatment group.
To start with, we need a prior - this is our belief before seeing the data.
A weakly informative prior is a good start - essentially we believe that there is no effect, but aren’t very sure about it. We’ll use a normal distribution centered on zero, suggesting that very large effects are less likely than small or moderate ones.
That will do for this example, though in practice for Bayesian re-analyses we tend to use multiple priors to become even more certain.
<- set_prior("normal(0, 1.5)", class = "b") prior_b
Now we can combine our prior belief with the data. This takes a few seconds to run:
# Run the Bayesian logistic regression model (log-odds of an event occurring)
<- brm(
brm_optimise | trials(total) ~ 1 + group,
events data = optimise_data,
family = binomial(link = "logit"),
prior = prior_b,
seed = 2222 # for reproducibility
)
Compiling Stan program...
Start sampling
SAMPLING FOR MODEL 'anon_model' NOW (CHAIN 1).
Chain 1:
Chain 1: Gradient evaluation took 2.3e-05 seconds
Chain 1: 1000 transitions using 10 leapfrog steps per transition would take 0.23 seconds.
Chain 1: Adjust your expectations accordingly!
Chain 1:
Chain 1:
Chain 1: Iteration: 1 / 2000 [ 0%] (Warmup)
Chain 1: Iteration: 200 / 2000 [ 10%] (Warmup)
Chain 1: Iteration: 400 / 2000 [ 20%] (Warmup)
Chain 1: Iteration: 600 / 2000 [ 30%] (Warmup)
Chain 1: Iteration: 800 / 2000 [ 40%] (Warmup)
Chain 1: Iteration: 1000 / 2000 [ 50%] (Warmup)
Chain 1: Iteration: 1001 / 2000 [ 50%] (Sampling)
Chain 1: Iteration: 1200 / 2000 [ 60%] (Sampling)
Chain 1: Iteration: 1400 / 2000 [ 70%] (Sampling)
Chain 1: Iteration: 1600 / 2000 [ 80%] (Sampling)
Chain 1: Iteration: 1800 / 2000 [ 90%] (Sampling)
Chain 1: Iteration: 2000 / 2000 [100%] (Sampling)
Chain 1:
Chain 1: Elapsed Time: 0.006 seconds (Warm-up)
Chain 1: 0.006 seconds (Sampling)
Chain 1: 0.012 seconds (Total)
Chain 1:
SAMPLING FOR MODEL 'anon_model' NOW (CHAIN 2).
Chain 2:
Chain 2: Gradient evaluation took 1e-06 seconds
Chain 2: 1000 transitions using 10 leapfrog steps per transition would take 0.01 seconds.
Chain 2: Adjust your expectations accordingly!
Chain 2:
Chain 2:
Chain 2: Iteration: 1 / 2000 [ 0%] (Warmup)
Chain 2: Iteration: 200 / 2000 [ 10%] (Warmup)
Chain 2: Iteration: 400 / 2000 [ 20%] (Warmup)
Chain 2: Iteration: 600 / 2000 [ 30%] (Warmup)
Chain 2: Iteration: 800 / 2000 [ 40%] (Warmup)
Chain 2: Iteration: 1000 / 2000 [ 50%] (Warmup)
Chain 2: Iteration: 1001 / 2000 [ 50%] (Sampling)
Chain 2: Iteration: 1200 / 2000 [ 60%] (Sampling)
Chain 2: Iteration: 1400 / 2000 [ 70%] (Sampling)
Chain 2: Iteration: 1600 / 2000 [ 80%] (Sampling)
Chain 2: Iteration: 1800 / 2000 [ 90%] (Sampling)
Chain 2: Iteration: 2000 / 2000 [100%] (Sampling)
Chain 2:
Chain 2: Elapsed Time: 0.006 seconds (Warm-up)
Chain 2: 0.006 seconds (Sampling)
Chain 2: 0.012 seconds (Total)
Chain 2:
SAMPLING FOR MODEL 'anon_model' NOW (CHAIN 3).
Chain 3:
Chain 3: Gradient evaluation took 1e-06 seconds
Chain 3: 1000 transitions using 10 leapfrog steps per transition would take 0.01 seconds.
Chain 3: Adjust your expectations accordingly!
Chain 3:
Chain 3:
Chain 3: Iteration: 1 / 2000 [ 0%] (Warmup)
Chain 3: Iteration: 200 / 2000 [ 10%] (Warmup)
Chain 3: Iteration: 400 / 2000 [ 20%] (Warmup)
Chain 3: Iteration: 600 / 2000 [ 30%] (Warmup)
Chain 3: Iteration: 800 / 2000 [ 40%] (Warmup)
Chain 3: Iteration: 1000 / 2000 [ 50%] (Warmup)
Chain 3: Iteration: 1001 / 2000 [ 50%] (Sampling)
Chain 3: Iteration: 1200 / 2000 [ 60%] (Sampling)
Chain 3: Iteration: 1400 / 2000 [ 70%] (Sampling)
Chain 3: Iteration: 1600 / 2000 [ 80%] (Sampling)
Chain 3: Iteration: 1800 / 2000 [ 90%] (Sampling)
Chain 3: Iteration: 2000 / 2000 [100%] (Sampling)
Chain 3:
Chain 3: Elapsed Time: 0.006 seconds (Warm-up)
Chain 3: 0.007 seconds (Sampling)
Chain 3: 0.013 seconds (Total)
Chain 3:
SAMPLING FOR MODEL 'anon_model' NOW (CHAIN 4).
Chain 4:
Chain 4: Gradient evaluation took 1e-06 seconds
Chain 4: 1000 transitions using 10 leapfrog steps per transition would take 0.01 seconds.
Chain 4: Adjust your expectations accordingly!
Chain 4:
Chain 4:
Chain 4: Iteration: 1 / 2000 [ 0%] (Warmup)
Chain 4: Iteration: 200 / 2000 [ 10%] (Warmup)
Chain 4: Iteration: 400 / 2000 [ 20%] (Warmup)
Chain 4: Iteration: 600 / 2000 [ 30%] (Warmup)
Chain 4: Iteration: 800 / 2000 [ 40%] (Warmup)
Chain 4: Iteration: 1000 / 2000 [ 50%] (Warmup)
Chain 4: Iteration: 1001 / 2000 [ 50%] (Sampling)
Chain 4: Iteration: 1200 / 2000 [ 60%] (Sampling)
Chain 4: Iteration: 1400 / 2000 [ 70%] (Sampling)
Chain 4: Iteration: 1600 / 2000 [ 80%] (Sampling)
Chain 4: Iteration: 1800 / 2000 [ 90%] (Sampling)
Chain 4: Iteration: 2000 / 2000 [100%] (Sampling)
Chain 4:
Chain 4: Elapsed Time: 0.006 seconds (Warm-up)
Chain 4: 0.006 seconds (Sampling)
Chain 4: 0.012 seconds (Total)
Chain 4:
Having done this, we can finally examine the posterior distribution—our updated belief about the treatment’s effect. This graph is the core result of our Bayesian analysis. It shows the full range of plausible values for the treatment’s effect. The peak of the curve represents the most likely values, while the tails represent less likely ones, and the dashed red line the point of no effect.
# Extract the posterior draws for interpretation
<- brm_optimise %>%
posterior_analysis spread_draws(b_Intercept, b_groupIntervention) %>%
mutate(
# Convert from log-odds back to probabilities and risk
prob_control = plogis(b_Intercept),
prob_intervention = plogis(b_Intercept + b_groupIntervention),
# Calculate key metrics
odds_ratio = exp(b_groupIntervention),
risk_difference = (prob_intervention - prob_control) * 100 # as a percentage
)
# Visualise the result for the risk difference
ggplot(posterior_analysis, aes(x = risk_difference)) +
stat_halfeye(fill = "#0072B2", alpha = 0.8, p_limits = c(0.025, 0.975)) +
geom_vline(xintercept = 0, linetype = "dashed", color = "red") +
labs(
x = "Absolute Risk Difference (%)",
y = "Density"
+
) theme_minimal()
Comparing the two outcomes
Let’s compare the conclusions from the two approaches side-by-side.
The frequentist 95% confidence interval for RR is 0.71 to 1.01. We can interpret this as “if this study were repeated infinitely, 95% of the calculated confidence intervals would contain the one true effect size.” This is a confusing statement about the procedure, not the result itself. Because the interval contains 1.0 (no effect), the result is “not statistically significant.”
The Bayesian 95% credible interval can be calculated from the model:
%>%
posterior_analysis median_qi(risk_difference)
# A tibble: 1 × 6
risk_difference .lower .upper .width .point .interval
<dbl> <dbl> <dbl> <dbl> <chr> <chr>
1 -6.70 -13.8 0.0784 0.95 median qi
Our 95% credible interval for the risk difference is -13.8% to +0.1%. The interpretation is simple and far more powerful: “Given the data, there is a 95% probability that the true risk reduction is between 13.8% and a slight risk increase of 0.1%.”
But is it “significant”?
Again, the frequentists report that p = 0.07 (or 0.06 if you re-analyse it as per the article). This value is the probability of seeing data this extreme (or more extreme) if the treatment had no effect. Since 0.07 > 0.05, the result is deemed “not statistically significant.” This binary thinking creates a “cliff edge” where p = 0.049 is a success and p = 0.051 is a failure, even when the evidence is virtually identical.
The Bayesian approach is to instead estimate the probability of benefit. We can ask a more practical question: “What’s the probability that the treatment is beneficial (ie that the risk difference is less than zero)?” using the model:
%>%
posterior_analysis summarise(p_benefit = mean(risk_difference < 0))
# A tibble: 1 × 1
p_benefit
<dbl>
1 0.974
Again, this value (our updated belief following the new evidence) is a far more powerful answer. The Bayesian interpretation is: “There is a 97.4% probability that cardiac output-guided management is better than usual care.”
So which should you use?
The frequentist analysis of the OPTIMISE trial leaves us in an awkward spot. A p-value of 0.07 (or even 0.06!) is often interpreted as “no effect,” but the authors correctly note it simply quantifies the data’s compatibility with the null hypothesis. You often see authors (incorrectly) refer to p-values of close to 0.05 as showing a “trend” towards a positive finding, but that’s not what the number means nor how NHST should be used.
The Bayesian analysis, however, provides a much clearer picture. While a very small harm isn’t entirely ruled out (the credible interval crosses zero), the evidence strongly points towards a benefit (with a 97.4% probability). Yes, the computation approach is slightly more difficult, but using it allows for a more nuanced discussion about whether a 6.7% risk reduction with a high probability of being real is clinically meaningful, moving us beyond the simplistic, and often misleading, tyranny of “statistical significance”.