REMAP-CAP have released their preliminary report into immune modulation and everyone is getting very excited by the potential for tocilizumab and sarilumab to improve COVID-19 outcomes in critical care. You even go so far as to think it would make a great journal club paper… until you read the methods section:
The primary analysis was generated from a Bayesian cumulative logistic model, which calculated posterior probability distributions of the 21-day organ support-free days (primary outcome) based on evidence accumulated in the trial and assumed prior knowledge in the form of a prior distribution
It’s not that bad, honest. Let’s take a deep breath and talk statistical inference for doctors and, borrowing heavily from Daniel Laken’s excellent Coursera, consider how we do it in terms of yoga:
Karma: the Path of Action
The frequentist approach to statistics is the “classical” one we’re all taught at medical school. We do a null-hypothesis significance test (NHST), apply a statistical test, and the lower the p-value, the better the stats, right?
Afraid not.
This type of statistics tells us about the outcome of experiments in the long run, but actually tells us nothing about the current test. The p-value is nothing but a measure of how surprising the data we have is, assuming there’s no true effect (i.e. that the null hypothesis is true). It doesn’t tell us anything about the alternative hypothesis, least of all how certain it is.
All of medical statistics is essentially seeing if there’s a difference in something between two populations - e.g. is there a height difference between men and women, or do people with COVID given tocilizumab live longer than people who don’t. Let’s call this the effect size. The most accurate way to calculate this would be to measure the parameter of interest in everyone in these two populations, but that’s an impossible ask: measuring all the men and women in the world is clearly impossible, and although we can measure outcomes for all the COVID patients given tocilizumab so far, we need to be able to generalise this to all the people we want to give it to.
So the next best option is to measure the effect in a sample taken from the population of interest, then use a statistical test to guess how likely the effect size - averaged over everyone we’ve measured - is to apply to the population(s) of interest. This is a function of the mean effect size
The NHST-way to calculate this is to plug the
The key thing to remember here is that the p-value is the probability of the data given
These graphs show the p-values for 100,000 simulated experiments, each time drawing
From this if
Jnana: the Path of Knowledge
In most uses of NHST medical statistics, we should stop there. Unfortunately many authors will draw an inference from that probability of data to try and say something about our hypothesis (even though we can’t without using Bayesian statistics, more on this later). Worse still, we often draw inferences from our probability of data to say something that we’ve a priori said we can’t! How many times do you read “there was a trend towards” when the data has failed the pre-specified significance test?! Please stop doing this.
If we want to say more (like “trends”, which is really trying to make a statement on the strength in our convictions regarding a hypothesis) we need to move from describing the data given purely the null hypothesis (there being no difference), to talking about the likelihood of different hypotheses being true, given said data. This is where Richard Royall’s likelihood paradigm comes in.
Each hypothesis has a potential set of parameters for the data - mean, variation, etc. Lets call these
Royall defined LR > 8 as moderately strong evidence for H1, and LR > 32 as strong evidence for H1, but likelihoods are simply relative evidence - just because data is more likely under one hypothetical distribution than another doesn’t mean that it has come from either of them, just that there is more relative evidence for one than another (while they could both still be highly unlikely). To say more about where the data has come from we need Reverend Bayes and his probability theorem.
Bhakti: the Path of Belief
Bayesian statistics allows us to compare the likelihood of different hypotheses given the data, and use these to modify our belief in each of those hypotheses. It all relies on Bayes’ theorem and the probabilities we’ve observed to calculate the probability of out hypothesis being true, given the data we’ve seen, and what we think we know about it (and the strength of that knowledge):
The beauty of Bayesian thinking is that our degree of belief in our hypothesis depends on our prior knowledge and experience (blue curve), but each time we add data (red dots) their corresponding likelihood distribution (red curve) alters our subsequent (“posterior”) belief (purple curve), which in turn becomes the prior for the next set of data:
I find it very strange that, on the whole, doctors don’t get Bayesian statistics because this is exactly how clinical reasoning works:
- Lets say you’re in A&E, and you’re told someone has chest pain. You already have a guess that this might be an MI, and that depends on where you work and it’s demographics, degree of deprivation, etc. That’s your prior.
- Now you set eyes on the patient, and they’re an overweight, middle-aged man, clammy and clutching at their chest. Your likelihood of this being a MI versus anything else is relatively high, so you update your posterior to be an even higher probability.
- You’re handed an ECG. It’s essentially normal. LR for hypothesis-MI based on this this is pretty low, but your prior (posterior from the last step) is still stonkingly high, so your new posterior given this new info is still reasonably sporting.
- The troponin comes back at 10,000. Very high LR, so your new posterior is so high that it’s essentially certain. You call cardiology and move onto the next patient.
(As an aside the NNT is a brilliant website which actually has all these LRs for various conditions, and similar analyses of treatments.)
Bayesian statistics is the process of applying this type of thinking to trial data. Instead of setting an end-point (“once we’ve recruited x people, according to our power calculation…”) and then doing a one-off test to regarding the data (“our p-value was…”), a Bayesian trial updates the relative probabilities of the hypotheses being true every time a new piece of data is added, resulting in a “credible interval” of reasonably probable parameters for the true distribution (holding the data fixed, and varying
If you’re desperate for a p-value, the other way to use Bayesian stats is to think of the Bayes Factor
The most complicated part of all this to get your head around is the prior distributions, as clearly choosing different ones of these could potentially lead to huge differences in posterior distributions, at least initially? Well that’s actually one of the major strengths of Bayesian analyses, as you can set different priors and collect data until they converge on something meaningful. Trials will usually report results given a null prior (i.e. no odds of their being anything meaningful), a neutral prior (no difference), an optimistic prior (as one would use for a power calculation in NHST), and a pessimistic prior (the opposite to what you hope to find). For example see this figure from the Bayesian reanalysis of the ANDROMEDA-SHOCK trial:
What does this all mean for REMAP-CAP?
REMAP-CAP is brilliant. It’s the adaptive platform trial stats nerds like me have been waiting for in critical care, and landed at the right time and place to be able to add domain after domain to study the COVID-19 pandemic in addition to its intended target of community acquired pneumonia. Patients get randomised across multiple domains, which can be added or removed without major changes to the running of the trial, and the analysis allows interactions between these domains to be picked out an adjusted for, with sequential Bayesian analyses within domains not only until a signal is found, but also (and this is the coolest bit) adjusting the probability with which interventions are delivered to patients as evidence is found to support them (after each “adaptive analysis”). In essence, if everyone was recruited to the study, there would be no need to announce the results, because the study mechanics would put people on the therapies with the best evidence for them automatically as that evidence becomes stronger and stronger. Yeah, I know. Awesome, right?
If you’re interested (and why wouldn’t you be in the future of clinical research?) CCR has put together a series of great podcasts on REMAP-CAP, and as always The Bottom Line have a great summary of the initial results regarding steroids.
Want even more?
If you’re keen for more details, Scholarpedia has a great article on Bayesian stats in much more depth (which I’m mostly linking to because I have an academic man-crush on David Spiegelhalter) and Eoin Travers has a lovely article on the maths of frequentist and Bayesian hypothesis testing if you’re really getting into it. From a more clinical point of view there’s a good BJA Education article and review from BJA worth checking out (which are also potential FRCA/FFICM fodder, be warned!).