Charles M. Beasley, Jr and Roy Tamura: What We Know and Do Not Know by Conventional Statistical Standards About Whether a Drug Does or Does Not Cause a Specific Side Effect (Adverse Drug Reaction)
Forward by John M. Davis
This book addresses the statistical analysis of side effects (referred to in the specialty of pharmacovigilance as “adverse drug reactions”). The book focuses on estimating the sample sizes needed for interpreting the cause of an adverse event (i.e., a medical complaint by someone taking a drug or placebo that the drug might cause if the person is taking a drug) to be the experimental drug evaluated in a randomized controlled trial (RCT) or a set of RCTs. Patients in both the experimental drug group and the placebo group of an RCT commonly experience and report medical complaints referred to as adverse events. Which adverse events occurring in the experimental drug group should be considered “real” side effects caused by the drug?
Alternatively, are these adverse events occurring in the experimental drug group an amalgam of the base rate of common medical complaints unrelated to the drug, an effect related to participating in the RCT, in which concern about and anticipation of side effects or symptoms of the disease being studied induce the medical complaints, and signs and symptoms caused by the disease being treated? For example, patients treated for COVID-19 might experience loss of taste due to the disease rather than experiencing this phenomenon as a side effect if treated with an experimental drug in an RCT. How does an individual correctly interpret the occurrence of such adverse events? Interpretations by some individuals can sometimes be incorrect.
Much the same considerations apply to rare serious events. For example, four patients died in the Pfizer COVID-19 vaccination trial control group and two in the vaccine group. But four died in each group of the Moderna vaccine COVID-19 trial. How does one make sense of this information? If someone focused on the number of deaths in these trials, comparing deaths in the active treatment group to deaths in the control group, the Moderna vaccine has more deaths relative to its control group than the Pfizer vaccine. The analysis of the difference in deaths with vaccine versus control can be based on two comparisons. The first is the difference in the incidence of deaths between drug and placebo. The second is the ratio of the incidence of deaths on drug to that incidence on placebo. The Pfizer vaccine had two fewer deaths compared to control, relative to the Moderna vaccine. The Pfizer vaccine had 50% of the deaths compared to control, relative to the Moderna vaccine. If a layperson considers these numbers, the Pfizer vaccine might appear to them to be safer. However, statistically, the numbers of deaths with the two vaccines compared to control are not significantly different, even when 30,000 subjects were included in both the vaccine groups and the control groups for both vaccine trials.
The authors consider the sample sizes necessary to prove that an adverse event is caused by a drug to essentially the same standard as proving a drug is efficacious. They consider the statistics of estimating sample sizes in the calculations involved in both proving a drug does and does not cause a side effect. These sample sizes grow inordinately large as the incidence of an event in both the drug and placebo group decline together and/or the background rate of the adverse event increases.
Dr. Beasley, a psychiatrist, and Dr. Tamara, a statistician, worked at Eli Lilly and Company for more than 20 years managing the statistical issues involved in measuring and interpreting side effect data and are respected by both industry and academia for their expertise, comprehensive knowledge and integrity. The book begins with the statistical section, which deals with the calculations of estimating sample size and interpretation of statistical inference in such work. They discuss common side effects, uncommon side effects, rare side effects and extremely rare side effects. This discussion contextualizes several types of side effects with a detailed discussion of inferring causation. For example, does olanzapine directly cause diabetes?
The book’s coverage of these topics and their details is an excellent introduction to statistical significance, power analysis and sample size estimation issues for anyone analyzing RCTs, including RCTs in multiple therapeutic areas. Providing medical examples places the statistical discussion in real-world problem areas, making it a somewhat difficult read. I would highly recommend this book to those individuals interested in these complex issues. This book is technical, directed toward specific statistical analysis and inference about causation. However, the book also considers a more significant, more general matter of how seriously to take adverse events that might or might not be side effects. The authors suggest two factors that should influence the extent to which such adverse events should be seriously considered. The first factor is their clinical significance. The second factor is the extent to which the demonstration that the adverse event might be a side effect conforms with the standard required for a demonstration of efficacy.
I want to discuss these matters in the context of behavioral economics as exemplified by the work of Daniel Kahneman and described in his books Thinking, Fast Slow (2011) and Noise: A Flaw in Human Judgment (2021). Kahneman, a psychologist, won a Nobel Prize in economics based on the implications for the field of economics in these books. He postulates two types of thinking: System 1 (fast thinking) and System 2 (slow thinking). Since System 1 takes place in the blink of an eye, is automatic, effortless and cannot be turned off, you cannot help being influenced by the biases that automatically come to mind. System 2 thinking takes effort and concentration over minutes, hours, days or years.
System 1 is what we use most of the time and works reasonably well, but cognitive errors happen ubiquitously, such as availability bias, confirmation bias and loss aversion. System 1 thinking is not conducive to good mathematical thinking. System 1 deals poorly with rare and uncommon events, generally overweighting them, especially if the event is vivid. As a result, rare events are sometimes overweighted, are, conversely, sometimes ignored. The human brain is not good at quantifying and conceptualizing uncommon events. The inability to quantify and conceptualize uncommon events is complicated by the uncertainty and interpolation from limited quantitative data.
This difficulty with System 1 thinking is significant from a clinical perspective. It can cause serious harm: patients who overweigh the occurrence of adverse events (whether rare or not) are more likely to be noncompliant regarding pharmaceutical interventions that benefit most patients. Also, reviewers of data, be they the sponsors of RCTs, academics or regulators, can be lead astray in their interpretations through System 1 thinking. Should the same standards of statistical proof be used to assess the efficacy and determine the adverse events reported in an RCT (or set of RCTs) as side effects? The book provided the tools for making calculations to estimate the probability of whether a side effect is caused by a drug, not caused by the drugs or somewhere in-between.
In Noise: A Flaw in Human Judgment, Kahneman and his co-authors discussed the role of noise and bias in decision-making, which in this case, would involve decisions about what adverse events are side effects and the choice of clinically used drugs used when comparing alternative medicines from a safety perspective. Empirical research shows a large amount of variability in decision-making and many major and minor biases can influence a given decision, causing noise and systemic biases. One can research which factors drive a decision, assess whether they are valid and use feedback to improve decision-making to align with sound science and statistics. Reduction of noise can also considerably improve the decision-making process and can be done immediately – no long-term follow-up is needed. Most importantly, reducing noise can improve decision-making when it involves a value judgment, even though different people might have different values. This existence of cognitive bias is supported by a massive number of controlled trials and other evidence. As said above, Beasley’s and Tamura’s book provides tools to understand better the statistical principles involved in optimizing the estimates of the probability that a given adverse event is caused, not caused by a drug or causation is uncertain.
Should risk and benefits be weighted equally? Of course, patients should decide for themselves. But their doctors, in shared decision making, play a crucial role in shaping patient decisions. Because of this, the standards of information in medical journals and the Food and Drug Administration labeling have an important impact on the doctor’s recommendation. Drug labeling influences clinical decisions about drug use. Therefore, should institutions such as the Food and Drug Administration shift their current practices to a more detailed and nonbinary description of the probability that adverse events are side effects by describing the probabilities that listed side effects are, in fact, side effects? Obfuscating the answer to this question has considerable influence on the knowledge base of a given drug. This question of how best to describe individual side effects in drug labeling differs from the value judgment of how much weight to assign to a given side effect in describing a drug’s risk-benefit profile. More specifically, drug labeling should list the percentage of patients experiencing adverse events believed to be side effects in drug versus placebo groups and the statistical significance level and/or confidence interval for this drug-placebo comparison. There should be some indication of the probability and confidence around that probability that adverse events listed as side effects are genuinely side effects. Listed adverse events could then be grouped as proven side effects, possible but unproven side effects or unlikely to be side effects (but listed in the drug label due to an abundance of caution due to the clinical significance of the adverse event).
Furthermore, some adverse events might be improved by a drug. Calling something a “side effect” or “adverse drug reaction” implies it is caused by the drug when it may not be. Knowing the probabilities, based on the cumulative controlled database for the drug that an adverse event listed as a “side effect” / “adverse drug reaction” along with the clinical significance of the adverse event, would clarify the picture, decrease the noise and allow for a more balanced assessment of risk versus benefits. Such practice would improve medication decisions.
Drug companies vigorously promote their drugs, but side effects do not receive as much focus. Availability bias may favor the drug unless a side effect is newsworthy. The discovery of a side effect and, better still, treating it should not be considered less important than any other medical discoveries.
There is still much to be done to improve this process of drug labeling. Using the same standard of proof for side effects as efficacy and describing those adverse events of potentially significant clinical consequence would be a major step toward developing a balanced description of risk. As the authors point out, some adverse events that do not meet the efficacy standard of proof must be described as possible side effects due to their clinical significance (e.g., potentially fatal, potentially leading to permanent disability). This book provides a statistical text, but more importantly, it raises these critical issues.
Kahneman D. Thinking, Fast and Slow. New York: Farrar, Status and Giroux; 2011.
Kahneman D, Sibony O, Sunstein CR. Noise: A Flaw in Human Judgment. New York: Little, Brown; 2021.
August 26, 2021