David Healy: Do Randomized Controlled Trials Add to or Subtract from Clinical Knowledge?
In 1947, a clinical trial of streptomycin introduced randomized controlled trials (RCTs) to medicine. From then, through to their incorporation into the 1962 amendments to the Food and Drugs Act, occasioned by the thalidomide tragedy, doubts were expressed as to whether RCTs had an epistemological basis in clinical realities. While there have been ongoing disputes since about the best statistical approach to take to the data stemming from RCTs – whether confidence intervals are preferable to significance testing, for instance, there have been fewer doubts expressed about their fundamental validity.
Repeated characterizations of RCTs as offering gold-standard evidence likely leave most clinicians thinking RCTs have a solid epistemological foundation, and if clinicians have no apparent qualms, few others likely feel compelled to dig deeper. More specifically RCTs are cast as offering evidence that is generalizable and that provides knowledge within confidence limits in a manner that the idiosyncratic views of clinicians do not and individual case reports cannot.
Hill and Fisher
The initial streptomycin trial tested whether randomization was feasible, as it might offer a further control of the subtle biases involved in the evaluation of a medicine. The 1947 Medical Research Council RCT demonstrated the feasibility of randomization. The information this trial produced was less clinically relevant than a prior clinical trial of streptomycin that controlled for confounders in the then standard way and depended on clinical judgement (Healy 2012; Healy 2020). The original trial noted a range of hazards of streptomycin and the fact that tolerance developed rapidly, important therapeutic issues that the RCT didn’t spot.
Tony Hill, the RCT lead, was not in a position to comment on the epistemological validity of this new exercise in clinical evaluation. He had taken the idea of randomization from a horticultural thought experiment Ronald Fisher had outlined two decades previously in which Fisher proposed that randomization could control for unknown confounders.
Hill’s primary idea was that randomization might control for the difficult to detect ways in which clinicians might steer patients like to respond well into an active treatment arm. There was no consideration as to whether randomization would control for other unknown unknowns in a clinical trial (Healy 2012; Healy 2020).
RCTs brought statistical significance testing in their wake, in part because Fisher tied randomization to significance testing. The only things that can interfere with expert judgement not being correct every time, Fisher argued, are unknown confounders and chance. Significance testing could control for chance and randomization could take care of unknown confounders. This model had an anchor in the real world – an expert who knew what they were doing and whose judgements were invariably correct – such as offering a view that wearing a parachute if you jump from a plane at 5,000 feet will save your life. Experiments that got the same result every time demonstrated that we knew what we were doing.
There was no suggestion from Fisher or Hill that randomization could correct for an expert who didn’t know what she/he was doing. There are relatively few situations in medicine, other than those involving simple procedures, where doctors know what they are doing. The more doctors know, the more they approach Fisher’s scenario – their actions produce the predicted results every time. But no one thinks that we need to run randomized trials in situations where we are likely to get the predicted result every time.
In the case of breast cancer, on the basis of our knowledge of physiology, it was hoped that giving Herceptin to Her 2+ receptor breast cancers might produce better responses than cisplatin, a more indiscriminate toxin, which nevertheless produces better responses than placebo. Trials confirm this but also reveal that even using Herceptin in Her 2+ breast cancers, we do not get the same result every time – there is a lot we don’t know.
Our lack of knowledge is even more marked in trials testing the outcomes of stents compared to other cardiac procedures, where doing what appears physiologically obvious does not produce the expected results. The issue is not one of whether stents do not work where other treatments do but rather whether we know what we are doing, which we mostly don’t. While recent stent RCTs are a good demonstration of the power of RCTs to stall a therapeutic bandwagon, the view that stents do not necessarily produce a good outcome had been accepted clinical wisdom in vascular leg surgery for decades and for stents in some cardiac surgery quarters before any RCT had been run.
Taking issue with Fisher, Jerzy Neyman and Egon Pearson borrowed from Carl Friedrich Gauss’ use of confidence intervals to manage the error in astronomical measurements of stars in 1810. Gauss’ ideas were picked up by Pierre-Simon Laplace and their combined input (1809-1827) to the central limit theorem, least-squares optimization and efficient linear algebra provided significant benefits for the physical sciences, engineering, astronomy (for determining star locations and computing planetary orbits) and geodesy.
The Neyman-Pearson modification of Fisher’s proposal by incorporating confidence intervals goes some way to compensating for our failure to get the same result every time in trials like the Herceptin trials. And taking successive measurements of a pulse in an individual is in some respects similar to determining the precise location of a star – the tighter the confidence interval bounding these measurements the more apparent we can do things reliably. When applied to the problem of imprecise measuring instruments and invariant entities like stars, confidence intervals have an anchor in the real world.
Pulses, however, do not have a fixed position in the way stars do. It is possible to imagine that confidence intervals could be used in a manner consistent with us knowing what we are doing if they are used to distinguish between a repeated set of pulse measurements before and after the administration of a drug to one individual. Their current use in medicine seems predicated on the idea that a cohort of patients can be regarded as a single object like a galaxy and if that were the case, their use would be uncomplicated.
The problem however is that confidence intervals control for measurement error. But when pulses increase in response to a drug in one individual and decrease in a second – this is not measurement error. Claiming the true effect of the drug likely lies near some mean of the effects in a number of individuals – in this case no effect – is wrong. In the case of stars, we knew enough about what we were doing to make reasonable inferences from measurement error. We don’t know enough to make comparable inferences when giving medicines, and while the use of confidence intervals in Herceptin trials has greater coherence than Fisher’s tests for statistical significance the results do not provide a good basis for the practice of medicine – without the intervention of clinical judgement.
In the case of Fisher’s fertilizers and significance testing, and Gauss’ stars and confidence intervals, the exercise centers on a primary endpoint – are there one or two stars or more or less ears of corn? Similarly, in RCTs, as currently undertaken, a focus on a primary endpoint is viewed as key to ensuring that only chance or measurement error will get in way of the correct result.
A horticultural expert focused on whether a fertilizer improved corn yield would likely not have any worthwhile expertise on its effects on worms in the ground or insects in the air afterwards. His/her view would likely be less accurate than that of a non-horticulturalist taking an overall view of the effects of fertilizer. Almost by definition, however, in this case significance testing would not be appropriate.
Similarly, the confidence intervals applied to an astronomer’s measurements of the location of a star are of no use when it comes to pinpointing the trajectories of satellites crossing the path of the observations.
In brief, the case for applying significance testing or confidence intervals in general to the primary endpoint in the trial of a medicine is less convincing than commonly thought but there is no case for applying them to effects other than the primary endpoint.
There is something of an assumption that the primary endpoint in an RCT is the commonest effect of a drug. It is often conceded that other effects may be missed – on the basis that they are rare or may not appear within the duration of the trial. But the commonest effect of an SSRI is genital anesthesia, which appears almost universally and within 30 minutes of taking a first pill. This effect has been missed in all RCTs to date – other than those trials looking at premature ejaculation. This scotoma stems from an RCT required focus on a primary endpoint. All sorts of treatment effects can be missed because of this focus.
The focus on a primary endpoint means that we might end up being able to say a lot about a pimple but very little about the person on whose back it is. RCTs can help evaluate one effect of a drug but are a poor method to evaluate a drug’s overall effects. They can play a useful role in tempering the enthusiasm for an evidently obvious course of action such as stenting blocked arteries, but this is an entirely different matter to using the supposed evidence from RCTs to deny evident effects of treatment – such as when a patient becomes suicidal on an antidepressant. Nor do the stent trials provide a basis for dismissing a link between an apparently good outcome in an individual case after stent insertion and the use of a stent in that case. The trials as conducted demonstrate that we do not know what we are doing in general but there may be a subset of patients in which a stent is the correct answer.
RCT evidence should never trump an evident effect that appears after treatment. If a person becomes suicidal after taking an antidepressant, the issue of what is happening in that case is a matter of assessing the effects of their condition, circumstances, prior exposure to similar drugs, dose changes on the medication and whether there are other evident effects of treatment consistent with a link between suicidality and treatment. Unless RCTs have been designed specifically to look at the effects of treatment on a possible emergence of suicidality (and there have been none), RCT evidence is irrelevant.
Confounding by Ignorance
In the case of RCTs and other epidemiological studies, inconvenient results often come with a rider that confounding by indication may make their interpretation difficult. This translates into a caution against assuming a treatment is causing some problem. Because of the rhetoric surrounding RCTS, many people will not be clear why randomization does not take care of all unknown unknowns. Consider the following cases of imipramine and paroxetine.
Imipramine, the first antidepressant, was discovered in 1957 and launched in 1958 without any RCT input. Among other actions, it is a serotonin reuptake inhibitor. In later RCTs it (and other older antidepressants all discovered and marketed without RCTs), “beat” SSRI antidepressants in patients with melancholia (severe depression). Melancholic patients are 80 times more likely to commit suicide than mildly depressed patients.
By 1959, clinicians had noted that, wonderful though imipramine was for many patients, it could cause agitation and suicidality in some that cleared when the drug was stopped and reappeared when restarted – on a Challenge-Dechallenge-Rechallenge (CDR) basis, it could cause suicidality.
In an RCT of imipramine, a drug that can cause suicide in melancholic patients, imipramine seems likely to protect against suicide on average by reducing the risk from melancholia to a greater extent than placebo. SSRIs will not do this. In contrast, in the RCTs that brought SSRIs to the market, these drugs doubled the rate of suicidal acts. This was because SSRIs are weaker than imipramine and had to be tested in people with mild depression, at little risk of suicide. The low placebo suicidal act rate revealed the risk from the SSRI – as it does for imipramine put into trials of mild depression. RCTs can, in other words, badly mislead as regards cause and effect – potentially getting results all the way along a spectrum from definitely causes to possible risk, likely protective and definitely cannot cause.
In any trial where both condition and treatment cause superficially similar problems, as when antidepressants and depression both cause suicidality or bisphosphonates and osteoporosis both lead to fractures, RCTs will give misleading answers. This is likely the case for a majority of RCTs in clinical conditions. The only trials free of this problem are in healthy volunteers where treatment hazards stand out clearly, although even here the results can be confounded by individual variation. Companies do healthy volunteer trials and don’t publish the data, but what they discover probably shapes their subsequent clinical trials.
Consider also the case of paroxetine, which was put into trials of patients with Major Depressive Disorder (MDD) and patients with Intermittent Brief Depressive Disorders (IBDD). IBDD patients (borderline personality disorder) are repeated self-harmers. Their depressive features, however, mean that IBDD patients can readily meet criteria for MDD.
In April 2006, GlaxoSmithKline (GSK) released RCT data in which MDD patients show a worrying increase in suicidal event risk on paroxetine (Table1). The data from IBDD RCTs in the GSK release were somewhat better. These data may not be correct but whether correct or not we can add 16 suicidal events to the paroxetine column and still get the same protective rather than problematic result for paroxetine when MDD and IBDD data are added together (Table 1).
Table 1: Suicidal Acts in MDD & IBDD Trials
MDD Trials Acts/Patients
IBDD Trials Acts/Patients
Inf (1.3, inf)
This effect has been noted as a hazard of meta-analyses and is termed Simpson’s paradox but it has to apply to some extent in every individual trial that recruits patients who have a superficially similar but in fact heterogenous condition such as depression, pain, breast cancer, Parkinson’s disease or diabetes – almost all disorders in medicine.
Every time there is a mixture of more than one patient group in the trial, randomization ensures some patients will hide some treatment effects – good and bad. Standard back pains and their treatments, for instance, mask the beneficial treatment effects of an antibiotic on back pains linked to infections (up to 10% of back pains).
As the above example illustrates, it possible to design an RCT that uses a problem a drug causes to hide that same effect.
This confounder also applies to the use of placebo. The assumption is that placebo only acts in RCTs as a control for natural variation. Placebos can have potent effects in their own right, making them a second drug, putting them in the position of an antibiotic in the back pain scenario just outlined. The difficulty in this case is that we do not know enough about placebo or placebo responders to know what is happening in an individual case.
Every medicine that gets on the market by definition beats placebo. As a result, it has become unethical to use placebos in clinical practice, when for some patients it is possible to do open heart surgery on placebo and, more generally, it seems a better idea to use a placebo if it is going to work.
Another problem with placebos is that a quantitative approach to data generated by algorithm rather than an approach based on judgement increases the risk that minor events in a placebo arm will be offset against significant events in an “active” treatment arm offering an opportunity for some to claim that nothing specific has happened, when it has.
The only way to overcome biases in trials, assuming we want to do so, is in fact to know what we are doing, which includes knowing all about the clinical populations we are treating.
Randomization also introduces confounders by choice of endpoint. As noted, choosing a primary endpoint such as a response on a mood scale can lead investigators to miss more common effects on sexual functioning. In this case, the confound stems from commercial considerations but it seems likely to apply regardless.
The bottom line is there can be no assumption that RCTs deliver valid, unconfounded knowledge.
From 1962 to 1990, clinical knowledge of drug effects derived primarily from clinical experience, embodied in case reports and published in clinical journals. The steady rise of mechanical observing, however, allied to a sequestration of trial data, ultimately relegated clinical evaluations that drug X causes problem Y, even when buttressed by evidence of CDR, to the status of anecdotes. From 1990, journals stopped taking these observations.
As a result, where in the 1960s the harms of treatments had taken at most a few years to establish after a drug came on the market, by 1990 a set of processes were in place that meant it now takes decades for the harms of treatments to be established, as for instance in the case of impulse control disorders on dopamine agonists or persistent sexual dysfunction on SSRIs or isotretinoin.
There is a perception that pharmacovigilance is in crisis. Discussions about solutions mention the need for systems to detect the rare effects of treatments not found in RCTs. There is a turn to a mining of electronic medical records or other observational approaches.
New signal detection methods and investigative approaches will always be welcome, but these are not the answer to the problems we now face. The main difficulty now lies not in detecting rare effects but in a systematic failure to acknowledge common effects.
There are better ways than new signal detection methods to detect common effects such as healthy volunteer studies. These ordinarily do not have a primary endpoint – the healthy volunteer trials of SSRIs, for example, demonstrated that sexual effects were common, often debilitating, and in some cases enduring. They also demonstrated suicidal effects from SSRIs.
The difficulty in recognizing adverse effects has been compounded by a sequestration of clinical trial data by companies and a ghost writing of the clinical literature that hypes the benefits and hides the harms of treatments, compounded by a regulatory willingness to avoid deterring patients from treatment benefits by placing warnings on drugs.
Clinical practice is also compromised by licensing indication and by guidelines. There are no drugs licensed to treat adverse effects. When a person becomes depressed and suicidal on an SSRI, there is no treatment licensed to treat this toxicity. Many clinicians wanting to help feel compelled to diagnose depression rather than toxicity but a depression diagnosis inevitably leads to a further treatment with an antidepressant rather than something more appropriate like a benzodiazepine, a beta-blocker or red wine.
Similarly, guidelines list treatment benefits but none tackle the management of treatment harms. As a consequence an increasing number of the relatives of dead family members, seeking to find out what happened, face services whose root cause analyses suggest the service did a great job – they only prescribed approved drugs and kept to all guidelines – leaving the cause of death unanswered. This is a profound dumbing down of clinical expertise.
These latter points risk suggesting that our current problems in therapeutics stem from company RCTs rather than from RCTs in their own right. The argument here is that RCTs, even if done by angels, are not a good way to evaluate a drug.
Whether dealing with case reports, RCTs, registry or Electronic Medical Record data, the key question is where objectivity comes from. Science traditionally generates data and challenges us to interpret them. New techniques (like a new drug) can throw up new observations, but while new data can challenge prior judgements, the mission of science has not been to replace judgement by technique. In the case of a pregnancy registry, for instance, provided there is a full dataset to which everyone has access, the question is whether objectivity about the best balance of benefits and harms stems from the combined scrutiny of women of child-bearing years, doctors, pharmacologists, pharmaceutical company personnel and others, rather than from signal detection methods operating on data “detached” from all human traces.
The issue is whether clinical practice is essentially a judicial rather than an algorithmic exercise. The view offered here is that our best evidence as to what has happened in an individual case lies in the ability to examine and cross-examine the person or persons (interrogate the data) to whom an effect has happened.
The supporters of RCTs can respond that of course judgement calls need to be made, just as the creators of the DSM operational criteria claimed. In practice, however, operational exercises like RCTs and DSM criteria call for a suspension of judgement and put a third party like the pharmaceutical industry in a strong position to contest any introduction of judgement by a doctor or a patient. Individual judgement of course can be suspect and one control for this is the judgement of others considering the same data, but if judgement is neglected RCTs become divorced from the real world or, at least, assertions to the contrary aside, no basis for thinking RCTs have a clear footing in reality has been articulated.
Arguments in favor of RCTs over clinical judgement point to a small series of treatments, such as internal mammary ligation, that RCTs demonstrated did not work, with the implication that clinical judgement can get things wrong. These arguments fail to note that most of the current treatment classes we have were introduced in the 1950s without RCTs and that the treatments introduced then from anti-hypertensives and hypoglycemics to antidepressants and other psychotropic drugs are more effective than treatments introduced since. RCTs appear to facilitate the introduction of treatments with lesser effects.
More to the point, every day of the week doctors and patients continue or stop treatments on the basis of judgements as to whether the treatment is working or not. These judgements have to be mostly correct or else medicine would not work. While individual doctors or patients may err, it is not clear that individual judgements can be replaced in clinical practice for treatments that legal and regulatory systems have deemed unavoidably risky.
We need to restore confidence in clinical and patient judgements about drug effects. Doubts about the wisdom of placing patient input at the center of clinical evaluations may stem from the domain of later clinical practice when clinicians have to decide whether a drug did in fact cause a problem. An endorsement of clinical judgement does not suit the pharmaceutical industry, for whom the supposed generalizability of RCT knowledge and confidence intervals that can be offered for such knowledge are legally appealing. The judgement that a drug has caused an effect is, however, logically prior to studies such as RCTs in which the issue of whether this drug can cause this effect more generally may be explored.
The Place for Randomized Controlled Trials
RCTs came into use as a result of a regulatory change prompted by Louis Lasagna and Walter Modell – not because of their epistemological validity or any proven superiority to other methods of evaluation (Healy 2012; Healy 2020).
Twenty years after the first RCT, Tony Hill offered the view that RCTs have a place in the study of therapeutic efficacy, but they are only one way of doing so and any belief to the contrary was mad (Hill 1965).
The subsequent history of company trials demonstrates that the knowledge derived from RCTs is far from generalizable. When two positive placebo-controlled trials were put in place as the criterion of entry to the market, it was assumed any subsequent trial after a first positive trial would almost always be replicated. This is clearly not the case.
Louis Lasagna, who had been an enthusiastic advocate for RCTs up to 1962, also changed his view, adding in respect of adverse events that while the common view was that spontaneous reporting was unsophisticated and not scientifically rigorous, this was only the case in the dictionary sense of sophisticated meaning “adulterated” and spontaneous reporting was in fact more worldly-wise, knowing, subtle and intellectually appealing than RCTs (Lasagna 1983).
The idea of randomization as an extra control on clinical bias retains an appeal. There is a place for it, unhooked from primary endpoints and statistical significance, as happens in large pragmatic trials – but here the word pragmatic concedes that we do not really know what we are doing. There is a place for it in other clinical studies that do not involve drugs.
RCTs also have a merit as a gateway to the market – randomization means the trials require less patients and can be run quickly. A positive result in company trials may indicate a compound has an “effect” but will rarely show a compound is effective. Trials aimed at establishing effectiveness require hard outcomes and time. This is not a realistic gateway to the market. Demonstration of an effect, as with SSRIs for depression, means it is not correct to say this drug does nothing and on this basis entry to the market could be permitted. By implication, the launch of a drug licensed on these terms would be the point when more comprehensive clinical evaluations should start, aimed at generating a clinical consensus as to the place of the drug in practice.
As RCTs, the standard through which industry would make gold, proliferated and became mechanical exercises, however, the mantra that RCTs provide gold standard medical evidence took hold. The ignorance of ignorance in claims that the only valid information on medicines comes from RCTs compounds a series of other factors that make RCTs a gold standard way to hide adverse events.
It is appropriate to use RCTs to raise the bar to those who would make money from one effect of a drug given to people at their most vulnerable, but seasoned clinicians, allied to increasingly health-literate patients, are better placed than RCTs to determine cause in the case of the 99 other effects every drug has. A recognition of common sexual or suicidal effects of antidepressants, for instance, needs input from patients able to distinguish condition effects from superficially similar treatment effects.
As a general tool to evaluate the effects of a drug, RCTs should take second place to a group of experienced clinicians whose observations are not constrained by checklists and an investigation tailored to one effect.
If an effect (good or bad) linked to a treatment is contested after its launch, as the example of Hormone Replacement Therapy (HRT) may demonstrate, randomization and an extreme focus on that one effect may help clarify whether there is an effect that needs further exploration or whether those who wish to make money out of people at their most vulnerable need to seek out sub-groups of patients who have a sufficiently robust benefit to make a reasonable trade-off between the harms of treatment (which will have to established by other means) and the harms of their condition.
Drug interventions (therapeutic poisoning) harm; the hope is that some good can also be brought from their use. Similarly, evaluations by RCT harm (generate ignorance), but if used judiciously some good can be brought out of the ignorance they necessarily generate. It is less likely that good will be brought out of ignorance if we rely solely on a data handling formula.
Algorithmic analytic methods can describe data but whether good comes from their use requires the kind of judgement calls that statistical approaches ordinarily make a virtue of side-lining. A recent study looking at 29 ways to analyze a dataset demonstrated that different techniques can lead to a wide variation in results with none able to guarantee what is happening in the real world (Silberzahn, Uhlmann, Martin et al. 2018).
In addition to changes in how RCTs are viewed, we need access to clinical trial data. Medical journals should not take reports of clinical and healthy volunteer studies without full access to the data – or these reports should not be designated as science.
Our hierarchies of evidence need to come clean on whether they regard a ghostwritten article without access to clinical trial data as better than or inferior to a Case Report that embodies dose responsiveness and CDR elements for instance.
At present, clinical data remains undefined as a concept. This paper takes the position that data means the people entered into the study behind any table of figures or behind the outcome of any analytic process applied to those figures. This definition makes case reports with names attached at present the only form of controlled clinical investigation that offer access to the data, the possibility to interrogate the data and, accordingly, an opportunity to ground any conclusions in the real world.
In the case of figures resulting from an analytic process, there is an onus on those deploying the process to clarify how the figures translate into the real world rather than assuming they do.
Evaluating the effects of drugs is among the most important exercises we undertake. When drugs work, they can like parachutes save lives.
Given the importance of the task, the notion of a hierarchy of evidence at the top of which are mechanisms that do the deciding for us has a potent allure. Linking these mechanisms to significant names like Gauss and Laplace, responsible for some of the most solid advances in science, adds to the allure.
This paper does not take issue with these achievements, but relegating judgement to the bottom of the evidence hierarchy in medicine brings out more clearly what has happened – we are uncomfortable with clinical judgement. Succumbing to an operational solution, however, is at least as dangerous as trusting clinical judgement, if not more so.
In the case of airplanes, adding parachutes and other interventions that are effective, rather than just have an effect, enhances safety, although recent Boeing plane crashes point to the perils of too great a reliance on automatic decision tools.
The use of RCTs has led many to view drug treatments as comparable in effectiveness to parachutes. As a result of RCTs of drugs, by the age of 50, close to 50% of us are now on five or more drugs. For the past five years, our life expectancies have been falling and admissions to hospital for treatment-induced morbidity are rising, an outcome that contrasts with the added safety of having parachutes and other gadgets in planes (Healy 2020).
While it may simply be that combining five drug gadgets brings risks of interactions that airplane gadgets don’t bring, if the RCTs of medicines essentially produce evidence that it is not correct to say this drug has no possible benefit, it has an effect, it makes some sense to think that being on five or more drugs, of which all we can say in the case of each of them is that it is not right to say they have no possible benefit, might be a contributing factor to increasing levels of mortality and morbidity.
Whatever one’s view of the epistemological basis of RCTs, the data on life expectancies and treatment linked morbidities call for an evaluation of the role of RCTs in the evaluation of drug treatments.
Healy D. Pharmageddon. California University Press, Berkeley (2012).
Healy D. The Shipwreck of the Singular; Healthcare’s Castaways. Samizdat Press, Toronto (Forthcoming 2020).
Hill AB. Reflections on the Controlled Trial. Annals Rheum Disease 1966; 25,107-13.
Lasagna L. Discovering adverse drug reactions. JAMA 1983; 249: 2224-5.
Silberzahn S, Uhlmann EL, Martin DP, Anselmi P, Aust F, Awtrey E, Bahník Š, Bai F, Bannard C, Bonnier E, Carlsson R, Cheung F, Christensen G, Clay R, Craig MA, Dalla Rosa A, Dam L, Evans MH, Flores Cervantes I, Fong N, Gamez-Djokic M, Glenz A, Gordon-McKeon S, Heaton TJ, Hederos K, Heene M, Hofelich Mohr AJ, F. Högden, K. Hui, M. Johannesson, J. Kalodimos, E. Kaszubowski, D. M. Kennedy, R. Lei, Lindsay TA, Liverani S, Madan CR, Molden D, Molleman E, Morey RD, Mulder LB, Nijstad BR, Pope NG, Pope B, Prenoveau JM, Rink F, Robusto E, Roderique H, Sandberg A, Schlüter E, Schönbrodt FD, Sherman MF, Sommer SA, Sotak K, Spain S, Spörlein C, Spörlein C, Stafford T, Stefanutti L, Tauber S, Ullrich J, Vianello M, Wagenmakers E-J, Witkowiak M, Yoon S, Nosek BA. Many Analysts, One Data Set: Making Transparent How Variations in Analytic Choices Affect Results. Advances in Methods and Practices in Psychological Science. 2018; 1:337-56.
December 3, 2020