David Healy: Do randomized clinical trials add or subtract from clinical knowledge

Charles M. Beasley, Jr’s comment – Part 2

In Part 1 of this Comment, I discussed multiple areas in which I potentially cannot entirely agree with Dr. Healy about his characterization of Randomized Clinical Trials (RCTs) and the extent to which RCTs increase accurate clinical knowledge that improves patient care, compared to expert clinical observation. These are potential disagreements because my interpretations of what Dr. Healy intended to convey through what he wrote could be inconsistent with what he intended to convey. Additionally, in Part 1, I pointed out several areas where Dr. Healy characterized the development of the RCT methodology and its associated statistical methods with which I also potentially disagree and summarized the bases for these potential disagreements.

In Part 1, I also agreed with Dr. Healy that RCTs could supply inaccurate information. Inaccurate information would certainly not foster improved patient care and outcomes, and at worst, might lead to patient harm. I introduced the term ‘proper RCT’* to describe an RCT that supplies accurate information generalizable to the complete set of patients who might receive the treatment studied in an RCT.

In this Part 2, I describe my schema for the steps that extend from the formulation of a conceptual hypothesis through the public dissemination of the cognitive interpretation of an RCT’s results. Multiple alternative schemas could accurately describe the steps in an RCT. The steps I include in my schema are influenced by my belief that several steps in the schema are steps in which design, conduct, analyses, interpretation, or dissemination can contribute to an RCT not being ‘proper.’ I use the term ‘bias’ to describe the information disseminated from an RCT that is not accurate. A bias can suggest both false-positive attributes for treatment (e.g., the treatment is effective) and false-negative attributes (e.g., the treatment causes clinically meaningful Adverse Drug Reaction [ADR] “X”). Additionally, a bias can be either unintentionally or intentionally created/reported. I describe how some biases can be created in several steps of the RCT process and ways to mitigate this potential and reduce its probability of occurrence.

This Part 2 is divided into three Sections, identified with the capital letters A, B, and C. Section A contains introductory information. Section B describes my schema, in outline format, with some discussion for several steps and sub-steps. Section C describes the ways biases might be created and the ways that such possibilities might be mitigated. The distinction between describing the steps of an RCT in Section B and proposing possible biases and their remedies in Section C is not always clear-cut. Some of the detailed descriptions of steps in Section B describe alternative processes and methods within those steps. When I describe multiple processes within a step, some of which might be excluded, and alternative processes within a step, I sometimes hint at my beliefs on reducing the potential for biases.

There are only two references that support the statements I make in Part 2 (excluding the reference for Dr. Healy’s Essay on which I am commenting), and one of these is embedded in the text because it is an FDA document without a named author, accessed on the FDA’s website. What I say in this Part 2 has been influenced by my experiences while a 28-year Eli Lilly & Company employee and a consultant to multiple pharmaceutical and biotechnology companies following my Lilly retirement in 2015. As such, what I suggest here might be characterized as speculation, non-empirical, and if a highly pejorative adjective is used, prejudices. In that I offer these thoughts, I am, to some extent, aligning myself with Dr. Healy’s position on the importance of information based on sources other than RCTs. But, in Part 1, I emphasized the need for consideration of information derived from sources other than RCTs and said that some matters of clinical importance could not be addressed with RCTs. My specific concerns with RCTs, however, appear to differ from those of Dr. Healy. Some things I suggest as potential sources of biases and their remedies might be assessed with RCTs. Some such work has been performed, although I do not cite the reporting literature. For one example of such empirical study, the utility of independent, expert, third-party interviewers/scorers of psychiatric rating scales has been empirically compared to this RCT activity performed by investigative site personnel. I believe that most of what I suggest has not been studied empirically and could not be assessed empirically with RCTs.

Part 2 – Framework for the Progressive Components of an RCT from Conceptual Hypothesis Formulation to the Public Presentation of the Interpretation of the Results and Avoidance of Intentional or Unintentional Bias in the Interpretation and Reporting of RCT Results

Section A – A Brief Review of the Initial Components of an RCT, Formulation of a Conceptual Hypothesis (or Hypotheses) through Planning the Inferential Statistical Analyses of the Hypotheses, a Description of the Therapeutic Areas and Types of RCTs that Influenced these Comments, and Other Introductory Information

There are multiple ways that the progressive steps toward public disclosure of an RCT can be sliced-and-diced and characterized for discussion, as noted above. Again, the following schema is purely mine. At least one of my steps is probably idiosyncratic to my schema and not often considered/discussed. I consider it a step worth naming and discussing based on my experience with RCTs that succeeded and RCTs that did not yield support for their conceptual hypotheses. These successes and failures are mainly within a set of specific psychiatric disorders where the conceptual hypotheses were that the drugs being studied improved the signs and symptoms of the disorder that constituted an inclusion criterion for the RCT. To a lesser extent, the schema is also based on RCTs that succeeded and failed when the conceptual hypotheses were that the drugs being studied did not adversely prolong the heart rate corrected QT interval (QTc) to a clinically meaningful degree (Thorough QT [TQT] RCTs). For the TQT RCTs, the primary inclusion criteria were the absence of a prolonged QTc based on an absolute value at baseline.

The formalized primary null hypotheses for the psychiatric disorder RCTs were the absence of improvement in the signs and symptoms of the disorder. The inferential statistical analyses were classical superiority analyses of several types. These inferential analyses included ANOVA or ANCOVA (because multiple doses of the experimental drug were often included along with placebo and positive control) with numerous factors included in the models (e.g., investigative site) and baseline the usual covariate in ANCOVA analyses. More recently, Mixed-Model, Repeated-Measure (MMRM) ANOVA or ANCOVA with various fixed and random effects (most commonly using unstructured covariance-variance matrices) has been used as an inferential analysis method. An important simplistic summary is that an MMRM analysis adjusts for missing data and estimates visit-wise changes based on the adjustments.

The TQT RCTs were not parallel design RCTs. These RCTs were three-treatment (experimental drug, placebo, positive-control drug) cross-over studies.

The formalized primary null hypotheses for the TQT RCTs were that the experimental treatment was associated with a clinically meaningful increase in the mean maximal QTc interval compared to placebo. The QTc interval was measured multiple times after subjects reached steady-state exposure to the experimental drug. The mean maximal increase in the QTc interval was the maximum difference between the drug-associated QTc interval and the placebo-associated QTc interval measured at the same times as measured for the drug.

The primary inferential analyses for the TQT RCTs were not superiority analyses but were non-inferiority analyses. A coprimary analysis required in the TQT RCTs was a demonstration that the RCT had assay sensitivity. A positive control, known to prolong the QT interval slightly, was included as a third arm in these studies, as noted above. A successful TQT RCT demonstrated that the positive control significantly prolonged the QTc interval and demonstrated that the experimental drug did not prolong the QTc interval to a clinically meaningful degree.

The brief description of RCTs where the conceptual hypothesis is that a drug improves the signs and symptoms involves a single formal null hypothesis. When I describe the several RCT steps and how they can be improved (i.e., the potential for positive or negative biases reduced), I focus on single conceptual hypotheses. Still, the suggestions apply to RCTs involving multiple comparisons. Furthermore, methods exist to allow statistically unbiased, sequential testing of several hypotheses. An example of such sequential testing is first to test if a drug is non-inferior to placebo for some outcome and then test if the drug is superior to the placebo for that same outcome. The testing of multiple doses of the same test drug compared to placebo or several test drugs compared to placebo and/or each other are other examples of multiple comparisons within a single RCT.

Section B – My Schema for Describing the Progressive Components of an RCT

The following outline is my schema that describes the progressive steps in an RCT that leads to disseminating the cognitive interpretation of its results. This schema is selective and could provide even greater detail with more top-level components and more sub-components. The average length of protocols I worked on in 2015 had grown to 125 pages of single-spaced full text.

I am not reproducing that level of detail with an explanation of the rationale for the contents. The level of detail that is provided is geared to outlining the basics and to where I can comment on avoidance of intentionally or unintentionally creating positive or negative biases:

1. Formulation of the conceptual primary hypothesis: for example, experimental/test drug X improves the signs and symptoms of schizophrenia.

2. Development of the formal statistical primary null hypothesis.

2.1. Selection of the endpoint(s) to be studied for the primary null hypothesis: for example, mean change in an assessment instrument score.

2.1.1. Multiple endpoints might be studied/required for the primary null hypothesis: for example, in addition to mean change in sign/symptoms, the proportion of subjects meeting a set of criteria defining a “responder” to treatment (e.g., some percentage, such as 20%, improvement in the signs and symptoms of schizophrenia or the “Kane Criteria” for a patient with schizophrenia being a “responder” to treatment. A minimum effect size might be prospectively required to interpret the study results as supporting the primary conceptual hypothesis that the experimental treatment improves the signs and symptoms of the disorder under study. If such a requirement were prospectively included as a requirement for interpreting the RCT’s results as supporting efficacy, this requirement would not be embodied in a formal null hypothesis. To my knowledge, neither a proportion of “responders” nor a minimum effect size has ever been a condition of approval with the Psychiatric Drug Products Review Division within the Center for Drug Evaluation and Research (CDER) branch of FDA. I am not intimately familiar with efficacy criteria employed by other Review Divisions within CDER. However, I believe they are commonly numerically continuous variables, similar to a mean change in some sign and symptom rating scale in psychiatry. For example, mean survival time might be considered in oncology trials, and the total number of seizures might be considered in anticonvulsant trials. I am much less familiar with the efficacy approval requirements within the CBER (Center for Biologics Evaluation and Research) that reviews and approves vaccines and other large-molecule treatments in other therapeutic areas. As a critical outcome for a vaccine regarding efficacy is the reduction in cases among those receiving the active vaccine compared to those receiving placebo, this dichotomous/binary outcome is somewhat analogous to the concept of “responder” in RCTs within psychiatric disorders.

2.2. Selection of the assessment instrument(s) for the endpoint(s) evaluating the primary null hypothesis: for example, the PANSS-Total score for an RCT in schizophrenia.

3. (Optional – Development of one or more secondary conceptual hypotheses, their formalized statistical null hypotheses, as well as the selection of their endpoints and assessment instruments).

4. (Optional – Development of one or more exploratory conceptual hypotheses, their formalized statistical null hypotheses, as well as the selection of their endpoints and assessment instruments).

5. Development of two formal documents, the first defines most, but not necessarily all, the operational elements of the conduct of the RCT, and the second defines in detail the descriptive and inferential analyses for the study.

5.1. Formally define most of the operational elements of the conduct of the study – the Protocol.

5.1.1. Treatment choice, including the dose(s).

5.1.1.1. One or more doses of the experimental drug.

5.1.1.1.1. If one dose, fixed or variable within a dose range.

5.1.1.1.2. If multiple, fixed, or each variable with non-overlapping ranges.

5.1.1.1.3. Active control possible, usually a single dose.

5.1.1.1.3.1. If the primary intent of the RCT is to compare the experimental drug to the active control, a placebo is often omitted. If a placebo is omitted, the within-study efficacy cannot be decided for either treatment.

5.1.1.1.3.2. When a placebo is also included, the purpose of an active control is usually to prove assay sensitivity. If the test drug and the active control both do not separate from placebo, it might be possible to conclude that the RCT lacked assay sensitivity and that the RCT was a “failed” study rather than a “negative” study (i.e., a study possibly suggesting that the test drug lacks efficacy).

5.1.1.1.3.2.1. The risk to a commercial sponsor of inclusion of an active control is an outcome where the active control shows superiority to placebo, but the test drug does not. This outcome is even more suggestive of at least a relative lack of efficacy for the test drug than simply lack of separation from placebo in a test drug – placebo only study.

5.1.2. Inclusion criteria.

5.1.2.1. The diagnosis of interest.

5.1.2.2. Generally, some minimum severity criteria.

5.1.2.3. Possibly, some maximum severity criteria.

5.1.2.4. If the RCT is intended to assess efficacy in a subset of patients with inadequate response to other treatment or resistant to other treatments (“treatment-resistant” patients), an operational definition of this subset of patients. Inadequate response versus treatment resistance is an evolving regulatory concept, with the latter group showing less improvement to more treatments within an index episode of active symptoms than the former group.

5.1.3. Exclusion criteria.

5.1.3.1. Non-psychiatric medical disorders: I believe that there is a general belief that RCTs sponsored by industry are unrealistic and exclude “real-world” patients based on medical comorbidities. I cannot entirely agree with this belief with but a few caveats. In Phase 2 studies, it is appropriate to exclude medical comorbidities that can worsen and suggest an ADR. However, with reasonable drug exposure in such studies that at least hint at what are likely to be ADRs, I believe most sponsors liberalize the comorbidities that are not exclusionary (fewer disorders are kept as exclusionary). Although the following might be extreme, I have written Phase 3 protocols where the only medical disorder exclusionary criterion was the presence of a disorder of such gravity that the patient stood a reasonable probability of dying before completing the RCT. There is some marketing-regulatory pressure to avoid medical disorder exclusionary criteria. If a sponsor excludes patients, those reasons for exclusion might find their way into regulatory labeling as contraindications, and a potential market has been lost. However, where there is reason to believe the experimental drug would pose an excess risk in a medical disorder, those disorders are appropriately included as exclusionary criteria.

5.1.3.2. Psychiatric medical disorders: These are more complicated than non-psychiatric disorders regarding exclusion versus non-exclusion. The issue is the signs and symptoms of one disorder confounding the assessment of the signs and symptoms of the disorder under study. A common issue is the amount of substance use disorder which should be exclusionary. Although somewhat dependent on the drug class or specific drug, a mild “X” use disorder is often not an exclusionary criterion. Inclusion or exclusion of a potential subject is sometimes decided on a case-by-case basis in consultation with the potential subject and the investigator’s perception of the potential subject’s willingness to abstain or limit their substance use during the RCT. It is highly desirable to have honest and collaborative subjects.

5.1.3.3. Non-psychiatric, medical disorder treatments/drugs: Drugs known to have significant pharmacokinetic or pharmacodynamic interactions are appropriately exclusionary, and relevant language will appear in product labeling as a contraindication, warning, or precaution.

5.1.3.4. Psychiatric treatments/drugs: Given that through Phase 3 studies, efficacy is a primary interest, other treatments that could affect efficacy assessment, for better or worse, must be excluded. A delicate balance must be maintained between treatments for secondary symptoms of the disorder under study (e.g., insomnia with some depressive episodes) and either blunting or inflating the apparent efficacy effect of the drug being studied. The same applies, primarily to blunting, an ADR (e.g., allowing excessively liberal use of anticholinergic agents, especially prophylactically, for dystonic or Parkinsonian symptoms with dopamine antagonists).

5.1.3.5. Suicidality: The late Dr. Alexander Glassman, a mentor for me after I began my work in the pharmaceutical industry, suggested recommendations I adopted. He strongly suggested that operational exclusionary criteria for which most psychiatrists acting as investigators for RCTs of experimental treatments for psychiatric disorders would easily agree on for individual potential subjects were needed. Such operational criteria would contrast with vague exclusionary criterion statements such as: “Considered to be at risk of suicide.” He suggested that for inpatient studies, the criterion: “Of sufficient risk of suicide that ECT is considered the treatment of choice.” For outpatient studies, this was changed to: “Of sufficient risk of suicide, that inpatient status is considered the appropriate treatment environment.” While different clinicians might assess the same patient differently regarding the need for ECT or hospitalization, these criteria are less vague than the stated alternatives. These criteria were generally adopted in psychiatric RCTs for which I had later responsibility. These criteria have often been modified with the development and widespread use of the Columbia Suicide Severity Rating Scale (C-SSRS). Suicidal ideation of severity level 4 (Active suicidal ideation with some intent to act, without specific plan) or level 5 (Active suicidal ideation with specific plan and intent) are virtually always exclusionary. In some cases, level 3 (Active suicidal ideation with any methods [not plan] without intent to act) is exclusionary, at least for outpatient RCTs. My personal belief is that under certain conditions, excluding a patient with level 3 ideation from an outpatient study is excessively exclusionary and limits the generalizability of an RCT to the relevant population. If such a patient has adequate support resources, access to the treating investigator, and lacks clinical factors that might increase risk (e.g., a history of suicidal acts, hopelessness/helplessness, agitation, a severe personality disorder, easy access to mean of suicide, comorbid active substance abuse or a history of such abuse) this exclusion is excessively restrictive.

5.1.4. Administration of the efficacy assessment interviews and the scoring of the assessment instruments: There are two broad categories for how these data are collected/scored.

5.1.4.1. By site staff.

5.1.4.2. By a neutral third party: Within administration and scoring by a third party, there are also two alternatives.

5.1.4.2.1. An expert live rater conducts the interview and scores the assessment instrument based on an audio-visual interview, generally by a live broadcast technology.

5.1.4.2.2. The subject interacts with voice prompt questions or responds to questions posed on a computer terminal, laptop, tablet, or similar device.

5.1.5. Collection of data relevant to safety assessment, including adverse events (AEs), concomitant medications, any scale data systematically assessing safety matters, vital signs, weight, and the results of laboratory analytes assays from venous blood samples. The objective data, less prone to collection/measurement error (e.g., weight), call for little discussion. AEs that are subjective experiences in many cases are discussed when I discuss potential biases and their prevention. Some AEs, such as sexual dysfunction events, which subjects can be very hesitant to discuss, result in a need to discuss ways to minimize under-reporting of AEs. AEs can be collected through a clinical interview with general, open-ended inquiries. Alternatively, the clinician can collect data through an interview but inquire about the occurrence of specific events contained in a list, with or without levels of severity and associated criteria. Finally, the subject can complete an electronic or paper-and-pencil questionnaire about a specific set of events.

5.2. Formally define the analyses - the Statistical Analysis Plan – the SAP.

5.2.1. This document is often not completed until some point during the conduct of the study. However, for studies that serve regulatory purposes, the SAP is completed before the statisticians and statistical analysts conducting the actual analyses are granted access to the unblinded datasets. The method for the inferential analysis of the primary null hypothesis is laid out in great and explicit detail. Usually, the inferential analyses for the secondary null hypotheses are also described in exact detail. It is often the case that inferential analysis methods for any exploratory analyses, even if briefly described in the Protocol, are not detailed. Some exploratory analyses may be developed following an initial unblinded data review and the conduct of the initial, planned analyses. These are, of course, post-hoc analyses but are conducted for very appropriate reasons. The sample sizes for each treatment arm are driven by expectations for the efficacy of the test drug compared to the control treatment. The past relative performance of recently approved drugs treating the same disorder as studied in the RCT is often used as the basis for estimating this potential relative efficacy. Some RCT sponsors might adopt the principle that while larger sample sizes are generally associated with greater variance in the change in the primary efficacy measure, this potentially increased variance is almost always offset by a greater probability of showing a significant effect with larger sample sizes. For any absolute, fixed difference between two groups being compared, the greater the sample sizes within those two groups, the more likely the difference will be found to be significant, assuming only modest increases in variance within those groups. Sponsors’ sample sizes generally do not receive substantial critical attention from regulatory authorities or journal reviewers. Because no minimum effect size is required for an RCT to support regulatory approval for a claim of efficacy, the sample size determinations’ bases might be adjusted within the sponsor’s rationale description to result in arbitrarily large sample sizes. For efficacy, sample sizes are usually set for an RCT with 80-90% power. The sample sizes estimations are based on a mean change and standard deviation for the change within the comparative treatment groups. If I want to use large sample sizes, I decrease the mean improvement in the experimental treated group, increase this value in the placebo-treated group and increase the standard deviations in both groups. I need to provide very little, if any, concrete justification for those values. Whereas with risperidone and olanzapine developed in the late 1980s and early 1990s, respectively, sample sizes on the order of 50 drug- and 50 placebo-treated patients were sufficient to demonstrate conventionally statistically significant efficacy, the dopamine antagonist antipsychotic most recently approved by the FDA (transdermal formulation of asenapine) for schizophrenia included 204 subjects in each of two active-drug arms and 206 subjects in the placebo-treatment arm in the one necessary pivotal RCT (results of the study were such that the study was overpowered, but this was effectively the sponsor’s choice CDER 2018]). The results of some other recent antipsychotic trials, based on FDA review packages available on FDA’s website similar to that fully referenced for transdermal asenapine, have suggested that sample sizes above 150-175 per an active-treatment arm and a placebo-treatment arm might be needed to have a reasonable chance of demonstrating significant efficacy for a dopamine antagonist in the treatment of schizophrenia. Sponsors diverge widely on whether to apply inferential statistical methods to safety data. For sponsors who do not perform inferential analyses of safety data results, the rationale is that sample sizes for the RCTs’ datasets and the combined dataset were not based on the study of any given AE or specific safety parameter in Phase 2 and 3 studies intended to evaluate efficacy. The SAP might specify the nature of non-inferential, descriptive only analyses (i.e., tables, figures, listings) used to describe the safety data collected in a study if these data are not analyzed with inferential methods. Sample sizes and their rationale and development are discussed either in the Protocol or the SAP (sometimes both). For virtually all numerical safety data that are continuous (e.g., systolic blood pressure) or ordinal (e.g., urine glucose assayed by a dipstick), reference limits exist that define an observed value as within limits or outside limits. A value for an analyte within limits is commonly but incorrectly (for most analytes) referred to as “normal.” This reference to “normal” is not correct because these limits are statistical derivations and not based on a clinical assessment of the health status of the subjects providing the set of values from which the limits were derived (for most analytes, clinical data are the bases for the derivation of limits for some analytes and for those analytes the “normal” description is more appropriate/correct). Values can, of course, be observed below the lower limit or above the upper limit, where upper and lower limits have been derived. For some analytes, primarily those with only ordinal values, only a single limit is derived, and then the observed values can best be described as below or above the limit. For those analytes for which limits exist, the analyte results can be described/analyzed on both a central tendency change basis and/or a categorical change basis (i.e., a change in status for a baseline value relative to the analyte’s reference limits to a different status at one or more on-treatment measurements.

6. Study conduct processes that are often not specified in the Protocol: Several critical components potentially influence the outcome of an RCT that are not described in the RCT’s Protocol, only partially described, or to which the Protocol only briefly alludes. I consider three of these especially important in psychiatric RCTs. All three relate to entering into the RCT only those subjects meeting the most important criterion of all the inclusion criteria – suffering from the disorder for which the RCT is studying an experimental treatment. This potential difficulty in conducting a psychiatric RCT is distinct from the validity and reliability of the rating of the signs and symptoms of the disorder that can also affect an RCT’s outcome. Despite the reliability of the DSM “X” criteria, even with the administration of a Structured Clinical Interview for DSM-5 Research Version SCID-5-RV, investigators can be “inaccurate,” and potential subjects can be “inaccurate.” Several factors can facilitate or discourage such “inaccuracies.”

6.1. The RCT’s sponsor’s direct or indirect emphasis on speed of study completion: I only have familiarity with the interest in the speed of an RCT’s completion for RCTs sponsored by pharmaceutical industry companies. There is a universal interest in expeditious study completion. Depending on the therapeutic area, for every year of delay in completing a set of studies or even a single study, the magnitude of the potential loss of revenue to the sponsoring company can amount to multiple billions to tens of billions of dollars. There are many ways a sponsor can signal to investigators that the study’s completion is the primary interest of the sponsor. Investigators can have their contracts designed such that if they do not meet randomization targets, they can be dropped as investigators. Enrollment can be competitive, and, therefore, investigators who enroll rapidly earn greater compensation. Frequent calls to an investigator from senior medical management within the sponsoring company discussing the investigator’s poor rate of randomization performance in the context of discussing using that investigator in future studies send a clear message. With greater and greater use of Contract Research Organizations (CROs) to conduct RCTs, even by large pharma companies, such pressures can come from CROs, known or unbeknownst to the actual sponsor. An RCT’s Protocol might include an expected start and completion date but does not detail methods designed to enhance the randomization rate.

6.2. “Professional patients”: A relatively new potential complexity in the objective conduct of an RCT for a psychiatric diagnosis is the evolution of the “professional patient.” “Professional patients” do not have the disease or disorder under study. This entity has arisen as commercial sponsors have begun paying modest sums of money for subjects’ participation in Phase 2 and 3 RCTs (something not done during most of my time as a full-time employee in the industry). Professional patients often move from metropolitan area to metropolitan area and sometimes enroll in the same RCT at multiple sites within a metropolitan area. Some readers might view my suggestion of such a population with some suspicion. A Google search on the term “professional patient” will find manuscripts discussing the matter and, more importantly, companies specializing in identifying such patients for corporate sponsors before they are randomized into an RCT. If multiple competing companies make profits in such an exercise, it is a real problem that can lead to false RCT results. “Professional patients” are skilled at manifesting the signs and complaining of the symptoms of psychiatric disorders included in industry-sponsored RCTs – effectively reasonable experts in the contents of Edition X of DSM. During the RCT, these subjects can choose to appear to be improving or not improving. If a sponsor chooses to employ a business entity to identify “professional patients” and prevent their randomization, details of this procedure might or might not be included in the RCT’s Protocol.

6.3. Diagnosis confirmation: Some business entities offer a range of services, from conducting medical record reviews up through reviewing the primary diagnostic interview conducted by a study site investigator (recorded in audio or audio-visual format) or conducting an independent diagnostic interview by an audio-visual link. If such services are employed in an RCT, they are described in the Protocol.

7. Perform the prospective analyses described in the SAP and any analyses for any exploratory hypotheses.

8. Review the results of the analyses from #7.

9. Design and perform any post-hoc analyses.

10. Cognitive interpretation of the comprehensive set of results.

11. Public dissemination of the cognitive interpretation of the results.

11.1. Regulatory report.

11.1.1. Individual Clinical Study Report (CSR).

11.1.2. Meta-analysis of the comprehensive development program (Integrated Assessment of Efficacy [ISE] and Integrated Summary of Safety [ISS]).

11.2. Academic manuscript.

Section C – Potential Sources of Bias associated with Several of the Progressive Steps in the Evolution of an RCT and Actions that Might Reduce the Probabilities of these Potential Bias Sources from Influencing an RCT

The discussion of potential sources of bias and actions that might prevent such biases is organized based on the schema outlined above in Section B. Not all steps in the evolution of an RCT have associated biases that I discuss. My list of potential biases is not exhaustive, and I am virtually sure that other authors could describe other potential sources of bias. My thoughts and suggestions are numbered as they are in Section B above. I do not have thoughts or suggestions for all RCTs’ components named in Section B (i.e., 1-4, 7 and 11 [11.1, 11.1 1. 11.1.2]).

5. The Protocol and SAP.

5.1.4. Administration and scoring of the efficacy instruments: These comments apply to all efficacy assessments, primary and secondary, and safety assessments that depend on scales, with or without formal interviews. Site staff can intentionally or unintentionally, for several reasons, record scores that would differ from those of a consensus of expert interviewers and raters/scorers. I favor data collection by an expert third party by a live interview with audio and visual interaction when most of what is being scored are the subjects’ subjective experiences. While I have concerns about the quality of the rapport between the interviewer and the subject, mainly when the treatment being studied is for a psychotic disorder, I believe the objectivity of this method outweighs my concerns. Although I engaged in developing an automated telephone interview and scoring system for a psychiatric condition-related scale, I do not favor using such methods. My work for one client that used such a data collection device discovered that some subjects reported in retrospect that they had not understood the questions and their recorded responses were inconsistent with what their responses would have been if they had correctly understood the questions. However, the initial scoring of this instrument could not be changed, which was a critical, harmful matter for my client. A human audio-visual interviewer can likely ascertain when a subject might be confused by a question and provide clarification, even if the interview that results in a score is intended to be highly structured.

5.1.5. Collection of safety (and efficacy) data, some of which might be highly personal about which subjects might be hesitant or embarrassed to speak: Within most protocols with which I am familiar, across multiple sponsors, AEs are collected through open-ended, non-specific questions. Responses to those questions can lead to more specific questions to obtain more specific information about the nature of an event. I favor this method of AE collection as it is consistent with the standard method for taking a medical history. Hopefully, for most AEs, the method is sufficiently sensitive and does not miss events. Several instruments exist for the systematic collection of specific AEs, for example, the UKU Side Effect Rating Scale (Lingjaerde, Alfors, Bech, et al. 1987). This scale is intended primarily for use with dopamine antagonist antipsychotics. Exclusive use of such an instrument runs the risk of missing events not included in the instrument. If such an instrument is used, it should be used along with the open-ended question collection method. The incidences of events obtained from the two methods should be presented separately in the public data disclosures. In a conversation with Dr. Craig Nelson, who worked extensively with desipramine while at Yale, he told me that the use of a systematic instrument for AE collection, with which he had experience in his desipramine studies, would result in an increased number of events being collected during treatment but also at baseline before treatment began. The net result of this phenomenon was generally fewer events being found to be treatment-emergent. Therefore, the incidence of AEs that were possibly ADRs was reduced when using a systematic collection instrument. For highly personal, potentially embarrassing safety events, such as the various aspects of sexual dysfunction mentioned by Dr. Healy, using a systematic self-report is advisable. The question is, should such instruments be included in the studies of every new investigational drug with central nervous system activity before the drug’s potential approval? My answer to that question is yes if the drug might be used for an extended time (e.g., more than several weeks). If you don’t look, you won’t know.

5.2.1. The SAP: My overarching suggestion is that this document must be developed through a close collaboration between a statistician who has the greatest possible knowledge of the clinical matters for which she or he is developing inferential analyses and a clinician with the greatest possible knowledge of inferential statistical methods. I consulted on a numerical safety parameter analyzed multiple times in a rapidly expanding, extensive set of RCTs. The results of each analysis were submitted to a regulatory authority. That authority raised no questions about the finding reported until the reporting of the last results. The authority requested an explanation for why the reported results were divergent across the received reports (the reported results were previously divergent, but this was the first time divergence was noted by the authority). While data analyzed differed across the expanding datasets, a substantial contributor to the difference in the reported results was due to differences in the complex inferential models applied to the data. When a consistent model was applied across the datasets, the results were reasonably concordant, allowing for the expected, slight differences in raw results.

6. Study conduct processes that might not be explicitly described in the Protocol.

6.1. Emphasis on the speed of study completion versus emphasis on proper subject selection: Multiple experiences suggest to me that an emphasis on speed of study completion is a recipe for a negative or failed RCT, even in studies with subjects suffering from schizophrenia.

6.2. Employing services to identify and exclude the participation of “professional patients” before they are assigned to a study treatment: I advocate using such a service for studies in psychiatric disorders. Once a subject receives a dose of treatment and has data collected following one or more doses, those data must be used in the primary and secondary efficacy analyses and all safety analyses, consistent with the intent-to-treat analysis principle required by regulators.

6.3. Diagnosis confirmation: I advocate using a service to verify the diagnosis, with at the very least a review of all the medical records for the potential subject. If the potential subject does not have medical records at the investigative site or the potential subject is unwilling to make such records available to the site, these are red flags and raise interest in the diagnosis. With diagnoses other than schizophrenia or bipolar disorder, manic episode, a diagnostic interview by an objective third-party expert should be considered on a case-by-case basis. This decision can be influenced by the knowledge of the investigator on the part of the RCT’s sponsor.

8, 9, 10. Review the prospectively planned analyses results, plan and conduct any post-hoc analyses and cognitively interpret the results in preparation for public dissemination: There can be very legitimate reasons for planning and conducting post-hoc analyses based on unexpected results in the a-priori, planned analyses. Ideally, these would be described, along with their rationale in an SAP supplement/addendum. I do not have recent experience with sponsors’ conduct of such analyses and cannot comment on how these might be explained and documented. Any party with access to the RCT’s SAP easily recognizes post-hoc analyses because they are not described in the SAP. I have only a single example of the importance and logic of cognitive interpretation. Assume that yet another D₂-receptor antagonist (perhaps with several other acute pharmacological actions such as D₄-receptor antagonism and 5-HT_a&b antagonism) is studied in the treatment of an acute exacerbation of schizophrenia. The RCT does not show statistically significant improvement in signs and symptoms compared to placebo. I would interpret such an outcome with such a drug as yet one more example of a poorly conducted RCT with the remote possibility that some unknown pharmacological property of the drug adversely impacted its efficacy. To believe that a D₂-receptor antagonist does not lead to improvement in schizophrenia flies in the face of what might be the best-established mechanism of action for any psychiatric treatment. This interpretation would be meaningless from a regulatory approval perspective but could receive consideration in the discussion section of an academic manuscript. Post-hoc assessment of the study’s conduct (not necessarily post-hoc inferential analyses) might suggest potential reasons for the failed or negative RCT. If it were a novel pharmacological mechanism being studied for the treatment of schizophrenia, I would be skeptical of the mechanism or the RCT’s conduct and withhold judgment on which was more likely responsible for the RCT’s observed results.

11.2 Public dissemination of the RCT: I have four suggestions for academic manuscripts when the RCTs that result in the manuscripts have followed the suggestions above. First, journal editors should insist that the a-priori primary efficacy analysis is the primary efficacy analysis presented in the manuscript. Journal editors should require the receipt of the SAP for the RCT for which the journal is considering the publication of the RCT’s manuscript. It might be necessary for the journal to obtain the services of a statistical reviewer to determine the consistency of the analyses presented in a draft manuscript with those the analyses described in the SAP. This requirement would undoubtedly increase the burden on journals but constitutes a major safeguard against intentional bias. My following two suggestions are related, and while not protecting against a potential bias, they would increase the utility of an RCT’s results by other study sponsors. Some manuscripts only report efficacy or other numerical results that have been adjusted by the analytical model used for analyses. Additionally, sometimes results are reported for only a subset of subjects with baseline data, who took study treatment, and had data while taking study treatment (e.g., only study completers). It would be most helpful if all data, at least for the primary efficacy parameter, were reported for all subjects with baseline data, who took at least one dose of study treatment and had data while taking study treatment without any model adjustment. For each treatment group, the number of subjects, the baseline mean and standard deviation, and the change from baseline and standard deviation should be reported along with any model-adjusted results. These raw change data could then be consistently used across studies in a specific disorder to estimate sample size requirements in future studies. Online supplementary material made available by many journals could supply these results (without affecting any manuscript length constraints) when they are not the product of the primary, a-priori analysis. I also suggest that for both efficacy and safety parameters, results for both central tendency changes and categorical changes (e.g., for efficacy, proportions of responders/remitters; and for safety, the proportions of subjects with numerical safety values outside ranges such that the results are considered of potential clinical significance) be presented. As a broad, general principle (but with some exceptions), central tendency changes are more sensitive to suggesting minor but real differences between treatment groups than are categorical changes. However, categorical changes are essential for understanding the possible clinical significance of positive and negative changes that might be appreciated by analyzing the central tendency changes or the categorical changes themselves. There are multiple points in the study where central tendency and categorical changes are evaluated and compared. These points include each subject’s last visit (endpoint), each subject’s highest value, each subject’s lowest value, each subject’s value at each visit where the data item was scheduled for collection, and each subject’s value at all collections whether scheduled or not. Which points are analyzed is an important consideration in determining the completeness and objectivity of the data presented. In addition, consideration of the presentation of last-observation-carried-forward data as one approach to dealing objectively with missing data, other approaches to dealing with missing data such as MMRM analyses and multiple other methods, and analyses of data for only those subjects completing an RCT, are important considerations in providing an accurate description of the results of an RCT. Therefore, there are many alternative but complementary analyses based on the subjects included and excluded and the nature and time in the study for values that should be carefully considered before deciding the final set of analyses. The analyses performed among these multiple alternatives are an important consideration in optimizing objectivity and reducing the probability of bias in presenting an RCT’s results. I am only alluding to these considerations because a reasonably complete discussion would extend to several hundred pages. There are obvious limits on what can be included in an academic manuscript, even with supplemental online material available as a method of presentation. Based on the considerations briefly outlined above, the optimal set of analyses is likely to differ across RCTs.

Two organizations have made extensive efforts and advancements to improve the transparency and quality/”accuracy” of reporting the data from RCTs and their results. The first organization is the Clinical Data Interchange Standards Consortium (CDISC) (https://CDISC.org). This organization has developed an extensive set of standards, in microscopic detail, for how raw and derived data in RCTs should be organized electronically. These standards are mandatory for studies submitted to the FDA. These standards allow regulatory authorities to easily check and confirm results and summaries of results included in regulatory submissions. In addition, regulators can efficiently perform additional analyses of submitted data.

CDISC has also produced Therapeutic Area User Guides (TAUGs) that provide even more explanatory detail on data organization and explain some suggested analyses. These TAUGs have been developed for specific disorders, broader therapeutic areas, and some types of treatments:

1. Psoriasis

2. Rheumatoid Arthritis

3. Cardiovascular (General)

4. Heart Failure

5. QT Studies

6. Traditional Chinese Medicine - Coronary Artery Disease-Angina

7. Acute Kidney Injury

8. Diabetes

9. Diabetes Type 1 - Exercise and Nutrition

10. Diabetes Type 1 - Pediatrics and Devices

11. Diabetes Type 1 - Screening, Staging, and Monitoring of Pre-clinical Type 1 Diabetes

12. Diabetic Kidney Disease

13. Dyslipidemia

14. Kidney Transplant

15. Polycystic Kidney Disease

16. CDAD

17. Crohn’s Disease

18. COVID-19

19. Ebola

20. Hepatitis C

21. HIV

22. Influenza

23. Malaria

24. Tuberculosis

25. Virology

26. Major Depressive Disorder

27. Post Traumatic Stress Disorder

28. Schizophrenia

29. Alzheimer’s

30. Huntington’s Disease

31. Multiple Sclerosis

32. Parkinson’s Disease

33. Traumatic Brain Injury

34. Breast Cancer

35. Colorectal Cancer

36. Lung Cancer

37. Pancreatic Cancer

38. Prostate Cancer

39. Nutrition

40. Traditional Chinese Medicine - Acupuncture

41. Duchenne Muscular Dystrophy

42. Rare Diseases

43. Asthma

44. COPD

45. COVID-19

46. Pain

47. Vaccines

The Pharmaceutical Users Software Exchange (PHUSE) organization (http://phuse.global) develops recommendations for best practices in analyzing various data types collected in RCTs. The data types for which these recommendations have been developed to date are primarily safety-related. Working groups of statisticians, data scientists, and clinicians who participate in working groups develop these recommendations. These contributing specialists are from pharmaceutical companies, regulatory agencies, including the FDA, and some are independent consultants (I currently serve on three working groups). While the recommended analyses for both individual RCTs and integrated datasets are non-binding, any commercial sponsor submitting the report for an RCT or an application for approval of an investigational drug to any regulatory authority for potential approval is well advised to adhere to the recommended guidances. These recommendations are made in White Papers and other formats available through the PHUSE website (https://phuse.global/Deliverables/2), and they are periodically updated as new analytical methodologies are developed. At this time, five web pages list these recommendations, with some being archived and some not directly related to RCT data analyses.

Hopefully, it is clear that substantial effort from many technical experts is being expended to improve RCT experimentation quality and its reporting. For those who might view the PHUSE effort as akin to the fox watching the hen house, as some American farmers might say, it should be remembered that scientists working for regulatory authorities contribute to the development of these standards.

References:

Healy D. David Healy’s Reply to Jean-François Dreyfus’ Comment. inhn.org.controversies. February 18, 2021.

Lingjaerde O, Alfors UG, Bech P, Denker SJ, Elgern K. The UKU side effect rating scale: a new comprehensive rating scale for psychotropic drugs and a cross-sectional study of side effects in neuroleptic-treated patients. Acta Psychiatr Scand 1987;334:(Suppl):1-100.

NDA/BLA Multi-disciplinary Review and Evaluation NDA 212268 Secuado (asenapine) transdermal system. CDER October 12, 2018. www.accessdata.fda.gov/drugsatfda_docs/nda/2019/212268Orig1s000MultidisciplineR.pdf

* Words/terms (i.e., accurate) and their linguistic derivatives (i.e., accuracy) that I use in this Comment that have specific meaning as I use them are placed in single quotations (‘WORD’) and defined within the text. While I have endeavored to assure that all instances of these words are enclosed in their single quotation marks, there might be rare instances where the quotation marks are inadvertently missing.

July 22, 2021

David Healy: Do randomized clinical trials add or subtract from clinical knowledge

Charles M. Beasley, Jr’s comment – Part 2

Related Organizations

Links

Contact