Author Affiliation: Nordic Cochrane Centre, Copenhagen, Denmark.
The design of the classic, parallel-group randomized trial involves formulating a null hypothesis of no difference between 2 interventions and identifying a clinically relevant difference (Δ) that researchers do not wish to overlook. Commonly referred to as superiority trials, the investigators usually hope to be able to reject the null hypothesis and demonstrate a difference between interventions. In contrast, a noninferiority trial is one-sided in nature1 as it seeks to determine whether a new intervention is no worse than a reference intervention within a prespecified noninferiority interval (−Δ to 0) for the primary outcome. Similarly, an equivalence trial aims to determine whether 2 interventions have a similar effect, within a prespecified interval (−Δ to + Δ).
Noninferiority and equivalence randomized trials create challenges for researchers and clinicians and are associated with several issues that are controversial and difficult to grasp, even for trialists. Two reports in this issue of JAMA, a survey of 116 noninferiority and 46 equivalence trials by Le Henanff et al,2 and a CONSORT statement for reporting these trials by Piaggio et al,3 highlight the complexity of the field. These trials require specific and careful consideration of a number of issues, including their appropriate application, design, analysis, reporting, interpretation, and above all, usefulness for clinical practice.
First, there are specific uses and indications for noninferiority and equivalence trials. These trials are particularly useful when an untreated control group would be considered unethical, eg, when investigating the long-term outcome of a new prosthesis for hip replacement, a new drug combination against AIDS, or antenatal care models with fewer clinic visits and reduced costs.3 These trials also may be used for risk-benefit assessment when a new intervention is expected to be less harmful than the standard intervention or for comparison of different formulations or doses of the same drug.4 These study designs are not recommended when the standard intervention is not consistently better than placebo, eg, for drugs to treat depression and dementia,5 or when it is doubtful whether the magnitude of the effect over placebo is clinically relevant.
Second, the terminology used for describing all types of trials is not particularly transparent. Trials that are neither noninferiority trials nor equivalence trials are called superiority trials, the idea being that most trials aim to determine whether one intervention is superior to another. However, many superiority trials have an active comparator, and a better name that respects the inherent symmetry in the null hypothesis of no difference would be equivalence trials, but this term means something else. It is also confusing that, compared with the classic trial, the null and alternative hypotheses are reversed in noninferiority and equivalence trials; a type I (false-positive) error becomes the erroneous acceptance of an inferior new treatment, whereas a type II (false-negative) error becomes the erroneous rejection of a truly noninferior treatment.6 In addition, the same trial can assess noninferiority or equivalence for some outcomes, and superiority for others, eg, for harms. It is therefore important that researchers describe exactly what they did in detail and avoid using potentially confusing terms such as a type I error.
Third, the choice of Δ is crucial in noninferiority and equivalence trials for planning the trial, determining sample size,7 and for interpreting results. In one of the examples in the article by Piaggio et al,3 Δ corresponded to half the effect size of the reference over placebo although the outcome was mortality; in this case, Δ should be particularly small to guard against the acceptance of inferior treatments. In another example,3 the estimated event rate was 3.1%, but Δ was 2%, which is arguably too large.1 Nevertheless, in this example, the trial report misleadingly claimed8 that ximelagatran was “at least as effective” as warfarin,9 which is an unwarranted conclusion unless the entire confidence interval lies outside the noninferiority interval, corresponding to P=.05 in a test for superiority, ie, the new drug is actually better than the control.
The regulatory requirement for drugs is that the selection of the noninferiority margin should include clinical judgment,4 ,10 but in practice, the reasoning almost always is exclusively statistical.1 ,11 It is considered inappropriate to use effect sizes (treatment difference divided by standard deviation) as justification for the choice of the noninferiority margin,4 but effect sizes can show whether Δ is generally reasonable: A systematic review of 332 noninferiority and equivalence trials found that in about one half of the trials a difference of 0.5 standard deviations, corresponding to an odds ratio of 2.2, was regarded as irrelevant, which is an unreasonably large Δ.11
In some cases, the clinical basis for selection of Δ is uncontroversial. For instance, some pain studies have shown the minimum difference in pain that patients can perceive, and if the 95% confidence interval is narrower than this, it may be concluded that the treatments are equivalent. This occurred in a study comparing acetaminophen (paracetamol) with nonsteroidal, anti-inflammatory drugs for treatment of pain after musculoskeletal injury.12
Fourth, noninferiority and equivalence trials involve important and sometimes complex considerations for statistical analyses and post hoc design changes. Stopping rules for noninferiority trials can be asymmetric, allowing a trial to continue longer if the new treatment appears superior. However, this interferes with blinding of data monitoring committees whose decisions should be uninfluenced by which treatment appears to have a better outcome.
In contrast to superiority trials, intention-to-treat analyses and per-protocol analyses are considered to be equally important in noninferiority and equivalence trials.1 Intention-to-treat analyses will generally be biased toward finding no difference, which is usually the desired outcome in noninferiority and equivalence trials and is favored by studies with many dropouts and missing data. The direction of the bias in per-protocol analyses is more unpredictable, and these analyses may lose the value of the balance between the randomized groups and become invalid if rates and reasons for dropout differ between groups.
The flexibility of the designs carries a risk of manipulation. Without having access to the original trial protocol, readers may not know what to believe. For example, the primary outcome (defined in terms of Δ) is crucial for noninferiority and equivalence trials. However, a comparison of mostly classic trial protocols with trial reports showed that in 62% of trials, at least 1 primary outcome was changed, introduced, or omitted.13 The Δ can also be enlarged post hoc to disguise an initial finding that the new treatment was inferior, just as Δ and the sample size calculation have sometimes been changed in classic trials to conceal that the obtained sample size was insufficient.
Even for noninferiority trials, researchers should use a 2-sided 95% confidence interval,4 which will allow the unexpected benefit of also assessing for superiority if the difference observed is in the opposite direction of what was expected. This would not be possible with use of a 1-sided 95% confidence interval. However, it is inappropriate to do the opposite and claim noninferiority from a superiority trial unless the findings are clearly related to a prespecified margin of noninferiority. Le Henanff et al2 suspected that some trials they examined had been planned as superiority trials but were reported as if they had been noninferiority or equivalence trials after failure to demonstrate superiority. A good clue that this could be the case is if the sample size calculation reported in the article does not include a noninferiority or equivalence margin.
Fifth, it appears that noninferiority and equivalence trials are poorly reported and perhaps poorly conducted. Particularly detracting for the reliability of many of these trial reports are the findings reported by Le Henanff et al2 that one third of the reports that included a sample size calculation had omitted elements needed to reproduce it; one third of the reports described a confidence interval whose size was not in accordance with the type I error rate used in the sample size calculation; and half the reports that used statistical tests did not take the margins into account (which therefore corresponded to tests for superiority). In addition, only 20% of the trials that these authors surveyed provided the 4 necessary basic requirements: noninferiority or equivalence margin defined, sample size calculation taking this margin into account, both intention-to-treat and per-protocol analyses, and confidence interval for the result. If justification for the margin is included, which is an important regulatory requirement,4 ,10 only 4% of these trials complied with reporting requirements.
Sixth, clinicians need to interpret any claims regarding efficacy of new treatments based on noninferiority and equivalence trials with caution. When the sample size is large or the Δ is large or the variation in the measurements is smaller than expected, the confusing situation can arise that the new treatment actually is significantly worse than the reference, although the result is either formally inconclusive, ie, the lower confidence limit crosses the line for noninferiority, or the result even shows noninferiority, ie, the confidence interval is within the noninferiority interval (as illustrated in the Figure of the article by Piaggio et al).3 In these situations, clinicians might consider the significant difference and decide not to use the new treatment, for Δ is often much larger than what clinicians and drug agencies would consider a minimum relevant clinical difference.11
Clinicians must be confident that the new treatment would have been shown to be efficacious if a placebo-controlled trial had been performed. It is a regulatory requirement that an indirect clear superiority to a putative placebo is provided,4 - 5 calculated from the difference between the new and the standard treatment and the difference between the standard and placebo.1 A systematic review identifying the relevant placebo-controlled studies should be used, but it is not clear whether the point estimate or a lower confidence limit should be used, whether the estimate should refer to all studies or only to more recent ones, and whether allowance should be made for possible publication bias. The assumption of constancy in factors that predict the outcome, compared with the historical placebo-controlled trials that demonstrated superiority, is inevitably questionable and often a major issue.1 ,4 Improved diagnostic methods can lead to changes in patient populations; ancillary treatments change; entry criteria for patients, timing of assessments, and doses may be different5 ; appropriate and relevant outcomes may change, eg, from death to surrogate outcomes in AIDS because of better treatments; and disease severity may change, eg, for infectious diseases.
Moreover, conclusions in drug trial reports are often used for marketing, but often may be misleading.14 This problem could be even greater with noninferiority trials. The appropriate conclusion from these types of trials should not be that noninferiority has been demonstrated as only a superiority trial can show this.4 A noninferiority trial can only demonstrate that the new intervention is not worse than the comparator by more than a prespecified, small amount.4 However, drug and device manufacturers may not be willing to state in an advertisement that “our product was not inferior to the standard product with regard to our predefined margin of the smallest clinically meaningful difference.” In one example, noninferiority could not be claimed for voriconazole,15 and when the analysis was in agreement with the analysis plan for the trial, voriconazole was even statistically significantly inferior to the control drug, liposomal amphothericin B.16 Nevertheless, the authors concluded that “Voriconazole is a suitable alternative to amphothericin B preparations.”15
In summary, clinicians should especially bear in mind that noninferiority margins are often far too large to be clinically meaningful11 and that a claim of equivalence may also be misleading if a trial has not been conducted to an appropriately high standard. Furthermore, clinicians should be somewhat skeptical of trials that fail to include the basic reporting requirements described by Le Henanff et al,2 including definition and justification of the noninferiority or equivalence margin, calculation of sample size taking this margin into account, presentation of both intention-to-treat and per-protocol analysis, and providing confidence intervals for the results.
Despite these concerns and cautions, it appears that noninferiority and equivalence trials are here to stay. Adherence to the recommendations suggested by Piaggio et al,3 both when planning and reporting noninferiority trials and equivalence trials, could lead to substantial improvement.
Corresponding Author: Peter C. Gøtzsche, MD, DrMedSci, Nordic Cochrane Centre, Rigshospitalet, Department 7112, Blegdamsvej 9, DK-2100 Copenhagen Ø, Denmark (pcg@cochrane.dk).
Financial Disclosures: None reported.
Disclaimer: Dr Gøtzsche is a member of the CONSORT group and provided comments on earlier drafts of the manuscript by Piaggio et al.
Editorials represent the opinions of the authors and JAMA and not those of the American Medical Association.
Country-Specific Mortality and Growth Failure in Infancy and Yound Children and Association With Material Stature
Use interactive graphics and maps to view and sort country-specific infant and early dhildhood mortality and growth failure data and their association with maternal
Instructions
Comments are moderated and will appear on the site at the discretion of the Journal of American Medical Association editors. Comments should not exceed 500 words of text and 10 references.
Do not submit personal medical questions or information that could identify a specific patient, questions about a particular case, or general inquiries to an author. Only content that has not been published, posted, or submitted elsewhere should be submitted. By submitting this Comment, you and any coauthors transfer copyright to the journal if your Comment is posted.
* = Required Field
Disclosure of Any Conflicts of Interest* Indicate all relevant conflicts of interest of each author below, including all relevant financial interests, activities, and relationships within the past 3 years including, but not limited to, employment, affiliation, grants or funding, consultancies, honoraria or payment, speakers’ bureaus, stock ownership or options, expert testimony, royalties, donation of medical equipment, or patents planned, pending, or issued. If all authors have none, check "No potential conflicts or relevant financial interests" in the box below. Please also indicate any funding received in support of this work. The information will be posted with your response.
Register and get free email Table of Contents alerts, saved searches, PowerPoint downloads, CME quizzes, and more
Subscribe for full-text access to content from 1998 forward and a host of useful features
Activate your current subscription (AMA members and current subscribers)
Some tools below are only available to our subscribers or users with an online account.
Download citation file:
Customize your page view by dragging & repositioning the boxes below.
Users' Guides to the Medical Literature Clinical Resolution
Users' Guides to the Medical Literature Clinical Scenario
All results at JAMAevidence.com >
and access these and other features:
Register Now
Enter your username and email address. We'll send you a reminder to the email address on record.
Athens and Shibboleth are access management services that provide single sign-on to protected resources. They replace the multiple user names and passwords necessary to access subscription-based content with a single user name and password that can be entered once per session. It operates independently of a user's location or IP address. If your institution uses Athens or Shibboleth authentication, please contact your site administrator to receive your user name and password.