0
Editorial |

Lessons From and Cautions About Noninferiority and Equivalence Randomized Trials

Peter C. Gøtzsche, MD, DrMedSci
[+] Author Affiliations

Author Affiliation: Nordic Cochrane Centre, Copenhagen, Denmark.

More Author Information
JAMA. 2006;295(10):1172-1174. doi:10.1001/jama.295.10.1172
Text Size: A A A
Published online

The design of the classic, parallel-group randomized trial involves formulating a null hypothesis of no difference between 2 interventions and identifying a clinically relevant difference (Δ) that researchers do not wish to overlook. Commonly referred to as superiority trials, the investigators usually hope to be able to reject the null hypothesis and demonstrate a difference between interventions. In contrast, a noninferiority trial is one-sided in nature1 as it seeks to determine whether a new intervention is no worse than a reference intervention within a prespecified noninferiority interval (−Δ to 0) for the primary outcome. Similarly, an equivalence trial aims to determine whether 2 interventions have a similar effect, within a prespecified interval (−Δ to + Δ).

Noninferiority and equivalence randomized trials create challenges for researchers and clinicians and are associated with several issues that are controversial and difficult to grasp, even for trialists. Two reports in this issue of JAMA, a survey of 116 noninferiority and 46 equivalence trials by Le Henanff et al,2 and a CONSORT statement for reporting these trials by Piaggio et al,3 highlight the complexity of the field. These trials require specific and careful consideration of a number of issues, including their appropriate application, design, analysis, reporting, interpretation, and above all, usefulness for clinical practice.

First, there are specific uses and indications for noninferiority and equivalence trials. These trials are particularly useful when an untreated control group would be considered unethical, eg, when investigating the long-term outcome of a new prosthesis for hip replacement, a new drug combination against AIDS, or antenatal care models with fewer clinic visits and reduced costs.3 These trials also may be used for risk-benefit assessment when a new intervention is expected to be less harmful than the standard intervention or for comparison of different formulations or doses of the same drug.4 These study designs are not recommended when the standard intervention is not consistently better than placebo, eg, for drugs to treat depression and dementia,5 or when it is doubtful whether the magnitude of the effect over placebo is clinically relevant.

Second, the terminology used for describing all types of trials is not particularly transparent. Trials that are neither noninferiority trials nor equivalence trials are called superiority trials, the idea being that most trials aim to determine whether one intervention is superior to another. However, many superiority trials have an active comparator, and a better name that respects the inherent symmetry in the null hypothesis of no difference would be equivalence trials, but this term means something else. It is also confusing that, compared with the classic trial, the null and alternative hypotheses are reversed in noninferiority and equivalence trials; a type I (false-positive) error becomes the erroneous acceptance of an inferior new treatment, whereas a type II (false-negative) error becomes the erroneous rejection of a truly noninferior treatment.6 In addition, the same trial can assess noninferiority or equivalence for some outcomes, and superiority for others, eg, for harms. It is therefore important that researchers describe exactly what they did in detail and avoid using potentially confusing terms such as a type I error.

Third, the choice of Δ is crucial in noninferiority and equivalence trials for planning the trial, determining sample size,7 and for interpreting results. In one of the examples in the article by Piaggio et al,3 Δ corresponded to half the effect size of the reference over placebo although the outcome was mortality; in this case, Δ should be particularly small to guard against the acceptance of inferior treatments. In another example,3 the estimated event rate was 3.1%, but Δ was 2%, which is arguably too large.1 Nevertheless, in this example, the trial report misleadingly claimed8 that ximelagatran was “at least as effective” as warfarin,9 which is an unwarranted conclusion unless the entire confidence interval lies outside the noninferiority interval, corresponding to P=.05 in a test for superiority, ie, the new drug is actually better than the control.

The regulatory requirement for drugs is that the selection of the noninferiority margin should include clinical judgment,4 ,10 but in practice, the reasoning almost always is exclusively statistical.1 ,11 It is considered inappropriate to use effect sizes (treatment difference divided by standard deviation) as justification for the choice of the noninferiority margin,4 but effect sizes can show whether Δ is generally reasonable: A systematic review of 332 noninferiority and equivalence trials found that in about one half of the trials a difference of 0.5 standard deviations, corresponding to an odds ratio of 2.2, was regarded as irrelevant, which is an unreasonably large Δ.11

In some cases, the clinical basis for selection of Δ is uncontroversial. For instance, some pain studies have shown the minimum difference in pain that patients can perceive, and if the 95% confidence interval is narrower than this, it may be concluded that the treatments are equivalent. This occurred in a study comparing acetaminophen (paracetamol) with nonsteroidal, anti-inflammatory drugs for treatment of pain after musculoskeletal injury.12

Fourth, noninferiority and equivalence trials involve important and sometimes complex considerations for statistical analyses and post hoc design changes. Stopping rules for noninferiority trials can be asymmetric, allowing a trial to continue longer if the new treatment appears superior. However, this interferes with blinding of data monitoring committees whose decisions should be uninfluenced by which treatment appears to have a better outcome.

In contrast to superiority trials, intention-to-treat analyses and per-protocol analyses are considered to be equally important in noninferiority and equivalence trials.1 Intention-to-treat analyses will generally be biased toward finding no difference, which is usually the desired outcome in noninferiority and equivalence trials and is favored by studies with many dropouts and missing data. The direction of the bias in per-protocol analyses is more unpredictable, and these analyses may lose the value of the balance between the randomized groups and become invalid if rates and reasons for dropout differ between groups.

The flexibility of the designs carries a risk of manipulation. Without having access to the original trial protocol, readers may not know what to believe. For example, the primary outcome (defined in terms of Δ) is crucial for noninferiority and equivalence trials. However, a comparison of mostly classic trial protocols with trial reports showed that in 62% of trials, at least 1 primary outcome was changed, introduced, or omitted.13 The Δ can also be enlarged post hoc to disguise an initial finding that the new treatment was inferior, just as Δ and the sample size calculation have sometimes been changed in classic trials to conceal that the obtained sample size was insufficient.

Even for noninferiority trials, researchers should use a 2-sided 95% confidence interval,4 which will allow the unexpected benefit of also assessing for superiority if the difference observed is in the opposite direction of what was expected. This would not be possible with use of a 1-sided 95% confidence interval. However, it is inappropriate to do the opposite and claim noninferiority from a superiority trial unless the findings are clearly related to a prespecified margin of noninferiority. Le Henanff et al2 suspected that some trials they examined had been planned as superiority trials but were reported as if they had been noninferiority or equivalence trials after failure to demonstrate superiority. A good clue that this could be the case is if the sample size calculation reported in the article does not include a noninferiority or equivalence margin.

Fifth, it appears that noninferiority and equivalence trials are poorly reported and perhaps poorly conducted. Particularly detracting for the reliability of many of these trial reports are the findings reported by Le Henanff et al2 that one third of the reports that included a sample size calculation had omitted elements needed to reproduce it; one third of the reports described a confidence interval whose size was not in accordance with the type I error rate used in the sample size calculation; and half the reports that used statistical tests did not take the margins into account (which therefore corresponded to tests for superiority). In addition, only 20% of the trials that these authors surveyed provided the 4 necessary basic requirements: noninferiority or equivalence margin defined, sample size calculation taking this margin into account, both intention-to-treat and per-protocol analyses, and confidence interval for the result. If justification for the margin is included, which is an important regulatory requirement,4 ,10 only 4% of these trials complied with reporting requirements.

Sixth, clinicians need to interpret any claims regarding efficacy of new treatments based on noninferiority and equivalence trials with caution. When the sample size is large or the Δ is large or the variation in the measurements is smaller than expected, the confusing situation can arise that the new treatment actually is significantly worse than the reference, although the result is either formally inconclusive, ie, the lower confidence limit crosses the line for noninferiority, or the result even shows noninferiority, ie, the confidence interval is within the noninferiority interval (as illustrated in the Figure of the article by Piaggio et al).3 In these situations, clinicians might consider the significant difference and decide not to use the new treatment, for Δ is often much larger than what clinicians and drug agencies would consider a minimum relevant clinical difference.11

Clinicians must be confident that the new treatment would have been shown to be efficacious if a placebo-controlled trial had been performed. It is a regulatory requirement that an indirect clear superiority to a putative placebo is provided,4 5 calculated from the difference between the new and the standard treatment and the difference between the standard and placebo.1 A systematic review identifying the relevant placebo-controlled studies should be used, but it is not clear whether the point estimate or a lower confidence limit should be used, whether the estimate should refer to all studies or only to more recent ones, and whether allowance should be made for possible publication bias. The assumption of constancy in factors that predict the outcome, compared with the historical placebo-controlled trials that demonstrated superiority, is inevitably questionable and often a major issue.1 ,4 Improved diagnostic methods can lead to changes in patient populations; ancillary treatments change; entry criteria for patients, timing of assessments, and doses may be different5 ; appropriate and relevant outcomes may change, eg, from death to surrogate outcomes in AIDS because of better treatments; and disease severity may change, eg, for infectious diseases.

Moreover, conclusions in drug trial reports are often used for marketing, but often may be misleading.14 This problem could be even greater with noninferiority trials. The appropriate conclusion from these types of trials should not be that noninferiority has been demonstrated as only a superiority trial can show this.4 A noninferiority trial can only demonstrate that the new intervention is not worse than the comparator by more than a prespecified, small amount.4 However, drug and device manufacturers may not be willing to state in an advertisement that “our product was not inferior to the standard product with regard to our predefined margin of the smallest clinically meaningful difference.” In one example, noninferiority could not be claimed for voriconazole,15 and when the analysis was in agreement with the analysis plan for the trial, voriconazole was even statistically significantly inferior to the control drug, liposomal amphothericin B.16 Nevertheless, the authors concluded that “Voriconazole is a suitable alternative to amphothericin B preparations.”15

In summary, clinicians should especially bear in mind that noninferiority margins are often far too large to be clinically meaningful11 and that a claim of equivalence may also be misleading if a trial has not been conducted to an appropriately high standard. Furthermore, clinicians should be somewhat skeptical of trials that fail to include the basic reporting requirements described by Le Henanff et al,2 including definition and justification of the noninferiority or equivalence margin, calculation of sample size taking this margin into account, presentation of both intention-to-treat and per-protocol analysis, and providing confidence intervals for the results.

Despite these concerns and cautions, it appears that noninferiority and equivalence trials are here to stay. Adherence to the recommendations suggested by Piaggio et al,3 both when planning and reporting noninferiority trials and equivalence trials, could lead to substantial improvement.

AUTHOR INFORMATION

Corresponding Author: Peter C. Gøtzsche, MD, DrMedSci, Nordic Cochrane Centre, Rigshospitalet, Department 7112, Blegdamsvej 9, DK-2100 Copenhagen Ø, Denmark (pcg@cochrane.dk).

Financial Disclosures: None reported.

Disclaimer: Dr Gøtzsche is a member of the CONSORT group and provided comments on earlier drafts of the manuscript by Piaggio et al.

Editorials represent the opinions of the authors and JAMA and not those of the American Medical Association.

D'Agostino RB Sr, Massaro JM, Sullivan LM. Non-inferiority trials: design concepts and issues—the encounters of academic consultants in statistics.  Stat Med. 2003;22169-186
PubMed
Le Henanff A, Giraudeau B, Baron G, Ravaud P. Quality of reporting of noninferiority and equivalence randomized trials.  JAMA. 2006;2951147-1151
Piaggio G, Elbourne DR, Altman DG, Pocock SJ, Evans SJW.CONSORT Group.  Reporting of noninferiority and equivalence randomized trials: an extension of the CONSORT statement.  JAMA. 2006;2951152-1160
Committee for Medicinal Products for Human Use.  Guideline on the Choice of the Non-Inferiority MarginLondon, England: European Medicines Agency, Pre-authorisation Evaluation of Medicines for Human Use; July 27, 2005. Available at: http://www.emea.eu.int/pdfs/human/ewp/215899en.pdf. Accessed January 27, 2006
 International Conference on Harmonization (ICH) of Technical Requirements for Registration of Pharmaceuticals for Human Use, Guideline, Choice of Control Group and Related Issues in Clinical Trials, May 2001Rockville, Md: US Food and Drug Administration; May 2001. Available at: http://www.fda.gov/cder/guidance/4155fnl.htm. Accessed February 5, 2006
Millar JA, Burke V. Relationship between sample size and the definition of equivalence in non-inferiority drug studies.  J Clin Pharm Ther. 2002;27329-333
PubMed
Jones B, Jarvis P, Lewis JA, Ebbutt AF. Trials to assess equivalence: the importance of rigorous methods.  BMJ. 1996;31336-39
PubMed
Kaul S, Diamond GA, Weintraub WS. Trials and tribulations of non-inferiority: the ximelagatran experience.  J Am Coll Cardiol. 2005;461986-1995
PubMed
Olsson SB.Executive Steering Committee on behalf of the SPORTIF III Investigators.  Stroke prevention with the oral direct thrombin inhibitor ximelagatran compared with warfarin in patients with non-valvular atrial fibrillation (SPORTIF III): randomised controlled trial.  Lancet. 2003;3621691-1698
PubMed
International Conference on Harmonisation.  E9 Statistical principles for clinical trials [notice].  Federal Register. 1998;6349583-49598Available at: http://www.fda.gov/cber/gdlns/ichclinical.pdf. Accessed January 27, 2006
Lange S, Freitag G. Choice of delta: requirements and reality—results of a systematic review.  Biomed J. 2005;4712-27
PubMed
Woo WWK, Man S-Y, Lam PKW, Rainer TH. Randomized double-blind trial comparing oral paracetamol and oral nonsteroidal antiinflammatory drugs for treating pain after musculoskeletal injury.  Ann Emerg Med. 2005;46352-361
PubMed
Chan A-W, Hróbjartsson A, Haahr MT, Gøtzsche PC, Altman DG. Empirical evidence for selective reporting of outcomes in randomized trials: comparison of protocols to published articles.  JAMA. 2004;2912457-2465
PubMed
Als-Nielsen B, Chen W, Gluud C, Kjaergard LL. Association of funding and conclusions in randomized drug trials: a reflection of treatment effect or adverse events?  JAMA. 2003;290921-928
PubMed
Walsh TJ, Pappas P, Winston DJ.  et al.  Voriconazole compared with liposomal amphothericin B for empirical antifungal therapy in patients with neutropenia and persistent fever.  N Engl J Med. 2002;346225-234
PubMed
Powers JH, Dixon CA, Goldberger MJ. Voriconazole versus liposomal amphotericin B in patients with neutropenia and persistent fever.  N Engl J Med. 2002;346289-290
PubMed

First Page Preview

First page PDF preview

Figures

Tables

Interactive Graphics

Video

Country-Specific Mortality and Growth Failure in Infancy and Yound Children and Association With Material Stature

Use interactive graphics and maps to view and sort country-specific infant and early dhildhood mortality and growth failure data and their association with maternal

D'Agostino RB Sr, Massaro JM, Sullivan LM. Non-inferiority trials: design concepts and issues—the encounters of academic consultants in statistics.  Stat Med. 2003;22169-186
PubMed
Le Henanff A, Giraudeau B, Baron G, Ravaud P. Quality of reporting of noninferiority and equivalence randomized trials.  JAMA. 2006;2951147-1151
Piaggio G, Elbourne DR, Altman DG, Pocock SJ, Evans SJW.CONSORT Group.  Reporting of noninferiority and equivalence randomized trials: an extension of the CONSORT statement.  JAMA. 2006;2951152-1160
Committee for Medicinal Products for Human Use.  Guideline on the Choice of the Non-Inferiority MarginLondon, England: European Medicines Agency, Pre-authorisation Evaluation of Medicines for Human Use; July 27, 2005. Available at: http://www.emea.eu.int/pdfs/human/ewp/215899en.pdf. Accessed January 27, 2006
 International Conference on Harmonization (ICH) of Technical Requirements for Registration of Pharmaceuticals for Human Use, Guideline, Choice of Control Group and Related Issues in Clinical Trials, May 2001Rockville, Md: US Food and Drug Administration; May 2001. Available at: http://www.fda.gov/cder/guidance/4155fnl.htm. Accessed February 5, 2006
Millar JA, Burke V. Relationship between sample size and the definition of equivalence in non-inferiority drug studies.  J Clin Pharm Ther. 2002;27329-333
PubMed
Jones B, Jarvis P, Lewis JA, Ebbutt AF. Trials to assess equivalence: the importance of rigorous methods.  BMJ. 1996;31336-39
PubMed
Kaul S, Diamond GA, Weintraub WS. Trials and tribulations of non-inferiority: the ximelagatran experience.  J Am Coll Cardiol. 2005;461986-1995
PubMed
Olsson SB.Executive Steering Committee on behalf of the SPORTIF III Investigators.  Stroke prevention with the oral direct thrombin inhibitor ximelagatran compared with warfarin in patients with non-valvular atrial fibrillation (SPORTIF III): randomised controlled trial.  Lancet. 2003;3621691-1698
PubMed
International Conference on Harmonisation.  E9 Statistical principles for clinical trials [notice].  Federal Register. 1998;6349583-49598Available at: http://www.fda.gov/cber/gdlns/ichclinical.pdf. Accessed January 27, 2006
Lange S, Freitag G. Choice of delta: requirements and reality—results of a systematic review.  Biomed J. 2005;4712-27
PubMed
Woo WWK, Man S-Y, Lam PKW, Rainer TH. Randomized double-blind trial comparing oral paracetamol and oral nonsteroidal antiinflammatory drugs for treating pain after musculoskeletal injury.  Ann Emerg Med. 2005;46352-361
PubMed
Chan A-W, Hróbjartsson A, Haahr MT, Gøtzsche PC, Altman DG. Empirical evidence for selective reporting of outcomes in randomized trials: comparison of protocols to published articles.  JAMA. 2004;2912457-2465
PubMed
Als-Nielsen B, Chen W, Gluud C, Kjaergard LL. Association of funding and conclusions in randomized drug trials: a reflection of treatment effect or adverse events?  JAMA. 2003;290921-928
PubMed
Walsh TJ, Pappas P, Winston DJ.  et al.  Voriconazole compared with liposomal amphothericin B for empirical antifungal therapy in patients with neutropenia and persistent fever.  N Engl J Med. 2002;346225-234
PubMed
Powers JH, Dixon CA, Goldberger MJ. Voriconazole versus liposomal amphotericin B in patients with neutropenia and persistent fever.  N Engl J Med. 2002;346289-290
PubMed
CME Course for:


You need to register in order to view this quiz.


To understand the clinical management of acute heart failure syndromes.
Accreditation Information The American Medical Association is accredited by the Accreditation Council for Continuing Medical Education to provide continuing medical education for physicians.
The AMA designates this journal-based CME activity for a maximum of 1 AMA PRA Category 1 CreditTM per course. Physicians should claim only the credit commensurate with the extent of their participation in the activity.
Physicians who complete the CME course and score at least 80% correct on the quiz are eligible for AMA PRA Category 1 CreditTM.
Note: You must get at least of the answers correct to pass this quiz.
Note: You must get at least of the answers correct to pass this quiz.
You have not filled in all the answers to complete this quiz
The following questions were not answered:
Sorry, you have unsuccessfully completed this CME quiz with a score of
The following questions were not answered correctly:
For CME Course: A Proposed Model for Initial Assessment and Management of Acute Heart Failure Syndromes
Indicate what changes(s) you will implement in your practice, if any, based on this CME course.
To view and print your certificate and access a summary of your CME courses go to My CME.
NOTE:
Citing articles are presented as examples only. In non-demo SCM6 implementation, integration with CrossRef’s “Cited By” API will populate this tab (http://www.crossref.org/citedby.html).
Submit a Response

Some tools below are only available to our subscribers or users with an online account.

Related Content

Customize your page view by dragging & repositioning the boxes below.

Articles Related By Topic
Related Topics
JAMAevidence.com

Users' Guides to the Medical Literature
Clinical Resolution

Users' Guides to the Medical Literature
Clinical Scenario