In
a previous post, the terms of ‘multiple endpoints’ and
‘co-primary endpoints’ were discussed. If a study contains two co-primary
efficacy endpoints, study is claimed to be successful if both endpoints have
statistical significance at alpha=0.05 (no adjustment for multiplicity is
necessary). If a study contains multiple (two) primary efficacy endpoints, the
study is claimed to be successful if either endpoint is statistically
significant. However, in later situation, the adjustment for multiplicity is
necessary to maintain the overall alpha at 0.05. In other words, for hypothesis
test for each individual endpoint, the significant level alpha is less than
0.05.
The most simple and straightforward approach is to apply the Bonferroni
correction. The Bonferroni correction compensates for the increase in number of
hypothesis tests. each individual hypothesis is tested at a significance level
of alpha/m, where alpha is the desired overall alpha level (usually 0.05) and m is
the number of hypotheses. If there are two hypothesis tests (m=2), each
individual hypothesis will be tested at alpha=0.025.
The Bonferroni method is a single-step procedure that is
commonly used, perhaps because of its simplicity and broad applicability. It is
a conservative test and a finding that survives a Bonferroni adjustment is a
credible trial outcome. The drug is considered to have shown effects for each
endpoint that succeeds on this test. The Holm and Hochberg methods are more
powerful than the Bonferroni method for primary endpoints and are therefore
preferable in many cases. However, for reasons detailed in sections IV.C.2-3,
sponsors may still wish to use the Bonferroni method for primary endpoints in
order to maximize power for secondary endpoints or because the assumptions of
the Hochberg method are not justified. The most common form of the Bonferroni
method divides the available total alpha (typically 0.05) equally among the
chosen endpoints. The method then concludes that a treatment effect is
significant at the alpha level for each one of the m endpoints for which the
endpoint’s p-value is less than α /m. Thus, with two endpoints, the critical
alpha for each endpoint is 0.025, with four endpoints it is 0.0125, and so on.
Therefore, if a trial with four endpoints produces two-sided p values of 0.012,
0.026, 0.016, and 0.055 for its four primary endpoints, the Bonferroni method
would compare each of these p-values to the divided alpha of 0.0125. The method
would conclude that there was a significant treatment effect at level 0.05 for
only the first endpoint, because only the first endpoint has a p-value of less
than 0.0125 (0.012). If two of the p-values were below 0.0125, then the drug
would be considered to have demonstrated effectiveness on both of the specific
health effects evaluated by the two endpoints. The Bonferroni method tends to
be conservative for the study overall Type I error rate if the endpoints are
positively correlated, especially when there are a large number of positively
correlated endpoints. Consider a case in which all of three endpoints give
nominal p-values between 0.025 and 0.05, i.e., all ‘significant’ at the 0.05
level but none significant under the Bonferroni method. Such an outcome seems
intuitively to show effectiveness on all three endpoints, but each would fail
the Bonferroni test. When there are more than two endpoints with, for example,
correlation of 0.6 to 0.8 between them, the true family-wise Type I error rate
may decrease from 0.05 to approximately 0.04 to 0.03, respectively, with
negative impact on the Type II error rate. Because it is difficult to know the
true correlation structure among different endpoints (not simply the observed
correlations within the dataset of the particular study), it is generally not
possible to statistically adjust (relax) the Type I error rate for such
correlations. When a multiple-arm study design is used (e.g., with several
dose-level groups), there are methods that take into account the correlation
arising from comparing each treatment group to a common control group.
The guidance also discussed the weighted Bonferroni approach:
The Bonferroni test can also be performed with different weights
assigned to endpoints, with the sum of the relative weights equal to 1.0 (e.g.,
0.4, 0.1, 0.3, and 0.2, for four endpoints). These weights are prespecified in
the design of the trial, taking into consideration the clinical importance of
the endpoints, the likelihood of success, or other factors. There are two ways
to perform the weighted Bonferroni test:
- The
unequally weighted Bonferroni method is often applied by dividing the overall
alpha (e.g., 0.05) into unequal portions, prospectively assigning a specific
amount of alpha to each endpoint by multiplying the overall alpha by the
assigned weight factor. The sum of the endpoint-specific alphas will always be
the overall alpha, and each endpoint’s calculated p-value is compared to the
assigned endpoint-specific alpha.
- An
alternative approach is to adjust the raw calculated p-value for each endpoint
by the fractional weight assigned to it (i.e., divide each raw p-value by the
endpoint’s weight factor), and then compare the adjusted p-values to the overall
alpha of 0.05.
These two approaches are equivalent
The guidance mentioned that reason for using the weighted
Bonferroni test are:
- Clinical importance of the endpoints
- The likelihood of success
- Other factors
Other factors could include:
- With two primary efficacy endpoints, the expectation for
regulatory approval for one endpoint is greater than another
- Sample size calculation indicates that the sample size that is
sufficient for primary efficacy endpoint #1 is overestimated for the primary
efficacy endpoint #2
With the weighted Bonferroni correction, the weights are subjective and are essentially arbitrarily selected which results in the partition of unequal significant levels (alphas) for different endpoints.
There are a lot of applications of Bonferroni and weighted Bonferroni in practice. Here are some examples:
The study was to be considered positive if either of the two
coprimary end points, progression free or overall survival, was significantly
longer with durvalumab than with placebo. Approximately 702 patients were
needed for 2:1 randomization to obtain 458 progression-free survival events for
the primary analysis of progressionfree survival and 491 overall survival
events for the primary analysis of overall survival. It was estimated that the
study would have a 95% or greater power to detect a hazard ratio for disease
progression or death of 0.67 and a 85% or greater power to detect a hazard
ratio for death of 0.73, on the basis of a log-rank test with a two-sided
significance level of 2.5% for each coprimary end point.
However, in the original study protocol, the weighted Bonferroni method was used and unequal alpha levels were assigned to OS and PFS.
The two co-primary endpoints of this study are OS and PFS.
The control for type-I error, a significance level of 4.5% will be used for
analysis of OS and a significance level of 0.5% will be used for analysis of
PFS. The study will be considered positive (a success) if either the PFS
analysis results and/or the OS analysis results are statistically significant.
Here, a weight of 0.9 (resulting in an alpha 0.9 x 0.05 = 0.045) was
given to OS and a weight of 0.1 (resulting in an alpha 0.1 x 0.05 = 0.005) was
given to PFS.
In COMPASS-2 Study (
Bosentan
added to sildenafil therapy in patients with pulmonary arterial hypertension),
the original protocol contained two primary efficacy endpoints and weighted Bonferroni
method (even though it was not explicitly mentioned in publication) was used for
multipolicy adjustment. A weight of 0.8 (resulting in an alpha 0.8 x 0.05 =
0.04) was given to time to first mortality/morbidity event and a weight of 0.2 (resulting
in an alpha 0.2 x 0.05 = 0.01) was given to the change from baseline to Week 16
in 6MWD.
The initial assumptions for the primary end-point were an
annual rate of 21% on placebo with a risk reduced by 36% (hazard ratio (HR)
0.64) with bosentan and a negligible annual attrition rate. In addition, it was
planned to conduct a single final analysis at 0.04 (two-sided), taking into
account the existence of a co-primary end-point (change in 6MWD at 16 weeks)
planned to be tested at 0.01 (two-sided). Over the course of the study, a
number of amendments were introduced based on the evolution of knowledge in the
field of PAHs, as well as the rate of enrolment and blinded evaluation of the
overall event rate. On implementation of an amendment in 2007, the 6MWD
end-point was change from a co-primary end-point to a secondary endpoint and
the Type I error associated with the single remaining primary end-point was
increased to 0.05 (two-sided).
Meeting
of the Antimicrobial Drugs Advisory Committee (AMDAC) “, the sponsor
(Bayer) conducted two pivotal studies: RESPIRE 1 and RESPIRE 2. Each study
contained two hypotheses. Interestingly, for multiplicity adjustment, the
Bonferroni method was used for RESPIRE 1 study and the weighted Bonferroni
method for RESPIRE 2 study. We can only guess why weights of 0.02 and 0.98 (resulting
in a partition of alpha of 0.001 and 0.049) was chosen in RESPIRE 2 study
RESPIRE 1 Study:
- Hypothesis 1: ciprofloxacin DPI for 28 days on/off treatment
regimen versus pooled placebo (alpha=0.025)
- Hypothesis 2: ciprofloxacin DPI for 14 days on/off treatment
regimen versus pooled placebo (alpha=0.025)
RESPIRE 2 Study:
- Hypothesis 1: ciprofloxacin DPI for 28 days on/off treatment
regimen versus pooled placebo (alpha=0.001)
- Hypothesis 2: ciprofloxacin DPI for 14 days on/off treatment
regimen versus pooled placebo (alpha=0.049)