Thursday, December 22, 2016

Control for Type I Error (or Adjustment for Multiplicity) for Secondary Endpoints

In clinical trial protocols, we usually specify one or more primary efficacy endpoints, then a list of secondary efficacy endpoints, and then more tertiary endpoints or exploratory endpoints. It is pretty standard that the primary efficacy endpoints are those the hypothesis testing or inferential statistics and sample size estimation are based on. If we have more than one primary efficacy endpoints, we will need to adjust for multiple tests or multiplicity. It is also clear that the tertiary or exploratory endpoints are for hypothesis generating or more bluntly for fishing expedition.

Questions we are asked very often are:

For secondary efficacy endpoints, do we need to do formal hypothesis testing?
If so, do we need to adjust for multiple tests and multiplicity?
What is the purpose of controlling the type I error for secondary efficacy endpoints?

While controlling Family Wise Error Rate (FWER) for primary efficacy endpoints (if multiple test situation exists) is necessary, controlling FWER for secondary efficacy endpoints is often questioned.

Controlling the FWER for secondary efficacy endpoints is valuable or is necessary if we are planning to include the secondary efficacy endpoints in the product label if the product is approved. In other words, if we would like to make the claim of the benefits based on the secondary efficacy endpoints. It is advisable to perform the formal hypothesis testing with controlling for FWER.

In FDA’s guidance for industry “Clinical Studies Section of Labeling for Human Prescription Drug and Biological Products — Content and Format”, there is a statement about the primary and secondary endpoints:

§ Primary and Secondary Endpoints: The terms primary endpoint and secondary endpoint are used so variably that they are rarely helpful. The appropriate inquiry is whether there is a well-documented, statistically and clinically meaningful effect on a prospectively defined endpoint, not whether the endpoint was identified as primary or secondary.

FDA does not care whether or not a study endpoint is called primary or secondary. However, if the information from these endpoints are used for supporting evidence and for product label, the endpoints need to be predefined and tests for these endpoints need to be controlled for the overall alpha or overall type I error.

Even though FDA does not care about the terminology of primary/secondary endpoints, it is still very common in practice that the clinical trials (especially the industry-sponsored clinical trials) specify primary, secondary, and exploratory endpoints.

In a presentation by Kathleen Fritsch from FDA “Multiplicity Issues in FDA-Reviewed Clinical Trials”, the difference between primary, secondary, and exploratory efficacy endpoints are clearly specified and it is stated that the secondary efficacy endpoints may be included in the product label if multiplicity issue is addressed.

It is not needed or rarely needed to control for FWER for exploratory efficacy endpoints.

In EMA’s guidance “Points to Consider on Multiplicity Issues in Clinical Trials”, adjustment for multiplicity is explicitly required if the secondary variables are used for additional claims.

A slide presentation regarding EMA guidance further explained the secondary efficacy endpoints for claims.

Norbert Benda (2013) The update of the multiplicityguideline

In FDA’s guidance on “Clinical Investigations of Devices Indicated for the Treatment of Urinary Incontinence”, the secondary endpoints were called out:

Secondary Endpoints

FDA believes secondary endpoint measures, by themselves, are not sufficient to characterize fully the treatment benefit. However, these measures may provide additional characterization of the treatment effect. Specifically, secondary endpoints can:

supply background and understanding of the primary endpoints, in terms of overall direction and strength of the treatment effect;

be the individual components of a composite primary endpoint, if used;

include variables for which the study is underpowered to definitively assess;

aid in the understanding of the treatment’s mechanism of action;

be associated with relevant sub-hypotheses (separate from the major objective of the treatment); or

be used to perform exploratory analyses.

Assuming that the primary safety and effectiveness endpoints of the study are successfully met, we recommend you analyze the secondary endpoints to provide supportive evidence concerning the safety and effectiveness of the device, as well as to support descriptions of device performance in the labeling. To minimize bias, your protocol should prospectively identify all secondary endpoints, indicating how the data will be analyzed and what success criteria will be applied.

Secondary Endpoint Analyses

We recommend your protocol prospectively define the statistical plan for performing secondary endpoint analyses in the event that the primary endpoint analysis has been successfully met. If the secondary endpoint analyses are intended purely as exploratory analyses, or are not intended to support the indication for use or device performance, we recommend you submit only simple descriptions of the analyses. If, on the other hand, any of the secondary endpoint analyses are intended to support the indication for use or the performance of your device in the labeling (e.g., comparing treatment and control groups using p-values or confidence intervals), we recommend you pre-specify this intention in your study protocol and describe in detail the statistical methods you plan to follow.

In summary, if we don’t perform the hypothesis testing for these secondary endpoints or not having appropriate control for overall type I error, we will lose the chance to claim the benefit on the secondary endpoints and to list secondary endpoints into the product label. Therefore, it is always a wise decision to perform the hypothesis testing for secondary endpoints with appropriate control for type I error.

Tuesday, December 20, 2016

the 21st Century Cures Act and Innovations in Clinical Trials

After the bill passed through house and senator, the president Obama signed the bill into law on December 13, 2016. The 21st Century Cures Act is now officially a law in effect.

If you have trouble to find the final version of the cures act, Here is the one signed into the law by the President Obama.

H.R.34 - 21st Century Cures Act

The NPR has a good summary of "who wins and who loses with the 21st Century Cures Act".
The 21st Century Cures Act will greatly benefit the NIH and NCI. Here is a latest article published in New England Journal of Medicine regarding NIH's perspectives on the Act.

The 21st Century Cures Act — A View from the NIH

Some sections in this act are very relevant to design and analysis of clinical trials. The section 3021 calls out the adaptive design and other novel clinical trial designs.

SEC. 3021. Novel clinical trial designs.(a) Proposals for use of novel clinical trial designs for drugs and biological products.—For purposes of assisting sponsors in incorporating complex adaptive and other novel trial designs into proposed clinical protocols and applications for new drugs under section 505 of the Federal Food, Drug, and Cosmetic Act (21 U.S.C. 355) and biological products under section 351 of the Public Health Service Act (42 U.S.C. 262), the Secretary of Health and Human Services (referred to in this section as the “Secretary”) shall conduct a public meeting and issue guidance in accordance with subsection (b).
(b) Guidance addressing use of novel clinical trial designs.—
(1) IN GENERAL.—The Secretary, acting through the Commissioner of Food and Drugs, shall update or issue guidance addressing the use of complex adaptive and other novel trial design in the development and regulatory review and approval or licensure for drugs and biological products.
(2) CONTENTS.—The guidance under paragraph (1) shall address—
(A) the use of complex adaptive and other novel trial designs, including how such clinical trials proposed or submitted help to satisfy the substantial evidence standard under section 505(d) of the Federal Food, Drug, and Cosmetic Act (21 U.S.C. 355(d));
(B) how sponsors may obtain feedback from the Secretary on technical issues related to modeling and simulations prior to—
(i) completion of such modeling or simulations; or
(ii) the submission of resulting information to the Secretary;
(C) the types of quantitative and qualitative information that should be submitted for review; and
(D) recommended analysis methodologies.
(3) PUBLIC MEETING.—Prior to updating or issuing the guidance required by paragraph (1), the Secretary shall consult with stakeholders, including representatives of regulated industry, academia, patient advocacy organizations, consumer groups, and disease research foundations, through a public meeting to be held not later than 18 months after the date of enactment of this Act.
(4) TIMING.—The Secretary shall update or issue a draft version of the guidance required by paragraph (1) not later than 18 months after the date of the public meeting required by paragraph (3) and finalize such guidance not later than 1 year after the date on which the public comment period for the draft guidance closes.

The act specifies that the real world evidence can be used to support the drug approval where Real world evidence means data regarding the usage, or the potential benefits or risks, of a drug derived from sources other than randomized clinical trials.

The act creates a new pathway for medical device - breakthrough medical device - a similar pathway for breakthrough drug.

The act specified the importance of the biomarker in drug approval process where biomarker “(A) means a characteristic (such as a physiologic, pathologic, or anatomic characteristic or measurement) that is objectively measured and evaluated as an indicator of normal biologic processes, pathologic processes, or biological responses to a therapeutic intervention; and

“(B) includes a surrogate endpoint.

The act contains a section for "targeted drugs for rare diseases"

SEC. 3012. Targeted drugs for rare diseases.Subchapter B of chapter V of the Federal Food, Drug, and Cosmetic Act (21 U.S.C. 360aa et seq.) is amended by inserting after section 529 the following:
“SEC. 529A. Targeted drugs for rare diseases.“(a) Purpose.—The purpose of this section, through the approach provided for in subsection (b), is to—
“(1) facilitate the development, review, and approval of genetically targeted drugs and variant protein targeted drugs to address an unmet medical need in one or more patient subgroups, including subgroups of patients with different mutations of a gene, with respect to rare diseases or conditions that are serious or life-threatening; and
“(2) maximize the use of scientific tools or methods, including surrogate endpoints and other biomarkers, for such purposes.
“(b) Leveraging of data from previously approved drug application or applications.—The Secretary may, consistent with applicable standards for approval under this Act or section 351(a) of the Public Health Service Act, allow the sponsor of an application under section 505(b)(1) of this Act or section 351(a) of the Public Health Service Act for a genetically targeted drug or a variant protein targeted drug to rely upon data and information—
“(1) previously developed by the same sponsor (or another sponsor that has provided the sponsor with a contractual right of reference to such data and information); and
“(2) submitted by a sponsor described in paragraph (1) in support of one or more previously approved applications that were submitted under section 505(b)(1) of this Act or section 351(a) of the Public Health Service Act,
for a drug that incorporates or utilizes the same or similar genetically targeted technology as the drug or drugs that are the subject of an application or applications described in paragraph (2) or for a variant protein targeted drug that is the same or incorporates or utilizes the same variant protein targeted drug, as the drug or drugs that are the subject of an application or applications described in paragraph (2).“(c) Definitions.—For purposes of this section—
“(1) the term ‘genetically targeted drug’ means a drug that—
“(A) is the subject of an application under section 505(b)(1) of this Act or section 351(a) of the Public Health Service Act for the treatment of a rare disease or condition (as such term is defined in section 526) that is serious or life-threatening;
“(B) may result in the modulation (including suppression, up-regulation, or activation) of the function of a gene or its associated gene product; and
“(C) incorporates or utilizes a genetically targeted technology;
“(2) the term ‘genetically targeted technology’ means a technology comprising non-replicating nucleic acid or analogous compounds with a common or similar chemistry that is intended to treat one or more patient subgroups, including subgroups of patients with different mutations of a gene, with the same disease or condition, including a disease or condition due to other variants in the same gene; and
“(3) the term ‘variant protein targeted drug’ means a drug that—
“(A) is the subject of an application under section 505(b)(1) of this Act or section 351(a) of the Public Health Service Act for the treatment of a rare disease or condition (as such term is defined in section 526) that is serious or life-threatening;
“(B) modulates the function of a product of a mutated gene where such mutation is responsible in whole or in part for a given disease or condition; and
“(C) is intended to treat one or more patient subgroups, including subgroups of patients with different mutations of a gene, with the same disease or condition.
“(d) Rule of construction.—Nothing in this section shall be construed to—
“(1) alter the authority of the Secretary to approve drugs pursuant to this Act or section 351 of the Public Health Service Act (as authorized prior to the date of enactment of the 21st Century Cures Act), including the standards of evidence, and applicable conditions, for approval under such applicable Act; or
“(2) confer any new rights, beyond those authorized under this Act or the Public Health Service Act prior to enactment of this section, with respect to the permissibility of a sponsor referencing information contained in another application submitted under section 505(b)(1) of this Act or section 351(a) of the Public Health Service Act.”.

Sunday, December 11, 2016

Commonly Used Procedure for Multiplicity Adjustment: Fixed Sequence Procedure, Holm Step-down Procedure, Hochberg Step-up Procedure

In clinical trials, we often have the multiple tests or multiplicity issue when there are more than one hypothesis tests built in the same study and we want to claim the trial success if one of multiple hypothesis tests is significant. For example, in steoporosis/breast cancer trial, there may be two endpoints:

Endpoint 1: Incidence of vertebral fractures
Endpoint 2: Incidence of breast cancer

We would like to claim the success if at least one endpoint is significant. In a trial with a low dose group, a high dose group, and a placebo control, if we want to claim the success if either lower dose versus placebo or high dose group versus control is statistically significant. In both of these situations, adjustment for multiplicity must be employed.

On the other hand, not all studies with more than one hypothesis tests will need the adjustment for multiplicity. With Alzheimer’s disease trial as example, FDA guidance requires two endpoints

Endpoint 1: Cognition endpoint (ADAS-Cog)
Endpoint 2: Clinical global scale (CIBIC plus)

and requires that both endpoints must be significant in order to claim success. In this case, both hypotheses are tested at significant level of 0.05 and there is no adjustment for multiplicity is needed.

In late phase clinical trials, if multiplicity issue exists, adjustment for multiplicity must be built into the statistical analysis plan to avoid the inflation of the family-wise type 1 error rate (usually 0.05 or 5%).

Many different approaches have been proposed for handling the multiplicity issue. In a recent article by Wang et al (2015) “Overview of multiple testing methodology and recent development in clinical trials”, the following procedures were reviewed.

Multiple testing procedures for non-hierarchical hypotheses	Non-parametric or semi-parametric procedures	Bonferroni procedure
		Simes procedure
		Holm step-down procedure
		Hochberg step-up procedure
		Hommel procedure
	Parametric procedures	Dunnett procedure

Multiple testing procedures on hierarchical hypotheses	Simple procedures for hierarchical hypotheses	Fixed-sequence procedure
		Fallback procedure
	Gatekeeping procedures	Serial gatekeeping procedures
		Parallel gatekeeping procedure
		Other extensions of gatekeeping procedures

Graphical approaches

In a presentation by Bretz and Xun “introduction to multiplicity in clinical trials” at IMPACT meeting, the multiple testing procedures for non-hierarchical hypotheses were organized based on whether the test is a single step or stepwise and based on whether or not the correlations are considered.

They also made the following remarks:

· Single step methods are less powerful than stepwise methods and not often used in practice

· Accounting for correlations leads to more powerful procedures, but correlations are not always known

· Simes-based methods are more powerful than Bonferroni-based methods, but control the FWER only under certain dependence structures

· In practice, we select the procedure that is not only powerful from a statistical perspective, but also appropriate from clinical perspective

For a specific clinical trial with multiplicity issue, the choice of the procedure for multiplicity adjustment depends on the study design, if there is an order in clinical importance of multiple hypothesis tests, or sometimes if there is a prior evidence that one hypothesis test may be more likely to be significant. For example, for a dose-response study, Dunnett procedure or stepdown Dunnett procedure may be preferred. If Multiplicity problems in clinical trials have multiple sources of multiplicity (for example, multiple endpoints + different type of tests (superiority and non-inferiority)), then the gatekeeping procedure may be preferred.

In industry clinical trials, some procedures are more commonly used than others because they are more powerful or more likely to declare the statistical significance. It may usually be the case that the clinical trial sponsor side (the pharmaceutical/biotech companies) would like to choose a procedure that is more powerful (such as Hochberg procedure) while the regulatory side (such as FDA) would prefer a procedure that is more conservative (such as Bonferroni or Holm’s procedure).

We are still waiting for FDA to issue its formal guidance on multiplicity issues. In the meantime, we see that some procedures for handling the multiplicity issue are mentioned in therapeutic area specific guidance or presentations by FDA statisticians. For example, in CDRH’s guidance “Clinical Investigations of Devices Indicated for the Treatment of Urinary Incontinence”, the following paragraph was mentioned in dealing with the multiplicity issue when performing the statistical tests for multiple secondary endpoints.

The primary statistical challenge in supporting the indication for use or device performance in the labeling is in making multiple assessments of the secondary endpoint data without increasing the type 1 error above an acceptable level (typically 5%). There are many valid multiplicity adjustment strategies available for use to maintain the type 1 error rate at or below the specified level, three of which are listed below:

Bonferroni procedure

Hierarchical closed test procedure

Holm’s step-down procedure

Because each of these multiplicity adjustment strategies involves balancing different potential advantages and disadvantages, we recommend you prospectively state the strategy that you intend to use. We recommend your protocol prospectively state a statistical hypothesis for each secondary endpoint related to the indication for use or device performance.

EMA has a guideline “Points to consider on multiplicity issues in clinical trials”. The document was issued in 2002 and might be time for revision. The document mainly focused on when the adjustment for multiplicity is needed and when the adjustment for multiplicity is not needed. There is no mention about the procedures that could be used for multiplicity adjustment.

A recent paper by Sakamaki et al (2016) “Current practice onmultiplicity adjustment and sample size calculation in multi-arm clinicaltrials: an industry survey in Japan” revealed that fixed sequence procedure, gatekeeping procedure, and Hochberg procedure are most commonly used and Holm procedure is rarely used.

Assuming that there are two hypothesis tests and the left column indicates the p-values for these two hypothesis tests. Claiming the statistical significance depending on which procedure to use for multiplicity adjustment. In this specific case, the Hochberg step-up procedure is more power than other multiplicity adjustment procedures.

	Without any adjustment for multiplicity	Bonferroni correction	Fixed sequence hierarchical	Holm stepdown procedure	Hochberg step-up Procedure
	Compare p₁ with 0.05 Compare p₂ with 0.05	If p₁ lt 0.025 or if p₂ lt 0.025	If p₁lt 0.05, comparing p₂ with 0.05; If p₁gt 0.05, p₂ will not be tested	If min(p₁, p₂) lt 0.025 Then test if max(p₁, p₂) lt 0.05	If max(p₁, p₂) lt 0.05 then claim both groups are successful; or if max(p₁, p₂) gt 0.05 then test if min(p₁,p₂) lt 0.025
p₁=0.04 p₂=0.03	✓	x	✓	x	✓
p₁gt 0.05 p₂=0.03	✓	x	x	x	x
p₁gt .05 p₂=0.02	✓	✓	x	✓	✓
p₁=0.04 p₂gt 0.05	✓	x	x	x	x
p₁=0.02 p₂₌0.02	✓	✓	✓	✓	✓

References:

Wang et al (2015) “Overview of multiple testing methodology and recent development in clinical trials”
Bretz and Xun (2014)“introduction to multiplicity in clinical trials”
Alex Dmitrienko (2013) “Multiple Testing Procedures in Clinical Trials”
Huque and R¨ohmel (2010) “Multiplicity Problems in Clinical Trials: A Regulatory Perspective”
Alex Dmitrienko and Brian Millen (2011) “Multiple testing methodology in the context of subgroup analysis”
Thomas Permutt (2013) Multiplicity in Regulatory Statistical Review"
Mohammad Huque et al (2013) Multiplicity Issues in Clinical Trials With Multiple Objectives

Thursday, November 24, 2016

Should the Percent of Confidence Confidence Interval Match the Significance Level?

In statistics, a confidence interval (CI) is a type of interval estimate of a population parameter. Confidence intervals consist of a range of values (interval) that act as good estimates of the unknown population parameter; however, the interval computed from a particular sample does not necessarily include the true value of the parameter. It has been a normal practice that when a point estimate is calculated and a p-value is presented, the confidence interval will also be provided. If a corresponding hypothesis test is performed, the confidence level is the complement of the level of significance i.e. a 95% confidence interval reflects a significance level of 0.05. Confidence intervals are typically stated at the 95% confidence level that is corresponding to the significance level of 0.05. Confidence intervals may also be presented as 90% or 99% that is corresponding to the significance level of 0.10 or 0.01. It will be odd if the constructed confidence interval is not in 90%, 95%, or 99%. It is odd to present a confidence interval of 97.31%.

The significant levels should match the confidence levels. That is why we usually say “corresponding xx% confidence interval”. If the significance level is 0.05, the corresponding confidence interval should be 95%. If the significant level is 0.01, the corresponding confidence interval should be 99%.

Issue arrives when we present the confidence interval for studies with interim analyses. To maintain the experimental wise alpha level at 0.05, with interim analysis, the final analyses will be tested at a significant level that is less than 0.05 and could be a number not commonly used otherwise. For example, with an interim analysis at 50% of information available, the significant level for interim analysis should be 0.005 and the significance level for the final analysis should be 0.048 (based on O'Brien Fleming method). Now the corresponding confidence interval will be an odd number of 95.2%.

This is exactly what was done in COMPASS-2 study “Bosentan added to sildenafil therapy in patients with pulmonary arterial hypertension” (ERJ, 2015). Because of the interim analysis, the alpha level for the final analysis is 0.0269. To match the alpha level, the study presented a 97.31% confidence interval. The article stated the following (note: I believe that it is pre-planned unblinded interim analysis, not blinded interim analysis in the sentence below):

With an overall study-wise Type I error (alpha) set to 0.05 (two-sided) based on the log-rank test, when adjusted for two pre-planned blinded interim analyses at 50% and 75% of the target number of primary end-point events, the alpha for the final analysis of the primary end-point was 0.0269 and, thus, 97.31% confidence intervals were used in reporting the HRs.

However, presenting a 97.31% confidence interval is odd (even though it is correct). In majority of the publications, the confidence interval were presented not according to their adjusted significance level due to the interim analysis. Here are several articles describing the studies with interim analysis and alpha-spending. The final analysis were tested at a significance level less than 0.05. However, no matter what the significance level was, the confidence interval was always presented as 95% confidence interval.

In an article by Schwartz et al (JAMA 2001) “Effects of atorvastatin on early recurrent ischemic events in acute coronary syndromes: the MIRACL study: a randomized controlled trial”, “the study protocol specified 3 interim analyses of safety and efficacy by the data safety monitoring board. A significance level of p=0.001 was used for each interim analysis, with a significance level for the final analysis adjusted to P=0.049 to preserve to the overall type I error rate at P=0.05”. However the results were presented with 95% confidence interval instead of 95.1%.

In an article by Combs et al (AJOG, 2011) “17-hydroxyprogesterone caproate for twin pregnancy: a double-blind, randomized clinical trial”, two interim analyses were performed and the primary efficacy endpoint was tested at alpha=0.0466, however, the 95% confidence interval was presented anyway.

“Interim analyses of the primary outcome were planned upon completion of 50% and 75% of the case reports. Only the first of these was actually performed. By the time 75% of the patients had delivered and case report forms had been completed, all but 8 of the planned total of 240 subjects had been enrolled and the data and safety monitoring board determined that a second interim analysis would have been moot. To correct for the interim analysis, the alpha level for the primary outcome was adjusted to 0.0466 based on the O’Brien-Fleming spending function. For all other analyses, no adjustments were made and an alpha level of 0.05 was used.”

In an article by Reck et al (NEJM, 2016) “Pembrolizumab versus Chemotherapy for PD-L1-Positive Non-Small-Cell Lung Cancer”, there is a lengthy discussion about the alpha level adjustment for interim analysis (see below), however, all results were presented with 95% confidence interval no matter what the alpha level is.

The overall type I error rate for this trial was strictly controlled at a one-sided alpha level of 2.5%. The full statistical analysis plan is available in the protocol. The protocol specified two interim analyses before the final analysis. The first interim analysis was to be performed after the first 191 patients who underwent randomization had a minimum of 6 months of followup; at this time, the objective response rate would be analyzed at an alpha level of 0.5%. The primary objective of the second interim analysis, which was to be performed after approximately 175 events of progression or death had been observed, was to evaluate the superiority of pembrolizumab over chemotherapy with respect to progression-free survival, at a one-sided alpha level of 2.0%. If pembrolizumab was superior with respect to progression-free survival, the superiority of pembrolizumab over chemotherapy with respect to overall survival would be assessed by means of a group-sequential test with two analyses, to be performed after approximately 110 and 170 deaths had been observed. We calculated that with approximately 175 events of progression or death, the trial would have 97% power to detect a hazard ratio for progression or death with pembrolizumab versus chemotherapy of 0.55. At the time of the second interim analysis, the trial had approximately 40% power to detect a hazard ratio for death with pembrolizumab versus chemotherapy of approximately 0.65 at a one-sided alpha level of 1.18%.

The second interim analysis was performed after 189 events of progression or death and 108 deaths had occurred and was based on a cutoff date of May 9, 2016. The data and safety monitoring committee reviewed the results on June 8, 2016, and June 14, 2016. Because pembrolizumab was superior to chemotherapy with respect to overall survival at the prespecified multiplicity adjusted, one-sided alpha level of 1.18%, the external data and safety monitoring committee recommended that the trial be stopped early to give the patients who were receiving chemotherapy the opportunity to receive pembrolizumab. All data reported herein are based on the second interim analysis.

In an article by Rinke et al (J Clin Oncol 2009) “Placebo-Controlled, Double-Blind, Prospective,

Randomized Study on the Effect of Octreotide LAR in the Control of Tumor Growth in Patients With Metastatic Neuroendocrine Midgut Tumors: A Report From the PROMID Study Group”, the alpha level is 0.0125, but the results were presented with 95% confidence interval.

On the basis of previous results, a median time to tumor progression of 9 months was assumed for the placebo group. An HR of 0.6 was postulated as a clinically meaningful difference to be detected with a power of 80%. An optimized group sequential design, with one interim analysis after observation of 64 progressions and the final analysis after observation of 124 progressions, with a local type I error level of 0.0122 at interim, was fixed in the protocol. A use function in the sense of DeMets and Lan was set up by reoptimization, resulting in the type I error level of 0.0125 after observation of 67 progressions. According to Schoenfeld and Richter and compensating for a lost to follow-up rate of 10%, recruitment of 162 patients was planned.

For survival time, a fixed-sample test based on 121 observed deaths was defined in the protocol. Controlling the family-wise error rate at the level of 5%, this test was planned as a confirmatory test in the event of a significant result for the primary end point, with the option of a redesign according to Mu¨ller and Scha¨fer.

Conclusion:

In clinical trials with interim analysis where the final analysis is performed at a significance level less than 0.05, the correct way for presenting the confidence interval should be using the corresponding percentage.
In practice, the correct, but odd way was not usually followed. Instead, no matter what the significance level is for the final analysis, the confidence interval is always presented as 95% confidence interval.
It seems to be acceptable or has been accepted in publications to present a confidence interval that does not match the corresponding significance level or alpha level. If the overall significance level is 0.05, the 95% confidence interval can always be presented no matter what the alpha level or significance level is left for the final analysis due to the adjustment for multiplicity for performing interim analyses.

Monday, November 21, 2016

Haybittle–Peto Boundary for Stopping the Trial for Efficacy or Futility at Interim Analysis

In clinical trials with group sequential design or the clinical trials with formal interim analysis for efficacy, we will need to deal with the alpha spending issue. To maintain the overall experimental wide alpha level at 0.05, the final analysis will usually tested at a significance level less than 0.05 due to the alpha spending at the interim analyses. Various approaches in handling the multiplicity issue due to the interim analyses have been proposed. The O'Brien-Fleming approach is the most common approach.

In an article by Schulz and Grimes (Lancet 2005) "Multiplicity in randomised trials II: subgroup and interim analyses", three approaches were listed where the stop boundaries were expressed in p-value format:

In the middle is the approach proposed by Peto. In practice, Peto approach is more commonly referred as Haybittle-Peto boundaries.

In an online course by "Treatment Effects Monitoring; Safety Monitoring", there is a following table comparing the different approaches for boundaries and Haybittle-Peto boundary is one of them.

In a lecture note by Dr Koopmeiners, the Haybitle-Peto boundaries are summarized as below with the boundaries expressed in critical values:

In short, the Haybittle–Peto boundary states that if an interim analysis shows a probability of equal to a very small alpha or greater than a very large critical value that a difference as extreme or more between the treatments is found, given that the null hypothesis is true, then the trial should be stopped early. The final analysis is still evaluated at almost the normal level of significance (usually 0.05). The main advantage of the Haybittle–Peto boundary is that the same threshold is used at every interim analysis, unlike other the O'Brien–Fleming boundary, which changes at every analysis. Also, using the Haybittle–Peto boundary means that the final analysis is performed using a 0.05 level of significance as normal, which makes it easier for investigators and readers to understand. The main argument against the Haybittle–Peto boundary is that some investigators believe that the Haybittle–Peto boundary is too conservative and makes it too difficult to stop a trial.

I have seen several high profiles clinical trials where the Haybittle-Peto boundary is used. In a very recent paper by Finn et al (NEJM 2016) "Palbociclib and Letrozole in Advanced Breast Cancer", the Haybittle-Peto boundary was used for the interim analysis.

We planned for the data and safety monitoring committee to conduct one interim analysis
after approximately 65% of the total number of events of disease progression or death were observed to allow for the study to be stopped early owing either to compelling evidence of efficacy (using a pre-specified Haybittle–Peto efficacy boundary with an alpha level of 0.000013) or to a lack of efficacy.

In a paper by Eikelboom et al (NEJM 2017) "Rivaroxaban with or without Aspirin in Stable
Cardiovascular Disease", the modified Haybittle-Peto boundary was used:

Two formal interim analyses of efficacy were planned, when 50% and 75% of primary efficacy events had occurred. A modified Haybittle–Peto rule was used, which required a difference of 4 SD at the first interim analysis that was consistent over a period of 3 months, and a consistent difference of 3 SD at the second interim analysis

In a paper by Sitbon et al (NEJM 2015) "Selexipag for the Treatment of Pulmonary
Arterial Hypertension", the Haybittle-Peto boundary was also used for the interim analysis. Notice that the overall alpha for this study was 0.005, not the typical 0.05. While it was not mentioned in the NEJM publication, the alpha level for interim analysis was 0.0001 according to FDA's statistical review of the NDA.

An independent data and safety monitoring committee performed an interim analysis,
which had been planned after 202 events had occurred, with stopping rules for futility
and efficacy that were based on Haybittle–Peto boundaries. The final analysis used a one-sided significance level of 0.00499.

In both case, we can see that with Haybittle-Peto boundaries, the boundaries are set up very high - making it almost impossible to stop the trial for efficacy (or futility).

In choosing the stop boundaries for interim analysis, Haybittle-Peto boundaries may be chosen when a sponsor has no real intention to stop the trial early, but give Data Monitoring Committee a chance to take a peek into the study results in in the middle of the study.

Haybittle-Peto boundaries are included in the major sample size calculation software such as Cytel's EAST SEQUENTIAL and SAS Proc SEQDESIGN.

On Biostatistics and Clinical Trials