Thursday, November 24, 2016

Should the Percent of Confidence Confidence Interval Match the Significance Level?

In statistics, a confidence interval (CI) is a type of interval estimate of a population parameter. Confidence intervals consist of a range of values (interval) that act as good estimates of the unknown population parameter; however, the interval computed from a particular sample does not necessarily include the true value of the parameter. It has been a normal practice that when a point estimate is calculated and a p-value is presented, the confidence interval will also be provided. If a corresponding hypothesis test is performed, the confidence level is the complement of the level of significance i.e. a 95% confidence interval reflects a significance level of 0.05. Confidence intervals are typically stated at the 95% confidence level that is corresponding to the significance level of 0.05. Confidence intervals may also be presented as 90% or 99% that is corresponding to the significance level of 0.10 or 0.01. It will be odd if the constructed confidence interval is not in 90%, 95%, or 99%. It is odd to present a confidence interval of 97.31%.

The significant levels should match the confidence levels. That is why we usually say “corresponding xx% confidence interval”. If the significance level is 0.05, the corresponding confidence interval should be 95%. If the significant level is 0.01, the corresponding confidence interval should be 99%.

Issue arrives when we present the confidence interval for studies with interim analyses. To maintain the experimental wise alpha level at 0.05, with interim analysis, the final analyses will be tested at a significant level that is less than 0.05 and could be a number not commonly used otherwise. For example, with an interim analysis at 50% of information available, the significant level for interim analysis should be 0.005 and the significance level for the final analysis should be 0.048 (based on O'Brien Fleming method). Now the corresponding confidence interval will be an odd number of 95.2%. 

This is exactly what was done in COMPASS-2 study “Bosentan added to sildenafil therapy in patients with pulmonary arterial hypertension” (ERJ, 2015). Because of the interim analysis, the alpha level for the final analysis is 0.0269. To match the alpha level, the study presented a 97.31% confidence interval. The article stated the following (note: I believe that it is pre-planned unblinded interim analysis, not blinded interim analysis in the sentence below):
With an overall study-wise Type I error (alpha) set to 0.05 (two-sided) based on the log-rank test, when adjusted for two pre-planned blinded interim analyses at 50% and 75% of the target number of primary end-point events, the alpha for the final analysis of the primary end-point was 0.0269 and, thus, 97.31% confidence intervals were used in reporting the HRs.
However, presenting a 97.31% confidence interval is odd (even though it is correct). In majority of the publications, the confidence interval were presented not according to their adjusted significance level due to the interim analysis. Here are several articles describing the studies with interim analysis and alpha-spending. The final analysis were tested at a significance level less than 0.05. However, no matter what the significance level was, the confidence interval was always presented as 95% confidence interval.

In an article by Schwartz et al (JAMA 2001) “Effects of atorvastatin on early recurrent ischemic events in acute coronary syndromes: the MIRACL study: a randomized controlled trial”, “the study protocol specified 3 interim analyses of safety and efficacy by the data safety monitoring board. A significance level of p=0.001 was used for each interim analysis, with a significance level for the final analysis adjusted to P=0.049 to preserve to the overall type I error rate at P=0.05”. However the results were presented with 95% confidence interval instead of 95.1%.

In an article by Combs et al (AJOG, 2011) “17-hydroxyprogesterone caproate for twin pregnancy: a double-blind, randomized clinical trial”, two interim analyses were performed and the primary efficacy endpoint was tested at alpha=0.0466, however, the 95% confidence interval was presented anyway.
“Interim analyses of the primary outcome were planned upon completion of 50% and 75% of the case reports. Only the first of these was actually performed. By the time 75% of the patients had delivered and case report forms had been completed, all but 8 of the planned total of 240 subjects had been enrolled and the data and safety monitoring board determined that a second interim analysis would have been moot. To correct for the interim analysis, the alpha level for the primary outcome was adjusted to 0.0466 based on the O’Brien-Fleming spending function. For all other analyses, no adjustments were made and an alpha level of 0.05 was used.”

In an article by Reck et al (NEJM, 2016) “Pembrolizumab versus Chemotherapy for PD-L1-Positive Non-Small-Cell Lung Cancer”, there is a lengthy discussion about the alpha level adjustment for interim analysis (see below), however, all results were presented with 95% confidence interval no matter what the alpha level is.
The overall type I error rate for this trial was strictly controlled at a one-sided alpha level of 2.5%. The full statistical analysis plan is available in the protocol. The protocol specified two interim analyses before the final analysis. The first interim analysis was to be performed after the first 191 patients who underwent randomization had a minimum of 6 months of followup; at this time, the objective response rate would be analyzed at an alpha level of 0.5%. The primary objective of the second interim analysis, which was to be performed after approximately 175 events of progression or death had been observed, was to evaluate the superiority of pembrolizumab over chemotherapy with respect to progression-free survival, at a one-sided alpha level of 2.0%. If pembrolizumab was superior with respect to progression-free survival, the superiority of pembrolizumab over chemotherapy with respect to overall survival would be assessed by means of a group-sequential test with two analyses, to be performed after approximately 110 and 170 deaths had been observed. We calculated that with approximately 175 events of progression or death, the trial would have 97% power to detect a hazard ratio for progression or death with pembrolizumab versus chemotherapy of 0.55. At the time of the second interim analysis, the trial had approximately 40% power to detect a hazard ratio for death with pembrolizumab versus chemotherapy of approximately 0.65 at a one-sided alpha level of 1.18%.
The second interim analysis was performed after 189 events of progression or death and 108 deaths had occurred and was based on a cutoff date of May 9, 2016. The data and safety monitoring committee reviewed the results on June 8, 2016, and June 14, 2016. Because pembrolizumab was superior to chemotherapy with respect to overall survival at the prespecified multiplicity adjusted, one-sided alpha level of 1.18%, the external data and safety monitoring committee recommended that the trial be stopped early to give the patients who were receiving chemotherapy the opportunity to receive pembrolizumab. All data reported herein are based on the second interim analysis.

In an article by Rinke et al (J Clin Oncol 2009) “Placebo-Controlled, Double-Blind, Prospective,
On the basis of previous results, a median time to tumor progression of 9 months was assumed for the placebo group. An HR of 0.6 was postulated as a clinically meaningful difference to be detected with a power of 80%. An optimized group sequential design, with one interim analysis after observation of 64 progressions and the final analysis after observation of 124 progressions, with a local type I error level of 0.0122 at interim, was fixed in the protocol. A use function in the sense of DeMets and Lan was set up by reoptimization, resulting in the type I error level of 0.0125 after observation of 67 progressions. According to Schoenfeld and Richter and compensating for a lost to follow-up rate of 10%, recruitment of 162 patients was planned.
For survival time, a fixed-sample test based on 121 observed deaths was defined in the protocol. Controlling the family-wise error rate at the level of 5%, this test was planned as a confirmatory test in the event of a significant result for the primary end point, with the option of a redesign according to Mu¨ller and Scha¨fer.

  • In clinical trials with interim analysis where the final analysis is performed at a significance level less than 0.05, the correct way for presenting the confidence interval should be using the corresponding percentage.
  • In practice, the correct, but odd way was not usually followed. Instead, no matter what the significance level is for the final analysis, the confidence interval is always presented as 95% confidence interval.
  • It seems to be acceptable or has been accepted in publications to present a confidence interval that does not match the corresponding significance level or alpha level. If the overall significance level is 0.05, the 95% confidence interval can always be presented no matter what the alpha level or significance level is left for the final analysis due to the adjustment for multiplicity for performing interim analyses. 

Monday, November 21, 2016

Haybittle–Peto Boundary for Stopping the Trial for Efficacy or Futility at Interim Analysis

In clinical trials with group sequential design or the clinical trials with formal interim analysis for efficacy, we will need to deal with the alpha spending issue. To maintain the overall experimental wide alpha level at 0.05, the final analysis will usually tested at a significance level less than 0.05 due to the alpha spending at the interim analyses. Various approaches in handling the multiplicity issue due to the interim analyses have been proposed. The O'Brien-Fleming approach is the most common approach. 

In an article by Schulz and Grimes (Lancet 2005) "Multiplicity in randomised trials II: subgroup and interim analyses", three approaches were listed where the stop boundaries were expressed in p-value format:

In the middle is the approach proposed by Peto. In practice, Peto approach is more commonly referred as Haybittle-Peto boundaries.

In an online course by "Treatment Effects Monitoring; Safety Monitoring", there is a following table comparing the different approaches for boundaries and Haybittle-Peto boundary is one of them.

In a lecture note by Dr Koopmeiners, the Haybitle-Peto boundaries are summarized as below with the boundaries expressed in critical values:

In short, the Haybittle–Peto boundary states that if an interim analysis shows a probability of equal to a very small alpha or greater than a very large critical value that a difference as extreme or more between the treatments is found, given that the null hypothesis is true, then the trial should be stopped early. The final analysis is still evaluated at almost the normal level of significance (usually 0.05). The main advantage of the Haybittle–Peto boundary is that the same threshold is used at every interim analysis, unlike other the O'Brien–Fleming boundary, which changes at every analysis. Also, using the Haybittle–Peto boundary means that the final analysis is performed using a 0.05 level of significance as normal, which makes it easier for investigators and readers to understand. The main argument against the Haybittle–Peto boundary is that some investigators believe that the Haybittle–Peto boundary is too conservative and makes it too difficult to stop a trial.
I have seen several high profiles clinical trials where the Haybittle-Peto boundary is used. In a very recent paper by Finn et al (NEJM 2016) "Palbociclib and Letrozole in Advanced Breast Cancer", the Haybittle-Peto boundary was used for the interim analysis.
We planned for the data and safety monitoring committee to conduct one interim analysis
after approximately 65% of the total number of events of disease progression or death were observed to allow for the study to be stopped early owing either to compelling evidence of efficacy (using a pre-specified Haybittle–Peto efficacy boundary with an alpha level of 0.000013) or to a lack of efficacy.
In a paper by Sitbon et al (NEJM 2015) "Selexipag for the Treatment of Pulmonary
Arterial Hypertension", the Haybittle-Peto boundary was also used for the interim analysis. Notice that the overall alpha for this study was 0.005, not the typical 0.05. While it was not mentioned in the NEJM publication, the alpha level for interim analysis was 0.0001 according to FDA's statistical review of the NDA.
An independent data and safety monitoring committee performed an interim analysis,
which had been planned after 202 events had occurred, with stopping rules for futility
and efficacy that were based on Haybittle–Peto boundaries. The final analysis used a one-sided significance level of 0.00499.
In both case, we can see that with Haybittle-Peto boundaries, the boundaries are set up very high - making it almost impossible to stop the trial for efficacy (or futility).

In choosing the stop boundaries for interim analysis, Haybittle-Peto boundaries may be chosen when a sponsor has no real intention to stop the trial early, but give Data Monitoring Committee a chance to take a peek into the study results in in the middle of the study.

Haybittle-Peto boundaries are included in the major sample size calculation software such as Cytel's EAST SEQUENTIAL and SAS Proc SEQDESIGN.

Tuesday, November 15, 2016

Collecting Sex, Gender, or Both in Clinical Trials?

In the latest issue of JAMA, Clayton and Tannenbaum published an article titled “Reporting Sex, Gender, or Both inClinical Research?” and raised an interesting question whether or not we should collect the sex, gender, or both in clinical trials.

Coincidently, just last month (Oct 26, 2016), FDA published a guidance for industry and FDA staff “Collection of Race and Ethnicity Data in Clinical Trials”. While the guidance is about the data collection for race and ethnicity, it mentions in several places both the sex and gender. On page 3 of the guidance, it uses the footnote to explain the differences between terms sex and gender.
The terms sex and gender have been used interchangeably in some FDA documents. However, according to a 2001 consensus report from the Institute of Medicine (Institute of Medicine, Committee on Understanding the Biology of Sex and Gender Differences. Exploring the Biological Contributions to Human Health: Does Sex Matter?, National Academy of Sciences, 2001), the terms have distinct definitions which should be used consistently to describe research results. Sex refers to the classification of living things, generally as male or female according to their reproductive organs and functions assigned by chromosomal complement. Gender refers to a person’s self-representation as male or female, or how that person is responded to by social institutions based on the individual’s gender presentation (perhaps Masculine vs. feminine). Gender is rooted in biology, and shaped by environment and experience. Because of underlying differences in the statutes and regulations referenced in this policy, the terms “gender” and “sex” have both been used in this document in accordance with the source material referenced.

Institute of Medicine, Committee on Understanding the Biology of Sex and Gender Differences. Exploring the Biological Contributions to Human Health: Does Sex Matter?, (National Academy of Sciences, 2001) has formally provided the different definitions for term ‘Sex’ and ‘Gender’ and consider the ‘Sex’ and ‘Gender’ having different meaning.

Prior to IOM report, the definitions for sex and gender have already been proposed in ICH guideline "Sex-related Considerations in the Conduct of Clinical Trials". ICH guideline has the following about the sex and gender.
The terms sex and gender have been used interchangeably in many of the previously adopted ICH guidelines. In recognition of currently accepted distinction between these concepts, the term sex will be used in all new and revised ICH guidelines to denote the biogenetic differences that distinguish males and females. While different definitions may exist respecting the term gender, it is understood that gender generally refers the array of socially constructed roles and relationships, behaviours and values that society ascribes to two sexes on a differentiated basis. 
It makes sense to differentiate the concept of sex and gender. With the definition above proposed by IOM, whether or not collecting sex or gender or both may depend on the indication or disease area. As recommended by Clayton and Tannenbaum in their JAMA article, we should “use the terms sex
When reporting biological factors and gender when reporting gender identity or psychosocial or cultural factors”. In psychology / psychiatry field, collecting the data on gender may be equally important to collecting the data on sex.

I believe that the current standard practice in clinical trial is still to use the terms sex and gender interchangeably. The case report form for demographic collects sex or gender (with male or female as categories), but not both. Providing different definitions for term ‘sex’ and ‘gender’ is driven by the potential participation of transgender subjects in clinical trials. Instead of collecting both sex and gender information, it might be easier just to add two additional categories to the data collection:
Sex/Gender: Male, Female, Transgender (Male to Female), Transgender (Female to Male).

FDA and EMA have both published the guidelines to encourage the data collection about demographic information and to perform subgroup analysis based on the demographic information. The sex or gender is big part of it. It is critical to have women subjects represented in the clinical trials. It is critical to collect and analyze the data to determine if there is sex difference in treatment effect. In some therapeutic areas, clinical trials may need to collect both the sex and gender information.    

Saturday, November 12, 2016

Presidential Election and Statistics

Tuesday’s election results were full of surprises. The election results indicated that almost all polls were wrong this time. I view the polling as a survey statistics. the election results exposed how survey statistics could go wrong - badly. 

Before the election, each poll is considered as an independent survey – taking a sample from the the overall population and then trying to make the prediction about the population. For presidential election, we calculate the sample proportion (proportion of polling subjects who will vote for a candidate) and then try to predict the population proportion. Usually in statistics, when we do the sampling to predict the population, we will never know if our prediction is correct or not because the truth for the population will never be revealed or known. 

It is different for the presidential election. After the election, we know the truth and the truth will verify if all the polls are wrong or correct. Unfortunately, the polls are mostly wrong this time.

We still remember the famous statistician (even though he is actually not a statistician) named Nate Silver and his website He became famous after he predicted correctly for 49 out 50 states in 2008 presidential election and 50 out 50 states in 2012 presidential election. In 2013, he was invited to give a keynote speech in annual Joint Statistical Meeting (the largest conference in statistics field).

Predicting correctly for 49 out of 50 and 50 out of 50 states sounds like a great feat, however, for majority of the states, anybody who pays a little bit attention to the presidential election will be able to predict the results correctly. For example, it will be pretty safe to put Texas, Indiana, Kentucky,… into the category of the red states and New York, California into the category of the blue states. In probability terms, I am willing to bet that Hillary will have 100% chance to win California and Trump will have 100% chance to win Mississippi.  There are actually less than 10 (or maybe even less than 5) states – so called battleground states – where the polling and prediction are critical. Predicting correctly for 49 out of 50 states may essentially be just predicting 4 out of 5 states. For 2016 election, the predictions are down to several battleground states such as Florida, North Carolina, Ohio, Michigan, Virginia,... - he got many of them wrong, especially in Michigan, Wisconsin, and Pennsylvania.  In the final poll prediction prior to the November 8 election, Nate Silver and predicted the following: 
"giving Clinton a 71.4% chance of winning, and predicting the former Secretary of State would end up with 302 electoral votes (270 are required for victory) and a 3.6 percentage point margin–48.5% to 44.9%–in the popular vote." 

Here is comparison of the final predictions from and the final results for all 50 states. The highlights in yellow are states with discordance (i.e., the prediction probability of Trump winning less than 50%, but Trump won; the prediction probability of Trump winning greater than 50%, but Trump lost):

Probability of Trump Winning
Actual Result of Trump Winning
greater than 99.9%
less than 0.1%
less than 0.1%
less than 0.1%
greater than 99.9%

 If we just look at the discordance: there is 6 out of 50 (12%) states with the prediction probability of Trump winning less than 50%, but Trump won; there is 0 out of 50 states with the prediction probability of Trump winning greater than 50%, but Trump lost.

Actual Results of Trump Winning
Probability of Trump Winning
According to
                      greater than 50%
                      less than 50%

Nate Silver is a democratic and he indicates that the party affiliation does not have any impact on his prediction. However, there may be unconscious biases in the prediction, at least this seems to be true based on the prediction and the actual election results for this election: for the discordance, always underestimated the probability of Trump winning (in all six states with discordance).  

Not sure what model is used in prediction for Nate Silver and his The predictions by is still better than other polls (even though the predictions this time are not as good as previous elections).  I see the similarities between his analysis and the meta analysis – the data analysis or modeling based on various sources of the polling data. The prediction is made in the format of probability of winning by each candidate based on the aggregate data from the meta analysis.

For Nate Silver and his, with three consecutive presidential elections (2008, 2012, and 2016), he got the first two right, but the third one totally wrong. This reminds us that it needs duplication and the verification to tell if a model is robustly correct. Remember what George Box said "essentially, all models are wrong, but some are useful"