Monday, January 17, 2022

Paired T-test and McNemar's test for paired data based on the summary data

Sometimes, it is necessary for us to calculate the p-values based on the summary (aggregate) data without the individual subject level data. In a previous post, group t-test or Chi-square test based on the summary data was discussed. Group t-test and chi-square test can be used in the setting of parallel-group comparisons. 

In single-arm clinical trials, there is no concurrent control group, and the statistical test is usually based on the pre-post comparison. For continuous measures, the pre-post comparison can be tested using paired t-test based on the change from baseline values (i.e., post-baseline measures - baseline measures):  For discrete outcomes, the pre-post comparison may be tested using McNemar's test.

Paired t-test:

A paired t-test is used when we are interested in the difference between two variables for the same subject. Suppose we have the descriptive statistics for change from baseline values: 83 subjects had the outcome measures at both baseline and week 12 (therefore, 83 pairs), the mean and standard deviation for these 83 pairs are: 10.7 (70.7); 68 subjects had the outcome measures at both baseline and week 24 (therefore 68 pairs), the mean and standard deviation for these 68 pairs are 20.2 (80.9). 

With the mean difference, the standard deviation for differences, and the sample size (# of pairs), we have all the elements to calculate the t statistics and therefore the p-value using the formula below. 

This can be implemented in SAS as the following - t-statistics and p-values can be calculated for each of weeks 12 and 24 based on the aggregate data. 
 

McNemar's Test:

McNemar's test is a statistical test used on paired nominal data. It is applied to 2 × 2 contingency tables with a dichotomous trait, with matched pairs of subjects, to determine whether the row and column marginal frequencies are equal (that is, whether there is "marginal homogeneity"). In clinical trials, the aggregate data may not be obvious as a 2 × 2 contingency table but can be converted into a 2 × 2 contingency table. 

Suppose we have the following summary data for post-baseline week 12: the number and percentage of subjects with improvement, stable (no change), and deterioration categories. 

 

 

All subjects

(n=300)

Week 12

Improved

  54 (18%)

No Change

228 (76%)

Deteriorated

  18 (  6%)

At Week 12, there are more subjects in the 'Improved' category than in the 'Deteriorated' category even though the majority of subjects are in the 'No Change' category. Are they more subjects with improvement than deterioration? 

Assuming that change from category 1 to 0 is 'Improved' and change from category 0 to 1 is 'Deteriorated', the table above can be converted into a 2 × 2 table: 

 

 

Baseline

0

1

Week 12

0

228

54

1

18

0

or

 

 

Baseline

0

1

Week 12

0

0

54

1

18

228

For McNemar’s test, only the numbers in the diagonal discordant cells (in our case, the # of improved and the # of deteriorated) are relevant.

The concordant cells (in our case, the # of no change) will only contribute to the sample size (therefore the degree of freedom), not have an impact on the p-value. How the # of subjects with the ‘No Change’ is split doesn’t matter with our calculation of chi-square statistics and therefore the p-value.

For the data highlighted in yellow, McNemar’s test can be performed using the SAS codes like this (weight statement indicates count variable is the frequency of the observation and agree option requests McNemar's test). How the 228 subjects in the concordance ‘No Change’ category are split has no impact on the p-value calculation. 



Sunday, January 09, 2022

Overrunning issue at the interim analysis for group sequential design

It is pretty common these days that clinical trials (especially the late phase, adequate, and well-controlled studies) employ interim analyses to determine if the efficacy results are too good so that the study should be stopped early for overwhelming efficacy, or if the efficacy results are not good so that the study should be stopped early for futility, or both. A study with formal interim analyses to look at the comparative efficacy is called 'group sequential design' even though the 'group sequential design' may not be formally used in the study protocol. Group sequential design is the most common type of adaptive design as described in the FDA guidance "adaptive designs for clinical trials". 

As mentioned in an early post "overrunning issues in adaptive design clinical trials", one of the issues with interim analyses in group sequential design is the overrunning issue. Overrunning consists of extra data, collected by investigators while awaiting results of the interim analysis (IA). Overrunning is the 
phenomenon that data will continue to accumulate after it is decided to stop a trial (Whitehead, 1992). In many cases there will be patients who have already been admitted to the trial but whose responses are not yet known. Also some extra patients will enter the trial because of the delay between the moment the data for the final interim analysis were retrieved, and the moment participating clinical centers receive instruction to stop recruitment.

EMA guidances "reflection paper on methodological issues in confirmatory clinical trials planned with an adaptive design" had a full paragraph discussing the overrunning issue: 


When planning for an interim analysis, the decision needs to be made about the timing of the interim analysis and what data is to be included in the interim analysis. For an event-driven study where the number of events is the endpoint, the timing of the interim analysis can be based on the percentage of the events, for example, the interim analysis can be performed when 50% of the total number of events are accrued. In other words, once 50% of the total number of events is achieved. For a longitudinal design, the endpoint is measured at various intervals. At any time during the study, there will be patients at different stages of the study (reaching the end of the study, reaching a specific duration of the study, or just being randomized into the study). It is more difficult to determine a good timing for interim analysis. Suppose that an interim analysis is planned when 50% of subjects reach the end of the study, by that time, there will be plenty of subjects in the various stage of the study already, just having not reached the end of study yet. If the study enrollment is pretty fast and the study endpoint is pretty long (for example 52 weeks), by the time 50% of subjects reach the study endpoint, the majority of subjects (if not all) have already been randomized into the study. 

If the interim analysis results trigger the recommendation of discontinuing the study early (either for efficacy or futility), the debate is whether the interim analysis needs to be re-run by including the overrunning subjects before adopting the recommendation to discontinue the study early. 

This exact issue about the handling of the overrunning subjects was discussed in Biogen's aducanumab clinical trials in Alzheimer's disease. As I discussed in a previous post "Futility Analysis and Conditional Power When Two Phase 3 Studies are Simultaneously Conducted", Biogen made the wrong decision and discontinued its pivotal studies (EMERGE and ENGAGE) where one of the studies was later found to have statistically significant treatment effects. The wrong decision was driven by two issues: 

  • they calculated the conditional powers based on the pooled data from both studies (instead of calculating the conditional powers separately based on the data from individual studies) - this was discussed in the previous post
  • they stopped the study early without re-running the interim analysis by including the overrunning subjects. 

The overrunning issue was mentioned in a recent article in the Wall Street Journal (Jan 4, 2022) "How Biogen Fumbled Its Alzheimer's Drug ---Once-promising Aduhelm is pricey and without proven efficacy"

"By evaluating data midstream in approval-seeking trials, companies can try to predict whether a drug will succeed if the trial continues. Stopping trials early for "futility," in industry parlance, can save millions of dollars and prevent patients from investing hope on an ineffective drug.

In a March 2019 meeting, Biogen executives on a small "senior decision team," as the company called it, concluded that the trials were doomed. Biogen pulled the plug and asked researchers around the world to shut down trials. It told more than 3,000 Alzheimer's patients who had volunteered that they would no longer receive treatment.

Biogen stock fell by nearly 30% the day of the announcement.

Biogen executives made errors in shutting down the trials. The trial plan called for analyzing data after half of patients completed the study treatment in late December 2018. By the time Biogen completed the analysis in March 2019, three more months of additional data were available -- but the decision team didn't scrutinize the additional data before the company halted the trials, Biogen has said.

A Biogen consultant in the summer of 2018 recommended to senior Biogen statisticians that they consider all available trial data, according to a person involved in the process. The consultant cautioned them that a plan to leave out consideration of additional trial data after the cutoff date -- and to leave out certain data from patients in the trial before the cutoff -- would open up Biogen to criticism and scrutiny, the person said. The statisticians didn't heed the consultant's advice, and it isn't clear whether the decision team or management considered the recommendation, the person said.

Biogen declined to comment on past discussions with its consultants but said it followed its pre-established statistical-analysis plan.

The decision not to consider the three months of additional data was a misstep, said some clinical-trial experts and statisticians. "Additional data after a study stops is called 'overrunning.' We plan for it," said Scott Emerson, a professor emeritus of biostatistics at the University of Washington who served on an FDA advisory committee that recommended against approving Aduhelm in November 2020. "In this case, the overrunning data was large."

The Biogen spokeswoman said: "Our decision to stop the trials, though clearly incorrect in hindsight, was based on putting patients at the forefront -- as it always should be. Cost was not considered in determining futility."

Only in the weeks after the trials stopped did Biogen scientists complete a preliminary analysis of the overrunning data and recognize their mistake, the Biogen spokeswoman said. The data seemed to show that one of the trials would have produced positive results, despite the likelihood of a negative outcome in the second trial. Initially, Biogen had analyzed combined data from both trials."

Even though the aducanumab was finally approved by FDA through the accelerated approval pathway, the approval was very controversial - the first drug and the only drug so far that was approved based on studies that had been stopped early for futility. There are strong pushback from academic about the use of aducanumab because of its unapproved efficacy and perhaps also because of the drama of resurrecting a drug that was declared failed. 

We all wonder: had these two pivotal studies not been stopped for futility by the sponsor, what would be the situation now for aducanumab? 

Saturday, January 01, 2022

Futility Analysis and Conditional Power When Two Phase 3 Studies are Simultaneously Conducted

In late-phase clinical trials, an independent Data Monitoring Committee (DMC) is usually set up. If the clinical program includes multiple late-phase studies, the same DMC will be responsible for the entire program. With DMC, the interim analyses can be performed for different purposes:
  • The interim analysis for safety
    • with pre-specified stopping rule (for example stop the trial if the significant imbalance in # of Serious Adverse Events or in # of deaths)
    • without pre-specified stopping rule (rely on DMC members to review the overall safety)
  • The interim analysis for efficacy: To see if the new treatment is overwhelmingly better than the control group  - then stop the trial for efficacy
  • The interim analysis for futility (futility analysis): To see if the new treatment is unlikely to be better than the control group or the study will be unlikely to achieve its objective given the data at the interim – then stop the trial for futility.
There seem to be more studies with built-in futility analysis without interim analysis for overwhelming efficacy, mainly because of the concerns about the alpha-spending for efficacy. The futility analysis will have an impact on the beta-spending and the statistical power, but not on the alpha-spending. For the decision-making, regulatory agencies are usually more concerned about the alpha level (incorrectly approves a drug that does not work) or the alpha level inflation. The sponsors are more concerned about the statistical power (incorrectly concludes a drug not working while the drug is actually working).

Futility analysis usually requires calculating the Conditional Power (CP) that is defined as the probability that the final study result will be statistically significant, given the data observed thus far at the time of the interim data cut and a specific assumption about the pattern of the data to be observed in the remainder of the study, such as assuming the original design effect (alternative hypothesis) or the effect estimated from the interim data.  

If there is one single pivotal trial, the stopping rule and the CP are relatively straightforward. However, it is uncommon that the sponsor may need to conduct two pivotal (phase 3) studies (two adequate and well-controlled (A&WC) trials in FDA's term) to demonstrate substantial evidence of effectiveness as outlined in FDA guidance for industry "Demonstrating Substantial Evidence of Effectiveness for Human Drug and Biological Products Guidance for Industry".

For a clinical program with two independent A&WC trials (usually with identical design), the futility analysis and CP calculation are a little bit more complicated. Two independent A&WC trials may have an identical design but be executed differently (i.e., may not be started at the same time; may be conducted in different geographic regions/countries; and may have different enrollment speeds,...). 

When futility analysis is performed for two A&WC trials, should the conditional powers be calculated for individual studies separately or should the conditional powers be calculated for both studies together (i.e. pooled data from both studies)? 

When there are two identical A&WC trials, the interim analysis for safety should be based on the pooled data sets from both studies because it will give a more definitive answer to the safety issues, the interim analysis for efficacy should be based on the individual study data because the decision about the overwhelming efficacy should be based on the individual study, not the integrated data from two studies; the interim analysis for futility is a little bit more complicated and the decision to use the data from an individual study or to use the data from the pooled data seems to be dependent on how close the observed results from two A&WC trials are at the time of the interim analysis. 

For futility analysis using stochastic curtailment procedure, While CPs can be calculated for each individual study assuming that the treatment effect in the remaining subjects in the same study will follow the treatment effect estimated from the data of this same study at the time of the interim data cut, 

There is an alternative way to calculate the CP, i.e., to calculate the CP for each individual study, but use the observed treatment effect from the pooled data at the interim from both studies to project the trend and pattern for the remaining subjects. 

According to the paper by Lan and Wittes (1988) "The B-Value: A Tool for Monitoring Data", the CP calculation involves the decomposition of overall critical value (B-value or B1 for example) into the sum of two statistically independent interval B-values: 
  • Bt, the value of B that accumulated up through time t when interim analysis is conducted; and 
  • (B1 - Bt), the incremental value of B that accumulates from time t through the end of the study. The legitimacy of the decomposition follows from the independence of distributions of the outcomes for successive study subjects
At the time t when the interim analysis is conducted, Bt is known and is estimated from the observed data up to the time t. (B1 - Bt) is a random variable that needs to be estimated. The conditional power is derived by fixing Bt and calculating the probability that Bt + (B1 - Bt) will exceed Z1-a/2.

To calculate the CPs when there are two identical A&WC studies, t, as a measure of the information fraction, will be different for different studies. At the time t, maybe 60% of subjects have been enrolled in study #1 while 50% of subjects are enrolled in study #2. In CP calculations, the Bt part will be obtained from the individual study. The (B1-Bt) part is estimated assuming the remaining data following the observed effect up to the interim time t, should the observed effect up to the interim time t be based on the data from the individual study or from the pooled data?

It turns out both approaches can be used: 
  • estimate the treatment differences for each individual study and calculate the CP assuming that the reminding data follows the trend and pattern based on the observed data from individual study
  • estimate the treatment difference from both studies and calculate the CP assuming that the remaining data follow the trend and pattern based on the observed data from the pooled data of two studies.           
For both of these approaches, the CPs will be calculated for each individual study (therefore one CP for each study). The difference between these two approaches is in the calculation of the (B1-Bt) part - based on the individual study itself or based on the pooled data from both studies. 

We can take a look at the famous and controversial case in Biogen's aducanumab program in Alzheimer's disease. Aducanumab program in Alzheimer's diseases consisted of two pivotal, phase 3 studies (EMERGE (study 301) and ENGAGE (study 302)), and both studies were designed the same and conducted simultaneously globally. Each study had two active arms (low dose and high dose of aducanumab) versus placebo - therefore two hypothesis tests (low dose vs. placebo and high dose vs. placebo). There was a total of four hypothesis tests (two for each study).  The protocol and SAP specified the interim analysis for futility. 

An interim analysis was performed after approximately 50% of the subjects had the opportunity to complete the Week 78 visit for both EMERGE and ENGAGE studies. An interim analysis for the futility of the primary endpoint was performed to allow early termination of the studies if it was evident that the efficacy of aducanumab was unlikely to be achieved. The futility criteria were based on conditional power, which was the chance that the primary efficacy endpoint analysis would be statistically significant in favor of aducanumab at the planned final analysis, given the data at the interim analysis. The CP was calculated assuming that the future unobserved effect was equal to the maximum likelihood estimate of what is observed in the interim data. 

For each study, two CPs were calculated. The pre-specified CP calculation was to use the pooled interim data from both EMERGE and ENGAGE studies for the (B1-Bt) part and assume that the treatment effect for the remaining of the study would follow the observed treatment effect at the interim analysis. At the interim analysis, the CPs were calculated to be 13% for low dose vs placebo and 0% for high dose vs. placebo in EMERGE study, and 11% for low dose vs placebo and 12% for high dose vs. placebo in ENGAGE study. Given all four CPs were lower than the threshold of 20% (a criterion for futility), the DMC recommended stopping both studies for futility.  Biogen followed the DMC recommendation and stopped both EMERGE and ENGAGE studies for futility
.

Only after two terminated studies were wrapped up, the reanalyses of the final data indicated that there were statistically significant treatment differences in one of the studies (the ENGAGE study). With the help of the FDA, Biogen was able to submit the BLA and obtain approval for aducanumab for Alzheimer's disease. Leading to the FDA approval, there was an advisory committee meeting to review the aducanumab data. In FDA's presentation, the conditional powers were retrospectively re-calculated - this time, the conditional powers were calculated for each individual study and assumed future unobserved effect would be similar to the interim data for each individual study (not the pooled interim data). FDA claimed that CPs using this approach were more appropriate and would have one of the four CPs above the threshold of 20% (CP=59% for high-dose vs placebo in ENGAGE study) - the studies would not be recommended for stopping for futility. 


Retrospectively, CPs calculated for each study independently (not using the pooled interim data to project the trend and pattern for the remaining data) seemed to be better in Biogen aducanumab program consisting of two A&WC trials. 

However, in a paper by Deng et al "Superiority of combining two independent trials in interim futility analysis", CP calculation using the observed treatment effects from the pooled interim data from two studies was considered a better approach. It concluded, "it is demonstrated that by leveraging data from the other study, the probability of making correct interim decision is increased if the treatment effects are similar between the two studies, and such benefit remains even if there is small to moderate between-study difference."

It is probably true that CP calculation using the pooled data at the interim to project the trend and pattern for the remainder data is a better approach if two studies are conducted in the same way and the results at the time of the interim analysis are similar. However, the CP calculation and the statistical analysis plan for interim analysis are usually pre-specified before seeing the unblinded data. At the time of the interim analysis, it is usually unknown whether or not the results (treatment effects) observed from two identical studies will be similar. Even though two A&WC studies are designed the same, the operation and execution of the trial can still be different: two studies may be conducted in different countries, enrollment speed may be different,... As evidenced by Biogen's EMERGE and ENGAGE trials, two identical designed studies may have different results - therefore calculating the CP entirely independently for each study may be more appropriate when two identical A&WC trials are conducted.