Monday, April 26, 2021

Within Patient Benefit-Risk Evaluation? Using Outcomes to Analyze Patients versus Using Patients to Analyze Outcomes?

In our daily life, benefit-risk evaluation is something we always do whether we realize it or not. Benefit-risk evaluation is especially critical in drug development and in the regulator's decision process. We often hear that a drug is approved because the benefits outweigh the risks. In the recent decision of resuming the J&J Covid-19 vaccine, the CDC and the FDA cited that the benefits of rolling out the J&J Covid vaccine outweigh the risks of developing the rare blood clot (so-called CVST Cerebral Venous Sinus Thrombosis) in some young women who received the J&J Covid vaccine. 

In a recent New York Times article "Irrational Covid Fears", the benefit and risk of the Covid-19 vaccine are compared to a fable of our times and automobiles. 
A fable for our times
Guido Calabresi, a federal judge and Yale law professor, invented a little fable that he has been telling law students for more than three decades.
He tells the students to imagine a god coming forth to offer society a wondrous invention that would improve everyday life in almost every way. It would allow people to spend more time with friends and family, see new places and do jobs they otherwise could not do. But it would also come with a high cost. In exchange for bestowing this invention on society, the god would choose 1,000 young men and women and strike them dead.
Calabresi then asks: Would you take the deal? Almost invariably, the students say no. The professor then delivers the fable’s lesson: “What’s the difference between this and the automobile?”
In truth, automobiles kill many more than 1,000 young Americans each year; the total U.S. death toll hovers at about 40,000 annually. We accept this toll, almost unthinkingly, because vehicle crashes have always been part of our lives. We can’t fathom a world without them.
It’s a classic example of human irrationality about risk. We often underestimate large, chronic dangers, like car crashes or chemical pollution, and fixate on tiny but salient risks, like plane crashes or shark attacks.
One way for a risk to become salient is for it to be new. That’s a core idea behind Calabresi’s fable. He asks students to consider whether they would accept the cost of vehicle travel if it did not already exist. That they say no underscores the very different ways we treat new risks and enduring ones.
I have been thinking about the fable recently because of Covid-19. Covid certainly presents a salient risk: It’s a global pandemic that has upended daily life for more than a year. It has changed how we live, where we work, even what we wear on our faces. Covid feels ubiquitous.
Fortunately, it is also curable. The vaccines have nearly eliminated death, hospitalization and other serious Covid illness among people who have received shots. The vaccines have also radically reduced the chances that people contract even a mild version of Covid or can pass it on to others.
Yet many vaccinated people continue to obsess over the risks from Covid — because they are so new and salient.
This article reminds me of the seminars presented by Scott Evans. In his seminars, for example, the one posted on youtube, he started with a hypothetical question:
If you are given a choice to choose drug A or drug B, Drug A increases your intelligence, but decreases your good looks; Drug B increases your good looks, but decreases your intelligence; which drug will you choose? 
This is a typical question about the benefit-risk evaluation or benefit-risk tradeoff. With this question, he brought up a topic about an alternative (supposed to be optimal) way to perform the benefit-risk evaluation (i.e., the benefit-risk assessment on each individual patient level before aggregating the data on the group level).  
Currently, in clinical trials, the benefit (efficacy) evaluation and risk (safety) evaluation are performed independently. The study protocol was designed for showing the benefit (efficacy) - selecting the sensitive and clinically meaningful efficacy endpoint, ensuring sufficient large sample size for statistical power, sound statistical analysis methods are all for ensuring that the efficacy results can be used to demonstrate the benefit of the new drug. FDA has issued specific guidance only for efficacy "Demonstrating Substantial Evidence of Effectiveness for Human Drug and Biological Products".

Risk (safety) evaluation is usually assessed separately from the efficacy. While we collect the data for risk (safety) analysis (adverse events, serious adverse events, death, clinical laboratory results, ECG results, vital signs,...), the analyses of safety data are usually based on the summaries (no hypothesis testing) to assess the nature/pattern of the serious adverse events, related to the investigational new drug, if there is elevated levels in certain laboratory parameters,... Safety analyses contain a lot of subjective judgment. Different reviewers may come to different conclusions. 

There is no separate guidance from FDA specifically about the risk (safety assessment). Instead, the safety assessment is included in FDA's Good Review Practice: Clinical Review Template - a checklist for FDA reviewers in evaluating the safety. 

Only after the efficacy and safety are separately analyzed and evaluated, are a benefit-risk section written as a formal evaluation of the benefit-risk - this is usually in CTD module 1 and 2. 

This approach of assessing the efficacy and safety separately evaluates the average effect (efficacy or safety) in the entire study population. The benefit or risk can not be easily translated into the individual patient level. In clinical trials, it is almost impossible to decide if a drug is good (the benefit outweighs the risk) for a specific patient. We have to wait for the aggregate data to determine the benefit and risk on a group level. 

With advances in precision medicine and pharmacogenomics, we hope that in the future, within-patient benefit-risk evaluation can be performed. In the present days (perhaps the foreseeable future), the benefit-risk evaluation (or efficacy-safety evaluation) will still be primarily based on the population level to assess the average group effect. 
  • Average effect (Using Patients to Analyze Outcomes)
  • Subgroup analyses to identify the prognostic factors (phenotypes) to help identify the patients who will more likely to respond to the therapy with fewer side effects
  • Targeted therapies, Precision Medicine to identify the genetic biomarkers (genes) to help identify the subgroup of patients who will more likely to respond to the therapy with few side effects  
  • Individual effect - within patient benefit-risk evaluation 
Even with targeted therapy, it is still not possible to be certain if a therapy will be good (the benefit outweighs the risks) for a specific patient. 

For the J&J Covid-19 vaccine issue, it seems to be clear that the vaccine does appear to increase the risk of the rare blood clot - CVST. Since the CVST is so rare, the benefit of receiving the Covid-19 vaccine outweighs the risk of the rare blood clot - this assessment is on the population as a whole. When it comes to the individual person, it will be his/her own choice - the risk is small, but maybe there.  

Monday, April 19, 2021

Restricted Mean Survival Time (RMST) for Handling the Non-Proportional Hazards Time to Event Data

Time to event analysis (or traditionally survival analysis) is one of the most common analyses in clinical trials. In general, the time to event analysis relies on the assumption of the proportional hazards. However, quietly frequently, we may find that the proportional hazards assumption is violated, especially in many immuno-oncology trials. When the proportional hazards assumption is violated, alternative approaches may be needed to analyze the data to achieve statistical power. As discussed in the previous post "Non-proportional Hazards: how to analyze the time-to-event data?", one of the alternative approaches is the restricted mean survival time (RMST) method. 

RMST is one of the Kaplan-Meier-based methods and is essentially calculating and comparing AUCs under Kaplan-Meier Curves for different treatment groups or different comparative groups. It has been said that RMST analysis has the following advantages:
  • Model-free, robust, and easily interpretable treatment effect information
  • Produces radically powerful patterns of difference as has been observed in some recent Oncology clinical trials
  • Accepted approach by regulatory agencies and industry leaders
RMST has been mentioned in the latest FDA guidance for Industry (2020): Acute Myeloid Leukemia: Developing Drugs and Biological Products for Treatment as an alternative approach to analyzing the data when the non-proportionality hazards occur (e.g., plateauing effect). 

"Plateauing Effect

Trials designed to cure AML often result in survival contours characterized by an initial drop followed by a plateauing effect after some time point post randomization. This is an example of nonproportional hazards. While the log-rank test is somewhat robust to nonproportionality, it generally results in loss of power. Furthermore, nonproportionality can cause difficulty in describing the treatment effect. FDA is open to discussion about analyses based on other approaches, such as weighted Cox regression or other weighted methods, or summarizing the treatment effect using restricted mean survival time (RMST) or landmark survival analysis. Plans that use these alternative approaches should include:
    • justification for what constitutes clinically meaningful difference,
    • justification of design parameters, such as sample size and follow-up duration, based on this endpoint, and
    • justification for the value of the threshold that will be used to calculate the RMST.
RMST analysis has also been used as a primary analysis approach or for sensitivity analysis in FDA reviews: 

In NDA of Baloxavir marboxil in treatment of acute, uncomplicated influenza, both applicants and the FDA reviewer analyzed the data using RMST. It stated:
Restricted mean survival time (RMST) up to Day 10 was estimated for each treatment group along with the difference between RMST in the two treatment groups. RMST is a measurement of the average survival from time 0 to a specified time point (e.g., 10 days) which is equivalent to the area under the Kaplan-Meier curve from the beginning of the study through that time point.

At an FDA CDRH Medical Devices Advisory Committee Circulatory System Panel meeting in 2019, the independent statistical consultant addressed the analysis issue when the proportional hazards assumption is violated:

The proposal they made was the restricted mean survival time. The restricted mean survival time is area under curve. Please note the word restricted. Mean survival time is over a period of time, according to the rules that have been laid out, so that you're not looking, like with proportional hazards, over all the follow-up that could have possibly happened or in binary where you're only looking at the patients that survive. The restricted mean would say we're going to look between, let's say, 0 and 5 years because we have sufficient information to make that kind of assessment.

The paper showed that the restricted mean has just as much power as proportional hazards when the assumptions are there for proportional hazards, and then has more power when the assumptions are violated.

There's also some advantages in terms for clinicians, in terms of explaining this to the patient. It's hard to talk about hazards or number needed to treat. But if you could say to a patient over a 60-month period the average survival time is 55 months with Device A versus 52 months with Device B, now they can look at what their life is going to look like in the next 60 months and make a decision.

Unfortunately, it was not me who noticed this. This was actually from a presentation by FDA. Several very smart statisticians had talked about the restricted mean and have made recommendations on using it for both proportional violations and for its interpretation.

In FDA Briefing Document for Oncologic Drugs Advisory Committee Meeting (December 17, 2019) to review Olaparib for the maintenance treatment of adult patients with deleterious or suspected deleterious germline BRCA mutated (gBRCAm) metastatic adenocarcinoma of the pancreas

FDA performed a test to evaluate whether the proportional hazard assumption was met. This test failed to detect evidence of non-proportionality; however, such a test may lack power to detect non-proportionality due to the small sample size. The Kaplan-Meier curves of PFS appear to show some degree of nonproportionality. The curves did not show separation until approximately 4 months, after approximately 53% of patients either had events or were censored. FDA performed additional sensitivity analyses by applying the restricted mean survival time (RMST) method using different truncation points (15 months and 18 months). The truncated time was selected (15 or 18 months) such that approximately 8-12% patients remained at risk. Based on the truncation times, the estimated RMST difference in PFS between arms ranged from 2.6 months (95% CI: 0.9, 4.3) to 3.1 months (95% CI: 1.0, 5.2). The range of the RMST differences again demonstrated great variation in the difference in PFS and the lower ends did not suggest that there was a clinically meaningful difference.

Thanks to the software, RMST analyses can be easily implemented in SAS or R. In the latest version (version 15.1 or above) of SAS/Stat, RMST is included in SAS Proc LIFETEST with RMST option and Proc RMSTREG. See a nice paper by 
With R, the package for RMST analysis is survRM2 that is developed by Hajime Uno from Dana-Farber Cancer Institute

For RMST analysis, it is important to select the cut-off value (tau) for the truncated time. The different selection of taus will give different results. The selection of tau can sometimes be arbitrary. In an FDA briefing document above, the FDA statistician chose the truncated time such that approximately 8-12% of patients remained at risk.

There are different ways to calculate the RMST:

  • Non-parametric method
  • Regression Analysis Method
  • Pseudo-value Regression Method
  • IPCW Regression - Inverse Probability of Censoring Weighting (IPCW) regression
  • Conditional restricted mean survival time (CRMST)

According to the paper by Guo and Liang (2019) "Analyzing Restricted Mean Survival Time Using SAS/STAT®", non-parametric analysis can be implemented using Proc Lifetest; regression analysis, pseudo-value regression, and IPCW regression can be implemented using SAS Proc RMSTREG. 

FDA statisticians also proposed an approach 'conditional restricted mean survival time' or CRMST. This approach was described in the paper by Qiu et al (2019) "Estimation on conditional restricted mean survival time with counting process" and also in a presentation by Lawrence and Qiu (2020) Novel Survival Analysis When Hazards Are Nonproportional and/or There Are Multiple Types of Events. CRMST can allow the AUC under K-M curves to be calculated from an interval time (not necessarily to be started from the 0 time). They claim CRMST is better for event-driven studies where the time to the first event is the interest. They concluded the following: 
CRMST possesses all the desirable statistical properties of RMST. In particular, it does not rely on proportional hazard assumption. In addition, CRMST measures an average event-free time in the time range at issue and has straightforward interpretation. In case that two survival curves cross, CRMST can be estimated separately before and after crossing and the CRMST differences can be used to assess benefit versus harm.

Further Reading:

Monday, April 05, 2021

Non-proportional Hazards: how to analyze the time-to-event data?

Time to event data is one of the most common data types in clinical trials. Traditionally, the log-rank test is used to compares the survival curves of two treatment groups.; the Kaplan Meier survival plot is used to illustrate the totality of time-to-event kinetics, including the estimated median survival time;  the Cox-proportional hazards model is employed to provide the estimated relative effect (i.e., hazard ratio) between treatment arms. The performance of these analyses largely depends on the proportional hazards (PH) assumption – that the hazard ratio is constant over time. In other words, the hazard ratio provides an average relative treatment effect over time.

Before the time to event data is analyzed, it is typical for statisticians to check the proportional hazards assumption. Various methods can be used to check the proportional hazards assumptions - see a previous post "Visual Inspection and Statistical Tests for Proportional Hazard Assumption".

Recently we have seen more examples of the time to event data not following the proportional hazards assumption, even more examples in immuno-oncology clinical trials. 

It is not the end of the world if the proportional hazards assumption is violated, various approaches have been proposed to handle the time to event data with non-proportional hazards. 

In practice, it is pretty common that in the statistical analysis plan, we prespecify the log-rank test to calculate the p-values and then use Cox-proportional hazards regression model to calculate the hazard ratio, its 95% confidence interval, and p-value - I call this 'Splitting p-value and estimate of the treatment difference". Two different p-values will be calculated: one from the log-rank test and one from the Cox regression. If the proportional hazards assumption is met, it is better to use the p-value from the Cox regression since all estimates and p-value are coming from the model. However, When the proportional hazard assumption is violated, the Cox-proportional hazard model may no longer be the optimal approach to determine treatment effect and the Kaplan-Meier estimate of median survival may not be the most valid measure to summarize the results. 

In a website post "Testing equality of two survival distributions: log-rank/Cox versus RMST", it stated:
“One thing to note is that the log-rank test does not assume proportional hazards per se. It is a valid test of the null hypothesis of equality of the survival functions without any assumptions (save assumptions regarding censoring). It is however most powerful for detecting alternative hypotheses in which the hazards are proportional.”
It is true that the log-rank test does not depend on the proportional hazards assumption. The log-rank test is still a valid test of the null hypothesis of equality of the survival functions without any assumptions even though that the log-rank test may not be optimal under non-proportional hazards. 

In a public workshop "Oncology Clinical Trials in the Presence of Non-Proportional Hazards" organized by Duke in 2018, Dr. Rajeshwari Sridhara from Division of Biometrics V, CDER/FDA stated (@40:45 of the youtube video) that in the non-proportional hazards situation, FDA is ok with presenting the p-value from the log-rank test and hazard ratio to measure the treatment difference. 

At this same workshop "Oncology Clinical Trials in the Presence of Non-Proportional Hazards", ASA Biopharmaceutical Section Regulatory-Industry Statistics Workshop presented their work and proposed the 'max-combo' test as the alternative method to address the non-proportional hazards situation. The “max-combo” test is based on Fleming-Harrington (FH) weighted log-rank statistics. The max-combo test tackles some of the challenges due to non-proportional hazards as it is able to robustly handle a range of non-proportional hazard types, can be pre-specified at the design stage, and can choose the appropriate weight in an adaptive manner (i.e. is able to address the control of family-wise Type I error). In workshop summaries, Max-Combo Test Design was described as the following:

Knezevic & Patil has a paper describing a SAS macro to perform Max-Combo test (or Combination weighted log-rank tests) "Combination weighted log-rank tests for survival analysis with
non-proportional hazards" (2020 SAS Global Forum). 

The NPH workshop has presented or published their work on numerous occasions, here is a list: 
In addition to the Max-Combo test, there are several other methods for handling the non-proportional hazards situation. 
  • RMST (restricted mean survival time): according to a presentation by Lawrence et al from FDA, The idea of Restricted Mean Survival Time (RMST) goes back to Irwin (1949) and is further implemented in survival analysis by Uno et al. (2014). RMST is defined as the area under the survival curve up to t*, which should be pre-specified for a randomized trial. RMST may be loosely described as the event free expectancy over the restricted period between randomization and a defined, clinically relevant time horizon, called t*. RMST analyses are now built into the SAS procedures with Proc Lifetest and Proc RSMTREG. See a paper by Guo and Liang (2019) "Analyzing Restricted Mean Survival Time Using SAS/STAT®"
  • Piecewise exponential regression allows for an early and late effect of treatment comparison. it is especially useful when the non-proportional hazards pattern is cross-over. Piecewise exponential regression can be fitted with SAS Proc MCMC and R package pch
  • Estimation via the average hazard ratios (AHR) method of Schemper (2009) and the average regression effects (ARE) method of Xu and O’Quigley (2000) - the method can be implemented using the COXPHW package in R. COXPHW package is described as:
This package implements weighted estimation in Cox regression as proposed by Schemper, Wakounig and Heinze (Statistics in Medicine, 2009, doi: 10.1002/sim.3623). Weighted Cox regression provides unbiased average hazard ratio estimates also in case of non-proportional hazards. The package provides options to estimate time-dependent effects conveniently by including interactions of covariates with arbitrary functions of time, with or without making use of the weighting option. For more details we refer to Dunkler, Ploner, Schemper and Heinze (Journal of Statistical Software, 2018, doi: 10.18637/jss.v084.i02).

in a presentation by Kaur et al "Analytical Methods Under Non-Proportional Hazards: A Dilemma of Choice", the following methods were described: 

Earlier this year, Mehrotra and West published a paper to describe their proposed method (5-START) to handle the heterogeneity of the patient population and potential non-proportional hazards (Lin et al (2021) Survival Analysis Using a 5-Step Stratified Testing and Amalgamation Routine (5-STAR) in Randomized Clinical Trials or here ):

"The power of the ubiquitous logrank test for a between-treatment comparison of survival times in randomized clinical trials can be notably less than desired if the treatment hazard functions are non-proportional, and the accompanying hazard ratio estimate from a Cox proportional hazards model can be hard to interpret. Increasingly popular approaches to guard against the statistical adverse effects of non-proportional hazards include the MaxCombo test (based on a versatile combination of weighted logrank statistics) and a test based on a between-treatment comparison of restricted mean survival time (RMST). Unfortunately, neither the logrank test nor the latter two approaches are designed to leverage what we refer to as structured patient heterogeneity in clinical trial populations, and this can contribute to suboptimal power for detecting a between-treatment difference in the distribution of survival times. Stratified versions of the logrank test and the corresponding Cox proportional hazards model based on pre-specified stratification factors represent steps in the right direction. However, they carry unnecessary risks associated with both a potential suboptimal choice of stratification factors and with potentially implausible dual assumptions of proportional hazards within each stratum and a constant hazard ratio across strata.
We have developed and described a novel alternative to the aforementioned current approaches for survival analysis in randomized clinical trials. Our approach envisions the overall patient population as being a finite mixture of subpopulations (risk strata), with higher to lower ordered risk strata comprised of patients having shorter to longer expected survival regardless of treatment assignment. Patients within a given risk stratum are deemed prognostically homogeneous in that they have in common certain pre-treatment characteristics that jointly strongly associate with survival time. Given this conceptualization and motivated by a reasonable expectation that detection of a true treatment difference should get easier as the patient population gets prognostically more homogeneous, our proposed method follows naturally. Starting with a pre-specified set of baseline covariates (Step 1), elastic net Cox regression (Step 2) and a subsequent conditional inference tree algorithm (Step 3) are used to segment the trial patients into ordered risk strata; importantly, both steps are blinded to patient-level treatment assignment. After unblinding, a treatment comparison is done within each formed risk stratum (Step 4) and stratum-level results are combined for overall estimation and inference (Step 5)."
Non-proportional hazards and the NPH pattern are usually identified after the study unblinding, which poses the challenges for pre-specifying the best approach to analyze the time to event data with non-proportional hazards. The safest way is to prespecify both the Log-rank test and the Cox proportional hazards regression. If the non-proportional hazards assumption is violated, the p-values from the log-rank test will be used as a measure of the significance. One can also pre-specify the Max-Combo method as the primary method regardless of the NPH assumption 

Monday, March 15, 2021

Drug-related AEs, AE Causality, AE relationship, and SUSAR

In pre-marketing clinical trials or post-marketing drug uses, adverse events (AEs) can always occur. When AEs are reported, assessments need to be made to judge if the reported AEs are caused by the study drug or study treatment. There are different terms to describe this assessment: drug-related, causality, relationship to the study drug, attributable to the study drug,...

In clinical trials, the causality assessment is made by the investigators. In post-marketing experience (surveillance or spontaneous reporting), all reported AEs are considered as related to the drug - so-called 'adverse drug reaction'. 

In an early post, I discussed "Causality Assessment, Causality Categories for Reporting Adverse Events or Adverse Reactions" and summarized various ways to categorize the AE causality: 

Not Related, Unlikely Related, Possibly Related, Related)
Not Related, Unlikely Related, Possibly Related, Related).
Certain, Probable/Likely, Possible, Unlikely, Conditional/Unclassified
Other options
None, Unlikely, Possible, Probable, Not assessable

In a recently published CDISC CDASH eCRF form for AEs, the causality is simply assessed as Yes, No.  

For statistical analyses of the AE data, drug-related AEs will usually be summarized and analyzed. Drug-related AEs are usually defined as the AEs with causality assessment at least probably related to the investigational study drug (including the placebo). For example, using ICH2B or CDISC criteria, AEs with causality assessment of 'possibly related', 'related', will be considered as 'drug-related. If CDISC CDASH eCRF is followed, the drug-related AEs will be those with 'Yes' answer to the 'Relationship to Study Treatment' question. Some of my European colleagues once argued that the 'unlikely' category should also be considered as 'drug-related - but I never found any regulatory guidance to support this.  

It is difficult to compare drug-related AEs across different studies that are conducted by different sponsors because the number of causality categories may be different in different studies. The number of categories can be anywhere between 2 (yes/no) to 5 categories. 

In some studies, additional causality assessments may be performed to judge if the AEs are related to the background therapies (especially in studies with add-on design) or related to the medical device (such as inhalation device and infusion pumps). The same AE could be assessed to be related to the study treatment, background therapy, and/or the medical device. 

For the past year, we have seen plenty of headlines about the AE causality assessment in actions in Covid-19 vaccine studies and in gene therapy studies. AE causality can be assessed on an individual case (the unexpected event in AZ's and J&J's Covid-19 vaccine trials) or on an aggregate level (in the case of blood clots by AZ's Covid-19 vaccine). 

Causality assessment based on the individual case:

"After a thorough evaluation of a serious medical event experienced by one study participant, no clear cause has been identified. There are many possible factors that could have caused the event. Based on the information gathered to date and the input of independent experts, the Company has found no evidence that the vaccine candidate caused the event."

Causality assessment on an aggregate level: 

"A careful review of all available safety data of more than 17 million people vaccinated in the European Union (EU) and UK with COVID-19 Vaccine AstraZeneca has shown no evidence of an increased risk of pulmonary embolism, deep vein thrombosis (DVT) or thrombocytopenia, in any defined age group, gender, batch or in any particular country.

So far across the EU and UK, there have been 15 events of DVT and 22 events of pulmonary embolism reported among those given the vaccine, based on the number of cases the Company has received as of 8 March. This is much lower than would be expected to occur naturally in a general population of this size and is similar across other licensed COVID-19 vaccines. The monthly safety report will be made public on the European Medicines Agency website in the following week, in line with exceptional transparency measures for COVID-19.
One type of AEs requires special attention and needs to be reported to the regulatory agencies and local IRB/ECs expeditiously. It is called Suspected Unexpected Serious Adverse Reaction (SUSAR). According to FDA guidance Safety Reporting Requirements for INDs and BA/BE Studies, SUSARs are those AEs meeting the following criteria: 

  • Serious (S)
  • Unexpected (U)
  • Suspected Adverse Reactions (SAR)

Fatal or life-threatening SUSAR should be reported to FDA no later than 7 days; Others SUSAR should be reported to FDA no later than 15 days.

We saw the news that Bluebird Bio temporarily suspended their gene therapy clinical trials due to a reported SUSAR of acute myeloid leukemia (AML).

Bluebird bio did their assessment and ruled that Gene Therapy for Sickle Cell Not Linked to Cancer (AML event)

"The company released a statement yesterday (March 10) claiming an investigation has found “it is very unlikely” the AML is related to the therapy and the firm is seeking approval from the US Food and Drug Administration (FDA) to resume the trials.

“VAMP4 has no known association with the development of AML nor with processes such as cellular proliferation or genome stability,” Bluebird’s Chief Scientific Officer Philip Gregory says in the press release. Furthermore, the patient’s cells had mutations in other genes, which are related to leukemia."
Some additional comments on AE causality assessment:
  • AEs that occurred prior to the first dose of study treatment (i.e., non-treatment-emergent AEs) should always have the causality 'unrelated' to the study treatment - an edit check should be in place to prevent the investigators to enter a non-TEAE as drug-related. 
  • In blinded studies, if a SUSAR event is reported, the individual patient's treatment assignment should be unblinded so that the sponsor can assess the causality and report the SUSAR event appropriately. 
  • While drug-related AEs are typically summarized, analyzed, and included in the clinical study report, FDA reviewers will focus their review on all AEs and all SAEs regardless of the causality. Drug-related AEs are usually not included in the drug-label. Instead, the drug label will list the most frequent AEs (whether or not they are drug-related).  
  • Sometimes, causality assessment by the investigator may be subjective and arbitrary to some degree. Important events (such as SUSAR and AESI (AE of special interest)) may be further reviewed by the sponsor, data monitoring committee, clinical event adjudication committee, and regulatory agencies. For example, some oncology drugs may induce pneumonitis/interstitial lung disease and these events of pneumonitis/interstitial lung disease can be reviewed and adjudicated by a committee. 
  • Statistical summary and analysis of AE causality are always based on the assessment by the investigator that is recorded in the clinical database. Causality assessment by the sponsor and other parties is not part of the clinical database and will be analyzed separately.
Additional References: 

Wednesday, March 10, 2021

Intention-to-Treat Principle versus Treatment Policy Estimand: Different Names, but Same Meaning?

ICH E9 "Statistical Principles for Clinical Trials" was finalized in February 1998. The E9 guidelines established the Intention-to-Treat principle for the design and analysis of clinical trials. With the intention-to-treatment principle, we are required to include all study participants (full analysis set) in the analyses. Here are the definitions for 'full analysis set' and 'intention-to-treat principle' from ICH E9. 

In 90's, it took a while for the people to understand and accept the concept of the intention-to-treat principle. We also see that the intention-to-treat principle was misused, over-used, or undercut by the use of practical intention-to-treat and modified intention to treat. I had a presentation (in 2004) about the misuse/overuse of intention-to-treat and modified intention-to-treat. What I said then is still applicable today. 

The strict definition of intention-to-treat can be traced back to the book chapter by Fisher, LD et al. Intention to treat in clinical trials in Statistical Issues in Drug Research and Development. Edited by Peace KE (1990). The intention-to-treat was defined as:

Includes all randomized patients in the groups to which they were randomly assigned, regardless of their adherence with the entry criteria, regardless of the treatment they actually received, and regardless of subsequent withdrawal from treatment or deviation from the protocol

The intention-to-treat principle includes all randomized subjects in the analyses and ignores what happens to the subjects after the randomization (whether or not the subject discontinued the study drug, took prohibited or rescue therapies, crossed over the alternate treatment,...), which is obviously not the best option in estimating the treatment effect in some situations.  This leads to the development of Addendum to ICH E9 "ICH E9 (R1) Estimands and Sensitivity Analysis in Clinical Trials". ICH E9 (R1) explained the issues with the intention-to-treat principle and introduced the new concept of estimands (including treatment policy estimand) and intercurrent events. 

This addendum clarifies and extends ICH E9 in respect of the following topics. Firstly, ICH E9 introduced the Intention-To-Treat (ITT) principle in connection with the effect of a treatment policy in a randomised controlled trial, whereby subjects are followed, assessed and analysed irrespective of their compliance to the planned course of treatment, indicating that preservation of randomisation provides a secure foundation for statistical tests. Multiple consequences arising from the ITT principle can be distinguished. Firstly, that the trial analysis should include all subjects relevant for the research question. Secondly, that subjects should be included in the analysis as randomised. Taken directly from the definition of the ITT principle (see ICH E9 Glossary), a third consequence is that subjects should be followed-up and assessed regardless of adherence to the planned course of treatment and that those assessments should be used in the analysis. It remains undisputed that randomisation is a cornerstone of controlled clinical trials and that analysis should aim at exploiting the advantages of randomisation to the greatest extent possible. However, the question remains whether estimating an effect in accordance with the ITT principle always represents the treatment effect of greatest relevance to regulatory and clinical decision making. The framework outlined in this addendum gives a basis for describing different treatment effects and some points to consider for the design and analysis of trials to give estimates of these treatment effects that are reliable for decision making. Secondly, issues considered generally under data handling and “missing data” (see Glossary) are re-visited. Two important distinctions are made. 

With the intention-to-treat principle, subjects who discontinued the study drug prematurely should continue to be followed up and the data after dose discontinuation should continue to be collected. However, in practice for many studies, the data collection was stopped for subjects who discontinued the study drug, or the data collected after subjects' discontinuation of study drug were collected, but not used in the analyses. To some extent, the intention-to-treat principle was not fully followed. That is why the FDA has issued its guidance "Data Retention When Subjects Withdraw from FDA-RegulatedClinical Trials" to encourage the data collection after the subjects withdraw from the study. As discussed in the guidance:

The validity of a clinical study would also be compromised by the exclusion of data collected during the study. There is long-standing concern with the removal of data, particularly when removal is non-random, a situation called “informative censoring.” FDA has long advised “intent-to-treat” analyses (analyzing data related to all subjects the investigator intended to treat), and a variety of approaches for interpretation and imputation of missing data have been developed to maintain study validity. Complete removal of data, possibly in a non-random or informative way, raises great concerns about the validity of the study. 

The addendum to ICH E9 introduced the concept of estimands and intercurrent events. Those events that occurred after the randomization were previously ignored even though the analyses were under the intention-to-treat principle. With the addendum, Those events that occurred after the randomization would be called 'intercurrent events'. Here is the official definition of the intercurrent events:

Intercurrent Events:
Events occurring after treatment initiation that affect either the interpretation or the existence of the measurements associated with the clinical question of interest. It is necessary to address intercurrent events when describing the clinical question of interest in order to precisely define the treatment effect that is to be estimated.

Estimands can be classified based on the strategies of handling the intercurrent events. One way to handle the intercurrent events is the 'treatment policy' strategy - therefore, we have a treatment policy estimand. The treatment policy estimand under the addendum is almost identical to the intention-to-treatment principle under the original ICH E9. 

Treatment policy strategy
The occurrence of the intercurrent event is considered irrelevant in defining the treatment effect of interest: the value for the variable of interest is used regardless of whether or not the intercurrent event occurs. For example, when specifying how to address use of additional medication as an intercurrent event, the values of the variable of interest are used whether or not the patient takes additional medication.
If applied in relation to whether or not a patient continues treatment, and whether or not a patient experiences changes in other treatments (e.g. background or concomitant treatments), the intercurrent event is considered to be part of the treatments being compared. In that case, this reflects the comparison described in the ICH E9 Glossary (under ITT Principle) as the effect of a treatment policy.

The intention-to-treat and treatment policy estimand are two different names with the same meaning. If we have to differentiate them, we can say that the intention-to-treatment principle is more focused on which subjects should be included in the analyses while the treatment policy estimand is more focused on which data points should be included in the analyses

We have started to see that the ICH E9 addendum and the concept of estimands are gradually adopted, especially in EU countries. The adoption of the ICH E9 in the US is much slower than in EU countries. The concept of estimands and intercurrent events is still considered as the words invented by statisticians. It will take a while for non-statisticians to understand the concept and to accept these new terms. A presentation "Regulator’s experience with estimands" by Andreas Brandt from EMA summarized the challenges for the adoption and implementation of the ICH E9 Addendum. We will anticipate the difficulties ahead for non-statisticians and clinicians to accept the concept of estimand and intercurrent events. This is reflected in a paper by Min & Bain "Estimands in diabetes clinical trials"

During 2019 several type 2 diabetes trials results using the term estimand were published. This word will be unfamiliar to many clinicians (and to spellcheck) but given that regulatory bodies have endorsed its use, this word is likely to become a staple of medical jargon in the future.

ICH E9 Addendum described five different strategies for handling the intercurrent events: treatment policy strategy, hypothetical strategy, composite variable strategy, while on treatment strategy, and principle stratum strategy. However, in practice, the treatment policy estimand is used the vast majority of the studies where the estimand concept is mentioned. There are a few studies using the principle stratum strategy. The other three strategies (hypothetical strategy, composite variable strategy, while on treatment strategy) are rarely used in practice perhaps because they are relatively new, are uncertain with the regulatory acceptance, and because there is no available method to estimate the treatment difference for some estimands.  

If the vast majority of the estimand application is treatment policy strategy which is almost identical to the traditional intention-to-treat principle, we will question if it is worth revamping the entire ICH E9 to come up with an addendum for estimand and intercurrent event concept.  

Monday, February 22, 2021

Randomization Using Envelopes In Randomized, Controlled, and Blinded Clinical Trials

I read an article by Clark et al “Envelope use and reporting in randomized controlled trials: A guide for researchers”. The article reminds me of the old times when envelopes were the popular ways for randomization and blinding (treatment concealment). In the 1990s and 2000s, for randomized, blinded clinical trials, the concealed envelope is the only way for the investigator to do the emergency unblinding (or code breaking) and sometimes the way to administer the randomization for single-blinded studies.

In Berende et al (2016, NEJM) “Randomized Trial of Longer-Term Therapy for Symptoms Attributed to Lyme Disease”, the study protocol described the following procedure for "unblinding of randomization" where sealed envelopes were used.  

I used to be an unblinded statistician to prepare the randomization schedule (including the randomization envelopes) for clinical trials. The following procedures will need to be followed:

  • Based on the study protocol, develop the randomization specifications describing randomization ratio, stratification factors, block size, the number of randomization codes, recipients of the randomization schedule, or code-break envelopes
  • Generate the dummy randomization schedule for the study team to review and approval
  • Replace the random seed to generate the final randomization schedule (a list of all randomized assignments)
  • Prepare the randomization envelopes (randomization number, stratification factors outside the envelope, and treatment assignment inside the envelope)
  • QC the randomization envelopes (to make sure that inside/outside information matches the randomization schedule
  • Shipping and tracking

For double-blinded studies, both the investigator and the patient are blinded to the treatment assignment. The randomization schedule will usually be sent to a third party (for example, the pharmacist) who is unblinded to the treatment assignment and can prepare the study drug for dispensing or administration. The third-party (for example, the pharmacist) must not be involved in other aspects of the clinical trial conduct. The concealed envelopes can be sent to the investigators for emergency unblinding. If there is a medical emergency requiring the unblinding of an individual subject, the investigator can open the code break envelope to reveal the treatment assignment for the specific subject.

For single-blinded studies, the investigator is unblinded to the treatment assignment and the patient is blinded to the treatment assignment. The randomization schedule and/or the randomization envelopes can be sent to the investigators.

Nowadays, randomization through envelopes is obsolete. The randomization procedures are integrated into the overall CTM (clinical trial material)  management process through the IRT (interactive response technologies). In the last 20 years, the randomization process has shifted from randomization envelopes -> IVRS (interactive voice response system) -> IWRS (Interactive Web Response System) - > IRT.

With IRT, the randomization schedule will be sent to the IRT vendor and uploaded into the IRT system. The study team members can be assigned different levels of access to the IRT system depending on their roles in the study. The investigators and pharmacovigilance personnel can be granted the emergency access code for them to gain the access to the treatment assignment in IRT when necessary.  

However, in some situations, randomization envelopes may still the best way for implementing the randomization.

In a study by Chetter et al “A Prospective, Randomized, MulticenterClinical Trial on the Safety and Efficacy of a Ready-to-Use Fibrin Sealant as an Adjunct to Hemostasis during Vascular Surgery”, the randomization occurred in the operation room and only after the target bleeding site (TBS) was identified after the surgical procedure. There would not be ideal for the surgeon (the investigator) to log into the IRT system to obtain the treatment assignment information. The better approach would be for the surgeon or surgeon’s assistant to open the randomization envelope to obtain the treatment assignment information in the operation room. The randomization procedure was described as the following in the paper:


In the Primary Study, patients were randomized 2:1to treatment with FS Grifols or MC after the identification of the TBS during the procedure. Treatment group assignments were generated by the randomization function of the statistics software and communicated using sealed opaque envelopes. Due to the obvious differences between the 2 treatments, blinding of investigators was not possible following randomization

Additional Reads:

Monday, February 01, 2021

BLQs (below limit of quantification) and LLOQ (Lower Limit of Quantification): how to handle them in analyses?

In data analyses of the clinical trial, one type of data is the laboratory data containing the results measured by the central laboratory or specialty laboratory on the specimen (blood sample, plasma or serum sample, urine sample, bronchoalveolar lavage,...) collected from clinical trial participants. The laboratory results are usually reported as quantitative measures in numeric format. However, sometimes, we will see the results reported as '<xxx' or 'BLQ'.

The laboratory measures rely on the assay and the assay has its limit and can only accurately measure the level or concentration to a certain degree - the limit is called the Lower Limit of Quantification (LLOQ) or the Limit of Quantification (LOQ) or the Limit of Detection (LOD). 

In FDA's guidance (2018) "Bioanalytical Method Validation", they defined the Quantification range, LLOQ and ULOQ: 

The quantification range is the range of concentrations, including the ULOQ and the LLOQ that can be reliably and reproducibly quantified with accuracy and precision with a concentration-response relationship.

Lower limit of quantification (LLOQ): The LLOQ is the lowest amount of an analyte that can be quantitatively determined with acceptable precision and accuracy.

Upper limit of quantification (ULOQ): The ULOQ is the highest amount of an analyte in a sample that can be quantitatively determined with precision and accuracy.

According to the article by Vashist and Luong "Bioanalytical Requirements and Regulatory Guidelines for Immunoassays". The LLOQ and LOQ are different. In practice, the LLOQ and LOQ may be used interchangeably. 

The LOQ is the lowest analyte concentration that can be quantitatively detected with a stated accuracy and precision [24]. However, the determination of LOQ depends on the predefined acceptance criteria and performance requirements set by the IA developers. Although such criteria and performances are not internationally adopted, it is of importance to consider the clinical utility of the IA to define such performance requirements.

The LLOQ is the lowest calibration standard on the calibration curve where the detection response for the analyte should be at least five times over the blank. The detection response should be discrete, identifiable, and reproducible. The precision of the determined concentration should be within 20% of the CV while its accuracy should be within 20% of the nominal concentration.

In FDA's guidance "Studies to Evaluate the Metabolism and ResidueKinetics of Veterinary Drugs in Food-ProducingAnimals: Validation of Analytical Methods Used in Residue Depletion Studies", the LOD and LOQ are differentiated a little bit. 

3.4. Limit of Detection
The limit of detection (LOD) is the smallest measured concentration of an analyte from which it is possible to deduce the presence of the analyte in the test sample with acceptable certainty. There are several scientifically valid ways to determine LOD and any of these could be used as long as a scientific justification is provided for their use. 
3.5. Limit of Quantitation
The LOQ is the smallest measured content of an analyte above which the determination can be made with the specified degree of accuracy and precision. As with the LOD, there are several scientifically valid ways to determine LOQ and any of these could be used as long as scientific justification is provided. 

If the level or concentration is below the range that the assay can detect, it will be reported as the BLQ (Below the Limit of Quantification), BQL (Below Quantification Level), BLOQ (Below the Limit Of Quantification), or <xxx where xxx is the LLOQ. The results are seldom reported as 0 or missing since the result is only undetectable using the corresponding assay. It is usually agreed that the BLQ values are not missing values - they are measured, but not measurable. 

In clinical laboratory data with the purpose of safety assessment, the BLQ or <xxx is reported in the character variable. When converting the character variable to the numerical variable, the BLQ or <xxx will be automatically treated as missing unless we do something. The following four approaches may be seen in handling the BLQ values (with an example assuming LLOQ 0.01 ng/mL). 

Reported Value

Converted Value


< 0.01 ng/mL


The specific measure will be set to missing and will not be included in summary and analysis.

< 0.01 ng/mL


The specific measure will be set to 0 in summary and analysis

< 0.01 ng/mL

0.005 ng/mL

Half of the LLOQ – commonly used in clinical pharmacology studies (Bioavailability and Bioequivalence studies)

<0.01 ng/mL

0.01 ng/mL

Ignore the less than the ‘<’ sign and take the LLOQ as the value for summary and analysis. This approach can also handle the values beyond the ULOQ (upper limit of quantification), for example, '>1000 ng/mL' by removing the greater than '>' sign.

In clinical pharmacology studies (bioavailability and bioequivalence studies), series pharmacokinetic (PK) samples will be drawn and analyzed to get a PK profile for a specific compound or formulation. The series samples will include a pre-dose sample (the sample drawn before the dosing) and multiple time points after the dosing. It is entirely possible to have results reported as BLQ especially for the pre-dose sample and the late time points. BLQ values can also be possible for samples in the middle of the PK profile (i.e., between two samples with non-BLQ values). The rules for handling these BLQs are different depending on the samples at pre-dose, at the middle of the profile, and at the end of the PK profile (with an example assuming LLOQ 0.01 ng/mL)


Reported Value

Converted Value


Pre-dose sample for a compound with no endogenous level

< 0.01 ng/mL


The BLQ(s) occurring before the first quantifiable concentration will be set to zero. 

Pre-dose sample for a compound with endogenous level or pre-dose at the steady-state

< 0.01 ng/mL

0.005 ng/mL

The endogenous pre-dose level will be set to half of the LLOQ. 

In multiple-dose situation, the pre-dose sample (trough or Cmin) is set to half of the LLOQ

At middle of the PK profile or between two non-BLQ time points

< 0.01 ng/mL


The BLQ values between the two reported concentrations will be set to missing in the analysis – essentially the linear interpolation rule will be used in AUC calculation.

The last time point(s) of the PK profile

< 0.01 ng/mL


0.005 ng/mL

It is common to set the last BLQ(s) to 0 to be consistent with the rule for pre-dose BLQ handling. According to FDA's "Bioequivalence Guidance", "For a single dose bioequivalence study, AUC should be calculated from time 0 (predose) to the last sampling time associated with quantifiable drug concentration AUC(0-LOQ)."

In some situations, the BLQ values after the last non-BLQ measure can also be set to half of the LLOQ.

There are some discussions that these single imputation methods will generate biased estimates. In a presentation by Helen Barnett et al "Non-compartmental methods for BelowLimit of Quantification (BLOQ)responses", they concluded:

It is clear that the method of kernel density imputation is the best performing out of all the methods considered and is hence is the preferred method for dealing with BLOQ responses in NCA. 

In a recent paper by Barnetta et al (2021 Statistics in Biopharmaceutical Research) "Methods for Non-Compartmental Pharmacokinetic AnalysisWith Observations Below the Limit of Quantification", eight different methods were discussed for handling the BLQs (or BLOQs). The authors conclude that the kernel-based method performs best for most situations.

  • Method 1; replace BLOQ values with 0
  • Method 2: replace BLOQ values with LOQ/2
  • Method 3: regression on order statistics (ROS) imputation
  • Method 4: maximum likelihood per timepoint (summary)
  • Method 5: maximum likelihood per timepoint (imputation)
  • Method 6: Full Likelihood
  • Method 7: Kernel Density Imputation
  • Method 8: Discarding BLOQ Values
For the specific study, rules for handling the BLQs may be different depending on the time point in the PK profile, the measured compound (with or without endogenous concentrations), the single dose or multiple doses, study design (single dose, parallel, crossover). No matter what the rules are, they need to be specified (preferably pre-specified before the study unblinding if it is pivotal study and the PK analysis results are the basis for regulatory approval) in the statistical analysis plan (SAP) or PK analysis plan (PKAP).   

Here are two examples with descriptions of the BLQ handling rules. In a phase I study by Shire, the BLQ handling rules are specified as the following: 

In a phase I study by Emergent Product Development, the BLQ rules are described as the following: