On Biostatistics and Clinical Trials: Statistics

Showing posts with label Statistics. Show all posts

Thursday, August 18, 2022

Handling of values below or above a threshold (Below the Low Limit of Quantification or Above the Upper Limit of Quantification)?

In clinical trials, the samples are often collected and sent to the central laboratory or specialty laboratory for measuring certain parameters (drug concentrations, metabolite concentrations, biomarkers,...). It is not uncommon that the results may be reported as "<xxx" or ">xxx" indicating that the measurement is below or above a threshold, outside the range, or the quality control curves. We call them below the low limit of quantification or above the upper limit of quantification.

How to handle them in the data set and in the analyses?

In the data set, the laboratory results should be reported as it is in character variable. The data listings should use the character variable so that the signs of '<' or '>' will be kept and displayed.

For the purpose of the statistical summaries and analyses, a separate numeric variable should be derived and appropriate rules will be applied to these values below the low limit of quantification or above the upper limit of quantification.

Most of the discussions were about the handling of the values below the low limit of quantification (BLQs). See a previous post "BLQs (below limit of quantification) and LLOQ (Lower Limit of Quantification): how to handle them in analyses?" and the researchgate.net discussion board "How should one treat data with <LOQ values during statistical analysis?".

The options for handling the BLQs are:

Treat BLQs as missing
Treat BLQs as 0
Treat BLQs as 1/2 of the LLQ (lower limit of qualification). For example, if the result was reported as "<10" µg, take 5 µg as the measure - this approach is pretty common in handling pharmacokinetic concentration data.
Simply remove the sign of '<' and take the face value (i.e. LLQ value). For example, if the result was reported as "<10" µg, take 10 µg as the measure.
More complicated methods using statistical (regression, maximum likelihood,...) approaches

There are fewer discussions about handling the values above the upper limit of quantification (ULQ). Usually, these values above the upper threshold will be handled by:

Treat values above ULQ as missing
Simply remove the sign of '>' and take the face value (i.e. ULQ value). For example, if the result was reported as ">100" mg, take 100 mg as the numeric value

In an SAP developed by Astellas, the simple rule was specified for handling the values below or above a threshold:

"For continuous variables that are recorded as “< X” or “> X”, the value of “X” will be used in the calculation of summary statistics. The original values will be used for the listings."

In an SAP developed by Galapagos for their phase 3 study of GLPG1690 in subjects with idiopathic pulmonary fibrosis, the following rules were proposed to handle values below or above a threshold. Their approach of adding or deducting a small number from the face value is unconventional.

7.3. Handling of Values Below (or Above) a Threshold
Values below (above) the detection limit will be imputed by the value one unit smaller or larger than the detection limit itself. In listings, the original value will be presented. Example: if the database contains the value “<0.04”, then for the descriptive statistics the value “0.03” will be used. The value “>1000” will be imputed by “1001”.

Monday, January 17, 2022

Paired T-test and McNemar's test for paired data based on the summary data

Sometimes, it is necessary for us to calculate the p-values based on the summary (aggregate) data without the individual subject level data. In a previous post, group t-test or Chi-square test based on the summary data was discussed. Group t-test and chi-square test can be used in the setting of parallel-group comparisons.

In single-arm clinical trials, there is no concurrent control group, and the statistical test is usually based on the pre-post comparison. For continuous measures, the pre-post comparison can be tested using paired t-test based on the change from baseline values (i.e., post-baseline measures - baseline measures): For discrete outcomes, the pre-post comparison may be tested using McNemar's test.

Paired t-test:

A paired t-test is used when we are interested in the difference between two variables for the same subject. Suppose we have the descriptive statistics for change from baseline values: 83 subjects had the outcome measures at both baseline and week 12 (therefore, 83 pairs), the mean and standard deviation for these 83 pairs are: 10.7 (70.7); 68 subjects had the outcome measures at both baseline and week 24 (therefore 68 pairs), the mean and standard deviation for these 68 pairs are 20.2 (80.9).

With the mean difference, the standard deviation for differences, and the sample size (# of pairs), we have all the elements to calculate the t statistics and therefore the p-value using the formula below.

This can be implemented in SAS as the following - t-statistics and p-values can be calculated for each of weeks 12 and 24 based on the aggregate data.

McNemar's Test:

McNemar's test is a statistical test used on paired nominal data. It is applied to 2 × 2 contingency tables with a dichotomous trait, with matched pairs of subjects, to determine whether the row and column marginal frequencies are equal (that is, whether there is "marginal homogeneity"). In clinical trials, the aggregate data may not be obvious as a 2 × 2 contingency table but can be converted into a 2 × 2 contingency table.

Suppose we have the following summary data for post-baseline week 12: the number and percentage of subjects with improvement, stable (no change), and deterioration categories.

		All subjects (n=300)
Week 12	Improved	54 (18%)
	No Change	228 (76%)
	Deteriorated	18 ( 6%)

At Week 12, there are more subjects in the 'Improved' category than in the 'Deteriorated' category even though the majority of subjects are in the 'No Change' category. Are they more subjects with improvement than deterioration?

Assuming that change from category 1 to 0 is 'Improved' and change from category 0 to 1 is 'Deteriorated', the table above can be converted into a 2 × 2 table:

		Baseline
		0	1
Week 12	0	228	54
	1	18	0

		Baseline
		0	1
Week 12	0	0	54
	1	18	228

For McNemar’s test, only the numbers in the diagonal discordant cells (in our case, the # of improved and the # of deteriorated) are relevant.

The concordant cells (in our case, the # of no change) will only contribute to the sample size (therefore the degree of freedom), not have an impact on the p-value. How the # of subjects with the ‘No Change’ is split doesn’t matter with our calculation of chi-square statistics and therefore the p-value.

For the data highlighted in yellow, McNemar’s test can be performed using the SAS codes like this (weight statement indicates count variable is the frequency of the observation and agree option requests McNemar's test). How the 228 subjects in the concordance ‘No Change’ category are split has no impact on the p-value calculation.

Sunday, August 29, 2021

Retire Statistical Significance and p-value? - Revisited

In 2019, there was a public debate about the use or misuse of statistical significance and p-values. I wrote a post about it "Retire Statistical Significance and p-value?"

Obviously, the hypothesis testing, statistical significance, and p-value are still the cornerstone of our clinical trials, the basis for judging if a clinical trial is a success, and the basis for the decision-making for regulatory approvals by the FDA and other regulatory authorities.

In the latest issue of AMSTATNews and also the Annals of Applied Statistics, there is an article "The ASA President’s Task Force Statement on Statistical Significance and Replicability". The statements confirmed that the significance and p-values are here to stay.

"P-values are valid statistical measures that provide convenient conventions for communicating the uncertainty inherent in quantitative results. Indeed, p-values and significance tests are among the most studied and best understood statistical procedures in the statistical literature. They are important tools that have advanced science through their proper application."

"p-values and significance tests, when properly applied and interpreted, increase the rigor of the conclusions drawn from data. Analyzing data and summarizing results are often more complex than is sometimes popularly conveyed. Although all scientific methods have limitations, the proper application of statistical methods is essential for interpreting the results of data analyses and enhancing the applicability of scientific results."

I would say the following about the significant tests and p-values:

We embrace it, not abandon it.
We focus on the appropriate use and interpretation
We should not become the slave of the p-values

A quote from Alfred Marshall, 1885:

"The most reckless and treacherous of all theorists is he who professes to let facts and figures speak for themselves, who keeps in the background the part he has played, perhaps unconsciously, in selecting and grouping them."

Monday, August 09, 2021

Randomized Withdrawal Design in Practice - Story of the HARMONY Trial

In previous posts, we discussed the "Randomized Withdrawal Design and Randomized Discontinuation Trial" and "Randomized Withdrawal Design - Examples for Defining the Criteria for Run-in and Randomized Withdrawal Periods".

Randomized Withdrawal Design (RWD) was discussed in FDA guidance "Enrichment Strategies for Clinical Trials to Support Determination of Effectiveness of Human Drugs and Biological Products" as one of the enrichment strategies and enable us to minimize the sample size of the study while demonstrating the efficacy of the experimental drug.

"In a randomized withdrawal study, patients who have an apparent response to treatment in an open-label period or in the treatment arm of a randomized trial are randomized to continued drug treatment or to placebo treatment. Because such trials generally involve only patients who appear to have responded, this is a study enriched with apparent responders, an empiric strategy. The study evaluation can be based on signs or symptoms during a specified interval (e.g., BP, angina rate), on recurrence of a condition that had been controlled by the drug (e.g., depression), or on the fraction of patients developing a rate or severity of symptoms that exceeds some specified limit (i.e., a failure criterion). "

Even though the FDA encourages the use of the randomized withdrawal design in clinical trials, the use of RWD is still very limited, especially in pivotal, confirmatory studies. The reluctance in using the RWD is usually due to the safety concern - it will be perceived as unethical to discontinue the experimental treatment in the Placebo group when the experimental treatment has just been shown to be effective. Another concern is the size of the safety database - there is usually a sufficient number of patients exposed to the experimental drug so that the safety can be adequately assessed (see a previous post "The Size of Safety Database In Drug Development Program")

In the recent issue of the New England Journal of Medicine, the results of the HARMONY study (sponsored by Acadia Pharmaceuticals) were published: Tariot et al "Trial of Pimavanserin in Dementia-Related Psychosis". While the term 'randomized withdrawal' was not explicitly used in the study, the HARMONY study was a randomized withdrawal study and was described as the following:

"a phase 3, double-blind, randomized, placebo-controlled discontinuation trial involving patients with psychosis related to Alzheimer’s disease, Parkinson’s disease dementia, dementia with Lewy bodies, frontotemporal dementia, or vascular dementia. Patients received open-label pimavanserin for 12 weeks. Those who had a reduction from baseline of at least 30% in the score on the Scale for the Assessment of Positive Symptoms–Hallucinations and Delusions (SAPS–H+D, with higher scores indicating greater psychosis) and a Clinical Global Impression–Improvement (CGI-I) score of 1 (very much improved) or 2 (much improved) at weeks 8 and 12 were randomly assigned in a 1:1 ratio to continue receiving pimavanserin or to receive placebo for up to 26 weeks. "

As we can see from the study design diagram below, all patients received pimavanserin during the open-label period. Those subjects who had responses to the pimavanserin (i.e., reduction in SAPS-H+D) were randomized to continue the treatment of pimavanserin or withdraw the treatment of pimavanserin (i.e., receiving the placebo).

HARMONY Study protocol and SAP were posted on clinicaltrials.gov.

Study Protocol [PDF] July 16, 2010
Statistical Analysis Plan [PDF] August 5, 2014

With the HARMONY study using a randomized withdrawal design, the sponsor was able to demonstrate the efficacy of pimavanserin in treating patients with dementia-related psychosis. The reduced sample size due to the use of the randomized withdrawal design was further reduced after the sponsor stopped the study early for positive efficacy. The sponsor subsequently submitted the sNDA for pimavanserin in treating patients with dementia-related psychosis. Unfortunately, they received a complete response letter from FDA citing the reasons not related to the study design but related to the insufficient sample size for certain sub-groups.

Sunday, June 20, 2021

Early Phase Trial to Find Maximal Tolerated Dose (MTD) - 3+3, CRM, and BOIN Designs

In early-phase clinical trials, determining the dose range and therapeutic window is critical. The purpose of the early-phase studies may just be to identify the maximum tolerated dose or maximum tolerable dose.

Definition of maximum tolerated dose (MTD)

The highest dose of a drug or treatment that does not cause unacceptable side effects. The maximum tolerated dose is determined in clinical trials by testing increasing doses on different groups of people until the highest dose with acceptable side effects is found. Also called MTD.

The studies for identifying the MTD are usually designed as a dose-escalation study and the dose-escalation study is defined as:

A study that determines the best dose of a new drug or treatment. In a dose-escalation study, the dose of the test drug is increased a little at a time in different groups of people (also called cohort) until the highest dose that does not cause harmful side effects is found. A dose-escalation study may also measure ways that the drug is used by the body and is often done as part of a phase I clinical trial. These trials usually include a small number of patients and may include healthy volunteers.

In dose-escalation studies, within each dose cohort, a placebo group can be included even though the majority of the dose-escalation studies for MTD are designed without placebo controls.

Identifying MTD is based on the number of dose-limiting toxicities (DTLs)that are observed in each dose cohort. DTLs are defined as:

side effects of a drug or other treatment that are serious enough to prevent an increase in dose or level of that treatment.

In practice, DTLs are often defined as grade 3 or above adverse events according to Common Terminology Criteria for Adverse Events (CTCAEs) especially in the oncology area even though other customer-defined criteria for DTLs may be used in non-oncology areas.

Clinical trials to identify the MTD are generally needed for phase I studies directly conducted in patients, not healthy volunteers. Areas that the phase I studies are conducted in patients, not healthy volunteers, include oncology drugs, drugs in severe diseases such as AIDS, Sepsis, ARDS, etc., the gene and cell therapies, human-plasma derived products.

There are different types of clinical trial designs for identifying the MTD. The commonly used designs are 3+3 design, Continuous Reassessment Method (CRM), and Bayesian Optimal INterval design (BOIN).

3+3 Design was discussed in an early post Phase I Dose Escalation Study Design: "3 + 3 Design". It is a straightforward rule-based method and requires no statistical calculations. 3+3 design is the most frequently used method for identifying the MTD.

The CRM is a model-based design for phase I trials, which aims to find the maximum tolerated dose (MTD) of a new therapy. The CRM has been shown to be more accurate in targeting the MTD than traditional rule-based approaches such as the 3 + 3 design. With CRM design, statistical inferences on the model parameter(s) need to be made using likelihood-based or Bayesian approaches and DLT probability at each dose needs to be estimated. The patient is assigned to the next dose level based on the probability of patients with DLTs at the current dose level. The toxicity risk of other dose levels is based on accrued data, which improves trial efficiency.

Following articles or videos provided a great introduction/reference about the CRM method:

The BOIN design shares the simplicity of the 3+3 design, which makes the decision of dose escalation/de-escalation by comparing the observed DLT rate with 0/3, 1/3, 2/3, 0/6, 1/6, and 2/6. The BOIN design makes the decision by comparing with two fixed boundaries, λe and λd, which is arguably even simpler.

BOIN design are described and explained in the following article and video:

Software for Sample Size Calculation for Phase I MTD Finding Studies:

Cytel's EAST software has a module 'ESCALATE' containing the sample size calculations for 3+3, CRM, BLRM, mTPI, comb2BLRM, and PIPE methods.

trialdesign.org is a website developed and maintained by a research team at MD Anderson Cancer and it contains the literature and software for phase I designs including CRM and BOIN.

Additional Videos:

Monday, April 19, 2021

Restricted Mean Survival Time (RMST) for Handling the Non-Proportional Hazards Time to Event Data

Time to event analysis (or traditionally survival analysis) is one of the most common analyses in clinical trials. In general, the time to event analysis relies on the assumption of the proportional hazards. However, quietly frequently, we may find that the proportional hazards assumption is violated, especially in many immuno-oncology trials. When the proportional hazards assumption is violated, alternative approaches may be needed to analyze the data to achieve statistical power. As discussed in the previous post "Non-proportional Hazards: how to analyze the time-to-event data?", one of the alternative approaches is the restricted mean survival time (RMST) method.

RMST is one of the Kaplan-Meier-based methods and is essentially calculating and comparing AUCs under Kaplan-Meier Curves for different treatment groups or different comparative groups. It has been said that RMST analysis has the following advantages:

Model-free, robust, and easily interpretable treatment effect information
Produces radically powerful patterns of difference as has been observed in some recent Oncology clinical trials
Accepted approach by regulatory agencies and industry leaders

RMST has been mentioned in the latest FDA guidance for Industry (2020): Acute Myeloid Leukemia: Developing Drugs and Biological Products for Treatment as an alternative approach to analyzing the data when the non-proportionality hazards occur (e.g., plateauing effect).

"Plateauing Effect

Trials designed to cure AML often result in survival contours characterized by an initial drop followed by a plateauing effect after some time point post randomization. This is an example of nonproportional hazards. While the log-rank test is somewhat robust to nonproportionality, it generally results in loss of power. Furthermore, nonproportionality can cause difficulty in describing the treatment effect. FDA is open to discussion about analyses based on other approaches, such as weighted Cox regression or other weighted methods, or summarizing the treatment effect using restricted mean survival time (RMST) or landmark survival analysis. Plans that use these alternative approaches should include:

justification for what constitutes clinically meaningful difference,
justification of design parameters, such as sample size and follow-up duration, based on this endpoint, and
justification for the value of the threshold that will be used to calculate the RMST.

RMST analysis has also been used as a primary analysis approach or for sensitivity analysis in FDA reviews:

In NDA of Baloxavir marboxil in treatment of acute, uncomplicated influenza, both applicants and the FDA reviewer analyzed the data using RMST. It stated:

Restricted mean survival time (RMST) up to Day 10 was estimated for each treatment group along with the difference between RMST in the two treatment groups. RMST is a measurement of the average survival from time 0 to a specified time point (e.g., 10 days) which is equivalent to the area under the Kaplan-Meier curve from the beginning of the study through that time point.

At an FDA CDRH Medical Devices Advisory Committee Circulatory System Panel meeting in 2019, the independent statistical consultant addressed the analysis issue when the proportional hazards assumption is violated:

The proposal they made was the restricted mean survival time. The restricted mean survival time is area under curve. Please note the word restricted. Mean survival time is over a period of time, according to the rules that have been laid out, so that you're not looking, like with proportional hazards, over all the follow-up that could have possibly happened or in binary where you're only looking at the patients that survive. The restricted mean would say we're going to look between, let's say, 0 and 5 years because we have sufficient information to make that kind of assessment.
The paper showed that the restricted mean has just as much power as proportional hazards when the assumptions are there for proportional hazards, and then has more power when the assumptions are violated.
There's also some advantages in terms for clinicians, in terms of explaining this to the patient. It's hard to talk about hazards or number needed to treat. But if you could say to a patient over a 60-month period the average survival time is 55 months with Device A versus 52 months with Device B, now they can look at what their life is going to look like in the next 60 months and make a decision.
Unfortunately, it was not me who noticed this. This was actually from a presentation by FDA. Several very smart statisticians had talked about the restricted mean and have made recommendations on using it for both proportional violations and for its interpretation.

In FDA Briefing Document for Oncologic Drugs Advisory Committee Meeting (December 17, 2019) to review Olaparib for the maintenance treatment of adult patients with deleterious or suspected deleterious germline BRCA mutated (gBRCAm) metastatic adenocarcinoma of the pancreas

FDA performed a test to evaluate whether the proportional hazard assumption was met. This test failed to detect evidence of non-proportionality; however, such a test may lack power to detect non-proportionality due to the small sample size. The Kaplan-Meier curves of PFS appear to show some degree of nonproportionality. The curves did not show separation until approximately 4 months, after approximately 53% of patients either had events or were censored. FDA performed additional sensitivity analyses by applying the restricted mean survival time (RMST) method using different truncation points (15 months and 18 months). The truncated time was selected (15 or 18 months) such that approximately 8-12% patients remained at risk. Based on the truncation times, the estimated RMST difference in PFS between arms ranged from 2.6 months (95% CI: 0.9, 4.3) to 3.1 months (95% CI: 1.0, 5.2). The range of the RMST differences again demonstrated great variation in the difference in PFS and the lower ends did not suggest that there was a clinically meaningful difference.

Thanks to the software, RMST analyses can be easily implemented in SAS or R. In the latest version (version 15.1 or above) of SAS/Stat, RMST is included in SAS Proc LIFETEST with RMST option and Proc RMSTREG. See a nice paper by

With R, the package for RMST analysis is survRM2 that is developed by Hajime Uno from Dana-Farber Cancer Institute

For RMST analysis, it is important to select the cut-off value (tau) for the truncated time. The different selection of taus will give different results. The selection of tau can sometimes be arbitrary. In an FDA briefing document above, the FDA statistician chose the truncated time such that approximately 8-12% of patients remained at risk.

There are different ways to calculate the RMST:

Non-parametric method
Regression Analysis Method
Pseudo-value Regression Method
IPCW Regression - Inverse Probability of Censoring Weighting (IPCW) regression
Conditional restricted mean survival time (CRMST)

According to the paper by Guo and Liang (2019) "Analyzing Restricted Mean Survival Time Using SAS/STAT®", non-parametric analysis can be implemented using Proc Lifetest; regression analysis, pseudo-value regression, and IPCW regression can be implemented using SAS Proc RMSTREG.

FDA statisticians also proposed an approach 'conditional restricted mean survival time' or CRMST. This approach was described in the paper by Qiu et al (2019) "Estimation on conditional restricted mean survival time with counting process" and also in a presentation by Lawrence and Qiu (2020) Novel Survival Analysis When Hazards Are Nonproportional and/or There Are Multiple Types of Events. CRMST can allow the AUC under K-M curves to be calculated from an interval time (not necessarily to be started from the 0 time). They claim CRMST is better for event-driven studies where the time to the first event is the interest. They concluded the following:

CRMST possesses all the desirable statistical properties of RMST. In particular, it does not rely on proportional hazard assumption. In addition, CRMST measures an average event-free time in the time range at issue and has straightforward interpretation. In case that two survival curves cross, CRMST can be estimated separately before and after crossing and the CRMST differences can be used to assess benefit versus harm.

Further Reading:

Royston & Parmar (2013) Restricted mean survival time: an alternative to the hazard ratio for the design and analysis of randomized trials with a time-to-event outcome

Huang & Kuan (2017) Comparison of the restricted mean survival time with the hazard ratio in superiority trials with a time-to-event end point

McCaw, Yin, and Wei (2019) Using the Restricted Mean Survival Time Difference as an Alternative to the Hazard Ratio for Analyzing Clinical Cardiovascular Studies

Rivano et al (2020) Restricted mean survival time as outcome measure in advanced urothelial bladder cancer: analysis of 4 clinical studies

Lawrence et al (2016) Comparison of Hazard Ratio and Restricted Mean Survival Analysis for Cardiorenal Drug Trials
Tian L, Zhao L, Wei LJ. Predicting the restricted mean event time with the subject's baseline covariates in survival analysis. Biostatistics 2014, 15, 222-233.

Monday, April 05, 2021

Non-proportional Hazards: how to analyze the time-to-event data?

Time to event data is one of the most common data types in clinical trials. Traditionally, the log-rank test is used to compares the survival curves of two treatment groups.; the Kaplan Meier survival plot is used to illustrate the totality of time-to-event kinetics, including the estimated median survival time; the Cox-proportional hazards model is employed to provide the estimated relative effect (i.e., hazard ratio) between treatment arms. The performance of these analyses largely depends on the proportional hazards (PH) assumption – that the hazard ratio is constant over time. In other words, the hazard ratio provides an average relative treatment effect over time.

Before the time to event data is analyzed, it is typical for statisticians to check the proportional hazards assumption. Various methods can be used to check the proportional hazards assumptions - see a previous post "Visual Inspection and Statistical Tests for Proportional Hazard Assumption".

Recently we have seen more examples of the time to event data not following the proportional hazards assumption, even more examples in immuno-oncology clinical trials.

It is not the end of the world if the proportional hazards assumption is violated, various approaches have been proposed to handle the time to event data with non-proportional hazards.

In practice, it is pretty common that in the statistical analysis plan, we prespecify the log-rank test to calculate the p-values and then use Cox-proportional hazards regression model to calculate the hazard ratio, its 95% confidence interval, and p-value - I call this 'Splitting p-value and estimate of the treatment difference". Two different p-values will be calculated: one from the log-rank test and one from the Cox regression. If the proportional hazards assumption is met, it is better to use the p-value from the Cox regression since all estimates and p-value are coming from the model. However, When the proportional hazard assumption is violated, the Cox-proportional hazard model may no longer be the optimal approach to determine treatment effect and the Kaplan-Meier estimate of median survival may not be the most valid measure to summarize the results.

In a website post "Testing equality of two survival distributions: log-rank/Cox versus RMST", it stated:

“One thing to note is that the log-rank test does not assume proportional hazards per se. It is a valid test of the null hypothesis of equality of the survival functions without any assumptions (save assumptions regarding censoring). It is however most powerful for detecting alternative hypotheses in which the hazards are proportional.”

It is true that the log-rank test does not depend on the proportional hazards assumption. The log-rank test is still a valid test of the null hypothesis of equality of the survival functions without any assumptions even though that the log-rank test may not be optimal under non-proportional hazards.

In a public workshop "Oncology Clinical Trials in the Presence of Non-Proportional Hazards" organized by Duke in 2018, Dr. Rajeshwari Sridhara from Division of Biometrics V, CDER/FDA stated (@40:45 of the youtube video) that in the non-proportional hazards situation, FDA is ok with presenting the p-value from the log-rank test and hazard ratio to measure the treatment difference.

At this same workshop "Oncology Clinical Trials in the Presence of Non-Proportional Hazards", ASA Biopharmaceutical Section Regulatory-Industry Statistics Workshop presented their work and proposed the 'max-combo' test as the alternative method to address the non-proportional hazards situation. The “max-combo” test is based on Fleming-Harrington (FH) weighted log-rank statistics. The max-combo test tackles some of the challenges due to non-proportional hazards as it is able to robustly handle a range of non-proportional hazard types, can be pre-specified at the design stage, and can choose the appropriate weight in an adaptive manner (i.e. is able to address the control of family-wise Type I error). In workshop summaries, Max-Combo Test Design was described as the following:

Knezevic & Patil has a paper describing a SAS macro to perform Max-Combo test (or Combination weighted log-rank tests) "Combination weighted log-rank tests for survival analysis with

non-proportional hazards" (2020 SAS Global Forum).

The NPH workshop has presented or published their work on numerous occasions, here is a list:

Roychoudhury et al (2021) Robust Design and Analysis of Clinical Trials With Nonproportional Hazards: A Straw Man Guidance From a Cross-Pharma Working Group. Statistics in Pharmaceutical research
Lin et al (2020) Alternative Analysis Methods for Time to Event Endpoints Under Nonproportional Hazards: A Comparative Analysis. Statistics in Pharmaceutical Research
Anderson & Rochoudhury (2018) Design and Analysis of Clinical Trials in the Presence of Non-Proportional Hazards. JSM 2018
Roychoudhury & Anderson (2020) Robust Design and Analysis of Clinical Trials with Non-proportional Hazards: Methodology and Implementation with R. RISW 2020

In addition to the Max-Combo test, there are several other methods for handling the non-proportional hazards situation.

https://ww2.amstat.org/meetings/biopharmworkshop/2018/onlineprogram/ViewPresentation.cfm?file=300719.pdf

RMST (restricted mean survival time): according to a presentation by Lawrence et al from FDA, The idea of Restricted Mean Survival Time (RMST) goes back to Irwin (1949) and is further implemented in survival analysis by Uno et al. (2014). RMST is defined as the area under the survival curve up to t*, which should be pre-specified for a randomized trial. RMST may be loosely described as the event free expectancy over the restricted period between randomization and a defined, clinically relevant time horizon, called t*. RMST analyses are now built into the SAS procedures with Proc Lifetest and Proc RSMTREG. See a paper by Guo and Liang (2019) "Analyzing Restricted Mean Survival Time Using SAS/STAT®"
Piecewise exponential regression allows for an early and late effect of treatment comparison. it is especially useful when the non-proportional hazards pattern is cross-over. Piecewise exponential regression can be fitted with SAS Proc MCMC and R package pch
Estimation via the average hazard ratios (AHR) method of Schemper (2009) and the average regression effects (ARE) method of Xu and O’Quigley (2000) - the method can be implemented using the COXPHW package in R. COXPHW package is described as:

This package implements weighted estimation in Cox regression as proposed by Schemper, Wakounig and Heinze (Statistics in Medicine, 2009, doi: 10.1002/sim.3623). Weighted Cox regression provides unbiased average hazard ratio estimates also in case of non-proportional hazards. The package provides options to estimate time-dependent effects conveniently by including interactions of covariates with arbitrary functions of time, with or without making use of the weighting option. For more details we refer to Dunkler, Ploner, Schemper and Heinze (Journal of Statistical Software, 2018, doi: 10.18637/jss.v084.i02).

in a presentation by Kaur et al "Analytical Methods Under Non-Proportional Hazards: A Dilemma of Choice", the following methods were described:

Earlier this year, Mehrotra and West published a paper to describe their proposed method (5-START) to handle the heterogeneity of the patient population and potential non-proportional hazards (Lin et al (2021) Survival Analysis Using a 5-Step Stratified Testing and Amalgamation Routine (5-STAR) in Randomized Clinical Trials or here ):

"The power of the ubiquitous logrank test for a between-treatment comparison of survival times in randomized clinical trials can be notably less than desired if the treatment hazard functions are non-proportional, and the accompanying hazard ratio estimate from a Cox proportional hazards model can be hard to interpret. Increasingly popular approaches to guard against the statistical adverse effects of non-proportional hazards include the MaxCombo test (based on a versatile combination of weighted logrank statistics) and a test based on a between-treatment comparison of restricted mean survival time (RMST). Unfortunately, neither the logrank test nor the latter two approaches are designed to leverage what we refer to as structured patient heterogeneity in clinical trial populations, and this can contribute to suboptimal power for detecting a between-treatment difference in the distribution of survival times. Stratified versions of the logrank test and the corresponding Cox proportional hazards model based on pre-specified stratification factors represent steps in the right direction. However, they carry unnecessary risks associated with both a potential suboptimal choice of stratification factors and with potentially implausible dual assumptions of proportional hazards within each stratum and a constant hazard ratio across strata.

We have developed and described a novel alternative to the aforementioned current approaches for survival analysis in randomized clinical trials. Our approach envisions the overall patient population as being a finite mixture of subpopulations (risk strata), with higher to lower ordered risk strata comprised of patients having shorter to longer expected survival regardless of treatment assignment. Patients within a given risk stratum are deemed prognostically homogeneous in that they have in common certain pre-treatment characteristics that jointly strongly associate with survival time. Given this conceptualization and motivated by a reasonable expectation that detection of a true treatment difference should get easier as the patient population gets prognostically more homogeneous, our proposed method follows naturally. Starting with a pre-specified set of baseline covariates (Step 1), elastic net Cox regression (Step 2) and a subsequent conditional inference tree algorithm (Step 3) are used to segment the trial patients into ordered risk strata; importantly, both steps are blinded to patient-level treatment assignment. After unblinding, a treatment comparison is done within each formed risk stratum (Step 4) and stratum-level results are combined for overall estimation and inference (Step 5)."

Non-proportional hazards and the NPH pattern are usually identified after the study unblinding, which poses the challenges for pre-specifying the best approach to analyze the time to event data with non-proportional hazards. The safest way is to prespecify both the Log-rank test and the Cox proportional hazards regression. If the non-proportional hazards assumption is violated, the p-values from the log-rank test will be used as a measure of the significance. One can also pre-specify the Max-Combo method as the primary method regardless of the NPH assumption

Monday, February 01, 2021

BLQs (below limit of quantification) and LLOQ (Lower Limit of Quantification): how to handle them in analyses?

In data analyses of the clinical trial, one type of data is the laboratory data containing the results measured by the central laboratory or specialty laboratory on the specimen (blood sample, plasma or serum sample, urine sample, bronchoalveolar lavage,...) collected from clinical trial participants. The laboratory results are usually reported as quantitative measures in numeric format. However, sometimes, we will see the results reported as '<xxx' or 'BLQ'.

The laboratory measures rely on the assay and the assay has its limit and can only accurately measure the level or concentration to a certain degree - the limit is called the Lower Limit of Quantification (LLOQ) or the Limit of Quantification (LOQ) or the Limit of Detection (LOD).

In FDA's guidance (2018) "Bioanalytical Method Validation", they defined the Quantification range, LLOQ and ULOQ:

The quantification range is the range of concentrations, including the ULOQ and the LLOQ that can be reliably and reproducibly quantified with accuracy and precision with a concentration-response relationship.
Lower limit of quantification (LLOQ): The LLOQ is the lowest amount of an analyte that can be quantitatively determined with acceptable precision and accuracy.
Upper limit of quantification (ULOQ): The ULOQ is the highest amount of an analyte in a sample that can be quantitatively determined with precision and accuracy.

According to the article by Vashist and Luong "Bioanalytical Requirements and Regulatory Guidelines for Immunoassays". The LLOQ and LOQ are different. In practice, the LLOQ and LOQ may be used interchangeably.

The LOQ is the lowest analyte concentration that can be quantitatively detected with a stated accuracy and precision [24]. However, the determination of LOQ depends on the predefined acceptance criteria and performance requirements set by the IA developers. Although such criteria and performances are not internationally adopted, it is of importance to consider the clinical utility of the IA to define such performance requirements.
The LLOQ is the lowest calibration standard on the calibration curve where the detection response for the analyte should be at least five times over the blank. The detection response should be discrete, identifiable, and reproducible. The precision of the determined concentration should be within 20% of the CV while its accuracy should be within 20% of the nominal concentration.

In FDA's guidance "Studies to Evaluate the Metabolism and ResidueKinetics of Veterinary Drugs in Food-ProducingAnimals: Validation of Analytical Methods Used in Residue Depletion Studies", the LOD and LOQ are differentiated a little bit.

3.4. Limit of Detection

The limit of detection (LOD) is the smallest measured concentration of an analyte from which it is possible to deduce the presence of the analyte in the test sample with acceptable certainty. There are several scientifically valid ways to determine LOD and any of these could be used as long as a scientific justification is provided for their use.

3.5. Limit of Quantitation

The LOQ is the smallest measured content of an analyte above which the determination can be made with the specified degree of accuracy and precision. As with the LOD, there are several scientifically valid ways to determine LOQ and any of these could be used as long as scientific justification is provided.

If the level or concentration is below the range that the assay can detect, it will be reported as the BLQ (Below the Limit of Quantification), BQL (Below Quantification Level), BLOQ (Below the Limit Of Quantification), or <xxx where xxx is the LLOQ. The results are seldom reported as 0 or missing since the result is only undetectable using the corresponding assay. It is usually agreed that the BLQ values are not missing values - they are measured, but not measurable.

In clinical laboratory data with the purpose of safety assessment, the BLQ or <xxx is reported in the character variable. When converting the character variable to the numerical variable, the BLQ or <xxx will be automatically treated as missing unless we do something. The following four approaches may be seen in handling the BLQ values (with an example assuming LLOQ 0.01 ng/mL).

Reported Value	Converted Value	Explanation
< 0.01 ng/mL	missing	The specific measure will be set to missing and will not be included in summary and analysis.
< 0.01 ng/mL	0	The specific measure will be set to 0 in summary and analysis
< 0.01 ng/mL	0.005 ng/mL	Half of the LLOQ – commonly used in clinical pharmacology studies (Bioavailability and Bioequivalence studies)
<0.01 ng/mL	0.01 ng/mL	Ignore the less than the ‘<’ sign and take the LLOQ as the value for summary and analysis. This approach can also handle the values beyond the ULOQ (upper limit of quantification), for example, '>1000 ng/mL' by removing the greater than '>' sign.

In clinical pharmacology studies (bioavailability and bioequivalence studies), series pharmacokinetic (PK) samples will be drawn and analyzed to get a PK profile for a specific compound or formulation. The series samples will include a pre-dose sample (the sample drawn before the dosing) and multiple time points after the dosing. It is entirely possible to have results reported as BLQ especially for the pre-dose sample and the late time points. BLQ values can also be possible for samples in the middle of the PK profile (i.e., between two samples with non-BLQ values). The rules for handling these BLQs are different depending on the samples at pre-dose, at the middle of the profile, and at the end of the PK profile (with an example assuming LLOQ 0.01 ng/mL)

Timepoint	Reported Value	Converted Value	Explanation
Pre-dose sample for a compound with no endogenous level	< 0.01 ng/mL	0	The BLQ(s) occurring before the first quantifiable concentration will be set to zero.
Pre-dose sample for a compound with endogenous level or pre-dose at the steady-state	< 0.01 ng/mL	0.005 ng/mL	The endogenous pre-dose level will be set to half of the LLOQ. In multiple-dose situation, the pre-dose sample (trough or Cmin) is set to half of the LLOQ
At middle of the PK profile or between two non-BLQ time points	< 0.01 ng/mL	missing	The BLQ values between the two reported concentrations will be set to missing in the analysis – essentially the linear interpolation rule will be used in AUC calculation.
The last time point(s) of the PK profile	< 0.01 ng/mL	0 or 0.005 ng/mL	It is common to set the last BLQ(s) to 0 to be consistent with the rule for pre-dose BLQ handling. According to FDA's "Bioequivalence Guidance", "For a single dose bioequivalence study, AUC should be calculated from time 0 (predose) to the last sampling time associated with quantifiable drug concentration AUC(0-LOQ)." In some situations, the BLQ values after the last non-BLQ measure can also be set to half of the LLOQ.

There are some discussions that these single imputation methods will generate biased estimates. In a presentation by Helen Barnett et al "Non-compartmental methods for BelowLimit of Quantification (BLOQ)responses", they concluded:

It is clear that the method of kernel density imputation is the best performing out of all the methods considered and is hence is the preferred method for dealing with BLOQ responses in NCA.

In a recent paper by Barnetta et al (2021 Statistics in Biopharmaceutical Research) "Methods for Non-Compartmental Pharmacokinetic AnalysisWith Observations Below the Limit of Quantification", eight different methods were discussed for handling the BLQs (or BLOQs). The authors conclude that the kernel-based method performs best for most situations.

Method 1; replace BLOQ values with 0
Method 2: replace BLOQ values with LOQ/2
Method 3: regression on order statistics (ROS) imputation
Method 4: maximum likelihood per timepoint (summary)
Method 5: maximum likelihood per timepoint (imputation)
Method 6: Full Likelihood
Method 7: Kernel Density Imputation
Method 8: Discarding BLOQ Values

For the specific study, rules for handling the BLQs may be different depending on the time point in the PK profile, the measured compound (with or without endogenous concentrations), the single dose or multiple doses, study design (single dose, parallel, crossover). No matter what the rules are, they need to be specified (preferably pre-specified before the study unblinding if it is pivotal study and the PK analysis results are the basis for regulatory approval) in the statistical analysis plan (SAP) or PK analysis plan (PKAP).

Here are two examples with descriptions of the BLQ handling rules. In a phase I study by Shire, the BLQ handling rules are specified as the following:

In a phase I study by Emergent Product Development, the BLQ rules are described as the following:

REFERENCES:

On Biostatistics and Clinical Trials