Monday, August 01, 2016

Should hypothesis tests be performed and p-values be provided for safety variables in efficacy evaluation clinical trials?

p-value is the probability of observing a test statistic at least as large as the one calculated assuming the null hypothesis is true. In many situations, p-value from the hypothesis testing has been over-used, mis-used, or mis-interpreted. American Statistical Association seems to be fed up with the mis-use of the p-values and has formally issued a statement about the p-value (see AMERICAN STATISTICAL ASSOCIATION RELEASES STATEMENT ON STATISTICAL SIGNIFICANCE AND P-VALUES). It also provides the following six principles to improve the Conduct and Interpretation of Quantitative Science.
  •  P-values can indicate how incompatible the data are with a specified statistical model.
  •  P-values do not measure the probability that the studied hypothesis is true, or the probability that the data were produced by random chance alone.
  • Scientific conclusions and business or policy decisions should not be based only on whether a p-value passes a specific threshold.
  • Proper inference requires full reporting and transparency. 
  • A p-value, or statistical significance, does not measure the size of an effect or the importance of a result.
  • By itself, a p-value does not provide a good measure of evidence regarding a model or hypothesis.

However, we continue to see the cases that the p-values are over-used, mis-used, and mis-interpreted, or used for wrong purpose. One area for p-value misuse is in the analysis of safety endpoints such as adverse events and laboratory parameters in clinical trials. One of the ASA proposed principles of using p-value is “a p-value, or statistical significance, does not measure the size of an effect or the importance of a result” – ironically, this is also the reason for people to present p-values for tens and hundreds of p-values hoping that a p-value will measure the size of an effect or the important of a result.

In a recent article in New England Journal of Medicine, the p-values were provided for each adverse event even though the hypothesis testing to compare the incidence of each adverse event was not the intention of the study. In Marso et al (2016) Liraglutide and Cardiovascular Outcomes in Type 2 Diabetes, the following summary table was presented with p-values for individual adverse events.

The study protocol did not mention any inferential analysis for adverse events. It is clear that these p-values presented in the article are post-hoc and unplanned. Here is the analysis plan for AE in the protocol.

AEs are summarised descriptively. The summaries of AEs are made displaying the number of subjects with at least one event, the percentage of subjects with at least one event, the number of events and the event rate per 100 years. These summaries are done by seriousness, severity, relation to treatment, MESI, withdrawal due to AEs and outcome.”

In this same article, the appendix also presented p-values for cardiovascular and anti-diabetes medications at baseline and during trial. However, it could be misleading to interpret the results based on these p-values. For example, for Statins introduced during trial, the rates are 379 / 4668 = 8.1% in Liraglutide group and 450 / 4672 = 9.6% in Placebo group with a p-value of 0.01. However, while the p-value is statistically significant, the difference in rate (8.1% versus 9.6%) is not really meaningful.

Similarly, in another NEJM article (Goss et al (2016) Extending Aromatase-Inhibitor Adjuvant Therapy to 10 Years).  The p-values were provided for individual adverse events.

Usually, the clinical trials are designed to assess the treatment effect for efficacy endpoints, not for the safety endpoints such as adverse events and laboratory test results. For a clinical trial, there could be many different adverse events reported. Providing the p-value for each adverse event could be mis-interpreted as testing the statistical significant difference for each event between treatment groups. Uniformly and non-discretionary applying the hypothesis testing for tens and hundreds of different adverse event terms is against the statistical principle.

FDA Center for Drug Evaluation Research (CDER) has a reviewer guidance for Conducting a Clinical Safety Review of a New Product Application and Preparing a Report on the Review. It has the following statements about the hypothesis testing for safety endpoints.
Approaches to evaluation of the safety of a drug generally differ substantially from methods used to evaluate effectiveness. Most of the studies in phases 2-3 of a drug development program are directed toward establishing effectiveness. In designing these trials, critical efficacy endpoints are identified in advance, sample sizes are estimated to permit an adequate assessment of effectiveness, and serious efforts are made, in planning interim looks at data or in controlling multiplicity, to preserve the type 1 error (alpha error) for the main end point. It is also common to devote particular attention to examining critical endpoints by defining them with great care and, in many cases, by using blinded committees to adjudicate them. In contrast, with few exceptions, phase 2-3 trials are not designed to test specified hypotheses about safety nor to measure or identify adverse reactions with any pre-specified level of sensitivity. The exceptions occur when a particular concern related to the drug or drug class has arisen and when there is a specific safety advantage being studied. In these cases, there will often be safety studies with primary safety endpoints that have all the features of hypothesis testing, including blinding, control groups, and pre-specified statistical plans.
In the usual case, however, any apparent finding emerges from an assessment of dozens of potential endpoints (adverse events) of interest, making description of the statistical uncertainty of the finding using conventional significance levels very difficult. The approach taken is therefore best described as one of exploration and estimation of event rates, with particular attention to comparing results of individual studies and pooled data. It should be appreciated that exploratory analyses (e.g., subset analyses, to which a great caution is applied in a hypothesis testing setting) are a critical and essential part of a safety evaluation. These analyses can, of course, lead to false conclusions, but need to be carried out nonetheless, with attention to consistency across studies and prior knowledge. The approach typically followed is to screen broadly for adverse events and to expect that this will reveal the common adverse reaction profile of a new drug and will detect some of the less common and more serious adverse reactions associated with drug use. Identifying Common and Drug-Related Adverse EventsFor common adverse events, the reviewer should attempt to identify those events that can reasonably be considered drug related. Although it is tempting to use hypothesis-testing methods, any reasonable correction for multiplicity would make a finding almost impossible, and studies are almost invariably underpowered for statistically valid detection of small differences. The most persuasive evidence for causality is a consistent difference from control across studies, and evidence of dose response. The reviewer may also consider specifying criteria for the minimum rate and the difference between drug and placebo rate that would be considered sufficient to establish that an event is drug related (e.g., for a given dataset, events occurring at an incidence of at least 5 percent and for which the incidence is at least twice, or some other percentage greater than, the placebo incidence would be considered common and drug related). The reviewer should be mindful that such criteria are inevitably arbitrary and sensitive to sample size. Standard Analyses and Explorations of Laboratory DataThis review should generally include three standard approaches to the analysis of laboratory data. The first two analyses are based on comparative trial data. The third analysis should focus on all patients in the phase 2 to 3 experience. Analyses are intended to be descriptive and should not be thought of as hypothesis testing. P-values or confidence intervals can provide some evidence of the strength of the finding, but unless the trials are designed for hypothesis testing (rarely the case), these should be thought of as descriptive. Generally, the magnitude of change is more important than the p-value for the difference.
PhUSE is an independent, not-for-profit organisation run by volunteers. Since its inception, PhUSE has expanded from its roots as a conference for European Statistical Programmers, to a global platform for the discussion of topics encompassing the work of Data Managers, Biostatisticians, Statistical Programmers and eClinical IT professionals. PhUSE is run by the statistical programmers, but it is attempting to put together some guidelines about how the statistical tables should be presented. I guess that statisticians may not agree with all of their proposals.

PhUSE has published a draft proposal Analyses and Displays Associated with Adverse Events – Focus on Adverse Events in Phase 2-4 Clinical Trials and Integrated Summary Documents”. The proposal has a specific section about presentation of p-values for adverse event summary tables.
6.2. P-values and Confidence IntervalsThere has been ongoing debate on the value or lack of value for the inclusion of p-values and/or confidence intervals in safety assessments (Crowe, et. al. 2009). This white paper does not attempt to resolve this debate. As noted in the Reviewer Guidance, p-values or confidence intervals can provide some evidence of the strength of the finding, but unless the trials are designed for hypothesis testing, these should be thought of as descriptive. Throughout this white paper, p-values and measures of spread are included in several places. Where these are included, they should not be considered as hypothesis testing. If a company or compound team decides that these are not helpful as a tool for reviewing the data, they can be excluded from the display.
Some teams may find p-values and/or confidence intervals useful to facilitate focus, but have concerns that lack of “statistical significance” provides unwarranted dismissal of a potential signal. Conversely, there are concerns that due to multiplicity issues, there could be over-interpretation of p-values adding potential concern for too many outcomes. Similarly, there are concerns that the lower- or upper-bound of confidence intervals will be over-interpreted. It is important for the users of these TFLs to be educated on these issues.
Similarly, PhUSE also has a white paper on “Analyses and Displays Associated with Demographics, Disposition, and Medications in Phase 2-4 Clinical Trials and Integrated Summary Documents “where p-values in summary table for demographics and concomitant medications are also discussed.
6.1.1. P-values There has been ongoing debate on the value or lack of value of the inclusion of p-values in assessments of demographics, disposition, and medications. This white paper does not attempt to resolve this debate. Using p-values for the purpose of describing a population is generally considered to have no added value. The controversy usually pertains to safety assessments. Throughout this white paper, p-values have not been included. If a company or compound team decides that these will be helpful as a tool for reviewing the data, they can be included in the display.
It is very common that the p-values are provided for the demographic and baseline characteristics to make sure that there is a balance (no difference between treatment groups) in key demographic and baseline characteristics. These demographic and baseline characteristics are usually the factors for performing the sub-group analysis.

It is also very common that the p-values are not provided for safety and ancillary variables such as adverse events, laboratory parameters, concomitant medications, and medical histories. The obvious concerns are about the multiplicity, lack of pre-specification, the interpretation of these p-values, and mis-interpretation of p-value as a measure of the importance of a result.  The safety analyses are still mainly on summary basis unless the specific safety variables are pre-specified for hypothesis testing. The safety assessment is sometimes based on the qualitative analysis rather than the quantitative analysis – this is why the narratives for serious adverse events (SAEs) place a critical role in safety assessment. For example, it is well known now the drug Tysabri is effective in treating the relapsing-remitting multiple sclerosis, but increases the risk of progressive multifocal leukoencephalopathy (PML), an opportunistic viral infection of the brain that usually leads to death or severe disability. PML is very rare and is not supposed to be seen in clinical trial subjects. If any PML case is reported in Tysabri treatment group, it will be considered as significant even though the p-value may not be.