p-value is the probability of observing a test statistic at least as
large as the one calculated assuming the null hypothesis is true. In many
situations, p-value from the hypothesis testing has been over-used, mis-used, or
mis-interpreted. American Statistical Association seems to be fed up with the
mis-use of the p-values and has formally issued a statement about the p-value (see
AMERICAN
STATISTICAL ASSOCIATION RELEASES STATEMENT ON STATISTICAL SIGNIFICANCE AND P-VALUES).
It also provides the following six principles to improve the Conduct and
Interpretation of Quantitative Science.
- P-values can indicate how incompatible the data are with a specified statistical model.
- P-values do not measure the probability that the studied hypothesis is true, or the probability that the data were produced by random chance alone.
- Scientific conclusions and business or policy decisions should not be based only on whether a p-value passes a specific threshold.
- Proper inference requires full reporting and transparency.
- A p-value, or statistical significance, does not measure the size of an effect or the importance of a result.
- By itself, a p-value does not provide a good measure of evidence regarding a model or hypothesis.
However, we continue to see
the cases that the p-values are over-used, mis-used, and mis-interpreted, or
used for wrong purpose. One area for p-value misuse is in the analysis of
safety endpoints such as adverse events and laboratory parameters in clinical
trials. One of the ASA proposed principles of using p-value is “a p-value, or
statistical significance, does not measure the size of an effect or the
importance of a result” – ironically, this is also the reason for people to
present p-values for tens and hundreds of p-values hoping that a p-value will
measure the size of an effect or the important of a result.
In a recent article in New
England Journal of Medicine, the p-values were provided for each adverse event
even though the hypothesis testing to compare the incidence of each adverse
event was not the intention of the study. In Marso et al (2016) Liraglutide and Cardiovascular
Outcomes in Type 2 Diabetes, the following summary table was presented with p-values for individual
adverse events.
The study protocol did not mention any inferential analysis for
adverse events. It is clear that these p-values presented in the article are
post-hoc and unplanned. Here is the analysis plan for AE in the protocol.
“AEs are summarised
descriptively. The summaries of AEs are made displaying the number of subjects
with at least one event, the percentage of subjects with at least one event,
the number of events and the event rate per 100 years. These summaries are done
by seriousness, severity, relation to treatment, MESI, withdrawal due to AEs
and outcome.”
In this same article, the
appendix also presented p-values for cardiovascular and anti-diabetes medications at baseline and during trial. However, it could be misleading to
interpret the results based on these p-values. For example, for Statins
introduced during trial, the rates are 379 / 4668 = 8.1% in Liraglutide group
and 450 / 4672 = 9.6% in Placebo group with a p-value of 0.01. However, while
the p-value is statistically significant, the difference in rate (8.1% versus
9.6%) is not really meaningful.
Similarly, in another
NEJM article (Goss et al (2016) Extending
Aromatase-Inhibitor Adjuvant Therapy to 10 Years). The p-values were provided for individual adverse events.
Usually, the
clinical trials are designed to assess the treatment effect for efficacy
endpoints, not for the safety endpoints such as adverse events and laboratory
test results. For a clinical trial, there could be many different adverse
events reported. Providing the p-value for each adverse event could be
mis-interpreted as testing the statistical significant difference for each
event between treatment groups. Uniformly and non-discretionary applying the
hypothesis testing for tens and hundreds of different adverse event terms is
against the statistical principle.
FDA Center for
Drug Evaluation Research (CDER) has a reviewer guidance for Conducting
a Clinical Safety Review of a New Product Application and Preparing a Report on
the Review. It has the following statements about the
hypothesis testing for safety endpoints.
Approaches to evaluation of the safety of a drug generally differ substantially from methods used to evaluate effectiveness. Most of the studies in phases 2-3 of a drug development program are directed toward establishing effectiveness. In designing these trials, critical efficacy endpoints are identified in advance, sample sizes are estimated to permit an adequate assessment of effectiveness, and serious efforts are made, in planning interim looks at data or in controlling multiplicity, to preserve the type 1 error (alpha error) for the main end point. It is also common to devote particular attention to examining critical endpoints by defining them with great care and, in many cases, by using blinded committees to adjudicate them. In contrast, with few exceptions, phase 2-3 trials are not designed to test specified hypotheses about safety nor to measure or identify adverse reactions with any pre-specified level of sensitivity. The exceptions occur when a particular concern related to the drug or drug class has arisen and when there is a specific safety advantage being studied. In these cases, there will often be safety studies with primary safety endpoints that have all the features of hypothesis testing, including blinding, control groups, and pre-specified statistical plans.
In the usual case, however, any apparent finding emerges from an assessment of dozens of potential endpoints (adverse events) of interest, making description of the statistical uncertainty of the finding using conventional significance levels very difficult. The approach taken is therefore best described as one of exploration and estimation of event rates, with particular attention to comparing results of individual studies and pooled data. It should be appreciated that exploratory analyses (e.g., subset analyses, to which a great caution is applied in a hypothesis testing setting) are a critical and essential part of a safety evaluation. These analyses can, of course, lead to false conclusions, but need to be carried out nonetheless, with attention to consistency across studies and prior knowledge. The approach typically followed is to screen broadly for adverse events and to expect that this will reveal the common adverse reaction profile of a new drug and will detect some of the less common and more serious adverse reactions associated with drug use.
7.1.5.5 Identifying Common and Drug-Related Adverse EventsFor common adverse events, the reviewer should attempt to identify those events that can reasonably be considered drug related. Although it is tempting to use hypothesis-testing methods, any reasonable correction for multiplicity would make a finding almost impossible, and studies are almost invariably underpowered for statistically valid detection of small differences. The most persuasive evidence for causality is a consistent difference from control across studies, and evidence of dose response. The reviewer may also consider specifying criteria for the minimum rate and the difference between drug and placebo rate that would be considered sufficient to establish that an event is drug related (e.g., for a given dataset, events occurring at an incidence of at least 5 percent and for which the incidence is at least twice, or some other percentage greater than, the placebo incidence would be considered common and drug related). The reviewer should be mindful that such criteria are inevitably arbitrary and sensitive to sample size.
7.1.7.3 Standard Analyses and Explorations of Laboratory DataThis review should generally include three standard approaches to the analysis of laboratory data. The first two analyses are based on comparative trial data. The third analysis should focus on all patients in the phase 2 to 3 experience. Analyses are intended to be descriptive and should not be thought of as hypothesis testing. P-values or confidence intervals can provide some evidence of the strength of the finding, but unless the trials are designed for hypothesis testing (rarely the case), these should be thought of as descriptive. Generally, the magnitude of change is more important than the p-value for the difference.
PhUSE is an
independent, not-for-profit organisation run by volunteers. Since its
inception, PhUSE has expanded from its roots as a conference for European
Statistical Programmers, to a global platform for the discussion of topics
encompassing the work of Data Managers, Biostatisticians, Statistical
Programmers and eClinical IT professionals. PhUSE is run by the statistical
programmers, but it is attempting to put together some guidelines about how the
statistical tables should be presented. I guess that statisticians may not
agree with all of their proposals.
PhUSE has published a draft proposal “Analyses
and Displays Associated with Adverse Events – Focus on Adverse Events in Phase
2-4 Clinical Trials and Integrated Summary Documents”. The proposal has a
specific section about presentation of p-values for adverse event summary
tables.
6.2. P-values and Confidence IntervalsThere has been ongoing debate on the value or lack of value for the inclusion of p-values and/or confidence intervals in safety assessments (Crowe, et. al. 2009). This white paper does not attempt to resolve this debate. As noted in the Reviewer Guidance, p-values or confidence intervals can provide some evidence of the strength of the finding, but unless the trials are designed for hypothesis testing, these should be thought of as descriptive. Throughout this white paper, p-values and measures of spread are included in several places. Where these are included, they should not be considered as hypothesis testing. If a company or compound team decides that these are not helpful as a tool for reviewing the data, they can be excluded from the display.
Some teams may find p-values and/or confidence intervals useful to facilitate focus, but have concerns that lack of “statistical significance” provides unwarranted dismissal of a potential signal. Conversely, there are concerns that due to multiplicity issues, there could be over-interpretation of p-values adding potential concern for too many outcomes. Similarly, there are concerns that the lower- or upper-bound of confidence intervals will be over-interpreted. It is important for the users of these TFLs to be educated on these issues.
Similarly,
PhUSE also has a white paper on “Analyses
and Displays Associated with Demographics, Disposition, and Medications in
Phase 2-4 Clinical Trials and Integrated Summary Documents “where p-values
in summary table for demographics and concomitant medications are also
discussed.
6.1.1. P-values There has been ongoing debate on the value or lack of value of the inclusion of p-values in assessments of demographics, disposition, and medications. This white paper does not attempt to resolve this debate. Using p-values for the purpose of describing a population is generally considered to have no added value. The controversy usually pertains to safety assessments. Throughout this white paper, p-values have not been included. If a company or compound team decides that these will be helpful as a tool for reviewing the data, they can be included in the display.
It is very
common that the p-values are provided for the demographic and baseline
characteristics to make sure that there is a balance (no difference between
treatment groups) in key demographic and baseline characteristics. These
demographic and baseline characteristics are usually the factors for performing
the sub-group analysis.
It
is also very common that the p-values are not provided for safety and ancillary
variables such as adverse events, laboratory parameters, concomitant
medications, and medical histories. The obvious concerns are about the
multiplicity, lack of pre-specification, the interpretation of these p-values,
and mis-interpretation of p-value as a measure of the importance of a result. The safety analyses are still mainly on
summary basis unless the specific safety variables are pre-specified for
hypothesis testing. The safety assessment is sometimes based on the qualitative
analysis rather than the quantitative analysis – this is why the narratives for
serious adverse events (SAEs) place a critical role in safety assessment. For
example, it is well known now the drug Tysabri is effective in treating the
relapsing-remitting multiple sclerosis, but increases the risk of progressive
multifocal leukoencephalopathy (PML), an opportunistic viral infection of the
brain that usually leads to death or severe disability. PML is very rare and is
not supposed to be seen in clinical trial subjects. If any PML case is reported
in Tysabri treatment group, it will be considered as significant even though
the p-value may not be.
also interesting is the supplemental material for the asa statement.
ReplyDeletehttp://amstat.tandfonline.com/doi/full/10.1080/00031305.2016.1154108