Multiple imputation has become more and more popular in
handling the missing data in clinical trials. Multiple imputation inference
involves three distinct phases:
- The missing data are filled in m times to generate m
complete data sets. This step is through the imputation model and can be
implemented using SAS Proc MI
- The m complete data sets are analyzed by using standard
procedures. This step is through the analysis model – depending on nature
of the outcome variable, the analysis model can be ANCOVA (analysis of
covariance), MMRM (mixed model repeated measures), Logistic regression, GEE
(generalized estimating equation), GENMOD (generalized linear model),… The
analysis model is also the primary model for analyzing the corresponding
outcome variable.
- The results from the m complete data sets are combined for
the inference. This step is using Robin’s rule and can be implemented with SAS Proc
MIANALYZE
For both the imputation model and the analysis model will need to
include a list of explanatory or independent variables, but for different purposes.
The list of explanatory or independent variables in the imputation model is to impute
the missing values; the list of explanatory or independent variables in the analysis model are covariates as part of the standard statistical models. Here
are some comparisons for the variables used in the imputation model and analysis
model:
- The covariates included in the analysis model must also be
included in imputation model
- The imputation model can include additional auxiliary
variables including those variables that are not used as covariates in the analysis
model
- The number of variables used in imputation model is greater
than or equal to the number of variables in analysis model
- The imputation model can include variables measured after the randomization (such as secondary outcomes, concomitant medication use, compliance data). However, for analysis model, “variables measured after randomisation and so potentially affected by the treatment should not be included as covariates in the primary analysis.”
- For longitudinal data or repeated measures, the outcome measures at early time points will be included in the imputation model.
- If the variables used in the analysis model are transformed, the transformed variable should also be used in the imputation model
- If the interaction term is used in the analysis model, it should also be included in the imputation model - this can make the imputation model pretty complicated though.
In many publications, multiple imputation was stated as the method for handling the missing data, however, the details about the imputation model (i.e., which variables are included in the imputation model) were not usually described.
While there is no clear guidance about the variables included in the imputation model, it is important to pre-specify the list of variables included in the imputation model especially if the auxiliary variables or variables not included in the analysis model.
Below are some excerpts from the literature about the
imputation model and analysis model.
UCLA Seminar “MULTIPLE
IMPUTATION IN STATA”
Imputation Model, Analytic Model and Compatibility :
When developing your imputation model, it is important to
assess if your imputation model is “congenial” or consistent with your analytic
model. Consistency means that your imputation model includes (at the very
least) the same variables that are in your analytic or estimation model. This
includes any transformations to variables that will be needed to assess your
hypothesis of interest. This can include log transformations, interaction
terms, or recodes of a continuous variable into a categorical form, if that is
how it will be used in later analysis. The reason for this relates back to the
earlier comments about the purpose of multiple imputation. Since we are trying
to reproduce the proper variance/covariance matrix for estimation, all
relationships between our analytic variables should be represented and
estimated simultaneously. Otherwise, you are imputing values assuming they have
a correlation of zero with the variables you did not include in your imputation
model. This would result in underestimating the association between parameters
of interest in your analysis and a loss of power to detect properties of your
data that may be of interest such as non-linearities and statistical
interactions.
Auxiliary variables are variables in your data set that are
either correlated with a missing variable(s) (the recommendation is
r > 0.4) or are believed to be associated with missingness. These
are factors that are not of particular interest in your analytic model , but
they are added to the imputation model to increase power and/or to help make
the assumption of MAR more plausible. These variables have been found to
improve the quality of imputed values generate from multiple imputation.
Moreover, research has demonstrated their particular importance when imputing a
dependent variable and/or when you have variables with a high proportion of
missing information (Johnson and Young, 2011; Young and Johnson, 2010; Enders ,
2010).
You may a priori know of several variables you believe would
make good auxiliary variables based on your knowledge of the data and subject
matter. Additionally, a good review of the literature can often help identify
them as well. However, if your not sure what variables in the data would be
potential candidates (this is often the case when conducting secondary data
analysis), you can uses some simple methods to help identify potential
candidates.
In a presentation of “multiple
imputations” by Adrienne D. Woods
Which variables should you include as predictors in the
imputation model?
- Any variables you plan to use in later analyses (including
controls)
- General advice: use as many as possible (could get unwieldy!)
- Although, some (i.e., Kline, 2005; Hardt, Herke, & Leonhart, 2012) believe
that this introduces more imprecision, especially if the auxiliary variable
explains less than 10% of the variance in missingness on Y… thoughts?
- Know your analysis model beforehand and include
at least all analysis variables in imputation model (including interaction
terms)
FDA’s Statistical
Review for Vantrela (hydrocodone bitartrate) extended-release tablets in
Management of pain severe
Analysis model:
"The primary efficacy endpoint of trial 3103 was change from
baseline to week 12 in the weekly average of worst pain intensity (WPI). The
primary analysis was ANCOVA model with baseline WPI, randomized treatment,
opioid status, and center as covariates. The intent-to-treat analysis
population, defined as all randomized patients, was used for the primary
efficacy analysis."
Imputation model:
"The applicant performed multiple imputation on the week 12
missing data for the primary analysis. The imputation model included randomized
treatment, opioid status, baseline and postbaseline WPI values while subjects
in the active-drug treatment group who discontinued study drug because of an
adverse event, were treated as if they were in the placebo group and their
missing data were imputed based on the observed placebo subjects' data."
FDA's Statistical
Review for EUCRISA™ (crisaborole) topical ointment, 2% for Atopic Dermatitis mentioned the imputation model for missing dichotomized outcome variable.
The protocol specified the primary imputation method to be
the multiple imputation (MI) approach. For each treatment arm separately,
missing data was imputed using the Markov Chain Monte Carlo (MCMC) method. The
protocol specified the following two sensitivity analyses for the handling of
missing data:
·
Repeated-measures logistic regression model (GEE), with dichotomized ISGA
success as the dependent variable and treatment, analysis center, and visit
(i.e., Days 8, 15, 22, and 29) as independent factors. In this analysis, data
from all post-baseline visits will be included with no imputation for missing
data.
·
Model-based multiple imputation method to impute missing data for the
dichotomized ISGA data. The imputation model (i.e., logistic regression) will
include treatment and analysis center.
Kaifeng Lu et al (2010) Multiple
Imputation Approaches for the Analysis of Dichotomized Responses in
Longitudinal Studies with Missing Data pointed out the issue if the analysis model is different from the imputation model.
Despite its conceptual simplicity and flexibility, the above
MI procedure is not valid for the analysis of dichotomized responses because
Rubin’s variance estimator is biased when the analysis model is different from
the imputation model (Meng, 1994; Robins and Wang, 2000). This is true even
when the imputation and analysis models are compatible, e.g. when the treatment
is the only effect in the logistic regression model.
Ian R. White et al
(2012) Including
all individuals is not enough: lessons for intention-to-treat analysis
In some cases, an MI procedure can be improved by including
in the imputation model ‘auxiliary variables’ that are not in the analysis
model [36, Chapter 4]: auxiliary variables in a randomised trial might be
secondary outcomes or compliance summaries. MI then produces estimates of the
treatment effect that are genuinely different from a likelihoodbased analysis,
by incorporating information on individuals with missing outcome but observed
values of auxiliary variables. However, in our experience, the contribution to
such an analysis of individuals missing the outcome of interest is moderate
unless correlations between the outcome and one or more auxiliary variables are
substantial [37].
Michael Spratt et al (2010) Strategies for
Multiple Imputation in Longitudinal Studies
Where there are nontrivial amounts of missing data in
covariates, both preliminary analyses and imputation models will become more
complex. An MAR assumption may often become more plausible after the inclusion
in the imputation model of additional variables that are not in our analysis
model (because they are on the causal pathway, for example). Thus, multiple imputation models
should typically be more complex than the analysis model. Including
variables that are not related to the variable being imputed in the imputation
models may slightly decrease efficiency but should not cause bias (29, 31).
Model diagnostics should be used to highlight any implausibility in the imputed
values. For example, the distributions of observed and imputed data should be
compared and the plausibility of any differences examined. Imputation models
should also preserve the structure of the analysis model (32). For example,
where the substantive analysis exploits the hierarchical nature of longitudinal
data (e.g., using a multilevel model), the imputation model should be similarly
structured. Here, the longitudinal nature of the data allowed us to include
variables (previous wheezing) that predicted the values of the variable with
the most missing data (wheeze at 81 months) in imputation models.
Jochen Hard et al (2012) Auxiliary
variables in multiple imputation in regression with missing X: a warning
against including too many in small sample research
- An additional advantage of MI over CC (complete-case analysis) is the possibility of
including information from auxiliary variables into the imputation model. Auxiliary
variables are variables within the original data that are not included in the
analysis, but are correlated to the variables of interest or help to keep the
missing process random [MAR: 1]. Little [6] has calculated the amount of
decrease in variance of a regression coefficient Y on X1 when a covariate X2 is
added that has no missing data. White and Carlin [7] have extended this proof
to more than one covariate. In practice however, it is likely that auxiliary
variables themselves will have missing data.
EMA Guideline
on Missing Data in Confirmatory Clinical Trials mentioned the multiple imputation as an approach to handle
the missing data with MAR assumption, however, it did not mention anything
about the imputation model.
Panel on Handling Missing Data in Clinical Trials; National
Research Council (2010) The
Prevention and Treatment of Missing Data in Clinical Trials
Multiple imputation methods address concerns about (b)
“simple imputation is generally not true because the methods do not always
yield conservative effect estimators, and standard errors and confidence
interval widths can be underestimated when uncertainty about the imputation
process is neglected.” and enable the
use of large amounts of auxiliary information.
An important advantage of multiple imputation in the
clinical trial setting is that auxiliary variables that are not included in the
final analysis model can be used in the imputation model. For example, consider
a longitudinal study of HIV, for which the primary outcome Y is longitudinal
CD4 count and that some CD4 counts are missing. Further, assume the presence of
auxiliary information V in the form of longitudinal viral load. If V is not
included in the model, the MAR condition requires the analysis to assume that, conditional
on observed CD4 history, missing outcome data are unrelated to the CD4 count
that would have been measured; this assumption may be unrealistic. However, if
the investigator can confidently specify the relationship between CD4 count and
viral load (e.g., based on knowledge of disease progression dynamics) and if
viral load values are observed for all cases, then MAR implies that the
predictive distribution of missing CD4 counts given the observed CD4 counts and
viral load values is the same for cases with CD4 missing as for cases with CD4
observed, which may be a much more acceptable assumption.
Meyer et al (2020) Statistical
Issues and Recommendations for Clinical Trials Conducted During the COVID-19
Pandemic
Multiple imputation (MI) methodology (Rubin, 1987) may be
helpful in this respect as it allows inclusion of auxiliary variables (both
pre- and post-randomization) in the imputation model while utilizing the
previously planned analysis model. Multiple imputation with auxiliary variables
may be used for various types of endpoints, including continuous, binary,
count, and time-to-event and coupled with various inferential methods in the
analysis step.
Thomas R Sullivan et al (2018) Should
multiple imputation be the method of choice for handling missing data in
randomized trials?
In the first stage of MI, multiple values (m > 1) for
each missing observation are independently simulated from an imputation model.
For missing data restricted to the outcome, the imputation model would
typically regress observed values of Y on X and T. Additional auxiliary
variables that are not in the analysis model can also be added to the
imputation model to improve the prediction of missing values.
In applying MI, the repeated measurements of the outcome are
usually treated as distinct variables in the imputation model. Where interest
lies in the treatment effect at the final time point, the analysis model need
not include the intermediate outcome measures; following imputation a
comparison of final time point results is sufficient. In this case, the intermediate
measures operate as auxiliary variables, assisting with the prediction of
missing values at the final time point and making the MAR assumption more
plausible. Other auxiliary variables, for instance measures of compliance or
related outcomes, can also be added to the imputation model as required. If
data are collected but more likely to be missing following treatment
discontinuation, an indicator variable for discontinuation may also be valuable
as an auxiliary variable. The ability to incorporate auxiliary variables, both
for univariate and multivariate outcomes, is considered one of the key
strengths of MI.
Thus in settings where MI is adopted, we recommend imputing
by randomized group; compared to MI overall, this approach offers greater robustness
at little cost. The approach is also consistent with general recommendations
for over- rather than under-specifying imputation models. It should be
noted that imputing by group only protects against bias in estimating the ATE
if effect modifiers are included in the imputation model.
One of the strengths of MI is its ability to easily
incorporate variables of different types (e.g. continuous, binary) in the
imputation model, whether for univariate or multivariate data. An added benefit
of including all outcomes in a single imputation model is that associations
between related outcomes can aid imputation. Another appealing feature of MI is
its ability to be implemented under an assumption that data are MNAR. This
property makes MI well suited to undertaking sensitivity analyses around a
primary assumption that data are MAR, and as a primary method of analysis in
settings where data are believed to be MNAR. One such setting is RCTs where
participants cannot followed up after discontinuing treatment. If all observed
data are ‘on-treatment’, a MAR assumption entails estimating the effect of
treatment had all participants remained on their assigned treatment.27 However,
for a de facto type estimand (such as ITT), it may be more appropriate to
assume that data are MNAR. In this situation, reference based sensitivity
analyses have been proposed, which at present require the use of MI.2
Interaction terms are not suggested.
Although the bias of MI overall could be eliminated by
including the interaction term in the imputation model (results not shown),
this may not be an obvious strategy if subgroup analyses are not of interest.
Simon Grund et al (2018) Multiple
Imputation of Missing Data for Multilevel Models: Simulations and
Recommendations
A crucial point in the application of MI to multilevel data is that the imputation model not only includes all relevant variables, but also that it “matches” the model of interest (i.e., the substantive analysis model; see Meng, 1994; Schafer, 2003). In other words, the imputation model must capture the relevant aspects of the analysis model, making the imputation model at least as general as (or more general than) the analysis model. If the imputation model is more restrictive than the analysis
model, then imputations are generated under a simplified set of assumptions, and the results of subsequent analyses may be misleading.
Protocol for: Hatemi G, Mahr A, Ishigatsubo Y, et al. Trial
of apremilast for oral ulcers in Behçet’s syndrome. N Engl J Med
2019;381:1918-28. DOI: 10.1056/NEJMoa1816594
REFERENCES: