On Biostatistics and Clinical Trials: Multiple Imputation: Imputation Model versus Analysis Model

Multiple imputation has become more and more popular in handling the missing data in clinical trials. Multiple imputation inference involves three distinct phases:

The missing data are filled in m times to generate m complete data sets. This step is through the imputation model and can be implemented using SAS Proc MI
The m complete data sets are analyzed by using standard procedures. This step is through the analysis model – depending on nature of the outcome variable, the analysis model can be ANCOVA (analysis of covariance), MMRM (mixed model repeated measures), Logistic regression, GEE (generalized estimating equation), GENMOD (generalized linear model),… The analysis model is also the primary model for analyzing the corresponding outcome variable.
The results from the m complete data sets are combined for the inference. This step is using Robin’s rule and can be implemented with SAS Proc MIANALYZE

For both the imputation model and the analysis model will need to include a list of explanatory or independent variables, but for different purposes. The list of explanatory or independent variables in the imputation model is to impute the missing values; the list of explanatory or independent variables in the analysis model are covariates as part of the standard statistical models. Here are some comparisons for the variables used in the imputation model and analysis model:

The covariates included in the analysis model must also be included in imputation model
The imputation model can include additional auxiliary variables including those variables that are not used as covariates in the analysis model
The number of variables used in imputation model is greater than or equal to the number of variables in analysis model
The imputation model can include variables measured after the randomization (such as secondary outcomes, concomitant medication use, compliance data). However, for analysis model, “variables measured after randomisation and so potentially affected by the treatment should not be included as covariates in the primary analysis.”
For longitudinal data or repeated measures, the outcome measures at early time points will be included in the imputation model.
If the variables used in the analysis model are transformed, the transformed variable should also be used in the imputation model
If the interaction term is used in the analysis model, it should also be included in the imputation model - this can make the imputation model pretty complicated though.

In many publications, multiple imputation was stated as the method for handling the missing data, however, the details about the imputation model (i.e., which variables are included in the imputation model) were not usually described.

While there is no clear guidance about the variables included in the imputation model, it is important to pre-specify the list of variables included in the imputation model especially if the auxiliary variables or variables not included in the analysis model.

Below are some excerpts from the literature about the imputation model and analysis model.

UCLA Seminar “MULTIPLE IMPUTATION IN STATA”

Imputation Model, Analytic Model and Compatibility :

When developing your imputation model, it is important to assess if your imputation model is “congenial” or consistent with your analytic model. Consistency means that your imputation model includes (at the very least) the same variables that are in your analytic or estimation model. This includes any transformations to variables that will be needed to assess your hypothesis of interest. This can include log transformations, interaction terms, or recodes of a continuous variable into a categorical form, if that is how it will be used in later analysis. The reason for this relates back to the earlier comments about the purpose of multiple imputation. Since we are trying to reproduce the proper variance/covariance matrix for estimation, all relationships between our analytic variables should be represented and estimated simultaneously. Otherwise, you are imputing values assuming they have a correlation of zero with the variables you did not include in your imputation model. This would result in underestimating the association between parameters of interest in your analysis and a loss of power to detect properties of your data that may be of interest such as non-linearities and statistical interactions.

Auxiliary variables are variables in your data set that are either correlated with a missing variable(s) (the recommendation is r > 0.4) or are believed to be associated with missingness. These are factors that are not of particular interest in your analytic model , but they are added to the imputation model to increase power and/or to help make the assumption of MAR more plausible. These variables have been found to improve the quality of imputed values generate from multiple imputation. Moreover, research has demonstrated their particular importance when imputing a dependent variable and/or when you have variables with a high proportion of missing information (Johnson and Young, 2011; Young and Johnson, 2010; Enders , 2010).

You may a priori know of several variables you believe would make good auxiliary variables based on your knowledge of the data and subject matter. Additionally, a good review of the literature can often help identify them as well. However, if your not sure what variables in the data would be potential candidates (this is often the case when conducting secondary data analysis), you can uses some simple methods to help identify potential candidates.

In a presentation of “multiple imputations” by Adrienne D. Woods

Which variables should you include as predictors in the imputation model?

Any variables you plan to use in later analyses (including controls)
General advice: use as many as possible (could get unwieldy!)
Although, some (i.e., Kline, 2005; Hardt, Herke, & Leonhart, 2012) believe that this introduces more imprecision, especially if the auxiliary variable explains less than 10% of the variance in missingness on Y… thoughts?
Know your analysis model beforehand and include at least all analysis variables in imputation model (including interaction terms)

FDA’s Statistical Review for Vantrela (hydrocodone bitartrate) extended-release tablets in Management of pain severe

Analysis model:

"The primary efficacy endpoint of trial 3103 was change from baseline to week 12 in the weekly average of worst pain intensity (WPI). The primary analysis was ANCOVA model with baseline WPI, randomized treatment, opioid status, and center as covariates. The intent-to-treat analysis population, defined as all randomized patients, was used for the primary efficacy analysis."

Imputation model:

"The applicant performed multiple imputation on the week 12 missing data for the primary analysis. The imputation model included randomized treatment, opioid status, baseline and postbaseline WPI values while subjects in the active-drug treatment group who discontinued study drug because of an adverse event, were treated as if they were in the placebo group and their missing data were imputed based on the observed placebo subjects' data."

FDA's Statistical Review for EUCRISA™ (crisaborole) topical ointment, 2% for Atopic Dermatitis mentioned the imputation model for missing dichotomized outcome variable.

The protocol specified the primary imputation method to be the multiple imputation (MI) approach. For each treatment arm separately, missing data was imputed using the Markov Chain Monte Carlo (MCMC) method. The protocol specified the following two sensitivity analyses for the handling of missing data:

· Repeated-measures logistic regression model (GEE), with dichotomized ISGA success as the dependent variable and treatment, analysis center, and visit (i.e., Days 8, 15, 22, and 29) as independent factors. In this analysis, data from all post-baseline visits will be included with no imputation for missing data.

· Model-based multiple imputation method to impute missing data for the dichotomized ISGA data. The imputation model (i.e., logistic regression) will include treatment and analysis center.

Kaifeng Lu et al (2010) Multiple Imputation Approaches for the Analysis of Dichotomized Responses in Longitudinal Studies with Missing Data pointed out the issue if the analysis model is different from the imputation model.

Despite its conceptual simplicity and flexibility, the above MI procedure is not valid for the analysis of dichotomized responses because Rubin’s variance estimator is biased when the analysis model is different from the imputation model (Meng, 1994; Robins and Wang, 2000). This is true even when the imputation and analysis models are compatible, e.g. when the treatment is the only effect in the logistic regression model.

Ian R. White et al (2012) Including all individuals is not enough: lessons for intention-to-treat analysis

In some cases, an MI procedure can be improved by including in the imputation model ‘auxiliary variables’ that are not in the analysis model [36, Chapter 4]: auxiliary variables in a randomised trial might be secondary outcomes or compliance summaries. MI then produces estimates of the treatment effect that are genuinely different from a likelihoodbased analysis, by incorporating information on individuals with missing outcome but observed values of auxiliary variables. However, in our experience, the contribution to such an analysis of individuals missing the outcome of interest is moderate unless correlations between the outcome and one or more auxiliary variables are substantial [37].

Michael Spratt et al (2010) Strategies for Multiple Imputation in Longitudinal Studies

Where there are nontrivial amounts of missing data in covariates, both preliminary analyses and imputation models will become more complex. An MAR assumption may often become more plausible after the inclusion in the imputation model of additional variables that are not in our analysis model (because they are on the causal pathway, for example). Thus, multiple imputation models should typically be more complex than the analysis model. Including variables that are not related to the variable being imputed in the imputation models may slightly decrease efficiency but should not cause bias (29, 31). Model diagnostics should be used to highlight any implausibility in the imputed values. For example, the distributions of observed and imputed data should be compared and the plausibility of any differences examined. Imputation models should also preserve the structure of the analysis model (32). For example, where the substantive analysis exploits the hierarchical nature of longitudinal data (e.g., using a multilevel model), the imputation model should be similarly structured. Here, the longitudinal nature of the data allowed us to include variables (previous wheezing) that predicted the values of the variable with the most missing data (wheeze at 81 months) in imputation models.

Jochen Hard et al (2012) Auxiliary variables in multiple imputation in regression with missing X: a warning against including too many in small sample research

An additional advantage of MI over CC (complete-case analysis) is the possibility of including information from auxiliary variables into the imputation model. Auxiliary variables are variables within the original data that are not included in the analysis, but are correlated to the variables of interest or help to keep the missing process random [MAR: 1]. Little [6] has calculated the amount of decrease in variance of a regression coefficient Y on X1 when a covariate X2 is added that has no missing data. White and Carlin [7] have extended this proof to more than one covariate. In practice however, it is likely that auxiliary variables themselves will have missing data.

EMA Guideline on Missing Data in Confirmatory Clinical Trials mentioned the multiple imputation as an approach to handle the missing data with MAR assumption, however, it did not mention anything about the imputation model.

Panel on Handling Missing Data in Clinical Trials; National Research Council (2010) The Prevention and Treatment of Missing Data in Clinical Trials

Multiple imputation methods address concerns about (b) “simple imputation is generally not true because the methods do not always yield conservative effect estimators, and standard errors and confidence interval widths can be underestimated when uncertainty about the imputation process is neglected.” and enable the use of large amounts of auxiliary information.

An important advantage of multiple imputation in the clinical trial setting is that auxiliary variables that are not included in the final analysis model can be used in the imputation model. For example, consider a longitudinal study of HIV, for which the primary outcome Y is longitudinal CD4 count and that some CD4 counts are missing. Further, assume the presence of auxiliary information V in the form of longitudinal viral load. If V is not included in the model, the MAR condition requires the analysis to assume that, conditional on observed CD4 history, missing outcome data are unrelated to the CD4 count that would have been measured; this assumption may be unrealistic. However, if the investigator can confidently specify the relationship between CD4 count and viral load (e.g., based on knowledge of disease progression dynamics) and if viral load values are observed for all cases, then MAR implies that the predictive distribution of missing CD4 counts given the observed CD4 counts and viral load values is the same for cases with CD4 missing as for cases with CD4 observed, which may be a much more acceptable assumption.

Meyer et al (2020) Statistical Issues and Recommendations for Clinical Trials Conducted During the COVID-19 Pandemic

Multiple imputation (MI) methodology (Rubin, 1987) may be helpful in this respect as it allows inclusion of auxiliary variables (both pre- and post-randomization) in the imputation model while utilizing the previously planned analysis model. Multiple imputation with auxiliary variables may be used for various types of endpoints, including continuous, binary, count, and time-to-event and coupled with various inferential methods in the analysis step.

Thomas R Sullivan et al (2018) Should multiple imputation be the method of choice for handling missing data in randomized trials?

In the first stage of MI, multiple values (m > 1) for each missing observation are independently simulated from an imputation model. For missing data restricted to the outcome, the imputation model would typically regress observed values of Y on X and T. Additional auxiliary variables that are not in the analysis model can also be added to the imputation model to improve the prediction of missing values.

In applying MI, the repeated measurements of the outcome are usually treated as distinct variables in the imputation model. Where interest lies in the treatment effect at the final time point, the analysis model need not include the intermediate outcome measures; following imputation a comparison of final time point results is sufficient. In this case, the intermediate measures operate as auxiliary variables, assisting with the prediction of missing values at the final time point and making the MAR assumption more plausible. Other auxiliary variables, for instance measures of compliance or related outcomes, can also be added to the imputation model as required. If data are collected but more likely to be missing following treatment discontinuation, an indicator variable for discontinuation may also be valuable as an auxiliary variable. The ability to incorporate auxiliary variables, both for univariate and multivariate outcomes, is considered one of the key strengths of MI.

Thus in settings where MI is adopted, we recommend imputing by randomized group; compared to MI overall, this approach offers greater robustness at little cost. The approach is also consistent with general recommendations for over- rather than under-specifying imputation models. It should be noted that imputing by group only protects against bias in estimating the ATE if effect modifiers are included in the imputation model.

One of the strengths of MI is its ability to easily incorporate variables of different types (e.g. continuous, binary) in the imputation model, whether for univariate or multivariate data. An added benefit of including all outcomes in a single imputation model is that associations between related outcomes can aid imputation. Another appealing feature of MI is its ability to be implemented under an assumption that data are MNAR. This property makes MI well suited to undertaking sensitivity analyses around a primary assumption that data are MAR, and as a primary method of analysis in settings where data are believed to be MNAR. One such setting is RCTs where participants cannot followed up after discontinuing treatment. If all observed data are ‘on-treatment’, a MAR assumption entails estimating the effect of treatment had all participants remained on their assigned treatment.27 However, for a de facto type estimand (such as ITT), it may be more appropriate to assume that data are MNAR. In this situation, reference based sensitivity analyses have been proposed, which at present require the use of MI.2

Interaction terms are not suggested.

Although the bias of MI overall could be eliminated by including the interaction term in the imputation model (results not shown), this may not be an obvious strategy if subgroup analyses are not of interest.

Simon Grund et al (2018) Multiple Imputation of Missing Data for Multilevel Models: Simulations and Recommendations

A crucial point in the application of MI to multilevel data is that the imputation model not only includes all relevant variables, but also that it “matches” the model of interest (i.e., the substantive analysis model; see Meng, 1994; Schafer, 2003). In other words, the imputation model must capture the relevant aspects of the analysis model, making the imputation model at least as general as (or more general than) the analysis model. If the imputation model is more restrictive than the analysis
model, then imputations are generated under a simplified set of assumptions, and the results of subsequent analyses may be misleading.

Protocol for: Hatemi G, Mahr A, Ishigatsubo Y, et al. Trial of apremilast for oral ulcers in Behçet’s syndrome. N Engl J Med 2019;381:1918-28. DOI: 10.1056/NEJMoa1816594

REFERENCES:

von Hippel, 2009 “HOW TO IMPUTE INTERACTIONS, SQUARES, AND OTHER TRANSFORMED VARIABLES“
von Hippel, 2013 “Should a Normal Imputation Model be Modified to Impute Skewed Variables?”
White et al., 2010 “Avoiding bias due to perfect prediction in multiple imputation of incomplete categorical variables”
Yang Yuan "Multiple Imputation for Missing Data: Concepts and NewDevelopment (Version 9.0)"
Cro et al (2020) "Sensitivity analysis for clinical trials with missing continuous outcome data using controlled multiple imputation: A practical guide."
SAS/Stat 15.2 User's Guide: Proc MI and Proc MIANALYZE

On Biostatistics and Clinical Trials

Sunday, December 06, 2020

Multiple Imputation: Imputation Model versus Analysis Model

No comments:

About Me

Promoting Statistical Insight