Sunday, May 20, 2012

Log(x+1) Data Transformation


When performing the data analysis, sometimes the data is skewed and not normal-distributed, and the data transformation is needed. We are very familiar with the typically data transformation approaches such as log transformation, square root transformation. As a special case of logarithm transformation, log(x+1) or log(1+x) can also be used.

The first time I had to use log(x+1) transformation is for a dose-response data set where the dose is in exponential scale with a control group dose concentration of zero. The data set is from a so-called Whole Effluent Toxicity Test. The Whole Effluent Toxicity test, one of the aquatic toxicological experiments, has been used by the US Environmental Protection Agency (USEPA) to identify effluents and receiving waters containing toxic materials, and to estimate the toxicity of waster water. In the Whole Effluent Toxicity testing, many different species and several endpoints are used to measure the aggregate toxic effect of an effluent. For many of these biological endpoints, toxicity is manifested as a reduction in the response relative to the control group. The whole Effluent toxicity testing is often designed as multi-concentrations, and includes a minimum of five concentrations of effluent and one control group. Therefore, from a dose-response analysis standpoint, the control group dose is considered as zero and the various concentrations are designed in exponential scale. Prior to the analysis, the log transformation for the dose, log(x), is usually applied. Since the control group dose is considered zero and log(x) does not exist, an easy solution is to use log(x+1). For the control group, the log(0+1) = 0, which seems to be a perfect approach in this case.

However, in clinical trials, I have seen many applications of the log-transformation, but not the log(x+1) transformation. From the FDA website, I could only find one study where the log(1+x) transformation was used. In advisory committee meeting document for AZ’s Drug Esomeprazole, the statistical analysis for the primary endpoint was stated as:
"The primary endpoint, change from baseline in signs and symptoms of GERD observed from video and cardiorespiratory monitoring, was analyzed by ANCOVA. Prior to the analysis, the number of events at baseline and final visit were normalized (to correspond to 8 hours observation time) then log-transformed via a log(1+x) transformation. The ANCOVA of change from baseline on the log-scale was adjusted for treatment and baseline. The least square means (lsmeans) for each treatment group were transformed and expressed as estimated percentage changes from baseline, and the lsmean for the esomeprazole treatment effect was transformed similarly, and expressed as a percentage difference from placebo, which was presented with the associated 2-sided 95% CI and p-value. "

Recently I read an article by Lachin et al “Sample size requirements for studies of treatment effects on Beta-cell function in newly diagnosed type 1 diabetes”, where various data transformation techniques were compared and the log(x+1) and sqrt(x) (square root of x) were suggested for the primary endpoint of C-peptide AUC mean. According to the paper “Most C-peptide values will fall between 0 and 1 and the distribution is positively skewed. Thus, scale-contracting transformations were considered. However, the log transformation could introduce negative skewness because log(x) approaches negative infinity as the value x approaches zero. This can be corrected by using log(x+1)"

When discussing with my friend, Dr Song, from CDC, we came up with the following Q&A regarding the use of  log(x+1) transformation:

Q: Is log(x+1) a fine approach for data transformation?
A: it’s fine to use ln(x+1) as long as this transformation makes data normal and variance relatively constant.

Q: Since the reason for using log(x+1) transformation is to avoid the log(x) approaching negative infinity as the x approaches zero. Could we change the measurement unit from pmol/mL to pmol/dL (1 pmol/dL = 100 pmol/mL)?
A: Log(100x) = log(100) + log(x). It only makes transformed value positive and it does not change the normality and variability. From statistical point of view, it is the same or equivalent to the transformation of log(x).

Q: With log(x), square root of x,…transformation, we can essentially transfer the calculated values or estimates back to the originally scale. With log(x+1), do we have a problem to convert the calculated values or estimates back to the original scale?
A: According to the paper by Lachin, “For each transformation y=f(x), the mean values and confidence limits are presented using the inverse transformation applied to the mean of the transformed values, and the corresponding confidence limits. Thus, for an analysis using y=log(x), the inverse mean is the geometric mean exp(mean y). For an analysis using y=log(x+1), the inverse mean is the geometric-like mean exp(mean y) - 1. For an analysis using y=sqrt(x), the inverse mean is (mean y)**2.”

Q: Whether or not one transformation approach is better than another depending on the range of the values?
A: log(x+1) transformation is often used for transforming data that are right-skewed, but also include zero values. The shape of the resulting distribution will depend on how big x is compared to the constant 1. Therefore the shape of the resulting distribution depends on the units in which x was measured. In the C-peptide AUC mean situation, all transformations are similar at the higher level of x (mean = 0.04 at month 24), but at the lower value level of x (mean = 0.01 at month 12), sqrt(x) is better than log(x+1) and log(x+1) is better than ln(x). This can be easily understood from the curvature of transformations shown in the following graph.



Some additional notes about the use of log(x+1) transformation:
  • Any base for the logarithm can be used, but base 10 is often used because of interpretability
  • In addition to log(x+1), log(2x+1) or log(x+3/8) transformation may also be used
  • Remember to re-inspect the data after transformation to confirm its suitability. This will also be true no matter which data transformation approach is used.

1 comment:

Unknown said...

Thanks for the nice post. I don't understand the choice of log(2x+1) or log(x+3/8).
It seems to me that any transformation that avoids the result to approx minus infinity to be proper. And for ease of interpretability, it should be log(x+c) where c>=1. Or am I missing something?