Thursday, January 14, 2010

Logistic regression: complete or quasi-complete separation of data points

When we perform the logistic regression, sometimes, we may run into an issue so called ‘complete or quasi-complete separation of data points’. In this situation, the maximum likelihood estimate does not exist. If we use SAS Proc Logistic, SAS log will give a warning message "WARNING: There is possibly a quasi-complete separation of data points. The maximum likelihood estimate may not exist. WARNING: The LOGISTIC procedure continues in spite of the above warning. Results shown are based on the last maximum likelihood iteration. Validity of the model fit is questionable." SAS will continue to report the Wald test results and odds ratios, however, these tests are no longer valid and results are not reliable (actually not accurate at all).

Complete separation data is something like below:
Y X
0 1
0 2
0 4
1 5
1 6
1 9

There is complete separation because all of the cases in which Y is 0 have X values equal to or less than 4, and the cases in which Y is 1 have X values equal to or greater than 5. In other words, Maximal value in one group is less than the minimal value in another group. When maximal value in one group is equal to the minimal value in another group, quasi-complete separation data may occur.

If the explanatory variable is categorical, complete separation of data points could be something like this:
Response Failure Success
Predictor
0 25 0

1 0 21


Where There are no successes when the value of the predictor variable is 0, and there are no failures when the value of the predictor variable is 1.

For maximum likelihood estimates to exist, there must be some overlaps in the two distributions. Since logistic regression models uses maximum likelihood estimates, when there is no overlaps of data points between two groups, the results from logistic regression models are unreliable and should not be credited.

Starting from SAS version 9.2, Proc Logistic provides Firth estimation for dealing with the issue of quasi or complete separation of data points.

proc logistic;
model y = x /firth;
run;

However, even after Firth estimation, the results should still be interpreted with extreme caution. Complete separation and quasi-complete separation of the data points may occur when the sample size is small and number of data points is not large or in the situation the samples are determined by the outcome (i.e., response) rather than explanatory variables – we see many publications where the analysis is based on the responders vs. non-responders.


When complete separation or quasi-complete separation occurs, for multivariate regression, the explanatory variable causing this situation should be identified and preferably excluded from the model. For univariate regression, other alternative statistical tests (for example group t-test) should be used.

Further reading:

4 comments:

Xiao said...

Hi, your post is very informative as I am analyzing a set of data that also have the problem of complete separation. You suggest group t-test, I wonder how that can be done. Do you mind giving some suggestion? Also, I am analyzing my data with logit mixed effect model, so my concern is that t-test does not take into consideration subject/item random effects. Thank you in advance!

Anonymous said...

Group t-test in this situation is not ideal approach and should be the last resort after no other approaches work. What I am saying is that when there is a complete separation, you can still perform group t-test to compare the differences between two groups.

wei said...

I think this link from SAS is good as well: http://support.sas.com/kb/22/599.html

Usage Note 22599: Understanding and correcting complete or quasi-complete separation problems

superstats said...

HI, your post is very informative. I am actually dealing with a data having quasiseparation of data points. My doubt is how to identify the variables causing this? I am using SAS913.

Thanks