Complete separation data is something like below:
Y X
0 1
0 2
0 4
1 5
1 6
1 9
There is complete separation because all of the cases in which Y is 0 have X values equal to or less than 4, and the cases in which Y is 1 have X values equal to or greater than 5. In other words, Maximal value in one group is less than the minimal value in another group. When maximal value in one group is equal to the minimal value in another group, quasi-complete separation data may occur.
If the explanatory variable is categorical, complete separation of data points could be something like this:
Response Failure Success
Predictor
0 25 0
1 0 21
Where There are no successes when the value of the predictor variable is 0, and there are no failures when the value of the predictor variable is 1.
For maximum likelihood estimates to exist, there must be some overlaps in the two distributions. Since logistic regression models uses maximum likelihood estimates, when there is no overlaps of data points between two groups, the results from logistic regression models are unreliable and should not be credited.
Starting from SAS version 9.2, Proc Logistic provides Firth estimation for dealing with the issue of quasi or complete separation of data points.
proc logistic;
model y = x /firth;
run;
However, even after Firth estimation, the results should still be interpreted with extreme caution. Complete separation and quasi-complete separation of the data points may occur when the sample size is small and number of data points is not large or in the situation the samples are determined by the outcome (i.e., response) rather than explanatory variables – we see many publications where the analysis is based on the responders vs. non-responders.
When complete separation or quasi-complete separation occurs, for multivariate regression, the explanatory variable causing this situation should be identified and preferably excluded from the model. For univariate regression, other alternative statistical tests (for example group t-test) should be used.
Further reading:
- Computation of the Odds Ratio with Small or Zero Cell Counts by Dr Robin High
- Convergence Failures in Logistic Regression by Paul Allison
- A tutorial on logistic regression by Ying So
- What is new in SAS 9.2?
Hi, your post is very informative as I am analyzing a set of data that also have the problem of complete separation. You suggest group t-test, I wonder how that can be done. Do you mind giving some suggestion? Also, I am analyzing my data with logit mixed effect model, so my concern is that t-test does not take into consideration subject/item random effects. Thank you in advance!
ReplyDeleteGroup t-test in this situation is not ideal approach and should be the last resort after no other approaches work. What I am saying is that when there is a complete separation, you can still perform group t-test to compare the differences between two groups.
ReplyDeleteI think this link from SAS is good as well: http://support.sas.com/kb/22/599.html
ReplyDeleteUsage Note 22599: Understanding and correcting complete or quasi-complete separation problems
HI, your post is very informative. I am actually dealing with a data having quasiseparation of data points. My doubt is how to identify the variables causing this? I am using SAS913.
ReplyDeleteThanks