Saturday, August 08, 2009

Poisson regression and zero-inflated Poisson regression

Poisson regression is a method to model the frequency of event counts or the event rate, such as the number of adverse events of a certain type or frequency of epileptic seizures during a clinical trial, by a set of covariates. The counts are assumed to follow a Poisson distribution with other variables that are modeled as a function of the covariates. The Poisson regression model is a special case of a generalized linear model (GLM) with a log link - this is why the Poisson regression may also be called Log-Linear Model . Consequently, it is often presented as an example in the broader context of GLM theory.

Poisson regression is the simplest regression model for count data and assumes that each observed count Yi is drawn from a Poisson distribution with the conditional mean ui on a given vector Xi for case i. The number of events follows the Poisson distribution that is described blow:

f(k; \lambda)=\frac{\lambda^k e^{-\lambda}}{k!},\,\!

where

  • e is the base of the natural logarithm (e = 2.71828...)
  • k is the number of occurrences of an event - the probability of which is given by the function
  • k! is the factorial of k
  • λ is a positive real number, equal to the expected number of occurrences that occur during the given interval, the interval could be a time interval or other offset variables (denominators).
The most important feature of the Poisson regressin is that the parameter λ is not only the mean number of occurrences, but also its variance. In other words, to follow the Poisson distribution, the mean equals to the variance. However, with the empirical data (observations), this feature may not always fit - a situation called overdisperse or underdisperse. When the observed variance is higher than the variance of a theoretical model (or for Poisson distribution, the observed variance is higher than the observed mean), overdispersion has occurred. Conversely, underdispersion means that there was less variation in the data than predicted.

When overdisperse occurs, an alternative model with additional free parameters may provide a better fit. In the case of the count data, an alternaitve model such as negative binomial distribution may be used.

In practice, we often see the count data with excessive zero counts (no event), which may cause the deviation from the Poisson distribution - overdispersion or underdispersion. If this is the case, zero-inflated Poisson regression may be used.

In SAS, several procedures in both STAT and ETS modules can be used to estimate Poisson regression. While GENMOD, GLIMMIX (from SAS/Stat), and COUNTREG (from SAS/ETS) are easy to use with standard MODEL statement, NLMIXED, MODEL, NLIN provide great flexibility to model count data by specifying the log likelihood function explicitly.

2 comments:

  1. Regression models for count data in R
    http://www.jstatsoft.org/v27/i08/

    ReplyDelete
  2. The course notes on the negative binomial distribution requires a log-in

    ReplyDelete