Sunday, April 06, 2014

Simulation Approach to estimate the sample size for Poisson distribution

FDA Guidance for Industry: Safety, Efficacy, and Pharmacokinetic Studies to Support Marketing of Immune Globulin Intravenous (Human) as Replacement Therapy for Primary Humoral Immunodeficiency requires the efficacy endpoint to be the serious bacterial infection rate. The guidance states:

“…Based on our examination of historical data, we believe that a statistical demonstration of a serious infection rate per person-year less than 1.0 is adequate to provide substantial evidence of efficacy. You may test the null hypothesis that the serious infection rate is greater than or equal to 1.0 per person-year at the 0.01 level of significance or, equivalently, the upper one-sided 99% confidence limit would be less
than 1.0.

You should employ a sufficient number of subjects to provide at least 80% power with one-sided hypothesis testing and an alpha = 0.01. Although the responsibility for choosing the sample size rests with you, we anticipate that studies employing a total of approximately 40 to 50 subjects would generally prove adequate to achieve the requisite statistical power, given the design specifications listed in this same section.”

“Statistical considerations

The number of subjects to be included into the study might exceed 40 patients as the study should provide at least 80% power to reject the null-hypothesis of a serious infection rate greater or equal 1 by means of a one-sided test and a Type I error of 0.01. “
If we plan a study to meet the above requirements by regulatory agencies, sample size will need to be estimated and the sample size estimation requires the use of power and sample size calculation using Poisson distribution."

In a webinar  “Design and Analysis of Count Data” by Mani Lakshminarayanan, the following statements were made:

  • If the endpoint is count data then that should be taken into consideration for sample size Calculation
  • Most commonly used sample size software (nQuery Advisor, PASS 2002) do not have the options for discrete data. Poisson regression is available in PASS 2002.
  • If normal approximation is used then the sample size estimates might be too high, which increases cost and time of subject recruitment

The latest version of PASS has a module for poisson regression that allows the sample size calculation when the purpose is to compare two poisson response rates.

Cytel’s EAST version 6.2 offers power analysis and sample size calculations for count data in fixed (not adaptive) sample designs. EAST provides design capabilities for:
  • Test of a single Poisson rate
  • Test for a ratio of Poisson rates
  • Test for a ratio of Negative Binomial rates

However, none of the GUI sample size calculation software can be readily used for calculating the sample size in serious bacterial infection rate situation where the requirement is based on the upper bound of confidence interval for Poisson distribution.

In SAS community, there are some discussions using SAS to calculate the sample size for count / Poisson data. However, there is no easy answer for the question.

For serious bacterial infection rate situation discribed above, the simulation approach can be used. Several years ago, we described that the simulation approach can be used to  estimate the sample size for studies comparing two different slopes using random coefficient model (see Chen, Stock, and Deng (2008)). Similar approach can be used to estimate the sample size to meet the regulatory requirement for the upper bound of the 99% confidence interval meeting a threshold. The SAS macro is attached below.  

options nonotes ;   *suppress the SAS log;
%macro samplesize(n=, lambda=);
  *generating the first data set;
  data one;
         do i = 1 to &n;
            x = RAND('POISSON',&lambda);

   ods output parameterestimates=parmest ;  
   proc genmod data=one ;
       model x = / dist=poisson link=log scale=deviance lrci alpha=0.02*98% CI for two sides is equivalent to 99% for one-sided;
   run ;

   data parmest;
     set parmest;
     est = exp(estimate);
     lower = exp(lowerlrcl) ;
     upper = exp(upperlrcl) ;

  data un;
    set parmest;
    iteration = 1;

*Iteration macro to generate the rest of data sets;
%macro loop(iteration);

%do i = 2 %to &iteration;
   data one;
         do i = 1 to &n;
            x = RAND('POISSON',&lambda);

   ods output parameterestimates=parmest ;  
   proc genmod data=one ;
       model x = / dist=poisson link=log scale=deviance lrci alpha=0.02 ;   *98% CI for two sides ;
   run ;

    data parmest;
     set parmest;
     est = exp(estimate);
     lower = exp(lowerlrcl) ;
     upper = exp(upperlrcl) ;
  data parmest;
    set parmest;
    iteration = &i;

*combined the cumulative data sets;
   data un;
     set un parmest;
%loop(100);    * for real application, it needs to be a much larger number (with more iterations);
*Calculate and print out power;
 Data power;
    set un;
    if parameter = 'Intercept' then do;
      if upper >=1 then flag =1;     *upper bound is above 1;
        else if upper <1 then flag =0;
    if parameter  = 'Intercept';

 proc freq data = power ;
    table flag;
      title "n=&n; lambda=&lambda";

*try different sample size and lambda;
%samplesize(n=10, lambda=0.5);
%samplesize(n=20, lambda=0.5);
%samplesize(n=25, lambda=0.5);
%samplesize(n=30, lambda=0.5);
%samplesize(n=40, lambda=0.5);

%samplesize(n=50, lambda=0.5);

Sunday, March 30, 2014

Computing Confidence Interval for Poisson Mean

For Poisson distribution, there are many different ways for calculating the confidence interval. The paper by Patil and Kulkarni discusses 19 different ways to calculate a confidence interval for the mean of a Poisson distribution.

The most commonly used method is the normal approximation (for large sample size) and the exact method (for small sample size)

Normal Approximation:

For Poisson, the mean and the variance are both lambda (λ).
The standard error is calculated as: sqrt(
λ /n) where λ is Poisson mean and n is sample size or total exposure (total person years, total time observed,…)
The confidence interval can be calculated as:
                  λ ±z(α/2)*sqrt(λ/n).
The 95-percent confidence interval is calculated as: λ ±1.96*sqrt(λ/n).
The 99-percent confidence interval is calculated as: λ ±2.58*sqrt(λ/n).
EXACT method:

Refer to the following paper for the description of this method: 

The confidence interval for event X is calculated as: 

           (qchisq(α/2, 2*x)/2, qchisq(1-α/2, 2*(x+1))/2 )

Where x is the number of events occurred under Poisson distribution.

In order to calculate the exact confidence interval for Poisson mean, the obtained confidence interval for the number of events need to be converted to the confidence interval for Poisson mean.

Here are two examples from the internet:

Example 1:
Would like to know how confident I can be in my λ. Anyone know of a way to set upper and lower confidence levels for a Poisson distribution?
  • Observations (n) = 88
  • Sample mean (λ) = 47.18182
what would the 95% confidence look like for this?
With Normal Approximation, the 95% confidence interval is calculated as:

                47.18182 +/- 1.96* sqrt(47.18182/88)

This gives 45.7467, 48.617

With Exact method, we first need to calculate x (# of events):
                X = n * λ = 88 * 48.18182 = 4152

The compute the 95% confidence interval for X = 4152. This will give the 95% confidence interval for X as (4026.66, 4280.25)

The 95% confidence interval for mean (λ) is therefore:
lower bound = 4026.66 / 88 = 45.7575
upper bound = 4280.25 /88 = 48.6392

Say that 14 events are observed in 200 people studied for 1 year and 100 people studies for 2 years. Calculate the 95% confidence interval for Poisson mean

In this example, the number of events (X) is given, the Poisson rate (λ) or mean needs to be calculated.

First step is to calculate the person year:
The person time at risk is 200 + 100 x 2 = 400 person years
The poisson rate / poisson mean (λ) is :
  • Events observed = 14
  • Time at risk of event = 400
  • Poisson (e.g. incidence) rate estimate = 14/400 = 0.035

Normal Approximation: 95% confidence interval is calculated as: 
                0.035 +/- 1.96* sqrt(0.035/400)
This will give the 95% confidence interval of (0.0167, 0.0533)

Exact approach:  Calculate the 95% confidence interval for the number of events (X) using: 
(qchisq(0.025, 2*x)/2, qchisq(0.975, 2*(x+1))/2 )

and the result is: [7.65, 23.49]
Exact 95% confidence interval for Poisson mean is:
         Lower bound = 7.65 / 400 =0.019135 for lower bound and
         Upper bound = 23.49 / 400 = 0.058724 for upper bound

We will then say the Poisson mean is 0.035 with 95% confidence interval of (0.019, 0.059).

The following SAS programs can illustrate the calculations above: 

data normal;
  input lambda n ;
  lower = lambda - probit(0.975)*sqrt(lambda/n);
  upper = lambda + probit(0.975)*sqrt(lambda/n);
  47.18182 88
  0.035       400
proc print data=normal;
  title 'Normal Approximation for 95% confidence interval for Poisson mean';

data exact;
  input X;
  lower = quantile('CHISQ',.025,2*x)/2;
  upper = quantile('CHISQ',.975,2*(x+1))/2;
proc print data=exact;
  title 'Exact method for 95% confidence interval for Poisson mean';

Thursday, March 06, 2014

Intra-Subject Coefficient of Variation (CV%) for Sample Size Estimation for Crossover Design

To calculate the sample size for a crossover design for bioequivalence study, a key assumption is the intra-subject variation. The intra-subject variation is usually expressed with coefficient of variation (COV).

In sample size calculation software ‘PASS’, “Equivalence Test for Two Means in 2x2 Crossover Design - Specify Using Ratio” procedure requires COV (Coefficient of Variation) as the following:

COV (COEFFICENT OF VARIATION):“...For log-normal data, the following relationship exists: COV(Y) = SQR(Exp(SX*SX)-1) where SX is the square root of the within mean square error in the ANOVA table of the log-transformed values.” 
 In Sample size calculation software Nquery, “T-tests (TOST) of equivalence in ratio of means for crossover design (natural log scale)” procedure specifies that
“the square root of the mean square error is needed, the square root of the mean squared error can be obtained from the crossover ANOVA computed using the natural log scale. sqrt(MSE) is equal to sd/sqrt(2), where sd is the standard deviation of the period difference computed using the natural log scale. To compare results from this table to those in the Diletti, E., et al (1991) paper note that CV = sqrt(exp(s^2-1)”

Suppose we analyze the data using Proc Mixed as following:

proc mixed data=pk ;
    class sequence period treatment subject;
    model logauc = sequence period treatment;
    random subject(treatment);
    lsmeans treatment/pdiff cl alpha=0.1;

Covariance parameter estimates in the SAS outputs will provide the information for calculating intra-subject coeffiicient. 

                                                                The Mixed Procedure
Covariance Parameter Estimates
Cov Parm


Here, Subject(Sequence) is the inter-subject variability s2inter and residual the within-subject (or intra-subject) variability s2within.

If we need to design a new study with crossover design, we will convert the intra-subject variability to CV for sample size calculation. CVintra can be calculated with the formula CV=100*sqrt(exp(S2
within)-1) or CV=100*sqrt(exp(Residual)-1). From the table above, s2within=0.1856, CV can be calculated as 45.16%

If we use Proc GLM (for a balanced study with no missing data) as specified below, the within-subject (or intra-subject) variability s2within is the Mean Square Error (0.1856) – identical to the residual in Proc Mixed. Intra-subject CV is then calculated using CV=100*sqrt(exp(MSE)-1).

PROC GLM data=pk;
 CLASS trtsEQ period trtgrp patno;
 MODEL lauc = trtSEQ patno(trtSEQ) period TRtgrp; * TRtgrp*PERIOD;
 LSMEANS TRTgrp / PDIFF CL alpha=0.1;

Sum of Squares
Mean Square
F Value
Pr > F

Corrected Total

Further Reading:

Sunday, March 02, 2014

Analysis of 2x2x2 crossover design data: Proc Mixed or Proc GLM

In an early article "Cookbook SAS Codes for Bioequivalence Test in 2x2x2 Crossover Design", I provided the analysis using Proc Mixed model. There is a question about the necessity of using Proc Mixed while the Proc GLM can be used for analyzing the data from a 2x2x2 crossover design. As a matter of fact, FDA guidance  "Statistical Approaches to Establishing Bioequivalence” indeed mentioned the use of Proc GLM for analyzing the non-replicated Crossover Design:

b. Nonreplicated Crossover Designs
For nonreplicated crossover designs, this guidance recommends parametric (normal-theory) procedures to analyze log-transformed BA measures. General linear model procedures available in PROC GLM in SAS or equivalent software are preferred, although linear mixed-effects model procedures can also be indicated for analysis of nonreplicated crossover studies. For example, for a conventional two-treatment, two-period, two-sequence (2 x2) randomized crossover design, the statistical model typically includes factors accounting for the following sources of variation: sequence, subjects nested in sequences, period, and treatment. The Estimate statement in SAS PROC GLM, or equivalent statement in other software, should be used to obtain estimates for the adjusted differences between treatment means and the standard error associated with these differences.
Put it into the SAS statements, the SAS codes will be something like the following:

 PROC GLM data=pk;

The SAS output below will provide the mean difference and its 90% confidence interval. Since the data is in log scale, anti-log of the mean difference (1.076) is the ratio of the geometric mean and anti-log of the 90% confidence interval is our 90% confidence interval for the geometric mean ratio (0.860, 1.346). 

Least Squares Means for Effect TREATMENT
i j Difference Between
90% Confidence Limits for LSMean(i)-LSMean(j)
1 2 0.073113 -0.150925 0.297152

If it is balanced design (i.e., there is no missing value) and all subjects have data available for period 1 and period 2, the results from the Proc GLM will be identical to teh results obtained from Proc Mixed as described in my previous article "Cookbook SAS Codes for Bioequivalence Test in 2x2x2 Crossover Design".

However, it is very often, there is possibility of missing data in one of the periods. In this case, if Proc GLM is used, the entire subject will be excluded from the analysis. If Proc Mixed is used, the subject who contributes only the data for one of the two periods will still be included in the analysis. 

Monday, February 10, 2014

Sending emails through SAS

I am not sure if there is any practical use for sending emails through SAS. But this SAS micro was picked up from RTP SAS User Group Email List and it works very well. The credit goes to Mr Ping Ling. 
 %macro sendemail(to=,cc=,subj=,msg1=,msg2=,msg3=,msg4=,msg5=,msg6=,msg7=,msg8=,attach=);
         FILENAME mail EMAIL (&to.)
         EMAILID='Microsoft Outlook'
         %if &cc. ne %then %do;
         %if &attach. ne %then %do;
         %let today=%sysfunc(date(),date9.);

         DATA _NULL_;
           FILE mail;
           PUT "&today.";
           PUT "&msg1.";
           %do i=2 %to 8;
             %if &&msg&i. ne %then %do;
               %if %sysfunc(compress("&&msg&i."))="/n" %then %do; put; %end;
               %else %do; PUT "&&msg&i. "; %end;

/*Example for using the macro
                subj=%str(Sending email with SAS),
                msg1=%str(This is a test.),


Sunday, February 02, 2014


The research triangle area in North Carolina is one of the area with the concentrated talents in statistics and biostatistics. All three big universities (University of North Carolina, North Carolina State University, and Duke University) have great training programs in statistics and biostatistics. There are also many statisticians working in the pharmaceutical companies, biotech companies, CROs, and medical centers. Correspondingly, there should have been more conferences in clinical trial statistics.

The following conference is one that will be held at April 21-23, 2014 in Durham, North Carolina - the heart of the Research Triangle Park. 

Trends and Innovations in Clinical Trial Statistics (2014)
Link to the event homepage:

Link to the conference agenda:

Link to the event registration form: