Wednesday, November 18, 2009

Dealing with the paired data

Paired data contains values which fall normally into pairs and can therefore be expected to vary more between pairs than within pairs. The pairing is to reduce the variability. After the pairing, The between-subject variability will be eliminated. If pairing is effective it will reduce variability enough to justify the effort involved to obtain paired data.

There are many practical examples of paring. In clinical trial, crossover design is a special case of the pairing where the same subject receive more than one treatment. If all subjects receive treatment A, then treatment B, it can still be called crossover design (single sequence cross over design). In Epidemiology field, the case-control study is typically paring. There are terms 1:1 matched case-control, and 1:m matched case-control. In education, we can do the paring to compare the scores before and after the training;......

When outcome measures are continuous variable (such as drug concentration), without considering the covariates, analysis of paired data can be implemented by using paired t-test which can be easily performed using SAS PROC UNIVARIATE (calculate the difference for each pair, then run PROC UNIVARIATE) or SAS PROC TTEST (without calculating the difference first). Suppose x1 and x2 are paired variables,
proc ttest;
paired x1*x2;
run;
If the normality assumption is questionable, the non-parametric tests (sign test and Wilcoxon signed rank sum test) can be used. UCLA's Statistical Consulting Services web site provided examples for these tests.

In more complicated situation (such as crossover design) or if we have to do the modeling to include the covariates, mixed model needs to be used. SAS PROC MIXED can implement the mixed model easily. See SAS/Stat User's Manual for PROC MIXED. In a research paper titled "Detection of emphysema progression in alpha 1-antitrypsin deficiency using CT densitometry; Methodological advances", I actually dealt with the paired data using so called 'random coefficient model'.


When outcome variable is discrete data, the easiest example is McNemar test. McNemar's test is performed if we are interested in the marginal frequencies of two binary outcomes. These binary outcomes may be the same outcome variable on matched pairs (like a case-control study) or two outcome variables from a single group.

In more complicated situation or if the covarites need to be included in the model, 'conditional logistic regression' needs to be employed. 'Conditional logistical regression' can be implemented using SAS Proc Logistic or SAS Proc PHREG. See following links for detail descriptions.


4 comments:

Obama said...

Hello Chunqin!
I am not sure if this comment comes as an ectopic. But I real need your help.
I am currently in my last three years of residence program in internal medicine here in Tanzania. Being part of fulfillment of this course, I did a dissertation proposal on Seroprevalence of Herpes Simplex Virus among HIV infected individuals (in the age group of 15-49, the most HIV affected age group in Tanzania).
Since I decided to compare the prevalence of HSV of this population (In cross sectional study) with that in HIV uninfected individuals, then I chose to use the formula of two proportions to calculate the sample size and got 145 participants from HIV positive and the same in HIV negative individuals as minimal sample size.
However, to take enough sample size; I took 10% of study sites population to get sample size. To make sure I get enough participants from lower age group (particularly HIV positive), I stratified the sample by age, then sex and HIV status and ended with 640 participants who were then tested for HSV-1 and HSV-2
I used prevalence ratios and Poisson regression with robust variance to look for predictors of HSV infection.
So the questions comes, did I use the correct formula for this cross section study? (I think yes). Does stratification of age, sex, and HIV status affect the analysis? (I think for HIV infected and uninfected it doesn’t affect, however, they can’t be generalized to the general population where the prevalence of HIV is about 6%; so may be someone has to do extrapolation of findings I got to get prevalence in the general population).
I need your help whether I was right or wrong, and what I should since I have already collected data.
Dr. Issakwisa, Tanzania

Obama said...

Hello Chunqin!
I am not sure if this comment comes as an ectopic. But I real need your help.
I am currently in my last three years of residence program in internal medicine here in Tanzania. Being part of fulfillment of this course, I did a dissertation proposal on Seroprevalence of Herpes Simplex Virus among HIV infected individuals (in the age group of 15-49, the most HIV affected age group in Tanzania).
Since I decided to compare the prevalence of HSV of this population (In cross sectional study) with that in HIV uninfected individuals, then I chose to use the formula of two proportions to calculate the sample size and got 145 participants from HIV positive and the same in HIV negative individuals as minimal sample size.
However, to take enough sample size; I took 10% of study sites population to get sample size. To make sure I get enough participants from lower age group (particularly HIV positive), I stratified the sample by age, then sex and HIV status and ended with 640 participants who were then tested for HSV-1 and HSV-2
I used prevalence ratios and Poisson regression with robust variance to look for predictors of HSV infection.
So the questions comes, did I use the correct formula for this cross section study? (I think yes). Does stratification of age, sex, and HIV status affect the analysis? (I think for HIV infected and uninfected it doesn’t affect, however, they can’t be generalized to the general population where the prevalence of HIV is about 6%; so may be someone has to do extrapolation of findings I got to get prevalence in the general population).
I need your help whether I was right or wrong, and what I should since I have already collected data.

Anonymous said...

It depends on how you design your study. If you simply compare two populations (HIV uninfected vs. HIV infected),you are dealing two independent groups. If you do matching (for each HIV infected, you select a subject of HIV uninfected with same age, gender,...), you are dealing with the paired data. The sample size for both situations are available in statistics text book and on the web. For example, for comparing two sub-groups, you can use OpenEpi (http://www.sph.emory.edu/~cdckms/Sample%20Size%20Calculation%20for%20a%20single%20cross-Sectional%20survey%20comparing%20two%20subgroups.htm).

OpenEPI and EPI Infor are free software developed by CDC.

http://www.sph.emory.edu/~cdckms/

Obama said...

Thanks very much!