Sunday, January 02, 2011

Agreement Statistics and Kappa

In clinical trial and medical research, we often have a situation where two different measures/assessments are performed on the same sample, same patient, same image,… the agreement needs to be calculated as a summary statistics. Depending on whether or not the measurement is continuous or categorical, the agreement statistics could be different. Lin L had a very nice overview for agreement statistics.

Specifically for categorical assessment, there are many examples where the agreement statistics is needed. In a clinical trial with imaging assessment, the same image (for example, CT Scan, arteriogram,…) can be read by different readers. For disease diagnosis, a new diagnostic tool (with advantage of less invasive or easier to implement) could be compared to an established diagnostic tool… Typically, the outcome measure is dichotomous (e.g., disease vs no disease, positive vs. negative…).

The choice of the methods of comparison is influenced by the existence and/or practical applicability of a reference standard (golden standard). If a reference standard (golden standard) is available, we can estimate sensitivity and specificity – ROC (receiver operation characteristics) analysis. If a reference standard is not available or there is no golden standard for comparison, we can not perform ROC analysis. Instead, we can assess the agreement and calculate the Kappa. This has been discussed in detail in FDA’s Guidance for Industry and FDA Staff “Statistical Guidance on Reporting Results from Studies Evaluating Diagnostic Tests”. For example, for comparing the assessment from two different readers, we would calculate Kappa, overall percent agreement, positive percent agreement, and negative percent agreement. We would not use ROC statistics and would not calculate the sensitivity and specificity.
If we would like to assess the agreement between the urine pregnancy test and the serum pregnancy test, we could use the ROC and calculate the sensitivity, specificity, positive predictive value, and negative predictive value since the serum pregnancy test could be considered as a reference standard or golden standard for pregnancy test.

Kappa Statistic(K) is a measure of agreement between two sources, which is measured on a binary scale (i.e. condition present/absent). K statistic can take values between 0 and 1.
  • Poor agreement : K < 0.20
  • Fair agreement : K = 0.20 to 0.39
  • Moderate agreement : K = 0.40 to 0.59
  • Good agreement : K = 0.60 to 0.79
  • Very good agreement : K =0.80 to 1.00
A good review article about Kappa Statistics is the one written by Karemer et al “Kappa Statistics in Medical Research”.

SAS procedures can calculate Kappa Statistics easily. Here is a list of papers:

1 comment:

Sample Agreements said...

Interesting blog, Thanks you for providing this information.