Saturday, August 07, 2010

R-square for regression without intercept?

Sometimes, simple linear regression may not be very simple. One of the issues is to decide whether or not to fit the regression with the intercept or without the intercept. For regression without intercept, the regression line goes through the origin. for regression with intercept, the regression line does not go through the origin.

In clinical trials, we may need to fit the regression models about the drug concentration vs. dose; AUC vs. trough concentration,...Regression with or without a intercept relies on the scientific background, not purely the statistics. Using the drug concentration vs dose as an example, if there is no endogenous drug concentration, a regression model without intercept makes sense. If there is a endogenous drug concentration, a regression model with intercept will be more appropriate - when there is no dose given, the drug concentration is not zero.

In some situation, regression models are purely data-driven or empirical. Choosing a model with or without an intercept may not be easy to decide. We recently had a real experience in this. With the same set of data, we fitted the models with intercept and without intercept. We thought we could judge which model was better by comparing the R-square values - an indicator for goodness of fit. Surprisely, the models without intercept were always much better than the models with intercept by comparing the R-squares. However, when we thought twice about this, we realized that in this situation, the R-square was no longer a good indicator of the goodness of fit.

The problem is that the regression model without intercept will always give a very high R-square. This is related to the way how the sum of squares are calculated. There are two excellent articles discussing this issue.

1 comment:

COLLINS HENYA said...

The R squared for regression without an intercept is found from the sum of squares of total(SST) divided by sum of squares due to regression(SSR).This squared of the multiple correlation coefficient(R) indicates the variation in the dependent variable that is explained by the model.
This percentage indicates whether or not the model is a good predictor model or can just be used to show relationships.By Collins Henya,Statistics,Moi University,TNS RMS.+254717082914.