Monday, January 19, 2009

Data dredging vs. Data mining; Post-hoc vs. Ad-hoc

Data dredging (data fishing, data snooping) is the inappropriate (sometimes deliberately so) search for 'statistically significant' relationships in large quantities of data. This activity was formerly known in the statistical community as data mining, but that term is now in widespread use with an essentially positive meaning, so the pejorative term data dredging is now used instead.

Data mining is the process of extracting hidden patterns from data. As more data is gathered, with the amount of data doubling every three years, data mining is becoming an increasingly important tool to transform this data into information. It is commonly used in a wide range of applications, such as marketing, fraud detection and scientific discovery. Data mining can be applied to data sets of any size. However, while it can be used to uncover hidden patterns in data that has been collected, obviously it can neither uncover patterns which are not already present in the data, nor can it uncover patterns in data that has not been collected.

In or of the form of an argument in which one event is asserted to be the cause of a later event simply by virtue of having happened earlier: coming to conclusions post hoc; post hoc reasoning.
[Latin, short for post hoc, ergō propter hoc, after this, therefore because of this : post, after + hoc, neuter of hic, this.]

For the specific purpose, case, or situation at hand and for no other: a committee formed ad hoc to address the issue of salaries.adj.
Formed for or concerned with one specific purpose: an ad hoc compensation committee.
Improvised and often impromptu: “On an ad hoc basis, Congress has . . . placed . . . ceilings on military aid to specific countries” (New York Times).
[Latin : ad, to + hoc, neuter accusative of hic, this.]

While both post-hoc and ad-hoc analysis may be performed based on the data or results we have seen, the ad-hoc analysis typically occurred alongside the project while the post-hoc analysis occurred absolutely after the project or after the unblinding of the study or after the pre-specified analyses results have been reviewed. In this sense, the ad-hoc analysis is better than post-hoc analysis.

