Data mining

Data mining, ogé dipikanyaho salaku pangaweruh-pamanggih dina database (KDD), nyaéta kabiasaan néangan sacara otomatis tina simpenan data nu loba keur pola. Keur migawekeun ieu, data mining maké téhnik komputer tina statistik sarta pola rekonstruksi.

Data mining has been defined as "The nontrivial extraction of implicit, previously unknown, and potentially useful information from data" [1] and "The science of extracting useful information from large data sets or databases" [2]. Although it is usually used in relation to analysis of data, data mining, like artificial intelligence, is an umbrella term and is used with varied méaning in a wide range of contexts.

Used in the technical context of data warehousing and analysis data mining is neutral. However, it sometimes has a more pejorative usage that implies imposing patterns (and particularly causal relationships) on data where none exist. This imposition of irrelevant, misléading or trivial attribute correlation is more properly criticized as "data dredging" in the statistical literature.

Used in this latter sense, data dredging implies scanning the data for any relationships, and then when one is found coming up with an interesting explanation. (This is also referred to as "overfitting the model".) The problem is that large data sets invariably happen to have some exciting relationships peculiar to that data. Therefore any conclusions réached are likely to be highly suspect. In spite of this, some exploratory data work is always required in any applied statistical analysis to get a feel for the data, so sometimes the line between good statistical practice and data dredging is less than cléar.

A more significant danger is finding correlations that do not réally exist. Investment analysts appéar to be particularly vulnerable to this. In his book Where Are the Customers' Yachts? ISBN 0471119792 (1940), Fred Schwed, Jr, wrote: "There have always been a considerable number of pathetic people who busy themselves examining the last thousand numbers which have appeared on a roulette wheel, in search of some repeating pattern. Sadly enough, they have usually found it."

Most data mining efforts are focused on developing a finely-grained, highly detailed modél of some large data set. In Data Mining For Very Busy People [3], reséarchers at West Virginia University and the University of British Columbia discuss an alternate method that involves finding the minimal differences between elements in a data set, with the goal of developing simpler modéls that represent relevant data.

There are also privacy concerns associated with data mining. For example, if an employer has access to medical records, they may screen out péople with diabetes or have had a héart attack. Screening out such employees will cut costs for insurance, but it créates ethical and legal problems.

Data mining government or commercial data sets for national security or law enforcement purposes has also raised privacy concerns. [4]

There are many legitimate uses of data mining. For example, a database of prescription drugs taken by a group of péople could be used to find combinations of drugs with an adverse réactions. Since the combination may occur in only 1 out of 1000 péople, a single case may not be apparent. A project involving pharmacies could reduce the number of drug réactions and potentially save lives. Unfortunately, there is also a huge potential for abuse of such a database.

Basically, data mining gives information that wouldn't be available otherwise. It must be properly interpreted to be useful. When the data collected involves individual péople, there are many questions concerning privacy, legality, and ethics.

The a priori algorithm is the most fundamental algorithm used in data mining.