Data mining

data-miningdataminingknowledge discovery in databasesdata mineminingdata discoverydata minersKDDknowledge miningmined
Data mining is the process of discovering patterns in large data sets involving methods at the intersection of machine learning, statistics, and database systems.wikipedia
947 Related Articles

Machine learning

machine-learninglearningstatistical learning
Data mining is the process of discovering patterns in large data sets involving methods at the intersection of machine learning, statistics, and database systems. Other terms used include data archaeology, information harvesting, information discovery, knowledge extraction, etc. Gregory Piatetsky-Shapiro coined the term "knowledge discovery in databases" for the first workshop on the same topic (KDD-1989) and this term became more popular in AI and machine learning community.
Data mining is a field of study within machine learning, and focuses on exploratory data analysis through unsupervised learning.

Data pre-processing

preprocessingdata preparationpre-processing
Aside from the raw analysis step, it also involves database and data management aspects, data pre-processing, model and inference considerations, interestingness metrics, complexity considerations, post-processing of discovered structures, visualization, and online updating.
Data preprocessing is an important step in the data mining process.

Cluster analysis

clusteringdata clusteringcluster
The actual data mining task is the semi-automatic or automatic analysis of large quantities of data to extract previously unknown, interesting patterns such as groups of data records (cluster analysis), unusual records (anomaly detection), and dependencies (association rule mining, sequential pattern mining). As data sets have grown in size and complexity, direct "hands-on" data analysis has increasingly been augmented with indirect, automated data processing, aided by other discoveries in computer science, such as neural networks, cluster analysis, genetic algorithms (1950s), decision trees and decision rules (1960s), and support vector machines (1990s).
It is a main task of exploratory data mining, and a common technique for statistical data analysis, used in many fields, including machine learning, pattern recognition, image analysis, information retrieval, bioinformatics, data compression, and computer graphics.

Data analysis

data analyticsanalysisdata analyst
Often the more general terms (large scale) data analysis and analytics – or, when referring to actual methods, artificial intelligence and machine learning – are more appropriate.
Data mining is a particular data analysis technique that focuses on statistical modeling and knowledge discovery for predictive rather than purely descriptive purposes, while business intelligence covers data analysis that relies heavily on aggregation, focusing mainly on business information.

Business intelligence

BIBusiness Intelligence (BI)Business discovery
It also is a buzzword and is frequently applied to any form of large-scale data or information processing (collection, extraction, warehousing, analysis, and statistics) as well as any application of computer decision support system, including artificial intelligence (e.g., machine learning) and business intelligence.
Common functions of business intelligence technologies include reporting, online analytical processing, analytics, data mining, process mining, complex event processing, business performance management, benchmarking, text mining, predictive analytics and prescriptive analytics.

Anomaly detection

outlier detectionanomaliesdetecting
The actual data mining task is the semi-automatic or automatic analysis of large quantities of data to extract previously unknown, interesting patterns such as groups of data records (cluster analysis), unusual records (anomaly detection), and dependencies (association rule mining, sequential pattern mining).
In data mining, anomaly detection (also outlier detection ) is the identification of rare items, events or observations which raise suspicions by differing significantly from the majority of the data.

Predictive analytics

predictiveCARTpredictive analysis
These patterns can then be seen as a kind of summary of the input data, and may be used in further analysis or, for example, in machine learning and predictive analytics.
Predictive analytics encompasses a variety of statistical techniques from data mining, predictive modelling, and machine learning, that analyze current and historical facts to make predictions about future or otherwise unknown events.

Sequential pattern mining

Sequence miningPrefixSpanSequences
The actual data mining task is the semi-automatic or automatic analysis of large quantities of data to extract previously unknown, interesting patterns such as groups of data records (cluster analysis), unusual records (anomaly detection), and dependencies (association rule mining, sequential pattern mining).
Sequential pattern mining is a topic of data mining concerned with finding statistically relevant patterns between data examples where the values are delivered in a sequence.

Artificial intelligence

AIA.I.artificially intelligent
It also is a buzzword and is frequently applied to any form of large-scale data or information processing (collection, extraction, warehousing, analysis, and statistics) as well as any application of computer decision support system, including artificial intelligence (e.g., machine learning) and business intelligence. Other terms used include data archaeology, information harvesting, information discovery, knowledge extraction, etc. Gregory Piatetsky-Shapiro coined the term "knowledge discovery in databases" for the first workshop on the same topic (KDD-1989) and this term became more popular in AI and machine learning community.
In the late 1990s and early 21st century, AI began to be used for logistics, data mining, medical diagnosis and other areas.

Gregory Piatetsky-Shapiro

Gregory I. Piatetsky-ShapiroKDnuggets
Other terms used include data archaeology, information harvesting, information discovery, knowledge extraction, etc. Gregory Piatetsky-Shapiro coined the term "knowledge discovery in databases" for the first workshop on the same topic (KDD-1989) and this term became more popular in AI and machine learning community.
Gregory I. Piatetsky-Shapiro (born 7 April 1958) is a data scientist and the co-founder of the KDD conferences, and co-founder and past chair of the Association for Computing Machinery SIGKDD group for Knowledge Discovery, Data Mining and Data Science.

Data warehouse

data warehousingdata warehousesEnterprise Data Warehouse
It also is a buzzword and is frequently applied to any form of large-scale data or information processing (collection, extraction, warehousing, analysis, and statistics) as well as any application of computer decision support system, including artificial intelligence (e.g., machine learning) and business intelligence.
The main source of the data is cleansed, transformed, catalogued, and made available for use by managers and other business professionals for data mining, online analytical processing, market research and decision support.

Usama Fayyad

Fayyad
It was co-chaired by Usama Fayyad and Ramasamy Uthurusamy.
Usama M. Fayyad (born July, 1963) is an American data scientist and co-founder of KDD conferences and ACM SIGKDD association for Knowledge Discovery and Data Mining.

Decision tree learning

decision treesdecision treeClassification and regression tree
As data sets have grown in size and complexity, direct "hands-on" data analysis has increasingly been augmented with indirect, automated data processing, aided by other discoveries in computer science, such as neural networks, cluster analysis, genetic algorithms (1950s), decision trees and decision rules (1960s), and support vector machines (1990s).
It is one of the predictive modeling approaches used in statistics, data mining and machine learning.

Data management

managementData maintenancedata-management
Aside from the raw analysis step, it also involves database and data management aspects, data pre-processing, model and inference considerations, interestingness metrics, complexity considerations, post-processing of discovered structures, visualization, and online updating.

Data Mining and Knowledge Discovery

A year later, in 1996, Usama Fayyad launched the journal by Kluwer called Data Mining and Knowledge Discovery as its founding editor-in-chief.
Data Mining and Knowledge Discovery is a bimonthly peer-reviewed scientific journal focusing on data mining published by Springer Science+Business Media.

Automatic summarization

text summarizationsummarizationdocument summarization
Automatic data summarization is part of machine learning and data mining.

Statistics

statisticalstatistical analysisstatistician
Data mining is the process of discovering patterns in large data sets involving methods at the intersection of machine learning, statistics, and database systems.
Inference can extend to forecasting, prediction and estimation of unobserved values either in or associated with the population being studied; it can include extrapolation and interpolation of time series or spatial data, and can also include data mining.

Special Interest Group on Knowledge Discovery and Data Mining

SIGKDDConference on Knowledge Discovery and Data MiningKDD Conference
Later he started the SIGKDD Newsletter SIGKDD Explorations.
SIGKDD is the Association for Computing Machinery's (ACM) Special Interest Group (SIG) on Knowledge Discovery and Data Mining.

Examples of data mining

data mining
Notable examples of data mining can be found throughout business, medicine, science, and surveillance.
Data mining, the process of discovering patterns in large data sets, has been used in many applications.

Statistical inference

inferential statisticsinferenceinferences
Aside from the raw analysis step, it also involves database and data management aspects, data pre-processing, model and inference considerations, interestingness metrics, complexity considerations, post-processing of discovered structures, visualization, and online updating.
The MDL principle has been applied in communication-coding theory in information theory, in linear regression, and in data mining.

Statistical hypothesis testing

hypothesis testingstatistical teststatistical tests
Often this results from investigating too many hypotheses and not performing proper statistical hypothesis testing.
A related problem is that of multiple testing (sometimes linked to data mining), in which a variety of tests for a variety of possible effects are applied to a single data set and only those yielding a significant result are reported.

Total Information Awareness

In particular, data mining government or commercial data sets for national security or law enforcement purposes, such as in the Total Information Awareness Program or in ADVISE, has raised privacy concerns.
Although the program was formally suspended, its data mining software was later adopted by other government agencies, with only superficial changes being made.

Cross-industry standard process for data mining

CRISP-DMCross Industry Standard Process for Data MiningCRISP framework
It exists, however, in many variations on this theme, such as the Cross-industry standard process for data mining (CRISP-DM) which defines six phases:
Cross-industry standard process for data mining, known as CRISP-DM, is an open standard process model that describes common approaches used by data mining experts.

Receiver operating characteristic

ROC curveAUCROC
Several statistical methods may be used to evaluate the algorithm, such as ROC curves.
ROC analysis since then has been used in medicine, radiology, biometrics, forecasting of natural hazards, meteorology, model performance assessment, and other areas for many decades and is increasingly used in machine learning and data mining research.

Web mining

Web usage miningWeb
Under European copyright and database laws, the mining of in-copyright works (such as by web mining) without the permission of the copyright owner is not legal.
Web mining is the application of data mining techniques to discover patterns from the World Wide Web.