A report on Data mining

An example of data produced by data dredging through a bot operated by statistician Tyler Vigen, apparently showing a close link between the best word winning a spelling bee competition and the number of people in the United States killed by venomous spiders. The similarity in trends is obviously a coincidence.

Process of extracting and discovering patterns in large data sets involving methods at the intersection of machine learning, statistics, and database systems.

- Data mining
An example of data produced by data dredging through a bot operated by statistician Tyler Vigen, apparently showing a close link between the best word winning a spelling bee competition and the number of people in the United States killed by venomous spiders. The similarity in trends is obviously a coincidence.

58 related topics with Alpha

Overall

Machine learning as subfield of AI

Machine learning

16 links

Field of inquiry devoted to understanding and building methods that 'learn', that is, methods that leverage data to improve performance on some set of tasks.

Field of inquiry devoted to understanding and building methods that 'learn', that is, methods that leverage data to improve performance on some set of tasks.

Machine learning as subfield of AI
Part of machine learning as subfield of AI or part of AI as subfield of machine learning
A support-vector machine is a supervised learning model that divides the data into regions separated by a linear boundary. Here, the linear boundary divides the black circles from the white.
An artificial neural network is an interconnected group of nodes, akin to the vast network of neurons in a brain. Here, each circular node represents an artificial neuron and an arrow represents a connection from the output of one artificial neuron to the input of another.
Illustration of linear regression on a data set.
A simple Bayesian network. Rain influences whether the sprinkler is activated, and both rain and the sprinkler influence whether the grass is wet.
The blue line could be an example of overfitting a linear function due to random noise.

Data mining is a related field of study, focusing on exploratory data analysis through unsupervised learning.

Data science process flowchart from Doing Data Science, by Schutt & O'Neil (2013)

Data analysis

5 links

Process of inspecting, cleansing, transforming, and modelling data with the goal of discovering useful information, informing conclusions, and supporting decision-making.

Process of inspecting, cleansing, transforming, and modelling data with the goal of discovering useful information, informing conclusions, and supporting decision-making.

Data science process flowchart from Doing Data Science, by Schutt & O'Neil (2013)
The phases of the intelligence cycle used to convert raw information into actionable intelligence or knowledge are conceptually similar to the phases in data analysis.
Data visualization is used to help understand the results after data is analyzed.
A time series illustrated with a line chart demonstrating trends in U.S. federal spending and revenue over time.
A scatterplot illustrating the correlation between two variables (inflation and unemployment) measured at points in time.
An illustration of the MECE principle used for data analysis.
350px

Data mining is a particular data analysis technique that focuses on statistical modelling and knowledge discovery for predictive rather than purely descriptive purposes, while business intelligence covers data analysis that relies heavily on aggregation, focusing mainly on business information.

Business intelligence

3 links

Business intelligence (BI) comprises the strategies and technologies used by enterprises for the data analysis and management of business information.

Business intelligence (BI) comprises the strategies and technologies used by enterprises for the data analysis and management of business information.

Common functions of business intelligence technologies include reporting, online analytical processing, analytics, dashboard development, data mining, process mining, complex event processing, business performance management, benchmarking, text mining, predictive analytics, and prescriptive analytics.

Gregory Piatetsky-Shapiro in NYC

Gregory Piatetsky-Shapiro

3 links

Gregory Piatetsky-Shapiro in NYC

Gregory I. Piatetsky-Shapiro (born 7 April 1958) is a data scientist and the co-founder of the KDD conferences, and co-founder and past chair of the Association for Computing Machinery SIGKDD group for Knowledge Discovery, Data Mining and Data Science.

The normal distribution, a very common probability density, useful because of the central limit theorem.

Statistics

2 links

Discipline that concerns the collection, organization, analysis, interpretation, and presentation of data.

Discipline that concerns the collection, organization, analysis, interpretation, and presentation of data.

The normal distribution, a very common probability density, useful because of the central limit theorem.
Scatter plots are used in descriptive statistics to show the observed relationships between different variables, here using the Iris flower data set.
Gerolamo Cardano, a pioneer on the mathematics of probability.
Karl Pearson, a founder of mathematical statistics.
A least squares fit: in red the points to be fitted, in blue the fitted line.
Confidence intervals: the red line is true value for the mean in this example, the blue lines are random confidence intervals for 100 realizations.
In this graph the black line is probability distribution for the test statistic, the critical region is the set of values to the right of the observed data point (observed value of the test statistic) and the p-value is represented by the green area.
The confounding variable problem: X and Y may be correlated, not because there is causal relationship between them, but because both depend on a third variable Z. Z is called a confounding factor.
gretl, an example of an open source statistical package

It can include extrapolation and interpolation of time series or spatial data, and data mining.

Usama Fayyad

3 links

Usama M. Fayyad (born July, 1963) is an American-Jordanian data scientist and co-founder of KDD conferences and ACM SIGKDD association for Knowledge Discovery and Data Mining.

Data visualization is one of the steps in analyzing data and presenting it to users.

Data and information visualization

3 links

Interdisciplinary field that deals with the graphic representation of data and information.

Interdisciplinary field that deals with the graphic representation of data and information.

Data visualization is one of the steps in analyzing data and presenting it to users.
Partial map of the Internet early 2005 represented as a graph, each line represents two IP addresses, and some delay between those two nodes.
Charles Joseph Minard's 1869 diagram of Napoleonic France's invasion of Russia, an early example of an information graphic
A time series illustrated with a line chart demonstrating trends in U.S. federal spending and revenue over time
A scatterplot illustrating negative correlation between two variables (inflation and unemployment) measured at points in time
Selected milestones and inventions
Product Space Localization, intended to show the Economic Complexity of a given economy
Tree Map of Benin Exports (2009) by product category. The Product Exports Treemaps are one of the most recent applications of these kind of visualizations, developed by the Harvard-MIT Observatory of Economic Complexity
Planetary movements
Playfair TimeSeries
A data visualization from social media

The field of data and information visualization has emerged "from research in human–computer interaction, computer science, graphics, visual design, psychology, and business methods. It is increasingly applied as a critical component in scientific research, digital libraries, data mining, financial data analysis, market studies, manufacturing production control, and drug discovery".

Special Interest Group on Knowledge Discovery and Data Mining

3 links

SIGKDD, representing the Association for Computing Machinery's (ACM) Special Interest Group (SIG) on Knowledge Discovery and Data Mining, hosts an influential annual conference.

Sequential pattern mining

1 links

Sequential pattern mining is a topic of data mining concerned with finding statistically relevant patterns between data examples where the values are delivered in a sequence.

SimpleSemanticDataMiningDiagram

Data pre-processing

0 links

SimpleSemanticDataMiningDiagram

Data preprocessing can refer to manipulation or dropping of data before it is used in order to ensure or enhance performance, and is an important step in the data mining process.