Data pre-processing

preprocessingdata preparationpre-processingpreprocessed
Data preprocessing is an important step in the data mining process.wikipedia
34 Related Articles

Data mining

data-miningdataminingknowledge discovery in databases
Data preprocessing is an important step in the data mining process. The phrase "garbage in, garbage out" is particularly applicable to data mining and machine learning projects.
Aside from the raw analysis step, it also involves database and data management aspects, data pre-processing, model and inference considerations, interestingness metrics, complexity considerations, post-processing of discovered structures, visualization, and online updating.

Instance selection

Selected Instances
Data preprocessing includes cleaning, Instance selection, normalization, transformation, feature extraction and selection, etc. The product of data preprocessing is the final training set.
Instance selection (or dataset reduction, or dataset condensation) is an important data pre-processing step that can be applied in many machine learning (or data mining) tasks.

Garbage in, garbage out

GIGOmeaninglessgarbage data
The phrase "garbage in, garbage out" is particularly applicable to data mining and machine learning projects.

Machine learning

machine-learninglearningstatistical learning
The phrase "garbage in, garbage out" is particularly applicable to data mining and machine learning projects. Often, data preprocessing is the most important phase of a machine learning project, especially in computational biology.

Missing data

missing valuesmissing at randomincomplete data
Data-gathering methods are often loosely controlled, resulting in out-of-range values (e.g., Income: −100), impossible data combinations (e.g., Sex: Male, Pregnant: Yes), missing values, etc. Analyzing data that has not been carefully screened for such problems can produce misleading results.

Data quality

qualityData quality assurancedata quality assessment
Thus, the representation and quality of data is first and foremost before running an analysis.

Computational biology

computational biologistcomputationalcomputational biologists
Often, data preprocessing is the most important phase of a machine learning project, especially in computational biology.

Knowledge extraction

knowledge discoveryderivation of knowledgediscovery
If there is much irrelevant and redundant information present or noisy and unreliable data, then knowledge discovery during the training phase is more difficult.

Canonical form

canonicalnormal formcanonically
Data preprocessing includes cleaning, Instance selection, normalization, transformation, feature extraction and selection, etc. The product of data preprocessing is the final training set.

Data transformation

MediationData mediationtransformation
Data preprocessing includes cleaning, Instance selection, normalization, transformation, feature extraction and selection, etc. The product of data preprocessing is the final training set.

Feature extraction

constructingEarly visionextract features
Data preprocessing includes cleaning, Instance selection, normalization, transformation, feature extraction and selection, etc. The product of data preprocessing is the final training set.

Feature selection

variable selectionfeaturesselecting
Data preprocessing includes cleaning, Instance selection, normalization, transformation, feature extraction and selection, etc. The product of data preprocessing is the final training set.

Training, validation, and test sets

training settraining datatest set
Data preprocessing includes cleaning, Instance selection, normalization, transformation, feature extraction and selection, etc. The product of data preprocessing is the final training set.

Chemometrics

chemometricchemical researchchemometrician
This aspect should be carefully considered when interpretation of the results is a key point, such in the multivariate processing of chemical data (chemometrics).

Data cleansing

cleancleaningdata cleaning
Data preprocessing includes cleaning, Instance selection, normalization, transformation, feature extraction and selection, etc. The product of data preprocessing is the final training set.

Data binning

binningbinbinned
Data binning (also called Discrete binning or bucketing) is a data pre-processing technique used to reduce the effects of minor observation errors.

Range searching

range searchorthogonal rangeorthogonal range search
In data structures, the range searching problem most generally consists of preprocessing a set S of objects, in order to determine which objects from S intersect with a query object, called a range.

RapidMiner

RapidMiner is a data science software platform developed by the company of the same name that provides an integrated environment for data preparation, machine learning, deep learning, text mining, and predictive analytics.

Automated machine learning

AutoML
An expert may have to apply appropriate data pre-processing, feature engineering, feature extraction, and feature selection methods that make the dataset amenable for machine learning.

Outline of machine learning

machine learning algorithmslearning algorithmsmachine learning

Fault detection and isolation

fault detectionFault isolationMachine fault diagnosis
During the past decades, there are different classification and preprocessing models that have been developed and proposed in this research area.