Outlier

outliersconservative estimateirregularitiesoutlier detectionstatistical outliers
In statistics, an outlier is an observation point that is distant from other observations.wikipedia
214 Related Articles

Robust statistics

robustbreakdown pointrobustness
In the former case one wishes to discard them or use statistics that are robust to outliers, while in the latter case they indicate that the distribution has high skewness and that one should be very cautious in using tools or intuitions that assume a normal distribution.
One motivation is to produce statistical methods that are not unduly affected by outliers.

Normal distribution

normally distributednormalGaussian
In the former case one wishes to discard them or use statistics that are robust to outliers, while in the latter case they indicate that the distribution has high skewness and that one should be very cautious in using tools or intuitions that assume a normal distribution. In the case of normally distributed data, the three sigma rule means that roughly 1 in 22 observations will differ by twice the standard deviation or more from the mean, and 1 in 370 will deviate by three times the standard deviation.
Therefore, it may not be an appropriate model when one expects a significant fraction of outliers—values that lie many standard deviations away from the mean—and least squares and other statistical inference methods that are optimal for normally distributed variables often become highly unreliable when applied to such data.

Arithmetic mean

meanaveragearithmetic
In most larger samplings of data, some data points will be further away from the sample mean than what is deemed reasonable.
While the arithmetic mean is often used to report central tendencies, it is not a robust statistic, meaning that it is greatly influenced by outliers (values that are very much larger or smaller than most of the values).

Median

averagesample medianmedian-unbiased estimator
For example, if one is calculating the average temperature of 10 objects in a room, and nine of them are between 20 and 25 degrees Celsius, but an oven is at 175 °C, the median of the data will be between 20 and 25 °C but the mean temperature will be between 35.5 and 40 °C. In this case, the median better reflects the temperature of a randomly sampled object (but not the temperature in the room) than the mean; naively interpreting the mean as "a typical sample", equivalent to the median, is incorrect.
The median is a popular summary statistic used in descriptive statistics, since it is simple to understand and easy to calculate, while also giving a measure that is more robust in the presence of outlier values than is the mean.

Sample maximum and minimum

sample maximumsample minimumMaximum
Outliers, being the most extreme observations, may include the sample maximum or sample minimum, or both, depending on whether they are extremely high or low.
If the sample has outliers, they necessarily include the sample maximum or sample minimum, or both, depending on whether they are extremely high or low.

Box plot

boxplotbox and whisker plotadjusted boxplots
Box plots are a hybrid.
Box plots may also have lines extending vertically from the boxes (whiskers) indicating variability outside the upper and lower quartiles, hence the terms box-and-whisker plot and box-and-whisker diagram. Outliers may be plotted as individual points.

Normal probability plot

Some are graphical such as normal probability plots.
This includes identifying outliers, skewness, kurtosis, a need for transformations, and mixtures.

King effect

Additionally, the pathological appearance of outliers of a certain form appears in a variety of datasets, indicating that the causative mechanism for the data might differ at the extreme end (King effect).
In statistics, economics, and econophysics, the King effect refers to the phenomenon where the top one or two members of a ranked set show up as outliers.

68–95–99.7 rule

3-sigma68-95-99.7 rulethree sigma rule
In the case of normally distributed data, the three sigma rule means that roughly 1 in 22 observations will differ by twice the standard deviation or more from the mean, and 1 in 370 will deviate by three times the standard deviation.
It is also as a simple test for outliers if the population is assumed normal, and as a normality test if the population is potentially not normal.

Mixture model

mixture modelsGaussian mixture modelGMM
A frequent cause of outliers is a mixture of two distributions, which may be two distinct sub-populations, or may indicate 'correct trial' versus 'measurement error'; this is modeled by a mixture model.
outliers.

Grubbs' test for outliers

Grubbs' test for outliers
Grubbs's test (named after Frank E. Grubbs, who published the test in 1950 ), also known as the maximum normalized residual test or extreme studentized deviate test, is a statistical test used to detect outliers in a univariate data set assumed to come from a normally distributed population.

Chauvenet's criterion

Chauvenet's criterion
In statistical theory, Chauvenet's criterion (named for William Chauvenet ) is a means of assessing whether one piece of experimental data — an outlier — from a set of observations, is likely to be spurious.

Heavy-tailed distribution

heavy tailsheavy-tailedheavy tail
Outliers can occur by chance in any distribution, but they often indicate either measurement error or that the population has a heavy-tailed distribution.
Outlier

Dixon's Q test

Dixon's Q test
In statistics, Dixon's Q test, or simply the Q test, is used for identification and rejection of outliers.

Estimation of covariance matrices

Shrinkage estimationcovariance estimationestimate of the covariance
Some estimators are highly sensitive to outliers, notably estimation of covariance matrices.
Another issue is the robustness to outliers, to which sample covariance matrices are highly sensitive.

Interquartile range

inter-quartile rangebelowinterquartile
Other methods flag observations based on measures such as the interquartile range.
The IQR can be used to identify outliers (see below).

Local outlier factor

LOFLocal Outlier Factor (LOF)
In the data mining task of anomaly detection, other approaches are distance-based and density-based such as Local Outlier Factor (LOF), and most of them use the distance to the k-nearest neighbors to label observations as outliers or non-outliers.
These are considered to be outliers.

Quartile

quartileslower quartilelower and upper quartiles
For example, if Q_1 and Q_3 are the lower and upper quartiles respectively, then one could define an outlier to be any observation outside the range:
There are methods by which to check for outliers in the discipline of statistics and statistical analysis.

Cook's distance

Cook's ''D
In regression problems, an alternative approach may be to only exclude points which exhibit a large degree of influence on the estimated coefficients, using a measure such as Cook's distance.
Data points with large residuals (outliers) and/or high leverage may distort the outcome and accuracy of a regression.

Robust regression

robust estimationRobustrobust linear model
Robust regression
In particular, least squares estimates for regression models are highly sensitive to (i.e. not robust against) outliers.

Winsorizing

Winsorised estimatorswinsorized
The two common approaches to exclude outliers are truncation (or trimming) and Winsorising.
Winsorizing or winsorization is the transformation of statistics by limiting extreme values in the statistical data to reduce the effect of possibly spurious outliers.

Mahalanobis distance

Chi distanceMahalanobis metricMahalanobis
Mahalanobis distance and leverage are often used to detect outliers, especially in the development of linear regression models.
Mahalanobis distance and leverage are often used to detect outliers, especially in the development of linear regression models.

Studentized residual

studentized residualsexternallystudentization of residuals
Studentized residual
This is an important technique in the detection of outliers.

Anomaly (natural sciences)

anomaly time series
Anomaly (natural sciences)
Robust statistics, resistant to the effects of outliers, are sometimes used as the basis of the transformation.

Data analysis

data analyticsanalysisdata analyst
If a data point (or points) is excluded from the data analysis, this should be clearly stated on any subsequent report.
In the case of outliers: should one use robust analysis techniques?