Outlier

outliersstatistical outliersconservative estimateirregularitiesoutlier detectionOutliers in statistics
In statistics, an outlier is a data point that differs significantly from other observations.wikipedia
228 Related Articles

Robust statistics

robustbreakdown pointrobustness
In the former case one wishes to discard them or use statistics that are robust to outliers, while in the latter case they indicate that the distribution has high skewness and that one should be very cautious in using tools or intuitions that assume a normal distribution.
One motivation is to produce statistical methods that are not unduly affected by outliers.

Skewness

skewedskewskewed distribution
In the former case one wishes to discard them or use statistics that are robust to outliers, while in the latter case they indicate that the distribution has high skewness and that one should be very cautious in using tools or intuitions that assume a normal distribution.
We can transform this sequence into a negatively skewed distribution by adding a value far below the mean, which is probably a negative outlier, e.g. (40, 49, 50, 51).

Normal distribution

normally distributedGaussian distributionnormal
In the former case one wishes to discard them or use statistics that are robust to outliers, while in the latter case they indicate that the distribution has high skewness and that one should be very cautious in using tools or intuitions that assume a normal distribution. In the case of normally distributed data, the three sigma rule means that roughly 1 in 22 observations will differ by twice the standard deviation or more from the mean, and 1 in 370 will deviate by three times the standard deviation.
Therefore, it may not be an appropriate model when one expects a significant fraction of outliers—values that lie many standard deviations away from the mean—and least squares and other statistical inference methods that are optimal for normally distributed variables often become highly unreliable when applied to such data.

Arithmetic mean

meanaveragearithmetic
In most larger samplings of data, some data points will be further away from the sample mean than what is deemed reasonable.
While the arithmetic mean is often used to report central tendencies, it is not a robust statistic, meaning that it is greatly influenced by outliers (values that are very much larger or smaller than most of the values).

Sample maximum and minimum

sample maximumsample minimumMaximum
Outliers, being the most extreme observations, may include the sample maximum or sample minimum, or both, depending on whether they are extremely high or low.
If the sample has outliers, they necessarily include the sample maximum or sample minimum, or both, depending on whether they are extremely high or low.

Median

averagesample medianmedian-unbiased estimator
For example, if one is calculating the average temperature of 10 objects in a room, and nine of them are between 20 and 25 degrees Celsius, but an oven is at 175 °C, the median of the data will be between 20 and 25 °C but the mean temperature will be between 35.5 and 40 °C.
The median is a popular summary statistic used in descriptive statistics, since it is simple to understand and easy to calculate, while also giving a measure that is more robust in the presence of outlier values than is the mean.

Box plot

boxplotbox and whisker plotadjusted boxplots
Box plots are a hybrid.
Outliers may be plotted as individual points.

Normal probability plot

Some are graphical such as normal probability plots.
This includes identifying outliers, skewness, kurtosis, a need for transformations, and mixtures.

King effect

Additionally, the pathological appearance of outliers of a certain form appears in a variety of datasets, indicating that the causative mechanism for the data might differ at the extreme end (King effect).
In statistics, economics, and econophysics, the King effect refers to the phenomenon where the top one or two members of a ranked set show up as outliers.

68–95–99.7 rule

3-sigma68-95-99.7 rulethree sigma rule
In the case of normally distributed data, the three sigma rule means that roughly 1 in 22 observations will differ by twice the standard deviation or more from the mean, and 1 in 370 will deviate by three times the standard deviation.
It is also used as a simple test for outliers if the population is assumed normal, and as a normality test if the population is potentially not normal.

Grubbs's test for outliers

In statistics, Grubbs's test or the Grubbs test (named after Frank E. Grubbs, who published the test in 1950 ), also known as the maximum normalized residual test or extreme studentized deviate test, is a test used to detect outliers in a univariate data set assumed to come from a normally distributed population.

Chauvenet's criterion

In statistical theory, Chauvenet's criterion (named for William Chauvenet ) is a means of assessing whether one piece of experimental data — an outlier — from a set of observations, is likely to be spurious.

Heavy-tailed distribution

heavy tailsheavy-tailedheavy tail
Outliers can occur by chance in any distribution, but they often indicate either measurement error or that the population has a heavy-tailed distribution.

Mixture model

Gaussian mixture modelmixture modelsGMM
A frequent cause of outliers is a mixture of two distributions, which may be two distinct sub-populations, or may indicate 'correct trial' versus 'measurement error'; this is modeled by a mixture model.
outliers.

Dixon's Q test

Q testDixon's ''Q'' test
In statistics, Dixon's Q test, or simply the Q test, is used for identification and rejection of outliers.

Estimation of covariance matrices

Covariance estimationShrinkage estimationestimate of the covariance
Some estimators are highly sensitive to outliers, notably estimation of covariance matrices.
Another issue is the robustness to outliers, to which sample covariance matrices are highly sensitive.

Quartile

quartileslower quartilelower and upper quartiles
For example, if Q_1 and Q_3 are the lower and upper quartiles respectively, then one could define an outlier to be any observation outside the range:
There are methods by which to check for outliers in the discipline of statistics and statistical analysis.

Local outlier factor

LOFLocal Outlier Factor (LOF)
In the data mining task of anomaly detection, other approaches are distance-based and density-based such as Local Outlier Factor (LOF), and most of them use the distance to the k-nearest neighbors to label observations as outliers or non-outliers.
These are considered to be outliers.

Interquartile range

inter-quartile rangebelowinterquartile
Other methods flag observations based on measures such as the interquartile range.
The IQR can be used to identify outliers (see below).

Cook's distance

Cook's ''D
In regression problems, an alternative approach may be to only exclude points which exhibit a large degree of influence on the estimated coefficients, using a measure such as Cook's distance.
Data points with large residuals (outliers) and/or high leverage may distort the outcome and accuracy of a regression.

Anscombe's quartet

They were constructed in 1973 by the statistician Francis Anscombe to demonstrate both the importance of graphing data before analyzing it and the effect of outliers and other influential observations on statistical properties.

Robust regression

robust estimationRobustrobust linear model
In particular, least squares estimates for regression models are highly sensitive to outliers.

Anomaly (natural sciences)

Anomaly time seriesAtmospheric anomalyanomalies
Robust statistics, resistant to the effects of outliers, are sometimes used as the basis of the transformation.

Winsorizing

WinsorisingwinsorizationWinsorised estimators
The two common approaches to exclude outliers are truncation (or trimming) and Winsorising.
Winsorizing or winsorization is the transformation of statistics by limiting extreme values in the statistical data to reduce the effect of possibly spurious outliers.

Data analysis

data analyticsanalysisdata analyst
If a data point (or points) is excluded from the data analysis, this should be clearly stated on any subsequent report.