A report on Correlation

Several sets of (x, y) points, with the Pearson correlation coefficient of x and y for each set. The correlation reflects the noisiness and direction of a linear relationship (top row), but not the slope of that relationship (middle), nor many aspects of nonlinear relationships (bottom). N.B.: the figure in the center has a slope of 0 but in that case the correlation coefficient is undefined because the variance of Y is zero.
Example scatterplots of various datasets with various correlation coefficients.
Pearson/Spearman correlation coefficients between X and Y are shown when the two variables' ranges are unrestricted, and when the range of X is restricted to the interval (0,1).
Anscombe's quartet: four sets of data with the same correlation of 0.816

Any statistical relationship, whether causal or not, between two random variables or bivariate data.

- Correlation
Several sets of (x, y) points, with the Pearson correlation coefficient of x and y for each set. The correlation reflects the noisiness and direction of a linear relationship (top row), but not the slope of that relationship (middle), nor many aspects of nonlinear relationships (bottom). N.B.: the figure in the center has a slope of 0 but in that case the correlation coefficient is undefined because the variance of Y is zero.

19 related topics with Alpha

Overall

Examples of scatter diagrams with different values of correlation coefficient (ρ)

Pearson correlation coefficient

4 links

Examples of scatter diagrams with different values of correlation coefficient (ρ)
Several sets of (x, y) points, with the correlation coefficient of x and y for each set. Note that the correlation reflects the strength and direction of a linear relationship (top row), but not the slope of that relationship (middle), nor many aspects of nonlinear relationships (bottom). N.B.: the figure in the center has a slope of 0 but in that case the correlation coefficient is undefined because the variance of Y is zero.
This figure gives a sense of how the usefulness of a Pearson correlation for predicting values varies with its magnitude. Given jointly normal X, Y with correlation ρ, (plotted here as a function of ρ) is the factor by which a given prediction interval for Y may be reduced given the corresponding value of X. For example, if ρ = 0.5, then the 95% prediction interval of Y|X will be about 13% smaller than the 95% prediction interval of Y.
Critical values of Pearson's correlation coefficient that must be exceeded to be considered significantly nonzero at the 0.05 level.

In statistics, the Pearson correlation coefficient (PCC, pronounced ) ― also known as Pearson's r, the Pearson product-moment correlation coefficient (PPMCC), the bivariate correlation, or colloquially simply as the correlation coefficient ― is a measure of linear correlation between two sets of data.

The normal distribution, a very common probability density, useful because of the central limit theorem.

Statistics

2 links

Discipline that concerns the collection, organization, analysis, interpretation, and presentation of data.

Discipline that concerns the collection, organization, analysis, interpretation, and presentation of data.

The normal distribution, a very common probability density, useful because of the central limit theorem.
Scatter plots are used in descriptive statistics to show the observed relationships between different variables, here using the Iris flower data set.
Gerolamo Cardano, a pioneer on the mathematics of probability.
Karl Pearson, a founder of mathematical statistics.
A least squares fit: in red the points to be fitted, in blue the fitted line.
Confidence intervals: the red line is true value for the mean in this example, the blue lines are random confidence intervals for 100 realizations.
In this graph the black line is probability distribution for the test statistic, the critical region is the set of values to the right of the observed data point (observed value of the test statistic) and the p-value is represented by the green area.
The confounding variable problem: X and Y may be correlated, not because there is causal relationship between them, but because both depend on a third variable Z. Z is called a confounding factor.
gretl, an example of an open source statistical package

These inferences may take the form of answering yes/no questions about the data (hypothesis testing), estimating numerical characteristics of the data (estimation), describing associations within the data (correlation), and modeling relationships within the data (for example, using regression analysis).

Why-Because Graph of the capsizing of the Herald of Free Enterprise (click to see in detail).

Causality

2 links

Influence by which one event, process, state, or object (a cause) contributes to the production of another event, process, state, or object ( an effect) where the cause is partly responsible for the effect, and the effect is partly dependent on the cause.

Influence by which one event, process, state, or object (a cause) contributes to the production of another event, process, state, or object ( an effect) where the cause is partly responsible for the effect, and the effect is partly dependent on the cause.

Why-Because Graph of the capsizing of the Herald of Free Enterprise (click to see in detail).
Whereas a mediator is a factor in the causal chain (1), a confounder is a spurious factor incorrectly suggesting causation (2)
Used in management and engineering, an Ishikawa diagram shows the factors that cause the effect. Smaller arrows connect the sub-causes to major causes.

Alternative methods of structure learning search through the many possible causal structures among the variables, and remove ones which are strongly incompatible with the observed correlations.

The sign of the covariance of two random variables X and Y

Covariance

1 links

Measure of the joint variability of two random variables.

Measure of the joint variability of two random variables.

The sign of the covariance of two random variables X and Y
Geometric interpretation of the covariance example. Each cuboid is the bounding box of its point (x, y, f (x, y)) and the X and Y means (magenta point). The covariance is the sum of the volumes of the red cuboids minus blue cuboids.

The sign of the covariance therefore shows the tendency in the linear relationship between the variables.

Sir Francis Galton by Charles Wellington Furse, given to the National Portrait Gallery, London in 1954

Francis Galton

1 links

English Victorian era polymath: a statistician, sociologist, psychologist, anthropologist, tropical explorer, geographer, inventor, meteorologist, proto-geneticist, psychometrician and a proponent of social Darwinism, eugenics, and scientific racism.

English Victorian era polymath: a statistician, sociologist, psychologist, anthropologist, tropical explorer, geographer, inventor, meteorologist, proto-geneticist, psychometrician and a proponent of social Darwinism, eugenics, and scientific racism.

Sir Francis Galton by Charles Wellington Furse, given to the National Portrait Gallery, London in 1954
Portrait of Galton by Octavius Oakley, 1840
Galton in the 1850s
Galton in his later years
Sir Francis Galton, 1890s
Galton's 1889 illustration of the quincunx or Galton board.
Galton's correlation diagram 1886
Francis Galton (right), aged 87, on the stoep at Fox Holm, Cobham, with the statistician Karl Pearson.
Louisa Jane Butler

He also created the statistical concept of correlation and widely promoted regression toward the mean.

Pearson in 1912

Karl Pearson

2 links

English mathematician and biostatistician.

English mathematician and biostatistician.

Pearson in 1912
Pearson with Sir Francis Galton, 1909 or 1910.
Karl Pearson at work, 1910.

These techniques, which are widely used today for statistical analysis, include the chi-squared test, standard deviation, and correlation and regression coefficients.

Dinosaur illiteracy and extinction may be correlated, but this would not mean the variables had a causal relationship.

Correlation does not imply causation

1 links

Dinosaur illiteracy and extinction may be correlated, but this would not mean the variables had a causal relationship.

The phrase "correlation does not imply causation" refers to the inability to legitimately deduce a cause-and-effect relationship between two events or variables solely on the basis of an observed association or correlation between them.

Correlation coefficient

0 links

A correlation coefficient is a numerical measure of some type of correlation, meaning a statistical relationship between two variables.

Bivariate normal joint density

Multivariate normal distribution

0 links

Generalization of the one-dimensional normal distribution to higher dimensions.

Generalization of the one-dimensional normal distribution to higher dimensions.

Bivariate normal joint density
Left: Classification of seven multivariate normal classes. Coloured ellipses are 1 sd error ellipses. Black marks the boundaries between the classification regions. . These are computed by the numerical method of ray-tracing (Matlab code).

The multivariate normal distribution is often used to describe, at least approximately, any set of (possibly) correlated real-valued random variables each of which clusters around a mean value.

A graph showing how the log odds ratio relates to the underlying probabilities of the outcome X occurring in two groups, denoted A and B. The log odds ratio shown here is based on the odds for the event occurring in group B relative to the odds for the event occurring in group A. Thus, when the probability of X occurring in group B is greater than the probability of X occurring in group A, the odds ratio is greater than 1, and the log odds ratio is greater than 0.

Odds ratio

0 links

A graph showing how the log odds ratio relates to the underlying probabilities of the outcome X occurring in two groups, denoted A and B. The log odds ratio shown here is based on the odds for the event occurring in group B relative to the odds for the event occurring in group A. Thus, when the probability of X occurring in group B is greater than the probability of X occurring in group A, the odds ratio is greater than 1, and the log odds ratio is greater than 0.
A graph showing the minimum value of the sample log odds ratio statistic that must be observed to be deemed significant at the 0.05 level, for a given sample size. The three lines correspond to different settings of the marginal probabilities in the 2×2 contingency table (the row and column marginal probabilities are equal in this graph).
Risk Ratio vs Odds Ratio

An odds ratio (OR) is a statistic that quantifies the strength of the association between two events, A and B. The odds ratio is defined as the ratio of the odds of A in the presence of B and the odds of A in the absence of B, or equivalently (due to symmetry), the ratio of the odds of B in the presence of A and the odds of B in the absence of A. Two events are independent if and only if the OR equals 1, i.e., the odds of one event are the same in either the presence or absence of the other event.