Empirical risk minimization

empirical riskempirical risk functionalminimize empirical risk
Empirical risk minimization (ERM) is a principle in statistical learning theory which defines a family of learning algorithms and is used to give theoretical bounds on their performance.wikipedia
39 Related Articles

Supervised learning

supervisedsupervised machine learningsupervised classification
Consider the following situation, which is a general setting of many supervised learning problems.
There are two basic approaches to choosing f or g: empirical risk minimization and structural risk minimization.

Statistical learning theory

Learning theory (statistics)statistical machine learning
Empirical risk minimization (ERM) is a principle in statistical learning theory which defines a family of learning algorithms and is used to give theoretical bounds on their performance.
the empirical risk is called empirical risk minimization.

Support-vector machine

support vector machinesupport vector machinesSVM
In practice, machine learning algorithms cope with that either by employing a convex approximation to the 0-1 loss function (like hinge loss for SVM), which is easier to optimize, or by imposing assumptions on the distribution P(x, y) (and thus stop being agnostic learning algorithms to which the above result applies).
The soft-margin support vector machine described above is an example of an empirical risk minimization (ERM) algorithm for the hinge loss.

Machine learning

machine-learninglearningstatistical learning
Empirical risk minimization (ERM) is a principle in statistical learning theory which defines a family of learning algorithms and is used to give theoretical bounds on their performance.

Joint probability distribution

joint distributionjoint probabilitymultivariate distribution
To put it more formally, we assume that there is a joint probability distribution P(x, y) over X and Y, and that the training set consists of n instances drawn i.i.d. from P(x, y).

Independent and identically distributed random variables

independent and identically distributedi.i.d.iid
To put it more formally, we assume that there is a joint probability distribution P(x, y) over X and Y, and that the training set consists of n instances drawn i.i.d. from P(x, y).

Random variable

random variablesrandom variationrandom
Note that the assumption of a joint probability distribution allows us to model uncertainty in predictions (e.g. from noise in data) because y is not a deterministic function of x, but rather a random variable with conditional distribution P(y | x) for a fixed x.

Conditional probability distribution

conditional distributionconditional densityconditional
Note that the assumption of a joint probability distribution allows us to model uncertainty in predictions (e.g. from noise in data) because y is not a deterministic function of x, but rather a random variable with conditional distribution P(y | x) for a fixed x.

Statistical risk

riskRisk (statistics)risky
The risk associated with hypothesis h(x) is then defined as the expectation of the loss function:

Expected value

expectationexpectedmean
The risk associated with hypothesis h(x) is then defined as the expectation of the loss function:

Loss function

objective functioncost functionrisk function
We also assume that we are given a non-negative real-valued loss function which measures how different the prediction \hat{y} of a hypothesis is from the true outcome y. Empirical risk minimization for a classification problem with a 0-1 loss function is known to be an NP-hard problem even for such a relatively simple class of functions as linear classifiers.

Mathematical optimization

optimizationmathematical programmingoptimal
: Thus the learning algorithm defined by the ERM principle consists in solving the above optimization problem.

NP-hardness

NP-hardNP hardNP'''-hard
Empirical risk minimization for a classification problem with a 0-1 loss function is known to be an NP-hard problem even for such a relatively simple class of functions as linear classifiers.

Linear classifier

linearLDClinear classification
Empirical risk minimization for a classification problem with a 0-1 loss function is known to be an NP-hard problem even for such a relatively simple class of functions as linear classifiers.

Linear separability

linearly separablelinear separatorlinearly separated
Though, it can be solved efficiently when the minimal empirical risk is zero, i.e. data is linearly separable.

Convex optimization

convex minimizationconvexconvex programming
In practice, machine learning algorithms cope with that either by employing a convex approximation to the 0-1 loss function (like hinge loss for SVM), which is easier to optimize, or by imposing assumptions on the distribution P(x, y) (and thus stop being agnostic learning algorithms to which the above result applies).

Hinge loss

margin loss
In practice, machine learning algorithms cope with that either by employing a convex approximation to the 0-1 loss function (like hinge loss for SVM), which is easier to optimize, or by imposing assumptions on the distribution P(x, y) (and thus stop being agnostic learning algorithms to which the above result applies).

Sample complexity

sample-complexity bounds
A learning algorithm over \mathcal H is a computable map from Z^* to \mathcal H. In other words, it is an algorithm that takes as input a finite sequence of training samples and outputs a function from X to Y. Typical learning algorithms include empirical risk minimization, without or with Tikhonov regularization.

Outline of machine learning

machine learning algorithmslearning algorithmsmachine learning

Online machine learning

online learningon-line learningonline
:A common paradigm in this situation is to estimate a function \hat{f} through empirical risk minimization or regularized empirical risk minimization (usually Tikhonov regularization).

Structured sparsity regularization

Consider the linear kernel regularized empirical risk minimization problem with a loss function and the \ell_0 "norm" as the regularization penalty:

Stability (learning theory)

stability
It was shown that for large classes of learning algorithms, notably empirical risk minimization algorithms, certain types of stability ensure good generalization.

Stochastic gradient descent

AdaGradAdaptive Moment Estimationgradient-based learning methods
The sum-minimization problem also arises for empirical risk minimization.