# Empirical risk minimization

**empirical riskempirical risk functionalminimize empirical risk**

Empirical risk minimization (ERM) is a principle in statistical learning theory which defines a family of learning algorithms and is used to give theoretical bounds on their performance.wikipedia

39 Related Articles

### Supervised learning

**supervisedsupervised machine learningsupervised classification**

Consider the following situation, which is a general setting of many supervised learning problems.

There are two basic approaches to choosing f or g: empirical risk minimization and structural risk minimization.

### Statistical learning theory

**Learning theory (statistics)statistical machine learning**

Empirical risk minimization (ERM) is a principle in statistical learning theory which defines a family of learning algorithms and is used to give theoretical bounds on their performance.

the empirical risk is called empirical risk minimization.

### Support-vector machine

**support vector machinesupport vector machinesSVM**

In practice, machine learning algorithms cope with that either by employing a convex approximation to the 0-1 loss function (like hinge loss for SVM), which is easier to optimize, or by imposing assumptions on the distribution P(x, y) (and thus stop being agnostic learning algorithms to which the above result applies).

The soft-margin support vector machine described above is an example of an empirical risk minimization (ERM) algorithm for the hinge loss.

### Machine learning

**machine-learninglearningstatistical learning**

Empirical risk minimization (ERM) is a principle in statistical learning theory which defines a family of learning algorithms and is used to give theoretical bounds on their performance.

### Joint probability distribution

**joint distributionjoint probabilitymultivariate distribution**

To put it more formally, we assume that there is a joint probability distribution P(x, y) over X and Y, and that the training set consists of n instances drawn i.i.d. from P(x, y).

### Independent and identically distributed random variables

**independent and identically distributedi.i.d.iid**

To put it more formally, we assume that there is a joint probability distribution P(x, y) over X and Y, and that the training set consists of n instances drawn i.i.d. from P(x, y).

### Random variable

**random variablesrandom variationrandom**

Note that the assumption of a joint probability distribution allows us to model uncertainty in predictions (e.g. from noise in data) because y is not a deterministic function of x, but rather a random variable with conditional distribution P(y | x) for a fixed x.

### Conditional probability distribution

**conditional distributionconditional densityconditional**

Note that the assumption of a joint probability distribution allows us to model uncertainty in predictions (e.g. from noise in data) because y is not a deterministic function of x, but rather a random variable with conditional distribution P(y | x) for a fixed x.

### Statistical risk

**riskRisk (statistics)risky**

The risk associated with hypothesis h(x) is then defined as the expectation of the loss function:

### Expected value

**expectationexpectedmean**

The risk associated with hypothesis h(x) is then defined as the expectation of the loss function:

### Loss function

**objective functioncost functionrisk function**

We also assume that we are given a non-negative real-valued loss function which measures how different the prediction \hat{y} of a hypothesis is from the true outcome y. Empirical risk minimization for a classification problem with a 0-1 loss function is known to be an NP-hard problem even for such a relatively simple class of functions as linear classifiers.

### Mathematical optimization

**optimizationmathematical programmingoptimal**

: Thus the learning algorithm defined by the ERM principle consists in solving the above optimization problem.

### NP-hardness

**NP-hardNP hardNP'''-hard**

Empirical risk minimization for a classification problem with a 0-1 loss function is known to be an NP-hard problem even for such a relatively simple class of functions as linear classifiers.

### Linear classifier

**linearLDClinear classification**

Empirical risk minimization for a classification problem with a 0-1 loss function is known to be an NP-hard problem even for such a relatively simple class of functions as linear classifiers.

### Linear separability

**linearly separablelinear separatorlinearly separated**

Though, it can be solved efficiently when the minimal empirical risk is zero, i.e. data is linearly separable.

### Convex optimization

**convex minimizationconvexconvex programming**

In practice, machine learning algorithms cope with that either by employing a convex approximation to the 0-1 loss function (like hinge loss for SVM), which is easier to optimize, or by imposing assumptions on the distribution P(x, y) (and thus stop being agnostic learning algorithms to which the above result applies).

### Hinge loss

**margin loss**

In practice, machine learning algorithms cope with that either by employing a convex approximation to the 0-1 loss function (like hinge loss for SVM), which is easier to optimize, or by imposing assumptions on the distribution P(x, y) (and thus stop being agnostic learning algorithms to which the above result applies).

### Maximum likelihood estimation

**maximum likelihoodmaximum likelihood estimatormaximum likelihood estimate**

### M-estimator

**M-estimationM-estimatorsestimation**

### Sample complexity

**sample-complexity bounds**

A learning algorithm over \mathcal H is a computable map from Z^* to \mathcal H. In other words, it is an algorithm that takes as input a finite sequence of training samples and outputs a function from X to Y. Typical learning algorithms include empirical risk minimization, without or with Tikhonov regularization.

### Outline of machine learning

**machine learning algorithmslearning algorithmsmachine learning**

### Online machine learning

**online learningon-line learningonline**

:A common paradigm in this situation is to estimate a function \hat{f} through empirical risk minimization or regularized empirical risk minimization (usually Tikhonov regularization).

### Structured sparsity regularization

Consider the linear kernel regularized empirical risk minimization problem with a loss function and the \ell_0 "norm" as the regularization penalty:

### Stability (learning theory)

**stability**

It was shown that for large classes of learning algorithms, notably empirical risk minimization algorithms, certain types of stability ensure good generalization.

### Stochastic gradient descent

**AdaGradAdaptive Moment Estimationgradient-based learning methods**

The sum-minimization problem also arises for empirical risk minimization.