# Cluster analysis

**clusteringdata clusteringclusterclustersclustering algorithmclustering algorithmsclusteredclustering analysesdata clustergrouping**

Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense) to each other than to those in other groups (clusters).wikipedia

482 Related Articles

### Machine learning

**machine-learninglearningstatistical learning**

It is a main task of exploratory data mining, and a common technique for statistical data analysis, used in many fields, including machine learning, pattern recognition, image analysis, information retrieval, bioinformatics, data compression, and computer graphics.

Unsupervised learning algorithms are used to find structure in the data, like grouping or clustering of data points.

### Data mining

**data-miningdataminingknowledge discovery in databases**

It is a main task of exploratory data mining, and a common technique for statistical data analysis, used in many fields, including machine learning, pattern recognition, image analysis, information retrieval, bioinformatics, data compression, and computer graphics.

The actual data mining task is the semi-automatic or automatic analysis of large quantities of data to extract previously unknown, interesting patterns such as groups of data records (cluster analysis), unusual records (anomaly detection), and dependencies (association rule mining, sequential pattern mining).

### Statistical classification

**classificationclassifierclassifiers**

Besides the term clustering, there are a number of terms with similar meanings, including automatic classification, numerical taxonomy, botryology (from Greek βότρυς "grape"), typological analysis, and community detection.

The corresponding unsupervised procedure is known as clustering, and involves grouping data into categories based on some measure of inherent similarity or distance.

### DBSCAN

**Density-based spatial clustering of applications with noise**

The most popular density based clustering method is DBSCAN.

Density-based spatial clustering of applications with noise (DBSCAN) is a data clustering algorithm proposed by Martin Ester, Hans-Peter Kriegel, Jörg Sander and Xiaowei Xu in 1996.

### Pattern recognition

**pattern analysispattern detectionpatterns**

It is a main task of exploratory data mining, and a common technique for statistical data analysis, used in many fields, including machine learning, pattern recognition, image analysis, information retrieval, bioinformatics, data compression, and computer graphics.

For example, the unsupervised equivalent of classification is normally known as clustering, based on the common perception of the task as involving no training data to speak of, and of grouping the input data into clusters based on some inherent similarity measure (e.g. the distance between instances, considered as vectors in a multi-dimensional vector space), rather than assigning each input instance into one of a set of pre-defined classes.

### Bioinformatics

**bioinformaticbioinformaticianbio-informatics**

One can then apply clustering algorithms to that expression data to determine which genes are co-expressed.

### Biclustering

**Co-clustering**

, co-clustering, or two-mode clustering is a data mining technique which allows simultaneous clustering of the rows and columns of a matrix.

### HCS clustering algorithm

The HCS (Highly Connected Subgraphs) clustering algorithm (also known as the HCS algorithm, and other names such as Highly Connected Clusters/Components/Kernels) is an algorithm based on graph connectivity for cluster analysis.

### K-means clustering

**k-meansk''-means clusteringk-means algorithm**

When the number of clusters is fixed to k, k-means clustering gives a formal definition as an optimization problem: find the k cluster centers and assign the objects to the nearest cluster center, such that the squared distances from the cluster are minimized. This led to the development of pre-clustering methods such as canopy clustering, which can process huge data sets efficiently, but the resulting "clusters" are merely a rough pre-partitioning of the data set to then analyze the partitions with existing slower methods such as k-means clustering.

k-means clustering is a method of vector quantization, originally from signal processing, that is popular for cluster analysis in data mining.

### Hierarchical clustering

**agglomerative hierarchical clusteringhierarchical cluster analysisdivisive clustering**

Connectivity-based clustering, also known as hierarchical clustering, is based on the core idea of objects being more related to nearby objects than to objects farther away. OPTICS is a generalization of DBSCAN that removes the need to choose an appropriate value for the range parameter \varepsilon, and produces a hierarchical result related to that of linkage clustering.

### Unsupervised learning

**unsupervisedunsupervised classificationunsupervised machine learning**

Two of the main methods used in unsupervised learning are principal component and cluster analysis.

### Numerical taxonomy

**numerical taxonomiststaxonometric**

Besides the term clustering, there are a number of terms with similar meanings, including automatic classification, numerical taxonomy, botryology (from Greek βότρυς "grape"), typological analysis, and community detection.

It aims to create a taxonomy using numeric algorithms like cluster analysis rather than using subjective evaluation of their properties.

### Anomaly detection

**outlier detectionanomaliesdetecting**

However, it has recently been discussed whether this is adequate for real data, or only on synthetic data sets with a factual ground truth, since classes can contain internal structure, the attributes present may not allow separation of clusters or the classes may contain anomalies.

Instead, a cluster analysis algorithm may be able to detect the micro clusters formed by these patterns.

### K-medoids

**k''-medoidsk-medoidK-medoids clustering (PAM)**

Variations of k-means often include such optimizations as choosing the best of multiple runs, but also restricting the centroids to members of the data set (k-medoids), choosing medians (k-medians clustering), choosing the initial centers less randomly (k-means++) or allowing a fuzzy cluster assignment (fuzzy c-means).

-medoids or partitioning around medoids (PAM''') algorithm is a clustering algorithm reminiscent to the [[k-means|

### Determining the number of clusters in a data set

**appropriate number of clusters in a datasetChoose a number of clusterschoosing the number of clusters**

Most k-means-type algorithms require the number of clusters – k – to be specified in advance, which is considered to be one of the biggest drawbacks of these algorithms.

Determining the number of clusters in a data set, a quantity often labelled k as in the k-means algorithm, is a frequent problem in data clustering, and is a distinct issue from the process of actually solving the clustering problem.

### OPTICS algorithm

**OPTICS**

OPTICS is a generalization of DBSCAN that removes the need to choose an appropriate value for the range parameter \varepsilon, and produces a hierarchical result related to that of linkage clustering.

Ordering points to identify the clustering structure (OPTICS) is an algorithm for finding density-based clusters in spatial data.

### K-medians clustering

**k-mediansk''-medians clusteringk-median problem**

Variations of k-means often include such optimizations as choosing the best of multiple runs, but also restricting the centroids to members of the data set (k-medoids), choosing medians (k-medians clustering), choosing the initial centers less randomly (k-means++) or allowing a fuzzy cluster assignment (fuzzy c-means).

In statistics and data mining, k-medians clustering is a cluster analysis algorithm.

### Fuzzy clustering

**Fuzzy C-Means Clusteringfuzzyfuzzy c-means**

Variations of k-means often include such optimizations as choosing the best of multiple runs, but also restricting the centroids to members of the data set (k-medoids), choosing medians (k-medians clustering), choosing the initial centers less randomly (k-means++) or allowing a fuzzy cluster assignment (fuzzy c-means).

Clustering or cluster analysis involves assigning data points to clusters such that items in the same cluster are as similar as possible, while items belonging to different clusters are as dissimilar as possible.

### Clustering high-dimensional data

**subspace clusteringclusteringhigh-dimensional data**

This led to new clustering algorithms for high-dimensional data that focus on subspace clustering (where only some attributes are used, and cluster models include the relevant attributes for the cluster) and correlation clustering that also looks for arbitrary rotated ("correlated") subspace clusters that can be modeled by giving a correlation of their attributes.

Clustering high-dimensional data is the cluster analysis of data with anywhere from a few dozen to many thousands of dimensions.

### Canopy clustering algorithm

**canopy clustering**

This led to the development of pre-clustering methods such as canopy clustering, which can process huge data sets efficiently, but the resulting "clusters" are merely a rough pre-partitioning of the data set to then analyze the partitions with existing slower methods such as k-means clustering.

The canopy clustering algorithm is an unsupervised pre-clustering algorithm introduced by Andrew McCallum, Kamal Nigam and Lyle Ungar in 2000.

### BIRCH

**BIRCH algorithm**

Among them are CLARANS (Ng and Han, 1994), and BIRCH (Zhang et al., 1996).

BIRCH (balanced iterative reducing and clustering using hierarchies) is an unsupervised data mining algorithm used to perform hierarchical clustering over particularly large data-sets.

### Davies–Bouldin index

The Davies–Bouldin index (DBI) (introduced by David L. Davies and Donald W. Bouldin in 1979) is a metric for evaluating clustering algorithms.

### Mean shift

**Mean-shift**

Mean-shift is a clustering approach where each object is moved to the densest area in its vicinity, based on kernel density estimation.

Application domains include cluster analysis in computer vision and image processing.

### Dunn index

The Dunn index (DI) (introduced by J. C. Dunn in 1974) is a metric for evaluating clustering algorithms.

### Silhouette (clustering)

**silhouetteSilhouette indexcluster silhouette**

Silhouette refers to a method of interpretation and validation of consistency within clusters of data.