Needleman–Wunsch algorithm

Needleman and WunschNeedleman-Wunschcomputer algorithm
EMBOSS Needle and EMBOSS Stretcher Global Alignment Tools. Needleman-Wunsch alignment for two nucleotide sequences. MathWorks - Globally align two sequences using Needleman-Wunsch algorithm. BitKeeper – Source Control Management Software. Smith–Waterman algorithm. Sequence mining. Levenshtein distance. Dynamic time warping. Sequence alignment. NW-align: A protein sequence-to-sequence alignment program by Needleman-Wunsch algorithm (online server and source code). Parallel Needleman-Wunsch Algorithm for Grid - Implementation by Tahir Naveed, Imitaz Saeed Siddiqui and Shaftab Ahmed - Bahria University. Needleman-Wunsch Algorithm as Haskell Code.

Most frequent k characters

Damerau–Levenshtein distance. diff. MinHash. Dynamic time warping. Euclidean distance. Fuzzy string searching. Hamming weight. Hirschberg's algorithm. Homology of sequences in genetics. Hunt–McIlroy algorithm. Jaccard index. Jaro–Winkler distance. Levenshtein distance. Longest common subsequence problem. Lucene (an open source search engine that implements edit distance). Manhattan distance. Metric space. Needleman–Wunsch algorithm. Optimal matching algorithm. Sequence alignment. Similarity space on Numerical taxonomy. Smith–Waterman algorithm. Sørensen similarity index. String distance metric. String similarity function. Wagner–Fischer algorithm. Locality-sensitive hashing.

Dynamic programming

dynamicdynamic contracting problemsdynamic optimization
The Needleman–Wunsch algorithm and other algorithms used in bioinformatics, including sequence alignment, structural alignment, RNA structure prediction. Floyd's all-pairs shortest path algorithm. Optimizing the order for chain matrix multiplication. Pseudo-polynomial time algorithms for the subset sum, knapsack and partition problems. The dynamic time warping algorithm for computing the global distance between two time series. The Selinger (a.k.a. System R) algorithm for relational database query optimization. De Boor algorithm for evaluating B-spline curves. Duckworth–Lewis method for resolving the problem when games of cricket are interrupted.

List of algorithms

graph algorithmscommon graph algorithmsGraph
Dynamic time warping: measure similarity between two sequences which may vary in time or speed. Hirschberg's algorithm: finds the least cost sequence alignment between two sequences, as measured by their Levenshtein distance. Needleman–Wunsch algorithm: find global alignment between two sequences. Smith–Waterman algorithm: find local sequence alignment. Exchange sorts. Bubble sort: for each pair of indices, swap the items if out of order. Cocktail shaker sort or bidirectional bubble sort, a bubble sort traversing the list alternately from front to back and back to front. Comb sort. Gnome sort. Odd–even sort.

Edit distance

string edit distanceedit distance costfamily of distance metrics
LCS distance is an upper bound on Levenshtein distance. For strings of the same length, Hamming distance is an upper bound on Levenshtein distance. Hirschberg's algorithm computes the optimal alignment of two strings, where optimality is defined as minimizing edit distance. Approximate string matching can be formulated in terms of edit distance.

Smith–Waterman algorithm

Smith and WatermanSmith-Watermanlocal alignment
To obtain the second best local alignment, apply the traceback process starting at the second highest score outside the trace of the best alignment. Substitution matrix:. Gap penalty: W_k=2k (a linear gap penalty of W_1 = 2). Bioinformatics. Sequence alignment. Sequence mining. Needleman–Wunsch algorithm. Levenshtein distance. BLAST. FASTA. JAligner — an open source Java implementation of the Smith–Waterman algorithm. B.A.B.A. — an applet (with source) which visually explains the algorithm. FASTA/SSEARCH — services page at the EBI. UGENE Smith–Waterman plugin — an open source SSEARCH compatible implementation of the algorithm with graphical interface written in C++.

Multiple sequence alignment

MSAmultiple alignmentalignment
Thus, the assumptions used to align protein sequences and DNA coding regions are inherently different from those that hold for TFBS sequences. Although it is meaningful to align DNA coding regions for homologous sequences using mutation operators, alignment of binding site sequences for the same transcription factor cannot rely on evolutionary related mutation operations. Similarly, the evolutionary operator of point mutations can be used to define an edit distance for coding sequences, but this has little meaning for TFBS sequences because any sequence variation has to maintain a certain level of specificity for the binding site to function.

Hidden Markov model

hidden Markov modelsHMMhidden Markov models (HMMs)
Alignment of bio-sequences. Time series analysis. Activity recognition. Protein folding. Sequence classification. Metamorphic virus detection. DNA motif discovery. Chromatin state discovery. Transportation forecasting. Solar irradiance variability. Andrey Markov. Baum–Welch algorithm. Bayesian inference. Bayesian programming. Conditional random field. Estimation theory. HHpred / HHsearch free server and software for protein sequence searching. HMMER, a free hidden Markov model program for protein sequence analysis. Hidden Bernoulli model. Hidden semi-Markov model. Hierarchical hidden Markov model. Layered hidden Markov model. Sequential dynamical system. Stochastic context-free grammar.

Sequence homology

orthologparalogsparalog
Significant similarity is strong evidence that two sequences are related by evolutionary changes from a common ancestral sequence. Alignments of multiple sequences are used to indicate which regions of each sequence are homologous. The term "percent homology" is often used to mean "sequence similarity." The percentage of identical residues (percent identity) or the percentage of residues conserved with similar physicochemical properties (percent similarity), e.g. leucine and isoleucine, is usually used to "quantify the homology." Based on the definition of homology specified above this terminology is incorrect since sequence similarity is the observation, homology is the conclusion.

Speech recognition

voice recognitionautomatic speech recognitionvoice command
The loss function is usually the Levenshtein distance, though it can be different distances for specific tasks; the set of possible transcriptions is, of course, pruned to maintain tractability. Efficient algorithms have been devised to re score lattices represented as weighted finite state transducers with edit distances represented themselves as a finite state transducer verifying certain assumptions. Dynamic time warping is an approach that was historically used for speech recognition but has now largely been displaced by the more successful HMM-based approach. Dynamic time warping is an algorithm for measuring similarity between two sequences that may vary in time or speed.

Hirschberg's algorithm

In computer science, Hirschberg's algorithm, named after its inventor, Dan Hirschberg, is a dynamic programming algorithm that finds the optimal sequence alignment between two strings. Optimality is measured with the Levenshtein distance, defined to be the sum of the costs of insertions, replacements, deletions, and null actions needed to change one string into the other. Hirschberg's algorithm is simply described as a more space efficient version of the Needleman–Wunsch algorithm that uses divide and conquer. Hirschberg's algorithm is commonly used in computational biology to find maximal global alignments of DNA and protein sequences.

Jaro–Winkler distance

Jaro-Winkler
There are other popular measures of edit distance, which are calculated using a different set of allowable edit operations. For instance, Edit distance is usually defined as a parameterizable metric calculated with a specific set of allowed edit operations, and each operation is assigned a cost (possibly infinite).

Matrix (mathematics)

matrixmatricesmatrix theory
In mathematics, a matrix (plural: matrices) is a rectangular array of numbers, symbols, or expressions, arranged in rows and columns. For example, the dimensions of the matrix below are 2 × 3 (read "two by three"), because there are two rows and three columns:

Triangle inequality

triangular inequalityreverse triangle inequality
[[File:TriangleInequality.svg|thumb|Three examples of the triangle inequality for triangles with sides of lengths

Linguistics

linguistlinguisticlinguists
Linguistics is the scientific study of language, and it involves an analysis of language form, language meaning, and language in context. The earliest activities in the documentation and description of language have been attributed to the 6th century BC Indian grammarian Pāṇini who wrote a formal description of the Sanskrit language in his .

Social sequence analysis

Andrew Abbott argued that sequence alignment methods in biology and information theory and computer science provided useful models. Both fields had developed combinations of sequence alignment operations to facilitate the comparison of whole sequences. Social scientists adapted these methods in the form of optimal matching (OM) analysis, often in conjunction with cluster analysis techniques to aid in the identification of common sequence pattern classes.

Bioinformatics

bioinformaticbioinformaticiangenome browser
Major research efforts in the field include sequence alignment, gene finding, genome assembly, drug design, drug discovery, protein structure alignment, protein structure prediction, prediction of gene expression and protein–protein interactions, genome-wide association studies, the modeling of evolution and cell division/mitosis. Bioinformatics now entails the creation and advancement of databases, algorithms, computational and statistical techniques, and theory to solve formal and practical problems arising from the management and analysis of biological data.

String metric

similaritystring metricsdistance
The most widely known string metric is a rudimentary one called the Levenshtein distance (also known as edit distance). It operates between two input strings, returning a number equivalent to the number of substitutions and deletions needed in order to transform one input string into another. Simplistic string metrics such as Levenshtein distance have expanded to include phonetic, token, grammatical and character-based methods of statistical comparisons.

Segregating site

non-conservative replacementsnon-conservative mutationsvariable
Sequence alignment. Sequence alignment software. ClustalW.

Vladimir Levenshtein

Levenshtein, VladimirVladimir I. Levenshtein
Hamming Medal in 2006, for "contributions to the theory of error-correcting codes and information theory, including the Levenshtein distance". Association scheme. Bose–Mesner algebra. Levenshtein automaton. Levenshtein coding. Levenstein's personal webpage - in Russian. March 2003 pictures of Levenshtein at a professional reception. Another (better) picture from the same source.

Consensus sequence

consensusconsensus sequencescanonical sequence
In molecular biology and bioinformatics, the consensus sequence (or canonical sequence) is the calculated order of most frequent residues, either nucleotide or amino acid, found at each position in a sequence alignment. It represents the results of multiple sequence alignments in which related sequences are compared to each other and similar sequence motifs are calculated. Such information is important when considering sequence-dependent enzymes such as RNA polymerase. A protein binding site, represented by a consensus sequence, may be a short sequence of nucleotides which is found several times in the genome and is thought to play the same role in its different locations.

SAM (file format)

Sequence Alignment MapSAMSAM format
Sequence Alignment Map (SAM) is a text-based format for storing biological sequences aligned to a reference sequence developed by Heng Li and Bob Handsaker et al. It is widely used for storing data, such as nucleotide sequences, generated by next generation sequencing technologies. The format supports short and long reads (up to 128Mbp) produced by different sequencing platforms and is used to hold mapped data within the Genome Analysis Toolkit (GATK) and across the Broad Institute, the Wellcome Sanger Institute, and throughout the 1000 Genomes Project. Sequence Alignment/Map (SAM) format for alignment of nucleotide sequences (e.g. sequencing reads) to (a) reference sequence(s).

Biopython

Separate modules extend Biopython's capabilities to sequence alignment, protein structure, population genetics, phylogenetics, sequence motifs, and machine learning. Biopython is one of a number of Bio* projects designed to reduce code duplication in computational biology. Biopython development began in 1999 and it was first released in July 2000. It was developed during a similar time frame and with analogous goals to other projects that added bioinformatics capabilities to their respective programming languages, including BioPerl, BioRuby and BioJava. Early developers on the project included Jeff Chang, Andrew Dalke and Brad Chapman, though over 100 people have made contributions to date.

Mutual intelligibility

mutually intelligiblemutually unintelligibleintelligible
The Levenshtein distance between written Dutch and German is 50.4% as opposed to 61.7% between English and Dutch. The spoken languages are much more difficult to understand for both, with studies showing Dutch speakers having slightly less difficulty in understanding German speakers than vice versa, though it remains unclear whether this asymmetry has to do with prior knowledge of the language (Dutch people being more exposed to German than vice versa), better knowledge of another related language (English) or any other non-linguistic reasons.

FASTA format

FASTAfasta sequences
Sequences may be protein sequences or nucleic acid sequences, and they can contain gaps or alignment characters (see sequence alignment). Sequences are expected to be represented in the standard IUB/IUPAC amino acid and nucleic acid codes, with these exceptions: lower-case letters are accepted and are mapped into upper-case; a single hyphen or dash can be used to represent a gap character; and in amino acid sequences, U and * are acceptable letters (see below). Numerical digits are not allowed but are used in some databases to indicate the position in the sequence.