Needleman–Wunsch algorithm

Needleman-Wunsch algorithmNeedleman-WunschNeedleman and Wunsch
EMBOSS Needle and EMBOSS Stretcher Global Alignment Tools. Needleman-Wunsch alignment for two nucleotide sequences. MathWorks - Globally align two sequences using Needleman-Wunsch algorithm. BitKeeper – Source Control Management Software. Smith–Waterman algorithm. Sequence mining. Levenshtein distance. Dynamic time warping. Sequence alignment. NW-align: A protein sequence-to-sequence alignment program by Needleman-Wunsch algorithm (online server and source code). Parallel Needleman-Wunsch Algorithm for Grid - Implementation by Tahir Naveed, Imitaz Saeed Siddiqui and Shaftab Ahmed - Bahria University. Needleman-Wunsch Algorithm as Haskell Code.

Dynamic programming

dynamicdynamic contracting problemsdynamic programming (DP),
The Needleman–Wunsch algorithm and other algorithms used in bioinformatics, including sequence alignment, structural alignment, RNA structure prediction. Floyd's all-pairs shortest path algorithm. Optimizing the order for chain matrix multiplication. Pseudo-polynomial time algorithms for the subset sum, knapsack and partition problems. The dynamic time warping algorithm for computing the global distance between two time series. The Selinger (a.k.a. System R) algorithm for relational database query optimization. De Boor algorithm for evaluating B-spline curves. Duckworth–Lewis method for resolving the problem when games of cricket are interrupted.

List of algorithms

graph algorithmscommon graph algorithmsGraph
Dynamic time warping: measure similarity between two sequences which may vary in time or speed. Hirschberg's algorithm: finds the least cost sequence alignment between two sequences, as measured by their Levenshtein distance. Needleman–Wunsch algorithm: find global alignment between two sequences. Smith–Waterman algorithm: find local sequence alignment. Exchange sorts. Bubble sort: for each pair of indices, swap the items if out of order. Cocktail shaker sort or bidirectional bubble sort, a bubble sort traversing the list alternately from front to back and back to front. Comb sort. Gnome sort. Odd–even sort.

Edit distance

distance costfamily of distance metricsLevenshtein algorithm
LCS distance is an upper bound on Levenshtein distance. For strings of the same length, Hamming distance is an upper bound on Levenshtein distance. Hirschberg's algorithm computes the optimal alignment of two strings, where optimality is defined as minimizing edit distance. Approximate string matching can be formulated in terms of edit distance.

Smith–Waterman algorithm

Smith-Waterman algorithmSmith-WatermanSmith and Waterman
To obtain the second best local alignment, apply the traceback process starting at the second highest score outside the trace of the best alignment. Substitution matrix:. Gap penalty: W_k=2k (a linear gap penalty of W_1 = 2). Bioinformatics. Sequence alignment. Sequence mining. Needleman–Wunsch algorithm. Levenshtein distance. BLAST. FASTA. JAligner — an open source Java implementation of the Smith–Waterman algorithm. B.A.B.A. — an applet (with source) which visually explains the algorithm. FASTA/SSEARCH — services page at the EBI. UGENE Smith–Waterman plugin — an open source SSEARCH compatible implementation of the algorithm with graphical interface written in C++.

Multiple sequence alignment

MSAmultiple alignmentalignment
Thus, the assumptions used to align protein sequences and DNA coding regions are inherently different from those that hold for TFBS sequences. Although it is meaningful to align DNA coding regions for homologous sequences using mutation operators, alignment of binding site sequences for the same transcription factor cannot rely on evolutionary related mutation operations. Similarly, the evolutionary operator of point mutations can be used to define an edit distance for coding sequences, but this has little meaning for TFBS sequences because any sequence variation has to maintain a certain level of specificity for the binding site to function.

Speech recognition

voice recognitionautomatic speech recognitionvoice command
The loss function is usually the Levenshtein distance, though it can be different distances for specific tasks; the set of possible transcriptions is, of course, pruned to maintain tractability. Efficient algorithms have been devised to re score lattices represented as weighted finite state transducers with edit distances represented themselves as a finite state transducer verifying certain assumptions. Dynamic time warping is an approach that was historically used for speech recognition but has now largely been displaced by the more successful HMM-based approach. Dynamic time warping is an algorithm for measuring similarity between two sequences that may vary in time or speed.

Hirschberg's algorithm

In computer science, Hirschberg's algorithm, named after its inventor, Dan Hirschberg, is a dynamic programming algorithm that finds the optimal sequence alignment between two strings. Optimality is measured with the Levenshtein distance, defined to be the sum of the costs of insertions, replacements, deletions, and null actions needed to change one string into the other. Hirschberg's algorithm is simply described as a more space efficient version of the Needleman–Wunsch algorithm that uses divide and conquer. Hirschberg's algorithm is commonly used in computational biology to find maximal global alignments of DNA and protein sequences.

Jaro–Winkler distance

Jaro distanceJaro-Winkler
There are other popular measures of edit distance, which are calculated using a different set of allowable edit operations. For instance, Edit distance is usually defined as a parameterizable metric calculated with a specific set of allowed edit operations, and each operation is assigned a cost (possibly infinite).

Triangle inequality

triangular inequalityReverse triangle inequality
[[File:TriangleInequality.svg|thumb|Three examples of the triangle inequality for triangles with sides of lengths


Linguistics is the scientific study of language. It involves analysing language form, language meaning, and language in context. The earliest activities in the documentation and description of language have been attributed to the 6th-century-BC Indian grammarian Pāṇini who wrote a formal description of the Sanskrit language in his .

Sequence homology

Significant similarity is strong evidence that two sequences are related by evolutionary changes from a common ancestral sequence. Alignments of multiple sequences are used to indicate which regions of each sequence are homologous. The term "percent homology" is often used to mean "sequence similarity." The percentage of identical residues (percent identity) or the percentage of residues conserved with similar physicochemical properties (percent similarity), e.g. leucine and isoleucine, is usually used to "quantify the homology." Based on the definition of homology specified above this terminology is incorrect since sequence similarity is the observation, homology is the conclusion.

Hidden Markov model

hidden Markov modelsHMMPoisson hidden Markov model
Alignment of bio-sequences. Time series analysis. Activity recognition. Protein folding. Sequence classification. Metamorphic virus detection. DNA motif discovery. Chromatin state discovery. Transportation forecasting. Solar irradiance variability. Andrey Markov. Baum–Welch algorithm. Bayesian inference. Bayesian programming. Conditional random field. Estimation theory. HHpred / HHsearch free server and software for protein sequence searching. HMMER, a free hidden Markov model program for protein sequence analysis. Hidden Bernoulli model. Hidden semi-Markov model. Hierarchical hidden Markov model. Layered hidden Markov model. Sequential dynamical system. Stochastic context-free grammar.

Social sequence analysis

Andrew Abbott argued that sequence alignment methods in biology and information theory and computer science provided useful models. Both fields had developed combinations of sequence alignment operations to facilitate the comparison of whole sequences. Social scientists adapted these methods in the form of optimal matching (OM) analysis, often in conjunction with cluster analysis techniques to aid in the identification of common sequence pattern classes.


Major research efforts in the field include sequence alignment, gene finding, genome assembly, drug design, drug discovery, protein structure alignment, protein structure prediction, prediction of gene expression and protein–protein interactions, genome-wide association studies, the modeling of evolution and cell division/mitosis. Bioinformatics now entails the creation and advancement of databases, algorithms, computational and statistical techniques, and theory to solve formal and practical problems arising from the management and analysis of biological data.

String metric

similaritystring metricsdistance
The most widely known string metric is a rudimentary one called the Levenshtein distance (also known as edit distance). It operates between two input strings, returning a number equivalent to the number of substitutions and deletions needed in order to transform one input string into another. Simplistic string metrics such as Levenshtein distance have expanded to include phonetic, token, grammatical and character-based methods of statistical comparisons.

Segregating site

non-conservative replacementsnon-conservative mutationsvariable
Sequence alignment. Sequence alignment software. ClustalW.

Vladimir Levenshtein

Levenshtein, VladimirVladimir I. Levenshtein
Vladimir Iosifovich Levenshtein (March 20, 1935 – September 6, 2017) was a Russian scientist who did research in information theory, error-correcting codes, and combinatorial design. Among other contributions, he is known for the Levenshtein distance and a Levenshtein algorithm, which he developed in 1965. He graduated from the Department of Mathematics and Mechanics of Moscow State University in 1958 and worked at the Keldysh Institute of Applied Mathematics in Moscow ever since. He was a fellow of the IEEE Information Theory Society. He received the IEEE Richard W.

Consensus sequence

consensus sequencescanonical sequenceconsensus
In molecular biology and bioinformatics, the consensus sequence (or canonical sequence) is the calculated order of most frequent residues, either nucleotide or amino acid, found at each position in a sequence alignment. It represents the results of multiple sequence alignments in which related sequences are compared to each other and similar sequence motifs are calculated. Such information is important when considering sequence-dependent enzymes such as RNA polymerase. A protein binding site, represented by a consensus sequence, may be a short sequence of nucleotides which is found several times in the genome and is thought to play the same role in its different locations.

SAM (file format)

SAMSequence Alignment MapSAM format
Sequence Alignment/Map (SAM) format for alignment of nucleotide sequences (e.g. sequencing reads) to (a) reference sequence(s). It may contain base-call and alignment qualities and other data. The SAM format consists of a header and an alignment section. The binary equivalent of a SAM file is a Binary Alignment Map (BAM) file, which stores the same data in a compressed binary representation. SAM files can be analysed and edited with the software SAMtools. The header section must be prior to the alignment section if it is present. Headings begin with the '@' symbol, which distinguishes them from the alignment section.

Approximate string matching

Fuzzy string searchingfuzzy searchfuzzy matching
Levenshtein distance. Locality-sensitive hashing. Metaphone. Needleman–Wunsch algorithm. Plagiarism detection. Regular expressions for fuzzy and non-fuzzy matching. Smith–Waterman algorithm. Soundex. String metric. Flamingo Project. Efficient Similarity Query Processing Project with recent advances in approximate string matching based on an edit distance threshold. StringMetric project a Scala library of string metrics and phonetic algorithms. Natural project a JavaScript natural language processing library which includes implementations of popular string metrics.


Separate modules extend Biopython's capabilities to sequence alignment, protein structure, population genetics, phylogenetics, sequence motifs, and machine learning. Biopython is one of a number of Bio* projects designed to reduce code duplication in computational biology. Biopython development began in 1999 and it was first released in July 2000. It was developed during a similar time frame and with analogous goals to other projects that added bioinformatics capabilities to their respective programming languages, including BioPerl, BioRuby and BioJava. Early developers on the project included Jeff Chang, Andrew Dalke and Brad Chapman, though over 100 people have made contributions to date.

FASTA format

FASTAfasta sequences
Sequences may be protein sequences or nucleic acid sequences, and they can contain gaps or alignment characters (see sequence alignment). Sequences are expected to be represented in the standard IUB/IUPAC amino acid and nucleic acid codes, with these exceptions: lower-case letters are accepted and are mapped into upper-case; a single hyphen or dash can be used to represent a gap character; and in amino acid sequences, U and * are acceptable letters (see below). Numerical digits are not allowed but are used in some databases to indicate the position in the sequence.


==See also == * BioRubyDoc Community Wiki 2009: Implementing phyloXML support in BioRuby. 2010: Ruby 1.9.2 support of BioRuby. 2010: Implementation of algorithm to infer gene duplications in BioRuby. 2011: Represent bio-objects and related informatio with images. 2012: Extend bio-alignment plug-in with Multiple Alignment Format -MAF- parser. 0.7.0 December 18, 2005 (438 KB). 1.0.0 February 26, 2006 (528 KB). May 24, 2013 (1.42 MB). mailing list. contributors. GitHub. Japan Open Bioinformatics Research Group. The Open Bioinformatics Foundation. BioPerl. BioPython. BioJava. BioJS. BioLisp. BioDAS. Saaien Tist -