Arithmetic coding

arithmetic coderarithmetic encodingarithmetic codeArithmeticarithmetic compressionarithmetic encoder
Arithmetic coding is a form of entropy encoding used in lossless data compression.wikipedia
91 Related Articles

Lossless compression

losslesslossless data compressioncompression
Arithmetic coding is a form of entropy encoding used in lossless data compression.
The primary encoding algorithms used to produce bit sequences are Huffman coding (also used by DEFLATE) and arithmetic coding.

Asymmetric numeral systems

Finite State EntropyrANSANS
Recent family of entropy coders called asymmetric numeral systems allows for faster implementations thanks to directly operating on a single natural number representing the current information.
ANS combines the compression ratio of arithmetic coding (which uses a nearly accurate probability distribution), with a processing cost similar to that of Huffman coding.

Entropy encoding

entropy codingentropy codedentropy coder
Arithmetic coding is a form of entropy encoding used in lossless data compression.
Two of the most common entropy encoding techniques are Huffman coding and arithmetic coding.

Huffman coding

HuffmanHuffman codeHuffman encoding
Arithmetic coding differs from other forms of entropy encoding, such as Huffman coding, in that rather than separating the input into component symbols and replacing each with a code, arithmetic coding encodes the entire message into a single number, an arbitrary-precision fraction q where 0.0 ≤ q < 1.0.
Other methods such as arithmetic coding often have better compression capability.

Entropy (information theory)

entropyinformation entropyShannon entropy
This 8 bit output is larger than the information content, or entropy of the message, which is
Entropy effectively bounds the performance of the strongest lossless compression possible, which can be realized in theory by using the typical set or in practice using Huffman, Lempel–Ziv or arithmetic coding.

JPEG

JPG.jpgJPG/JPEG
Also, encoders and decoders of the JPEG file format, which has options for both Huffman encoding and arithmetic coding, typically only support the Huffman encoding option, which was originally because of patent concerns; the result is that nearly all JPEG images in use today use Huffman encoding although JPEG's arithmetic coding patents have expired due to the age of the JPEG standard (the design of which was approximately completed by 1990).
The JPEG standard also allows, but does not require, decoders to support the use of arithmetic coding, which is mathematically superior to Huffman coding.

Bzip2

BZIPbunzip2BZ2
At least one significant compression software program, bzip2, deliberately discontinued the use of arithmetic coding in favor of Huffman coding due to the perceived patent situation at the time.
bzip2's ancestor bzip used arithmetic coding instead of Huffman.

Data compression

compressionvideo compressioncompressed
In a further refinement of the direct use of probabilistic modelling, statistical estimates can be coupled to an algorithm called arithmetic coding.

Image compression

compressionimagecompressed
The JPEG image compression format's arithmetic coding algorithm is based on the following cited patents (since expired).

Context-adaptive binary arithmetic coding

CABACCABAC entropy coding
Arithmetic coding is finally applied to compress the data.

ASCII

US-ASCIIAmerican Standard Code for Information InterchangeASCII code
Normally, a string of characters such as the words "hello there" is represented using a fixed number of bits per character, as in the ASCII code.

Arbitrary-precision arithmetic

arbitrary-precisionarbitrary precisionbignum
Arithmetic coding differs from other forms of entropy encoding, such as Huffman coding, in that rather than separating the input into component symbols and replacing each with a code, arithmetic coding encodes the entire message into a single number, an arbitrary-precision fraction q where 0.0 ≤ q < 1.0.

Block code

linear block codedistanceblock encoding
Simple block encoding would require 2 bits per symbol, which is wasteful: one of the bit variations is never used.

Ternary numeral system

ternarytrit3
The next step is to encode this ternary number using a fixed-point binary number of sufficient precision to recover it, such as 0.0010110010 2 — this is only 10 bits; 2 bits are saved in comparison with naïve block encoding.

Shannon's source coding theorem

source coding theoremShannon's noiseless coding theoremnoiseless coding
In general, arithmetic coders can produce near-optimal output for any given set of symbols and probabilities (the optimal value is −log 2 P bits for each symbol of probability P, see source coding theorem).

Conceptual model

modelmodelsschema
Compression algorithms that use arithmetic coding start by determining a model of the data – basically a prediction of what patterns will be found in the symbols of the message.

Adaptive coding

adaptive
Models can even be adaptive, so that they continually change their prediction of the data based on what the stream actually contains.

Prefix code

prefix-free codeprefix codesprefix-free
In particular, it is only necessary to transmit enough digits (in whatever base) of the fraction so that all fractions that begin with those digits fall into the final interval; this will guarantee that the resulting code is a prefix code.

Bit

bitsbinary digitbinary digits
This is correct; the information content of a three-digit decimal is bits; the same message could have been encoded in the binary fraction 0.10001010 (equivalent to 0.5390625 decimal) at a cost of only 8 bits.

Significant figures

precisionsignificant digitssignificant digit
In particular, they are written as if the encoder first calculated the fractions representing the endpoints of the interval in full, using infinite precision, and only converted the fraction to its final form at the end of encoding.

Radix

basenumber basebases
In general, arithmetic (and range) coding may be interpreted as a generalized change of radix.

Numeral system

numeralsnumeralnumeration
In a positional numeral system the radix, or base, is numerically equal to a number of different symbols used to express the number.

Independent and identically distributed random variables

independent and identically distributedi.i.d.iid
Because arithmetic coding doesn't compress one datum at a time, it can get arbitrarily close to entropy when compressing iid strings.

Run-length encoding

RLErun length encodingrun-length encoded
An alternative is encoding run lengths via Huffman-based Golomb-Rice codes.