Logo of the Unicode Consortium
The first 216 Unicode code points. The stripe of solid gray near the bottom are the surrogate halves used by UTF-16 (the white region below the stripe is the Private Use Area)
Many modern applications can render a substantial subset of the many scripts in Unicode, as demonstrated by this screenshot from the OpenOffice.org application.
15px
Various Cyrillic characters shown with upright, oblique and italic alternate forms

UTF-16 (16-bit Unicode Transformation Format) is a character encoding capable of encoding all 1,112,064 valid character code points of Unicode (in fact this number of code points is dictated by the design of UTF-16).

- UTF-16

The Unicode standard defines Unicode Transformation Formats (UTF): UTF-8, UTF-16, and UTF-32, and several other encodings.

- Unicode
Logo of the Unicode Consortium

7 related topics

Alpha

Declared character set for 10million most popular websites since 2010

UTF-8

Variable-width character encoding used for electronic communication.

Variable-width character encoding used for electronic communication.

Declared character set for 10million most popular websites since 2010
Use of the main encodings on the web from 2001 to 2012 as recorded by Google, with UTF-8 overtaking all others in 2008 and over 60% of the web in 2012 (since then approaching 100%). The ASCII-only figure includes all web pages that only contain ASCII characters, regardless of the declared header. Other encodings of Unicode such as GB2312 are added to "others".

Defined by the Unicode Standard, the name is derived from Unicode (or Universal Coded Character Set) Transformation Format – 8-bit.

Since RFC 3629 (November 2003), the high and low surrogate halves used by UTF-16 (U+D800 through U+DFFF) and code points not encodable by UTF-16 (those after U+10FFFF) are not legal Unicode values, and their UTF-8 encoding must be treated as an invalid byte sequence.

Punched tape with the word "Wikipedia" encoded in ASCII. Presence and absence of a hole represents 1 and 0, respectively; for example, "W" is encoded as "1010111".

Character encoding

Process of assigning numbers to graphical characters, especially the written characters of human language, allowing them to be stored, transmitted, and transformed using digital computers.

Process of assigning numbers to graphical characters, especially the written characters of human language, allowing them to be stored, transmitted, and transformed using digital computers.

Punched tape with the word "Wikipedia" encoded in ASCII. Presence and absence of a hole represents 1 and 0, respectively; for example, "W" is encoded as "1010111".
Hollerith 80-column punch card with EBCDIC character set
365x365px

The low cost of digital representation of data in modern computer systems allows more elaborate character codes (such as Unicode) which represent most of the characters used in many written languages.

A code unit in UTF-16 consists of 16 bits;

ASCII chart from a pre-1972 printer manual

ASCII

Character encoding standard for electronic communication.

Character encoding standard for electronic communication.

ASCII chart from a pre-1972 printer manual
ASCII (1963). Control pictures of equivalent controls are shown where they exist, or a grey dot otherwise.

Unicode and the ISO/IEC 10646 Universal Character Set (UCS) have a much wider array of characters and their various encoding forms have begun to supplant ISO/IEC 8859 and ASCII rapidly in many environments.

While ASCII is limited to 128 characters, Unicode and the UCS support more characters by separating the concepts of unique identification (using natural numbers called code points) and encoding (to 8-, 16-, or 32-bit binary formats, called UTF-8, UTF-16, and UTF-32, respectively).

A human computer, with microscope and calculator, 1952

Universal Coded Character Set

Standard set of characters defined by the international standard ISO/IEC 10646, Information technology — Universal Coded Character Set (UCS) (plus amendments to that standard), which is the basis of many character encodings, improving as characters from previously unrepresented typing systems are added.

Standard set of characters defined by the international standard ISO/IEC 10646, Information technology — Universal Coded Character Set (UCS) (plus amendments to that standard), which is the basis of many character encodings, improving as characters from previously unrepresented typing systems are added.

A human computer, with microscope and calculator, 1952

The original edition of the UCS defined UTF-16, an extension of UCS-2, to represent code points outside the BMP.

In 1990, therefore, two initiatives for a universal character set existed: Unicode, with 16 bits for every character (65,536 possible characters), and ISO/IEC 10646.

Comparison of units of information: bit, trit, nat, ban. Quantity of information is the height of bars. Dark green level is the "nat" unit.

UTF-32

Comparison of units of information: bit, trit, nat, ban. Quantity of information is the height of bars. Dark green level is the "nat" unit.

UTF-32 (32-bit Unicode Transformation Format) is a fixed-length encoding used to encode Unicode code points that uses exactly 32 bits (four bytes) per code point (but a number of leading bits must be zero as there are far fewer than 232 Unicode code points, needing actually only 21 bits).

This makes UTF-32 close to twice the size of UTF-16.

IBM code page numbers (CPGIDs and CCSIDs) used for CJK encodings. Microsoft use of code page numbers for CJK encodings differs, and is noted in brackets where applicable.

Code page

Character encoding and as such it is a specific association of a set of printable characters and control characters with unique numbers.

Character encoding and as such it is a specific association of a set of printable characters and control characters with unique numbers.

IBM code page numbers (CPGIDs and CCSIDs) used for CJK encodings. Microsoft use of code page numbers for CJK encodings differs, and is noted in brackets where applicable.

The multitude of character sets leads many vendors to recommend Unicode.

1200UTF-16BE Unicode (big-endian) with IBM Private Use Area (PUA)

A map of the Basic Multilingual Plane. Each numbered box represents 256 code points.

Plane (Unicode)

A map of the Basic Multilingual Plane. Each numbered box represents 256 code points.
A map of the Supplementary Multilingual Plane. Each numbered box represents 256 code points.
A map of the Supplementary Ideographic Plane. Each numbered box represents 256 code points.
A map of the Tertiary Ideographic Plane. Each numbered box represents 256 code points.
A map of the Supplementary Special-purpose Plane. Each numbered box represents 256 code points.

In the Unicode standard, a plane is a continuous group of 65,536 (216) code points.

The limit of 17 planes is due to UTF-16, which can encode 220 code points (16 planes) as pairs of words, plus the BMP as a single word.