UTF-16

The first 216 Unicode code points. The stripe of solid gray near the bottom are the surrogate halves used by UTF-16 (the white region below the stripe is the Private Use Area)

Character encoding capable of encoding all 1,112,064 valid character code points of Unicode (in fact this number of code points is dictated by the design of UTF-16).

- UTF-16
The first 216 Unicode code points. The stripe of solid gray near the bottom are the surrogate halves used by UTF-16 (the white region below the stripe is the Private Use Area)

19 related topics

Alpha

Declared character set for 10million most popular websites since 2010

UTF-8

Variable-width character encoding used for electronic communication.

Variable-width character encoding used for electronic communication.

Declared character set for 10million most popular websites since 2010
Use of the main encodings on the web from 2001 to 2012 as recorded by Google, with UTF-8 overtaking all others in 2008 and over 60% of the web in 2012 (since then approaching 100%). The ASCII-only figure includes all web pages that only contain ASCII characters, regardless of the declared header. Other encodings of Unicode such as GB2312 are added to "others".

Since RFC 3629 (November 2003), the high and low surrogate halves used by UTF-16 (U+D800 through U+DFFF) and code points not encodable by UTF-16 (those after U+10FFFF) are not legal Unicode values, and their UTF-8 encoding must be treated as an invalid byte sequence.

Punched tape with the word "Wikipedia" encoded in ASCII. Presence and absence of a hole represents 1 and 0, respectively; for example, "W" is encoded as "1010111".

Character encoding

Process of assigning numbers to graphical characters, especially the written characters of human language, allowing them to be stored, transmitted, and transformed using digital computers.

Process of assigning numbers to graphical characters, especially the written characters of human language, allowing them to be stored, transmitted, and transformed using digital computers.

Punched tape with the word "Wikipedia" encoded in ASCII. Presence and absence of a hole represents 1 and 0, respectively; for example, "W" is encoded as "1010111".
Hollerith 80-column punch card with EBCDIC character set
365x365px

A code unit in UTF-16 consists of 16 bits;

Logo of the Unicode Consortium

Unicode

Information technology standard for the consistent encoding, representation, and handling of text expressed in most of the world's writing systems.

Information technology standard for the consistent encoding, representation, and handling of text expressed in most of the world's writing systems.

Logo of the Unicode Consortium
Many modern applications can render a substantial subset of the many scripts in Unicode, as demonstrated by this screenshot from the OpenOffice.org application.
15px
Various Cyrillic characters shown with upright, oblique and italic alternate forms

The Unicode standard defines Unicode Transformation Formats (UTF): UTF-8, UTF-16, and UTF-32, and several other encodings.

ASCII chart from a pre-1972 printer manual

ASCII

Character encoding standard for electronic communication.

Character encoding standard for electronic communication.

ASCII chart from a pre-1972 printer manual
ASCII (1963). Control pictures of equivalent controls are shown where they exist, or a grey dot otherwise.

While ASCII is limited to 128 characters, Unicode and the UCS support more characters by separating the concepts of unique identification (using natural numbers called code points) and encoding (to 8-, 16-, or 32-bit binary formats, called UTF-8, UTF-16, and UTF-32, respectively).

IBM code page numbers (CPGIDs and CCSIDs) used for CJK encodings. Microsoft use of code page numbers for CJK encodings differs, and is noted in brackets where applicable.

Code page

Character encoding and as such it is a specific association of a set of printable characters and control characters with unique numbers.

Character encoding and as such it is a specific association of a set of printable characters and control characters with unique numbers.

IBM code page numbers (CPGIDs and CCSIDs) used for CJK encodings. Microsoft use of code page numbers for CJK encodings differs, and is noted in brackets where applicable.

1200UTF-16BE Unicode (big-endian) with IBM Private Use Area (PUA)

A human computer, with microscope and calculator, 1952

Universal Coded Character Set

Standard set of characters defined by the international standard ISO/IEC 10646, Information technology — Universal Coded Character Set (UCS) (plus amendments to that standard), which is the basis of many character encodings, improving as characters from previously unrepresented typing systems are added.

Standard set of characters defined by the international standard ISO/IEC 10646, Information technology — Universal Coded Character Set (UCS) (plus amendments to that standard), which is the basis of many character encodings, improving as characters from previously unrepresented typing systems are added.

A human computer, with microscope and calculator, 1952

The original edition of the UCS defined UTF-16, an extension of UCS-2, to represent code points outside the BMP.

Declared character set for 10million most popular websites since 2010

Byte order mark

Optional.

Optional.

Declared character set for 10million most popular websites since 2010

In UTF-16, a BOM may be placed as the first character of a file or character stream to indicate the endianness (byte order) of all the 16-bit code unit of the file or stream.

Comparison of units of information: bit, trit, nat, ban. Quantity of information is the height of bars. Dark green level is the "nat" unit.

UTF-32

Fixed-length encoding used to encode Unicode code points that uses exactly 32 bits (four bytes) per code point (but a number of leading bits must be zero as there are far fewer than 232 Unicode code points, needing actually only 21 bits).

Fixed-length encoding used to encode Unicode code points that uses exactly 32 bits (four bytes) per code point (but a number of leading bits must be zero as there are far fewer than 232 Unicode code points, needing actually only 21 bits).

Comparison of units of information: bit, trit, nat, ban. Quantity of information is the height of bars. Dark green level is the "nat" unit.

This makes UTF-32 close to twice the size of UTF-16.

150px

Universal Character Set characters

The Unicode Consortium and the ISO/IEC JTC 1/SC 2/WG 2 collaborates on the Universal Character Set (UCS).

The Unicode Consortium and the ISO/IEC JTC 1/SC 2/WG 2 collaborates on the Universal Character Set (UCS).

150px
500px
Example of fraction slash use. This typeface (Apple Chancery) shows the synthesized common fraction on the left and the precomposed fraction glyph on the right as a rendering the plain text string "1 1⁄4 1¼". Depending on the text environment, the single string "1 1⁄4" might yield either result, the one on the right through substitution of the fraction sequence with the single precomposed fraction glyph.
A more elaborate example of fraction slash usage: plain text "4 221⁄225" rendered in Apple Chancery. This font supplies the text layout software with instructions to synthesize the fraction according to the Unicode rule described in this section.

It is also not likely to be UTF-16 in little-endian byte order because 0xFE, 0xFF read as a 16-bit little endian word would be U+FFFE, which is meaningless.

A map of the Basic Multilingual Plane. Each numbered box represents 256 code points.

Plane (Unicode)

Continuous group of 65,536 code points.

Continuous group of 65,536 code points.

A map of the Basic Multilingual Plane. Each numbered box represents 256 code points.
A map of the Supplementary Multilingual Plane. Each numbered box represents 256 code points.
A map of the Supplementary Ideographic Plane. Each numbered box represents 256 code points.
A map of the Tertiary Ideographic Plane. Each numbered box represents 256 code points.
A map of the Supplementary Special-purpose Plane. Each numbered box represents 256 code points.

The limit of 17 planes is due to UTF-16, which can encode 220 code points (16 planes) as pairs of words, plus the BMP as a single word.