Punched tape with the word "Wikipedia" encoded in ASCII. Presence and absence of a hole represents 1 and 0, respectively; for example, "W" is encoded as "1010111".
The first 216 Unicode code points. The stripe of solid gray near the bottom are the surrogate halves used by UTF-16 (the white region below the stripe is the Private Use Area)
Hollerith 80-column punch card with EBCDIC character set
365x365px

UTF-16 (16-bit Unicode Transformation Format) is a character encoding capable of encoding all 1,112,064 valid character code points of Unicode (in fact this number of code points is dictated by the design of UTF-16).

- UTF-16

A code unit in UTF-16 consists of 16 bits;

- Character encoding
Punched tape with the word "Wikipedia" encoded in ASCII. Presence and absence of a hole represents 1 and 0, respectively; for example, "W" is encoded as "1010111".

9 related topics

Alpha

Declared character set for 10million most popular websites since 2010

UTF-8

Declared character set for 10million most popular websites since 2010
Use of the main encodings on the web from 2001 to 2012 as recorded by Google, with UTF-8 overtaking all others in 2008 and over 60% of the web in 2012 (since then approaching 100%). The ASCII-only figure includes all web pages that only contain ASCII characters, regardless of the declared header. Other encodings of Unicode such as GB2312 are added to "others".

UTF-8 is a variable-width character encoding used for electronic communication.

Since RFC 3629 (November 2003), the high and low surrogate halves used by UTF-16 (U+D800 through U+DFFF) and code points not encodable by UTF-16 (those after U+10FFFF) are not legal Unicode values, and their UTF-8 encoding must be treated as an invalid byte sequence.

Logo of the Unicode Consortium

Unicode

Logo of the Unicode Consortium
Many modern applications can render a substantial subset of the many scripts in Unicode, as demonstrated by this screenshot from the OpenOffice.org application.
15px
Various Cyrillic characters shown with upright, oblique and italic alternate forms

Unicode, formally the Unicode Standard, is an information technology standard for the consistent encoding, representation, and handling of text expressed in most of the world's writing systems.

The Unicode standard defines Unicode Transformation Formats (UTF): UTF-8, UTF-16, and UTF-32, and several other encodings.

IBM code page numbers (CPGIDs and CCSIDs) used for CJK encodings. Microsoft use of code page numbers for CJK encodings differs, and is noted in brackets where applicable.

Code page

IBM code page numbers (CPGIDs and CCSIDs) used for CJK encodings. Microsoft use of code page numbers for CJK encodings differs, and is noted in brackets where applicable.

In computing, a code page is a character encoding and as such it is a specific association of a set of printable characters and control characters with unique numbers.

1200UTF-16BE Unicode (big-endian) with IBM Private Use Area (PUA)

ASCII chart from a pre-1972 printer manual

ASCII

ASCII chart from a pre-1972 printer manual
ASCII (1963). Control pictures of equivalent controls are shown where they exist, or a grey dot otherwise.

ASCII, abbreviated from American Standard Code for Information Interchange, is a character encoding standard for electronic communication.

While ASCII is limited to 128 characters, Unicode and the UCS support more characters by separating the concepts of unique identification (using natural numbers called code points) and encoding (to 8-, 16-, or 32-bit binary formats, called UTF-8, UTF-16, and UTF-32, respectively).

A human computer, with microscope and calculator, 1952

Universal Coded Character Set

A human computer, with microscope and calculator, 1952

The Universal Coded Character Set (UCS, Unicode) is a standard set of characters defined by the international standard ISO/IEC 10646, Information technology — Universal Coded Character Set (UCS) (plus amendments to that standard), which is the basis of many character encodings, improving as characters from previously unrepresented typing systems are added.

The original edition of the UCS defined UTF-16, an extension of UCS-2, to represent code points outside the BMP.

Declared character set for 10million most popular websites since 2010

Byte order mark

Optional.

Optional.

Declared character set for 10million most popular websites since 2010

In UTF-16, a BOM may be placed as the first character of a file or character stream to indicate the endianness (byte order) of all the 16-bit code unit of the file or stream.

Punched tape with the word "Wikipedia" encoded in ASCII. Presence and absence of a hole represents 1 and 0, respectively; for example, "W" is encoded as "1010111".

Variable-width encoding

Punched tape with the word "Wikipedia" encoded in ASCII. Presence and absence of a hole represents 1 and 0, respectively; for example, "W" is encoded as "1010111".

A variable-width encoding is a type of character encoding scheme in which codes of differing lengths are used to encode a character set (a repertoire of symbols) for representation, usually in a computer.

The Unicode standard has two variable-width encodings: UTF-8 and UTF-16 (it also has a fixed-width encoding, UTF-32).

Punched tape with the word "Wikipedia" encoded in ASCII. Presence and absence of a hole represents 1 and 0, respectively; for example, "W" is encoded as "1010111".

CCSID

Punched tape with the word "Wikipedia" encoded in ASCII. Presence and absence of a hole represents 1 and 0, respectively; for example, "W" is encoded as "1010111".

A CCSID (coded character set identifier) is a 16-bit number that represents a particular encoding of a specific code page.

For example, Unicode is a code page that has several encoding (so called "transformation") forms, like UTF-8, UTF-16 and UTF-32, but which may or may not actually be accompanied by a CCSID number to indicate that this encoding is being used.

A screenshot of Manjaro running the Cinnamon desktop environment, Firefox accessing Wikipedia which uses MediaWiki, LibreOffice Writer, Vim, GNOME Calculator, VLC and Nemo file manager, all of which are open-source software.

International Components for Unicode

Open-source project of mature C/C++ and Java libraries for Unicode support, software internationalization, and software globalization.

Open-source project of mature C/C++ and Java libraries for Unicode support, software internationalization, and software globalization.

A screenshot of Manjaro running the Cinnamon desktop environment, Firefox accessing Wikipedia which uses MediaWiki, LibreOffice Writer, Vim, GNOME Calculator, VLC and Nemo file manager, all of which are open-source software.

ICU provides the following services: Unicode text handling, full character properties, and character set conversions; Unicode regular expressions; full Unicode sets; character, word, and line boundaries; language-sensitive collation and searching; normalization, upper and lowercase conversion, and script transliterations; comprehensive locale data and resource bundle architecture via the Common Locale Data Repository (CLDR); multiple calendars and time zones; and rule-based formatting and parsing of dates, times, numbers, currencies, and messages.

ICU has historically used UTF-16, and still does only for Java; while for C/C++ UTF-8 is supported, including the correct handling of "illegal UTF-8".