A report on UTF-8

Declared character set for 10million most popular websites since 2010
Use of the main encodings on the web from 2001 to 2012 as recorded by Google, with UTF-8 overtaking all others in 2008 and over 60% of the web in 2012 (since then approaching 100%). The ASCII-only figure includes all web pages that only contain ASCII characters, regardless of the declared header. Other encodings of Unicode such as GB2312 are added to "others".

Variable-width character encoding used for electronic communication.

- UTF-8
Declared character set for 10million most popular websites since 2010

41 related topics with Alpha

Overall

Logo of the Unicode Consortium

Unicode

15 links

Information technology standard for the consistent encoding, representation, and handling of text expressed in most of the world's writing systems.

Information technology standard for the consistent encoding, representation, and handling of text expressed in most of the world's writing systems.

Logo of the Unicode Consortium
Many modern applications can render a substantial subset of the many scripts in Unicode, as demonstrated by this screenshot from the OpenOffice.org application.
15px
Various Cyrillic characters shown with upright, oblique and italic alternate forms

The most common encodings are the ASCII-compatible UTF-8, the obsolete UCS-2, the UCS-2-compatible UTF-16, and GB18030 which is not an official Unicode standard but is used in China and implements Unicode fully.

Punched tape with the word "Wikipedia" encoded in ASCII. Presence and absence of a hole represents 1 and 0, respectively; for example, "W" is encoded as "1010111".

Character encoding

12 links

Process of assigning numbers to graphical characters, especially the written characters of human language, allowing them to be stored, transmitted, and transformed using digital computers.

Process of assigning numbers to graphical characters, especially the written characters of human language, allowing them to be stored, transmitted, and transformed using digital computers.

Punched tape with the word "Wikipedia" encoded in ASCII. Presence and absence of a hole represents 1 and 0, respectively; for example, "W" is encoded as "1010111".
Hollerith 80-column punch card with EBCDIC character set
365x365px

A code unit in UTF-8, EBCDIC and GB 18030 consists of 8 bits;

The first 216 Unicode code points. The stripe of solid gray near the bottom are the surrogate halves used by UTF-16 (the white region below the stripe is the Private Use Area)

UTF-16

11 links

Character encoding capable of encoding all 1,112,064 valid character code points of Unicode (in fact this number of code points is dictated by the design of UTF-16).

Character encoding capable of encoding all 1,112,064 valid character code points of Unicode (in fact this number of code points is dictated by the design of UTF-16).

The first 216 Unicode code points. The stripe of solid gray near the bottom are the surrogate halves used by UTF-16 (the white region below the stripe is the Private Use Area)

Since May 2019, Microsoft has begun supporting UTF-8 (as well as UTF-16) and encouraging its use.

ASCII chart from a pre-1972 printer manual

ASCII

9 links

Character encoding standard for electronic communication.

Character encoding standard for electronic communication.

ASCII chart from a pre-1972 printer manual
ASCII (1963). Control pictures of equivalent controls are shown where they exist, or a grey dot otherwise.

ASCII was the most common character encoding on the World Wide Web until December 2007, when UTF-8 encoding surpassed it; UTF-8 is backward compatible with ASCII.

Byte order mark

4 links

Optional.

Optional.

Its presence interferes with the use of UTF-8 by software that does not expect non-ASCII bytes at the start of a file but that could otherwise handle the text stream.

ISO/IEC 8859-1 code page layout

ISO/IEC 8859-1

5 links

Part of the ISO/IEC 8859 series of ASCII-based standard character encodings, first edition published in 1987.

Part of the ISO/IEC 8859 series of ASCII-based standard character encodings, first edition published in 1987.

ISO/IEC 8859-1 code page layout

This is sometimes assumed to be the encoding of text on Microsoft Windows (and Unix) if there is no byte order mark (BOM); this is only gradually being changed to UTF-8.

IBM code page numbers (CPGIDs and CCSIDs) used for CJK encodings. Microsoft use of code page numbers for CJK encodings differs, and is noted in brackets where applicable.

Code page

7 links

Character encoding and as such it is a specific association of a set of printable characters and control characters with unique numbers.

Character encoding and as such it is a specific association of a set of printable characters and control characters with unique numbers.

IBM code page numbers (CPGIDs and CCSIDs) used for CJK encodings. Microsoft use of code page numbers for CJK encodings differs, and is noted in brackets where applicable.

Vendors that use a code page system allocate their own code page number to a character encoding, even if it is better known by another name; for example, UTF-8 has been assigned page numbers 1208 at IBM, 65001 at Microsoft, and 4110 at SAP.

Windows code page

5 links

Windows code pages are sets of characters or code pages (known as character encodings in other operating systems) used in Microsoft Windows from the 1980s and 1990s.

Windows code pages are sets of characters or code pages (known as character encodings in other operating systems) used in Microsoft Windows from the 1980s and 1990s.

This method encodes uniquely all Unicode characters in the Basic Multilingual Plane and a 32-bit (four byte) code for others – but the rest of the industry (Unix-like systems and the web) chose UTF-8 (which uses one byte for the 7-bit ASCII character set, two or three bytes for other characters in the BMP, and four bytes for the remainder).

A map of the Basic Multilingual Plane. Each numbered box represents 256 code points.

Plane (Unicode)

3 links

Continuous group of 65,536 code points.

Continuous group of 65,536 code points.

A map of the Basic Multilingual Plane. Each numbered box represents 256 code points.
A map of the Supplementary Multilingual Plane. Each numbered box represents 256 code points.
A map of the Supplementary Ideographic Plane. Each numbered box represents 256 code points.
A map of the Tertiary Ideographic Plane. Each numbered box represents 256 code points.
A map of the Supplementary Special-purpose Plane. Each numbered box represents 256 code points.

UTF-8 was designed with a much larger limit of 231 (2,147,483,648) code points (32,768 planes), and would still be able to encode 221 (2,097,152) code points (32 planes) even under the current limit of 4 bytes.

Thompson (left) with Dennis Ritchie

Ken Thompson

2 links

American pioneer of computer science.

American pioneer of computer science.

Thompson (left) with Dennis Ritchie
DEC PDP-7, as used for initial work on Unix
Thompson (sitting) and Ritchie working together at a PDP-11
Version 6 Unix running on the SIMH PDP-11 simulator, with "/usr/ken" still present
Plan 9 from Bell Labs, running the acme text editor, and the rc shell

Other notable contributions included his work on regular expressions and early computer text editors QED and ed, the definition of the UTF-8 encoding, and his work on computer chess that included the creation of endgame tablebases and the chess machine Belle.