Unicode

Unicode StandardUnicode Transformation FormatThe Unicode StandardUnicode 9.0UU+Unicode 6.0Unicode 8.0Bulldog AwardUnicode 5.1
Unicode is a computing industry standard for the consistent encoding, representation, and handling of text expressed in most of the world's writing systems.wikipedia
2,240 Related Articles

Script (Unicode)

CommonInheritedscripts
The standard is maintained by the Unicode Consortium, and the most recent version, Unicode 12.1, contains a repertoire of 137,994 characters covering 150 modern and historic scripts, as well as multiple symbol sets and emoji.
In Unicode, a script is a collection of letters and other written signs used to represent textual information in one or more writing systems.

UTF-8

65001Unicode (UTF-8)AL32UTF8
The Unicode standard defines UTF-8, UTF-16, and UTF-32, and several other encodings are in use.
UTF-8 (8-bit Unicode Transformation Format) is a variable width character encoding capable of encoding all 1,112,064 valid code points in Unicode using one to four 8-bit bytes.

Unicode Consortium

Unicode Technical CommitteeThe Unicode ConsortiumUnicode
The standard is maintained by the Unicode Consortium, and the most recent version, Unicode 12.1, contains a repertoire of 137,994 characters covering 150 modern and historic scripts, as well as multiple symbol sets and emoji.
Its primary purpose is to maintain and publish the Unicode Standard which was developed with the intention of replacing existing character encoding schemes which are limited in size and scope, and are incompatible with multilingual environments.

Plane (Unicode)

Basic Multilingual PlaneSupplementary Multilingual PlaneBMP
UCS-2 uses two bytes (16 bits) for each character but can only encode the first 65,536 code points, the so-called Basic Multilingual Plane (BMP).
In the Unicode standard, a plane is a continuous group of 65,536 (2 16 ) code points.

UTF-16

UTF-16BEUTF-16LEsurrogate pair
The Unicode standard defines UTF-8, UTF-16, and UTF-32, and several other encodings are in use.
UTF-16 (16-bit Unicode Transformation Format) is a character encoding capable of encoding all 1,112,064 valid code points of Unicode.

XML

Extensible Markup LanguageXML documentXML parser
The standard has been implemented in many recent technologies, including modern operating systems, XML, Java (and other programming languages), and the .NET Framework.
It is a textual data format with strong support via Unicode for different human languages.

Emoji

AnimojiemojisEmoji characters
The standard is maintained by the Unicode Consortium, and the most recent version, Unicode 12.1, contains a repertoire of 137,994 characters covering 150 modern and historic scripts, as well as multiple symbol sets and emoji.
The set of 90 emoji included many that would later be added to the Unicode Standard, such as Pile of Poo, but as the phone was very expensive they were not widely used at the time.

Han unification

UnihanUnihan DatabaseCJK Unified Ideographs
In the case of Chinese characters, this sometimes leads to controversies over distinguishing the underlying character from its variant glyphs (see Han unification).
Han unification is an effort by the authors of Unicode and the Universal Character Set to map multiple character sets of the so-called CJK languages into a single set of unified characters.

UTF-32

UCS-4UTF-32BEUTF-32LE
The Unicode standard defines UTF-8, UTF-16, and UTF-32, and several other encodings are in use.
UTF-32 (32-bit Unicode Transformation Format) is a fixed-length encoding used to encode Unicode code points that uses exactly 32 bits (four bytes) per code point (but a number of leading bits must be zero as there are far fewer than 2 32 Unicode code points).

Duplicate characters in Unicode

For other examples, see duplicate characters in Unicode.
Unicode has a certain amount of duplication of characters.

ISO/IEC 8859

ISO 8859ECMA-94ISO-8859
Unicode has the explicit aim of transcending the limitations of traditional character encodings, such as those defined by the ISO 8859 standard, which find wide usage in various countries of the world but remain largely incompatible with each other.
The ISO/IEC 8859 standard is designed for reliable information exchange, not typography; the standard omits symbols needed for high-quality typography, such as optional ligatures, curly quotation marks, dashes, etc. As a result, high-quality typesetting systems often use proprietary or idiosyncratic extensions on top of the ASCII and ISO/IEC 8859 standards, or use Unicode instead.

GB 18030

GB18030Code page 549361392
The most commonly used encodings are UTF-8, UTF-16, and UCS-2 (without full support for Unicode), a precursor of UTF-16; GB18030 is standardized in China and implements Unicode fully, while not an official Unicode standard.
As a Unicode Transformation Format (i.e. an encoding of all Unicode code points), GB18030 supports both simplified and traditional Chinese characters.

List of Egyptian hieroglyphs

designated symboljustifiedOwl
For code points outside the BMP, five or six digits are used as required, e.g. U+13254 for the Egyptian hieroglyph designating a reed shelter or a winding wall ( Hiero O4.png ).
The Unicode Egyptian Hieroglyphs block (Unicode version 5.2, 2009) includes 1071 signs, with organisation based on Gardiner's list.

Joe Becker (Unicode)

Joe Becker
Based on experiences with the Xerox Character Code Standard (XCCS) since 1980, the origins of Unicode date to 1987, when Joe Becker from Xerox with Lee Collins and Mark Davis from Apple, started investigating the practicalities of creating a universal character set.
Joseph D. Becker is one of the co-founders of the Unicode project, and a Technical Vice President Emeritus of the Unicode Consortium.

Mark Davis (Unicode)

Mark Davis
Based on experiences with the Xerox Character Code Standard (XCCS) since 1980, the origins of Unicode date to 1987, when Joe Becker from Xerox with Lee Collins and Mark Davis from Apple, started investigating the practicalities of creating a universal character set.
He is one of the key technical contributors to the Unicode specifications, being the primary author or co-author of Bi-directional Algorithm (used worldwide to display Arabic and Hebrew text), Collation (used for sorting and searching), Normalization, Scripts, Text segmentation, Identifiers, Regular Expressions, Compression, Character Conversion, and Security.

Lee Collins (Unicode)

Lee Collins
Based on experiences with the Xerox Character Code Standard (XCCS) since 1980, the origins of Unicode date to 1987, when Joe Becker from Xerox with Lee Collins and Mark Davis from Apple, started investigating the practicalities of creating a universal character set.
Lee Collins is one of the three software engineers who created Unicode in late 1987, the other two being Joe Becker and Mark Davis.

Internationalization and localization

localizationlocalizedinternationalization
Unicode's success at unifying character sets has led to its widespread and predominant use in the internationalization and localization of computer software.

Mojibake

displayed incorrectlyerroneously doubly-encoded UTF-8garbage characters
In practice the C1 code points are often improperly-translated (Mojibake) legacy CP-1252 characters used by some English and Western European texts with Windows technologies.
The differing default settings between computers are in part due to differing deployments of Unicode among operating system families, and partly the legacy encodings' specializations for different writing systems of human languages.

Unicode character property

General Categoryattested names in Unicodebidirectional writing
Each code point has a single General Category property.
The Unicode Standard assigns character properties to each code point.

Unicode block

blockblocksUnicode
Within each plane, characters are allocated within named blocks of related characters.
A Unicode block is one of several contiguous ranges of numeric character codes (code points) of the Unicode character set that are defined by the Unicode Consortium for administrative and documentation purposes.

Latin script

LatinLatin alphabetRoman script
Many traditional character encodings share a common problem in that they allow bilingual computer processing (usually using Latin characters and the local script), but not multilingual computer processing (computer processing of arbitrary scripts mixed with each other).
Unicode uses the term "Latin" as does the International Organization for Standardization (ISO).

Hexadecimal

hex0x16
Unicode's codespace is a range of integers from 0 to hexadecimal 10FFFF, which amounts to 1,114,112 numbers called "code points" available for assignment to the repertoire of abstract characters.

Bopomofo

ZhuyinZhuyin FuhaoMandarin Phonetic Symbols
Zhuyin Fuhao and Zhuyin are traditional terms, whereas Bopomofo is the colloquial term, also used by the ISO and Unicode.

Cyrillic script

CyrillicCyrillic alphabetUzbek Cyrillic
In accordance with Unicode policy, the standard does not include letterform variations or ligatures found in manuscript sources unless they can be shown to conform to the Unicode definition of a character.

Code point

codepointcode pointscharacter codes
Unicode's codespace is a range of integers from 0 to hexadecimal 10FFFF, which amounts to 1,114,112 numbers called "code points" available for assignment to the repertoire of abstract characters. UTF-8, the dominant encoding on the World Wide Web (used in over 94% of websites ), uses one byte for the first 128 code points, and up to 4 bytes for other characters.
For example, the character encoding scheme ASCII comprises 128 code points in the range 0 hex to 7F hex, Extended ASCII comprises 256 code points in the range 0 hex to FF hex, and Unicode comprises 1,114,112 code points in the range 0 hex to 10FFFF hex.