Character encoding

character setComputer encodingsencodingcharacter setstext encodingcharsetcode unitcharacter encodingscoded character setencoded
Character encoding is used to represent a repertoire of characters by some kind of encoding system.wikipedia
664 Related Articles

Code point

codepointcode pointscharacter codes
Depending on the abstraction level and context, corresponding code points and the resulting code space may be regarded as bit patterns, octets, natural numbers, electrical pulses, etc. A character encoding is used in computation, data storage, and transmission of textual data.
In character encoding terminology, a code point or code position is any of the numerical values that make up the code space.

Code page

codepagecode pagesOEM character set
"Character set", "character map", "codeset" and "code page" are related, but not identical, terms.
In computing, a code page is a character encoding and as such it is a specific association of a set of printable characters and control characters with unique numbers.

Unicode

Unicode StandardUU+
The low cost of digital representation of data in modern computer systems allows more elaborate character codes (such as Unicode) which represent most of the characters used in many written languages. Common examples of character encoding systems include Morse code, the Baudot code, the American Standard Code for Information Interchange (ASCII) and Unicode. Unicode and its parallel standard, the ISO/IEC 10646 Universal Character Set, together constitute a modern, unified character encoding.
Unicode is a computing industry standard for the consistent encoding, representation, and handling of text expressed in most of the world's writing systems.

Computer data storage

main memorystoragememory
Depending on the abstraction level and context, corresponding code points and the resulting code space may be regarded as bit patterns, octets, natural numbers, electrical pulses, etc. A character encoding is used in computation, data storage, and transmission of textual data.
Many standards exist for encoding (e.g., character encodings like ASCII, image encodings like JPEG, video encodings like MPEG-4).

Chinese telegraph code

Chinese Commercial CodeChinese telegraph dictionarycode
The earliest codes were based upon manual and hand-written encoding and cyphering systems, such as Bacon's cipher, Braille, International maritime signal flags, and the 4-digit encoding of Chinese characters for a Chinese telegraph code (Hans Schjellerup, 1869).
The Chinese telegraph code, Chinese telegraphic code, or Chinese commercial code ( or ) is a four-digit decimal code (character encoding) for electrically telegraphing messages written with Chinese characters.

EBCDIC

Extended Binary Coded Decimal Interchange Code
IBM's Extended Binary Coded Decimal Interchange Code (usually abbreviated as EBCDIC) is an eight-bit encoding scheme developed in 1963. A code unit in UTF-8, EBCDIC and GB18030 consists of 8 bits;
Extended Binary Coded Decimal Interchange Code (EBCDIC; ) is an eight-bit character encoding used mainly on IBM mainframe and IBM midrange computer operating systems.

Variable-width encoding

MBCSmulti-bytemulti-byte character set
To encode code points higher than the length of the code unit, such as above 256 for 8-bit units, the solution was to implement variable-width encodings where an escape sequence would signal that subsequent bits should be parsed as a higher code point.
A variable-width encoding is a type of character encoding scheme in which codes of differing lengths are used to encode a character set (a repertoire of symbols) for representation in a computer.

Character (computing)

charactercharacterstext
Character encoding is used to represent a repertoire of characters by some kind of encoding system.
Computers and communication equipment represent characters using a character encoding that assigns each character to something — an integer quantity represented by a sequence of digits, typically — that can be stored or transmitted through a network.

String (computer science)

stringstringscharacter string
Example of a code unit: Consider a string of the letters "abc" followed by (represented with 1 char32_t, 2 char16_t or 4 char8_t).
A string is generally considered a data type and is often implemented as an array data structure of bytes (or words) that stores a sequence of elements, typically characters, using some character encoding.

UTF-8

65001Unicode (UTF-8)code page 65001
A code unit in UTF-8, EBCDIC and GB18030 consists of 8 bits; Simple character encoding schemes include UTF-8, UTF-16BE, UTF-32BE, UTF-16LE or UTF-32LE; compound character encoding schemes, such as UTF-16, UTF-32 and ISO/IEC 2022, switch between several simple schemes by using byte order marks or escape sequences; compressing schemes try to minimise the number of bytes used per code unit (such as SCSU, BOCU, and Punycode).
UTF-8 is a variable width character encoding capable of encoding all 1,112,064 valid code points in Unicode using one to four 8-bit bytes.

ASCII

7-bit ASCIIAmerican Standard Code for Information InterchangeASCII printable characters
Common examples of character encoding systems include Morse code, the Baudot code, the American Standard Code for Information Interchange (ASCII) and Unicode.
ASCII, abbreviated from American Standard Code for Information Interchange, is a character encoding standard for electronic communication.

Universal Coded Character Set

ISO 10646UCSIEC standard 10646
Unicode and its parallel standard, the ISO/IEC 10646 Universal Character Set, together constitute a modern, unified character encoding.
The Universal Coded Character Set (UCS) is a standard set of characters defined by the International Standard ISO/IEC 10646, Information technology — Universal Coded Character Set (UCS) (plus amendments to that standard), which is the basis of many character encodings.

Fieldata

Field data
In 1959 the U.S. military defined its Fieldata code, a six-or seven-bit code, introduced by the U.S. Army Signal Corps.
Much of the FIELDATA system was the specifications for the format the data would take, leading to a character set that would be a huge influence on ASCII a few years later.

ISO/IEC 8859-1

Latin-1-charsetLatin-1ISO-8859-1
For example, in a given repertoire, the capital letter "A" in the Latin alphabet might be represented by the code point 65, the character "B" to 66, and so on. Multiple coded character sets may share the same repertoire; for example ISO/IEC 8859-1 and IBM code pages 037 and 500 all cover the same repertoire but map them to different code points.
ISO/IEC 8859-1:1998, Information technology — 8-bit single-byte coded graphic character sets — Part 1: Latin alphabet No. 1, is part of the ISO/IEC 8859 series of ASCII-based standard character encodings, first edition published in 1987.

UTF-16

12001201surrogate pair
A code unit in UTF-16 consists of 16 bits; Simple character encoding schemes include UTF-8, UTF-16BE, UTF-32BE, UTF-16LE or UTF-32LE; compound character encoding schemes, such as UTF-16, UTF-32 and ISO/IEC 2022, switch between several simple schemes by using byte order marks or escape sequences; compressing schemes try to minimise the number of bytes used per code unit (such as SCSU, BOCU, and Punycode).
UTF-16 (16-bit Unicode Transformation Format) is a character encoding capable of encoding all 1,112,064 valid code points of Unicode.

ISO/IEC 2022

ISO 2022G0 setISO 2022-JP
Simple character encoding schemes include UTF-8, UTF-16BE, UTF-32BE, UTF-16LE or UTF-32LE; compound character encoding schemes, such as UTF-16, UTF-32 and ISO/IEC 2022, switch between several simple schemes by using byte order marks or escape sequences; compressing schemes try to minimise the number of bytes used per code unit (such as SCSU, BOCU, and Punycode).
a technique for including multiple character sets in a single character encoding system, and

Braille

braille alphabetbraille codebraille script
The earliest codes were based upon manual and hand-written encoding and cyphering systems, such as Bacon's cipher, Braille, International maritime signal flags, and the 4-digit encoding of Chinese characters for a Chinese telegraph code (Hans Schjellerup, 1869).
Braille was the first writing system with binary encoding.

Code page 437

437codepage 437CP437
Well-known code page suites are "Windows" (based on Windows-1252) and "IBM"/"DOS" (based on code page 437), see Windows code page for details.
Code page 437 is the character set of the original IBM PC (personal computer).

CJK characters

CJKCJK encodingChinese, Japanese and Korean
The need to support more writing systems for different languages, including the CJK family of East Asian scripts, required support for a far larger number of characters and demanded a systematic approach to character encoding rather than the previous ad hoc approaches.
The number of characters required for complete coverage of all these languages' needs cannot fit in the 256-character code space of 8-bit character encodings, requiring at least a 16-bit fixed width encoding or multi-byte variable-length encodings.

CCSID

IBM's Character Data Representation Architecture (CDRA) designates with coded character set identifiers (CCSIDs) and each of which is variously called a "charset", "character set", "code page", or "CHARMAP".
A CCSID (coded character set identifier) is a 16-bit number that represents a particular encoding of a specific code page.

Binary Ordered Compression for Unicode

BOCU-1BOCU
Simple character encoding schemes include UTF-8, UTF-16BE, UTF-32BE, UTF-16LE or UTF-32LE; compound character encoding schemes, such as UTF-16, UTF-32 and ISO/IEC 2022, switch between several simple schemes by using byte order marks or escape sequences; compressing schemes try to minimise the number of bytes used per code unit (such as SCSU, BOCU, and Punycode).
This Unicode encoding is designed to be useful for compressing short strings, and maintains code point order.

Transcoding

transcodetranscodertranscodes
As a result of having many character encoding methods in use (and the need for backward compatibility with archived data), many computer programs have been developed to translate data between encoding schemes as a form of data transcoding.
Transcoding is the direct digital-to-digital conversion of one encoding to another, such as for movie data files (e.g., PAL, SECAM, NTSC), audio files (e.g., MP3, WAV), or character encoding (e.g., UTF-8, ISO/IEC 8859).

Luit

luit – program that converts encoding of input and output to programs running interactively
luit is a utility program used to translate the character set of a computer program so that its output can be displayed correctly on a terminal emulator that uses a different character set.

Mojibake

displayed incorrectlyerroneously doubly-encoded UTF-8garbage characters
Mojibake – character set mismap.
Mojibake is the garbled text that is the result of text being decoded using an unintended character encoding.

MIME

multipart/form-datamedia typeMIME encoded-word
A "character set" in HTTP (and MIME) parlance is the same as a character encoding (but not the same as CCS).
Text in character sets other than ASCII