Character encoding

character setComputer encodingsencodingcharacter setscharsetcode unitcharacter encodingscoded character setencodingsencoded
Character encoding is used to represent a repertoire of characters by some kind of encoding system.wikipedia
664 Related Articles

Code point

codepointcode pointscharacter codes
Depending on the abstraction level and context, corresponding code points and the resulting code space may be regarded as bit patterns, octets, natural numbers, electrical pulses, etc. A character encoding is used in computation, data storage, and transmission of textual data.
In character encoding terminology, a code point or code position is any of the numerical values that make up the code space.

Code page

codepagecode pagesOEM character set
"Character set", "character map", "codeset" and "code page" are related, but not identical, terms.
In computing, a code page is a character encoding and as such it is a specific association of a set of printable characters and control characters with unique numbers.

Unicode

Unicode StandardUnicode Transformation FormatThe Unicode Standard
The low cost of digital representation of data in modern computer systems allows more elaborate character codes (such as Unicode) which represent most of the characters used in many written languages. Common examples of character encoding systems include Morse code, the Baudot code, the American Standard Code for Information Interchange (ASCII) and Unicode. Unicode and its parallel standard, the ISO/IEC 10646 Universal Character Set, together constitute a modern, unified character encoding.
Unicode is a computing industry standard for the consistent encoding, representation, and handling of text expressed in most of the world's writing systems.

Chinese telegraph code

Chinese commercial codeChinese telegraph dictionarycode
The earliest codes were based upon manual and hand-written encoding and cyphering systems, such as Bacon's cipher, Braille, International maritime signal flags, and the 4-digit encoding of Chinese characters for a Chinese telegraph code (Hans Schjellerup, 1869).
The Chinese telegraph code, Chinese telegraphic code, or Chinese commercial code ( or ) is a four-digit decimal code (character encoding) for electrically telegraphing messages written with Chinese characters.

Computer data storage

main memorystoragememory
Depending on the abstraction level and context, corresponding code points and the resulting code space may be regarded as bit patterns, octets, natural numbers, electrical pulses, etc. A character encoding is used in computation, data storage, and transmission of textual data.
Many standards exist for encoding (e.g., character encodings like ASCII, image encodings like JPEG, video encodings like MPEG-4).

Morse code

MorseInternational Morse CodeMorse-code
The earliest well-known electrically-transmitted character code, Morse code, introduced in the 1840s, used a system of four "symbols" (short signal, long signal, short space, long space) to generate codes of variable length. Common examples of character encoding systems include Morse code, the Baudot code, the American Standard Code for Information Interchange (ASCII) and Unicode.
Morse code is a character encoding scheme used in telecommunication that encodes text characters as standardized sequences of two different signal durations called dots and dashes or dits and dahs.

EBCDIC

Extended Binary Coded Decimal Interchange Code
IBM's Extended Binary Coded Decimal Interchange Code (usually abbreviated as EBCDIC) is an eight-bit encoding scheme developed in 1963.
Extended Binary Coded Decimal Interchange Code (EBCDIC; ) is an eight-bit character encoding used mainly on IBM mainframe and IBM midrange computer operating systems.

Character (computing)

charactercharacterstext
Character encoding is used to represent a repertoire of characters by some kind of encoding system.
Computers and communication equipment represent characters using a character encoding that assigns each character to something — an integer quantity represented by a sequence of digits, typically — that can be stored or transmitted through a network.

Variable-width encoding

MBCSmulti-bytemulti-byte character set
To encode code points higher than the length of the code unit, such as above 256 for 8-bit units, the solution was to implement variable-width encodings where an escape sequence would signal that subsequent bits should be parsed as a higher code point.
A variable-width encoding is a type of character encoding scheme in which codes of differing lengths are used to encode a character set (a repertoire of symbols) for representation in a computer.

String (computer science)

stringstringscharacter string
A string is generally considered as a data type and is often implemented as an array data structure of bytes (or words) that stores a sequence of elements, typically characters, using some character encoding.

UTF-8

65001Unicode (UTF-8)AL32UTF8
Simple character encoding schemes include UTF-8, UTF-16BE, UTF-32BE, UTF-16LE or UTF-32LE; compound character encoding schemes, such as UTF-16, UTF-32 and ISO/IEC 2022, switch between several simple schemes by using byte order marks or escape sequences; compressing schemes try to minimise the number of bytes used per code unit (such as SCSU, BOCU, and Punycode).
UTF-8 (8-bit Unicode Transformation Format) is a variable width character encoding capable of encoding all 1,112,064 valid code points in Unicode using one to four 8-bit bytes.

ASCII

US-ASCIIAmerican Standard Code for Information InterchangeASCII code
Common examples of character encoding systems include Morse code, the Baudot code, the American Standard Code for Information Interchange (ASCII) and Unicode.
ASCII, abbreviated from American Standard Code for Information Interchange, is a character encoding standard for electronic communication.

Universal Coded Character Set

ISO 10646Universal Character SetISO/IEC 10646
Unicode and its parallel standard, the ISO/IEC 10646 Universal Character Set, together constitute a modern, unified character encoding.
The Universal Coded Character Set (UCS) is a standard set of characters defined by the International Standard ISO/IEC 10646, Information technology — Universal Coded Character Set (UCS) (plus amendments to that standard), which is the basis of many character encodings.

Fieldata

Field dataS.A.C. (control code)
In 1959 the U.S. military defined its Fieldata code, a six-or seven-bit code, introduced by the U.S. Army Signal Corps.
Much of the FIELDATA system was the specifications for the format the data would take, leading to a character set that would be a huge influence on ASCII a few years later.

ISO/IEC 8859-1

ISO 8859-1ISO-8859-1Latin-1
Multiple coded character sets may share the same repertoire; for example ISO/IEC 8859-1 and IBM code pages 037 and 500 all cover the same repertoire but map them to different code points.
1'', is part of the ISO/IEC 8859 series of ASCII-based standard character encodings, first edition published in 1987.

UTF-16

UTF-16BEUTF-16LEsurrogate pair
Simple character encoding schemes include UTF-8, UTF-16BE, UTF-32BE, UTF-16LE or UTF-32LE; compound character encoding schemes, such as UTF-16, UTF-32 and ISO/IEC 2022, switch between several simple schemes by using byte order marks or escape sequences; compressing schemes try to minimise the number of bytes used per code unit (such as SCSU, BOCU, and Punycode). Characters in the range U+10000 to U+10FFFF in the other planes are called supplementary characters.
UTF-16 (16-bit Unicode Transformation Format) is a character encoding capable of encoding all 1,112,064 valid code points of Unicode.

ISO/IEC 2022

ISO 2022ISO-2022-JPISO-2022
Simple character encoding schemes include UTF-8, UTF-16BE, UTF-32BE, UTF-16LE or UTF-32LE; compound character encoding schemes, such as UTF-16, UTF-32 and ISO/IEC 2022, switch between several simple schemes by using byte order marks or escape sequences; compressing schemes try to minimise the number of bytes used per code unit (such as SCSU, BOCU, and Punycode).
Extended Unix Code (EUC) is an 8-bit variable-width character encoding system used primarily for Japanese, Korean, and simplified Chinese.

Braille

braille alphabetbraille typewriterBraille code
The earliest codes were based upon manual and hand-written encoding and cyphering systems, such as Bacon's cipher, Braille, International maritime signal flags, and the 4-digit encoding of Chinese characters for a Chinese telegraph code (Hans Schjellerup, 1869).

Code page 437

437CP437codepage 437
Well-known code page suites are "Windows" (based on Windows-1252) and "IBM"/"DOS" (based on code page 437), see Windows code page for details.
Code page 437 is the character set of the original IBM PC (personal computer).

CJK characters

CJKCJK encodingCJK character
The need to support more writing systems for different languages, including the CJK family of East Asian scripts, required support for a far larger number of characters and demanded a systematic approach to character encoding rather than the previous ad hoc approaches.
The number of characters required for complete coverage of all these languages' needs cannot fit in the 256-character code space of 8-bit character encodings, requiring at least a 16-bit fixed width encoding or multi-byte variable-length encodings.

CCSID

IBM's Character Data Representation Architecture (CDRA) designates with coded character set identifiers (CCSIDs) and each of which is variously called a "charset", "character set", "code page", or "CHARMAP".
A CCSID (coded character set identifier) is a 16-bit number that represents a particular encoding of a specific code page.

MIME

Multipurpose Internet Mail Extensionsmultipart/form-datamedia type
A "character set" in HTTP (and MIME) parlance is the same as a character encoding (but not the same as CCS).
Multipurpose Internet Mail Extensions (MIME) is an Internet standard that extends the format of email messages to support text in character sets other than ASCII, as well attachments of audio, video, images, and application programs.

Binary Ordered Compression for Unicode

BOCU-1BOCU
Simple character encoding schemes include UTF-8, UTF-16BE, UTF-32BE, UTF-16LE or UTF-32LE; compound character encoding schemes, such as UTF-16, UTF-32 and ISO/IEC 2022, switch between several simple schemes by using byte order marks or escape sequences; compressing schemes try to minimise the number of bytes used per code unit (such as SCSU, BOCU, and Punycode).
This Unicode encoding is designed to be useful for compressing short strings, and maintains code point order.

Transcoding

transcodetranscodertranscodes
As a result of having many character encoding methods in use (and the need for backward compatibility with archived data), many computer programs have been developed to translate data between encoding schemes as a form of data transcoding.
Transcoding is the direct digital-to-digital conversion of one encoding to another, such as for movie data files, audio files (e.g., MP3, WAV), or character encoding (e.g., UTF-8, ISO/IEC 8859).

Luit

luit is a utility program used to translate the character set of a computer program so that its output can be displayed correctly on a terminal emulator that uses a different character set.