As many uses in computing require units of bytes (octets) there are three related encoding schemes which map to octet sequences instead of words: namely UTF-16, UTF-16BE, and UTF-16LE. They differ only in the byte order chosen to represent each 16-bit unit and whether they make use of a Byte Order Mark. All of the schemes will result in either a 2 or 4-byte sequence for any given character.
UTF-16 is officially defined in Annex Q of the international standard ISO/IEC 10646-1. It is also described in The Unicode Standard version 3.0 and higher, as well as in the IETF's RFC 2781.
UCS-2 (2-byte Universal Character Set) is an obsolete character encoding which is a predecessor to UTF-16. The UCS-2 encoding form is identical to that of UTF-16, except that it does not support surrogate pairs and therefore can only encode characters in the BMP range U+0000 through U+FFFF. As a consequence it is a fixed-length encoding that always encodes characters into a single 16-bit value. As with UTF-16, there are three related encoding schemes (UCS-2, UCS-2BE, UCS-2LE) that map characters to a specific byte sequence.
Because of the technical similarities and upwards compatibility from UCS-2 to UTF-16, the two encodings are often erroneously conflated and used as if interchangeable, so that strings encoded in UTF-16 are sometimes misidentified as being encoded in UCS-2.
For both UTF-16 and UCS-2, all 65,536 code points contained within the BMP (Plane 0), excluding the 2,048 special surrogate code points, are assigned to code units in a one-to-one correspondence with the 16-bit non-negative integers with the same values. Thus code point U+0000 is encoded as the number 0, and U+FFFF is encoded as 65535 (which is FFFF16 in hexadecimal).
UTF-16 represents non-BMP characters (those from U+10000 through U+10FFFF) using a pair of 16-bit words, known as a surrogate pair. First 1000016 is subtracted from the code point to give a 20-bit value. This is then split into two separate 10-bit values each of which is represented as a surrogate with the most significant half placed in the first surrogate. To allow safe use of simple word-oriented string processing, separate ranges of values are used for the two surrogates: 0xD800–0xDBFF for the first, most significant surrogate and 0xDC00-0xDFFF for the second, least significant surrogate.
For example, the character at code point U+10000 becomes the code unit sequence 0xD800 0xDC00, and the character at U+10FFFD, the upper limit of Unicode, becomes the sequence 0xDBFF 0xDFFD. Unicode and ISO/IEC 10646 do not, and will never, assign characters to any of the code points in the U+D800–U+DFFF range, so an individual code value from a surrogate pair does not ever represent a character.
The UTF-16 (and UCS-2) encoding scheme allows either endian representation to be used, but mandates that the byte order should be explicitly indicated by prepending a Byte Order Mark before the first serialized character. This BOM is the encoded version of the Zero-Width No-Break Space (ZWNBSP) character, codepoint U+FEFF, chosen because it should never legitimately appear at the beginning of any character data. This results in the byte sequence FE FF (in hexadecimal) for big-endian architectures, or FF FE for little-endian. The BOM at the beginning of a UTF-16 or UCS-2 encoded data is considered to be a signature separate from the text itself; it is for the benefit of the decoder. Technically, with the UTF-16 scheme the BOM prefix is optional, but omitting it is not recommended as UTF-16LE or UTF-16BE should be used instead. If the BOM is missing, barring any indication of byte order from higher-level protocols, big endian is to be used or assumed. The BOM is not optional in the UCS-2 scheme.
The UTF-16BE and UTF-16LE encoding schemes (and correspondingly UCS-2BE and UCS-2LE) are similar to the UTF-16 (or UCS-2) encoding scheme. However rather than using a BOM prepended to the data, the byte order used is implicit in the name of the encoding scheme (LE for little-endian, BE for big-endian). Since a BOM is specifically not to be prepended in these schemes, if an encoded ZWNBSP character is found at the beginning of any data encoded by these schemes it is not to be considered to be a BOM, but instead is considered part of the text itself. In practice most software will ignore these "accidental" BOMs.
The IANA has approved UTF-16, UTF-16BE, and UTF-16LE for use on the Internet, by those exact names (case insensitively). The aliases UTF_16 or UTF16 may be meaningful in some programming languages or software applications, but they are not standard names in Internet protocols.
Symbian OS used in Nokia S60 handsets and Sony Ericsson UIQ handsets uses UCS-2.
Older Windows NT systems (prior to Windows 2000) only support UCS-2.. In Windows XP, no code point above U+FFFF is included in any font delivered with Windows for European languages, possibly with Chinese Windows versions.
Java used UCS-2 initially, and added UTF-16 supplementary character support in J2SE 5.0. The Python language environment has used UCS-2 internally since version 2.1, although newer versions can use UCS-4 (UTF-32) to store supplementary characters (instead of UTF-16).
Many application programs still only support UCS-2, that is code points up to U+FFFF.
|code point||character||UTF-16 code value(s)||glyph*|
|122 (hex 7A)||small Z (Latin)||007A||z|
|27700 (hex 6C34)||water (Chinese)||6C34||水|
|119070 (hex 1D11E)||musical G clef||D834 DD1E||𝄞|
|"水z𝄞" (water, z, G clef), UTF-16 encoded|
|labeled encoding||byte order||byte sequence|
|UTF-16LE||little-endian||34 6C, 7A 00, 34 D8 1E DD|
|UTF-16BE||big-endian||6C 34, 00 7A, D8 34 DD 1E|
|UTF-16||little-endian, with BOM||FF FE, 34 6C, 7A 00, 34 D8 1E DD|
|UTF-16||big-endian, with BOM||FE FF, 6C 34, 00 7A, D8 34 DD 1E|
* Appropriate font and software are required to see the correct glyphs.
v = 0x64321
v′ = v - 0x10000
= 0101 0100 0011 0010 0001
vh = 0101010000 // higher 10 bits of v′
vl = 1100100001 // lower 10 bits of v′
w1 = 0xD800 // the resulting 1st word is initialized with the lower bracket
w2 = 0xDC00 // the resulting 2nd word is initialized with the higher bracket
w1 = w1 | vh
= 1101 1000 0000 0000 |
01 0101 0000
= 1101 1001 0101 0000
w2 = w2 | vl
= 1101 1100 0000 0000 |
11 0010 0001
= 1101 1111 0010 0001
The correct UTF-16 encoding for this character is thus the following word sequence:
Since the character is above U+FFFF, the character cannot be encoded in UCS-2.
Wrestling XML down to size: reducing the burden on networks and servers; XML is in the bloodstream of Web services, but sometimes the plasma doesn't flow as efficiently as one would wish.(extensible markup language)
Dec 01, 2004; One unpleasant fact of XML--its silent cholesterol, one might say--is its bloated, ASCII-text-based encoding. XML imposes...