Definitions

UTF-8

UTF-8

UTF-8 (8-bit UCS/Unicode Transformation Format) is a variable-length character encoding for Unicode. It is able to represent any character in the Unicode standard, yet the initial encoding of byte codes and character assignments for UTF-8 is backwards compatible with ASCII. For these reasons, it is steadily becoming the preferred encoding for e-mail, web pages, and other places where characters are stored or streamed.

UTF-8 encodes each character (code point) in one to four octets (8-bit bytes), with the 1-byte encoding used for the 128 US-ASCII characters. See the Description section below for details.

Four bytes may seem like a lot for one character . However UTF-16 (the main alternative to UTF-8) also uses four bytes for these same code points. Whether UTF-8 or UTF-16 is more efficient depends on the range of code points being used. For English using ASCII UTF-8 will be 1/2 the size, for all other Western languages UTF-8 will range from 1/2 size to trivially smaller, depending on the number of ASCII characters such as spaces, numbers, and unaccented ASCII letters, are in the text. For Chinese UTF-8 can be larger if less than 50% of the characters are spaces, numbers, newlines, tabs, and other ASCII characters. It is important to realize that all the standard Unicode encoding schemes are not intended for compression; modern algorithms such as used by gzip will compress any of them to approximately the same size and typically much smaller than 1 byte per character. For short items of text where traditional algorithms do not perform well and size is important, the Standard Compression Scheme for Unicode could be considered instead.

The Internet Engineering Task Force (IETF) requires all Internet protocols to identify the encoding used for character data, and the supported character encodings must include UTF-8. The Internet Mail Consortium (IMC) recommends that all email programs be able to display and create mail using UTF-8.

History

By early 1992 the search was on for a good byte-stream encoding of multi-byte character sets. The draft ISO 10646 standard contained a non-required annex called UTF that provided a byte-stream encoding of its 32-bit characters. This encoding was not satisfactory on performance grounds, but did introduce the notion that bytes in the ASCII range of 0-127 represent themselves in UTF, thereby providing backward compatibility.

In July 1992 the X/Open committee XoJIG was looking for a better encoding. Dave Prosser of Unix System Laboratories submitted a proposal for one that had faster implementation characteristics and introduced the improvement that 7-bit ASCII characters would only represent themselves; all multibyte sequences would include only 8-bit characters, i.e. those where the high bit was set.

In August 1992 this proposal was circulated by an IBM X/Open representative to interested parties. Ken Thompson of the Plan 9 operating system group at Bell Laboratories then made a crucial modification to the encoding to allow it to be self-synchronizing, meaning that it was not necessary to read from the beginning of the string in order to find character boundaries. Thompson's design was outlined on September 2, 1992, on a placemat in a New Jersey diner with Rob Pike. The following days, Pike and Thompson implemented it and updated Plan 9 to use it throughout, and then communicated their success back to X/Open.

UTF-8 was first officially presented at the USENIX conference in San Diego, from January 2529 1993.

Description

The bits of a Unicode character are distributed into the lower bit positions inside the UTF-8 bytes, with the lowest bit going into the last bit of the last byte:

Unicode Byte1 Byte2 Byte3 Byte4
U+000000-U+00007F 0xxxxxxx
U+000080-U+0007FF 110xxxxx 10xxxxxx
U+000800-U+00FFFF 1110xxxx 10xxxxxx 10xxxxxx
U+010000-U+10FFFF 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

So the first 128 characters (US-ASCII) need one byte. The next 1920 characters need two bytes to encode, this includes Latin letters with diacritics and characters from Greek, Cyrillic, Coptic, Armenian, Hebrew, Arabic, Syriac and Thaana alphabets. Three bytes are needed for the rest of the Basic Multilingual Plane (which contains virtually all characters in common use). Four bytes are needed for characters in the other planes of Unicode, which are rarely used in practice.

By continuing the pattern given above it is possible to deal with much larger numbers. The original specification allowed for sequences of up to six bytes covering numbers up to 31 bits (the original limit of the universal character set). However, UTF-8 was restricted by RFC 3629 to use only the area covered by the formal Unicode definition, U+0000 to U+10FFFF, in November 2003.

With these restrictions, bytes in a UTF-8 sequence have the following meanings. The ones marked in red can never appear in a legal UTF-8 sequence. The ones in white must only appear as the first byte in a multi-byte sequence, and the ones in orange can only appear as the second or later byte in a multi-byte sequence:

binary hex decimal notes
00000000-01111111 00-7F 0-127 US-ASCII
10000000-10111111 80-BF 128-191 Second, third, or fourth byte of a multi-byte sequence
11000000-11000001 C0-C1 192-193 Overlong encoding: start of a 2-byte sequence, but code point <= 127
11000010-11011111 C2-DF 194-223 Start of 2-byte sequence
11100000-11101111 E0-EF 224-239 Start of 3-byte sequence
11110000-11110100 F0-F4 240-244 Start of 4-byte sequence
11110101-11110111 F5-F7 245-247 Restricted by RFC 3629: start of 4-byte sequence for codepoint above 10FFFF
11111000-11111011 F8-FB 248-251 Restricted by RFC 3629: start of 5-byte sequence
11111100-11111101 FC-FD 252-253 Restricted by RFC 3629: start of 6-byte sequence
11111110-11111111 FE-FF 254-255 Invalid: not defined by original UTF-8 specification

Unicode also disallows the 2048 code points U+D800..U+DFFF (the UTF-16 surrogate pairs) and also the 32 code points U+FDD0..U+FDEF (noncharacters) and all 34 code points of the form U+xxFFFE and U+xxFFFF (more noncharacters). See Table 3-7 in the Unicode 5.0 standard. UTF-8 reliably transforms these values, but they are not valid scalar values in Unicode, and thus the UTF-8 encodings of them may be considered invalid sequences.

There are several current definitions of UTF-8 in various standards documents:

  • RFC 3629 / STD 63 (2003), which establishes UTF-8 as a standard Internet protocol element
  • The Unicode Standard, Version 5.0, §3.9 D92, $3.10 D95 (2007)
  • The Unicode Standard, Version 4.0, §3.9–§3.10 (2003)
  • ISO/IEC 10646:2003 Annex D (2003)

They supersede the definitions given in the following obsolete works:

  • ISO/IEC 10646-1:1993 Amendment 2 / Annex R (1996)
  • The Unicode Standard, Version 2.0, Appendix A (1996)
  • RFC 2044 (1996)
  • RFC 2279 (1998)
  • The Unicode Standard, Version 3.0, §2.3 (2000) plus Corrigendum #1: UTF-8 Shortest Form (2000)
  • Unicode Standard Annex #27: Unicode 3.1 (2001)

They are all the same in their general mechanics, with the main differences being on issues such as allowed range of code point values and safe handling of invalid input.

Naming

The official name is "UTF-8", which is used in all the documents relating to the encoding. There are many instances, particularly for documents to be transmitted across the internet, where the character set in which the document is encoded is declared by the name near the start of the document. In this case, the correct name to use is "UTF-8". In addition, all standards conforming to the Internet Assigned Numbers Authority (IANA) list, which include CSS, HTML, XML, and [headers] may also use the name "utf-8", as the declaration is case insensitive. Despite this, alternative forms, usually "utf8" or "UTF8", are seen; while this is incorrect and should be avoided, most agents such as browsers can understand this.

Examples

The Dollar Sign ($), which is Unicode U+0024 or binary 10 0100:

  • this falls into the first line of the table range of U+0000 through U+007F
  • The first line of the table shows it will be encoded using one byte, 0xxxxxxx
  • Putting the binary right-justified into the 'x' bits results in 00100100
  • This byte in hexadecimal is 0x24. Thus the ASCII dollar sign is encoded unchanged.

The Cent Sign (¢), which is Unicode U+00A2 or binary 1010 0010:

  • this falls into the second line of the table range of U+0080 through U+07FF
  • The second line of the table shows it will be encoded using two bytes, 110xxxxx,10xxxxxx.
  • Putting the binary right-justified into the 'x' bits results in 11000010,10100010
  • These bytes in hexadecimal are 0xC2,0xA2. That is the encoding of the character Cent Sign (¢) in UTF-8.

The character aleph (א), which is Unicode U+05D0 or binary 101 1101 0000:

  • this falls into the second line of the table range of U+0080 through U+07FF
  • The second line of the table shows it will be encoded using two bytes, 110xxxxx,10xxxxxx.
  • Putting the binary right-justified into the 'x' bits results in 11010111,10010000
  • These bytes in hexadecimal are 0xD7,0x90. That is the encoding of the character aleph (א) in UTF-8.

The Euro symbol (€), which is Unicode U+20AC or binary 10 0000 1010 1100:

  • this falls into the third line of the table range of U+0800 through U+FFFF
  • The third line of the table shows it will be encoded using three bytes, 1110xxxx,10xxxxxx,10xxxxxx.
  • Putting the binary right-justified into the 'x' bits results in 11100010,10000010,10101100
  • These bytes in hexadecimal are 0xE2,0x82,0xAC. That is the encoding of the Euro symbol (€) in UTF-8.

Rationale behind UTF-8's design

As a consequence of the design of UTF-8, the following properties of multi-byte sequences hold:

  • The most significant bit of a single-byte character is always 0.
  • The most significant bits of the first byte of a multi-byte sequence determine the length of the sequence. These most significant bits are 110 for two-byte sequences; 1110 for three-byte sequences, and so on.
  • The remaining bytes in a multi-byte sequence have 10 as their two most significant bits.
  • A UTF-8 stream contains neither the byte FE nor FF. This makes sure that a UTF-8 stream never looks like a UTF-16 stream starting with U+FEFF (Byte-order mark)

UTF-8 was designed to satisfy these properties in order to guarantee that no byte sequence of one character is contained within a longer byte sequence of another character. This ensures that byte-wise sub-string matching can be applied to search for words or phrases within a text; some older variable-length 8-bit encodings (such as Shift-JIS) did not have this property and thus made string-matching algorithms rather complicated. Although this property adds redundancy to UTF-8–encoded text, the advantages outweigh this concern; besides, data compression is not one of Unicode's aims and must be considered independently. This also means that if one or more complete bytes are lost due to error or corruption, one can resynchronize at the beginning of the next character and thus limit the damage.

Also due to the design of the byte sequences, if a sequence of bytes supposed to represent text validates as UTF-8 then it is fairly safe to assume it is UTF-8. The chance of a random sequence of bytes being valid UTF-8 and not pure ASCII is 3.1% for a 2 byte sequence, 0.39% for a 3 byte sequence and even lower for longer sequences.

While natural languages encoded in traditional encodings are not random byte sequences, they are also unlikely to produce byte sequences that would pass a UTF-8 validity test and then be misinterpreted. For example, for ISO-8859-1 text to be misrecognized as UTF-8, the only non-ASCII characters in it would have to be in sequences starting with either an accented letter or the multiplication symbol and ending with a symbol. Pure ASCII text would pass a UTF-8 validity test and it would be interpreted correctly because the UTF-8 encoding for the same text is the same as the ASCII encoding.

The bit patterns can be used to identify UTF-8 characters. If the byte's first hex code begins with 0–7, it is an ASCII character. If it begins with C or D, it is an 11-bit character (expressed in two bytes). If it begins with E, it is 16-bit (expressed in 3 bytes), and if it begins with F, it is 21 bits (expressed in 4 bytes). 8 through B cannot be first hex codes, but all following bytes must begin with a hex code between 8 through B. Thus, at a glance, it can be seen that "0xA9" is not a valid UTF-8 character, but that "0x54" and "0xE3 0xB4 0xB1" are valid UTF-8 characters.

There is no good validity test for most traditional 8-bit encodings like ISO-8859-1. It must be known otherwise which encoding is used, otherwise bad text will be shown. This is called mojibake and other names. The fact that there is a working validity test for UTF-8-encoded texts is a big advantage.

UTF-8 derivations

The following implementations are slight differences from the UTF-8 specification. They are incompatible with the UTF-8 specification.

CESU-8

Many pieces of software added UTF-8 conversions for UCS-2 data and did not alter their UTF-8 conversion when UCS-2 was replaced with the surrogate-pair supporting UTF-16. The result is that each half of a UTF-16 surrogate pair is encoded as its own 3-byte UTF-8 encoding, resulting in 6 bytes rather than 4 for characters outside the Basic Multilingual Plane. Oracle databases use this, as well as Java and Tcl as described below, and probably a great deal of other Windows software where the programmers were unaware of the complexities of UTF-16. Although most usage is by accident, a supposed benefit is that this preserves UTF-16 binary sorting order when CESU-8 is binary sorted.

Because CESU-8 and derivations are not UTF-8, one needs to be very careful to avoid mislabelling data in it as UTF-8 when interchanging information over the Internet.

Modified UTF-8

Modified UTF-8 treats surrogate pairs as in CESU-8, described above. In addition the null character (U+0000) is encoded as 0xc0 0x80 rather than 0x00. (0xc0 0x80 is not legal standard UTF-8 because it is not the shortest possible representation.) This means that the encoding of an array of Unicode containing the null character will not have a null byte in it, and thus will not be truncated if processed in a language such as C using traditional ASCIIZ string functions.

In normal usage, the Java programming language supports standard UTF-8 when reading and writing strings through and . However it uses for object serialization, for the Java Native Interface, and for embedding constants in class files.

Tcl also uses the same modified UTF-8 as Java for internal representation of Unicode data.

UTF-8 nuances

The following are slight differences between typical UTF-8 usage scenarios, and conform to the UTF-8 specification.

Windows

Many Windows programs (including Windows Notepad) add the byte sequence EF BB BF—the UTF-8 encoding of the Unicode Byte Order Mark—to the beginning of any document saved as UTF-8. This causes interoperability problems with software that does not expect the BOM. In particular:

  • It removes the desirable feature that UTF-8 is identical to ASCII for ASCII-only text.
  • Older text editors that expect ISO-8859-1 or CP1252 text will display "" at the start of the document, even if all the text is ASCII characters and thus would otherwise display correctly.
  • Programs that identify text files by special leading characters may fail to identify the UTF-8 files; a notable example is the Unix shebang syntax.

Some Windows software (including Notepad) will sometimes misidentify plain ASCII documents as UTF-16LE if there is no BOM, a bug commonly known as "Bush hid the facts" after a particular phrase that can trigger it.

Mac OS X

The Mac OS X Operating System uses canonically decomposed Unicode. The file names are encoded as UTF-8 in the Mac OS filesystem. In canonically decomposed Unicode, the use of precomposed characters is forbidden and combining diacritics must be used to replace them. It is valid for UTF-8 to be in either NFC or NFD, but most other operating systems expect the shorter NFC form of UTF-8, which can cause an interoperability problem for some applications. This difference in normalization forms is not specific to the Mac.

A common argument is that this makes sorting far simpler, but this argument is easily refuted : for one, sorting is language dependent (in German, the ä character sorts just after the a character, while in Swedish ä sorts after z). Therefore, it can be confusing for software built around the assumption that precomposed characters are the norm and combining diacritics are only used to form unusual combinations. This is an example of the NFD (Normalization Form Canonical Decomposition) variant of Unicode normalization—most other platforms, including Windows and Linux, use the NFC (Normalization Form Canonical Composition) form of Unicode normalization, which is also used by W3C standards, so NFD data must typically be converted to NFC for use on other platforms or the Web.

This is discussed in Apple Q&A 1173.

Overlong forms, invalid input, and security considerations

The exact response required of a UTF-8 decoder on invalid input is not uniformly defined by the standards. In general, there are several ways a UTF-8 decoder might behave in the event of an invalid byte sequence:

  1. Not notice and decode as if the bytes were some similar bit of UTF-8.
  2. Replace the bytes with a replacement character (usually '?' or '�' (U+FFFD)).
  3. Ignore the bytes.
  4. Interpret the bytes according to another encoding (often ISO-8859-1 or CP1252).
  5. Act like the string ends at that point and report an error
  6. Undo (or avoid) any result of the already-decoded part and report an error

Decoders may also differ in what bytes are part of the error. The sequence 0xF0,0x20,0x20,0x20 might be considered a single 4-byte error, or a 1-byte error followed by 3 space characters.

It is possible for a decoder to behave in different ways for different types of invalid input.

RFC 3629 states that "Implementations of the decoding algorithm MUST protect against decoding invalid sequences." The Unicode Standard requires a Unicode-compliant decoder to "…treat any ill-formed code unit sequence as an error condition. This guarantees that it will neither interpret nor emit an ill-formed code unit sequence."

Overlong forms are one of the most troublesome types of UTF-8 data. The current RFC says they must not be decoded, but older specifications for UTF-8 only gave a warning, and many simpler decoders will happily decode them. Overlong forms have been used to bypass security validations in high profile products including Microsoft's IIS web server. Therefore, great care must be taken to avoid security issues if validation is performed before conversion from UTF-8, and it is generally much simpler to handle overlong forms before any input validation is done.

To maintain security in the case of invalid input, there are a few options. The first is to decode the UTF-8 before doing any input validation checks. The second is to use a decoder that, in the event of invalid input either returns an error or text that the application knows to be harmless. A third possibility is to never decode the UTF-8 at all, and only look for the byte patterns you wish to match, but this requires that you know that no other part of your system will attempt a decoding, a catch-22 that makes this simple solution difficult to use in many systems.

Advantages and disadvantages

A note on string length and character indexes

A common criticism from beginners of variable-length encoding such as UTF-8 is that the algorithm to find the number of characters between two points, or to advance a pointer by n characters, are not O(1) (constant time), causing programs using them to be slower. However the use of these algorithms by actual working software is often vastly over-estimated:

  • In almost any case, the value provided to one algorithm has been calculated by calling the other previously. Common examples are malloc(strlen(s)+1) or pointer+=length_of_word(*pointer). Changing the functions to return byte count in place of character count will get the exact same program with O(1) performance.
  • It is wrong to assume the number of characters will assist in getting the visual space needed to display a string. Combining characters, double width characters, proportional fonts, non-printing characters and right-to-left characters all contribute to the impossibility of accurately calculating the layout space without analyzing the actual characters being used.
  • The self-synchronizing nature of UTF-8 can be used to find the code point boundary that is near a pointer. This is due to the lead and trail bytes using a disjoint numerical range of possible byte values.

So while the number of octets in a UTF-8 string or substring is related in a more complex way to the number of code points than for UTF-32, it is very rare to encounter a situation where this makes a difference in practice, and this cannot be used as either an advantage or disadvantage of UTF-8.

General

Advantages

  • UTF-8 is a superset of ASCII. Since a plain ASCII string is also a valid UTF-8 string, no conversion needs to be done for existing ASCII text. Software designed for traditional code page specific character sets can generally be used with UTF-8 with few or no changes.
  • Sorting of UTF-8 strings using standard byte-oriented sorting routines will produce the same results as sorting them based on Unicode code points. (This has limited usefulness, though, since it is unlikely to represent the culturally acceptable sort order of any particular language or locale.) For the sorting to work correctly, the bytes must be treated as unsigned values.
  • UTF-8 and UTF-16 are the standard encodings for XML documents. All other encodings must be specified explicitly either externally or through a text declaration.
  • Any byte oriented string search algorithm can be used with UTF-8 data (as long as one ensures that the inputs only consist of complete UTF-8 characters). Care must be taken with regular expressions and other constructs that count characters, however.
  • UTF-8 strings can be fairly reliably recognized as such by a simple algorithm. That is, the probability that a string of characters in any other encoding appears as valid UTF-8 is low, diminishing with increasing string length. For instance, the octet values C0, C1, and F5 to FF never appear. For better reliability, regular expressions can be used to take into account illegal overlong and surrogate values (see the W3 FAQ: Multilingual Forms for a Perl regular expression to validate a UTF-8 string).

Disadvantages

  • A badly-written (and not compliant with current versions of the standard) UTF-8 parser could accept a number of different pseudo-UTF-8 representations and convert them to the same Unicode output. This provides a way for information to leak past validation routines designed to process data in its eight-bit representation.

Compared to single-byte encodings

Advantages

  • UTF-8 can encode any Unicode character, avoiding the need to figure out and set a "code page" or otherwise indicate what character set is in use, and allowing output in multiple languages at the same time. It needs only to indicate to the operating system that text is in UTF-8. For Windows, this involves setting code page 65001.

Disadvantages

  • UTF-8 encoded text is larger than the appropriate single-byte encoding except for plain ASCII characters. In the case of languages which commonly used 8-bit character sets with non-Latin alphabets encoded in the upper half (such as most Cyrillic and Greek alphabet code pages), UTF-8 text will be almost double the size of the same text in a single-byte encoding.
  • Single byte per character encodings make string cutting easy even with simple-minded APIs.

Compared to other multi-byte encodings

Advantages

  • UTF-8 can encode any Unicode character. In most cases, multi-byte encodings can be converted to Unicode and back with no loss and — as UTF-8 is an encoding of Unicode — this applies to it too.
  • Character boundaries are easily found from anywhere in an octet stream (scanning either forwards or backwards). This implies that if a stream of bytes is scanned starting in the middle of a multi-byte sequence, only the information represented by the partial sequence is lost and decoding can begin correctly on the next character. Similarly, if a number of bytes are corrupted or dropped, then correct decoding can resume on the next character boundary. Many multi-byte encodings are much harder to resynchronise.
  • A byte sequence for one character never occurs as part of a longer sequence for another character as it did in older variable-length encodings like Shift-JIS (see the previous section on this). For instance, US-ASCII octet values do not appear otherwise in a UTF-8 encoded character stream. This provides compatibility with file systems or other software (e.g., the printf() function in C libraries) that parse based on US-ASCII values but are transparent to other values.
  • The first byte of a multi-byte sequence is enough to determine the length of the multi-byte sequence. This makes it extremely simple to extract a sub-string from a given string without elaborate parsing. This was often not the case in multi-byte encodings.
  • Efficient to encode using simple bit operations. UTF-8 does not require slower mathematical operations such as multiplication or division (unlike the obsolete UTF-1 encoding).

Disadvantages

  • UTF-8 often takes more space than an encoding made for one or a few languages. Latin letters with diacritics and characters from other alphabetic scripts typically take one byte per character in the appropriate multi-byte encoding but take two in UTF-8. East Asian scripts generally have two bytes per character in their multi-byte encodings yet take three bytes per character in UTF-8.

Compared to UTF-7

Advantages

  • UTF-8 uses significantly fewer bytes per character for all non-ASCII characters.
  • UTF-8 encodes "+" as itself whereas UTF-7 encodes it as "+-".

Disadvantages

  • UTF-8 requires the transmission system to be eight-bit clean. In the case of e-mail this means it has to be further encoded using quoted printable or base64 in some cases. This extra stage of encoding carries a significant size penalty. However, this disadvantage is not so important an issue any more because most mail transfer agents in modern use are eight-bit clean and support the 8BITMIME SMTP extension as specified in RFC 1869.

Compared to UTF-16

Advantages

  • Unlike UTF-16, byte values of 0 (The ASCII NUL character) do not appear in the encoding unless U+0000 (the Unicode NUL character) is represented. This means that string functions from the standard C library (such as strcpy()) which use a null terminator will correctly handle UTF-8 strings (whereas many UTF-16 strings will be prematurely truncated by use of the standard functions).
  • Also other characters below U+0080 are handled correctly by a parser that assumes it is ASCII or another charset. Most existing computer programs (including operating systems) were not written with Unicode in mind. Using UTF-16 with them while maintaining compatibility with existing programs (such as was done with Windows) requires every API and data structure that takes a string to be duplicated. UTF-8 would only require duplication if the API treats bytes with the high bit set in a special way, which is a very small or empty set in practice.
    • For example, by using UTF-8, Unicode can be implemented in most programming languages even for compilers that don't support it, by adding some library functions, (especially I/O) and having a separate text editor when editing text strings.
  • In UTF-8, characters outside the basic multilingual plane are not a special case. UTF-16 is often mistaken to be constant-length, leading to code that works for most text but suddenly fails for non-BMP characters. Retrofitting code tends to be hard, so it's better to implement support for the entire range of Unicode from the start.
  • Characters below U+0080 take only one byte in UTF-8 and take two bytes in UTF-16. Text consisting of mostly diacritic-free Latin letters will be around half the size in UTF-8 than it would be in UTF-16. Text in many other alphabets will be slightly smaller in UTF-8 than it would be in UTF-16 because of the presence of spaces, newlines, numbers, and punctuation.
  • UTF-8 uses a byte as its atomic unit while UTF-16 uses a 16-bit word which is generally represented by a pair of bytes. This representation raises a couple of potential problems of its own.
    • When representing a word in UTF-16 as two bytes instead of one 16-bit word, the order of those two bytes becomes an issue. A variety of mechanisms can be used to deal with this issue (for example, the Byte Order Mark), but they still present an added complication for software and protocol design.
    • If a byte is missing from a character in UTF-16, software that then tries to read the UTF-16 string from that point will be mis-indexed. The software will think that the first byte it reads is the start of a new character when in reality it is in the middle of a character that lost its beginning byte(s). The result will be either invalid UTF-16 or completely meaningless text. In UTF-8, if part of a multi-byte character is removed, only that character is affected and not the rest of the text. i.e. UTF-8 was made to be self-synchronizing, whereas UTF-16 was not.
  • Conversion of a string of random 16-bit values that is assumed to be UTF-16 to UTF-8 is lossless. But conversion of a string of random bytes that is assumed to be UTF-8 to UTF-16 will lose or mangle invalid byte sequences. This makes UTF-8 a safe way to hold data that might be text; this is surprisingly important in some software.

Disadvantages

  • Characters above U+0800 in the BMP use three bytes in UTF-8, but only two in UTF-16. As a result, text in (for example) Chinese, Japanese or Hindi takes up more space when represented in UTF-8. It could be understood for Chinese with its several thousands of characters that more bytes are needed, but in India there have been complaints, in part since their western neighbors using the Arabic letters have only two bytes per character. This disadvantage can be more than offset by the fact that characters below U+0080 (Latin letters, numbers and punctuation marks, space, carriage return and line feed) that frequently appear in those texts take only one byte in UTF-8 while they take two bytes in UTF-16.
  • Although both UTF-8 and UTF-16 suffer from the need to handle invalid sequence as described above under general disadvantages, a simplistic parser for UTF-16 is unlikely to convert invalid sequences to ASCII. Since the dangerous characters in most situations are generally ascii a simplistic UTF-16 parser is much less dangerous than a simplistic UTF-8 parser.

See also

References

External links

Search another word or see UTF-8on Dictionary | Thesaurus |Spanish
Copyright © 2014 Dictionary.com, LLC. All rights reserved.
  • Please Login or Sign Up to use the Recent Searches feature