In discussing Unicode and the UCS, many often refer to compatibility characters. Compatibility characters are graphical characters that are discouraged by the Unicode Consortium. As the Unicode consortium says:
A character that would not have been encoded except for compatibility and round-trip convertibility with other standards
However, the definition is more complicated than the glossary reveals. One of the properties given to characters by the Unicode consortium is the characters' decomposition or compatibility decomposition. Most characters have no value for this property, but over 5 thousand characters do have a compatibility decomposition mapping that compatibility character to one or more other characters. By setting a character's decomposition property, Unicode establishes that character as a compatibility character. The reasons for these compatibility designations are varied and are discussed in further detail below. The term decomposition can sometimes confuse because a character's decomposition can, in some cases, be a singleton. In these cases the decomposition of one character is simply another equivalent or approximately equivalent character.
Because these semantically distinct characters may be displayed with glyphs similar to the glyphs of other characters, text processing software should try to address possible confusion for the sake of end users. When comparing and collating (sorting) text strings, different forms and rich text variants of characters should not alter the text processing results. For example, software users may be confused when performing a find on a page for a capital Latin letter ‘I’ and their software application fails to find the visually similar Roman numeral ‘Ⅰ’.
The UCS, Unicode character properties and the Unicode algorithms provide software implementations with everything needed to properly display these characters from their decomposition equivalents. Therefore these decomposable compatibility characters become redundant and unnecessary. Their existence in the character set requires extra text processing to ensure text is properly compared and collated (see Unicode normalization). Moreover, these compatibility characters provide no additional or distinct semantics. Nor do these characters provide any visually distinct rendering provided the text layout and fonts are Unicode conforming. Also, none of these characters are required for roundtrip convertibility to other character sets, since the transliteration can easily map decomposed characters to precomposed counterparts in another character set. Similarly, contextual forms, such as a final Arabic letter can be mapped based on its position within a word to the appropriate legacy character set form character.
In order to dispense with these compatibility characters, text software must conform to several Unicode protocols. The software must be able to:
All together these compatibility characters included for incomplete Unicode implementations total 3,779 of the 5,402 designated compatibility characters. These include all of the compatibility characters marked with the keywords <initial>, <medial>, <final>, <isolated>, <fraction>, <wide>, <narrow>, <small>, <vertical>, <square>. Also it includes nearly all of the canonical and most of the <compat> keyword compatibility characters (the exceptions include those <compat> keyword characters for enclosed alphanumerics, enclosed ideographs and those discussed in the following sections: subsequent section).
Many other compatibility characters constitute what Unicode considers rich text and therefore outside the goals of Unicode and UCS. In some sense even compatibility characters discussed in the previous section — those that aid legacy software in displaying ligatures and vertical text — constitute a form of rich text, since the rich text protocols determine whether text is displayed in one way or another. However, the choice to display text with or without ligatures or vertically versus horizontally are both non-semantic rich text. They are simply style differences. This is contrast to other rich text such as italics, superscripts and subscripts, or list markers where the styling of the rich text implies certain semantics along with it.
For comparing, collating, handling and storing plain text, rich text variants are semantically redundant. For example, using a superscript character for the numeral 4 is likely indistinguishable from using the standard character for a numeral 4 and then using rich text protocols to make it superscript. Such alternate rich text characters therefore create ambiguity because they appear visually the same as their plain text counterpart characters with rich text formatting applied. These rich text compatibility characters include:
For all of these rich text compatibility characters the display of glyphs is typically distinct from their compatibility decomposition (related) characters. However, these are considered compatibility characters and discouraged for use by the Unicode consortium because they are not plain text characters, which is what Unicode seeks to support with its UCS and associated protocols. Rich text should be handled through non-Unicode protocols such as HTML, CSS, RTF and other such protocols.
The rich text compatibility characters comprise 1,451 of the 5,402 compatibility characters. These include all of the compatibility characters marked with keywords <circle> and <font> (except three listed in the semantically distinct below); 11 spaces variants from the <compat> and canonical characters; and some of the keyword <superscript> and <subscript> from the "Superscripts and Subscripts" block.
Many compatibility characters are semantically distinct characters, though they may share representational glyphs with other characters. Some of these characters may have been included because most other characters sets that focussed on one script or writing system. So for example, the ISO and other Latin character sets likely included a character for π (pi) since, when focussing on primarily one writing system or script, those character sets would not have otherwise had characters for the common mathematical symbol π;. However, with Unicode, mathematicians are free to use letters from any known script in the World or to select a Unihan ideograph to stand in for a mathematical set or mathematical constant. To date, Unicode has only added specific semantic support for a few such mathematical constants (for example Plancks constant: U+2107 and Eulers constant U+210E, both of which Unicode considers to be compatibility characters). Therefore Unicode designates several mathematical symbols based on letters from Greek and Hebrew as compatibility characters. These include:
While these compatibility characters are distinguished from their compatibility decomposition characters only by adding the word “symbol” to their name, they do represent long-standing distinct meanings in written mathematics. However, for all practical purposes they share the same semantics as their compatibility equivalent Greek or Hebrew letter. These may be considered border-line semantically distinguishable characters so they are not included in the total.
Unicode also designates twenty-eight other letter-like symbols as compatibility characters.
In addition, several scripts use glyph position such as superscripts and subscripts to differentiate semantics. In these cases subscripts and superscripts are not merely rich text, but constitute a distinct character — similar to a hybrid between a diacritic and a letter — in the writing system (130 total).
Finally, Unicode designates Roman numerals as compatibility equivalence to the Latin letters that share the same glyphs. Here the Unicode Standard make the same mistake in confusing glyph and character that it so often seeks to prevent. Certainly there's a need to deal with the visual ambiguity these characters may suffer when sharing the same glyphs, however a sign-value numeral for one is certainly a semantically distinct character from a Latin capital or small letter ‘i’. A similar visual ambiguity exists between such characters as the Latin capital letter A (U+0041) and the Greek capital letter Alpha (Α U+0391), yet Unicode does not unify those characters.
Roman numeral One Thousand actually has a third character representing a third form or glyph for the same semantic unit: One Thousand C D (ↀ U+2180). From this glyph, one can see where the practice of using a Latin M may have arisen. Strangely, though Unicode unifies the sign-value Roman numerals with the very different (though visually similar) Latin letters, the Indic Arabic place-value (positional) decimal digit numerals are repeated 24 times (a total of 240 code points for 10 numerals) throughout the UCS without any relational or decomposition mapping between them.
The presence of these 146 semantically distinct though visually similar characters (plus the additional 18 precomposed Roman numerals and 11 Hebrew and Greek letter based symbols) among the compatibility characters complicates the topic of compatibility characters. Some suggest discouraging the use of compatibility characters by content authors. However, in certain specialized areas, these characters are important and quite similar to other characters that have not been included among the compatibility characters. For example, in certain academic circles the use of Roman numerals as distinct from Latin letters that share the same glyphs would be no different than the use of Cuneiform numerals or ancient Greek numerals. Collapsing the Roman numeral characters to Latin letter characters eliminates a semantic distinction. A similar situation exists for phonetic alphabet characters that use subscript or superscript positioned glyphs. In the specialized circles that use phonetic alphabets, authors should be able to do so without resorting to rich text protocols.
Unfortunately, there are a small number of characters even within the compatibility blocks that themselves are not compatibility characters and therefore may confuse authors. The “Enclosed CJK Letters and Months” block contains a single non-compatibility character: the ‘Korean Standard Symbol’ (㉿ U+327F). This symbol and 12 other characters have been included in these blocks for no known reasons. The “CJK Compatibility Ideographs” block contains these non-compatibility unified Han ideographs:
These thirteen characters are neither compatibility characters nor is their use discouraged in any way.
Several other characters in these blocks have no compatibility mapping but are clearly intended for legacy support:
Alphabetic Presentation Forms (1)
Arabic Presentation Forms (4)
CJK Compatibility Forms (2 that are both related to CJK Unified Ideograph: U+4E36 丶)
Enclosed Alphanumerics (21 rich text variants)
Normalization is the process by which Unicode conforming software first performs compatibility decomposition before making comparisons or collating text strings. This is similar to other operations needed when, for example, a user performs a case or diacritic insensitive search within some text. In such cases software must equate or ignore characters it would not otherwise equate or ignore. Typically normalization is performed without altering the underlying stored text data (lossless). However, some software may potentially make permanent changes to text that eliminates the canonical or even non-canonical compatibility characters differences from text storage (lossy).