These languages all have a shared characteristic: Their writing systems all completely or partly use Chinese characters — hànzì in Chinese, kanji in Japanese, and hanja in Korean. Chinese is written in Chinese characters only and requires c. 4,000 characters for general literacy although there are up to 40,000 characters for reasonably complete coverage. Japanese uses fewer characters — general literacy in Japan can be expected with about 2,000 characters — together with two syllabaries. The use of Chinese characters in Korea is becoming increasingly rare altogether, although idiosyncratic use of Chinese characters in proper names requires knowledge (and therefore availability) of many more characters. The number of characters required for complete coverage of all these languages' needs cannot fit in the 256-character code space of 8-bit character encodings, requiring at least a 16-bit fixed width encoding or multi-byte variable-length encodings. The 16-bit fixed width encodings, such as Unicode up to and including version 2.0, are now deprecated due to the requirement to encode more characters than a 16-bit encoding can accommodate — Unicode 5.0 has some 90,000 Han characters — and the requirement by the Chinese government that software in China support the GB18030 character set.
Although CJK encodings have common character sets, the encodings often used to represent them have been developed separately by different East Asian governments and software companies, and are mutually incompatible. Unicode has attempted, with some controversy, to unify the character sets in a process known as Han unification.
CJK character encodings include:
The CJK character sets take up the bulk of the Unicode code space. There is much controversy among Japanese experts of Chinese characters about the desirability and technical merit of the Han unification process used to map multiple Chinese and Japanese characters sets into a single set of unified characters.
Chinese and Japanese can be written both left-to-right and top-to-bottom, but is usually considered a left-to-right script when discussing encoding issues.