Collation is the assembly of written information into a standard order. This is commonly called alphabetisation, though collation is not limited to ordering letters of the alphabet. Collating lists of words or names into alphabetical order is the basis of most office filing systems, library catalogs and reference books.
Collation differs from classification in that classification is concerned with arranging information into logical categories, while collation is concerned with the ordering of those categories.
Advantages of sorted lists include:
A collation algorithm, e.g. the "Unicode collation algorithm", differs from a sorting algorithm: the first is a process to define the order, which corresponds to the process of just comparing two values, while a sorting algorithm is a procedure to put a list of items in this order.
Collation defines a on the set of possible items, typically by defining a total order on a . Note however that in the case of e.g. numerical sorting of strings representing numbers, the strings are only partially preordered, because e.g. 2e3 and 2000 have the same ranking, and 2 and 2.0 also. The numbers represented by the strings are totally ordered.
While this might appear to work only for numbers, computers can use this method for any textual information since computers internally use character sets which assign a numeric code point to each letter or glyph. For example, a computer using ASCII code (or any of its supersets such as Unicode) and numerical sorting would collate the list of characters a · b · C · d · $ to $ · C · a · b · d.
The numerical values that ASCII uses are $ = 36, a = 97, b = 98, C = 67, and d = 100, resulting in what is called "".
This style of collation is commonly used, often with the refinement of converting uppercase letters to lowercase before comparing ASCII values, since most people do not expect capitalised words to jump the head of the list.
The order of the Latin alphabet is
The principle behind extending alphabetical order to words (lexicographical order) is that all words in a list beginning with the same letter should be grouped together; within a grouping starting with a single letter, all words beginning with the same two letters shall be grouped together; and so on, maximizing the number of common initial letters between adjacent words. The ordering principle is applied at the point where the letters differ. For instance, in the sequence:
The order of the words is given according to the first letter of the words that is different from the others (shown in bold). Since n follows l in the alphabet, but precedes p, Astronomy comes after Astrolabe, but before Astrophysics.
There has historically been some variation in the application of these rules. For instance, the prefixes Mc and M' in Irish and Scottish surnames were taken to be abbreviations for Mac, and alphabetized as if they were spelled out as Mac in full. Thus one might find in a catalog the sequence:
with McKinley preceding Mackintosh, as if it had been spelled "MacKinley". Since the advent of computer-sorted lists, this type of alphabetization is less frequently encountered, though it is still used in British phone books. A variation in alphabetical principles applies to names consisting of two words. In some cases, names with identical first words are all alphabetized together under the first word, e.g., grouping together all names beginning with San, all those beginning with Santa, and those beginning with Santo:
But in another system, the names are alphabetized as if they had no spaces, e.g. as follows:
The difference between computer-style numerical sorting and true alphabetical sorting becomes obvious in languages using an extended Latin alphabet. For example, the 29-letter alphabet of Spanish treats ñ as a basic letter following n, and formerly treated ch and ll as basic letters following c and l, respectively. Ch and ll are still considered letters, but are now alphabetized as two-letter combinations. (The new alphabetization rule was issued by the Royal Spanish Academy in 1994.) On the other hand, the digraph rr follows rqu as expected, both with and without the 1994 alphabetization rule. A numeric sort may order ñ incorrectly following z and treat ch as c + h, also incorrect when using pre-1994 alphabetization.
Similar differences between computer numeric sorting and alphabetic sorting occur in Danish and Norwegian (aa is ordered at the end of the alphabet when it is pronounced like å, and at the start of the alphabet when it is pronounced like a), German (ß is ordered as s + s; ä, ö, ü are ordered as a + e, o + e, u + e in phone books, but as o elsewhere, and behind o in Austria), Icelandic (ð follows d), Dutch (ij is sometimes ordered as y; see IJ: Collation), English (æ is ordered as a + e), and many other languages.
Usually the spaces or hyphens between words are ignored.
Languages that used a syllabary or abugida instead of an alphabet (for example, Cherokee) can use approximately the same system if there is a set ordering for the symbols.
Another form of collation is radical-and-stroke sorting, used for non-alphabetic writing systems such as Chinese hanzi and Japanese kanji, whose thousands of symbols defy ordering by convention. In this system, common components of characters are identified; these are called radicals in Chinese and logographic systems derived from Chinese. Characters are then grouped by their primary radical, then ordered by number of pen strokes within radicals. When there is no obvious radical or more than one radical, convention governs which is used for collation. For example, the Chinese character for "mother" (媽) is sorted as a thirteen-stroke character under the three-stroke primary radical (女).
The radical-and-stroke system is cumbersome compared to an alphabetical system in which there are a few characters, all unambiguous. The choice of which components of a logograph comprise separate radicals and which radical is primary is not clear-cut. As a result, logographic languages often supplement radical-and-stroke ordering with alphabetic sorting of a phonetic conversion of the logographs. For example, the kanji word Tōkyō'' (東京), the Japanese name of Tokyo can be sorted as if it were spelled out in the Japanese characters of the hiragana syllabary as "to-u-ki-yo-u" (とうきょう), using the conventional sorting order for these characters.
Nevertheless, the radical-and-stroke system is the only practical method for constructing dictionaries that someone may use to look up a logograph whose pronunciation is unknown.
In addition, in Greater China, surname stroke ordering is a convention in some official documentations where peoples' names are listed without hierarchy.
A similar complication arises when special characters such as hyphens or apostrophes appear in words or names. Any of the same rules as above can be used in this case as well; however, the strict ASCII sorting no longer corresponds exactly to any of the rules.
In certain contexts, very common words (such as articles) at the beginning of a sequence of words are not considered for ordering, or are moved to the end. So "The Shining" is considered "Shining" or "Shining, The" when alphabetizing and therefore is ordered before "Summer of Sam". This rule is fairly easy to capture in an algorithm, but many programs rely instead on simple lexicographic ordering. One fairly quaint exception to this rule is the flying of the flag of The Former Yugoslav Republic of Macedonia at the United Nations between those of Thailand and Timor Leste.
Also -13 comes alphabetically after -12 although it is less. With negative numbers, to make ascending order correspond with alphabetical sorting, more drastic measures are needed such as adding a constant to all numbers to make them all positive.
Sorting decimals properly is a bit more difficult, due to the fact that different locales use different symbols for a decimal point, and sometimes the same character used as a decimal point is also used as a separator, for example "Section 3.2.5". There is no universal answer for how to sort such strings; any rules are application dependent.