Definitions

# Mass lexical comparison

Mass lexical comparison or mass comparison is a highly controversial method developed by the well-known linguist Joseph Greenberg to find genetic relationships among languages in the remote past, which he considered unsuitable for the mainstream comparative method, or in situations where there are too many languages to practically apply the latter without many generations of work. He later called his method "multilateral comparison".

### The comparative method

Since the development of comparative linguistics in the 19th century, a linguist who claims that two languages are related, whether or not there exists historical evidence, is expected to back up that claim by presenting general rules that describe the differences between their lexicons, morphologies, and grammars. The procedure is described in detail in the comparative method article.

For instance, one could prove that Spanish is related to Italian by showing that many words of the former can be mapped to corresponding words of the latter by a relatively small set of replacement rules — such as the correspondence of initial es- and s-, final -os and -i, etc. Many similar correspondences exist between the grammars of the two languages. Since those systematic correspondences are extremely unlikely to be random coincidences, the most likely explanation by far is that the two languages have evolved from a single ancestral tongue (Latin, in this case).

Most pre-historical language groupings that are unanimously accepted today — such as the Indo-European, Uralic, Algonquian, and Bantu families — have been proved in this way, although many — such as Niger-Congo, and until quite recently Afro-Asiatic and Sino-Tibetan — have not, and some families whose proponents claim to have proved them in this way (e.g. Nostratic) have not been widely accepted.

### Limitations of the comparative method

However, besides systematic changes, languages are also subject to random mutations (such as borrowings from other languages, irregular inflections, compounding, and abbreviation) that affect one word at a time, or small subsets of words. For example, Spanish perro (dog), which does not come from Latin, cannot be rule-mapped to its Italian equivalent cane (the Spanish word can would be the Latin-derived equivalent but is much less used in everyday conversations, being reserved for more formal purposes).

As those sporadic changes accumulate, they will increasingly obscure the systematic ones — just as enough dirt and scratches on a photograph will eventually make the face unrecognizable. Presumably for this reason, the comparative method has not been able to provide reliable evidence of genetic relationship between languages that have split off more than 10,000 years ago. Considering that humans probably have been speaking fully developed languages since at least 60,000 years ago (when Australia was first populated), it is hardly surprising that many languages and language families still have no known relationship with other groups.

## Mass lexical comparison

### Lexical similarity

In an effort to extend comparative linguistics beyond its present limits, and arrive at his broad super-family groupings, Greenberg invented a new statistical method, mass lexical comparison. In this method, one simply compares a large sample of words from one language $A$ with its equivalents in the other language $B$, looking for similar sound patterns. Thus, for example, Spanish cabeza and Italian capo are similar to the extent that both contain the same consonant sound [k], similar vowel sounds [a], and similar consonants [b], [p], in the same sequence.

Departing from the traditional criterion, Greenberg did not look for any systematic trend in these similarities, trusting that a sufficiently large percentage $S\left(A,B\right)$ of sufficiently similar pairs among the samples would be enough to prove a common origin for the two languages. This assumption is valid in principle, because $S$ is expected to be higher for languages that have split off more recently, and to decrease as the split recedes into the past. The chief difficulty lies in deciding what constitutes "sufficient" similarity, particularly bearing in mind that many similarities are due to borrowing between languages and, far more commonly than is often realised, to coincidence.

### From similarity to phylogeny

Assuming that the similarity measures are statistically significant, they can be used to decide the branching order of the languages on their presumed genetic tree. That is, if the computed similarity $S\left(A,B\right)$ is greater than $S\left(A,C\right)$ and $S\left(B,C\right)$, one can take it as indication that $A$ and $B$ separated from $C$ before separating from each other. In other words, there is a single branch of the tree that includes $A$ and $B$ but not $C$.

### Mass comparison

Greenberg also observed that, just from statistical principles, the computed similarity between the lexicons of two sets of closely related languages would be more reliable than that computed from two languages alone. (This justifies the "mass" in the method's name.)

Thus, paradoxically, Greenberg claims the lexical comparison method should become more accurate as the investigation recedes into the past — which offsets to some extent the increased level of statistical noise in the measurements. This stands in contrast to the traditional comparative method, which becomes more unreliable as it is applied to broader language groups — since the structural comparisons must be applied to increasingly dubious, inaccurately and incompletely reconstructed proto-languages. Most historical linguists would reply the time depth of comparison used when doing mass lexical comparison magnifies the potential of error to such a degree that the results obtained cannot be distinguished from borrowing or chance coincidence.

The mass lexical comparison method also has the purported advantage that it can reconstruct the broad phylogeny for a large set of languages directly from raw lexical samples, without consulting detailed morphological studies of the languages, or the reconstructions of proto-languages for each branch. Most historical linguists, however, would argue that deep comparison that is not informed by such morphological or historical studies is inherently unreliable, where again, actual genetic relationship cannot be successfully distinguished from borrowing or coincidence.

### Choosing the sample lexicon

Ideally, the sample lexicons should contain only words that are likely to have survived in either language since the time of their hypothetical common origin, and are unlikely to be replaced by borrowed or reinvented words. For studies that extend more than 5000 years into the past, that criterion leaves only a few hundred concepts — such as body parts, close family relations, common animals and plants, water, fire, sky, stone, spear, etc..

Words for "modern" concepts — such as "wine", "horse", and "steel" — may show spurious similarities between unrelated languages, due to the name being imported by a culture together with the thing; e.g. Portuguese pão and Japanese pan ("bread"). Alternatively, the names of recently imported concepts may get invented separately in related languages, such as computadora ("computer") in Latin American Spanish and ordinateur in French. Either way, such words would only add noise and bias to the comparison.

## Weaknesses of the method

### Significance of the similarity

In theory, the reliability of Greenberg's method could be settled by statistical analysis; namely, by computing the probability that a given similarity level $S$ could have arisen by chance coincidences between totally unrelated languages. Two languages then should be considered similar only if the observed value of $S$ was significantly greater than this "baseline" level.

Unfortunately, this computation is very difficult to do. For one thing, the similarity level $S$ is expected to depend on the phonetic repertoires of the two languages; thus, for instance, one expects more chance resemblances between two languages that have few vowels and many consonants, than between a vowel-rich and a vowel-poor language. Similar biases can be expected when comparing languages that allow consonant clusters with those that don't, or polysyllabic languages with monosyllabic ones. It follows that deciding what would be a significant level of similarity would require a stochastic model for a "random lexicon" that took into account letter frequencies, syllable structure, and many other similar statistics.

At the same time, the correspondences used in the method are often tenuous, to say the least, requiring at times a correspondence of only one phoneme, or even only one characteristic (labial, dental, etc.). A wide semantic range is also allowed; for example, words were compared by Greenberg, in his book on the American languages, meaning arm, shoulder, armpit, forearm, elbow, etc. Thus, using this method, Lyle Campbell, a linguist specializing in the languages of the Americas and author of a review of Greenberg's book, was able to establish a correspondence between the proposed Amerind language and Finnish, and others were able to do so with Latin and many languages obviously not related to those of the Americas.

### Onomatopoeic forms

Also, some of the "ancient" concepts that are most suitable for inclusion in the sample lexicons may have been originally denoted by onomatopoeic words that imitate a natural sound associated with the concept. (Examples of originally onomatopoetic words in English include such words as "crack", "crow", "cough", "gurgle", etc.). The independent use of this principle in two languages will tend to create similar word pairs, that contribute to the similarity measure $S$ but are not due to common origin.

Ideally, such words ought to be excluded from the sample lexicon; but the onomatopoeic origin of a word may be hard to recognize in its present form. Even basic words like "milk" or "wind" have been claimed to reflect the corresponding sounds (those of sucking and blowing, respectively). Unfortunately, the impact of these "natural false cognates" in the similarity measure is hard to estimate.

### Semantic drift and subjectivity

Finally, in every language the same concept can often be expressed by two or more different words; and the meanings of words are known to drift over centuries just as much as their forms. The semantic change can be dramatic — as an example, English black and Russian белый "white" are cognates. They both derive from Proto-Indo-European *bhel, which refers to fire, and which became Germanic *blakaz "blazed" and then English black.

As a consequence of these semantic shifts and synonymies, the construction of the representative lexicon for a language typically involves many choices that must often be made on subjective criteria. These choices may be unconsciously biased towards words that are similar to those previously chosen for other languages, thus artificially inflating the similarity measure $S$. Unfortunately, the impact of this factor, too, is hard to quantify.

## Criticism

Although mass lexical comparison has met with enthusiastic acceptance by some non-linguists, it is rejected by most historical linguists, who view the comparative method as the only legitimate way to establish pre-historical common ancestry for languages. The essential complaint against mass lexical comparison by linguists is that it fails to differentiate between genuine genetic relatedness versus borrowing or simple coincidental resemblances between languages.

Proponents of mass lexical comparison claim that Greenberg used it successfully in his classification of the languages of Africa. To this critics respond that Greenberg's other claims have not been accepted, including his Amerind family, Indo-Pacific hypothesis and claim that all of the aboriginal languages of mainland Australia are related. They also assert that even his success in Africa is less than what it appears. Much of the previous work was very bad, involving such gross errors as the classification of languages on the basis of whether their speakers herded cattle. His classification was derivative of other work, particularly that of Westermann. Finally, some specialists have grave doubts about the unity of both Nilo-Saharan and Khoi-San, two of Greenberg's four families.

A further consideration is that, insofar as mass lexical comparison is a legitimate scientific method, it must work when applied by others, not just Joseph Greenberg. In fact, it has been used by many others, with no discernible difference in application, and has produced results that are either not accepted or are considered to be clearly wrong. Among the examples that we may cite are cases of languages being wrongly classified as Indo-European discussed by Poser and Campbell (1992).

## References

• Campbell, Lyle (2004). Historical Linguistics: An Introduction. 2nd, Cambridge: MIT Press. ISBN 0-262-53267-0.
• Greenberg, Joseph H. (1953) Historical linguistics and unwritten languages, in Alfred L. Kroeber (ed.) Anthropology Today. Chicago: University of Chicago Press. pp. 265-286.
• Greenberg, Joseph H. (1971) The Indo-Pacific hypothesis, in Thomas F. Sebeok (ed.) Current Trends in Linguistics. vol. 8: Linguistics in Oceania. The Hague: Mouton. pp. 807-871.
• Greenberg, Joseph H. (1987) Language in the Americas Stanford: Stanford University Press.
• Hock, Hans Henrich & Joseph, Brian D. (1996). Language History, Language Change, and Language Relationship: An Introduction to Historical and Comparative Linguistics. Berlin: Mouton de Gruyter.
• Kessler, Brett (2001). The Significance of Word Lists: Statistical Tests for Investigating Historical Connections Between Languages. 2nd edition, Stanford, CA: CSLI Publications. ISBN 1-57586-299-9. LINGUIST List 13.491: Review by John Clifton
• Kessler, B., & Lehtonen, A. (2004, July) Paper presented at Phylogenetic Methods and the Prehistory of Languages, McDonald Institute for Archaeological Research, Cambridge.
• Kessler, B. (2003). Review of the book Time Depth in Historical Linguistics Diachronica, 20, 373-377. SICI 0176-4225(20030101)20:2L.375;1-
• Kessler and Lehtonen (2006) Multilateral Comparison and Significance Testing of the Indo-Uralic Question, in Phylogenic Methods and the Prehistory of Languages - Foster and Renfrew, ISBN 1902937333.
• Matisoff, James. (1990) On Megalocomparison. Language, 66. 109-20
• Poser, William J. and Lyle Campbell (1992). Proceedings of the Eighteenth Annual Meeting of the Berkeley Linguistics Society, pp. 214-236.
• Ringe, Donald. (1992). On calculating the factor of chance in language comparison. American Philosophical Society, Transactions, 82 (1), 1-110.
• Ringe, Donald. (1993). A reply to Professor Greenberg. American Philosophical Society, Proceedings, 137, 91-109.