The starting point of cladistic analysis is a group of species and molecular, morphological, or other data characterizing those species. The end result is a tree-like relationship diagram called a cladogram. The cladogram graphically represents a hypothetical evolutionary process. Cladograms are subject to revision as additional data become available.
| Number of Species | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | N |
| Number of Cladograms | 1 | 3 | 15 | 105 | 945 | 10,395 | 135,135 | 2,027,025 | 34,459,425 | 1*3*5*7*...*(2N-3) |
This exponential growth of the number of possible cladograms explains why manual creation of cladograms becomes very difficult when the number of species is large.
Prior to the advent of cladistics, most taxonomists used Linnaean taxonomy and later Evolutionary taxonomy to organize life forms. These traditional approaches, still in use by some researchers (especially in works intended for a more general audience) use several fixed levels of a hierarchy, such as kingdom, phylum, class, order, and family. Cladistics does not use those terms, because one of the fundamental premises of cladistics is that the evolutionary tree is so deep and so complex that it is inadvisable to set a fixed number of levels.
Evolutionary taxonomy insists that groups reflect phylogenies. In contrast, Linnean taxonomy allows both monophyletic and paraphyletic groups as taxa. Since the early 20th century, Linnaean taxonomists have generally attempted to make genus-level and lower-level taxa monophyletic. Ernst Mayr drew a distinction between the terms cladistics and phylogeny, using the term cladistics to refer to classifications which only take into account genealogy, as opposed to phylogeny, which had previously been used in a broader sense to refer to the combination of genealogy and amount of divergence from an ancestor (i.e. Evolutionary taxonomy). Mayr wrote, in 1985:
Willi Hennig's pioneering work provoked a spirited debate about the relative merits of cladistics versus traditional taxonomy which has continued down to the present. Some of the debates that the cladists engaged in had been running since the 19th century, but they entered these debates with a new fervor, as can be seen from the Foreword to Hennig (1979) by Rosen, Nelson, and Patterson:
Cladistics strictly and exclusively follows phylogeny, and has arbitrarily deep trees with binary branching: each taxon is a clade. Linnaean taxonomy, while following phylogeny, also subjectively considers morphology, and has a fixed hierarchy, whose taxa are not always clades.
Cladistics also contains the taxon Tetrapoda, whose living members can be classified phylogenically as "the clade defined by the common ancestor of amphibians and mammals", or more precisely the clade defined by the common ancestor of a specific amphibian and mammal (or bird or reptile), but whose tree is still being worked out (there are a number of extinct branches). The taxon does not have a rank, and its subtaxa are subclades: these can be contained within one another, but one does not divide the clade into several non-overlapping taxa (as in traditional taxonomy): one can split into two clades at the first branching, but that is all. With regards to the traditional classes, Aves and Mammalia are subclades, contained in the subclade Amniota, but Reptilia* is a paraphyletic taxon, not a clade — "At best, the cladists suggest, we could say that the traditional Reptilia are "non-avian, non-mammalian amniotes" — and instead one divides Amniota into the two clades Sauropsida (which contains birds and all living amniotes other than mammals, including all living traditional reptiles) and Theropsida (mammals and the extinct "mammal-like reptiles"). Similarly, Amphibia* is a paraphyletic taxon.
| Cladistics | Linnaean Taxonomy |
| Treats all levels of the tree as equivalent. | Treats each tree level uniquely. Uses special names (such as Family, Class, Order) for each level. |
| Handles arbitrarily deep trees. | Often must invent new level names (such as superorder, suborder, infraorder, parvorder, magnorder) to accommodate new discoveries. Biased towards trees about 4 to 12 levels deep. |
| Discourages naming or use of groups that are not monophyletic | Acceptable to name and use paraphyletic groups |
| Primary goal is to reflect actual process of evolution | Primary goal is to group species based on morphological similarities |
| Assumes that the shape of the tree will change frequently, with new discoveries | New discoveries often require renaming or releveling of Classes, Orders, and Kingdoms |
| Definitions of taxa are objective, hence free from personal interpretation | Definitions of taxa require individuals to make subjective decisions. For example, various taxonomists suggest that the number of Kingdoms in Biology is two, three, four, five, or six. |
| Taxa, once defined, are permanent (e.g. "taxon X comprises the most recent common ancestor of species A and B along with its descendants") | Taxa can be renamed and eliminated (e.g. Aschelminthes and Insectivora are some of many taxa in the Linnaean system that have been eliminated). |
Proponents of Linnaean taxonomy contend that it has some advantages over cladistics, such as:
| Cladistics | Linnaean Taxonomy |
| Limited to entities related by evolution or ancestry | Supports groupings without reference to evolution or ancestry |
| Does not include a process for naming species | Includes a process for giving unique names to species |
| Difficult to understand the essence of a clade, because clade definitions emphasize ancestry at the expense of meaningful characteristics | Taxa definitions based on tangible characteristics |
| Ignores sensible, clearly defined paraphyletic groups such as reptiles | Permits clearly defined groups such as reptiles |
| Difficult to determine if a given species is in a clade or not (e.g. if clade X is defined as "most recent common ancestor of A and B along with its descendants", then the only way to determine if species Y is in the clade is to perform a complex evolutionary analysis) | Straightforward process to determine if a given species is in a taxon or not |
| Limited to organisms that evolved by inherited traits; not applicable to organisms that evolved via complex gene sharing or lateral transfer | Applicable to all organisms, regardless of evolutionary mechanism |
For some decades in the mid to late 20th century, a commonly used methodology was phenetics ("numerical taxonomy"). This can be seen as a predecessor to some methods of today's cladistics (namely distance matrix methods like neighbor-joining), but made no attempt to resolve phylogeny, only similarities.
Considered cutting edge in their time as they were among the first bioinformatics applications, phenetic methods are today superseded by cladistic analyses due to the inability of phenetics to provide an evolutionary hypothesis, except by chance: as phenetics does not distinguish between plesiomorphies (ancient common retained characters) and apomorphies (novel characters that arose after the last common ancestor), it will consider groups as "natural" even if they are only united by "primitive" (i.e., retained) characters.
Consider for example a cow, a whale and a human. A cladistic analysis would recognize the whale's lack of legs as an apomorphy, whereas the presence of legs in cows and humans is plesiomorphic. It thus does not provide information on their relationships in a cladistic analysis, except that their last common ancestor had legs too. In a phenetic analysis, the presence of legs in cows and humans could be considered to indicate that they are closer relatives of each other than either is to whales. In fact, whales and cows are closer related to each other than either is to humans.
In short, phenetic analysis tend to resolve evolutionary grades as presumably monophyletic groups.
Many cladists discourage the use of paraphyletic groups because they detract from cladistics' emphasis on clades (monophyletic groups). In contrast, proponents of the use of paraphyletic groups argue that any dividing line in a cladogram creates both a monophyletic section above and a paraphyletic section below. They also contend that paraphyletic taxa are necessary for classifying earlier sections of the tree – for instance, the early vertebrates that would someday evolve into the family Hominidae cannot be placed in any other monophyletic family. They also argue that paraphyletic taxa provide information about significant changes in organisms' morphology, ecology, or life history – in short, that both paraphyletic groups and clades are valuable notions with separate purposes.
A simplified procedure for generating a cladogram is:
For example, if analyzing 20 species of birds, the data might be:
The characteristics used to create a cladogram can be roughly categorized as either morphological (synapsid skull, warm blooded, notochord, unicellular, etc.) or molecular (DNA, RNA, or other genetic information). Prior to the advent of DNA sequencing, all cladistic analysis used morphological data.
As DNA sequencing has become cheaper and easier, molecular systematics has become a more and more popular way to reconstruct phylogenies. Using a parsimony criterion is only one of several methods to infer a phylogeny from molecular data; maximum likelihood and Bayesian inference, which incorporate explicit models of sequence evolution, are non-Hennigian ways to evaluate sequence data. Another powerful method of reconstructing phylogenies is the use of genomic retrotransposon markers, which are thought to be less prone to the problem of reversion that plagues sequence data. They are also generally assumed to have a low incidence of homoplasies because it was once thought that their integration into the genome was entirely random; this seems at least sometimes not to be the case, however.
Ideally, morphological, molecular, and possibly other phylogenies should be combined into an analysis of total evidence: All have different intrinsic sources of error. For example, character convergence (homoplasy) is much more common in morphological data than in molecular sequence data, but character reversions that are unrecognizable as such are more common in the latter (see long branch attraction). Morphological homoplasies can usually be recognized as such if character states are defined with enough attention to detail.
The researcher decides which character states were present before the last common ancestor of the species group (plesiomorphies) and which were present in the last common ancestor (synapomorphies) by considering one or more outgroups. This makes the choice of an outgroup an important task, since this choice can profoundly change the topology of a tree. Note that only synapomorphies are of use in characterising clades.
A homoplasy is a character that is shared by multiple species due to some cause other than common ancestry. Typically, homoplasies occur due to convergent evolution. Use of homoplasies when building a cladogram is sometimes unavoidable but is to be avoided when possible.
A well known example of homoplasy due to convergent evolution would be the character, "presence of wings". Though the wings of birds, bats, and insects serve the same function, each evolved independently, as can be seen by their anatomy. If a bird, bat, and a winged insect were scored for the character, "presence of wings", a homoplasy would be introduced into the dataset, and this would confound the analysis, possibly resulting in a false evolutionary scenario.
Homoplasies can often be avoided outright in morphological datasets by defining characters more precisely and increasing their number. When analyzing "supertrees" (datasets incorporating as many taxa of a suspected clade as possible), it may become unavoidable to introduce character definitions that are imprecise, as otherwise the characters might not apply at all to a large number of taxa; to continue with the "wings" example, the presence of wings would be hardly be a useful character if attempting a phylogeny of all Metazoa, as most of these don't have wings at all. Cautious choice and definition of characters thus is another important element in cladistic analyses. With a faulty outgroup or character set, no method of evaluation is likely to produce a phylogeny representing the evolutionary reality.
When there are just a few species being organized, it is possible to do this step manually, but most cases require a computer program. There are scores of computer programs available to support cladistics. See phylogenetic tree for more information about tree-generating computer programs.
Because the total number of possible cladograms grows exponentially with the number of species, it is impractical for a computer program to evaluate every individual cladogram. A typical cladistic program begins by using heuristic techniques to identify a small number of candidate cladograms. Many cladistic programs then continue the search with the following repetitive steps:
Computer programs that generate cladograms use algorithms that are very computationally intensive, because the cladogram algorithm is NP-hard.
There are several algorithms available to identify the "best" cladogram. Most algorithms use a metric to measure how consistent a candidate cladogram is with the data. Most cladogram algorithms use the mathematical techniques of optimization and minimization.
In general, cladogram generation algorithms must be implemented as computer programs, although some algorithms can be performed manually when the data sets are trivial (for example, just a few species and a couple of characteristics).
Some algorithms are useful only when the characteristic data is molecular (DNA, RNA) data. Other algorithms are useful only when the characteristic data is morphological data. Other algorithms can be used when the characteristic data includes both molecular and morphological data.
Algorithms for cladograms include least squares, neighbor-joining, parsimony, maximum likelihood, and Bayesian inference.
Biologists sometimes use the term parsimony for a specific kind of cladogram generation algorithm and sometimes as an umbrella term for all cladogram algorithms.
Algorithms that perform optimization tasks (such as building cladograms) can be sensitive to the order in which the input data (the list of species and their characteristics) is presented. Inputting the data in various orders can cause the same algorithm to produce different "best" cladograms. In these situations, the user should input the data in various orders and compare the results.
Using different algorithms on a single data set can sometimes yield different "best" cladograms, because each algorithm may have a unique definition of what is "best".
Because of the astronomical number of possible cladograms, algorithms cannot guarantee that the solution is the overall best solution. A nonoptimal cladogram will be selected if the program settles on a local minimum rather than the desired global minimum. To help solve this problem, many cladogram algorithms use a simulated annealing approach to increase the likelihood that the selected cladogram is the optimal one.
One argument in favor of cladistics is that it supports arbitrarily complex, arbitrarily deep trees. Especially when extinct species are considered (both known and unknown), the complexity and depth of the tree can be very large. Every single speciation event, including all the species that are now extinct, represents an additional fork on the hypothetical, complete cladogram representing the full tree of life. Fractals can be used to represent this notion of increasing detail: as a viewpoint zooms into the tree of life, the complexity remains virtually constant. This great complexity of the tree, and the uncertainty associated with the complexity, are among the reasons that cladists cite for the attractiveness of cladistics over traditional taxonomy. Proponents of noncladistic approaches to taxonomy point to punctuated equilibrium to bolster the case that the tree of life has a finite depth and finite complexity. If the number of species currently alive is finite, and the number of extinct species that we will ever know about is finite, then the depth and complexity of the tree of life is bounded, and there is no need to handle arbitrarily deep trees.
A formal code of phylogenetic nomenclature, the PhyloCode, is currently under development for cladistic taxonomy. It is intended for use by both those who would like to abandon Linnaean taxonomy and those who would like to use taxa and clades side by side. In several instances (see for example Hesperornithes) it has been employed to clarify uncertainties in Linnaean systematics so that in combination they yield a taxonomy that is unambiguously placing the group in the evolutionary tree in a way that is consistent with current knowledge.
Hennig's major book, even the 1979 version, does not contain the term cladistics in the index. He referred to his own approach as phylogenetic systematics, as implied by the book's title. A review paper by Dupuis observes that the term clade was introduced in 1958 by Julian Huxley, cladistic by Cain and Harrison in 1960, and cladist (for an adherent of Hennig's school) by Mayr in 1965.
There are three ways to define a clade for use in a cladistic taxonomy.
The processes used to generate cladograms are not limited to the field of biology. The generic nature of cladistics means that cladistics can be used to organize groups of items in many different academic realms. The only requirement is that the items have characteristics that can be identified and measured.
Recent attempts in the use of cladistic methods outside of biology attack problems in:
|