Part 1: The Cytochrome-c tree, anomalies, and why anomalies exist
(Disclaimer: I’m not in the field of bioinformatics.)
Cytochrome-C is a protein involved in turning food and oxygen into energy. It’s found in Eukaryotes – which means all multicellular life (plants and animals) and some single-celled life (fungus and yeast). The fact that it’s so ubiquitous gives us the opportunity to compare evolution over wide sections of life on earth. After compiling the protein sequences of nearly 100 species, I ran some genetic analysis on it. Here’s how the results look:

The basic pattern of descent is shown pretty clearly with this data. Animals you’d expect to be related are clustered into groups. For example, primates are a subset of mammals, and apes (including humans) are a subset of primates. Humans, Chimpanzees, Gorillas, and Orangutans all have an identical protein sequence of cytochrome-c (and the DNA sequence varies slightly among them). Birds are a branch out of the reptiles group. Whales are clearly part of the mammal group – not the fish group.
It also shows how ridiculous it is when creationists make statements like:
“There is not evidence yet to claim how the Earth was created and no evidence to connect the family of apes with the family of man.” – Utah state Superintendent of Public Instruction Patti Harrington (Source)
However, there are a few anomalies in the series. They are:
– Frog appears inside the “Fish” group. It also doesn’t appear next to bullfrog.
– Horsfield’s Tarsier appears with rat, mouse, and guinea pig. Tarsiers are related to monkeys (it should actually appear roughly where kangaroo does).
– The kangaroo (a marsupial) appears inside the placental-mammal group.
– Honey-bee appears outside the ‘insect’ group and near starfish, earthworm, and snail.
– Bat appears near seal and dog.
– Why don’t mammals appear as a subset of reptiles (since mammalian ancestors were reptiles)?
– Why don’t reptiles/amphibilians appear as a subset of fish (since terrestrial vertebrates evolved from fish)?
First of all, genetic studies of individual genes have certain limitations. While the general pattern of decent can often be shown from a single gene, the details can be confused due to inherent problems of small datasets. Creationists sometimes use genetic studies on a single gene as if it’s perfect truth, and if anything varies from accepted evolutionary theory, they’ll argue that those problems are evidence that evolutionary theory disagrees with the facts. The problem is this: genetic studies on individual genes is a little bit like a public poll. Even if you perfectly randomize the people answering your poll, it’s still susceptible to inaccuracies. For example, if you randomly call phone numbers, you might discover that 9 out of 10 respondents support a particular candidate, even when the reality is that it’s a 50-50 split among the public. Studies of single genes have the same problem, and, in both cases, this is a problem that is more likely to occur with a small dataset.
How do these problems arise with genetic data? It has to do with mathematics of mutation, and limited information.
When genetic data is analyzed, we look at a sequence, compare differences, and create a tree which describes the relationship pattern. So, for example, if we have four species with the following protein sequence:
Species1: DAAAAA
Species2: AAAAEA
Species3: ACAAEA
Species4: ACAAEA
We could construct a few different trees to describe the situation. If we assume “AAAAAA” is the ancestral sequence, then the tree looks like this:

We would then infer that this pattern represents the splitting of species and mutations over time. In this case, Species2, Species3, and Species 4 probably inherited the E mutation while they were all one species. Species3 and Species4 acquired the C mutation while they were one species. However, it’s possible that all of these mutations happened independently, like this:

Statistically, it’s unlikely situation #2 would happen. It requires that Species4 happens to get exactly two mutations, and those two mutations exactly match the mutations in other species. However, it’s not statistically impossible. And since it’s not impossible, it will happen with a frequency equal to its likelihood. It’s also possible that a mixture of the two situations occurs.
So: when two species have the same mutation, it might be that they gained it through common ancestry, or they might simply be coincidence. When dealing with large numbers of mutations, you can quickly sort-out which is which, but with fewer numbers of mutations, the correct interpretation is less certain.
These are some situations which can make the ancestry ambiguous, and lead to erroneous phylogenetic trees:
First, let’s pretend we have a 100 amino-acid sequence. Let’s also say that each location can contain two different possibilities (the other 18 amino acids disrupt the protein’s function, killing the organism).
(1) The more species there are, the more likely two of them will have an identical mutation by coincidence. If we have two species, and each of them have an independent mutation, then the odds that they will be the same mutation is 1 in 100. However, if we expand our example to contain 15 species, each with one independent mutation, the odds that two species’ mutation will match becomes extremely high. In fact, on average, there will be one matching mutation. (The fifteenth species has a 14% chance of ‘hitting’ an existing mutation because there are already fourteen separate mutations in the group.) The situation gets worse and worse the more species that are added to the group. That common mutation might be interpreted as “a common mutation acquired through common ancestry”, but that’s an incorrect conclusion.
(2) The more independent mutations a species has, the more likely it is that one will overlap an existing mutation in another species. Imagine that our two species have each acquired 20 independent mutations. What are the odds that one of the mutations in Species1 will match a mutation in Species2? Statistically, we can expect that around 4 mutations will match ( 0.20 * 0.20 * 100 locations = 4 ). Again, the situation becomes more likely with more mutations. None of those mutations were actually acquired through common descent, but it will be interpreted as commonly acquired mutations.
(3) Back mutations also make the situation ambiguous. Let’s say we begin with four species with this sequence of mutations. Species 4 has a back mutation (changing “E” back to “A”).

The resulting sequences are ambiguous. What should the interpretation be from the sequences alone?

Based on the resulting sequences, it’s not quite clear what the correct interpretation should be – at least not without some outside information (from other genes, etc). And if you construct a tree with the wrong interpretation (2 or 3), creationists might jump on it and say, “The genetics says that Species2 and Species3 are more closely related than Species3 and Species4. But, evolutionists claim Species3 and Species4 are more closely related. Evolution contradicts the facts.”
The problem of back mutations increases as the number of independent mutations increase. This is because the possibility of a back mutation is proportional to the number of total mutations.
Explaining the Anomalies:
In most of the anomalies shown above, the problem involves a single species which has no close relatives on the chart, and has acquired a large number of mutations. This increases the incidence of situations #2 (large numbers of independent mutations coincidentally overlapping existing mutations) and #3 (back mutations erasing actual descent information). And large numbers of species (#1) gives lots of possibilities to find matches. Take the honeybee for example:

There is a small area of commonality (section A), and a large area of independent mutations (section B). Cytochrome-C contains 104 amino-acids, and the honey-bee and snail versions differ at 26 locations. What happened was that a few mutations overlapped, it matched slightly better than other species on the chart (perhaps due to back mutations), so it erroneously placed it next to ‘snail’.
The frog and kangaroo follow this same pattern. While one would expect ‘bullfrog’ to be a close relative of the frog (i.e. Western Clawed Frog), they actually differ at 15 locations. The large number of differences shows that their common ancestor lived a long time ago – which shows just how ancient the ‘frog’ group is. And Kangaroos are the only marsupial on the chart. The Kangaroo protein sequence should be equidistant from all placental mammals, except for some random coincidental mutations. It just happens that those coincidental mutations placed it near the primate group where it clearly doesn’t belong. In fact, different analysis algorithms place the kangaroo in different locations, indicating how tentative its current placement is. Including some other marsupials in the list should stabilize it’s location outside the placental mammal group.
Bats appear near seals and dogs. That seems odd. Although, bats are actually a pretty ancient species as far as mammals go, so there might be some coincidental mutations. (And as for the hippo being close to the same group – well, based on the length of the line, that’s a pretty thin conclusion.)
None of these four species have any close relatives on the chart, they have a large number of independent mutations, so the software probably found the best match based on coincidental mutations.
Horsfield’s Tarsier is also an anomaly. It should appear at the base of the primate group. Either this is just a case of an odd coincidental mutation placing it elsewhere, or perhaps tarsiers shouldn’t be classified as primates. (Some people have suggested that.) In the end, a larger genetic analysis should clear up what’s going on.
The other anomalies involve the placement of mammals inside the reptile group, and terrestrial vertebrates as a subbranch of fish. In fact, mammals evolved from a branch of ancient reptiles that was separate from the ancient animals that gave rise to modern reptiles. From the Tree of Life website (1,2):

And terrestrial mammals descended from lobe-finned fish, which is separate from the ray-finned lineage of the four fish shown (tuna, carp, zebrafish, and pufferfish). From the Tree of Life website (1,2):

(Another interesting thing to notice from the Tree-Of-Life charts is the large number of animal groups that have gone extinct. All the little yellow crosses indicate extinct families of animals. It looks like nearly 90% of all animal groups have gone extinct. I guess those were the projects God started and then scraped.)
Up Next: How Creationists use and abuse Cytochrome-C data
Read Full Post »