Sunday, December 13, 2015

A Proposal for a New Lexicon for Ancient DNA "Components" Like WHG, EHG, EEF, ANE, and CHG

Some of us a few years back started to decry the ever-ongoing ISOGG renaming process, which coupled with the discovery of new subclades, meant that one year, someone might be deemed R1b1b1a2bab2ba11babd12ba2b1c, and the next year R1b1b2bab2f1faf1fafaf1f1f1a. 

People started saying that it would probably be better to say the first couple letters and the major terminal SNP. For example, R1b-U106 or I2-M26. This was logical and goodUnlike the terminology, the SNPs never change. And they're shorter to write.
Here I humbly propose a new terminology for ancient autosomal samples. I think picking terms like, "WHG" was a mistake, and now that I read about EHG and CHG, I really think so. For the uninitiated, these acronyms stand for "Western Hunter Gatherer," "Caucasus Hunter Gatherer," etc.

People compare their modern genomes, or the genomes of modern populations or ethnic groups, to these ancient samples. And then they use the shorthand, like, "Scottish average 19% CHG." This is highly misleading.

Let me give the reasons why I think it is deficient, and tell me if you disagree.

1. As we get more samples over time, it will be hard to keep renaming the different samples, if they form a different component. We just saw this with the recent CHG finds. Imagine if we find a detectable signal of ancient genes from Iberia. What will we call that component? "Really Western Hunter Gatherer?"

2. The shorthand is deeply misleading (i.e., "Scottish are 19% CHG.") This to me is the most important point. Most people reading this are experts. But I see on so many other boards people who seem to think that some scientist somewhere took a survey of a bunch of ancient samples, "averaged" it, and that we are comparing populations to populations.

We're not. We are not comparing Scots to Western Hunter Gatherers. We are comparing Scots (or any other modern individual or group) to ONE SAMPLE. For WHG, it's Loschbour. For EEF, it's Stuttgart. For ANE, it's Mal'ta. Etc.

3. We don't know that that one sample will turn out to be representative of "Western Hunter Gatherers" any more than we know that taking Danny Devito or the harlequin model Fabio is a representative of a modern Italian. Indeed, as the number of samples we get grows, we know the situation is infinitely more complex.

We all remember, for example, when the first farmers sampled had very unique mtDNA. For a while, people tried to read too much into it. "OMG, what if all farmers bore this odd mtDNA?" was the refrain. But it turned out to be a one-off. This can and will happen again and again as we get more samples over time.

4. The acronyms will get repetitive real fast. We are talking about aDNA, remember? Before farming, the whole world were hunter gatherers. So, many (most) aDNA samples will eventually have -HG after them, if we follow the current convention.

I imagine a world where we have found 26 slightly different hunter gatherer samples, and thus we have one different -HG for every letter in the alphabet! That'd be just silly.


For these reasons, but primarily numbers 2 and 3, I think the current practice is misleading and doomed to failure. Europe is a very complicated place. We will find ancient samples with very unique genomes, which are detectable in modern populations. They will all be slightly different from one another, because one sample is, well, one sample... It is highly misleading to say that "John Smith..." or "Estonians are more Western Hunter Gatherer than..." because we have not sampled all, most, or even many Western Hunter Gatherers. (I don't mean to pick on WHG. This applies equally, indeed MORE, with EEF and ANE!)

So, what is the solution?

I think if we purport to be scientific, we need to speak with scientific precision.

If an individual or a modern population bears resemblance to an ancient genome, we should state that it has a percentage similarity to that one sample. And not try to make it more than it is, like the very official and extensive term like, "Eastern Hunter Gatherers."

As for the sample, we should also include the year discovered, the situs of the discovery, and the years Before Present (BP). 

Remember, many of these sites are caves where there have been and will be more discoveries. In other words, I expect there will be many more Loschbours, more Stuttgarts, etc., and it will get quite confusing unless we speak with specificity about when something was discovered and when in time it came from.
Let's avoid a situation like we had with terms like R1b1b1b1a2a1b2bc3d, which lose meaning. Let's refer to things with scientific precision.


Instead of, "Scots are 19% Ancient North Eurasian."

SAY: "On average 19% of the genes of the modern Scottish population match 2013Mal'ta-24,000BP."

Instead of, "Southern European populations have a lot more CHG blood than I expected."

SAY: "Southern European populations bear many genes matching 2015Kotias-10,000BP."

Instead of, "Sardinians are 45% WHG."

SAY: "Approximately 45% of the genes in the modern Sardinian population resemble 2013Loschbour-6000BP."

This convention is much more accurate.

