Genetics, History, DNA, and Genealogy Information: calculators

Saturday, January 30, 2016

In Praise of Roberta Estes and DNAeXplained.com

In a world of pseudo-science and echo chambers, a few blogs stick out for being mostly in touch with reality. In the world of Ancient DNA, Dienekes, although less active than before, has pioneered much in the field of DNA, and still has many serious scientists who comment there.

In the world of DNA for Genealogy, one blog sticks out. It is Roberta Estes' DNAeXplained.com. Of all the blogs and websites dedicated to disseminating information about DNA, hers is consistently factual, science-based, and yet easy to understand.

This scientist came across a few of her posts, and I daresay they are mandatory reading for anyone seeking a better understanding of their DNA. Below are links and highlights:

"Determining Ethnicity Percentages"

Step 1: Creation of the underlying population data base.
Don’t we wish this was as simple as it sounds. It isn’t. In fact, this step is the underpinnings of the accuracy of the ethnicity predictions. The old GIGO (garbage in, garbage out) concept applies here. . . .

The third way to obtain this type of information is by inference. Both Ancestry.com and 23andMe do some of this. Ancestry released its V2 ethnicity updates this week, and as a part of that update, they included a white paper available to DNA participants. In that paper, Ancestry discusses their process for utilizing contributed pedigree charts and states that, aside from immigrant locations, such as the United States and Canada, a common location for 4 grandparents is sufficient information to include that individuals DNA as “native” to that location. Ancestry used 3000 samples in their new ethnicity predictions to cover 26 geographic locations. That’s only 115 samples, on average, per location to represent all of that population. That’s pretty slim pickins. Their most highly represented area is Eastern Europe with 432 samples and the least represented is Mali with 16. The regions they cover are shown below. . .

No matter which calculations you use relative to acceptable Margin of Error and Confidence Level, Ancestry’s sample size is extremely light. . . .

"Are You Native American?"

"having Haplogroup Origins and Ancestral Origins indicating Native American ancestry does not necessarily mean you are Native American or have Native American heritage. This is a very pervasive myth that needs to be dispelled. . . .

The good news is that more and more people are DNA testing. The bad news is that errors in the system are tending to become more problematic, or said another way, GIGO – Garbage in, Garbage Out.

....

There are a very limited number of major haplogroups that include Native American results. For mitochondrial DNA, they are A, B, C, D, X and possibly M. I maintain a research list of the subgroups which are Native. Each of these base haplogroups also have subgroups which are European and/or Asian. The same holds true for Native American Y haplogroups Q and C.
In the Haplogroup Origins and Ancestral Origins, there are many examples where Non-Native haplogroups are assigned as Native American, such as haplogroup H1a below. Haplogroup H is European...

One of the problems we have today is that because there are so many people who carry the oral history of grandmother being “Cherokee,” it has become common to “self-assign” oneself as Native. That’s all fine and good, until one begins to “self-assign” those haplogroups as Native as well – by virtue of that “Native” assignment in the Family Tree DNA data base. That’s a horse of a different color.

Sunday, December 13, 2015

A Proposal for a New Lexicon for Ancient DNA "Components" Like WHG, EHG, EEF, ANE, and CHG

Some of us a few years back started to decry the ever-ongoing ISOGG renaming process, which coupled with the discovery of new subclades, meant that one year, someone might be deemed R1b1b1a2bab2ba11babd12ba2b1c, and the next year R1b1b2bab2f1faf1fafaf1f1f1a.

People started saying that it would probably be better to say the first couple letters and the major terminal SNP. For example, R1b-U106 or I2-M26. This was logical and good. Unlike the terminology, the SNPs never change. And they're shorter to write.

Here I humbly propose a new terminology for ancient autosomal samples. I think picking terms like, "WHG" was a mistake, and now that I read about EHG and CHG, I really think so. For the uninitiated, these acronyms stand for "Western Hunter Gatherer," "Caucasus Hunter Gatherer," etc.

People compare their modern genomes, or the genomes of modern populations or ethnic groups, to these ancient samples. And then they use the shorthand, like, "Scottish average 19% CHG." This is highly misleading.

Let me give the reasons why I think it is deficient, and tell me if you disagree.

1. As we get more samples over time, it will be hard to keep renaming the different samples, if they form a different component. We just saw this with the recent CHG finds. Imagine if we find a detectable signal of ancient genes from Iberia. What will we call that component? "Really Western Hunter Gatherer?"

2. The shorthand is deeply misleading (i.e., "Scottish are 19% CHG.") This to me is the most important point. Most people reading this are experts. But I see on so many other boards people who seem to think that some scientist somewhere took a survey of a bunch of ancient samples, "averaged" it, and that we are comparing populations to populations.

We're not. We are not comparing Scots to Western Hunter Gatherers. We are comparing Scots (or any other modern individual or group) to ONE SAMPLE. For WHG, it's Loschbour. For EEF, it's Stuttgart. For ANE, it's Mal'ta. Etc.

3. We don't know that that one sample will turn out to be representative of "Western Hunter Gatherers" any more than we know that taking Danny Devito or the harlequin model Fabio is a representative of a modern Italian. Indeed, as the number of samples we get grows, we know the situation is infinitely more complex.

We all remember, for example, when the first farmers sampled had very unique mtDNA. For a while, people tried to read too much into it. "OMG, what if all farmers bore this odd mtDNA?" was the refrain. But it turned out to be a one-off. This can and will happen again and again as we get more samples over time.

4. The acronyms will get repetitive real fast. We are talking about aDNA, remember? Before farming, the whole world were hunter gatherers. So, many (most) aDNA samples will eventually have -HG after them, if we follow the current convention.

I imagine a world where we have found 26 slightly different hunter gatherer samples, and thus we have one different -HG for every letter in the alphabet! That'd be just silly.

For these reasons, but primarily numbers 2 and 3, I think the current practice is misleading and doomed to failure. Europe is a very complicated place. We will find ancient samples with very unique genomes, which are detectable in modern populations. They will all be slightly different from one another, because one sample is, well, one sample... It is highly misleading to say that "John Smith..." or "Estonians are more Western Hunter Gatherer than..." because we have not sampled all, most, or even many Western Hunter Gatherers. (I don't mean to pick on WHG. This applies equally, indeed MORE, with EEF and ANE!)

So, what is the solution?

I think if we purport to be scientific, we need to speak with scientific precision.

If an individual or a modern population bears resemblance to an ancient genome, we should state that it has a percentage similarity to that one sample. And not try to make it more than it is, like the very official and extensive term like, "Eastern Hunter Gatherers."

As for the sample, we should also include the year discovered, the situs of the discovery, and the years Before Present (BP).

Remember, many of these sites are caves where there have been and will be more discoveries. In other words, I expect there will be many more Loschbours, more Stuttgarts, etc., and it will get quite confusing unless we speak with specificity about when something was discovered and when in time it came from.

Let's avoid a situation like we had with terms like R1b1b1b1a2a1b2bc3d, which lose meaning. Let's refer to things with scientific precision.

Examples:

Instead of, "Scots are 19% Ancient North Eurasian."

SAY: "On average 19% of the genes of the modern Scottish population match 2013Mal'ta-24,000BP."

Instead of, "Southern European populations have a lot more CHG blood than I expected."

SAY: "Southern European populations bear many genes matching 2015Kotias-10,000BP."

Instead of, "Sardinians are 45% WHG."

SAY: "Approximately 45% of the genes in the modern Sardinian population resemble 2013Loschbour-6000BP."

This convention is much more accurate.