Let's apply logic and broad historical knowledge to DNA studies and genetic genealogy, so that we can help scientists brainstorm for new ideas, and so we can rid the field of charlatans, hucksters, and pop-sci, and pseudoscience.
Wednesday, October 28, 2015
What Is the Best and Most Accurate Ancestry Calculator (DNA Testing)?
What Is the Best and Most Accurate Ancestry or Admixture Calculator from DNA Testing?
We Review 23andme, AncestryDNA, Family Tree DNA (FTDNA), DNA.Land, Dodecad, Eurogenes, etc.
Judging from community discussions in online forums, "Admixture" tests, where a company or entity takes your raw DNA data, puts it into a calculator, and then purports to tell you where your ancestors came from -- these are all the rage. It is not rare for seemingly educated individuals to post on the Internet sheer and utter nonsense about their results, for example, assuming that a calculator identified their ancestry with something close to 100% accuracy.
In the online world, there is no such thing as perfect privacy. And in DNA, there is no such thing as 100% accuracy for ancestry calculators.
This is because all people are admixed, but not all ethnic groups form part of the samples. Put another way, if your ancestors come from a valley in Switzerland where no one has ever been tested, you might show up in a test as French, German, Italian, Austrian, but not Swiss.
You might say to yourself that you have documented ancestry back to the dawn of time that you are from Switzerland. You may match other Swiss people exactly. But because the Swiss are indeed mixes of the groups above, and because there are no specific, micro-targeted Swiss samples in the hypothetical database that match you more closely than those other nationalities, the test would be woefully inaccurate to YOU. After all, you don't want a test to tell you you might be Northern Italian, if you are Swiss. (For that matter, do you NEED a test to tell you that? See below.)
In the online privacy world, they've named protections that are scientifically the best (and do their job pretty darn well) "Pretty Good Privacy." In the DNA world, all we can hope for is "Pretty Good Accuracy" -- ancestry calculators that are scientifically grounded, don't make claims beyond what they can really do, and ones that get the broad regions correct in the very least.
The coolest benefit about living in a college town (Berkeley for this blogger) is that there are a ton of people from all over the world, with pretty well-defined ancestry. For example, that Danish exchange student with 500 years of documented ancestors in Denmark? That's a good candidate for testing some of these calculators. Enough friends of mine have taken DNA tests, and we've plugged the results in the calculators across several paysites (testing companies) like 23andme and AncestryDNA, and free calculators, like the ones available on Gedmatch. Who came out on top?
By far, the best and most accurate ancestry calculator is on 23andme. Like all good scientists, they are humble instead of full of hubris. They don't profess to give you one set of results and say, "this is it." Instead, they give you three different results: standard, conservative, and speculative. Each is pretty darn accurate for most of the people we know who have tested there and other sites. Bottom line: 23andme's "Ancestry Composition" feature is outstanding, and the best, most accurate one online we could find.
It is our opinion that the least accurate ancestry calculator is at the new site DNA.land. And the one on FTDNA is a close second. Both are terrible. Almost everyone who used the feature on DNA.land reported that the calculator is way off; just not ready for prime time at time of writing this post.
How do these calculators work? Well, remember, the data that comes out is only as good as the data that comes in. It is worth to always remember the concept that computer programmers call "GIGO: Garbage In, Garbage Out." What this means is that if the data on which a conclusion is based is faulty, the answer will also be faulty. With calculators, this manifests itself two ways: with a shifted focus, or faulty or incomplete baseline data. By a different focus, we mean: Several calculators, for example, the MDLP Ethnicity Calculator, also offered (with Eurogenes and Dodecad and Gedrosia) at Gedmatch, stands for Magnus Ducatiae Lituania Project. As you might have guessed from its name, it focuses on the people from lands that used to form the Grand Duchy of Lithuania: places in Northeast Europe, including Poland, Estonia, etc. MDLP seeks to be very good at calculating ethnic tidbits of interest to those populations. But is is good for determining the difference between, say, a Catalonian Spaniard and a Northern Italian? No, it's actually quite bad on that front. That's simply not its focus. Similarly, there are other calculators on Gedmatch that exist to focus on and cater to Asians, Africans, even mixed race folks. And within European populations, you have other focuses, like Dodecad, which seems Grecocentric, for lack of a better word. None of these will do that great outside their focus areas. So take the results from those ones with a grain of salt, unless you happen to hail from their regions of focus. Don't believe that? Think I'm being extreme? If you are European, try putting your data in a calculator that is focused on another population. Like the East Asian-focused calculators. It won't tell you that you are NOT East Asian. It will tell you which East Asian population you resemble the most. To be clear: if all a calculator has is East Asian samples, a European will be told he or she is Japanese or Chinese. This same concept applies within European focused calculators at the regional level.
In terms of bad baselines, recall the Swiss example above. Europe is filled with micropopulations that exhibit a high degree of population homogeneity (a little inbred, to use the pejorative term). If a calculator does not have a sample from your micropopulation (the narrow region where your ancestor lived for millennia), then you will get a faulty reading.
Put simply (to use a French example): It's a big country. Normans are not Basques, Provencals are not Bretagnes, etc. That is why the best calculators are HONEST. 23andme discloses quite readily that for the huge populations in the middle of Europe (French, Germans, but also Benelux countries, etc.), it cannot spot the DNA with certainty 92% of the time.
Does the 23andme website have any drawbacks? Sure it does. But they are minor compared to the others.
First, its "Countries of Ancestry" feature is not what it could be. But it's important to understand three things: (1) This is NOT their ancestry calculator, but another feature entirely, so perhaps it's unfair perhaps for us to even review it in this space. (2) It's experimental, and they state that. (3) They are wisely phasing it out. What was the problem with that feature? Well, it gave you the list of countries of people who have the most matches with you. Let's say for example you are half Italian, half Polish (a common mix in Chicago). In other parts of Chicago, another common mix is half Polish, half Irish. For whatever reason, people of Irish heritage have tested themselves at far greater numbers than the others. Your Polish DNA would overlap (match) with the people who reported they were half Polish, half Irish. And this feature would then tell you that "a high percentage of the people who have DNA similar to yours are from Ireland." Do you understand? It's a huge problem, especially for smaller populations, especially because so many Americans are now half this, half that. It's just not that edifying then.
23andme also suffers from the same sample issues as many of the other ancestry calculators. For example, 85% of Italian Americans (TRANSLATION: potential customers, since most people who test are from Britain or the US) hail from just 3 regions in the deep south of Italy: Campania (Naples), Calabria, and Sicily. Yet the population samples that most of these websites use are from Tuscany. Even though Dante tried to meld them, Tuscans are not Sicilians and vice-versa.
Often, these calculators when they see Sicilian or rural Southern Italian genes, they, in effect, say: we don't know what you are! you are kind of Italian but you also resemble, a little bit, people from Cyprus or Jews. So they give an odd result. And then you have someone tested who says, "I might be Jewish." No. The answer is that your people were not included in the data-set by which the baseline was developed. If they were, the calculator would recognize you as a run of the mill Sicilian.
All online ancestry alculators also suffer from lack of inter-operability and non-standardized terms. For example, among the calculators on Gedmatch, some use the term "Caucasian" to mean "generalized European" (which is how it used in common parlance, of course). Others use it to mean, the specific, like, from Soviet Georgia, Armenia, etc.
Here's the bottom line: don't expect any ethnicity or ethnic-origins calculator to be 100% correct. Don't expect new insights if you have confirmed records. In other words, if you look just like your dad (you're not a bastard), and you're not adopted, and you have records going back centuries -- why do you need an ethnicity calculator to begin with?
These admixture tests can help if you were adopted, and want to have a sense of where to start. But keep in mind, the largest plurality of Americans come from German heritage, and yet the best currently cannot identify German DNA 92% of the time.
Avoid the mythology and those who oversimplify. There are reliable sources out there in genetic genealogy, like Debbie Kennett -- and there are a lot of charlatans. Be careful whenever someone oversimplifies to the point of exaggeration, falls into stereotypes, or tells you what you want to hear. With DNA as with everything, the most parsimonious answer is often the best. The exotic is often wrong.
As the science improves, you can't go wrong using the Standard or Conservative setting on the 23andme Ancestry Composition test.