Traitwell's Future of Genomics Series: Ancestry
We're continuing our discussion about the future of genomics.
We’re continuing our analysis of the future of genomics. For those who are curious, please visit Traitwell.com. Be sure to check out our free apps.
The simplest direct extension of genetic identity is its application to ancestry. Some facts can be deduced from a single genome without any other information other than the general collections of human population DNA data gathered as part of the 1000 genomes project and similar initiatives. Any person’s DNA can be scanned for unusual agreement, compared to the collections mentioned, of individual SNP positions. These are known as runs of homozygosity. The scrambling of DNA due to sexual reproduction make such agreement unlikely in proportion to the length of the ‘run’ of positions that agree. More runs than usual imply inbreeding. The usual population-genetic coefficient of inbreeding can be deduced directly without requiring error-prone genealogical information from the individual. (We will cover other deductions based on single genomes elsewhere in this discussion.)
Given two genome sequences, it is possible to calculate a genetic distance between them. This expresses the degree to which the two genomes are similar. It is true that the vast bulk of the genome does not vary between individuals of any one species, including humans. Nevertheless there is an enormous amount of information in the parts that do vary. Let us confine out attention to those parts only. Because of the sheer amount of information involved, if two individuals do not differ by much then they are almost certainly closely related, by descent. Coincidences are possible for some percentage of information, but not for large amounts. Many different measures of this distance are possible, according to taste and technology, but the essential idea remains the same. (Abstractly, we can think of genes as forming a multi-dimensional space, with the distances relating points within that space.)
Interest in genealogy is a human universal, and one of the most common uses of the internet from its inception. We have a wired-in curiosity about who we are related to, where we came from, and where our descendants are likely to go. Here knowledge is an end in its own right, quite independent of whatever uses it may be put to.
To date, the practical utility of genealogical knowledge, for other purposes, has been modest but we expect that to change. If behavior, intelligence, personality are all heritable it follows that genealogy can be useful in counterintelligence. Is this why Ancestry.com was purchased by a consortium which included Chinese capital? On December 4, 2020 The Blackstone Group acquired the company in a deal valued at $4.7 billion. They installed Deborah Liu as CEO. Liu is an acolyte of Sheryl Sandberg whose family is deeply connected with Israeli intelligence.
Aside from intelligence services, most of us want to know for the sake of knowledge itself, to slake our native curiosity. Genomic inference about ancestry is founded on the idea of genetic distance described above. Given a group of genomes we can calculate the distance between an individual and that group, typically by finding a (hypothetical) centre point for the group and finding the distance from the individual to that centre. Although there are myriad alternative ways of doing this calculation, the idea is the same: how distant is the individual from the group?
The groups involved may be living individuals, or they may have been sequenced forensically, even from fossil DNA. In the former case, with enough individuals, who may have joined the same online service and deliberately uploaded their DNA sequences, genealogical trees can be constructed to reflect the likely pattern of descent that fits the data best. As the size of the repository increases, the tree becomes finer and more accurate. Many such services already exist, with a substantial number of users. The addition of other genealogical tools, such as census records, newspaper archives, land registries, voter rolls, tax records and similar historical information gives users more to do on such sites, so that they can construct family trees with partial DNA information and partial information from those other sources. DNA should be thought of as one element of the genealogy toolkit, and those who operate in this realm have to enrich their offerings accordingly.
In the near future genealogy tools will expand to provide greater (partially-automated) assistance to users when researching their ancestry. Manually combing through large databases is error-prone and time consuming. Surnames are often ambiguous. Other information must be used, including DNA where it is available, to help disambiguate through consideration of context. Humans are typically not good at the sort of probability calculations needed to narrow down candidates for ancestors. Here the weight of evidence is needed, and machines are better at trawling through large volumes of data to form a composite opinions and generate leads for further research.
Interest in ancestry also includes far-reaching curiosity about group origins, which one may call race or ‘geographic ancestry’ or, more misleadingly, ethnicity or some other equivalent. Here a single genome again suffices, given general reference sets of DNA. The Y chromosomes and mitochondrial DNA are important here because, barring random mutations, the former is handed down unchanged paternally, while the latter is transmitted unchanged maternally. It is already possible to infer much about group ancestry in this way, though some parts of the world have much better coverage in reference collections and therefore corresponding resolution than others. However the confidence claimed by existing services is overstated, and different services may deliver inconsistent results, depending on the reference collections they use and the genetic distance measures they chose to employ.
Disillusion among users of ancestry services is inevitable until the problem surrounding reliable and consistent group ancestry is clarified. Larger reference collections with much better coverage are inevitable and will keep improving over time. Greater transparency about the impact of distance measures will also come, and be available for control by the (informed) users of ancestry services. This transparency will help them to pose the right questions in the first place. Existing services also confuse their users with charts which contain attractive colours in direct proportion to their lack of any clear interpretation, e.g. differences mapped onto chromosomes. These displays will be replaced in the future by genuinely informative tools explaining the functional significance of what is inherited or differs from others.
The use of historical and even ancient DNA, recovered from fossil sources and burials, expands the ‘group’ considered further back, even to the species level. Breeding between adjacent human species is now known to have been widespread even if rarely successful. A phenomenon known as ‘introgression’ captures a small percentage of useful variants from another species and propagates it. The rest is relentlessly weeded out by natural selection where it is functional, since the effects are usually bad, even catastrophic. Offering information of this kind feeds natural curiosity. The information is not actionable in any sensible way—nor is genealogy in general with rare exceptions like discovering long-lost living relatives—but people are still intrigued, as they are by history in general. This gives them a personal tie-in to that history. As fossil finds expand, and recognized hominid species multiply, the scope of this information expands accordingly.
More recent burials will become important sources of information for the purposes of ancestry. In Europe for instance, neglected graveyards are frequently cleared and the remains disposed of, to make room for new development. Commercial sequencing of these remains at a far greater scale than archaeology affords will unlock a trove of information about ancestry, greatly reinforcing official records and pushing the frame of inference back by centuries. The sequencing may be done inexpensively using microchips if only ancestry information is desired. But the course of technology suggests that falling costs of whole-genome sequencing will make that applicable here given its much greater potential for inference.
Another direction of future expansion will come from tracing traits back through ancestral lines. This can be pushed back, in principle, to ancient DNA. We will deal with such traits in much greater detail below at the individual level. Many such group traits are already known, for disease resistance (e.g. malaria) and other adaptations to geographical features, like tolerance for high altitude (Andeans and Himalayans, the latter trait introgressed from the now extinct Denisovans). The general form of this is to identify a grouping through genetic means, which we might call ‘tagging’.
There is no genetic causation implied by the tag itself, but other knowledge of the tagged group characters allows deductions to be made, in the absence of better evidence (and note that conditional probability suggests incorporating knowledge even when it based only on more general knowledge). Genes will almost certainly be implicated causally in the traits of interest, but this is not immediately important for prediction using the tag and may take many years to determine. Moreover the size of the group identified may vary from small to large. This is how all scientific induction works, when inferring from membership of a class to a best estimate of a trait ceteris paribus. This is important enough to be called out as a general principle for future developments. The impact of tagged groups on traits, including disease resistance, is known to be pervasive and so far we have only scratched the surface of the specific traits that will eventually be uncovered. People want to know what family traits they have on a scientific basis. It will fuel future interest in ancestry.