Beyond Curiosity: Industrial-Strength Genomics for Organizations
Or, what Palantir should do when it wins the NHS contract
Abstract: DNA analysis has been widely adopted for scattered applications without a unifying framework. This document argues for reimagining of the field in order to create an industrial-strength tool built on behavior genetics. It is shown that ‘who’ and ‘what’ carve the problem at its joints. The opportunities available to a substantial organization capable of collecting large amounts of DNA data are explored.
Apropos of the 75th anniversary of the National Health Service (NHS) it’s worth thinking about what the next 75 years might look like, in light of all the advances being made and yet to be made in genomics. Britain’s achievement of the NHS stands in stark contrast to America’s mobbed up, patchwork health care system. It is a model for the world and the ideas tried out in the United Kingdom serve as a sort of laboratory for the rest of the world, especially the Commonwealth countries. It’s my hope that that wider world might even include America.
Anyway my South African-born colleague and friend Gavan Tredoux has written (and I have edited) this piece about industrial-strength genomics which we think could help shape the discussion. As times goes on I’ll be putting more of this material out. For those who are unaware Traitwell is currently raising its Series A. I’ve personally invested about $600,000+ into the company and will likely invest more as it becomes available.
The reason I’ve made this investment is that I believe genomics is going to be a key part of health care, justice and policymaking. As Chinese capital draws down from 23 & Me and Ancestry.com, we’re going to need American genetics companies. Hence Traitwell.
We are concerned here specifically with applications of genomics by a large organization or set of organizations, or government agencies.
Everything is, to a greater or lesser degree, heritable. Which is to say that all traits are under some degree of genetic influence. Traits here range from bodily properties and diseases, to personality and behaviour, since evolution does not stop at the neck. Well studied behavioural traits include Educational Attainment, IQ, Schizophrenia, Autism, Anti-social Disorder and Attention Deficit Disorder.
This phenomenon means that differences in traits are associated with differences in genes. This reality of nature first became apparent in 1869. Subsequently, detailed evidence amassed from twin and other family studies with a multitude of designs repeatedly demonstrated this nineteenth century insight and allowed estimation of the degree of heritability within the usual range of variation. Specific genes have long been known for special, simpler, ‘monogenic’ cases.
With the advent of affordable DNA sequencing, modern Genome-wide Analysis Studies (GWAS), conducted since 2007 on very large samples of individuals, have extended this work to identify large numbers of specific DNA polymorphisms (SNPs) associated with even complex ‘polygenic’ traits. Composite polygenic scores (PRS) may be calculated from samples of SNPs to predict these complex outcomes, and incorporated into mare general models. Knowledge is expanding daily. Explanatory power is increasing. For many traits it is still modest, for others much stronger. We don’t know how deep the well goes exactly but we dare to plumb its depths.
To date practical applications of genomics have been scattershot. Abiding interest in ancestry drives a consumer-facing set of applications intended mostly to satisfy curiosity and to identify long lost relatives. Trait information is often offered but also mostly for curiosity and entertainment. Interest in identity, say forensically at crime scenes, drives another set of applications. Real-world medical applications to date have mostly concerned themselves with simple traits like mono- genic diseases of large effect (Mendelian disorders). All this exploration leaves a lot of opportunity for applications of the latest knowledge on the table, to go well beyond mere curiosity. That opportunity is explored here in terms of an important distinction between ‘who’ and ‘what’.
1 Complexity
First, let’s clear up some common confusions, which may be called “arguments from complexity”. Nobody claims that DNA explains everything. It does not. There is at the very least random variation that cannot be explained (by definition).
Other factors may also be involved. These will vary by trait, and are far less certain for any particular trait than might be supposed. But if some model can be devised which explains some of the variation in traits in non-genetic terms, it will be strengthened by adding the missing genetic terms. That is what concerns us here: adding explanatory power. In fact, many factors which on the surface appear to be non-genetic are themselves proxies for genes. Alas we have found that many excuses against genomic science are, in fact, ruses to underinvest in one of the most promising technologies to date.
2 Curiosity
Limitation of many applications so far to curiosity is not an accident, it is dictated by the nature of the genomic information and the granularity of its use. At an individual level, PRS scores are weak predictors, and likely to remain so for most traits. Their effect size is usually modest. A consumer inspecting his or her score will not discover much that is directly actionable. A doctor who obtains the same information will perhaps find it not so useful. The score will have little diagnostic value for that individual. Noise overwhelms signal.
Note though that this is not true for large organizations, corporations and governments, who get to use that information repeatedly over very many cases, so that even a small advantage gained in each case adds up to a valuable gain overall. An entire health system may gain in ways that an individual doctor may not appreciate at his or her level.
3 Who / What
There is a useful distinction between two kinds of applications of genomics. Either 1. you are trying to do something and need to know who best do it to or with, or 2. you already have subjects given to you, and you need to figure out what it is best do to or with them.
Who. Let’s just assume you have something you need to do. It may be your mission as a regulatory agency, or follow from your job description, be set by the expectations of others, say those who pay you. Being familiar with behavior genetics you know that humans vary genetically in their responsiveness to or suitability for what it is you have to do. To increase your chances of being effective, you are interested in finding the right people. Examples include clinical trials, personnel selection of all kinds, groups to study, groups to prioritize with interventions. By filtering the broader population using DNA, an enriched set of candidates may be found. For example, in conducting a clinical trial, finding people more susceptible to a disease is useful. Or when searching for people with IQ 175, testing from an enriched set filtered using DNA will save a lot of time and money, even with a modest predictor. It may well be the predictor from which all other predictors are derived.
What. Here you know who your subjects are. You cannot change them, for whatever reason. They have already been entrusted to your care or supervision. You may know more broadly what your responsibilities to them are, but the question is: what is the best way to treat or more generally deal with them? That question may be answered by starting from their known traits, obtained by sequencing them, and adopting the optimal set of actions based on that, assuming that certain traits favor particular actions, e.g. susceptibility to disease. For individuals this is straightforward: apply known results for the traits that are candidates. As noted above the resulting predictions are typically weak from the point of view of an individual and a single application, but very useful nonetheless to an organization applying them over and over again to a set of individuals, perhaps an entire population.
Personalized medicine is an example of an emerging field that utilizes genomic infor- mation to target drug treatments. The efficacy of drugs, their proper dosages and their interactions with other drugs depends on genes. The effects may be very large. Health out- comes can be markedly improved, and patients made more comfortable, by taking genes into account. Increasingly, drugs will target specific gene variants, which may be distributed differently in fine-grained sub-populations, to counter their effects. Genes provide an objective uncontroversial way here to identify the population granularity that is most informative. We even offer a “pharmacogenomics” app.
Traits vary in their spatial distribution, in ways that can be detected using DNA. In this way geospatial trait maps can be formed. Policy can take this distribution of traits into account to improve targeting of investment, spending and planning, all of which are done in practice at the geographical level. The map aggregates known DNA, which may be obtained from a biobank or collected directly from the environment. The latter is important enough to call out.
4 Found DNA Mixtures
To date DNA sequencing has been concerned mainly with individuals. Either samples are collected from known individuals and sequenced, or they are collected forensically, using ‘found DNA’ from unknown individuals, and then identified by matching against databases. In the forensic case a mixture of DNA from different people may be present, in which case they need to be distinguished by matching.
There is another set of important applications in which particular individuals are not the object of interest, but rather the aggregate traits represented in a mixture of found DNA. This is where industrial-strength genomics emerges. Rather than thinking of the mixture as a collection of DNA from individuals, it is treated as a collection of SNPs. Those SNPs are implicated in traits by GWAS studies. An aggregate PRS can be calculated by combining them all. Very many individuals may contribute to that aggregate PRS, but their individuality is not of interest. The traits are the focus, and for that only the SNPs matter.
Of course one may get a similar result by taking a collection of known individuals, se- quencing them all and aggregating the results. But that is time-consuming and expensive. Consent may be necessary. Far more sequencing must be done than is actually needed— much information will be discarded in the aggregate score and was wasted effort. A group must somehow be formed in advance, and so on. It is far easier to obtain found DNA from environmental sources (railway stations, subways, border crossings and other public spaces). This DNA mixture comes from unknown individuals. All that may be known are more general characteristics, such as where the mixture was collected, what broad types of people are present there, and other non-specific facts. For this application no more is needed.
In such a mixture, it may be necessary to distinguish human DNA from other animals, but this is not in principle difficult. The technical challenge, and opportunity, is to enable rapid determination and aggregation of traits, cheaply. This requires changing the sequencing approach, orienting it away from entire sequences where the individual matters toward fragments where the collection SNPs matter. Innovation is possible here.
There are definite strengths derived from considering aggregate traits rather than individuals. Collection is vastly cheaper. Consent is not required. Privacy is inherent, since no individual is identified. Much better accuracy is achieved. Modest PRS trait predictors are transformed into robust ones by cancelling out random errors, as they are in the filtering techniques described above.
Anything that may be said about the traits of an individual may be said about the traits of an aggregate. That aggregate may be more or less prone to schizophrenia, anti-social behavior, cognitive decline, heart disease, Covid, autism and so on. This is also true for generalized outcomes, as detailed below.
5 Use vs. Discovery
An organization may (and should and perhaps must) use the growing body of knowledge derived from GWAS studies in its application of genomics. This is already substantial and available, usually at no cost. However, there are drawbacks to that approach. Much of the knowledge generated so far has been focused on causality and understanding the mechanisms of traits. For activities like drug development that is necessary to find appropriate gene targets. For many applications, causality may not matter. The traits and outcomes that most interest the organization may not have been studied yet. The traits may be only indirectly linked to the outcomes of interest. Discovering knowledge about outcomes, given a large collection of DNA not available to others, and measurements of outcomes, will go much further to address the organization’s concerns. It may even be a major revenue centre.
Consider pharmaceutical development. Effective drugs are exceptionally hard to find. Entire companies may rely on one or two patented cash cows, fortuitously discovered. Target genes informing drug discovery are also hard to find. Few private companies possess very large collections of DNA data, which are essential. Existing biobanks, which make DNA collections available, where they permit private use, are comparatively small and suffer from selection bias because they rely on volunteers. Nationwide health services have the advantage of comprehending very large populations. They can include everyone, or a random sample of them, in ways that a voluntary biobank cannot. Sheer scale matters a great deal here. The gene variants of interest are usually very rare. Finding associations involving them requires many subjects. The nationwide health service can leverage its enormous pool of patients to create the largest biobank that it can afford to invest in. Drug discovery based on this can be turned into a revenue centre through global marketing of the resulting drugs, which are otherwise very hard to discover.
6 Generalized Outcomes and Extended Phenotypes
To date DNA has been used to predict a limited range of outcomes: physical characters, diseases, psychopathology, and behavioural traits including cognitive ability. This is for historical and convenience reasons. GWAS studies are typically composed of multiple smaller studies to achieve the large sample sizes needed to find associations. Common traits are required for the studies to be commensurable. The standard toolkit for each domain, developed over a long period, is typically used here. This is least straightforward for psychological and behavioural traits, where measure and classification is disputable, and predictive power questionable, leading to weak phenotypes and study results that are not as strong as they might be. The express hope in these studies is to find causal DNA variants (important for drug development).
There is no intrinsic need for this limitation. Any outcome may be predicted using DNA. Diagnostic categories are not necessary. Outcomes need not be profound and may be as mundane as ‘responds to emails but not text messages’ or ‘prefers wine to cola’. Whatever outcome can be measured or recorded is fair game here, provided the required data is available. In fact few organizations to date have had the required scale or data. However, for a modern organization at a national scale of millions, possessing DNA data, this is particularly feasible. The results may be more directly applicable and less noisy than first predicting an uncertain trait, then inferring from the trait to the final outcome.
Nor is it necessary to confine oneself to causal variants. Where DNA is associated with an outcome, but does not directly cause it, this is known as ‘tagging’. The tagged DNA is acting as a proxy for the real mechanisms, which may (or may not) lie entirely outside DNA. For example, certain DNA patterns may be associated with tea-drinking versus coffee-drinking, just because they happen to tag the people involved, but need not have a direct biological effect on that outcome. But if what you want to do is predict who likes tea, that doesn’t matter. The idea need not be to establish profound truths for all mankind but rather to make practical inferences for a particular population at hand. Given sufficient DNA data about that population, along with the outcomes of interest, objective associations can be learned using standard statistical techniques. By embracing tagging rather than depreciating it, the associations discovered may be continuously updated as the population changes.
No learning technique is off the table here either. Derived features may be used, rather than just an additive model of SNPs. This is because the causal path is not required. What works best in practice will vary by outcome and population.
7 DNA Tokens
In some contexts, like Emergency Rooms, it is useful to have rapid access to an individual’s DNA sequence. One could readily imagine an emergency system where the fully sequenced were given priority care as an enticement to get others to get sequenced either through the government or their own initiative.
Although each human cell has copies of that DNA, sequencing it is slow even on the fastest devices. By saving the DNA sequence onto a token that the individual carries around, somewhat like a Medical ID bracelet, instant access can be had to the sequence by those who need it to make quick and convenient decisions about personalized medicine (see above) or whatever problem is at hand.
We might even consider using facial recognition for willing participants.
This quick retrieval removes ambiguity and promotes better outcomes for patients. People may voluntarily choose to carry this in some form, or even have it embedded. There are several options for doing this:
A hardware token with the sequence stored directly on it, possibly in multiple forms. This may be on a keychain fob or worn on the body. It may also be embedded under the skin. Low cost of modern storage makes this feasible. Fingerprint or other authentication may be applied to the wearable device. This guarantees access at the point of care, unlike the options which follow.
Some people may not care to have their sequence directly readable on a device, for whatever reason. Instead of storing the sequence a unique identifier may be stored, and used to look up the sequence in an online database operated by a trusted authority, to which only those who need it have access it. The key may be protected as before by a fingerprint or other security. The condition is that the online service is accessible.
A mobile app may be used instead of dedicated hardware. Mobile devices are ubiquitous now, always carried by most people.. The DNA may be stored directly on the mobile or referenced as above by a key and looked up online from secure storage.
In variants of the above two options, the DNA may be stored in an online database but may not be downloadable. Instead it may be queried to answer questions. This more strictly limits exposure of information if that is desired.
We are better when we work together.
References
Galton, Francis (1869). Hereditary Genius. An Enquiry into its Laws and Consequences. First. London: Macmillan, 1869.
Plomin, Robert (2018). Blueprint. New York: MIT Press, 2018.
Plomin, Robert et al. (2018). “The new genetics of intelligence”. In: Nature Reviews 19
(2018), pp. 148–159.
Polderman, Tinca J C et al. (2015). “Meta-analysis of the heritability of human traits based
on fifty years of twin studies”. In: Nature Genetics 47 (May 18, 2015), p. 702.