The successful sequencing of the human genome1,2 in 2001 is considered by many to be one of the greatest achievements in biology. The published sequences were generated from the DNA of a few anonymous volunteers of differing ethnic backgrounds. However, a single genome (even one generated from many individuals) can provide only so much information. It was immediately clear that we would need to generate and compare more sequences from different people, if we were to harness information coded in genomes to better understand our health and heritage. So far, we have genomes for hundreds of thousands of individuals — more than was imaginable 20 years ago. Even so, we are just beginning to sequence diverse populations in the numbers needed to realize the promise of genomics.
Although human genomes are 99.9% similar, they also contain millions of single nucleotide polymorphisms (SNPs) — single bases where there is genetic variation between individuals. A map of about 1.42 million SNPs was published alongside the draft genome3, generated in part from differences found between the individuals who contributed their DNA for the draft. Thus, the Human Genome Project provided a framework for larger-scale projects to analyse human variation.
In 2003, a consortium of researchers set out to generate a genetic map of SNPs from diverse individuals — an endeavour known as the International HapMap Project4. The first iteration of the map, published in 2007, was a major milestone that documented more than 3 million SNPs discovered in 270 individuals from Japan, China, the United States and Nigeria5. The work shed light on how the genome is organized, revealing how segments of our DNA are inherited together as blocks, and highlighting how these blocks vary within and between populations. The HapMap was eventually expanded to include 11 population groups6, emphasizing differences in the way in which common human genetic variants (HGV) are distributed worldwide.
The HapMap project also aided the development of biotechnological and computational approaches such as genome-wide association studies (GWAS), which allow scientists to search thousands of individual genomes to discover genetic variants that are linked to specific traits. GWAS have successfully identified genomic regions that increase the risks of common conditions such as diabetes, coronary artery disease and Crohn’s disease7. But GWAS have been performed mainly in people of European ancestry7, and as of December 2020, 78% of individuals in all GWAS were of such ancestry (go.nature.com/3ocyhql). Several factors account for this bias, including a reliance on existing cohorts, preference for homogeneous population groups, limited funding for enrolling under-represented groups and early perceptions that findings from Europeans should be generalizable to other groups. The lingering lack of diversity in GWAS has been highlighted as one of the main roadblocks to the scientific and equitable realization of the promise of genomics8,9.
The 1000 Genomes Project was created in 2008 to generate a more comprehensive catalogue of HGV by systematically sequencing the genomes of thousands of individuals from diverse geographical locations, to identify both common and rare genetic variants10. Because of the ever-diminishing cost of sequencing, by its completion, the project had amassed 2,504 individuals from 26 population groups on 5 continents (including several groups with mixed ancestries), providing a detailed catalogue of genetic variants on a scale previously unimaginable.