top of page

How does population representation impact genomic findings?

Cancer classification, presymptomatic diagnosis of autism spectrum disorder, and improvement of healthcare for Alzheimer’s have all been made possible through genetic research. The Human Genome Project, a massive undertaking, sequenced the entire human genome and led to scientific breakthroughs such as the discovery of over 1,800 disease genes (Health). This data contributed immensely to our scientific understanding of disease inheritance, diagnosis, and treatment. Approximately 80% of participants in the Human Genome Project along with other genomic studies are of European descent, a proportion far from representative of the globe's diversity (Manolio). Though these data transformed healthcare, more diverse populations should be studied for accurate and specific genome representation of life around the globe.

Human life is governed by its genome, which consists of between 20,000 and 25,000 genes. Genes consist of four nucleotides, and single base pair changes, single nucleotide polymorphisms (SNPs), account for the distinguishing variations in over 1% of the world’s population (Huang et al.). Researchers study these differences in genomes to elicit causation and correlation of human health. Scientists have found more than 100 million human SNPs (U.S. National Library of Medicine), which help determine unique physical characteristics and population-level trends, such as disease progression. Correspondingly, it is also a valuable tool for analyzing community health using geographically distributed alleles, or gene variations.

Composed mostly of European descending individuals, current genomic data is limiting our ability to understand potential health impacts on people of different ethnicities (Bentley et al.). Inadequate data on high-risk allele frequencies in non-European populations decrease the accuracy of using current data to make predictions and limit the potential to apply them to different ancestry groups (Manolio). For instance, sickle cell anemia is a single-gene disease originating from Africa that is almost 9 times more common among African descendants than those of European ancestry (CDC), indicating the need to study disease etymology among African descendants. Despite the general consensus that the inclusion of diverse populations in genomic research can improve scientific progression, there has yet to be major efforts in collecting data representative of non-European individuals. Because 99.9% of human DNA is shared (Huang et al.), current genomic data can certainly be translated to underrepresented populations for informative inferences, but this is limited by the wide-ranging impacts differences in even a single gene can have. Representation has improved in small areas, such as with Asian individuals in comparative genomics, but there is still a lack of representation from Hispanic and African groups of people (Lindsey).

A limited picture of population-health creates misdiagnoses and other harmful health impacts. Our genomes harbor valuable information for differentiating human physiology and disease pathology. Many of these differences can be elucidated by genome-wide association studies (GWAS), so lacking diversity in GWAS restricts our understanding of these associations due to missing genetic profiles. Genetically profiling ethnic groups is crucial in a pharmacological setting to ensure safe personalized medicine (Bentley et al.). For instance, a recent analysis found that because African Americans had an underrepresented genomic database, many were misdiagnosed with hypertrophic cardiomyopathy (HCM), an inherited heart disease (Lindsey). Disregarding an entire population highlights potential catastrophic issues such as large-scale misdiagnosis and unsafe medicine.

Further analysis of genomic data trends shows that the inclusion of more diverse populations in analysis increases the efficacy of genetics-based diagnosis (Mei and Wang). Cystic fibrosis (CF), a lethal disease affecting one’s airways and nutrient absorption, models a classic case. Through current genome sequencing, scientists have found that thousands of different unique mutations in genetic code cause CF. Due to varying genetic origins, basing diagnoses of CF off of a biased sample can cause completely undetected cases and missed diagnoses. For example, while a phenylalanine mutation in a specific allele causes up to 70% of the cases in Europeans, it only accounts for a quarter of the cases in Africans (Sirugo et al.). This is one of many examples which strengthen the argument for more diverse sequencing- as realizing and testing for specific ethnic pathogenic variants is critical for appropriate clinical mediation.

While existing genomic data is helpful, we should strive for more representative data that portrays the diversity of all groups of people. Discovering disease origins, creating treatment options, and modeling population health will not be completely efficacious until genomic data is more representative of the world's population. One cannot deny the overall contributions of genomics to our current understanding of human health, but we can strive for better.



Bentley, Amy R., et al. “Diversity and Inclusion in Genomic Research: Why the Uneven Progress?” Journal of Community Genetics, vol. 8, no. 4, Springer Verlag, Oct. 2017, pp. 255–66, doi:10.1007/s12687-017-0316-6.

CDC. “Data & Statistics on Sickle Cell Disease | CDC.” CDC, 16 Dec. 2020,

Health, Slingshot. “Top 10 Breakthroughs of the Human Genome Project - Slingshot Health Blog.” Slingshot Health, 25 Apr. 2019,

Huang, Tao, et al. “Genetic Differences among Ethnic Groups.” BMC Genomics, vol. 16, no. 1, BioMed Central Ltd., Dec. 2015, doi:10.1186/s12864-015-2328-0.

Lindsey, Heather. “Bringing Diversity to Genomic Data | AACC.Org.” AACC, 1 June 2017,

Manolio, Teri A. “Using the Data We Have: Improving Diversity in Genomic Research.” American Journal of Human Genetics, vol. 105, no. 2, Cell Press, 1 Aug. 2019, pp. 233–36, doi:10.1016/j.ajhg.2019.07.008.

Mei, [, and B. Wang. An Efficient Method to Handle the “large p, Small n” Problem for Genomewide Association Studies Using Haseman-Elston Regression. 2016, doi:10.1007/s12041-016-0705-3.

Sirugo, Giorgio, et al. “The Missing Diversity in Human Genetic Studies.” Cell, vol. 177, no. 1, Cell Press, 21 Mar. 2019, pp. 26–31, doi:10.1016/j.cell.2019.02.048.

U.S. National Library of Medicine. “What Are Single Nucleotide Polymorphisms (SNPs)?” Medline Plus, 18 Sept. 2020,


Recent Posts

See All
bottom of page