The Story of The Human Genome Project (HGP) …as Told by a Front-Line Participant Director, National Human Genome Research Institute Eric Green, M.D., Ph.D.
Molecular Biology Revolution Set the Stage for the Human Genome Project (HGP) 1970s 1977 1983 DNA Cloning DNA Seq uencing Polymerase Chain Reaction (PCR )
1986 1984 Drumbeat of Discussions Leading Up to HGP 1987 1988 1988 1989 “For the newly developing discipline of [genome] mapping/sequencing (including the analysis of the information), we have adopted the term GENOMICS… Genomics
1. Expect to be a 15-year initiative 2. G ain experience with model (i.e., well-studied, experimental) organisms with smaller genomes before giving full attention to human genome 3. In each case, m ap (i.e., organize) DNA first and then sequence (i.e., read) DNA 4 . Wait to sequence human genome until a new ‘revolutionary’ DNA sequencing method(s) becomes available – replacing Sanger DNA sequencing 5. Make generating the first sequence of the human genome the signature accomplishment of the HGP Spoiler Alerts : #1 A n overestimate #2 M aintained #3 Mostly maintained #4 A bandoned #5 Absolutely true! Initially Envisioned Plan for HGP
Fruit Fly & Nematode Yeast Human Mouse 500M years 80M years 1,000M years Organisms of the HGP Evolutionary Separation
~3,000,000,000 bp ~160,000,000 bp ~100,000,000 bp ~15,000,000 bp Human Genome Mouse Genome Fruit Fly Genome Nematode Genome Yeast Genome Books to Conceptualize Genomes bp = base pairs
Scientific community had mixed opinions about HGP No detailed start-to-finish plan for executing HGP (i.e. , overt expectation to ‘figure it out along the way’) Genomics was a ‘toddler’ field, growing up as a melting pot of scientific immigrants from other disciplines Painfully early days of a functional internet Realities of 1990 (@ HGP Launch)
International in terms of funding and participants U.S. Funders: National Institutes of Health (NIH) Department of Energy (DOE) Other Countries: Some government funders Some private funders Distributed consortium-based (‘team science’) effort For studying the human genome: ‘divide & conquer’ strategy Implementation of HGP
1991-1995 1993-1998 1998-2003 HGP Guided by Periodically Updated Plans
Genomes Organized by Chromosomes Fruit Fly Yeast Human Nematode
Sequence-Ready Clone Contig Map Caveat : Note that the book metaphor is imperfect – adjacent clones (i.e., pages) actually overlap slightly rather than having precise ‘page breaks’ between clones (i.e., pages). Clones highlighted by red rectangles selected for DNA sequencing
Shotgun Sequencing Shotgun sequencing is a laboratory technique for determining the DNA sequence of an organism’s genome (or part of the genome). The method involves randomly breaking up the DNA into small fragments that are then sequenced individually. A computer program looks for overlaps in the DNA sequences, using them to reassemble the fragments in their correct order to determine the sequence of the starting DNA. From NHGRI’s ‘Talking Glossary’ genome.gov/genetics-glossary
Shotgun Sequencing Strategy Generate Shotgun S equence R eads Assemble Sequence R eads into Sequence C ontigs ‘ Working Draft’ Sequence Deduce Sequence Final Sequence Sequence Finishing GATCGTCTAGAATCTC GAGATCTCTGAGAGTC GTGGGAAACTGTGTGA TGTGACTAGCCACAGT GTGGGAAACTGTGTGA TACGTGTGAGAGATGT ATGATGCACCTGACCC GGGTTTCACTCTCAAC GACTCACTCCACCTCA GTGGGAAACTGTGTGA GAGGCCCACCGCCGCT GTGCACGTCCACCACC Clone DNA Subclones
Because raindrops fall in random locations, it takes many extra drops in certain areas to ensure that every portion of the sidewalk gets wet. The additional consideration for DNA sequencing is that the final accuracy of the sequence depends on reading every DNA base multiple times (e.g., 30-50 times; called ‘coverage’). Sequence Reads Like Raindrops on a Sidewalk
Imagine you did not know this text. If you were to sequence this text the way we sequence DNA, you would copy t he text, fragment it, and sequence the fragments many times over (see right). Sequence Assembly Challenges: Repeated Sequences These fragments are like DNA sequence reads Branch e s and loops represent alternative assemblies in a complex and often repetitive genome . An actual genome sequence is way more repetitive and complex than a Dickens novel. Requires sophisticated computational tools to assemble sequence correctly. As the fragments are aligned to reconstruct the text , notice that t here are ambiguities
Example of Genome Sequence Assembly
First Eukaryotic Genomes Sequenced by HGP
Dividing Up Human Genome During HGP For example…
Challenges of Sequencing the Human Genome Human Genome: ~3,000,000,000 nucleotides (bases or base pairs) Sanger DNA sequencing Circa 1990 : ~500-800 bases per read ‘Coverage’ (i.e., number of time each base is read) needed to be high (e.g., >30-fold) to attain high accuracy Roughly half of human genome consists repetitive DNA, m uch of it reflecting remnants of transposable elements
Generating the First Human Genome Sequence Initial HGP Plan Automation & Scale Computational Power 6 Countries, 20 Centers, 1000’s of researchers ~1,000 bases/second, 24 hours/day, & 7 days/week for ~6 years Brute force using Sanger DNA sequencing and massive computational help
Buffalo, NY blood donors 93% of HGP’s human genome sequence from 11 donors 70% of HGP’s human genome sequence from 1 donor HGP human genome sequence was a ‘mosaic’ representation of multiple people (a ‘reference’) Whose Genome Was Sequenced by HGP? Humorous Aside : Advocacy by some HGP researchers to select a ‘normal’ person and sequence their genome first – as if anyone knows what ‘normal’ means!
Significant attention to release and sharing of HGP genome sequence data Two seminal meetings in Bermuda in 1996 and 1997 Landmark agreement for rapid data release and public access to HGP genome sequence data Became known as ‘Bermuda Principles’ Among the most important legacy of HGP Bermuda Principles for Data Sharing
Two Major Protagonists of HGP Drama Francis Collins (UVA Alumnus) Craig Venter (UCSD Alumnus)
Two Major Protagonists of HGP Drama Craig Venter (UCSD Alumnus) At NIH at beginning of HGP, pioneered use of expressed-sequence tags (ESTs) as shortcut for studying genes (sequencing RNA instead of DNA) Began patenting human genes at furious pace, arousing controversy Left NIH, founded private research institute, and became HGP participant Grew impatient about pace of HGP; left HGP and joined forces with company that commercialized new automated instrument for very high-throughput Sanger DNA sequencing to create Celera Genomics Celera Genomics aimed to compete with the HGP in generating the first human genome sequence and sell subscriptions for accessing their genomic data Francis Collins (UVA Alumnus) Physician (medical geneticist) and scientist HGP participant at U. of Michigan before becoming Director of NIH’s ‘genome institute’ (succeeding Jim Watson) Became de facto leader of international consortium of HGP centers sequencing human genome Later appointed NIH Director by President Obama (and recently Acting Science Advisor to President Biden)
Initial HGP Plan ‘ Clone-by-Clone Shotgun Sequencing ’ VS Editorial Aside : Not really a fair ‘race’ since Celera had access to HGP data (but not vice versa)!!! Purported ‘Race’ to Sequence Human Genome Venter/Celera Plan ‘ Whole-Genome Shotgun Sequencing ’
June 2000: Draft Sequence of Human Genome
Vanity Fair (December 2000) Press Coverage of the ‘Race’
February 2001: Papers Reporting Draft Sequence of Human Genome HGP Paper Venter/Celera Paper
Venter/Celera could not fully assemble the human genome sequence and relied on the publicly available data to resolve many of the difficult regions; had little interest in improving (i.e., ‘finishing’) sequence beyond ‘working draft’ quality HGP focused on improving the human genome sequence from a ‘working draft’ to high-quality ‘finished’ Celera’s business plan to sell subscription access to the human genome sequence eventually failed Venter moved on to various other endeavors After June 2000 Announcement & February 2001 Publications
Initial HGP Plan Venter/Celera Plan Ultimate HGP Plan + = Generating the First Human Genome Sequence
National DNA Day established HGP completion & 50 th anniversary of discovery of DNA’s double-helical structure April 25, 2003: HGP Completion
Highlight Features of HGP Completed ahead of schedule (13 years) and underbudget Signature accomplishment was generation of an extremely high- quality sequence for >90% (‘near-complete’ or ‘essentially complete’) of human genome Cost of generating first h uman genome sequence by HGP: ~$1 billion The ‘race’ between HGP and Venter/Celera melted away after announcement of draft human genome sequence in 2000 Similarly, the initial concerns about the HGP from some parts of the scientific community largely melted away HGP set the field of genomics into a trajectory of widespread dissemination across biology, medicine, and society
To Learn More about HGP… genome.gov/HGP
HGP produced a high-quality human genome sequence , but it only accounted for 92% of the human genome Remaining 8% was not ‘readable’ using the then-available methods for DNA sequencing, but those regions are important for structural (centromere and telomeres) and medical reasons Several new ‘revolutionary’ methods for DNA sequencing have been developed over the last ~20 years These new methods plus better computational approaches set the stage for a new group of researchers to (finally) generate a truly complete sequence of the human genome in 2022 Epilogue: A Truly Complete Human Genome Sequence
2022: A Truly Complete Human Genome Sequence
Recent Time Video
Take-Home Messages HGP: 1990-2003 HGP used a map-first, sequence-second strategy to study the human genome HGP used Sanger DNA sequencing – not a revolutionary new DNA sequencing method Sequencing the human genome was particularly difficult because of its large size, complexity, and extensive amounts of repetitive regions Genome s equence assembly was (and remains) a major challenge; repetitive regions present a particular obstacle to accurately assembling genome sequences Venter/Celera pursued a whole-genome sequencing strategy and tried to build a business selling access to their data; both efforts fell short of expectations Ultimately, the HGP completed the task of generating the first high-quality ‘essentially complete’ sequence of the human genome; 19 years later (in 2022), a truly complete (‘telomere-to-telomere’) human genome sequence was finally generated
In reality, HGP was the end of one journey, but the beginning of another. For example, HGP determined the sequence of most of the ~3 billion bases in the human genome, with the next phase focused on INTERPRETING the information encoded in that sequence – something that continues to the present time. Take-Home Messages Beyond the HGP
Scale of the Human Genome Sequence www.youtube.com/watch?v=eLmElwzOCdU