Human Genome Project BackgroundHuman Genome Project Background
The idea of sequencing the entire human genome
was First proposed in discussions at scientific
meetings organized by the US Department of
Energy and others from 1984 to 1986
Recommended a broader programme, to include:
The creation of genetic, physical and sequence
maps of the human genome;
Parallel efforts in key model organisms such as
bacteria, yeast, worms, fies and mice;
Development of technology in support of these
objectives;
Research into the ethical, legal and social issues
raised by human genome research.
HGP BACKGROUND……HGP BACKGROUND……
Human Genome Organization (HUGO) &
International Human Genome Sequencing Consortium
(IHGSC) was founded to provide a forum for
international coordination of genomic research
HGP Project is constituted as the National Human
Genome Research Initiative (NHGRI).
The collaboration was coordinated through periodic
international meetings (referred to as ‘Bermuda
meetings’)
Work was shared flexibly among the centres, with
some groups focusing on particular chromosomes and
others contributing in a genome-wide fashion.
The second principle was rapid and unrestricted data
release. The centres adopted a policy that all genomic
sequence data should be made publicly available without
restriction within 24 hours of assembly (Bermuda
Principle)
Human Genome Project
Begun formally in 1990, the U.S. Human Genome
Project was a 13-year effort coordinated by the U.S.
Department of Energy and the National Institutes of
Health. The project originally was planned to last 15
years, but rapid technological advances accelerated the
completion date to 2003.
Project goals were to :-
Identify all the approximately 20,000-25,000 genes in
human DNA,
Determine the sequences of the 3 billion chemical base
pairs that make up human DNA,
Store this information in databases,
Improve tools for data analysis,
Transfer related technologies to the private sector, and
Address the ethical, legal, and social issues (ELSI) that
may arise from the project.
Milestones::
June 2000: Completion of a working draft of
the entire human genome
February 2001: Analyses of the working
draft are published
April 2003: HGP sequencing is completed
and Project is declared finished two years
ahead of schedule
Timeline of large-scale genomic analyses.
HUMAN GENOME
The human genome contains 3 billion chemical
nucleotide bases (A, C, T, and G).
The average gene consists of 3000 bases, but sizes
vary greatly, with the largest known human gene
being dystrophin at 2.4 million bases.
The total number of genes is estimated at around
30,000 much lower than previous estimates of
80,000 to 140,000.
Almost all (99.9%) nucleotide bases are exactly
the same in all people.
The functions are unknown for over 50% of
discovered genes.
HUMAN GENOME PROJECTHUMAN GENOME PROJECT
PUBLIC AND
PRIVATE SECTOR
Two Different Groups Worked to Obtain
the DNA Sequence of the Human Genome
The US HGP is a multinational consortium
established by government research agencies
and funded publicly.
Celera Genomics is a private company whose
former CEO, J. Craig Venter and Francis
collins, ran an independent sequencing project.
Differences arose regarding who should receive
the credit for this scientific milestone.
June 6, 2000, the HGP and Celera Genomics
held a joint press conference to announce that
TOGETHER they had completed ~97% of the
human genome.
PUBLISHED
The International Human Genome Sequencing
Consortium published their results in Nature,
409 (6822): 860-921, 2001.
“Initial Sequencing and Analysis of the
Human Genome”
Celera Genomics published their results in
Science, Vol 291(5507): 1304-1351, 2001.
“The Sequence of the Human Genome”
HGP SEQUENCING HGP SEQUENCING
STRATEGIESSTRATEGIES
LARGE SCALE SEQUENCING TECHNOLOGY
Genome GlossaryGenome Glossary
Genome GlossaryGenome Glossary
Genome GlossaryGenome Glossary
HGP SEQUENCING STRATEGIESHGP SEQUENCING STRATEGIES
The HGP project had three stages:
Genetic (or linkage) mapping
Physical mapping
DNA sequencing
Three-Stage Approach to Three-Stage Approach to
Genome SequencingGenome Sequencing
Strategic IssuesStrategic Issues
There are two approaches for sequencing
large repeat-rich genomes.
First is a whole-genome shotgun sequencing
approach, as has been used for the repeat-
poor genomes of viruses, bacteria and flies,
using linking information and computational
Second is the ‘hierarchical shotgun
sequencing’ approach , also referred to as
`map-based', `BAC-based' or `clone-by-
clone'
‘‘HIERARCHICAL SHOTGUN SEQUENCING’HIERARCHICAL SHOTGUN SEQUENCING’
`MAP-BASED', `BAC-BASED' OR
`CLONE-BY-CLONE'
Technology for large-scale sequencing
US HGP
Clone-by-clone or hierarchicalClone-by-clone or hierarchical
sequencing strategysequencing strategy
Advantages:
Ability to fill gap and re-sequence the
uncertain regions.
Ability to distribute the clones to other labs
Ability to check the produced sequence by
restriction enzymes
Disadvantages:
Expensive and time-consuming for
construction of the physical map
Experienced personnel are required,
HIERARCHIAL ASSEMBLY OF SEQUENCE
CONTIG SCAFFOLD
Assembly of the draft genome Assembly of the draft genome
sequencesequence
The key steps in assembling individual sequenced clones into the draft genome
sequence.
Levels of clone and sequence coverage.Levels of clone and sequence coverage.
WHOLE-GENOME SHOTGUNWHOLE-GENOME SHOTGUN
Developed by J. Craig Venter
Whole-Genome Shotgun Approach to Genome
Sequencing
The whole-genome shotgun approach was
developed by J. Craig Venter in 1992.
This approach skips genetic and physical
mapping and sequences random DNA
fragments directly.
Powerful computer programs are used to
order fragments into a continuous
sequence.
Whole-Genome Shotgun Sequencing
Shotgun Sequencing Strategy
Advantage:
No physical map construction,
Less risk of recombinant clones,
Cost effective and fast.
Ideal for small genome sequencing
Disadvantage:
Difficult to fill gaps and
Re-track all the sequenced plasmids,
Data less useful for positional cloning
Whole-Genome AssemblyWhole-Genome Assembly
Hierarchical vs. Shotgun Sequencing
Assembly of a mapped scaffold
Generating the draft genome sequence
Generating a draft sequence of the human
genome involved three steps:
Selecting the BAC clones to be sequenced,
Sequencing them ,and
Assembling the individual sequenced clones
into an overall draft genome sequence.
Assembly of the draft genome sequence
This process involved three steps:
Filtering,
Layout and
Merging.
The entire data set was filtered uniformly
to eliminate contamination from nonhuman
sequences and other artefacts that had not
already been removed by the individual
centres.
Assembly of the draft genome sequence
The sequenced clones were then associated
with specific clones on the physical map to
produce a `layout'.
The fingerprint clone contigs were then
mapped to chromosomal locations, using
sequence matches to mapped STSs from
four human maps; radiation hybrid maps,
one YAC and two genetic maps together
with data from FISH
The human
genome
assembly and
annotation
process
•BUILD CYCLE
•DATA FREEZE
•RELEASE
The human genome assembly and annotation
process : INPUTS
Genome AnnotationGenome Annotation
Feature Annotation
◦Clone Features
◦STS Features
◦SNP Features
◦Gene, mRNA(transcript),
◦misc_RNA(pseudogenes , and non-coding
transcripts, )
◦Protein Features
◦Repeat features
Genome AnnotationGenome Annotation
Products
◦Sequence Data
◦Resource Support( dbSNP , Entrez Gene, Map
Viewer, UniSTS)
Data Access
◦BLAST
◦Entrez Retrieval(Accession number, gene
symbol, or protein name)
◦FTP(genomes FTP site)
Links from Map Viewer objects to other
NCBI resources
UCSC put the human genome
sequence on the web July 7, 2000
UCSC put the human genome sequence
on CD in October 2000, with varying
results
HGP ON WEBHGP ON WEB
Genome Browsers were developed and are maintained
by the University of California at Santa Cruz (UCSC) .
EnsEMBL project of the European Bioinformatics
Institute and the Sanger Centre Additional browsers
have been created;
URLs are listed at www.nhgri.nih.gov/genome_hub.
These web-based computer tools allow users to view
an annotated display of the draft genome sequence,
with the ability to scroll along the chromosomes and
zoom in or out to different scales.
In addition to using the Genome Browsers, one can
download from these sites the entire draft genome
sequence together with the annotations in a computer-
readable format.
UCSC GENOME BROWSERUCSC GENOME BROWSER
Broad genomic landscapeBroad genomic landscape
The distribution of GC content,
CpG islands
Recombination rates,
Repeat content and
Gene content of the human genome.
Long-range variation in GC contentLong-range variation in GC content
GC-rich and GC-poor regions may have
different biological properties:
Gene density,
Composition of repeat sequences,
correspondence with cytogenetic bands
Recombination rate
CpG islands are of particular Interest
because they are associated with the
5’ends of genes
Repeat content of the human genomeRepeat content of the human genome
INTERSPERSED REPEATSINTERSPERSED REPEATS
Gene content of the human genomeGene content of the human genome
RNA genes and
protein-coding genes in the human genome.
Noncoding RNAs
There are several major classes of ncRNA
tRNA
rRNAs
small nucleolar RNAs (snoRNAs) are
small nuclear RNAs (snRNAs) are critical components
of spliceosomes, the large ribonucleoprotein (RNP)
complexes that splice introns out of pre-mRNAs in the
nucleus.
ncRNAs do not have translated ORFs, are often small
and are not polyadenylated.
Software tools for ab initio gene prediction
Software tools for ab initio gene prediction
Distribution of the homologues of Distribution of the homologues of
the predicted human proteins.the predicted human proteins.
Conserved Conserved
segments in the segments in the
human and human and
mouse genome.mouse genome.
* * Each colour
corresponds to a
particular mouse
chromosome.
DISEASE GENESDISEASE GENES
DRUG TARGETSDRUG TARGETS
Research challenges in genetics--what we still don't know, even with
the full human DNA sequence in hand.
Gene number, exact locations, and functions ,Gene regulation
DNA sequence organization ,Chromosomal structure and organization
Noncoding DNA types, amount, distribution, information content, and
functions
Coordination of gene expression, protein synthesis, and post-translational
events
Interaction of proteins in complex molecular machines
Predicted vs. experimentally determined gene function
Evolutionary conservation among organisms ,Protein conservation (structure
and function)
Proteomes in organisms
Correlation of SNPs with health and disease
Disease-susceptibility prediction based on gene sequence variation
Genes involved in complex traits and multigene diseases
Complex systems biology, including microbial consortia useful for
environmental restoration
Developmental genetics, genomics
“The more we learn about the human genome,
the more there is to explore”
“We shall not cease from exploration. And the end of all
our exploring will be to arrive where we started, and
know the place for the first time.” T. S. Eliot