Genome wide association study withWhole Genome Sequencing
Esther481825
19 views
48 slides
Oct 02, 2024
Slide 1 of 48
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
About This Presentation
GWAS
NGS
comparative genomics
Size: 755.36 KB
Language: en
Added: Oct 02, 2024
Slides: 48 pages
Slide Content
LEC 3 AND 4
- GWAS using Whole
Genome Sequencing &
-Comparative Genomics
Scanning markers across the complete sets of DNA or
genomes of many people to find genetic variations
associated with a particular disease.
Determines alleles that correlate to different diseases
and traits.
GWAS focus on SNPs that differ between individuals.
Reduces the number of nucleotides for analysis
Genome wide association studies
(GWAS)
GWAS
Participants are divided
into two groups:
People with disease /
Trait of interest
People without disease /
Trait of interest
GWAS
Identified SNPs are only ‘associated’ with the target
disease / trait
GWAS results are an odds ratio and does not give a
definitive answer. WHY?
– Traits and diseases are also impacted by the
environment and lifestyle decisions, not strictly genetics.
– Risk is of an individual getting a disease over his
entire lifetime and may change over the individual’s
lifetime.
– Most diseases and traits are controlled by many genes,
the impact of one particular gene may be minimal
GWAS
Diseases and traits which have been investigated
by GWAS include:
– Diabetes
– Pigmentation
– Epilepsy
– Alzheimer’s disease
– Autism
– Crohn’s disease
– Cancer
– Bipolar disorder
– Asthma
– High cholesterol
– Height and BMI
Chronic disorder affecting
the digestive system
Causes inflammation of the
digestive tract lining
Can lead to abdominal pain,
severe diarrhea, fatigue,
weight loss and
malnutrition.
Crohn’s disease
Occurs in 1/1000 people
Mainly appears in late teens / early twenties
More often occurs in whites and people from Eastern and
Central Europe of Jewish descent
Genetic and environmental factors play a role
Genetic variations in certain regions of chromosomes 5
and 10
Variations in specific genes (ATG16L1, IL23R, IRGM, and
NOD2) influence the risk of developing the disease
CROHN’S DISEASE
71 confirmed Crohn's disease susceptibility loci
ATG16L1 gene is located on chromosome 2 from bp
233,251,571 – bp 233,295,674.
CROHN’S DISEASE
WGS informs analysis of
oncogenes, tumor suppressors
and other risk factors
Provides base by base view of
unique mutations present in
cancer tissue
Enables discovery of novel
cancer-associated variants –
SNVs, CNVs and structural
variants (indels)
CANCER
CNVS- COPY NUMBER VARIANTS
Copy number variation is a phenomenon in
which sections of the genome are repeated and
the number of repeats in the genome varies
between individuals.
Copy number variation is a type of structural
variation: specifically, it is a type of duplication
or deletion event that affects a considerable
number of base pairs
One example is the CYP2D6 gene, which codes
for the cytochrome P450 in humans. P450 is an
enzyme important in breaking down substances
not produced by the body like drug metabolism.
These structural differences may have come
about through duplications, deletions or other
changes and can affect long stretches of DNA.
Such regions may or may not contain a gene(s).
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4
472309/pdf/nihms-696610.pdf
Copy Number Variation in Human Health,
Disease, and Evolution Zhang et al., 2009
SNVS- SINGLE NUCLEOTIDE
a variation of a single nucleotide in a population's
genome. Like SNVs, a single nucleotide
polymorphism (SNP) is also a single base
substitution, but it is limited to germline DNA
and must be present in at least 1% of the
population.
A SNV can be rare in one population but common
in a different population. Sometimes SNVs are
known as single nucleotide polymorphisms
(SNPs), although SNV and SNPs are not
interchangeable. To qualify as a SNP, the variant
must be present in at least 1% of the population.
INDELS-
An insertion/deletion polymorphism, commonly
abbreviated “indel,” is a type of genetic variation
in which a specific nucleotide sequence is present
(insertion) or absent (deletion). While not as
common as SNPs, indels are widely spread across
the genome.
an SNP changes a single nucleotide in the DNA
sequence, whereas an indel incorporates or
removes one or more nucleotides (Loewe, 2008).
SNPs in coding and noncoding regions have been
implicated in both Mendelian and complex
disease, and the same is true for indels.
Allows tumor-normal comparisons to give a
comprehensive view on changes in a specific
tumor sample
– Can identify somatic variants which act as
driver mutations in cancer progression
Illumina sequencing (IGN) offers library prep,
sequencing and data analysis for cancer
CANCER
WGS continued….
Groups working towards categorizing and
characterizing mutations in cancer:
– The cancer genome atlas (TCGA)
– The international cancer genome consortium (ICGC)
– Catalogue of somatic mutations in cancer (COSMIC)
CANCER
CANCER
Stephen Chanock (2012). Genome-wide Association Studies in Cancer:
A Step in the Right Direction
Methods of cancer DNA sequencing
– Cancer whole genome sequencing
– Targeted cancer sequencing
– Non-Invasive cancer biomarkers
– Cancer exome sequencing
CANCER
Breast Cancer:
BRCA1/2 gene
Hotspots of tandem duplication might drive breast
cancer
CANCER
ADVANTAGES OF WHOLE GENOME
SEQUENCING
Creating personalized treatment plans based not
only on the mutant genes causing a disease, but
also other genes in the patient’s genome.
Genotyping cancer cells and understanding what
genes are mis-regulated allows doctors to select
the best chemotherapy
– Patients exposed to less toxins
Previously unknown genes may be identified as
contributing to a disease state. Traditional
genetic testing looks only at the common
“troublemaker” genes.
Lifestyle or environmental changes that can
mediate the effects of genetic predisposition may
be identified and then moderated.
ADVANTAGES OF WHOLE GENOME
SEQUENCING
DISADVANTAGES OF WHOLE
GENOME SEQUENCING
Most of the information” found in a human genome
sequence is unusable at present.
Most doctors are not trained on how to interpret
genomic data.
An individual’s genome may contain information
that they DON’T want to know.
Policies and security measures to maintain the
privacy and safety of the vast information are still
new.
GWAS EXAMPLES
Hakonarson H and Grant SF. 2011. GWAS and its
impact on elucidating the etiology of diabetes.
Diabetes Metab Res Rev. Jun 1.
Mick E. et al. 2010. Family-based genome-wide
association scan of
attention-deficit/hyperactivity disorder. J Am
Acad Child Adolesc Psychiatry, 49(9): 898-905.
Treutlein J & Rietschel M. 2011. Genome-wide
association studies of alcohol dependence and
substance use disorders. Curr Psychiatry Rep,
13(2): 147-55.
Comparative Genomics
COMPARATIVE GENOMICS
Def:
A field of biological science where DNA
sequences, genes and gene function of different
organisms are compared.
Allows insight into what has / hasn’t changed over
millions of years within the genomes
5% of human genome sequence is evolutionarily
conserved across mammals – Functional
1.5% encodes for proteins
3.5%???
Promoters, enhancers, silencers, gene regulatory
elements, chromosomal functional elements,
undiscovered functional elements.
COMPARATIVE GENOMICS
SEQUENCE SIMILARITY SEARCHES
Why?
• To identify and annotate sequences with:
− incomplete (or no) annotations
(GenBank)
− incorrect annotations
• To assemble genomes
• To explore evolutionary relationships by:
− finding homologous molecules
− developing phylogenetic trees
NB: Similar sequences and homologous
molecules may NOT have similar function.
HOMOLOGY AND SIMILARITY
Homology is an evolutionary term used to describe
relationship via descent from a common ancestor
Homologous things are not always similar (whale
flipper and the human arm)
Similarity can be expressed as a percentage and does
not imply any reasons for the observed sameness
Homology is NEVER expressed as a percentage
HOMOLOGY AND SIMILARITY
Sequences are homologous when they share a common
ancestry
The ancestry is reflected in strong sequence similarity
Threshold limits for sequence similarity can be defined
by :
– length of the stretch of similar sequence
– percentage of identity between the sequence
– statistical measurements, like E-value, P-value,
Bit-score, etc.
For sequences to be considered homologous, > 70%
nucleotide similarity and 25% for proteins
Higher similarity scores means proteins have same
structure and same common ancestor.
Can only be done with > 100 aa or nt in length
HOMOLOGY AND SIMILARITY
HOMOLOGY AND SIMILARITY
Source: www.nature.com/nrg/journal/v5/n6/box/nrg1350_BX2.html
SEQUENCE ALIGNMENT
Pairwise Sequence Alignment:
Used to identify regions of similarity that may
indicate functional, structural and/or evolutionary
relationships between two sequences
Multiple Sequence Alignment:
Alignment of three or more biological sequences of
similar length from which homology can be inferred
and the evolutionary relationship between the
sequences studied
(ebi.ac.uk)
SEQUENCE ALIGNMENT
Colour key for alignment scores:
Red bar – most similar sequence
Pink – almost as similar
Green – even less similar
Blue/Black – worst scores
E-VALUE (EXPECTATION VALUES)
Number of times your database match may have
occurred by chance. Match unlikely to occur by chance
is a good match
Determine how much you can trust your conclusion on
homology
Between 0 – 1
Best E-values are the lowest (as close to 0 as possible),
most significant
Trusted enough to infer homology
For certainty, must be below 10
-4
or (0.0001)
ORTHOLOGS VS PARALOGS
Homologs can be separated into two classes:
orthologs and paralogs.
• Orthologs are homologous genes that perform
the same function in different species.
• Paralogs are homologous genes within a species
that may perform different functions.
* Synteny: Partial or complete conservation of
gene order
APPLICATION OF COMPARATIVE
GENOMICS
Locating unmapped genes and essential sequences
such as promoters and regulatory elements.
Eg. The puffer fish genome (400M bp) and human
genome (3B bp) have almost the same number of
genes and have shown similarities in their gene
order over short distances
Rice genome (400M bp) is 5 times smaller than the
wheat genome
Studying human disease genes
Eg. Mode of action of human disease-causing genes
with homologs in the Drosophila genome whose
genes the phenotypic effects have been studied.
Several human disease genes have homologs in yeast
Eg. Yeast gene SGS1 is homologous to a human gene
involved in the premature ageing disease – Werner’s
syndrome.
In many cases the biochemical activity is known in
the yeast homologs
APPLICATION OF COMPARATIVE
GENOMICS
BIOINFORMATICS
The acquisition, storage, access, analysis,
modeling, and distribution of the many types of
information embedded in DNA sequences.
The rapid proliferation of genome sequences has
been the major factor in the creation of the field
of bioinformatics.
DNA BARCODING
A short gene sequence taken
from standardized proteins of
the same gene region for species
identification
Creating an inventory of life
BARCODING GENES
CO1 (Cytochrome Oxydase subunit 1) – Animals
Rubisco (Ribulose-1,5-bisphosphate Carboxylase /
Ogygenase), MatK – Plants
18S for protozoans (Malaria)
Barcoding genes are:
– Ideal
– Present in all species
– Have minimal variations
– Standardized among scientists
DNA BARCODING
Source: councilforresponsiblegenetics.org
Evolved differences in nucleotide sequence of 648bp
CO1 used to identify animal species
Applications:
– Identify life stages in organisms
– Differentiate between look-alike species
– Food, customs and invasive species control
– Disease vector control
– Agriculture, forestry, conservation, education
DNA BARCODING
IMPORTANCE
Biodiversity crisis
Faster than traditional taxonomy
Discovery of unknowns
BOLD – Barcode of life database for species
identification
By the University of Guelph
Public workbench for barcoding projects
Researchers assemble, test, and analyze their data
records in BOLD before uploading to Genebank,
EMBL
DNA BARCODING
DNA barcode standards:
Creation of a reserved key word (BARCODE)
Required data elements should be reliable,
retrievable and verifiable
Barcode sequence, specimen and species name
applied by submitter
Info on taxonomy, GIS data and country code
DNA BARCODING
Barcode quality control
•Include at least 500 contiguous unambiguous bp
•<1% ambiguity- not clear or decided.
……eg…"the election result was ambiguous"
Include name of gene region
•Submit trace file including both primers
•Name, collection date and location
DNA BARCODING
Inter and intra-specific species divergence can
lead to non-identification
Not used as sole criterion for identification of
new species
www.boldsystems.org
www.dnasubway.iplantcollaborative.org
DNA BARCODING
NEXT…..
Genome mapping and sequence analysis
DNA BARCODING