Genome wide association study withWhole Genome Sequencing

Esther481825 19 views 48 slides Oct 02, 2024
Slide 1
Slide 1 of 48
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32
Slide 33
33
Slide 34
34
Slide 35
35
Slide 36
36
Slide 37
37
Slide 38
38
Slide 39
39
Slide 40
40
Slide 41
41
Slide 42
42
Slide 43
43
Slide 44
44
Slide 45
45
Slide 46
46
Slide 47
47
Slide 48
48

About This Presentation

GWAS
NGS
comparative genomics


Slide Content

LEC 3 AND 4
- GWAS using Whole
Genome Sequencing &
-Comparative Genomics

Scanning markers across the complete sets of DNA or
genomes of many people to find genetic variations
associated with a particular disease.
Determines alleles that correlate to different diseases
and traits.

GWAS focus on SNPs that differ between individuals.
Reduces the number of nucleotides for analysis
Genome wide association studies
(GWAS)

GWAS
Participants are divided
into two groups:
People with disease /
Trait of interest
People without disease /
Trait of interest

GWAS
Identified SNPs are only ‘associated’ with the target
disease / trait
GWAS results are an odds ratio and does not give a
definitive answer. WHY?
– Traits and diseases are also impacted by the
environment and lifestyle decisions, not strictly genetics.
– Risk is of an individual getting a disease over his
entire lifetime and may change over the individual’s
lifetime. 
– Most diseases and traits are controlled by many genes,
the impact of one particular gene may be minimal

GWAS
Diseases and traits which have been investigated
by GWAS include:
– Diabetes
– Pigmentation
– Epilepsy
– Alzheimer’s disease
– Autism
– Crohn’s disease
– Cancer
– Bipolar disorder
– Asthma
– High cholesterol
– Height and BMI

Chronic disorder affecting
the digestive system
Causes inflammation of the
digestive tract lining
Can lead to abdominal pain,
severe diarrhea, fatigue,
weight loss and
malnutrition.
Crohn’s disease

Occurs in 1/1000 people
Mainly appears in late teens / early twenties
More often occurs in whites and people from Eastern and
Central Europe of Jewish descent
Genetic and environmental factors play a role
Genetic variations in certain regions of chromosomes 5
and 10
Variations in specific genes (ATG16L1, IL23R, IRGM, and
NOD2) influence the risk of developing the disease
CROHN’S DISEASE

71 confirmed Crohn's disease susceptibility loci
ATG16L1 gene is located on chromosome 2 from bp
233,251,571 – bp 233,295,674.
CROHN’S DISEASE

WGS informs analysis of
oncogenes, tumor suppressors
and other risk factors
Provides base by base view of
unique mutations present in
cancer tissue
Enables discovery of novel
cancer-associated variants –
SNVs, CNVs and structural
variants (indels)
CANCER

CNVS- COPY NUMBER VARIANTS
Copy number variation is a phenomenon in
which sections of the genome are repeated and
the number of repeats in the genome varies
between individuals.
Copy number variation is a type of structural
variation: specifically, it is a type of duplication
or deletion event that affects a considerable
number of base pairs
One example is the CYP2D6 gene, which codes
for the cytochrome P450 in humans. P450 is an
enzyme important in breaking down substances
not produced by the body like drug metabolism.

These structural differences may have come
about through duplications, deletions or other
changes and can affect long stretches of DNA.
Such regions may or may not contain a gene(s).
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4
472309/pdf/nihms-696610.pdf
Copy Number Variation in Human Health,
Disease, and Evolution Zhang et al., 2009

SNVS- SINGLE NUCLEOTIDE
a variation of a single nucleotide in a population's
genome. Like SNVs, a single nucleotide
polymorphism (SNP) is also a single base
substitution, but it is limited to germline DNA
and must be present in at least 1% of the
population.
A SNV can be rare in one population but common
in a different population. Sometimes SNVs are
known as single nucleotide polymorphisms
(SNPs), although SNV and SNPs are not
interchangeable. To qualify as a SNP, the variant
must be present in at least 1% of the population.

INDELS-
An insertion/deletion polymorphism, commonly
abbreviated “indel,” is a type of genetic variation
in which a specific nucleotide sequence is present
(insertion) or absent (deletion). While not as
common as SNPs, indels are widely spread across
the genome.
an SNP changes a single nucleotide in the DNA
sequence, whereas an indel incorporates or
removes one or more nucleotides (Loewe, 2008).
SNPs in coding and noncoding regions have been
implicated in both Mendelian and complex
disease, and the same is true for indels.

Allows tumor-normal comparisons to give a
comprehensive view on changes in a specific
tumor sample
– Can identify somatic variants which act as
driver mutations in cancer progression
Illumina sequencing (IGN) offers library prep,
sequencing and data analysis for cancer
CANCER
WGS continued….

Groups working towards categorizing and
characterizing mutations in cancer:
– The cancer genome atlas (TCGA)
– The international cancer genome consortium (ICGC)
– Catalogue of somatic mutations in cancer (COSMIC)
CANCER

CANCER
Stephen Chanock (2012). Genome-wide Association Studies in Cancer:
A Step in the Right Direction

Methods of cancer DNA sequencing
– Cancer whole genome sequencing
– Targeted cancer sequencing
– Non-Invasive cancer biomarkers
– Cancer exome sequencing
CANCER

Breast Cancer:
BRCA1/2 gene
Hotspots of tandem duplication might drive breast
cancer
CANCER

ADVANTAGES OF WHOLE GENOME
SEQUENCING
Creating personalized treatment plans based not
only on the mutant genes causing a disease, but
also other genes in the patient’s genome.
Genotyping cancer cells and understanding what
genes are mis-regulated allows doctors to select
the best chemotherapy
– Patients exposed to less toxins

Previously unknown genes may be identified as
contributing to a disease state. Traditional
genetic testing looks only at the common
“troublemaker” genes.
Lifestyle or environmental changes that can
mediate the effects of genetic predisposition may
be identified and then moderated.
ADVANTAGES OF WHOLE GENOME
SEQUENCING

DISADVANTAGES OF WHOLE
GENOME SEQUENCING
Most of the information” found in a human genome
sequence is unusable at present.
Most doctors are not trained on how to interpret
genomic data.
An individual’s genome may contain information
that they DON’T want to know.
Policies and security measures to maintain the
privacy and safety of the vast information are still
new.

GWAS EXAMPLES
Hakonarson H and Grant SF. 2011. GWAS and its
impact on elucidating the etiology of diabetes.
Diabetes Metab Res Rev. Jun 1.
Mick E. et al. 2010. Family-based genome-wide
association scan of
attention-deficit/hyperactivity disorder. J Am
Acad Child Adolesc Psychiatry, 49(9): 898-905.
Treutlein J & Rietschel M. 2011. Genome-wide
association studies of alcohol dependence and
substance use disorders. Curr Psychiatry Rep,
13(2): 147-55.

Comparative Genomics

COMPARATIVE GENOMICS
Def:
A field of biological science where DNA
sequences, genes and gene function of different
organisms are compared.
Allows insight into what has / hasn’t changed over
millions of years within the genomes

5% of human genome sequence is evolutionarily
conserved across mammals – Functional
1.5% encodes for proteins
3.5%???
Promoters, enhancers, silencers, gene regulatory
elements, chromosomal functional elements,
undiscovered functional elements.
COMPARATIVE GENOMICS

SEQUENCE SIMILARITY SEARCHES
Why?
• To identify and annotate sequences with:
− incomplete (or no) annotations
(GenBank)
− incorrect annotations
• To assemble genomes
• To explore evolutionary relationships by:
− finding homologous molecules
− developing phylogenetic trees
NB: Similar sequences and homologous
molecules may NOT have similar function.

HOMOLOGY AND SIMILARITY
Homology is an evolutionary term used to describe
relationship via descent from a common ancestor
Homologous things are not always similar (whale
flipper and the human arm)
Similarity can be expressed as a percentage and does
not imply any reasons for the observed sameness
Homology is NEVER expressed as a percentage

HOMOLOGY AND SIMILARITY
Sequences are homologous when they share a common
ancestry
The ancestry is reflected in strong sequence similarity
Threshold limits for sequence similarity can be defined
by :
– length of the stretch of similar sequence
– percentage of identity between the sequence
– statistical measurements, like E-value, P-value,
Bit-score, etc.

For sequences to be considered homologous, > 70%
nucleotide similarity and 25% for proteins
Higher similarity scores means proteins have same
structure and same common ancestor.
Can only be done with > 100 aa or nt in length
HOMOLOGY AND SIMILARITY

HOMOLOGY AND SIMILARITY
Source: www.nature.com/nrg/journal/v5/n6/box/nrg1350_BX2.html

SEQUENCE ALIGNMENT
Pairwise Sequence Alignment:
Used to identify regions of similarity that may
indicate functional, structural and/or evolutionary
relationships between two sequences
Multiple Sequence Alignment:
Alignment of three or more biological sequences of
similar length from which homology can be inferred
and the evolutionary relationship between the
sequences studied
(ebi.ac.uk)

SEQUENCE ALIGNMENT
Colour key for alignment scores:
Red bar – most similar sequence
Pink – almost as similar
Green – even less similar
Blue/Black – worst scores

E-VALUE (EXPECTATION VALUES)
Number of times your database match may have
occurred by chance. Match unlikely to occur by chance
is a good match
Determine how much you can trust your conclusion on
homology
Between 0 – 1
Best E-values are the lowest (as close to 0 as possible),
most significant
Trusted enough to infer homology
For certainty, must be below 10
-4
or (0.0001)

ORTHOLOGS VS PARALOGS
Homologs can be separated into two classes:
orthologs and paralogs.
• Orthologs are homologous genes that perform
the same function in different species.
• Paralogs are homologous genes within a species
that may perform different functions.
* Synteny: Partial or complete conservation of
gene order

APPLICATION OF COMPARATIVE
GENOMICS
Locating unmapped genes and essential sequences
such as promoters and regulatory elements.
Eg. The puffer fish genome (400M bp) and human
genome (3B bp) have almost the same number of
genes and have shown similarities in their gene
order over short distances
Rice genome (400M bp) is 5 times smaller than the
wheat genome

Studying human disease genes
Eg. Mode of action of human disease-causing genes
with homologs in the Drosophila genome whose
genes the phenotypic effects have been studied.
Several human disease genes have homologs in yeast

Eg. Yeast gene SGS1 is homologous to a human gene
involved in the premature ageing disease – Werner’s
syndrome.
In many cases the biochemical activity is known in
the yeast homologs
APPLICATION OF COMPARATIVE
GENOMICS

BIOINFORMATICS
The acquisition, storage, access, analysis,
modeling, and distribution of the many types of
information embedded in DNA sequences.
The rapid proliferation of genome sequences has
been the major factor in the creation of the field
of bioinformatics.

DNA BARCODING
A short gene sequence taken
from standardized proteins of
the same gene region for species
identification
Creating an inventory of life

BARCODING GENES
CO1 (Cytochrome Oxydase subunit 1) – Animals
Rubisco (Ribulose-1,5-bisphosphate Carboxylase /
Ogygenase), MatK – Plants
18S for protozoans (Malaria)
Barcoding genes are:
– Ideal
– Present in all species
– Have minimal variations
– Standardized among scientists

DNA BARCODING
Source: councilforresponsiblegenetics.org
Evolved differences in nucleotide sequence of 648bp
CO1 used to identify animal species

Applications:
– Identify life stages in organisms
– Differentiate between look-alike species
– Food, customs and invasive species control
– Disease vector control
– Agriculture, forestry, conservation, education
DNA BARCODING

IMPORTANCE
Biodiversity crisis
Faster than traditional taxonomy
Discovery of unknowns

BOLD – Barcode of life database for species
identification
By the University of Guelph
Public workbench for barcoding projects
 Researchers assemble, test, and analyze their data
records in BOLD before uploading to Genebank,
EMBL
DNA BARCODING

DNA barcode standards:
Creation of a reserved key word (BARCODE)
Required data elements should be reliable,
retrievable and verifiable
Barcode sequence, specimen and species name
applied by submitter
Info on taxonomy, GIS data and country code
DNA BARCODING

Barcode quality control
•Include at least 500 contiguous unambiguous bp
•<1% ambiguity- not clear or decided.
……eg…"the election result was ambiguous"
Include name of gene region
•Submit trace file including both primers
•Name, collection date and location
DNA BARCODING

Inter and intra-specific species divergence can
lead to non-identification
Not used as sole criterion for identification of
new species
www.boldsystems.org
www.dnasubway.iplantcollaborative.org
DNA BARCODING

NEXT…..
Genome mapping and sequence analysis
DNA BARCODING