Association mapping

18,175 views 44 slides Aug 15, 2012
Slide 1
Slide 1 of 44
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32
Slide 33
33
Slide 34
34
Slide 35
35
Slide 36
36
Slide 37
37
Slide 38
38
Slide 39
39
Slide 40
40
Slide 41
41
Slide 42
42
Slide 43
43
Slide 44
44

About This Presentation

Association mapping, also known as "linkage disequilibrium mapping", is a method of mapping quantitative trait loci (QTLs) that takes advantage of linkage disequilibrium to link phenotypes to genotypes.Varioius strategey involved in association mapping is discussed in this presentation


Slide Content

Association Mapping for improvement of agronomic traits in Rice Hifzur Rahman

Methods in Crop Improvement To meet the food needs of the human population, plant breeders select for agronomically important trais like yield. Determining the genetic basis of economically important complex traits is a major goal. Linkage mapping has been a key tool for identifying the genetic basis of quantitative traits in plants. Identification of QTLs or genes associated to particular trait accelerated the pace of crop improvement either by introgressing the identified QTLs/genes in desired genotype by MAB or by transgenic technology.

QTL approach Uses standard bi-parental mapping populations F2 or RILs These have a limited number of recombination events. Resulting in low resolution of map i.e. the QTL covers many cM. Additional steps required to narrow QTL or clone gene. Difficult to discover closely linked markers for the causative gene

Association mapping , also known as " linkage disequilibrium mapping ", is a method of mapping quantitative trait loci (QTLs) that takes advantage of linkage disequilibrium to link phenotypes to genotypes. Uses the diverse lines from the natural populations or germplasm collections. Discovers linked markers associated (=linked) to gene controlling the trait. Association mapping (AM)

Association mapping (AM ): How it works? Association studies are based on the assumption that a marker locus is ‘sufficiently close’ to a trait locus so that some marker allele would be ‘travelling’ along with the trait allele through many generations during recombination. Murillo and Greenberg, 2008 . Major goal To identify inter-individual genetic variants, mostly single nucleotide polymorphisms ( SNPs), which show the strongest association with the phenotype of interest, either because they are causal or , more likely, statistically correlated or in linkage disequilibrium ( LD) with an unobserved causal variant(s ).

Advantages of AM over linkage mapping Much higher mapping resolution, Greater allele number and broader reference population Possibility of exploiting historically measured trait data Less research time in establishing an association (Flint-Garcia et al., 2003) ( Yu and Buckler, 2006)

Association analysis Two approaches:- On the basis of distance between two loci By analyzing linkage disequilibrium between marker and target gene in natural population. LD refers to nonrandom association of alleles at different loci. LD can occur between more distant sites or sites located in different chromosomes

LD Quantification LD is difference between the observed gametic frequencies of haplotypes and the expected gametic haplotype frequencies under linkage equilibrium . D = P AB − P A P B = ( P AB P ab − P Ab P aB ) D is informative for comparisons of different allele frequencies across loci and strongly inflated in a small sample size and low-allele frequencies Verified with the r 2 ( 0 to 1 ) before using for quantification of extent of LD in case of low allele frequency.

Calculation and visualization of LD: LD can be calculated using available haplotyping algorithms Maximum likelihood estimate (MLE). Pairwise LD can be depicted as a color-code triangle plot based on significant pairwise LD level ( r2, and D) Computer softwares : “Graphical Overview of Linkage Disequilibrium” (GOLD ) “Trait Analysis by aSSociation , Evolution and Linkage” (TASSEL) PowerMarker

Factors affecting LD LD increases due to mating system (self-pollination), genetic isolation , population structure, relatedness (kinship), small founder population size or genetic drift, admixture , selection ( natural, artificial, and balancing), epistasis, and genomic rearrangements. While factors like outcrossing, high recombination rate, high mutation rate, gene conversion, etc ., lead to a decrease/disruption in LD.

LD Decay: LD will tend to decay with genetic distance between the loci under consideration. Loci attains linkage equilibrium (LE), i.e. alleles are not preferentially paired anymore. LD decays by one-half with each generation of random mating. Thus, LD declines as the number of generations increases, so that in old populations LD is limited to small distances. Raveendran et. al., 2008

Types of association mapping Genome wide association mapping: search whole genome for causal genetic variation. A large number of markers are tested for association with various complex traits and it doesn’t require any prior information on the candidate genes . Candidate gene association mapping: dissect out the genetic control of complex traits, based on the available results from genetic, biochemical, or physiology studies in model and non-model plant species (Mackay, 2001). Requires identification of SNPs between lines within specific genes.

Zhu et al., 2008

Steps in Association M apping Abdurakhmonov & Abdukarimov, 2008

Power to detect associations depends on Sample size and experimental design accurate phenotypic evaluations. genotyping , genetic architecture .

Phenotyping and Germplasm selection Phenotyping Replications across multiple years in randomized plots and multiple locations and environments influence of flowering time on other correlated traits, photoperiod sensitivity, lodging, and susceptibility to prevalent pathogens because these traits affect the measurement of other morphological or agronomic traits at field condition. ( Raveendran et al. 2008) Field Design:- incomplete block design (Lattice) ( Eskridge , 2003). Should be done on the basis of Diversity:- on the basis of phenotype and genotype Population structure

Germplasm selection and Population structure Randomly or non-randomly mated germplasm Randomly mated populations represent a rather narrow group of germplasm, likely to lower resolution and harbor only a narrow range of alleles Nonrandomly mated germplasm is used, population structure needs to be controlled in the statistical analyses (Yu et al ., 2006)

A set of unlinked, selectively neutral background markers are used to achieve genome-wide coverage to broadly characterize the genetic composition of individuals . Cluster analysis and boot strapping is done. On the basis of cluster analysis most diverse individuals are selected from each cluster to represent the individuals of that cluster. Helps in preventing spurious associations if population structure and relatedness exist.

Rafalski et al 2010

Estimation of population structure Low-­dimensional projection PCA based methods (Patterson et al., 2006 ) Clustering Distance-­based ( Bowcock et al., 1994 ) Model-­based STRUCTURE (Pritchard et al., 2000 ) mStruct ( Shringarpure & Xing, 2008 )

Evaluation of linkage disequilibrium and associating genotype- phenotype Structure of linkage disequilibrium (LD) for a specific locus will, reveal the association resolution possible at that locus . TASSEL ( http://www.maizegenetics.net ) is used to measure the extent of LD as squared allele frequency correlation estimates (R 2 , Weir, 1996) and measure the significance of R 2 . Eg . if LD decays within 1000 bp , then 1 or 2 markers per 1000 bp will be needed to identify associations. Besides TASSEL there are many other softwares like DnaSP , Arlequin etc. used to calculate D’ and R 2 .

Softwares used in AM Software Focus Description Haploview 4.2 Haplotype analysis and LD LD and haplotype block analysis, haplotype population frequency estimation, single SNP and haplotype association tests, permutation testing for association significance SVS 7 Stratification, LD and AM Estimate stratification, LD, haplotypes blocks and multiple AM approaches for up to 1.8 million SNPs and 10,000 sample TASSEL Stratification, LD and AM SSR markers, GLM and MLM methods GenStat Stratification, LD and AM SSR markers, GLM and MLM-PCA methods JMP genomics Stratification, LD and structured AM SNPs, CG and GWAS, analysis of common and rare Variants GenAMap Stratification, LD and structured AM SNPs, tree of functional branches, multiple visualization tools PLINK Stratification, LD and structured AM SNPs, multiple AM approaches, IBD and IBS Analyses STRUCTURE Population structure Compute a MCMC Bayesian analysis to estimate the proportion of the genome of an individual originating from the different inferred Populations SPAGeDi Relative kinship genetic relationship analysis BAPS 5.0 Population structure Compute Bayesian analysis to estimate the proportion of the genome of an individual and assign individuals to genetic clusters by either considering them as immigrants or as descendents from immigrants mStruct Population Structure Detection of population structure in the presence of admixing and mutations from multi-locus genotype data. It is an admixture model which incorporates a mutation process on the observed genetic markers LDheatmap LD LD estimation (r2) displayed as heatmap plots using SNPs Arlequin 3.5 Genetic analysis and LD Hierarchical analysis of genetic structure (AMOVA), LD for D′ and r2. Version 3.5 incorporate s a R function to parse XML output files to produce publication quality Graphics

Examples of association mapping studies Much of the association mapping in crop plants is just emerging from the research phase and is beginning to be applied , especially in commercial breeding setting. First attempt on candidate-gene association mapping study in plants (maize) resulted in the identification of DNA sequence polymorphisms within the D8 locus associated with flowering time ( Thornsberry et al., 2001 ). Using same population, Whitt et al., 2002 associated the candidate gene su1 with sweetness taste , bt2 , sh1 and sh2 with kernel composition, and Wilson et al., 2004 ae1 and sh2 with starch pasting properties.

Association mapping studies in plant species .

Association mapping studies in Rice Population Sample Size BG markers Trait Reference Diverse land races 577 577 Starch quality ( Bao et al., 2006) Diverse accessions 103 123 SSRs Yield and its components ( Agrama et al., 2007) Landraces SSRs Heading date, plant height and panicle length Wen et al. (2009) Landraces SNPs Multiple agronomic traits Huang et al. (2010) Diverse accessions 203 154 SSRs,1indel Trait of Harvest Index Li et al. (2012) Diverse accessions 210 86 SSRs yield and grain quality Borba et al. (2010) diverse rice accessions 383 44,000SNPs Aluminum Tolerance Famoso et al (2011) Mini core collection 90 108 SSR+indel stigma and spikelet characteristics Yan et al. (2009) Diverse accessions 950 Sequence based Flowering time and grain yield Huang et al. (2011) Diverse accessions 127 Sequence based Aroma Singh et al. (2010) Diverse accessions 413 44K SNP chip Agronoical traits Zhao et al. (2011)

Out of 18,000 accession of global origin, a USDA rice mini core collection of 203 accession were used for phenotyping 14 agronomic traits. Out of 14 agronomic trait 5 traits were correlated with grain yield per plant: plant height , plant weight, tillers, panicle length, and kernels / branch. Genotyped with 155 SSRs and Model based clustering using STRUCTURE seperated the accessions into 5 main clusters namely in ARO, AUS, TRJ,TEJ, IND.

4 main groups (AUS, IND , TEJ and TRJ) were separately analyzed for the LD measured by R 2 mean R 2 ranged from 0.04 for IND to 0.10 for TEJ and TRJ. IND had the most linked marker pairs with significant LD (9.53%), while TRJ had the least (5.57%). LD decay in distances was about 20 cM within both AUS and IND, while it decayed about 30 and 40 cM within TRJ and TEJ

Association analysis on candidate genes Association study employs techniques from molecular biology, field sampling/breeding , bioinformatics and statistics. Select candidate genes using existing QTL and positional cloning Choose diverse germplasm for the trait. Score phenotypic traits in replicated trials. Amplify and sequence candidate genes. Manipulate sequence into valid alignments and identify. Obtain diversity estimates and evaluate patterns of selection Statistically evaluate associations between genotypes and phenotypes taking population structure into account.

BADH gene was isolated from all 16 varieties and sequenced Sequence trace files from each variety were assembled into contigs using combined Phred / Pharp / Consed software. Polymorphism tags were generated automatically by Polyphred software integrated with the Consed . High quality SNPs from transcribed region were then identified manually and screen shots of the SNP trace files for the two alleles .

MassARRAY Assay Design 3.1software was further used to detect more SNPs 127 diverse rice varieties and landraces were used to analyse polymorphism for the identified SNPs Phylogenetic tree of the BADH1 gene sequence obtained by resequencing of 16 rice varieties and Nipponbare reference gene sequence was constructed using MEGA 4.0. Analysis of the BADH1 sequence variation among 127 rice varieties was done based on the scores of 15 validated SNPs identified by resequencing of the BADH1 gene from 16 varieties and Nipponbare using the Sequenom MassARRAY assays .

Two common BADH1 protein haplotypes (corresponding to four BADH1 SNP haplotypes) were analyzed in all 127 rice varieties and also separately in the aromatic and salt-tolerant subgroups of varieties 54 SNPs giving more than 95%success rates were used for the population structure analysis using STRUCTURE software . Two haplotypes of the BADH1 protein, PH1 and PH2 were modeled and docked.

The three exonic SNPs were ( 1) S6 in exon 4 with a T/A polymorphism resulting in asparagine to lysine substitution at amino acid position 144; ( 2) S18in exon 11 with a C/A polymorphism resulting in glutamine to lysine substitution at amino acid position 345, and ( 3) S19 in exon 11 with T/C polymorphism resulting in isolucine to threonine substitution at amino acid position 347. PH1 has 15 active GABald binding site where as PH2 has 8.

517 landraces were phenotyped and genotyped by sequencing upto one fold coverage using Illumina Genome Analyzer II Aligned sequence reads to the rice reference genome for SNP identification Discrepancies with rice reference genome were called as candidate SNPs .

A total of 3,625,200 nonredundant SNPs were identified, resulting in an average of 9.32 SNPs per kb, with 87.9% of the SNPs located within 0.2 kb of the nearest SNP A total of 167,514 SNPs were found in the coding regions of 25,409 annotated genes. 3,625 large-effect SNPs (representing mutations predicted to cause large effects ) were identified. Neighbor-joining tree as well as the principal-component analysis seperated rice germaplasm in two groups i.e. indica and japonica. Further both indica and japonica had three subgroups .

Because of strong population differentiation between the two subspecies of cultivated rice GWAS was conducted only for 373 indica lines using mixed linear model (MLM ) 80 associations for the 14 agronomic traits were identified. Heading date strongly correlated with both population structure and geographic distribution. Genome-wide LD decay rates of indica and japonica were estimated at ~123 kb and ~167 kb, where the r2 drops to 0.25 and 0.28,

413 diverse accessions of O. sativa were phenotyped for 34 traits and genotyped using 44K SNP array. Probe was prepared from DNA, labelled and hybridized against array . Genotype calling was done using ALCHEMY program 36,901 high-performing SNPs (call rate > 70 %) were used for all analyses.

PCA analysis was done to determine population structure and separated all the accessions into 5 clusters. mixed model approach was implemented to correct population structure SNP LD among the 44K common SNPs were detected using r 2 using PLINK software . LD decay was observed at ~ 100 kb in indica , 200 kb in aus and temperate japonica , and 300 kb in tropical japonica giving and average marker distance of about 10kb

GWAS for various traits Plant height Panicle length Flowering time Photoperiod sensitivity

Comparison Candidate gene approach Genome wide association Mapping GWA using Markers SNP genotyping using Microarray Whole genome sequencing Choice of candidate gene and marker within them often involves some guess work so chances are there many earlier unreported genes will go undetected. Discovery of large number of markers. In crops like A. thaliana (125Mb) ~140,000 and in maize (475Mb)~10-15 million markers will be required to give complete coverage. Good and robust can process large number of sample and identify large no. of SNPs in one shot. But if polymorphism is not present in initial discovery panel remains undetected in large sample. Detects all polymorphisms in the population thus avoids the erosion of power due to ascertainment bias.