Estimation of Heritability from GWAS Summary Statistics Genetics & Genomics Winter School A/Prof Loic Yengo [email protected] Institute for Molecular Bioscience The University of Queensland
Outline Overview of Genome-Wide Association Studies (GWAS) Linkage Disequilibrium Score Regression Other methods Partitioned heritability
Outline Overview of Genome-Wide Association Studies (GWAS) Linkage Disequilibrium Score Regression Other methods Partitioned heritability
Chailurkit et al. PeerJ 2022 Association = mean differences between genotypes Genome-Wide = Test for large number of variants Genome-Wide Association Studies (Quantitative Traits) Allele A Allele G Allele T
5 More alleles T in Cases Less alleles T in Controls Allele A Allele G Allele T Association = allele frequency differences between cases and controls Large sample sizes are required… Genome-Wide Association Studies (Binary/Disease Traits)
6 Manhattan plot Mapping the human genetic architecture of COVID-19 COVID-19 host genetics consortium 2021 – Nature Detected genetic associations implicate interferon genes 125,584 cases vs 2.5 M controls 60 studies from 25 countries 26 associations detected
Another popular plot in GWAS = QQ-plot
GWAS summary statistics Comes in different flavor Minimum available SNP ID (e.g., rs number or chromosome:position:genome build) Alleles tested (e.g., effect allele / non-effect allele) Allele frequency Marginal SNP effect ( a.k.a “BETA”) Standard Error Per-SNP Sample size P-value More data sometimes… Imputation accuracy Genotypes frequencies Frequencies in cases and controls Hardy-Weinberg Equilibrium Test Statistics Most GWAS are conducted using regression methods: linear / logistic (mixed) models.
GWAS summary statistics Has Become a standard to share and make publicly available the summary-level data when publishing a GWAS study. 9 —Nat Genet editorial, July 2012
10 2021
Challenges with GWAS summary statistics Test is not always specified Sample size may (substantially) vary across SNPs (consortium / imputation) Imputation accuracy is not always available (effective sample size) Summary statistics may be truncated (identifiability issue) => creates noise Allele frequencies may not always match that across individuals in the GWAS
Notations and Nomenclature Estimated SNP effect of SNP j: Standard error of : SE( ) Z-score of SNP j: /SE( ) Chi-square of SNP j: This statistic is expected to follow, asymptotically (i.e., when sample size is infinite), a distribution with 1-degree of freedom. Genomic Control ( ) = median( ) / 0.456 Large values (i.e. >1.1) may indicate confounding due to population stratification
Outline Overview of Genome-Wide Association Studies (GWAS) Linkage Disequilibrium (LD) Score Regression Other methods Partitioned heritability
LD score regression Initial motivation: distinguish polygenicity from confounding (e.g., due to population stratification) Extension(s) Estimation of SNP-based heritability and genetic correlations ( Bulik -Sullivan 2014, 2015) Functional Enrichment (Finucane 2015) Estimation of polygenicity Etc. Credit to Bullik -Sullivan (online lecture)
100Kb 100Kb Population 1 LD scores LD score of SNP j : Credit to Bullik -Sullivan (online lecture)
Under genetic drift… Credit to Bullik -Sullivan (online lecture)
…the more you tag, the more likely you are to tag a causal variant ! Key assumption Each SNP explains the same amount of trait variance Credit to Bullik -Sullivan (online lecture)
Simulated Polygenicity Credit to Bullik -Sullivan (online lecture)
Simulated population stratification (UK vs Sweden)
LD score regression theory is the GWAS sample size is the average heritability explained per SNP. is the LD score regression intercept. Deviations from 1 indicate confounding .
Proof (More details in Supplementary Note of Bulik -Sullivan et al. 2014)
Key ideas behind the proof 1) Population stratification Model 2) Heritability Model
F ST model: Balding-Nichols 23 p 1 p 2 p Ancestral population Derived population 1 Derived population 2 Stratification Model +S/2 -S/2 Mean difference = S
How does it look like? 24 PC1 explains most the variance
Heritability Model with and All SNPs contribute equally to the trait heritability Under this assumption…(+ genotypes centred and scaled) You can complete the proof…
LD score regression in practice Regress the ’s on the ’s Use weights to account for High LD score SNP contribute too much Heteroskedasticity (i.e., residuals don’t have the same variance) Block-Jackknife to assess standard errors (300 blocks) Practical 4 will use the LDSC software
Caveat What is the correct “M”? 100Kb 100Kb Population 1 LDSC estimates of heritability are biased (yet still useful)!
Estimation of genetic correlations (More details in Supplementary Note of Bulik -Sullivan et al. 2014)
Heritability Model with and All SNPs contribute equally to the trait heritability Under this assumption…(+ genotypes centred and scaled) You can complete the proof…
Formally Bulik -Sullivan et al. (2015) and are the GWAS sample sizes of study/trait 1 and 2 is the number of participants overlapping study 1 and 2 is the average heritability explained per SNP for trait . is the phenotypic correlation between trait 1 and 2. is the genetic correlation (i.e. correlation between true effects of scaled genotypes) Estimation uses weighted least-squares Step 1: two univariate analyses Step 2: estimation genetic covariance
Formally Yengo, Yang & Visscher (2018) Additional term only matters when N is large!
Browsing genetic correlation in UK Biobank https:// ukbb-rg.hail.is /
Other methods
HDL method Same heritability model as LD score regression Lower standard errors Splits the genome into 1700 independent LD blocks Standard errors (SE) are still estimated using block jackknife SE are reduced by ~3-fold!
Generalized Random Effect (GRE) Minimal assumptions about the distribution of SNP effects Marginal effects Inverse of in-sample LD matrix
Bayesian Models (e.g., SBayesC ) proportion of SNPs with non-zero effects Dirac Point Mass at 0 Inference is based on Monte Carlo Markov Chain sampling (Lecture 6) Standard Errors are obtained from sampling the posterior distribution A bit more computationally intensive Prior distribution + Data = GWAS summary statistics = Posterior Distribution (Bayes Rule)
Bayesian Model ( SBayesS )
SumHer method A different heritability model… where, Allele frequency Local LD level
Extensions of LD score regression
Partitioned Heritability Model Effect of the annotation on heritability LD between SNP j And SNPs with that annotations Applications Quantify enrichment of heritability in certain annotations Prioritize tissues/cell-types
A few more extensions Estimation of genetic correlation between populations (POPCORN method – Brown et al. AJHG 2016) Estimation %heritability mediated by gene-expression (Yao et al. Nat. Genet 2020) Estimation of polygenicity of traits (O’Connor et al. AJHG 2019) Causal inference (O’Connor & Price, Nat Genet 2018)
Summary and conclusions The variation of strength of association between SNPs depends on local LD and heritability That variation can be leveraged to estimate heritability using a method like LD score regression (although biased) Genetic correlations can be estimated similarly (no bias) Other methods exist (different heritability model, Bayesian, etc.)