2. Agenda for Presentation
INTRO TO GWAS
01
Pros/ Cons of GWAS
02
GRAIL
03
SIGNAL PRIORITIZATION
WITH GENOWAP
04
CONCLUSION
05
FUTURE RESEARCH
06
2
3. What is GWAS (Genome Wide Association Studies)
❖ A Statistical Study
❖ Makes Association between genetic variation (genotype) and
observable traits (phenotype: high blood pressure)
❖ Tags a region of variant genes: including the causal ones
❖ 99.9 percent of genetic letters (3B) are identical in every human
❖ 0.1 percent caters to the mystery of the diversity of human kind
3
Ref: Five Years of GWAS Discovery, Peter M. Visscher et al, 2012
4. HTRA1 Promoter Polymorphism in Wet Age-Related Macular Degeneration
2005
Genome-wide association study of 14,000 cases of seven common diseases and
3,000 shared controls
2007
Common polygenic variation contributes to risk of schizophrenia that overlaps
with bipolar disorder
2009
Statistical Framework to Predict Functional Non-Coding Regions in the Human
Genome Through Integrated Analysis of Annotation Data
2015
LD Hub: a centralized database and web interface to perform LD score
regression that maximizes the potential of summary level GWAS data for SNP
heritability and genetic correlation analysis
2017
Post GWAS
ML and whole-genome sequencing (WGS) studies
Beyond
4
5. Real Life Example
A T C G
DNA Alphabet
ATC
CTG
Codons Exon Intron Stop
Codon Codon
_______________________________
Gene
ATG CTC GTT AAG TAA
ATG CTC GTT TAG TAA
_____________________
Variations
Identifying the genetic association with observable behaviors (e.g. traits/diseases)
5
6. What is SNP (Single Nucleotide Polymorphism) in GWAS
❖ Order of genetic letters in human genomes vary at specific location
❖ This variation is called SNP (pronounced ‘snip’)
❖ But not all SNPs have effects on traits
❖ GWAS finds SNPs to find associations between genes and traits
Image Source: https://www.nutrigeneticsspecialists.com/single-post/2017/03/27/what-is-a-snpc 6
7. High Level: GWAS Mechanism (Case/Control Study)
Population with
Disease
Population
without Disease
for each SNP
compute frequency for with and w/o disease
compute odd ratio
Genotyping
Genome
Finding
SNPs
SNP with Disease SNP w/o Disease
Case
Control
7
Ref: Designing candidate gene and genome-wide case-control association studies, Krina T. Zondervan et al, 2007
8. High Level: GWAS Mechanism (Case/Control Study)
SNP A T Total
Case 50 150 200
Control 100 100 200
Total 150 250 400
At some position, variation found with A/T
T is found to be more associated with the disease than A
8
9. Manhattan Plot based on GWAS Study
https://en.wikipedia.org/wiki/Genome-wide_association_study 9
10. Where to find GWAS data
National Center
for
Biotechnology
Information
(NCBI)
10
11. Curve of learning speed (GWAS Study and Publication)
Image Source: https://mobile.twitter.com/GWASCatalog/status/1360288750132150272/photo/1 11
12. GWAS
Pros
GWAS can lead to the discovery of novel biological
mechanism
implicate genes of unknown function, and experimental
follow-up on loci can lead to the discovery of novel
biological mechanisms that underlying disease
01
GWAS are relevant to the study of
low-frequency and rare variants
informed by data from large reference panels
enabling many low-frequency and rare variants
to be directly genotyped
02
GWAS based on SNP arrays use reliable
genotyping technology.
contemporary genome-wide SNP arrays achieve
call rates, HapMap concordance, Mendelian
consistency and reproducibility of >99.7%
03
GWAS can provide insight into ethnic
variation of complex traits
GWAS in diverse ethnic groups can therefore
reveal heterogeneity in genetic susceptibility to
disease
04
GWAS data are easily shared and publicly
available data facilitates novel discoveries
availability of GWAS summary statistics has
increased dramatically in recent years ( K
Biobank; Kaiser Permanente’s Research Pro.)
05
GWAS based on SNP arrays are
cost-effective for identifying risk loci
genome-wide SNP arrays, like Illumina Infinium
Global Screening Array or Thermo Fisher Axiom
Precision Medicine Research Array, cost
approximately US$40 per sample.
06
Tam et al, 2019
12
13. GWAS
Cons
Lack of Diversity
Large-scale GWAS efforts have
disproportionally focused on European ancestry
populations with only ~10% of all GWAS
participants being of non-European descent
(Loos, R. 2020)
01
GWAS are penalized by an important multiple
testing burden
done using a Bonferroni correction to maintain
genome-wide false-positive rate at 5% (assumption of
1m independent tests for common genetic variation)
02
GWAS have limited clinical predictive value
modest proportion of heritability explained
03
GWAS based on SNP arrays rely on pre-existing
genetic variant reference panels
SNP array based GWAS depends on completeness of
the sequencing studies and resulting reference panels
that inform genotyping array design
04
GWAS signals may be due to cryptic population
stratification
can result in spurious associations if not properly
accounted for
05
GWAS explain only a modest fraction of the
missing heritability
the variants that GWAS identifies as associated
with a trait/ disease account for only a modest
proportion of the estimated heritability of most
complex traits
06
Tam et al, 2019
13
14. The more we learn the more we realize how we little know - R. Buckminster Fuller
● Literatures provide us a number of disease regions
● Each region may have a number of SNPs
● All SNPs may not be responsible
● How to find the causal SNPs from the GWAS studies ?
● Is it possible to leverage the existing studies to find the causal genes?
14
15. GRAIL (Gene Relationships Across Implicated Loci)
Given a collection of disease regions, identifying a subset of genes that are
more highly related than by chance
A list of disease regions identified by GWAS and list of their publications
Input
Output
Degree of relatedness of the genes with the disease
Motivation
● Association does not infer causal
● Identifying causal inference helps better understanding of the disease
15
Ref: Identifying relationships among genomic disease regions: Predicting genes at Pathogenic SNP associations and rare
deletions, Chaudhury et al, 2009
16. GRAIL works in 4 steps !
16
for each of the overlapping gene:
identifying overlapping genes from the list of disease regions
rank all other genes based on the relatedness to it
count of regions having at least one highly related gene
assign p-value to the count
select the most connected gene in the region
17. Step 1: Define the overlapping region
Image Source: GRAIL, Chaudhury et al, 2009
17
Lets we look into
gene 1 next
18. Background of next step
GWAS study is important because of its
ability to identify associations between
disease and related genes. GWAS result
can be used for inventing treatment and
medication of the disease
Published Article / Document
word1 GWAS 2
word2 study 1
……..
wordN disease 2
Word Frequency in Doc 1
Doc1
GWAS of BMI helps researchers know
about more information regarding obesity. Doc2
word1 GWAS 2
word2 BMI 1
……..
wordN obesity 1
Document Frequency of words
18
19. Background of next step
# of Document
Fewer Documents, More Weight
Document freq. of Word i
Freq. of Word i in Document j
More frequent, More weight
Weight of Word i in
Document j
Word frequency : weight
Inverse document frequency: weight
19
20. Background of next step
Weighted count of
word i for gene k
All Documents/Publications
referring gene k
# of genes referred in
Document j
for a gene k, calculating g with all words: i in the vocabulary, provides a gene vector for gene k
20
21. Step 2: Ranking the Gene with other genes
Image Source: GRAIL, Chaudhury et al, 2009
21
22. Step 3: Counting regions with related genes
Image Source: GRAIL, Chaudhury et al, 2009
22
23. Step 4: Assign p-value of key gene to the region
Image Source: GRAIL, Chaudhury et al, 2009
23
24. Higher p-value bagged for some of the SNPs than other
Image Source: GRAIL, Chaudhury et al, 2009
24
25. Linkage disequilibrium
(LD) sometimes leads
to misinterpretation of
association results.
Specific Problems with GWAS
Bonferroni-corrected
significance threshold is too
conservative and leads to
missing heritability
coding-region-based tools
are not sufficient for
GWAS signal prioritization
…….This brings us to
GENOWAP!
25
26. GenoWAP: GWAS signal prioritization through integrated
analysis of genomic functional annotation
● goal of GWAS signal prioritization is to assign each SNP a
new score that measures its importance.
● GWAS signal prioritization method that integrates genomic
functional annotation and GWAS test statistics
○ GenoCanyon functional prediction
○ GWAS P-values.
26
27. GenoWAP: GenoCanyon (Lu, et al 2015)
● what: unsupervised statistical framework
● why: to predict functional non-coding regions in the human
genome
● how: through integrated analysis of multiple biochemical signals
and genomic conservation measures
● For each SNP in a GWAS dataset, the mean GenoCanyon
functional score of its surrounding 10,000 base pairs is used as
the prior probability P(Z =1)
● partition all the SNPs into functional (Z |1) and nonfunctional (Z |
0) subgroups based on the calculated mean
27
28. GenoWAP: Statistical Model
● for every SNP we define a Ζ to be the indicator of general functionality, and ΖD to
be the indicator of disease-specific functionality
● if a SNP or its surrounding region is active in any genomic functional pathway,
then Z equals to 1. If this SNP or the surrounding region is involved in the disease
pathway, then ZD equals to 1.
● each SNP has an associated ρ to denote its P-value obtained from the standard
GWAS analysis.
● conditional probability of being disease-specific functional given the P-value, i.e.
28
29. GenoWAP: Statistical Model
● to calculate a marker’s conditional probability P-value, we must know a few things:
○ prior probability of being functional
○ P-value density for disease-specific functional markers
○ P-value density for markers that are not related to the disease
○ the conditional probability of being disease-specific functional given the
marker is functional in the general sense
● partition all the SNPs into functional (Z | 1) and nonfunctional (Z | 0) subgroups
based on the calculated mean
29
30. GenoWAP: Contribution
● Compared to the top loci ranked on P-values only, top ranked
loci after prioritization tend to show substantially stronger
signals in large GWAS studies.
● Within each locus, we are able to distinguish true signals
among highly correlated SNPs.
30
31. Next Frontier: Machine Learning
Future Areas of Research
GENERALIZED LINEAR
MODELS
Generates a line of best “fit”
through input data in the
form of a classification
line/boundary
DECISION TREES
trees built around yes/no
rules developed by specific
features
NEURAL NETWORKS
interconnected neurons
evaluate and weigh input
data based on features
produced from the previous
connected neuron
31