Algorithm Implementation of Genetic Association ‎Analysis for Rheumatoid Arthritis Data Based on ‎Haplotype Blocks

1
Master of science dissertation defense

﴿
ْ
‫م‬َ‫ل‬َ‫أ‬
َْ‫ر‬َ‫ت‬
ْ
‫ن‬َ‫أ‬
ْ
َ‫ّللا‬
َْ‫ل‬َ‫ز‬‫ن‬َ‫أ‬
َْ‫ن‬ِ‫م‬
ِْ‫اء‬َ‫م‬‫الس‬
ْ
‫اء‬َ‫م‬
‫َا‬‫ن‬‫ج‬َ‫ر‬‫خ‬َ‫أ‬َ‫ف‬
ْ
ِ‫ه‬ِ‫ب‬
ْ
‫ات‬َ‫ر‬َ‫م‬َ‫ث‬
ْ
‫خ‬ُ‫م‬
‫ا‬‫ف‬ِ‫ل‬َ‫ت‬
‫ا‬َ‫ه‬ُ‫ن‬‫ا‬ َ‫و‬‫ل‬َ‫أ‬
َْ‫ن‬ِ‫م‬ َ‫و‬
ِْ‫ل‬‫ا‬َ‫ب‬ ِ‫ج‬‫ال‬
ْ
‫د‬َ‫د‬ُ‫ج‬
ْ
‫يض‬ِ‫ب‬
ْ
‫ر‬‫م‬ُ‫ح‬ َ‫و‬
ْ
‫ف‬ِ‫ل‬َ‫ت‬‫خ‬ُ‫م‬
‫ا‬َ‫ه‬ُ‫ن‬‫ا‬ َ‫و‬‫ل‬َ‫أ‬
ُْ‫يب‬ِ‫ب‬‫ا‬َ‫َر‬‫غ‬ َ‫و‬
ْ
‫ود‬ُ‫س‬
❁
َْ‫ن‬ِ‫م‬ َ‫و‬
ِْ
‫اس‬‫الن‬
ْ
ِ‫اب‬ َ‫و‬‫الد‬ َ‫و‬
ِْ‫ام‬َ‫ع‬‫األن‬ َ‫و‬
ْ
‫ف‬ِ‫ل‬َ‫ت‬‫خ‬ُ‫م‬
ْ
ُ‫ه‬ُ‫ن‬‫ا‬ َ‫و‬‫ل‬َ‫أ‬
َْ‫ك‬ِ‫ل‬َ‫ذ‬َ‫ك‬
‫ا‬َ‫م‬‫ن‬ِ‫إ‬
ْ
َ‫ي‬
ََ‫خ‬
ْ
َ‫ّللا‬
ْ
‫ن‬ِ‫م‬
ْ
ِ‫ه‬ِ‫د‬‫ا‬َ‫ب‬ِ‫ع‬
ْ
ُ‫ء‬‫ا‬َ‫م‬َ‫ل‬ُ‫ع‬‫ال‬
ْ
‫ن‬ِ‫إ‬
ْ
َ‫ّللا‬
ْ
‫يز‬ ِ
‫ز‬َ‫ع‬
ْ
‫ور‬ُ‫ف‬َ‫غ‬
﴾
[
‫فاطر‬
:35-36]
2

3
Minia University
Faculty of Engineering
Biomedical Engineering Department
Fatma Sayed Ibrahim
Master of science thesis defense
Wednesday, January 27,
2021
Algorithm Implementation of Genetic Association Analysis
for Rheumatoid Arthritis Data Based on Haplotype Blocks

5
Supervisors
Prof. Dr. Hesham Fathy A. Hamed
Former Dean of Faculty of Engineering, Minia University
Professor at Egyptian-Russian University
Dr. Ashraf Mahroos Said
Associate Professor
Biomedical Engineering Department, Minia University,
Dr. Mohamed Nagy Saad
Assistant Professor
Biomedical Engineering Department, Minia University

6
Dr Muhammad Ali M. Rushdi
Biomedical Engineering and Systems
Department
Faculty of Engineering, Cairo university
Dr Essam Halim Houssein
Vice-dean for Postgraduate studies and
research affairs
Faculty of Computers and Information,
Minia University
Thesis committee members
Prof. Dr. Hesham Fathy A.
Hamed
Former Dean of Faculty of
Engineering, Minia University
Professor at Egyptian-
Russian University
Dr. Ashraf Mahroos Said
Associate Professor
Biomedical Engineering Department
Minia University, Minia

7
1. Introduction 2. Literature
review
3. Data
description
4. Pre-processing
5. Methods 6. Results 7. Conclusion
Outlines

8
Introduction
Literature
review
Data
description
pre-processing Methods Results Conclusion
Background
Motivation
Research Objectives
Introduction

9
Introduction Literature review Data description pre-processing Methods Results Conclusion
Background

14
All human DNA is 99.9% identical and, 0.1 % is difference

A A
A
T
T T
A G
A
T
T C
SNP
15
A SNP is a mutation at a single nucleotide position, where a possible
nucleotide type is called an allele.
SNP

Chromosome Chromosome
Allele 1 Allele 2
A T T T
A A T A
Major homozygous
genotype
Minor homozygous
genotype
Heterozygous genotype
16

17
Genotypes
GG
AA
Minor
genotype
Major
genotype
Heterozygous
genotype
GA

18
SNPs in population
GG
GG
GG
GG
AA
GG
GG
GG
AA
AA
GA
GA
GA

The minor allele
frequency (MAF)
19

20
The minor allele frequency (MAF)
...ATGTCACACGTACTT...
...ATGTCACACGTACTT...
...ATGACACAGGTACTT...
...ATGTCACAGGTACTT...
...ATGACACACGTACTT...
SNP1 SNP2
SNP1 SNP2
Allele 1
Allele 2
Allele 1 frequency
Allele 2 frequency
Allele 1
Allele 2
Major
Minor
T C
A G
6 3
4 7
60% 30%
40% 70%
T G
A C

Linkage Disequilibrium (LD)
21
LD is the nonrandom association of alleles at different sites.

Ab
Ab Ab
aB
Ab
aB aB
aB
AB
AB
ab
AB
AB
ab ab
ab
ab
Equilibrium
AB
AB
ab
AB
AB
ab ab
ab
ab
Disequilibrium
AB
AB
AB
AB
ab
ab
ab
ab
22

24
Low LD
High LD
A C
A T
A C
A T
G C
G T
G C
G T
Individuals
SNP 1 SNP 2
A C
A C
A C
A C
G T
G T
G T
G T
Individuals
SNP 1 SNP 2
24

| D’ |
1.0
0.8
0.6
0.2
0.0
SNP
SNP1
SNP2
SNP3
SNP4
SNP5
SNP6
1 2 3 5 6
4
25
Measuring the LD

Pairwise LD
1 2 3 5 6
4
r 2 Color Key
0.2
0 0.6 0.8 1
0.4
26
Measuring the LD

SNP3
ATA
GTA
GAT
C
C
C
A
A
A
T
T
T
C
C
C
G
G
G
T
T
T
C
C
C
G
G
G
T
T
T
C
C
C
A
G
G
T
T
T
A
A
A
G
G
G
C
C
C
A
A
A
T
T
A
A
A
T
SNP2
SNP1
Chromosome
31

SNP3
ATA
GTA
GAT
Haplotype block
Haplotypes
C
C
C
A
A
A
T
T
T
C
C
C
G
G
G
T
T
T
C
C
C
G
G
G
T
T
T
C
C
C
A
G
G
T
T
T
A
A
A
G
G
G
C
C
C
A
A
A
T
T
A
A
A
T
SNP2
SNP1
Chromosome
Haplotype blocks
Recombination hotspots
32

33
Haplotype blocks partitioning
and visualization

SNP3
ATA
GTA
GAT
Haplotype block
Haplotypes
C
C
C
A
A
A
T
T
T
C
C
C
G
G
G
T
T
T
C
C
C
G
G
G
T
T
T
C
C
C
A
G
G
T
T
T
A
A
A
G
G
G
C
C
C
A
A
A
T
T
A
A
A
T
SNP2
SNP1
Chromosome
Haplotype blocks
Recombination hotspots
34

SNP1 SNP2 SNP3 SNP4 SNP5 SNP6 SNP7 SNP8
1 2 3 4 5 6 7 8
Block 1 Block 2
35

36
The role of SNPs and haplotype blocks in diseases

Environmental component
Genetic component
Infectious diseases Complex diseases Hereditary disease
37

38
SNP-based
association studied
Haplotype-based
association studied

39
Samples from
Population
Cases and controls
Data
acquisition
Genotyping
Data
SNP array dataset

40
Single SNP-based
association studies
approach
Haplotype block-
based association
studies approach

41
Single SNP-based association studies approach
Single point
mutation
(Single SNP)
Single point
mutation result in
a disease

42
Haplotype block-based association studies approach
Multiple
SNPs
Multiple
SNPs contributes
to disease
susciptability

Why haplotype blocks?
• Dimensional reduction
• Act better in complex diseases
• Multiple SNPs could have moderate effect
• Considers the interrelationship between linked
• Evolutionary studies
43

44
Introduction
Literature
review
Data
description
pre-processing Methods Results Conclusion
Background
Motivation
Research Objectives
Introduction

Why this point ?
• The genetic variations influence our predisposition to diseases and any disease
has a genetic component, even infectious diseases.
• Complex diseases are very common in societies. In particular, chronic condition
hugely affects the productivity of a person and its quality of life.
• Since the haplotype blocks are much more effective and powerful in such case
46
Motivations

Why this point ?
• The gap in knowledge in this field (especially MAF). There are many
remains questions that are unanswered
• No so many Arabs in this field especially in Minia.
47

48
Research Objectives
• Practically implement computational algorithms to partition
genotyped data based on the haplotype blocks.
• Find the best haplotype partitioning method applied for the
whole-genome case-control dataset to reduce the number
of SNPs in the association study.

49
. . .
G_G A_C . . . .
A_G A_A . . . .
. . . . . . .
. . . . . . .
. . . . . . .
G_G A_C . . . A_A
SNPs

50
Samples
Cases and controls
Data
acquisition
Genotyping
Data
SNP array dataset

52
The Input
52
. . .
G_G A_C . . . .
A_G A_A . . . .
. . . . . . .
. . . . . . .
. . . . . . .
G_G A_C . . . A_A
SNPs

53
. . . .
G_G A_C G_G . . . . .
A_G A_A A_G . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
G_G A_C G_G . . . . A_A
SNPs

54
.
G_G A_C G_G A_C G_G G_G . .
A_G A_A A_G A_A A_G A_G . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
G_G A_C G_G A_C G_G G_G . A_A
SNPs
Block 1 Block 2
Individuals
IDs

55
Start index End index Start rsID End rsID
rs22572xxx rs225722xxx
rs74307xxx rs198574xxx
The output

56
Major finding
• Practically data exploration and uncovering interesting detections
• Investigating partitioning method from literature review and empirical
comparative study (Biomarker reduction with high SNP correlation)
• Sequence of data preprocessing 🡪 R (Hope to make it in R package)
• The role of MAF on haplotype block partitioning

57
545,080
SNPs
Biomarker reduction
73,743
101,915 73,743
Max
Min
Haplotype blocks

58
Min
Max
Haplotype blocks(Biomarkers)
Total No.

59
Biomarker reduction percentage
13.5
18.7
• The value is varying from one method to
another
• MAF has an impact on the results

60
Outlines
review
3. Data
description
4. Pre-processing

Literature review
61

Literature review
62
The main projects in genomics Haplotype bock partitioning methods

2001
2002
2005
2007
2009
Human Genome Project
International HapMap Project
The 100,000 Genomes Project
The 1000 Genome Project
The announcement of the HGP
The completion of the HGP
The initial HGP sequencing
The announcement of the International HapMap
Project
HapMap Phase I completion
HapMap Phase II completion
HapMap Phase III completion
1990
2003
2008
2010
2012
2015
2012
2015
2019
The announcement of 1KGP
The completion of pilot phase
The 1KGP fulfill its goal
The 1KGP’s completion Northern Ireland and Scotland joint the
project
Beginning the initiative to involve the public in
genomic research
The 100, 000 Genome Project’s announcement
63

Literature review
haplotype block
partitioning
Survey in from 2001 to 2021

2001
2001-2003
2005-2013
2014-2020

Hidden Markov
model (HMM)
Mark Daly 2001
66

68
Hidden Markov model (HMM)
• 2001
Greedy algorithm (GA)
• 2002
Dynamic programming (DP)
• 2002
Confidence interval (CI)
• 2002
Four-gamete test (FGT)
• 2002
The minimum description length (MDL)
• 2003
Ning Wang
Kui Zhang
Stacey Gabriel
Nila Patil
Mark Daly
Mikko Koivisto

69
Solid spine of LD (SSLD)
• 2005
Markov Chain Monte Carlo (MCMC) algorithm
• 2008
Xor-genotypes
• 2009
Wavelet transforms
• 2011
GA-SVM algorithm
• 2013
From
2005
to
2013
Jeffrey Barrett
Pattaro

70
MIG++
• 2014
S-MIG ++
• 2015
Big-LD
• 2018
on neutrosophic c-means (NCM) algorithm
• 2020
LDBlockShow
• 2020
From
2014
to
2020
Daniel
Talian
Sunah Kim

71
Data description and exploration

73
Data description and exploration
NARAC dataset
Description
Map file
SNP array data file
Missing data
Participants
% Cases
Very rare SNPs
Alleles distribution
Genotype
distribution
% Male
% Female
Rare SNPs
SNP
annotation
% Controls
Visualization based on MAF
viewpoint
MAF for each
SNP
Low-frequency
SNPs
Common SNPs
% Male
% Female

74
Participants
ID
Af
fe
ct
io
n
Se
x
DRB1
_1
DRB1
_2
SE
Num
SE
Status
Anti-
CCP
RFUW
rs104
39884
D0024949 0 F 0101 0401 SS yes ? ? G_G
D0024302 0 F 0101 7 SN yes ? ? G_G
D0023151 0 F 0101 11 SN yes ? ? G_G
D0022042 0 F 0101 2 SN yes ? ? G_G
D0021275 0 F 0101 7 SN yes ? ? G_G
D0021163 0 F 0101 0403 SN yes ? ? G_G
D0020795 0 F 0101 3 SN yes ? ? G_G
6045201 1 F 0101 7 SN yes 31.3 142 G_G
D0023027 0 M 0101 3 SN yes ? ? G_G
1015200 1 M 0101 0403 SN yes 112.9 405 ?_?
D0015941 0 F 2 7 NN no ? ? A_G
D0016405 0 F 0101 7 SN yes ? ? ?_?
KNH763243 1 M 0404 0301 SN yes 99 ? G_G
Sample of
the SNPs’
array data

75
Chromosome rsID Position
1 rs3094315 792429
1 rs12562034 808311
11 rs3802985 188510
11 rs3741411 189256
21 rs2821850 13693682
21 rs2257226 13695103
Sample of the map file from

North American Rheumatoid Arthritis Consortium (NARAC) dataset
76
Cases (RA) Controls Total
Male 227 342 569
Female 641 852 1493
Total 868 1194 2,062

78
The percentage of the missing data

82
Common SNPs
Low frequency
Rare SNPs
Very rare SNPs
MAF>0.05
0.05<MAF<0.01
0.001<MAF<0.01
MAF<0.001

83

84
84

■ Very rare
■ Rare
■ Low
frequency
■ Common
MAF distribution for the data from chromosome 1 to chromosome 22

■ Very rare
■ Rare
■ Low frequency
■ Common
The common SNPs are 90.5%.
The histogram of the MAF
distribution

The SNPs
effect
(Annotation)
87
87

88
Ensembl Variant
Effect Predictor
(VEP)

Non-coding transcript exon variant
TF-binding inside variant
3-prime UTR variant
Missense variant
Synonymous variant
Others
Intron variant
Non-coding transcript variant
Downstream Gene variant
Upstream Gene variant
Intergenic variant
NMD transcript variant
Regulatory region variant
All consequence’s regions
Very rare SNPs Rare SNPs
90
Low frequency SNPs Common SNPs
90

Coding consequence’s regions
Missense variants
Synonymous variant
Others (Stop gained, Start lost, Stop lost, Coding sequence variant, Incomplete terminal codon variant)
Very rare SNPs Rare SNPs Low frequency SNPs Common SNPs
91
Introduction Literature review Data description pre-processing Methods Results Conclusion 91

94
Outlines
review
3. Data
description
4. Pre-processing
9

95

Our Data consists of about 545,080 SNPs for
about 2062 individuals (Cases and controls)
Matrix size =
2062*545080 = 1,123,954,960
about 5,619,774,800 letters
(it taking in acount homozygous and homozygous in a string format)

SNPs from chromosome 1 to chromosome 22
(531,689 SNPs ×2,062 participants)
1,096,342,718

Reading and cropping
Starting with reading genotyped
data and removing the first 9
columns

100
rs10439884 rs2260810 rs1296971 rs2257224
G_G A_A A_A G_G
G_G A_A A_A G_G
G_G A_A A_A G_G
G_G A_A A_A A_A
G_G A_A A_A G_G
G_G A_A A_A G_G
G_G A_A A_A G_G
G_G G_G C_C A_G
A_G A_G A_C A_G
A_G A_G A_C G_G
?_? A_G A_C A_G
G_G A_G A_C G_G

101
rs10439884 rs2260810 rs1296971 rs2257224
GG AA AA GG
GG AA AA GG
GG AA AA GG
GG AA AA AA
GG AA AA GG
GG AA AA GG
GG AA AA GG
GG GG CC AG
AG AG AC AG
AG AG AC GG
NA AG AC AG
GG AA AA GG

Convert the genotyped matrices and its map file into gp.data form as preparation to
codeGeno function

Imputation
using marginal allele distribution

Imputed data in bi-allelic format

After Imputation and recoding using Synbreed R package
0==Reference allele= Major allele
1==heterozygous
2==major allele

The output of imputation and
recoding
Pre-processed dataset

From bi-allelic format to 1,2,3,4 format
A🡪1
C🡪2
G🡪3
T🡪4

Family ID ID P ID M ID sex aff
SNPs
SNP 1
Preparing data for Haploview
Introduction Literature review Data description Methods Results
pre-processing Conclusion

110
Introduction Literature review Data description Methods Results
pre-processing Conclusion
Minor allele frequency (MAF)
quality control

Why studying the effect of MAF
(8 SNPs) in the CIT
Same size LD block
(12 SNPs) in SSLD
In 2019, Saad et al.
different MAF threshold
discard significant SNPs
while the size of block
was the same

113
Outlines
review
3. Data
description
4. Pre-processing
1
3

114
Methods and workflow
The proposed methods for haplotype block partitioning

Applied haplotype partitioning methods
FGT
CIT SSLD
2002
2005
BigLD
2018
2002
115

start
NARAC genotype
dataset ch21
NARAC map file
Position ch21
Reformatting Dataset
Imputation using ImputR
MAF=0.01
Biomarker check
MAF=0.02 MAF=0.05 MAF=0.1
MAF=0.001
Haploview
R
(BigLD)
Comparison and calculations
Haploview
R
(BigLD)
Haploview
R
(BigLD)
Haploview
R
(BigLD)
Haploview
R
(BigLD)
Flowchart and system
description

NARAC 22 chromosomes input files
NARAC genomic data
(2,062 individuals)
Perl
Ch1 to ch22 data and map file
Haploview
NARAC map file
(545,080 SNPs)
R
Haplotype blocks for 22 chromosome using 4 methods using 5 MAF
thresholds
R
Pre-processing
Imputation and recoding
Reformatting
Reformatting
for Haploview
MAF
0.001
MAF
0.01
MAF
0.05
MAF
0.02
MAF
0.1
FGT
CIT SSLD BigLD
Reformatting
for BigLD
Chromosomes
separated map file
Chromosomes
separated data
Haplotype block
partitioning
117
description

Haploview
Haplotype blocks for 22 chromosome using 4 methods using 5 MAF
thresholds
R
Pre-
processing
Reformatting
Reformatting for Haploview
MAF
0.001
MAF
0.01
MAF
0.05
MAF
0.02
MAF
0.1
FGT
CIT SSLD BigLD
Reformatting for BigLD
Haplotype block
partitioning
118

The measured evaluated parameters

NARAC 22 chromosomes input files
NARAC genomic data
(2,062 individuals)
Perl
Haploview
NARAC map file
(545,080 SNPs)
R
Haplotype blocks
R
Pre-processing
Reformatting
Reformatting
for Haploview
MAF
0.001
MAF
0.01
MAF
0.05
MAF
0.02
MAF
0.1
FGT
CIT SSLD BigLD
Reformatting
for BigLD
Chromosomes
separated map file
Chromosomes
separated data
Haplotype block
partitioning
121
description

Proposed Method Based on Interval Graph Modeling
(BigLD)

Heatmap for the haplotype
blocks detected by interval
graph modeling of clusters for a
portion of chromosome 21 from
9,993,822 bp to 14,137,685 bp.
12
4

• Confidence interval test (CIT)
• Four-gamete test (FGT)
• Solid spine of linkage disequilibrium (SSLD)
Haploview

Solid spine of linkage disequilibrium
(SSLD)
126

Implementation on Haploview

128
BLOCK 1. MARKERS: 4 5 6
132 (0.473) |0.277 0.124 0.074|
311 (0.467) |0.262 0.114 0.092|
112 (0.036) |0.011 0.017 0.008|
Multiallelic D prime: 0.054
BLOCK 2. MARKERS: 22 23
33 (0.564) |0.365 0.122 0.076|
13 (0.259) |0.103 0.052 0.104|
31 (0.175) |0.083 0.075 0.017|
33 (0.551) |0.242 0.174 0.136|
13 (0.252) |0.124 0.078 0.049|
11 (0.197) |0.081 0.077 0.038|
33 (0.446) |0.313 0.040 0.093|
31 (0.328) |0.209 0.064 0.055|
11 (0.223) |0.120 0.096 0.006|
31 (0.644) |0.322 0.323 0.000|
33 (0.201) |0.015 0.003 0.183|
13 (0.155) |0.155 0.000 0.000|

Haploview input
Haploview output

Post-processing for a standardization output

Chromosome 1
Chromosome 2
Post-processing of Haploview output

133
1) The total number of haplotype blocks
FGT
CIT
SSLD
BigLD

134
The smaller is the number of haplotypes blocks the greater is
the reduction rate.

135
The MAF and total number of haplotype blocks

136
0.1
0.05
0.02
0.01
0.001

137
2) Total number of blocks with considering the singletons
CIT
FGT
BigLD
SSLD

138
2) Total number of blocks with considering the singletons

139
3) Total number of SNPs in all blocks
SSLD
BigLD
FGT
CIT

140
3) Total number of SNPs in all blocks

141
4) The total length of all blocks (bp)
SSLD

142
4) The total length of all blocks (bp)
0.1
0.05
0.02
0.01
0.001

143
5) Mean number of SNPs in blocks
The BigLD and SSLD has higher mean number of SNPs in blocks than FIG and CIT

144
• MAF does not affect the mean number of SNPs in blocks so much.
• The highest mean number of SNPs in blocks is in chromosome 6
using SSLD with MAF=0.1 which equals to 6.637.

145
The BigLD has almost the same mean number of SNPs in blocks in the range from
0.001 to 0.05 and a higher mean number of SNPs in blocks at MAF=0.1.
The SSLD’s mean number of SNPs in blocks increases with the MAF threshold increases

146
6) The mean block length in base pair
SSLD
BigLD
CIT
FG

147
6) The mean block length in base pair
0.1
0.05
0.02
0.01
0.001

148
7) The mean r2 within blocks
The correlation mean r2 within the blocks is higher in
BigLD in general
BigLD
CIT
FGT
SSLD

149
7) The mean r2 within blocks
0.1
0.05
0.02
0.01
0.001
Contrastingly, in the BigLD method, the mean r2 within a block
decrease with the MAF threshold increases

150
8) The mean r2 between consecutive blocks
(without considering the singleton blocks)
FGT
CIT
BigLD
SSLD
0.1
0.05
0.02
0.01
0.001

151

152
FGT
CIT
BigLD
SSLD

153
9) The mean r2 between consecutive all blocks
(with singleton)
FGT
CIT
SSLD
BigLD

154
9) The mean r2 between consecutive all blocks
(with singleton)

155
10) The intersection percentage

Methods Matching percentage
FGT, CIT and, SSLD 67%
FGT, CIT, SSLD and, Big-LD 57.45%
FGT and Big-LD 78.6%
CIT and Big-LD 76.7%
SSLD and Big-LD 71.92%
The results of agreement in percentage of haplotype blocks
produced by compared methods
15
6

Index bp FGT CIT SSLD Big_LD
1 9993822
2 13562271 ✓ ✓
3 13609442 ✓ 1
4 13690214 ✓ ✓ ✓ ✓
5 13693682 ✓ ✓ ✓ ✓
6 13695103 ✓ ✓ ✓ ✓
7 13707489 ✓
8 13729007
9 13769165 ✓
10 13823791 ✓ ✓
11 13823972 ✓ ✓
12 13865210 ✓
13 13866986 ✓ ✓ ✓
14 13879844 ✓ ✓
15 13895773 ✓
16 13950406
17 14027356 ✓
18 14059449 ✓
19 14070504
20 14087640 ✓ ✓
21 14088675 ✓ ✓
22 14092050 ✓ ✓ ✓ ✓
23 14095717 ✓ ✓ ✓
24 14121682
25 14128555 ✓
26 14136579 ✓ ✓ ✓ ✓
27 14137685 ✓ ✓ ✓ ✓
28 14191954
29 14197852
30 14258290 ✓ ✓
31 14261969 ✓ ✓
32 14292734
33 14297378
34 14322489 ✓ ✓ ✓
35 14334270 ✓ ✓ ✓
36 14334702 ✓ ✓ ✓
37 14367339
Intersection among
the methods

Plot of a sample of chromosome 21 haplotype blocks produced by
FGT, CIT, SSLD, and Big-LD.
15
8

Compared parameters Big-LD FGT CIT SSLD
Max. No. of SNPs in each block 26 17 20 27
Total No. of Blocks 1,182 1,562 1,464 1,378
Max. block size (in bp) 190,708 140,491 178,064 218,644
Min. block size (in bp) 34 4 2 12
Percentage of uncovered SNPs 14.5% 12.8% 22.1% 4.9%
Median No. of SNPs within each block 4 4 3 4
Median block size (in bp) 9,830 7,551 6,783 10,870
Total block size (in bp) 23932662 23452817 23696256 23696256
The comparison between Big-LD, FGT, CIT, and SSLD haplotype block
partitioning methods in chromosome 21
16
0

Max. No. SNPs in each block
161

Init. SNPs No. SNPs No. blocks
Max Block
size (bp)
Min Block
size (bp)
Max Block size
(SNPs)
Min Block size
(SNPs)
Median block
size (bp)
Median No. SNPs
within each block
% uncovered SNPs
Ch 1 40929 39000 6586 192010 10 41 2 13040 4 12.54103
Ch 2 44090 42033 6586 192010 10 41 2 13040 4 12.57583
Ch 3 36690 35078 5502 214260 13 62 2 12425 4 13.22766
Ch 4 32628 31123 5018 222084 14 37 2 13234 4 13.33098
Ch 5 33612 32220 4992 314333 13 49 2 12636.5 4 13.48852
Ch 6 35574 34140 5020 216439 11 55 2 12732 4 11.90393
Ch 7 29244 28120 4315 260965 7 44 2 12220 4 13.58464
Ch 8 30990 28120 4315 260965 7 44 2 12220 4 13.58464
Ch 9 26128 25095 3741 190762 17 44 2 9934 4 12.38892
Ch 10 28331 27070 3968 206622 8 57 2 11668.5 4 12.60805
Ch 11 26477 25333 3901 214818 11 61 2 11601 4 12.91596
Ch 12 26365 25229 3879 190743 15 37 2 11500 4 13.68267
Ch 13 20242 19380 2842 191785 12 39 2 12827.5 4 12.52322
Ch 14 17951 17243 2648 201362 22 32 2 11740 4 13.50693
Ch 15 16166 15470 2648 201362 22 32 2 11740 4 13.23206
Ch 16 16460 15780 2442 237308 10 37 2 9532.5 4 14.02408
Ch 17 14027 13538 2273 247297 10 30 2 10013 4 16.49431
Ch 18 16450 15708 2406 250370 12 61 2 10374 4 13.38172
Ch 19 9236 8973 1596 134312 15 44 2 9875 3 19.20205
Ch 20 13843 13310 2056 144803 11 36 2 10054 4 13.78663
Ch 21 8051 7786 1185 190709 34 32 2 9587 4 13.52427
Ch 22 8205 7887 1240 176594 7 44 2 7769 4 14.42881
Total 531689 507636 79159 4651913 291 959 44 249763 87 299.9369

164
164
• The alleles distribution and description.
• The percentage of SNP appearance in physical location in chromosomes
affect by SNP’s MAF.
• The genotype imputation and preprocessing are crucial steps in HBP and
we produce a sequence of preprocessing that facilitates several any
genetic analysis.
Conclusion

165
165
• HBP reduce the biomarker to about 13%
• Big-LD method provided robust blocks partitioning in terms of
the block size and genomic coverage.
Conclusion

166
• There is a 70% intersection agreement among most HBP methods, Big-LD
matched more with FGT.
• FGT produce modest results in term of correlation and in term of
biomarker reduction.
• BigLD produced large haplotype blocks and show high r2 between blocks
and the lowest r2 between blocks considering the singleton blocks.
• In term of computation, BigLD takes less than half computational time of
Haploview methods.
166
166
Conclusion

167
• MAF quality control has a high effect of haplotype block partitioning
• We recommend taking the MAF in consideration when applying a
haplotype block partition. However, it is a tradeoff, higher MAF
produces a higher correlation within the blocks but its trunk a
portion of data that may could be significant.
• In term of correlation, we recommend using high MAF while using
Haploview methods, low or moderate MAF in BigLD method.
167
167
Conclusion

168
• We could answer the question related to MAF 🡪 the number of
blocks not necessarily affect the number of SNPs within blocks.
• The SNPs within blocks is the highest in SSLD at the same MAF
due to its size and is lowest in CIT
• At the same block size, the SNPs within blocks is decrease with
MAF increase.
168
168
Conclusion

Publications and Future work
169

Algorithm Implementation of Genetic Association ‎Analysis for Rheumatoid Arthritis Data Based on ‎Haplotype Blocks

Recommended

Recommended

More Related Content

Similar to Algorithm Implementation of Genetic Association ‎Analysis for Rheumatoid Arthritis Data Based on ‎Haplotype Blocks

Similar to Algorithm Implementation of Genetic Association ‎Analysis for Rheumatoid Arthritis Data Based on ‎Haplotype Blocks (20)

More from Fatma Sayed Ibrahim

More from Fatma Sayed Ibrahim (7)

Recently uploaded

Recently uploaded (20)

Algorithm Implementation of Genetic Association ‎Analysis for Rheumatoid Arthritis Data Based on ‎Haplotype Blocks