Genotype imputation study in Gir dairy cattle of Gujarat

POSTGRADUATE INSTITUTE OFVETERINARY EDUCATION & RESEARCH,
KAMDHENU UNIVERSITY, GANDHINAGAR, GUJRAT
Young Scientist Award – SOCDAB 2019
“ Selection of low density SNP panel and access of it’s efficiency for genotype
imputation to high density SNP panel in Gir cattle of Gujarat ”
Dr Rajeshkumar Thakkar
M.V.Sc
Animal Genetics and Breeding
Dr. P. H.VATALIYA
MAJOR ADVISOR
Director of Extension Education
Dr. Nilesh Nayee
Research Mentor
Senior Manager AB group, NDDB

 Gujrat: two prominent cattle breeds- Gir, Kankrej
 Gir: one of best Indian Milch breed
 Average milk yield 2,276±171.32 kg in Gir herd of Junagadh (Dangar andVataliya, 2015)
 Productivity is low as compare to exotic breeds >> Genetic improvement- Needed
 Breeding Programs mainly based on phenotypic recording
 Requires performance recorded large breeding population with pedigree, family and
progenies information and complex statistical analysis
 Faster genetic improvement >>> DNA information
 Genomic Era >>>Whole Genome Sequence
INTRODUCTION
SNPs have opened up the prospect of large scale genotyping and GS

Selection using genomic predictions of economic merit early in life or selection based on
the estimation of the genetic value of candidates using information on dense markers (SNPs)
covering the genome
 SNP markers - track inheritance of chromosomal segments
by calculating GEBV using genomic (x) matrix
 Benefits of implementing GS
 ↑ Accuracy of selection
 ↓ Generation interval
 Constrains of implementing GS
 Require large size of reference population = ↑ cost
 Factors affecting GS – heritability of the trait, Pedigree information, statistical methodology used,
linkage disequilibrium between SNP markers and QTL
Genomic Selection(GS)
(Source: Boichard et al. 2016)

 SNP genotyping technique
 Hybridization based methods e.g. SNP microarrays (DNA chips)
 Enzyme based e.g. RFLP and other PCR based methods
 Post amplification based e.g. HPLC and SNPlex
 SNP microarrays
Among all this technique SNP microarrays technique is suitable to score several
SNPs in a multiplexed fashion
 Constraints with SNP chip
 Costly
 Need high density chip
↑ High Density = ↑ Reliability

 GS theory proposed in 2001 before actual technology available
 In 2008 illumina first release 50K SNP chip for Bovine
 Three main technology providers, Illumina,Affymetrix and GeneSeek
Chip SNPs
3K 2,900
LD (7K) 6909
LD2 (7K) 9912
50KV1 54,001
50KV2 54,609
50KV3 53,714
HD(777K) 777,962
Chip SNPs
G 7K 7083
GGP 9K 8762
GP2 20K 19,809
GP3 27K 26,151
GP4 30K 30,112
GHD 75K 77,068
GH2 140K 139,480
Chip SNPs
Affy 10K 9713
Affy 15K 15,036
Affy 25K 25,068
Affy 700K 648,875
Available SNPs chips

 i_p_ta_io_ c_nsi_t_ i_ pr_di_t_n_ t_e m__s__g l_t_er_ _i_h__a w__d
o_ a s__t__c_
( CLUE / DICTIONARY)
 imputation consists in predicting the missing letters within a word or
a sentence
T A G T G A T
A T C A C T A
10-15K
54K T G A C A G C A G T C A G C T T A C G T A C A G A T C
A C T G T C G T C A G T C G A A T G C A T G T C T A G
Imputation methods determine whether a chromosome segment is IBD
Core concept behind imputation

 GS >> Changing breeding programs around the world
 SNP array technology >> 98 – 99 % SNP call rates
 e.g. with 50,000 SNP, this would result in 500 missing genotypes, for larger arrays, the
missing genotypes number will be even higher
 Missing genotypes complicate the implementation of GS and GWAS
 X matrix will be incomplete,
 Imputation can be used to infer these missing genotypes
Genotype Imputation

 The cost of genotyping may be decrease by using low and high density SNP Panels and
imputed up to high density (Habier et. al., 2009)
 The limited effective population sizes and population structures in livestock allow the
possibility of imputation of high-density genotypes from quite low-density genotypes.
(Boichard et al 2012)
 Imputation of low density to 50k SNP panels, is common practice in genomic breeding
programs for dairy cattle (Wiggans et al., 2012), pig (e.g. Huang et al., 2012a) and poultry
(e.g. Fulton, 2012), and has been investigated for sheep (Hayes et al., 2012)
History of Imputation In Animal Breeding

Imputation methods
1. Family based
-Use linkage, Parent offspring trios and Mendelian segregation rules
2. Population based
-Use linkage disequilibrium information between missing SNPs and the observed flanking
SNPs

Imputation program
Family
information
Reference
Merlin Y Abecasis et al. (2002)
fastPHASE N Scheet and Stephens (2006)
Beagle N Browning and Browning (2007)
IMPUTE N Howie et al. (2009)
Phrasebook Y Druet and Georges (2010)
DAGPHASE Y Druet and Georges (2010)
Multivariate BLUP Y Calus et al. (2011)
Findhap Y VanRaden et al. (2011)
FImpute Y Sargolzaei et al. (2011)
CHROMIBD Y Druet and Farnir (2011)
AlphaImpute. Y Hickey et al. (2011)
PedImpute Y Nicolazzi et al. (2013)
Minimac N Howie et al. (2012)
Available Imputation Software

Reference
population
Test
population
Marchini J, Howie B. Genotype imputation for genome-wide association studies. Nat Rev Genet. 2010 11:499-511.
Population Based Imputation

 Accuracy = correlation of real and imputed genotypes
 Concordance = percentage (%) of genotypes called correctly
 Depends on
 Imputation method/software used
 Size of reference set (bigger the better)
 Density of markers
 Frequency of SNP alleles (MAF)
 Genetic relationship to reference
 Species, the genetic structure and history of the population
Imputation Accuracy

 Fill in missing genotypes from the lab
 Merge data sets with genotypes on different arrays
 E.g. Illumina,Affymetrix and GeneSeek data
 Impute from low density to high density (save cost of breeding programme)
 7K > 50K >700K> up toWhole Genome Sequence level
 Capture power of higher density
 Better accuracy
Application of Imputation

 To study performance 50K (INDUSCHIP-1) in Gir cattle population of
Gujarat
 To study imputation efficiency of 50K (INDUSCHIP-1) to 777K (HD) level
in Gir population of Gujarat
 To design custom LD chip 7-15K for Gir cattle population
 To evaluate imputation efficiency of custom LD chip 7-15K (INDUSLD) to
50K chip (INDUSCHIP-1) level
Objectives of Research Work

Bovine HD (777K)
INDUSCHIP-1 (50K)
Selected LD (10-15K)
Step I
Step II
Imputation Methodology

 Time >> January to September (2018)
 Location of work >> Kamdhenu University, Gandhinagar
National Dairy Development Board (NDDB), Anand
 Collaboration >> Kamdhenu University and NDDB
 Sources of Genotype Data/Animals for Genotyping
>>Genotype data of total 1,019 (117 HD and 902 INDUSCHIP-1) Gir cows were used for
present study and this data were made available by NDDB
MATERIALS AND METHODS

 PLINK [1.9b5.2] (Shaun Purcell, 2017)
 Data QC
 Test,Validation, Reference File
 BEAGLE [3.3.2] (Browning and Browning, 2011)
 Imputation, Phasing of reference file imputation
 R- Statistical software [3.5.1] (R Core Team, 2017)
 For Imputation Concordance
 Graphical representation of data
Software used

The genotype data were corrected and checked for quality control with following criterion
using PLINK software
 SNPs with a MAF > 1%
 SNPs with a call rate per SNP less than 0.90
 Animals with all SNP call rate less than 0.90
 SNPs with a p-value 10-5 in the Hardy-Weinberg equilibrium
 SNPs that were located in non-autosomal regions
 SNPs that had the same genomic coordinates, i.e. mapped to the same positions (just the
replicates were removed)
Quality control criteria

 A total of 117 Gir animals and 5,67,020 SNPs remained in HD panel after QC
 A subset of the data having all the 117 animals and only INDUSCHIP-1 SNP was
extracted using PLINK software.
 A total of 902 individuals and 41,428 INDUSCHIP-1 SNPs remained in
INDUSCHIP-1 after QC
 Using PLINK 902 genotyped data were merged with data of 117 genotyped data
results in 1019 INDUSCHIP-1 data
 Create data sets using PLINK
Data set description after QC
VALIDATION SETSTEST SETS REFRENCE SETS

IMPUTATION-1
Reference data set
105 animals (777K SNPs)
Validation data set
same 12 animals (777K SNPs)Subset of INDUSCHIP-1
Impute at HD Level
Check concordance of imputed 777K SNPs
12 animals
Step I, checking efficiency of imputation for INDUSCHIP-1 to HD level
HD genotype data
Preparing data files

Chr. No.
No. of
individuals in
reference file
No. of SNPs in
reference
(validation) file
No. of
individuals in
test (validation)
file
No. of
SNPs in
test file
Set-1
Chr. 1 105 35,185 12 2435
Chr. 17 105 16,460 12 1173
Chr. 23 105 11,528 12 869
Same scheme followed for creating total five data sets
Data set description (IMPUTATION-1)
Step I, checking efficiency of imputation for INDUSCHIP-1 to HD level

INDUSCHIP-1 genotype data
Reference data set
Validation data set
same 15 animals (50K SNPs)
Subset of selected LD
Impute at INDUSCHIP-1
Level
Check concordance of imputed 50K SNPs
15 animals
IMPUTATION-11
Step II, checking efficiency of imputation for Selected ID Panel to INDUSCHIP-1 level
Preparing data files

Chr. No.
No. of
individuals
in reference
file
No. of SNPs in
reference
(validation) file
No. of
individuals in
test
(validation)
file
No. of
SNPs in
test file
Set-1
Chr. 1 1004 2,420 15 798
Chr. 17 1004 1,178 15 360
Chr. 23 1004 867 15 303
Data set description (IMPUTATION-11)(S-1)
Step I1, checking efficiency of imputation for Selected LD to INDUSCHIP-1 level

Chr. No.
No. of
individuals
in reference
file
No. of SNPs in
reference
(validation) file
No. of
individuals in
test
(validation)
file
No. of
SNPs in
test file
Set-1
Chr. 1 105 2,435 12 813
Chr. 17 105 1,173 12 346
Chr. 23 105 869 12 316
Data set description (IMPUTATION-11)(S-2)
Step I1, checking efficiency of imputation for Selected LD to INDUSCHIP-1 level

 Input data files required for beagle were prepared using PLINK
 Step-1 First the reference file need to phased and this phased file was used for
imputation of missing SNP data for test file.
 Step-2 Imputed phased file(output file) was compared against the genotypes in
validation dataset.
 Step-3The concordance % (% SNPs having same genotype call in both imputed and
validation data files) was calculated using an R script
 Step-4 Chromosome region wise % concordance was presented in graphical format.
 Results of all the validation rounds were averaged to arrive at overall imputation
accuracy in form of concordance %.
IMPUTATION procedure

 SNPs having MAF> 0.3 were used for first selection. First SNP encountered
at the beginning of chromosome No. 1 was selected. The second SNP
encountered after the distance of 50 kbps was selected and this was
continued till the end of the chromosome. The same process was repeated
for all other chromosomes.
 A second selection set was prepared for SNPs having MAF<0.1. SNPs having
minimum distance of 50 kbps from already selected SNPs were selected.
 Regions of the chromosomes where there was gap were filled with SNPS
having MAF > 0.1 and <0.3.
 At the completion of above 3 cycles, total 12,851 SNPs were selected.
Selection of LD panel from INDUSCHIP-1 SNPs

 Performance of INDUSCHIP-1 in Gir cattle population
 chromosome wise number of SNPs in INDUSCHIP-1 as compared to Illumina
BovineHD
 Per MB chromosome wise SNP number
 Minor Allele Frequency (MAF)
 Hardy Weinberg Equilibrium (HWE)
 Linkage Disequilibrium (LD)
 Effectiveness of INDUSCHIP-1 for imputation of missing SNPs at HD level in Gir cattle
breed
-Five-fold cross validationTable
 Effectiveness of selected LD for imputation at INDUSCHIP-1 level in Gir cattle breed
using different number of animal in reference population
-Five-fold cross validationTable for both scenario
RESULTS AND DISSCUSSION

Chromosome
No.
No. of SNPs in illumina
BovineHD
No. of SNPs
In INDUSCHIP-1
Avg. Distance among
SNPs in INDUSCHIP-1
(base pair)
% out of
INDUSCHIP-1 SNPs
% of BovineHD
SNPs
1 45,720 2717 58228.0 6.05 5.94
2 39,407 2333 58419.9 5.20 5.92
3 34,964 2096 57761.7 4.67 5.99
4 34366 2076 57929.4 4.63 6.04
5 34199 2034 59278.4 4.53 5.94
6 34971 2092 57913.2 4.66 5.98
7 32575 1870 60088.4 4.17 5.74
8 33021 2000 56430.9 4.46 6.05
9 30560 1951 53786.5 4.35 6.38
10 29955 1721 60077.6 3.84 5.74
11 31509 1826 58610.1 4.07 5.79
12 25461 1468 61435.5 3.27 5.76
13 23218 1400 59882.6 3.12 6.02
14 24393 1440 57744.8 3.21 5.90
Chromosome wise number of SNPs in INDUSCHIP-1 as
compared to Illumina BovineHD

Chromosome
No.
No. of SNPs in BovineHD No. of SNPs
In INDUSCHIP-1
Avg. Distance among
SNPs in INDUSCHIP-1
(base pair)
% out of
INDUSCHIP-1 SNPs
% of BovineHD
SNPs
15 24210 1465 57780.4 3.26 6.05
16 23743 1497 53879.7 3.34 6.30
17 21883 1297 57713.6 2.89 5.92
18 18987 1152 57137 2.57 6.06
19 18576 1136 55965.9 2.53 6.11
20 21127 1339 53297.7 2.98 6.33
21 20788 1260 54331.1 2.81 6.06
22 17754 1024 58636.5 2.28 5.76
23 14888 973 53161.5 2.17 6.53
24 18350 1074 57738.3 2.39 5.85
25 12701 755 56463.7 1.68 5.94
26 14953 884 57639.5 1.97 5.91
27 12904 828 54762.2 1.85 6.41
28 12769 830 55624.2 1.85 6.50
29 14392 889 56752.4 1.98 6.17

Chr. No.
Total length of
chromosome
(Mb)
No of SNPs
selected
SNPs per (Mb)
1 158.09 2717 17.186
2 136.51 2333 17.090
3 121.14 2096 17.302
4 120.39 2076 17.244
5 121.08 2034 16.799
6 119.19 2092 17.552
7 112.36 1870 16.643
8 113.01 2000 17.698
9 105.46 1951 18.500
10 103.25 1721 16.668
11 107.18 1826 17.037
12 90.94 1468 16.143
13 83.84 1400 16.698
14 83.15 1440 17.318
15 85.15 1465 17.205
16 81.17 1497 18.443
17 74.89 1297 17.319
18 65.29 1152 17.644
19 63.47 1136 17.898
20 71.59 1339 18.704
21 71.10 1260 17.722
22 61.22 1024 16.727
23 52.07 973 18.686
24 62.04 1074 17.311
25 42.71 755 17.677
26 50.95 884 17.350
27 45.27 828 18.290
28 46.15 830 17.985
29 51.10 889 17.397
Chromosome wise SNP number (Per MB)

Minor allele frequency
The INDUSCHIP-1 showed high proportion of polymorphic SNPs in Gir breed

0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0 20000000 40000000 60000000 80000000 100000000 120000000 140000000 160000000 180000000
AVERAGEMAF
Chromosomal physical position in (bp)
CHROMOSOME 1
MAF in various chromosomal regions for Chromosome No 1

0
100
200
300
400
500
600
700
800
900
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
No.ofSNPs
Chromosome no.
0.05-0.15 0.15-0.25 0.25-0.35 0.35-0.45 0.45-0.55
MAF ranges
SNP distribution according to MAF across all chromosome

Source: Bovine 50k Chip Illumina Datasheet
 Only 40 animals of two bos indicus breeds used
to construct and validate the chip
 Much less polymorphic
(only ~50% SNP are polymorphic in Gir)
 Much less informative
(50% of SNP have a MAF>0.02 in Gir)
 INDUSCHIP-1 having Mean MAF=0.281 and
Median MAF=0.339
 INDUSCHIP-1 will be more suitable for
selection in indigenous breeds
INDUSCHIP-1 performance comparison to illumina SNP panel
(SNP distribution and MAF)

 Lowest number of SNP away from HWE on Chromosome 22
 Highest number of SNP away from HWE on chromosome 1
0
10
20
30
40
50
60
70
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
62
53 54
41 40
45 45
40
45
34
17
45
24
31
33
23
27
32
15
29
15
11
23
20
16
21
25
21 21
14
N0.ofSNP(HWE)
chromosome
Distribution of SNPs deviating from HWE in different chromosomes

loss of Linkage when subsequent SNPs are located farther from each other
0
0.05
0.1
0.15
0.2
0.25
0
50000
100000
150000
200000
250000
300000
350000
400000
450000
500000
550000
600000
650000
700000
750000
800000
850000
Averager2
Inter-marker distance(kb)
LD decay in Gir cattle using INDUSCHIP-1
Linkage Disequilibrium among selected SNPs in INDUSCHIP-1

• Fivefold cross-validation was performed using 117 animal having HD
genotype data
• Out of these 12 animal’s INDUSCHIP-1 genotypes were used as test
genotypes to predict their HD genotypes using HD genotype of 105 animals
as reference genotypes
QC
Effectiveness of INDUSCHIP-1 for imputation at HD level in Gir
cattle breed
117 animals
777K HD
117 animals
5,67,020 SNP
Reference
105 animal
Test/Validation
12 animal

Test Set No. Chromosome 1
Chromosome
17
Chromosome
23
Overall
Concordance
1 96.430% 96.250% 95.410% 96.030%
2 93.220% 93.290% 92.690% 93.0667%
3 89.770% 88.560% 89.790% 89.373%
4 93.520% 92.710% 93.680% 93.303%
5 89.910% 90.250% 89.000% 89.720%
Median
Concordance
93.220% 92.710% 92.690% 93.070%
Average
Concordance
92.570% 92.212% 92.114% 92.299%
Fivefold cross-validation table for INDUSCHIP-1 to HD level
Imputation

0
0.2
0.4
0.6
0.8
1
1.2
0 20000000 40000000 60000000 80000000 100000000 120000000 140000000 160000000 180000000
MAFCONCORDANCE
CHROMOSOMAL PHYSICAL POSITION (BP)
CHROMOSOME 1
MAF Average Concordance
Chromosomal region wise Average MAF and Average concordance level

• Fivefold cross-validation was performed using 1,019 animal having
INDUSCHIP-1 genotype data
• Out of these 15 animal’s LD panel genotypes were used as test genotypes
to predict their INDUSCHIP-1 genotypes using INDUSCHIP-1 genotype of
1,004 animals as reference genotypes
QC
Effectiveness of selected LD for imputation at INDUSCHIP-1 level in
Gir cattle breed
1,019 animals
41,428 SNP
1,019 animals
39,243 SNP
Reference
1,004 animal
Test/Validation
15 animal

Test Set No.
Chromosome
1
Chromosome
17
Chromosome
23
Overall
Concordance
1 88.64% 87.90% 88.48% 88.34%
2 89.95% 89.94% 88.89% 89.59%
3 88.99% 86.10% 87.19% 87.43%
4 90.75% 87.81% 90.16% 89.57%
5 89.22% 87.77% 88.23% 88.41%
Median
Concordance
89.22% 87.81% 88.48% 88.41%
Average
Concordance
89.510% 87.904% 88.590% 88.668%
Fivefold cross-validation table for Selected LD to INDUSCHIP-1 level
Imputation using 1,004 animals in reference (S-1)

Test Set No.
Chromosome
1
Chromosome
17
Chromosome
23
Overall
Concordance
1 90.18% 89.35% 89.55% 89.69%
2 85.41% 85.76% 86.60% 85.92%
3 83.57% 80.25% 84.29% 82.70%
4 84.59% 83.50% 86.42% 84.84%
5 82.89% 81.74% 82.31% 82.31%
Median
Concordance
84.59% 83.50% 86.42% 84.84%
Average
Concordance
85.328% 84.120% 85.834% 85.094%
Fivefold cross-validation table for Selected LD to INDUSCHIP-1 level
Imputation using 105 animals in reference (S-2)

• The INDUSCHIP-1 gives high variability across all chromosomes in Gir cows
• The distribution of MAF along all chromosomes and along the length of
chromosomes is uniform
• The number of SNPs with high polymorphic SNPs (MAF 0.28) are very high in
INDUSCHIP-1 compared to illumina 50K SNP panels (MAF 0.11) so customized
INDUSCHIP-1 is useful for indigenous breeds
• The imputation accuracies for imputing SNPs at HD level, obtained using
INDUSCHIP-1 panel were (92.3%) high, considering only 105 individuals used as
reference with HD genotypes
CONCLUSIONS

• Selection of LD panel in the present study, total 12,851 SNPs were selected based
on MAF and equal distance of SNPs
• The imputation accuracies for imputing SNPs at INDUSCHIP-1 level, obtained by
using selected LD panel were (88.66%) high, considering 1004 individuals used as
reference
• There was only 3.63 % reduction in imputation accuracy compared to imputation
from INDUSCHIP-1 to HD, indicate that selected 13K LD panel is a promising
option for developing LD genotyping chip for Gir cattle
• The study thus provides evidence that adopting a relatively cheaper SNP chip is
feasible and would help to reduce cost of implementing GS at ground level

• Use of genotype imputation method with more number of animal in combination
with pedigree information will further increase imputation accuracy
• Designing cost effective genomic breeding programme for future, Imputation
methodology will surely decrease cost of genotyping by development of low-cost
Low Density chip and also favor large scale use at ground level.
• Novel efforts are needed to develop fast advance and efficient population based
imputation software exclusively design for animal population
FUTURE PROSPECTS

• Dr Nilesh Nayee
• Senior manager AB group,
• NDDB,Anand, Gujrat
• Dr. P. H.Vataliya
• Director of Extension Education,
• Kamdheu University, Gandhinagar, Gujarat
Acknowledgement

Genotype imputation study in Gir dairy cattle of Gujarat

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Genotype imputation study in Gir dairy cattle of Gujarat

Similar to Genotype imputation study in Gir dairy cattle of Gujarat (20)

Recently uploaded

Recently uploaded (20)

Genotype imputation study in Gir dairy cattle of Gujarat