The present study was conducted with the aim of reducing the cost of implementing Genomic Selection(GS) by using Genotype imputation methodology in Gir cattle. Application of GS mainly depends upon the cost of genotyping and reduce its cost, imputation approaches have been used. Imputation strategies and GS have been comprehensively studied in several taurine dairy cattle populations but very limited information is available on indigenous populations. Factors that affect the efficiency of imputation and GS are population structure, linkage disequilibrium between markers and differing marker density between indigenous and taurine breeds. The objective of the study was to evaluate the performance of INDUSCHIP-1, a customized Illumina bovine microarray chip for indigenous cattle breeds, designed by National Dairy Development Board, Anand and design one (7-15K) LD panel, and evaluate the performance of two panels of INDUSCHIP-1, and a 13K subset of the same for its imputation accuracy to HD (777K or INDUSCHIP-1 level). Thus, the study was planned with the aim to design LD panel for genotype imputation to INDUSCHIP-1 level with the strategy to maximize the accuracy of imputation in Gir cattle.
Progeny Testing is a method for accurately evaluating and selecting top bulls and using them to produce future bulls
The parents of progeny with higher performance for desired traits are selected for future breeding
Infectious diseases of livestock are most costly and hazardous problem facing the Agri-food industry
Adversely affect animal production and economics by increasing the cost of production and decreasing the production rate
The Sterile Insect Technique, best known by its acronym SIT and also identified as the Sterile Insect Release Method (SIRM), is a biologically-based method for the management of key insect pests of agricultural and medical/veterinary importance. In the FAO glossary, the Sterile Insect Technique is defined as "a method of pest control using area-wide inundative releases of sterile insects to reduce reproduction in a field population of the same species". It is therefore a type of "birth control" in which wild female insects of the pest population do not reproduce when they are inseminated by released, radiation-sterilized males. Sterilization is induced through the effects of irradiation on the reproductive cells of the insects. SIT does not involve the release of insects modified through transgenic (genetic engineering) processes. In this type of autocidal control, sequential releases of the sterilized insects in adequate sterile to wild male overflooding ratio's lead to a reduction in pest population numbers
Introduction:
Proposed by Meuwissen et al. (2001)
GS is a specialized form of MAS, in which information from genotype data on marker alleles covering the entire genome forms the basis of selection.
The effects associated with all the marker loci, irrespective of whether the effects are significant or not, covering the entire genome are estimated.
The marker effect estimates are used to calculate the genomic estimated breeding values (GEBVs) of different individuals/lines, which form the basis of selection.
Why to go for genomic selection:
Marker-assisted selection (MAS) is well-suited for handling oligogenes and quantitative trait loci (QTLs) with large effects but not for minor QTLs.
MARS attempts to take into account small effect QTLs by combining trait phenotype data with marker genotype data into a combined selection index.
Based on markers showing significant association with the trait(s) and for this reason has been criticized as inefficient
The genomic selection (GS) scheme was to rectify the deficiency of MAS and MARS schemes. The GS scheme utilizes information from genome-wide marker data whether or not their associations with the concerned trait(s) are significant.
GEBV: GenomicEstimated Breeding Values-
The sum total of effects associated with all the marker alleles present in the individual and included in the GS model applied to the population under selection
Calculated on a single individual basis
Gene-assisted genomic selection:
A GS model that uses information about prior known QTLs, the targeted QTLs were accumulated in much higher frequencies than when the standard ridge regression was used
The sum total of effects associated with all the marker alleles present in the individual and included in the GS model applied to the population under selection
Calculated on a single individual basis
Population used:
Training population: used for training of the GS model and for obtaining estimates of the marker-associated effects needed for estimation of GEBVs of individuals/lines in the breeding population.
Breeding population: the population subjected to GS for achieving the desired improvement and isolation of superior lines for use as new varieties/parents of new improved hybrids.
Training population-
large enough: must be representative of the breeding population: max. trait variance with marker : by cluster analysis
should have either equal or comparable LD, LD decay rates with breeding populations
Updated by including individuals/lines from the breeding population
Training more than one generation
Low colinearity between markers is needed since high colinearity tends to reduce prediction accuracy of certain GS models. (colinearity disturbed by recombination)
Presentation delivered by Dr. Jesse Poland (Kansas State University, USA) at Borlaug Summit on Wheat for Food Security. March 25 - 28, 2014, Ciudad Obregon, Mexico.
http://www.borlaug100.org
Within the last twenty years, molecular biology has revolutionized conventional breeding techniques in all areas. Biochemical and Molecular techniques have shortened the duration of breeding programs from years to months, weeks, or eliminated the need for them all together. The use of molecular markers in conventional breeding techniques has also improved the accuracy of crosses and allowed breeders to produce strains with combined traits that were impossible before the advent of DNA technology
Progeny Testing is a method for accurately evaluating and selecting top bulls and using them to produce future bulls
The parents of progeny with higher performance for desired traits are selected for future breeding
Infectious diseases of livestock are most costly and hazardous problem facing the Agri-food industry
Adversely affect animal production and economics by increasing the cost of production and decreasing the production rate
The Sterile Insect Technique, best known by its acronym SIT and also identified as the Sterile Insect Release Method (SIRM), is a biologically-based method for the management of key insect pests of agricultural and medical/veterinary importance. In the FAO glossary, the Sterile Insect Technique is defined as "a method of pest control using area-wide inundative releases of sterile insects to reduce reproduction in a field population of the same species". It is therefore a type of "birth control" in which wild female insects of the pest population do not reproduce when they are inseminated by released, radiation-sterilized males. Sterilization is induced through the effects of irradiation on the reproductive cells of the insects. SIT does not involve the release of insects modified through transgenic (genetic engineering) processes. In this type of autocidal control, sequential releases of the sterilized insects in adequate sterile to wild male overflooding ratio's lead to a reduction in pest population numbers
Introduction:
Proposed by Meuwissen et al. (2001)
GS is a specialized form of MAS, in which information from genotype data on marker alleles covering the entire genome forms the basis of selection.
The effects associated with all the marker loci, irrespective of whether the effects are significant or not, covering the entire genome are estimated.
The marker effect estimates are used to calculate the genomic estimated breeding values (GEBVs) of different individuals/lines, which form the basis of selection.
Why to go for genomic selection:
Marker-assisted selection (MAS) is well-suited for handling oligogenes and quantitative trait loci (QTLs) with large effects but not for minor QTLs.
MARS attempts to take into account small effect QTLs by combining trait phenotype data with marker genotype data into a combined selection index.
Based on markers showing significant association with the trait(s) and for this reason has been criticized as inefficient
The genomic selection (GS) scheme was to rectify the deficiency of MAS and MARS schemes. The GS scheme utilizes information from genome-wide marker data whether or not their associations with the concerned trait(s) are significant.
GEBV: GenomicEstimated Breeding Values-
The sum total of effects associated with all the marker alleles present in the individual and included in the GS model applied to the population under selection
Calculated on a single individual basis
Gene-assisted genomic selection:
A GS model that uses information about prior known QTLs, the targeted QTLs were accumulated in much higher frequencies than when the standard ridge regression was used
The sum total of effects associated with all the marker alleles present in the individual and included in the GS model applied to the population under selection
Calculated on a single individual basis
Population used:
Training population: used for training of the GS model and for obtaining estimates of the marker-associated effects needed for estimation of GEBVs of individuals/lines in the breeding population.
Breeding population: the population subjected to GS for achieving the desired improvement and isolation of superior lines for use as new varieties/parents of new improved hybrids.
Training population-
large enough: must be representative of the breeding population: max. trait variance with marker : by cluster analysis
should have either equal or comparable LD, LD decay rates with breeding populations
Updated by including individuals/lines from the breeding population
Training more than one generation
Low colinearity between markers is needed since high colinearity tends to reduce prediction accuracy of certain GS models. (colinearity disturbed by recombination)
Presentation delivered by Dr. Jesse Poland (Kansas State University, USA) at Borlaug Summit on Wheat for Food Security. March 25 - 28, 2014, Ciudad Obregon, Mexico.
http://www.borlaug100.org
Within the last twenty years, molecular biology has revolutionized conventional breeding techniques in all areas. Biochemical and Molecular techniques have shortened the duration of breeding programs from years to months, weeks, or eliminated the need for them all together. The use of molecular markers in conventional breeding techniques has also improved the accuracy of crosses and allowed breeders to produce strains with combined traits that were impossible before the advent of DNA technology
Marker Assisted Selection in Crop BreedingPawan Chauhan
Marker Assisted Selection is a value addition to conventional methods of Crop Breeding. It has been gaining importance in plant breeding with new generation of plant breeders and to get accurate and fast desired result from plant breeding.
Next-generation sequencing has enabled clinicians and researchers alike to identify novel genetic variants associated with rare Mendelian Diseases across the human genome. To help enable researchers and clinicians understand the role of CNVs in human health and disease, Golden Helix has integrated a specialized NGS-based CNV caller capable of detecting deletion and duplication events as small as single-exons and as large as whole chromosome aneuploidy events. In this webcast, we will present our workflows that integrates the NGS-based CNV caller into SVS.
Genomic selection changing Breeding programe around the world, talk consist of concept of Breeding, breeding value, Genomic breeding value, Genotype imputation, male calf procurement on basis of GEBV under SAG PT Project and 1000 bull genome project.
Marker Assisted Selection in Crop BreedingPawan Chauhan
Marker Assisted Selection is a value addition to conventional methods of Crop Breeding. It has been gaining importance in plant breeding with new generation of plant breeders and to get accurate and fast desired result from plant breeding.
Next-generation sequencing has enabled clinicians and researchers alike to identify novel genetic variants associated with rare Mendelian Diseases across the human genome. To help enable researchers and clinicians understand the role of CNVs in human health and disease, Golden Helix has integrated a specialized NGS-based CNV caller capable of detecting deletion and duplication events as small as single-exons and as large as whole chromosome aneuploidy events. In this webcast, we will present our workflows that integrates the NGS-based CNV caller into SVS.
Genomic selection changing Breeding programe around the world, talk consist of concept of Breeding, breeding value, Genomic breeding value, Genotype imputation, male calf procurement on basis of GEBV under SAG PT Project and 1000 bull genome project.
Predicting phenotypic traits from genotypes is a key focus in agrigenomics, as researchers and commercial farming operations work to increase crop yields and meat production to satisfy the needs of a growing global population. Genomic prediction allows these scientists to identify the plants or animals with the best breeding potential for desirable traits without having to go through lengthy and expensive field trials.
The Golden Helix SNP and Variation Suite (SVS) offers three methods for genomic prediction: Bayes C, Bayes C-pi and Genomic Best Linear Unbiased Predictors (GBLUP). This webcast will discuss the principles of genomic prediction. It describes how these methods are applied within SVS predicting phenotypes for both plant and animal species. In addition, we show how k-fold cross-validation can be utilized optimizing predictive models.
Exploiting technical replicate variance in omics data analysis (RepExplore)Enrico Glaab
High-throughput omics datasets often contain technical replicates included to account for technical sources of noise in the measurement process. Although summarizing these replicate measurements by using robust averages may help to reduce the influence of noise on downstream data analysis, the information on the variance across the replicate measurements is lost in the averaging process and therefore typically disregarded in subsequent statistical analyses.
We introduce RepExplore, a web-service dedicated to exploit the information captured in the technical replicate variance to provide more reliable and informative differential expression and abundance statistics for omics datasets. The software builds on previously published statistical methods, which have been applied successfully to biomedical omics data but are difficult to use without prior experience in programming or scripting. RepExplore facilitates the analysis by providing a fully automated data processing and interactive ranking tables, whisker plot, heat map and principal component analysis visualizations to interpret omics data and derived statistics.
Availability and implementation: Freely available at http://www.repexplore.tk
Journal publication: http://bioinformatics.oxfordjournals.org/content/31/13/2235.long (Glaab, E., & Schneider, R. (2015). RepExplore: Addressing technical replicate variance in proteomics and metabolomics data analysis. Bioinformatics, 31 (13): 2235-2237)
Genome-wide association studies (GWAS) have been providing valuable insight to the genetics of common and complex diseases for many years. In this webcast we will walk through one possible workflow for completing GWAS in Golden Helix SNP & Variation Suite (SVS) with special attention paid to adjusting analysis for population stratification.
The webcast will include:
Visualizations including Manhattan Plots, linkage disequilibrium plots, and genomic annotation sources.
Quality assurance including cryptic relatedness, population stratification, as well as sample and marker statistics.
Genotype association tests and statistics including Corr/Trend tests, logistic and linear regression, Mixed Linear Models, and more.
Genome Wide SNPs for Admixture Analysis and Selection Signaturesfirdous ahmad
The presentation presents basic concepts and describes various advanced uses of Genome Wide SNP markers for Admixture Analysis and Selection Signatures. The usage of SNP markers has made tremendous progress in recent times and it shall help in application of genomic selection in developing countries including India
Big Data at Golden Helix: Scaling to Meet the Demand of Clinical and Research...Golden Helix Inc
With a focus on scalable architecture and optimized native code that fully utilizes the CPU and RAM available, we can scale genomic analysis into sizes conventionally considered Big Data on a single host. In this webcast, we demonstrate recent innovations and features in Golden Helix solutions that enable the analysis of big data on your own terms.
Mitochondrial ND-1 gene-specific primer polymerase chain reaction to determin...UniversitasGadjahMada
A specificity method to detect mice meat contamination in beef meatballs using specific primer-polymerase chain reaction (PCR) technique has been developed. The primer ND1-P1 primers were designed using primer-BLAST software using mtDNA of mice as a template. The Primer ND1-P1 forward (5’-CGGCATCCTACAACCATTTGC-3’) and reverse (5’-CGGCTCGTAAAGC-TCCGAA-3’) was able to amplify a 294 bp fragment of ND1 gene in mice mtDNA. The primers have been proven precise with only amplify the target fragment in mice meatball but not in another meatball including beef meatball, chicken meatball, pork meatball, horse meatball, and goat meatball. The present of mice meat in meatballs can be detected at a concentration as low as 5% (w/w). The ND1-P1 primer is potentially used as a specific marker for detection of mice meat in the meat products.
Genomic surveillance of Rift Valley fever virusILRI
Poster by John Juma, Vagner Fonseca, Konongoi Limbaso, Peter van Heusden, Kristina Roesel, Rosemary Sang, Bernard Bett, Alan Christoffels, Tulio de Oliveira and Samuel Oyola presented at the Boosting Uganda's Investment in Livestock Development (BUILD) project annual planning meeting, Kampala, Uganda, 20–22 September 2022.
Using Public Access Clinical Databases to Interpret NGS VariantsGolden Helix Inc
In this webcast on February 19th, Gabe Rudy, Vice President of Product Development, will showcase publicly available databases and resources available for interpreting rare and novel mutations in the context of his own personal exome obtained through a limited 23andMe pilot in 2012.
The last couple years have seen many changes in well-established resources such as OMIM and dbSNP, while motivating new efforts such as ClinVar and PhenoDB to bring NGS interpretation to clinical grade through a global data sharing effort.
In this webcast, Gabe will cover:
The changing landscape of public annotations: Then, Now, and Soon.
Will the new human reference (GRCh38) released in December be a game changer?
Specific examples of improvements in annotation and algorithms that result in more accurate analysis of his own exome.
The utility and progress of NGS to different clinical applications in terms of public resources: carrier screening, hereditary cancer risk, pharmacogenomics, oncology care, and genetic disorder diagnosis.
Sharing of new clinical data: How both variation and phenotype level data is currently being shared and what will be the way forward to match rare and undiagnosed cases at a global scale.
Presentation by Justin Zook at GRC/GIAB ASHG 2017 workshop "Getting the most from the reference assembly and reference materials" on benchmarks for indels and structural variants.
This is a presentation from the Canadian Bovine Genomics Workshop held in Calgary, Alberta on Sept.14, 2009.
The workshop was the first step in developing a national bovine genomics strategy for Canada.
Similar to Genotype imputation study in Gir dairy cattle of Gujarat (20)
Analysis insight about a Flyball dog competition team's performanceroli9797
Insight of my analysis about a Flyball dog competition team's last year performance. Find more: https://github.com/rolandnagy-ds/flyball_race_analysis/tree/main
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data LakeWalaa Eldin Moustafa
Dynamic policy enforcement is becoming an increasingly important topic in today’s world where data privacy and compliance is a top priority for companies, individuals, and regulators alike. In these slides, we discuss how LinkedIn implements a powerful dynamic policy enforcement engine, called ViewShift, and integrates it within its data lake. We show the query engine architecture and how catalog implementations can automatically route table resolutions to compliance-enforcing SQL views. Such views have a set of very interesting properties: (1) They are auto-generated from declarative data annotations. (2) They respect user-level consent and preferences (3) They are context-aware, encoding a different set of transformations for different use cases (4) They are portable; while the SQL logic is only implemented in one SQL dialect, it is accessible in all engines.
#SQL #Views #Privacy #Compliance #DataLake
Techniques to optimize the pagerank algorithm usually fall in two categories. One is to try reducing the work per iteration, and the other is to try reducing the number of iterations. These goals are often at odds with one another. Skipping computation on vertices which have already converged has the potential to save iteration time. Skipping in-identical vertices, with the same in-links, helps reduce duplicate computations and thus could help reduce iteration time. Road networks often have chains which can be short-circuited before pagerank computation to improve performance. Final ranks of chain nodes can be easily calculated. This could reduce both the iteration time, and the number of iterations. If a graph has no dangling nodes, pagerank of each strongly connected component can be computed in topological order. This could help reduce the iteration time, no. of iterations, and also enable multi-iteration concurrency in pagerank computation. The combination of all of the above methods is the STICD algorithm. [sticd] For dynamic graphs, unchanged components whose ranks are unaffected can be skipped altogether.
Global Situational Awareness of A.I. and where its headedvikram sood
You can see the future first in San Francisco.
Over the past year, the talk of the town has shifted from $10 billion compute clusters to $100 billion clusters to trillion-dollar clusters. Every six months another zero is added to the boardroom plans. Behind the scenes, there’s a fierce scramble to secure every power contract still available for the rest of the decade, every voltage transformer that can possibly be procured. American big business is gearing up to pour trillions of dollars into a long-unseen mobilization of American industrial might. By the end of the decade, American electricity production will have grown tens of percent; from the shale fields of Pennsylvania to the solar farms of Nevada, hundreds of millions of GPUs will hum.
The AGI race has begun. We are building machines that can think and reason. By 2025/26, these machines will outpace college graduates. By the end of the decade, they will be smarter than you or I; we will have superintelligence, in the true sense of the word. Along the way, national security forces not seen in half a century will be un-leashed, and before long, The Project will be on. If we’re lucky, we’ll be in an all-out race with the CCP; if we’re unlucky, an all-out war.
Everyone is now talking about AI, but few have the faintest glimmer of what is about to hit them. Nvidia analysts still think 2024 might be close to the peak. Mainstream pundits are stuck on the wilful blindness of “it’s just predicting the next word”. They see only hype and business-as-usual; at most they entertain another internet-scale technological change.
Before long, the world will wake up. But right now, there are perhaps a few hundred people, most of them in San Francisco and the AI labs, that have situational awareness. Through whatever peculiar forces of fate, I have found myself amongst them. A few years ago, these people were derided as crazy—but they trusted the trendlines, which allowed them to correctly predict the AI advances of the past few years. Whether these people are also right about the next few years remains to be seen. But these are very smart people—the smartest people I have ever met—and they are the ones building this technology. Perhaps they will be an odd footnote in history, or perhaps they will go down in history like Szilard and Oppenheimer and Teller. If they are seeing the future even close to correctly, we are in for a wild ride.
Let me tell you what we see.
Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdfGetInData
Recently we have observed the rise of open-source Large Language Models (LLMs) that are community-driven or developed by the AI market leaders, such as Meta (Llama3), Databricks (DBRX) and Snowflake (Arctic). On the other hand, there is a growth in interest in specialized, carefully fine-tuned yet relatively small models that can efficiently assist programmers in day-to-day tasks. Finally, Retrieval-Augmented Generation (RAG) architectures have gained a lot of traction as the preferred approach for LLMs context and prompt augmentation for building conversational SQL data copilots, code copilots and chatbots.
In this presentation, we will show how we built upon these three concepts a robust Data Copilot that can help to democratize access to company data assets and boost performance of everyone working with data platforms.
Why do we need yet another (open-source ) Copilot?
How can we build one?
Architecture and evaluation
The Building Blocks of QuestDB, a Time Series Databasejavier ramirez
Talk Delivered at Valencia Codes Meetup 2024-06.
Traditionally, databases have treated timestamps just as another data type. However, when performing real-time analytics, timestamps should be first class citizens and we need rich time semantics to get the most out of our data. We also need to deal with ever growing datasets while keeping performant, which is as fun as it sounds.
It is no wonder time-series databases are now more popular than ever before. Join me in this session to learn about the internal architecture and building blocks of QuestDB, an open source time-series database designed for speed. We will also review a history of some of the changes we have gone over the past two years to deal with late and unordered data, non-blocking writes, read-replicas, or faster batch ingestion.
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...sameer shah
"Join us for STATATHON, a dynamic 2-day event dedicated to exploring statistical knowledge and its real-world applications. From theory to practice, participants engage in intensive learning sessions, workshops, and challenges, fostering a deeper understanding of statistical methodologies and their significance in various fields."
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
Genotype imputation study in Gir dairy cattle of Gujarat
1. POSTGRADUATE INSTITUTE OFVETERINARY EDUCATION & RESEARCH,
KAMDHENU UNIVERSITY, GANDHINAGAR, GUJRAT
Young Scientist Award – SOCDAB 2019
“ Selection of low density SNP panel and access of it’s efficiency for genotype
imputation to high density SNP panel in Gir cattle of Gujarat ”
Dr Rajeshkumar Thakkar
M.V.Sc
Animal Genetics and Breeding
Dr. P. H.VATALIYA
MAJOR ADVISOR
Director of Extension Education
Dr. Nilesh Nayee
Research Mentor
Senior Manager AB group, NDDB
2. Gujrat: two prominent cattle breeds- Gir, Kankrej
Gir: one of best Indian Milch breed
Average milk yield 2,276±171.32 kg in Gir herd of Junagadh (Dangar andVataliya, 2015)
Productivity is low as compare to exotic breeds >> Genetic improvement- Needed
Breeding Programs mainly based on phenotypic recording
Requires performance recorded large breeding population with pedigree, family and
progenies information and complex statistical analysis
Faster genetic improvement >>> DNA information
Genomic Era >>>Whole Genome Sequence
INTRODUCTION
SNPs have opened up the prospect of large scale genotyping and GS
3. Selection using genomic predictions of economic merit early in life or selection based on
the estimation of the genetic value of candidates using information on dense markers (SNPs)
covering the genome
SNP markers - track inheritance of chromosomal segments
by calculating GEBV using genomic (x) matrix
Benefits of implementing GS
↑ Accuracy of selection
↓ Generation interval
Constrains of implementing GS
Require large size of reference population = ↑ cost
Factors affecting GS – heritability of the trait, Pedigree information, statistical methodology used,
linkage disequilibrium between SNP markers and QTL
Genomic Selection(GS)
(Source: Boichard et al. 2016)
4. SNP genotyping technique
Hybridization based methods e.g. SNP microarrays (DNA chips)
Enzyme based e.g. RFLP and other PCR based methods
Post amplification based e.g. HPLC and SNPlex
SNP microarrays
Among all this technique SNP microarrays technique is suitable to score several
SNPs in a multiplexed fashion
Constraints with SNP chip
Costly
Need high density chip
↑ High Density = ↑ Reliability
5. GS theory proposed in 2001 before actual technology available
In 2008 illumina first release 50K SNP chip for Bovine
Three main technology providers, Illumina,Affymetrix and GeneSeek
Chip SNPs
3K 2,900
LD (7K) 6909
LD2 (7K) 9912
50KV1 54,001
50KV2 54,609
50KV3 53,714
HD(777K) 777,962
Chip SNPs
G 7K 7083
GGP 9K 8762
GP2 20K 19,809
GP3 27K 26,151
GP4 30K 30,112
GHD 75K 77,068
GH2 140K 139,480
Chip SNPs
Affy 10K 9713
Affy 15K 15,036
Affy 25K 25,068
Affy 700K 648,875
Available SNPs chips
6. i_p_ta_io_ c_nsi_t_ i_ pr_di_t_n_ t_e m__s__g l_t_er_ _i_h__a w__d
o_ a s__t__c_
( CLUE / DICTIONARY)
imputation consists in predicting the missing letters within a word or
a sentence
T A G T G A T
A T C A C T A
10-15K
54K T G A C A G C A G T C A G C T T A C G T A C A G A T C
A C T G T C G T C A G T C G A A T G C A T G T C T A G
Imputation methods determine whether a chromosome segment is IBD
Core concept behind imputation
7. GS >> Changing breeding programs around the world
SNP array technology >> 98 – 99 % SNP call rates
e.g. with 50,000 SNP, this would result in 500 missing genotypes, for larger arrays, the
missing genotypes number will be even higher
Missing genotypes complicate the implementation of GS and GWAS
X matrix will be incomplete,
Imputation can be used to infer these missing genotypes
Genotype Imputation
8. The cost of genotyping may be decrease by using low and high density SNP Panels and
imputed up to high density (Habier et. al., 2009)
The limited effective population sizes and population structures in livestock allow the
possibility of imputation of high-density genotypes from quite low-density genotypes.
(Boichard et al 2012)
Imputation of low density to 50k SNP panels, is common practice in genomic breeding
programs for dairy cattle (Wiggans et al., 2012), pig (e.g. Huang et al., 2012a) and poultry
(e.g. Fulton, 2012), and has been investigated for sheep (Hayes et al., 2012)
History of Imputation In Animal Breeding
9. Imputation methods
1. Family based
-Use linkage, Parent offspring trios and Mendelian segregation rules
2. Population based
-Use linkage disequilibrium information between missing SNPs and the observed flanking
SNPs
10. Imputation program
Family
information
Reference
Merlin Y Abecasis et al. (2002)
fastPHASE N Scheet and Stephens (2006)
Beagle N Browning and Browning (2007)
IMPUTE N Howie et al. (2009)
Phrasebook Y Druet and Georges (2010)
DAGPHASE Y Druet and Georges (2010)
Multivariate BLUP Y Calus et al. (2011)
Findhap Y VanRaden et al. (2011)
FImpute Y Sargolzaei et al. (2011)
CHROMIBD Y Druet and Farnir (2011)
AlphaImpute. Y Hickey et al. (2011)
PedImpute Y Nicolazzi et al. (2013)
Minimac N Howie et al. (2012)
Available Imputation Software
11. Imputation program
Family
information
Reference
Merlin Y Abecasis et al. (2002)
fastPHASE N Scheet and Stephens (2006)
Beagle N Browning and Browning (2007)
IMPUTE N Howie et al. (2009)
Phrasebook Y Druet and Georges (2010)
DAGPHASE Y Druet and Georges (2010)
Multivariate BLUP Y Calus et al. (2011)
Findhap Y VanRaden et al. (2011)
FImpute Y Sargolzaei et al. (2011)
CHROMIBD Y Druet and Farnir (2011)
AlphaImpute. Y Hickey et al. (2011)
PedImpute Y Nicolazzi et al. (2013)
Minimac N Howie et al. (2012)
Available Imputation Software
13. Accuracy = correlation of real and imputed genotypes
Concordance = percentage (%) of genotypes called correctly
Depends on
Imputation method/software used
Size of reference set (bigger the better)
Density of markers
Frequency of SNP alleles (MAF)
Genetic relationship to reference
Species, the genetic structure and history of the population
Imputation Accuracy
14. Fill in missing genotypes from the lab
Merge data sets with genotypes on different arrays
E.g. Illumina,Affymetrix and GeneSeek data
Impute from low density to high density (save cost of breeding programme)
7K > 50K >700K> up toWhole Genome Sequence level
Capture power of higher density
Better accuracy
Application of Imputation
15. To study performance 50K (INDUSCHIP-1) in Gir cattle population of
Gujarat
To study imputation efficiency of 50K (INDUSCHIP-1) to 777K (HD) level
in Gir population of Gujarat
To design custom LD chip 7-15K for Gir cattle population
To evaluate imputation efficiency of custom LD chip 7-15K (INDUSLD) to
50K chip (INDUSCHIP-1) level
Objectives of Research Work
17. Time >> January to September (2018)
Location of work >> Kamdhenu University, Gandhinagar
National Dairy Development Board (NDDB), Anand
Collaboration >> Kamdhenu University and NDDB
Sources of Genotype Data/Animals for Genotyping
>>Genotype data of total 1,019 (117 HD and 902 INDUSCHIP-1) Gir cows were used for
present study and this data were made available by NDDB
MATERIALS AND METHODS
18. PLINK [1.9b5.2] (Shaun Purcell, 2017)
Data QC
Test,Validation, Reference File
BEAGLE [3.3.2] (Browning and Browning, 2011)
Imputation, Phasing of reference file imputation
R- Statistical software [3.5.1] (R Core Team, 2017)
For Imputation Concordance
Graphical representation of data
Software used
19. The genotype data were corrected and checked for quality control with following criterion
using PLINK software
SNPs with a MAF > 1%
SNPs with a call rate per SNP less than 0.90
Animals with all SNP call rate less than 0.90
SNPs with a p-value 10-5 in the Hardy-Weinberg equilibrium
SNPs that were located in non-autosomal regions
SNPs that had the same genomic coordinates, i.e. mapped to the same positions (just the
replicates were removed)
Quality control criteria
20. A total of 117 Gir animals and 5,67,020 SNPs remained in HD panel after QC
A subset of the data having all the 117 animals and only INDUSCHIP-1 SNP was
extracted using PLINK software.
A total of 902 individuals and 41,428 INDUSCHIP-1 SNPs remained in
INDUSCHIP-1 after QC
Using PLINK 902 genotyped data were merged with data of 117 genotyped data
results in 1019 INDUSCHIP-1 data
Create data sets using PLINK
Data set description after QC
VALIDATION SETSTEST SETS REFRENCE SETS
21. IMPUTATION-1
Reference data set
105 animals (777K SNPs)
Validation data set
same 12 animals (777K SNPs)Subset of INDUSCHIP-1
12 animals (50K SNPs)
Impute at HD Level
12 animals (777K SNPs)
Check concordance of imputed 777K SNPs
12 animals
Step I, checking efficiency of imputation for INDUSCHIP-1 to HD level
HD genotype data
117 animals (777K SNPs)
Preparing data files
22. Chr. No.
No. of
individuals in
reference file
No. of SNPs in
reference
(validation) file
No. of
individuals in
test (validation)
file
No. of
SNPs in
test file
Set-1
Chr. 1 105 35,185 12 2435
Chr. 17 105 16,460 12 1173
Chr. 23 105 11,528 12 869
Same scheme followed for creating total five data sets
Data set description (IMPUTATION-1)
Step I, checking efficiency of imputation for INDUSCHIP-1 to HD level
23. INDUSCHIP-1 genotype data
1019 animals (50K SNPs)
Reference data set
1004 animals (50K SNPs)
Validation data set
same 15 animals (50K SNPs)
Subset of selected LD
15 animals (13K SNPs)
Impute at INDUSCHIP-1
Level
15 animals (777K SNPs)
Check concordance of imputed 50K SNPs
15 animals
IMPUTATION-11
Step II, checking efficiency of imputation for Selected ID Panel to INDUSCHIP-1 level
Preparing data files
24. Chr. No.
No. of
individuals
in reference
file
No. of SNPs in
reference
(validation) file
No. of
individuals in
test
(validation)
file
No. of
SNPs in
test file
Set-1
Chr. 1 1004 2,420 15 798
Chr. 17 1004 1,178 15 360
Chr. 23 1004 867 15 303
Same scheme followed for creating total five data sets
Data set description (IMPUTATION-11)(S-1)
Step I1, checking efficiency of imputation for Selected LD to INDUSCHIP-1 level
25. Chr. No.
No. of
individuals
in reference
file
No. of SNPs in
reference
(validation) file
No. of
individuals in
test
(validation)
file
No. of
SNPs in
test file
Set-1
Chr. 1 105 2,435 12 813
Chr. 17 105 1,173 12 346
Chr. 23 105 869 12 316
Same scheme followed for creating total five data sets
Data set description (IMPUTATION-11)(S-2)
Step I1, checking efficiency of imputation for Selected LD to INDUSCHIP-1 level
26. Input data files required for beagle were prepared using PLINK
Step-1 First the reference file need to phased and this phased file was used for
imputation of missing SNP data for test file.
Step-2 Imputed phased file(output file) was compared against the genotypes in
validation dataset.
Step-3The concordance % (% SNPs having same genotype call in both imputed and
validation data files) was calculated using an R script
Step-4 Chromosome region wise % concordance was presented in graphical format.
Results of all the validation rounds were averaged to arrive at overall imputation
accuracy in form of concordance %.
IMPUTATION procedure
27. SNPs having MAF> 0.3 were used for first selection. First SNP encountered
at the beginning of chromosome No. 1 was selected. The second SNP
encountered after the distance of 50 kbps was selected and this was
continued till the end of the chromosome. The same process was repeated
for all other chromosomes.
A second selection set was prepared for SNPs having MAF<0.1. SNPs having
minimum distance of 50 kbps from already selected SNPs were selected.
Regions of the chromosomes where there was gap were filled with SNPS
having MAF > 0.1 and <0.3.
At the completion of above 3 cycles, total 12,851 SNPs were selected.
Selection of LD panel from INDUSCHIP-1 SNPs
28. Performance of INDUSCHIP-1 in Gir cattle population
chromosome wise number of SNPs in INDUSCHIP-1 as compared to Illumina
BovineHD
Per MB chromosome wise SNP number
Minor Allele Frequency (MAF)
Hardy Weinberg Equilibrium (HWE)
Linkage Disequilibrium (LD)
Effectiveness of INDUSCHIP-1 for imputation of missing SNPs at HD level in Gir cattle
breed
-Five-fold cross validationTable
Effectiveness of selected LD for imputation at INDUSCHIP-1 level in Gir cattle breed
using different number of animal in reference population
-Five-fold cross validationTable for both scenario
RESULTS AND DISSCUSSION
29. Chromosome
No.
No. of SNPs in illumina
BovineHD
No. of SNPs
In INDUSCHIP-1
Avg. Distance among
SNPs in INDUSCHIP-1
(base pair)
% out of
INDUSCHIP-1 SNPs
% of BovineHD
SNPs
1 45,720 2717 58228.0 6.05 5.94
2 39,407 2333 58419.9 5.20 5.92
3 34,964 2096 57761.7 4.67 5.99
4 34366 2076 57929.4 4.63 6.04
5 34199 2034 59278.4 4.53 5.94
6 34971 2092 57913.2 4.66 5.98
7 32575 1870 60088.4 4.17 5.74
8 33021 2000 56430.9 4.46 6.05
9 30560 1951 53786.5 4.35 6.38
10 29955 1721 60077.6 3.84 5.74
11 31509 1826 58610.1 4.07 5.79
12 25461 1468 61435.5 3.27 5.76
13 23218 1400 59882.6 3.12 6.02
14 24393 1440 57744.8 3.21 5.90
Chromosome wise number of SNPs in INDUSCHIP-1 as
compared to Illumina BovineHD
34. 0
100
200
300
400
500
600
700
800
900
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
No.ofSNPs
Chromosome no.
0.05-0.15 0.15-0.25 0.25-0.35 0.35-0.45 0.45-0.55
MAF ranges
SNP distribution according to MAF across all chromosome
35. Source: Bovine 50k Chip Illumina Datasheet
Only 40 animals of two bos indicus breeds used
to construct and validate the chip
Much less polymorphic
(only ~50% SNP are polymorphic in Gir)
Much less informative
(50% of SNP have a MAF>0.02 in Gir)
INDUSCHIP-1 having Mean MAF=0.281 and
Median MAF=0.339
INDUSCHIP-1 will be more suitable for
selection in indigenous breeds
INDUSCHIP-1 performance comparison to illumina SNP panel
(SNP distribution and MAF)
36. Lowest number of SNP away from HWE on Chromosome 22
Highest number of SNP away from HWE on chromosome 1
0
10
20
30
40
50
60
70
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
62
53 54
41 40
45 45
40
45
34
17
45
24
31
33
23
27
32
15
29
15
11
23
20
16
21
25
21 21
14
N0.ofSNP(HWE)
chromosome
Distribution of SNPs deviating from HWE in different chromosomes
37. loss of Linkage when subsequent SNPs are located farther from each other
0
0.05
0.1
0.15
0.2
0.25
0
50000
100000
150000
200000
250000
300000
350000
400000
450000
500000
550000
600000
650000
700000
750000
800000
850000
Averager2
Inter-marker distance(kb)
LD decay in Gir cattle using INDUSCHIP-1
Linkage Disequilibrium among selected SNPs in INDUSCHIP-1
38. • Fivefold cross-validation was performed using 117 animal having HD
genotype data
• Out of these 12 animal’s INDUSCHIP-1 genotypes were used as test
genotypes to predict their HD genotypes using HD genotype of 105 animals
as reference genotypes
QC
Effectiveness of INDUSCHIP-1 for imputation at HD level in Gir
cattle breed
117 animals
777K HD
117 animals
5,67,020 SNP
Reference
105 animal
Test/Validation
12 animal
39. Test Set No. Chromosome 1
Chromosome
17
Chromosome
23
Overall
Concordance
1 96.430% 96.250% 95.410% 96.030%
2 93.220% 93.290% 92.690% 93.0667%
3 89.770% 88.560% 89.790% 89.373%
4 93.520% 92.710% 93.680% 93.303%
5 89.910% 90.250% 89.000% 89.720%
Median
Concordance
93.220% 92.710% 92.690% 93.070%
Average
Concordance
92.570% 92.212% 92.114% 92.299%
Fivefold cross-validation table for INDUSCHIP-1 to HD level
Imputation
40. 0
0.2
0.4
0.6
0.8
1
1.2
0 20000000 40000000 60000000 80000000 100000000 120000000 140000000 160000000 180000000
MAFCONCORDANCE
CHROMOSOMAL PHYSICAL POSITION (BP)
CHROMOSOME 1
MAF Average Concordance
Chromosomal region wise Average MAF and Average concordance level
41. 0
0.2
0.4
0.6
0.8
1
1.2
0 20000000 40000000 60000000 80000000 100000000 120000000 140000000 160000000 180000000
MAFCONCORDANCE
CHROMOSOMAL PHYSICAL POSITION (BP)
CHROMOSOME 1
MAF Average Concordance
Chromosomal region wise Average MAF and Average concordance level
42. • Fivefold cross-validation was performed using 1,019 animal having
INDUSCHIP-1 genotype data
• Out of these 15 animal’s LD panel genotypes were used as test genotypes
to predict their INDUSCHIP-1 genotypes using INDUSCHIP-1 genotype of
1,004 animals as reference genotypes
QC
Effectiveness of selected LD for imputation at INDUSCHIP-1 level in
Gir cattle breed
1,019 animals
41,428 SNP
1,019 animals
39,243 SNP
Reference
1,004 animal
Test/Validation
15 animal
43. Test Set No.
Chromosome
1
Chromosome
17
Chromosome
23
Overall
Concordance
1 88.64% 87.90% 88.48% 88.34%
2 89.95% 89.94% 88.89% 89.59%
3 88.99% 86.10% 87.19% 87.43%
4 90.75% 87.81% 90.16% 89.57%
5 89.22% 87.77% 88.23% 88.41%
Median
Concordance
89.22% 87.81% 88.48% 88.41%
Average
Concordance
89.510% 87.904% 88.590% 88.668%
Fivefold cross-validation table for Selected LD to INDUSCHIP-1 level
Imputation using 1,004 animals in reference (S-1)
44. Test Set No.
Chromosome
1
Chromosome
17
Chromosome
23
Overall
Concordance
1 90.18% 89.35% 89.55% 89.69%
2 85.41% 85.76% 86.60% 85.92%
3 83.57% 80.25% 84.29% 82.70%
4 84.59% 83.50% 86.42% 84.84%
5 82.89% 81.74% 82.31% 82.31%
Median
Concordance
84.59% 83.50% 86.42% 84.84%
Average
Concordance
85.328% 84.120% 85.834% 85.094%
Fivefold cross-validation table for Selected LD to INDUSCHIP-1 level
Imputation using 105 animals in reference (S-2)
45. • The INDUSCHIP-1 gives high variability across all chromosomes in Gir cows
• The distribution of MAF along all chromosomes and along the length of
chromosomes is uniform
• The number of SNPs with high polymorphic SNPs (MAF 0.28) are very high in
INDUSCHIP-1 compared to illumina 50K SNP panels (MAF 0.11) so customized
INDUSCHIP-1 is useful for indigenous breeds
• The imputation accuracies for imputing SNPs at HD level, obtained using
INDUSCHIP-1 panel were (92.3%) high, considering only 105 individuals used as
reference with HD genotypes
CONCLUSIONS
46. • Selection of LD panel in the present study, total 12,851 SNPs were selected based
on MAF and equal distance of SNPs
• The imputation accuracies for imputing SNPs at INDUSCHIP-1 level, obtained by
using selected LD panel were (88.66%) high, considering 1004 individuals used as
reference
• There was only 3.63 % reduction in imputation accuracy compared to imputation
from INDUSCHIP-1 to HD, indicate that selected 13K LD panel is a promising
option for developing LD genotyping chip for Gir cattle
• The study thus provides evidence that adopting a relatively cheaper SNP chip is
feasible and would help to reduce cost of implementing GS at ground level
47. • Use of genotype imputation method with more number of animal in combination
with pedigree information will further increase imputation accuracy
• Designing cost effective genomic breeding programme for future, Imputation
methodology will surely decrease cost of genotyping by development of low-cost
Low Density chip and also favor large scale use at ground level.
• Novel efforts are needed to develop fast advance and efficient population based
imputation software exclusively design for animal population
FUTURE PROSPECTS
48. • Dr Nilesh Nayee
• Senior manager AB group,
• NDDB,Anand, Gujrat
• Dr. P. H.Vataliya
• Director of Extension Education,
• Kamdheu University, Gandhinagar, Gujarat
Acknowledgement