Association mapping


Published on

Association mapping, also known as "linkage disequilibrium mapping", is a method of mapping quantitative trait loci (QTLs) that takes advantage of linkage disequilibrium to link phenotypes to genotypes.Varioius strategey involved in association mapping is discussed in this presentation

Published in: Education, Technology
1 Comment
No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • F6 or higher generational lines derived by continual generations of outcrossing the F2 (Darvasi and Soller, 1995), sufficient meioses have occurred to reduce disequilibrium between moderately linked markers. When these advance generation lines are created by selfing, the reduction is disequilibrium is not nearly as great as that under random mating.Assuming many generations, and therefore meioses, have elapsed since these events, recombination will have removed association between a QTL and any marker not tightly linked to it. Association mapping thus allows for much finer mapping than standard bi-parental cross approaches.
  • (1) Availability of broader genetic variations with wider background for marker-trait correlations (i.e., many alleles evaluated simultaneously),(2) likelihood for a higher resolution mapping because of the utilization of majority recombination events from a large number of meiosis throughout the germplasm development history, (3) possibility of exploiting historically measured trait data for association, and(4) no need for the development of expensive and tedious biparental populations that makes approach timesaving and cost-effective
  • Association analysis is done by LD
  • LE is a random association of alleles at different loci and equals the product of allele frequencies within haplotypes, meaning that at random combination of alleles at each locus its haplotypes (combination of alleles) frequency has equal value in a population. In contrast, LD is a nonrandom association of alleles at different loci, describing the condition with nonequal (increased or reduced) frequency of the haplotypes in a population at random combination of alleles at different loci. LD is not the same as linkage, although tight linkage may generate high levels of LD between alleles.Usually, there is significant LD between more distant sites or sites located in different chromosomes, caused by some specific genetic factorsr2, the square of the correlation coefficient between the two loci have more reliable sampling properties than D with the cases of low allele frequencies
  • The absolutely most important aspect when deciding between a candidate gene approach and a whole-genome study is the extent of LD in the organism of interest, because the extent of LD determines not only the mapping resolution that can be achieved, but also the numbers of markers that are needed for an adequate coverage of the genome in a genome-wide study
  • From a practical level, it is found that a sample of 100 diverse inbred lines has enough statistical power to identify associations that control 10% of the phenotypic variation.Larger samples and/or more replications of phenotypic evaluation could be used to identify associations with smaller effects.Randomly mated populations represent a rather narrow group of germplasm, likely to lower resolution and harbor only a narrow range of alleles. However, if nonrandomly mated germplasm is used, population structure needs to be controlled in the statistical analyses.Because association mapping often involves a relatively large number of diverse accessions, phenotypic data collection with adequate replications across multiple years and multiple locations is challenging. Efficient field design with incomplete block design (e.g., á-lattice), appropriate statistical methods (e.g., nearest neighbor analysis and spatial models), and consideration of QTL × environmental interaction should be explored to increase the mapping power, particularly if the field conditions are not homogenous (Eskridge, 2003). The increase in power of detecting QTLs with repeated measurements is well known and also has been demonstrated by simulation studies in mapping with pedigree-based breeding germplasm (Arbelbideet al., 2006).
  • The final consideration in selecting the sample population is whether to use randomly or non-randomly mated germplasm.
  • The dendrogram represents population structure in a subset of maize breeding lines. If a certain trait, such as disease resistance (red dots), is common in subgroup A but rare in subgroup B, any markers with significantly uneven allele distribution between the two subgroups will be positive in an association test, irrespective of their genomic locations.
  • If the samples are not randomly mated, it is critical that population structure be included in the association analysis.
  • ( p value) may be obtained from linear regression, ANOVA or one of several non-parametric statistical methods.TASSEL General Linear Model (Yu and Buckler, 2006; TASSEL:, a multiple regressionmodel combined with the estimates for the false discovery rate suggested by Kraakmanet al. (2006),The D’ is the standardized disequilibrium coefficient which mainly measures recombinational history and is therefore useful to assess the probability of historical recombination in a given population. The r2 is essentially the correlation between the alleles at two loci; it summarizes both recombinational and mutational history and is useful in the context of association studies. Both parameters vary in the interval from 0 to the value of 1.
  • The ARO group had seven accessions, too small to participate.A substantial drop in LD values (LD decay) varied from about 20 to 40 cM among four main groups in our study, suggesting that the mapping resolutions could possibly be achieved between 20 and 40 cM with variation among different genetic groups and different chromosomes.
  • 30 marker loci were identified to have significant marker–trait associations for yield and its correlated traits.RM7003 co-associated with grain yield and plant weight is reported to flank a major yield QTL (yld12.1), RM7003 is near the QTL (gpp12.1) (Thomson et al. 2003) dealing with grains per panicle, the QTL (pss12.1) dealing with seed set (Fu et al. 2010) and another QTL (qFG12-2) dealing with filled grain numberRM431, RM340, and RM245 were found to be associated with yield QTLs, yld1.1 (Fu et al. 2010), qYI-6-1 and qYI-9 (Suh et al. 2005), respectively.Rid12 co-associated with tillers and plant weight was found to be very close to a famous QTL ‘‘Ghd7’’ that had major effects on grain yield,plant height and heading date (Xue et al. 2008) in addition to its function for rice pericarp color (Sweeney et al. 2006; Brooks et al. 2008). RM125 associated with tillers was also identified to have a strong association with yield (Borba et al. 2010; Jiang et al. 2004). RM431 co-associated with plant height and tillers in this study has been reported to be closely linked with a QTL ‘‘sd1’’ to decrease plant height and increase yield (Peng et al. 1999; Fu et al. 2010).
  • 127 diverse rice varieties and landraces were used to analyse genetic diversity and on the basis of that 16 diverse rice varieties were used for detecting polymorphism in BADH gene.
  • BADH2 gene is a major locus responsible for rice aroma, where loss of function mutations in the gene lead to accumulation of gamma-aminobutyraldehyde (GABald), a precursor of the aroma compound 2-AP in the rice grains (Bradbury et al. 2008)Due to their similar biochemical function it is anticipated that loss of function mutations in the BADH1 gene could also control rice aroma similar to the BADH2 gene, particularly in salt and water stress conditions (Bradbury et al. 2008).It is important to note here that the loss of function mutation in the BADH2 gene is a primary requirement for aroma development due to constitutive expression of the BADH2 gene (Bradbury et al. 2005). However, just the loss of function of the BADH2 gene is not enough; it may be complemented with the BADH1 protein haplotype PH2 (SNP haplotypes SH1 and SH2) for full aroma expression.Thus, a combination of loss of function mutation in the BADH2 gene and a reduction in the substrate binding capacity of the BADH1 enzyme to aroma precursor compound GABald could be important for full aroma development in rice.For example, the popular crossbred basmati variety Pusa Basmati 1 has the badh2-exon 7 deletion mutation but has BADH1 haplotype PH1/ SH1, which could be the reason for its mild aroma, whereas another popular crossbred basmati variety Pusa 1121 has a rare allele of the BADH1 gene (haplotype PH1/SH14) that might lead to a better aroma development than Pusa Basmati 1.
  • About 78% of all SNPs were found in intergenic regions; of the remaining SNPs, the largest number were in introns of annotated genes, followed by coding regions and untranslated regions of annotated genes
  • It has previously been suggested that the photoperiod and temperature clines along latitudes may have been the primary factors driving differentiation of cultivated rice in China1
  • Genome-wide LD decay rates of indica and japonica were estimated at ~123 kb and ~167 kb, where the r2 drops to 0.25 and 0.28, respectively. This is in agreement with the previous estimation that cultivated rice has a long-range LD from close to 100 kb to over 200 kb, which might be a result of self-fertilization coupled with a relatively small effective population size.GWAS was carried out on 14 agronomic traits, which can be divided into five categories: morphological characteristics (tiller number and leaf angle), yield components (grain width, grain length, grain weight and spikelet number), grain quality (gelatinization temperature and amylose content), coloration (apiculus color, pericarp color and hull color) and physiological features (heading date, drought tolerance and degree of seed shattering)
  • Numbers of loci used to assign contributions to phenotypic variance are indicated at ends of bars.
  • All accessions were purified for two generations (single seed descent) before DNA extraction44,100 SNPs from 2 data sources: SNPs from the Oryza SNP project, an oligomer array-based re-sequencing effort using Perlegen Sciences technology and BAC clone Sanger sequencing of wild species from OMAP projectphenotypes examined were classified broadly into six categories: plant morphology related traits; yield-related traits; seed and grain morphology related traits; stress-related phenotypes; cooking, eating and nutritional- quality-related traits; and plant development, represented by flowering time
  • SNPs within 200 kb range of known genes are in red; other significant SNPs are in blue. Candidate gene locations shown as red vertical dashed lines with names on top.the aromatic subpopulation panicle length was not included for GWAS because of the small sample size.
  • Association mapping

    1. 1. • Hifzur Rahman
    2. 2. Methods in Crop Improvement• To meet the food needs of the human population, plant breeders select for agronomically important trais like yield.• Determining the genetic basis of economically important complex traits is a major goal.• Linkage mapping has been a key tool for identifying the genetic basis of quantitative traits in plants.• Identification of QTLs or genes associated to particular trait accelerated the pace of crop improvement either by introgressing the identified QTLs/genes in desired genotype by MAB or by transgenic technology.
    3. 3. QTL approach Uses standard bi-parental mapping populations F2 or RILs These have a limited number of recombination events. Resulting in low resolution of map i.e. the QTL covers many cM. Additional steps required to narrow QTL or clone gene. Difficult to discover closely linked markers for the causative gene
    4. 4. Association mapping (AM)• Association mapping, also known as "linkage disequilibrium mapping", is a method of mapping quantitative trait loci (QTLs) that takes advantage of linkage disequilibrium to link phenotypes to genotypes.• Uses the diverse lines from the natural populations or germplasm collections.• Discovers linked markers associated (=linked) to gene controlling the trait.
    5. 5. • Association studies are based on the assumption that a marker locus is „sufficiently close‟ to a trait locus so that some marker allele would be „travelling‟ along with the trait allele through many generations during recombination. Murillo and Greenberg, 2008.Major goal• To identify inter-individual genetic variants, mostly single nucleotide polymorphisms (SNPs), which show the strongest association with the phenotype of interest, either because they are causal or, more likely, statistically correlated or in linkage disequilibrium (LD) with an unobserved causal variant(s).
    6. 6. Advantages of AM over linkage mapping1. Much higher mapping resolution,2. Greater allele number and broader reference population (Yu and Buckler, 2006)3. Possibility of exploiting historically measured trait data4. Less research time in establishing an association (Flint-Garcia et al., 2003)
    7. 7. Association analysisTwo approaches:- On the basis of distance between two loci By analyzing linkage disequilibrium between marker and target gene in natural population. • LD refers to nonrandom association of alleles at different loci. • LD can occur between more distant sites or sites located in different chromosomes
    8. 8. LD Quantification• LD is difference between the observed gametic frequencies of haplotypes and the expected gametic haplotype frequencies under linkage equilibrium .• D = PAB − PAPB = (PABPab − PAbPaB)• D is informative for comparisons of different allele frequencies across loci and strongly inflated in a small sample size and low- allele frequencies• Verified with the r2 (0 to 1) before using for quantification of extent of LD in case of low allele frequency.
    9. 9. Calculation and visualization of LD:• LD can be calculated using available haplotyping algorithms • Maximum likelihood estimate (MLE). • Pairwise LD can be depicted as a color-code triangle plot based on significant pairwise LD level (r2, and D)Computer softwares:• “Graphical Overview of Linkage Disequilibrium” (GOLD )• “Trait Analysis by aSSociation, Evolution and Linkage” (TASSEL)• PowerMarker
    10. 10. Factors affecting LD LD increases due to mating system (self-pollination), genetic isolation, population structure, relatedness (kinship), small founder population size or genetic drift, admixture, selection (natural, artificial, and balancing), epistasis, and genomic rearrangements. While factors like outcrossing, high recombination rate, high mutation rate, gene conversion, etc., lead to a decrease/disruption in LD.
    11. 11. LD Decay: LD will tend to decay with genetic distance between the loci under consideration. Loci attains linkage equilibrium (LE), i.e. alleles are not preferentially paired anymore. LD decays by one-half with each generation of random mating. Thus, LD declines as the number of generations increases, so that in old populations LD is limited to small distances. Raveendran et. al., 2008
    12. 12. Types of association mapping1. Genome wide association mapping: search whole genome for causal genetic variation. A large number of markers are tested for association with various complex traits and it doesn‟t require any prior information on the candidate genes.2. Candidate gene association mapping: dissect out the genetic control of complex traits, based on the available results from genetic, biochemical, or physiology studies in model and non- model plant species (Mackay, 2001). Requires identification of SNPs between lines within specific genes.
    13. 13. Zhu et al.,2008
    14. 14. Steps in Association MappingAbdurakhmonov & Abdukarimov, 2008
    15. 15.  Power to detect associations depends on Sample size and experimental design accurate phenotypic evaluations. genotyping, genetic architecture.
    16. 16. Phenotyping and Germplasm selectionPhenotyping • Replications across multiple years in randomized plots and multiple locations and environments • influence of flowering time on other correlated traits, photoperiod sensitivity, lodging, and susceptibility to prevalent pathogens because these traits affect the measurement of other morphological or agronomic traits at field condition. (Raveendran et al. 2008) • Field Design:- incomplete block design (Lattice) (Eskridge, 2003).Should be done on the basis of • Diversity:- on the basis of phenotype and genotype • Population structure
    17. 17. Germplasm selection and Population structure• Randomly or non-randomly mated germplasm• Randomly mated populations represent a rather narrow group of germplasm, likely to lower resolution and harbor only a narrow range of alleles• Nonrandomly mated germplasm is used, population structure needs to be controlled in the statistical analyses (Yu et al., 2006)
    18. 18. • A set of unlinked, selectively neutral background markers are used to achieve genome-wide coverage to broadly characterize the genetic composition of individuals.• Cluster analysis and boot strapping is done.• On the basis of cluster analysis most diverse individuals are selected from each cluster to represent the individuals of that cluster.• Helps in preventing spurious associations if population structure and relatedness exist.
    19. 19. Rafalski et al 2010
    20. 20. Estimation of population structureLow--dimensional projection PCA based methods (Patterson et al., 2006)Clustering  Distance--based (Bowcock et al., 1994)  Model--based  STRUCTURE (Pritchard et al., 2000)  mStruct (Shringarpure & Xing, 2008)
    21. 21. Evaluation of linkage disequilibrium and associating genotype- phenotype• Structure of linkage disequilibrium (LD) for a specific locus will, reveal the association resolution possible at that locus.• TASSEL ( is used to measure the extent of LD as squared allele frequency correlation estimates (R2, Weir, 1996) and measure the significance of R2.• Eg. if LD decays within 1000 bp, then 1 or 2 markers per 1000 bp will be needed to identify associations.• Besides TASSEL there are many other softwares like DnaSP, Arlequin etc. used to calculate D‟ and R2.
    22. 22. Softwares used in AMSoftware Focus DescriptionHaploview 4.2 Haplotype LD and haplotype block analysis, haplotype population frequency estimation, analysis and single SNP and haplotype association tests, permutation testing for LD association significanceSVS 7 Stratification, Estimate stratification, LD, haplotypes blocks and multiple AM approaches for LD and AM up to 1.8 million SNPs and 10,000 sampleTASSEL Stratification, LD and AM SSR markers, GLM and MLM methodsGenStat Stratification, LD and AM SSR markers, GLM and MLM-PCA methodsJMP genomics Stratification, LD and structured SNPs, CG and GWAS, analysis of common and rare Variants AMGenAMap Stratification, LD and structured SNPs, tree of functional branches, multiple visualization tools AMPLINK Stratification, LD and structured SNPs, multiple AM approaches, IBD and IBS Analyses AMSTRUCTURE Population Compute a MCMC Bayesian analysis to estimate the proportion of the structure genome of an individual originating from the different inferred PopulationsSPAGeDi Relative kinship genetic relationship analysisBAPS 5.0 Population Compute Bayesian analysis to estimate the proportion of the genome of an structure individual and assign individuals to genetic clusters by either considering them as immigrants or as descendents from immigrantsmStruct Population Structure Detection of population structure in the presence of admixing and mutations from multi-locus genotype data. It is an admixture model which incorporates a mutation process on the observed genetic markersLDheatmap LD LD estimation (r2) displayed as heatmap plots using SNPs
    23. 23. Examples of association mapping studies• Much of the association mapping in crop plants is just emerging from the research phase and is beginning to be applied, especially in commercial breeding setting.• First attempt on candidate-gene association mapping study in plants (maize) resulted in the identification of DNA sequence polymorphisms within the D8 locus associated with flowering time (Thornsberry et al., 2001).• Using same population, Whitt et al., 2002 associated the candidate gene su1 with sweetness taste , bt2, sh1 and sh2 with kernel composition, and Wilson et al., 2004 ae1 and sh2 with starch pasting properties.
    24. 24. Association mapping studies in plant species
    25. 25. Association mapping studies in RicePopulation Sampl BG markers Trait Reference e SizeDiverse land races 577 577 Starch quality (Bao et al., 2006)Diverse accessions 103 123 SSRs Yield and its components (Agrama et al., 2007)Landraces SSRs Heading date, plant height and Wen et al. (2009) panicle lengthLandraces SNPs Multiple agronomic traits Huang et al. (2010)Diverse accessions 203 154 Trait of Harvest Index Li et al. (2012) SSRs,1indelDiverse accessions 210 86 SSRs yield and grain quality Borba et al. (2010)diverse rice 383 44,000SNPs Aluminum Tolerance Famoso et al (2011)accessionsMini core 90 108 SSR+indel stigma and spikelet Yan et al. (2009)collection characteristicsDiverse accessions 950 Sequence based Flowering time and grain yield Huang et al. (2011)Diverse accessions 127 Sequence based Aroma Singh et al. (2010)Diverse accessions 413 44K SNP chip Agronoical traits Zhao et al. (2011)
    26. 26. • Out of 18,000 accession of global origin, a USDA rice mini core collection of 203 accession were used for phenotyping 14 agronomic traits.• Out of 14 agronomic trait 5 traits were correlated with grain yield per plant: plant height, plant weight, tillers, panicle length, and kernels/ branch.• Genotyped with 155 SSRs and Model based clustering using STRUCTURE seperated the accessions into 5 main clusters namely in ARO, AUS, TRJ,TEJ, IND.
    27. 27.  4 main groups (AUS, IND, TEJ and TRJ) were separately analyzed for the LD measured by R2 mean R2 ranged from 0.04 for IND to 0.10 for TEJ and TRJ. IND had the most linked marker pairs with significant LD (9.53%), while TRJ had the least (5.57%). LD decay in distances was about 20 cM within both AUS and IND, while it decayed about 30 and 40 cM within TRJ and TEJ
    28. 28. Association analysis on candidate genesAssociation study employs techniques from molecular biology, fieldsampling/breeding, bioinformatics and statistics.1. Select candidate genes using existing QTL and positional cloning2. Choose diverse germplasm for the trait.3. Score phenotypic traits in replicated trials.4. Amplify and sequence candidate genes.5. Manipulate sequence into valid alignments and identify.6. Obtain diversity estimates and evaluate patterns of selection7. Statistically evaluate associations between genotypes and phenotypes taking population structure into account.
    29. 29.  BADH gene was isolated from all 16 varieties and sequenced Sequence trace files from each variety were assembled into contigs using combined Phred/Pharp/Consed software. Polymorphism tags were generated automatically by Polyphred software integrated with the Consed. High quality SNPs from transcribed region were then identified manually and screen shots of the SNP trace files for the two alleles.
    30. 30.  MassARRAY Assay Design 3.1software was further used to detect more SNPs 127 diverse rice varieties and landraces were used to analyse polymorphism for the identified SNPs Phylogenetic tree of the BADH1 gene sequence obtained by resequencing of 16 rice varieties and Nipponbare reference gene sequence was constructed using MEGA 4.0. Analysis of the BADH1 sequence variation among 127 rice varieties was done based on the scores of 15 validated SNPs identified by resequencing of the BADH1 gene from 16 varieties and Nipponbare using the Sequenom MassARRAY assays.
    31. 31. • Two common BADH1 protein haplotypes (corresponding to four BADH1 SNP haplotypes) were analyzed in all 127 rice varieties and also separately in the aromatic and salt-tolerant subgroups of varieties• 54 SNPs giving more than 95%success rates were used for the population structure analysis using STRUCTURE software .• Two haplotypes of the BADH1 protein, PH1 and PH2 were modeled and docked.
    32. 32. • The three exonic SNPs were• (1) S6 in exon 4 with a T/A polymorphism resulting in asparagine to lysine substitution at amino acid position 144;• (2) S18in exon 11 with a C/A polymorphism resulting in glutamine to lysine substitution at amino acid position 345, and• (3) S19 in exon 11 with T/C polymorphism resulting in isolucine to threonine substitution at amino acid position 347.• PH1 has 15 active GABald binding site where as PH2 has 8.
    33. 33. • 517 landraces were phenotyped and genotyped by sequencing upto one fold coverage using Illumina Genome Analyzer II• Aligned sequence reads to the rice reference genome for SNP identification• Discrepancies with rice reference genome were called as candidate SNPs.
    34. 34. • A total of 3,625,200 nonredundant SNPs were identified, resulting in an average of 9.32 SNPs per kb, with 87.9% of the SNPs located within 0.2 kb of the nearest SNP• A total of 167,514 SNPs were found in the coding regions of 25,409 annotated genes.• 3,625 large-effect SNPs (representing mutations predicted to cause large effects) were identified.• Neighbor-joining tree as well as the principal-component analysis seperated rice germaplasm in two groups i.e. indica and japonica.• Further both indica and japonica had three subgroups.
    35. 35.  Genome-wide LD decay rates of indica and japonica were estimated at ~123 kb and ~167 kb, where the r2 drops to 0.25 and 0.28, Because of strong population differentiation between the two subspecies of cultivated rice GWAS was conducted only for 373 indica lines using mixed linear model (MLM) 80 associations for the 14 agronomic traits were identified. Heading date strongly correlated with both population structure and geographic distribution.
    36. 36. • 413 diverse accessions of O. sativa were phenotyped for 34 traits and genotyped using 44K SNP array.• Probe was prepared from DNA, labelled and hybridized against array.• Genotype calling was done using ALCHEMY program• 36,901 high-performing SNPs (call rate > 70 %) were used for all analyses.
    37. 37. • PCA analysis was done to determine population structure and separated all the accessions into 5 clusters.• mixed model approach was implemented to correct population structure• SNP LD among the 44K common SNPs were detected using r2 using PLINK software.• LD decay was observed at ~ 100 kb in indica , 200 kb in aus and temperate japonica , and 300 kb in tropical japonica giving and average marker distance of about 10kb
    38. 38. Plant Panicle length heightFlowering time Photoperiod sensitivity
    39. 39. Comparison Candidate gene Genome wide association Mapping approach GWA using Markers SNP genotyping Whole genome using Microarray sequencing• Choice of • Discovery of large • Good and robust • Detects all candidate gene number of markers. can process large polymorphisms in and marker number of sample the population within them • In crops like A. and identify large thus avoids the often involves thaliana (125Mb) no. of SNPs in erosion of power some guess work ~140,000 and in one shot. due to so chances are maize ascertainment there many (475Mb)~10-15 • But if bias. earlier million markers polymorphism is unreported genes will be required to not present in will go give complete initial discovery undetected. coverage. panel remains undetected in large sample.