• Hifzur Rahman
Methods in Crop Improvement
• To meet the food needs of the human population, plant breeders
  select for agronomically important trais like yield.

• Determining the genetic basis of economically important complex
  traits is a major goal.

• Linkage mapping has been a key tool for identifying the genetic
  basis of quantitative traits in plants.

• Identification of QTLs or genes associated to particular trait
  accelerated the pace of crop improvement either by introgressing
  the identified QTLs/genes in desired genotype by MAB or by
  transgenic technology.
QTL approach

 Uses standard bi-parental mapping populations

 F2 or RILs

 These have a limited number of recombination events.

 Resulting in low resolution of map i.e. the QTL covers many cM.

 Additional steps required to narrow QTL or clone gene.

 Difficult to discover closely linked markers for the causative gene
Association mapping (AM)

• Association mapping, also known as "linkage disequilibrium
  mapping", is a method of mapping quantitative trait loci (QTLs) that
  takes advantage of linkage disequilibrium to link phenotypes to
  genotypes.

• Uses the diverse lines from the natural populations or germplasm
  collections.

• Discovers linked markers associated (=linked) to gene controlling
  the trait.
• Association studies are based on the assumption that a marker locus
  is „sufficiently close‟ to a trait locus so that some marker allele
  would be „travelling‟ along with the trait allele through many
  generations during recombination.   Murillo and Greenberg, 2008.
Major goal
• To identify inter-individual genetic variants, mostly single
  nucleotide polymorphisms (SNPs), which show the strongest
  association with the phenotype of interest, either because they are
  causal or, more likely, statistically correlated or in linkage
  disequilibrium (LD) with an unobserved causal variant(s).
Advantages of AM over linkage mapping


1. Much       higher     mapping
    resolution,

2. Greater allele number and
    broader reference population


                                                        (Yu and Buckler, 2006)
3. Possibility of exploiting historically measured trait data
4. Less research time in establishing an association
                                                       (Flint-Garcia et al., 2003)
Association analysis

Two approaches:-

 On the basis of distance between two loci

 By analyzing linkage disequilibrium between marker and target
  gene in natural population.

  • LD refers to nonrandom association of alleles at different loci.

  • LD can occur between more distant sites or sites located in
    different chromosomes
LD Quantification

• LD is difference between the observed gametic frequencies of
  haplotypes and the expected gametic haplotype frequencies under
  linkage equilibrium .

• D = PAB − PAPB = (PABPab − PAbPaB)

• D is informative for comparisons of different allele frequencies
  across loci and strongly inflated in a small sample size and low-
  allele frequencies
• Verified with the r2 (0 to 1) before using for quantification of extent
  of LD in case of low allele frequency.
Calculation and visualization of LD:

• LD can be calculated using available haplotyping algorithms
  • Maximum likelihood estimate (MLE).

  • Pairwise LD can be depicted as a color-code triangle plot based on
    significant pairwise LD level (r2, and D)

Computer softwares:

• “Graphical Overview of Linkage Disequilibrium” (GOLD )

• “Trait Analysis by aSSociation, Evolution and Linkage”
  (TASSEL)

• PowerMarker
Factors affecting LD

 LD increases due to mating system (self-pollination), genetic
  isolation, population structure, relatedness (kinship), small
  founder population size or genetic drift, admixture, selection
  (natural, artificial, and balancing), epistasis, and genomic
  rearrangements.



 While factors like outcrossing, high recombination rate, high
  mutation rate, gene conversion, etc., lead to a decrease/disruption
  in LD.
LD Decay:
 LD will tend to decay with genetic
  distance     between      the   loci    under
  consideration.

 Loci attains linkage equilibrium (LE), i.e.
  alleles    are not     preferentially   paired
  anymore.
 LD decays by one-half with each
  generation of random mating.
 Thus, LD declines as the number of
  generations increases, so that in old
  populations LD is limited to small
  distances.       Raveendran et. al., 2008
Types of association mapping

1. Genome wide association mapping: search whole genome for
   causal genetic variation. A large number of markers are tested for
   association with various complex traits and it doesn‟t require any
   prior information on the candidate genes.

2. Candidate gene association mapping: dissect out the genetic
   control of complex traits, based on the available results from
   genetic, biochemical, or physiology studies in model and non-
   model plant species (Mackay, 2001). Requires identification of
   SNPs between lines within specific genes.
Zhu et al.,
2008
Steps in
 Association
 Mapping




Abdurakhmonov & Abdukarimov,
                       2008
 Power to detect associations depends on
  Sample size and experimental design

  accurate phenotypic evaluations.

  genotyping,

  genetic architecture.
Phenotyping and Germplasm selection
Phenotyping
  • Replications across multiple years in randomized plots and multiple
   locations and environments
  • influence of flowering time on other correlated traits, photoperiod
   sensitivity, lodging, and susceptibility to prevalent pathogens because
   these traits affect the measurement of other morphological or agronomic
   traits at field condition. (Raveendran et al. 2008)
  • Field Design:- incomplete block design (Lattice) (Eskridge, 2003).

Should be done on the basis of
  • Diversity:- on the basis of phenotype and genotype
  • Population structure
Germplasm selection and Population structure


• Randomly or non-randomly mated germplasm

• Randomly mated populations represent a rather narrow group of
  germplasm, likely to lower resolution and harbor only a narrow
  range of alleles

• Nonrandomly mated germplasm is used, population structure needs
  to be controlled in the statistical analyses




                                                 (Yu et al., 2006)
• A set of unlinked, selectively neutral background markers are used
  to achieve genome-wide coverage to broadly characterize the
  genetic composition of individuals.

• Cluster analysis and boot strapping is done.

• On the basis of cluster analysis most diverse individuals are
  selected from each cluster to represent the individuals of that
  cluster.

• Helps in preventing spurious associations if population structure
  and relatedness exist.
Rafalski et al 2010
Estimation of population structure

Low--dimensional projection
    PCA based methods (Patterson et al., 2006)

Clustering
   Distance--based (Bowcock et al., 1994)

   Model--based

       STRUCTURE (Pritchard et al., 2000)

       mStruct (Shringarpure & Xing, 2008)
Evaluation of linkage disequilibrium and
         associating genotype- phenotype
• Structure of linkage disequilibrium (LD) for a specific locus will,
  reveal the association resolution possible at that locus.

• TASSEL (http://www.maizegenetics.net) is used to measure the
  extent of LD as squared allele frequency correlation estimates (R2,
  Weir, 1996) and measure the significance of R2.

• Eg. if LD decays within 1000 bp, then 1 or 2 markers per 1000 bp
  will be needed to identify associations.

• Besides TASSEL there are many other softwares like DnaSP,
  Arlequin etc. used to calculate D‟ and R2.
Softwares used in AM
Software        Focus                               Description
Haploview 4.2   Haplotype                           LD and haplotype block analysis, haplotype population frequency estimation,
                analysis and                        single SNP and haplotype association tests, permutation testing for
                LD                                  association significance
SVS 7           Stratification,                     Estimate stratification, LD, haplotypes blocks and multiple AM approaches for
                LD and AM                           up to 1.8 million SNPs and 10,000 sample
TASSEL          Stratification, LD and AM           SSR markers, GLM and MLM methods
GenStat         Stratification, LD and AM           SSR markers, GLM and MLM-PCA methods
JMP genomics    Stratification, LD and structured   SNPs, CG and GWAS, analysis of common and rare Variants
                AM
GenAMap         Stratification, LD and structured   SNPs, tree of functional branches, multiple visualization tools
                AM
PLINK           Stratification, LD and structured   SNPs, multiple AM approaches, IBD and IBS Analyses
                AM
STRUCTURE       Population                          Compute a MCMC Bayesian analysis to estimate the proportion of the
                structure                           genome of
                                                    an individual originating from the different inferred Populations
SPAGeDi         Relative kinship                    genetic relationship analysis
BAPS 5.0        Population                          Compute Bayesian analysis to estimate the proportion of the genome of an
                structure                           individual
                                                    and assign individuals to genetic clusters by either considering them as
                                                    immigrants or as descendents from immigrants
mStruct         Population Structure                Detection of population structure in the presence of admixing and mutations
                                                    from multi-locus genotype data. It is an admixture model which incorporates a
                                                    mutation process on the observed genetic markers
LDheatmap       LD                                  LD estimation (r2) displayed as heatmap plots using SNPs
Examples of association mapping studies
• Much of the association mapping in crop plants is just emerging from the
  research phase and is beginning to be applied, especially in commercial
  breeding setting.

• First attempt on candidate-gene association mapping study in plants
  (maize) resulted in the identification of DNA sequence polymorphisms
  within the D8 locus associated with flowering time (Thornsberry et al.,
  2001).

• Using same population, Whitt et al., 2002 associated the candidate gene
  su1 with sweetness taste , bt2, sh1 and sh2 with kernel composition, and
  Wilson et al., 2004 ae1 and sh2 with starch pasting properties.
Association mapping studies in plant species
Association mapping studies in Rice
Population           Sampl    BG markers      Trait                            Reference
                     e Size

Diverse land races   577      577             Starch quality                   (Bao et al., 2006)
Diverse accessions   103      123 SSRs        Yield and its components         (Agrama et al., 2007)
Landraces                     SSRs            Heading date, plant height and   Wen et al. (2009)
                                              panicle length
Landraces                     SNPs            Multiple agronomic traits        Huang et al. (2010)
Diverse accessions   203      154             Trait of Harvest Index           Li et al. (2012)
                              SSRs,1indel
Diverse accessions   210      86 SSRs         yield and grain quality          Borba et al. (2010)
diverse rice         383      44,000SNPs      Aluminum Tolerance               Famoso et al (2011)
accessions
Mini core            90       108 SSR+indel   stigma and spikelet              Yan et al. (2009)
collection                                    characteristics
Diverse accessions   950      Sequence based Flowering time and grain yield    Huang et al. (2011)
Diverse accessions   127      Sequence based Aroma                             Singh et al. (2010)
Diverse accessions   413      44K SNP chip    Agronoical traits                Zhao et al. (2011)
• Out of 18,000 accession of global origin, a USDA rice mini core
  collection of 203 accession were used for phenotyping 14
  agronomic traits.
• Out of 14 agronomic trait 5 traits were correlated with grain yield
  per plant: plant height, plant weight, tillers, panicle length, and
  kernels/ branch.
• Genotyped with 155 SSRs and Model based clustering using
  STRUCTURE seperated the accessions into 5 main clusters
  namely in ARO, AUS, TRJ,TEJ, IND.
 4 main groups (AUS, IND, TEJ and TRJ) were separately analyzed for the LD
   measured by R2
 mean R2 ranged from 0.04 for IND to 0.10 for TEJ and TRJ.
 IND had the most linked marker pairs with significant LD (9.53%), while TRJ
   had the least (5.57%).
 LD decay in distances was about 20 cM within both AUS and IND, while it
   decayed about 30 and 40 cM within TRJ and TEJ
Association analysis on candidate genes
Association study employs techniques from molecular biology, field
sampling/breeding, bioinformatics and statistics.
1. Select candidate genes using existing QTL and positional cloning
2. Choose diverse germplasm for the trait.
3. Score phenotypic traits in replicated trials.
4. Amplify and sequence candidate genes.
5. Manipulate sequence into valid alignments and identify.
6. Obtain diversity estimates and evaluate patterns of selection
7. Statistically evaluate associations between genotypes and
    phenotypes taking population structure into account.
 BADH gene was isolated from all 16 varieties and sequenced
 Sequence trace files from each variety were assembled into contigs
  using combined Phred/Pharp/Consed software.
 Polymorphism tags were generated automatically by Polyphred
  software integrated with the Consed.
 High quality SNPs from transcribed region were then identified
  manually and screen shots of the SNP trace files for the two alleles.
 MassARRAY Assay Design 3.1software was further used to
   detect more SNPs
 127 diverse rice varieties and landraces were used to analyse
   polymorphism for the identified SNPs
 Phylogenetic tree of the BADH1 gene sequence obtained by
   resequencing of 16 rice varieties and Nipponbare reference gene
   sequence was constructed using MEGA 4.0.
 Analysis of the BADH1 sequence variation among 127 rice
   varieties was done based on the scores of 15 validated SNPs
   identified by resequencing of the BADH1 gene from 16 varieties
   and Nipponbare using the Sequenom MassARRAY assays.
• Two common BADH1 protein haplotypes (corresponding to four
  BADH1 SNP haplotypes) were analyzed in all 127 rice varieties
  and also separately in the aromatic and salt-tolerant subgroups of
  varieties



• 54 SNPs giving more than 95%success rates were used for the
  population structure analysis using STRUCTURE software .



• Two haplotypes of the BADH1 protein, PH1 and PH2 were
  modeled and docked.
• The three exonic SNPs were
• (1) S6 in exon 4 with a T/A polymorphism resulting in asparagine
  to lysine substitution at amino acid position 144;
• (2) S18in exon 11 with a C/A polymorphism resulting in glutamine
  to lysine substitution at amino acid position 345, and
• (3) S19 in exon 11 with T/C polymorphism resulting in isolucine to
  threonine substitution at amino acid position 347.
• PH1 has 15 active GABald binding site where as PH2 has 8.
• 517 landraces were phenotyped and genotyped by sequencing upto
  one fold coverage using Illumina Genome Analyzer II

• Aligned sequence reads to the rice reference genome for SNP
  identification

• Discrepancies with rice reference genome were called as candidate
  SNPs.
• A total of 3,625,200 nonredundant SNPs were identified, resulting
  in an average of 9.32 SNPs per kb, with 87.9% of the SNPs located
  within 0.2 kb of the nearest SNP
• A total of 167,514 SNPs were found in the coding regions of
  25,409 annotated genes.
• 3,625 large-effect SNPs (representing mutations predicted to cause
  large effects) were identified.
• Neighbor-joining tree as well as the
  principal-component analysis
  seperated rice germaplasm in two
  groups i.e. indica and japonica.
• Further both indica and japonica had three subgroups.
 Genome-wide LD decay rates of indica and
  japonica were estimated at ~123 kb and ~167
  kb, where the r2 drops to 0.25 and 0.28,

 Because of strong population differentiation between the two
  subspecies of cultivated rice GWAS was conducted only for 373
  indica lines using mixed linear model (MLM)

 80 associations for the 14 agronomic traits were identified.

 Heading date strongly correlated with both population structure
  and geographic distribution.
• 413 diverse accessions of O. sativa were phenotyped for 34
  traits and genotyped using 44K SNP array.
• Probe was prepared from DNA, labelled and hybridized
  against array.
• Genotype calling was done using ALCHEMY program
• 36,901 high-performing SNPs (call rate > 70 %) were used for all
  analyses.
• PCA analysis was done to determine population structure and
  separated all the accessions into 5 clusters.

• mixed model approach was implemented to correct population
  structure

• SNP LD among the 44K common SNPs were detected using r2
  using PLINK software.

• LD decay was observed at ~ 100 kb in indica ,
  200 kb in aus and temperate japonica , and 300
  kb in tropical japonica giving and average
  marker distance of about 10kb
Plant    Panicle length
        height




Flowering time




                            Photoperiod
                            sensitivity
Comparison
 Candidate gene                   Genome wide association Mapping
   approach
                    GWA using Markers       SNP genotyping        Whole genome
                                           using Microarray        sequencing
• Choice of        • Discovery of large • Good and robust       • Detects all
  candidate gene     number of markers.    can process large      polymorphisms in
  and marker                               number of sample       the population
  within them      • In crops like A.      and identify large     thus avoids the
  often involves     thaliana (125Mb)      no. of SNPs in         erosion of power
  some guess work    ~140,000 and in       one shot.              due to
  so chances are     maize                                        ascertainment
  there many         (475Mb)~10-15       • But if                 bias.
  earlier            million markers       polymorphism is
  unreported genes   will be required to   not present in
  will go            give complete         initial discovery
  undetected.        coverage.             panel remains
                                           undetected in
                                           large sample.

Association mapping

  • 1.
  • 2.
    Methods in CropImprovement • To meet the food needs of the human population, plant breeders select for agronomically important trais like yield. • Determining the genetic basis of economically important complex traits is a major goal. • Linkage mapping has been a key tool for identifying the genetic basis of quantitative traits in plants. • Identification of QTLs or genes associated to particular trait accelerated the pace of crop improvement either by introgressing the identified QTLs/genes in desired genotype by MAB or by transgenic technology.
  • 3.
    QTL approach  Usesstandard bi-parental mapping populations  F2 or RILs  These have a limited number of recombination events.  Resulting in low resolution of map i.e. the QTL covers many cM.  Additional steps required to narrow QTL or clone gene.  Difficult to discover closely linked markers for the causative gene
  • 4.
    Association mapping (AM) •Association mapping, also known as "linkage disequilibrium mapping", is a method of mapping quantitative trait loci (QTLs) that takes advantage of linkage disequilibrium to link phenotypes to genotypes. • Uses the diverse lines from the natural populations or germplasm collections. • Discovers linked markers associated (=linked) to gene controlling the trait.
  • 5.
    • Association studiesare based on the assumption that a marker locus is „sufficiently close‟ to a trait locus so that some marker allele would be „travelling‟ along with the trait allele through many generations during recombination. Murillo and Greenberg, 2008. Major goal • To identify inter-individual genetic variants, mostly single nucleotide polymorphisms (SNPs), which show the strongest association with the phenotype of interest, either because they are causal or, more likely, statistically correlated or in linkage disequilibrium (LD) with an unobserved causal variant(s).
  • 7.
    Advantages of AMover linkage mapping 1. Much higher mapping resolution, 2. Greater allele number and broader reference population (Yu and Buckler, 2006) 3. Possibility of exploiting historically measured trait data 4. Less research time in establishing an association (Flint-Garcia et al., 2003)
  • 8.
    Association analysis Two approaches:- On the basis of distance between two loci  By analyzing linkage disequilibrium between marker and target gene in natural population. • LD refers to nonrandom association of alleles at different loci. • LD can occur between more distant sites or sites located in different chromosomes
  • 9.
    LD Quantification • LDis difference between the observed gametic frequencies of haplotypes and the expected gametic haplotype frequencies under linkage equilibrium . • D = PAB − PAPB = (PABPab − PAbPaB) • D is informative for comparisons of different allele frequencies across loci and strongly inflated in a small sample size and low- allele frequencies • Verified with the r2 (0 to 1) before using for quantification of extent of LD in case of low allele frequency.
  • 10.
    Calculation and visualizationof LD: • LD can be calculated using available haplotyping algorithms • Maximum likelihood estimate (MLE). • Pairwise LD can be depicted as a color-code triangle plot based on significant pairwise LD level (r2, and D) Computer softwares: • “Graphical Overview of Linkage Disequilibrium” (GOLD ) • “Trait Analysis by aSSociation, Evolution and Linkage” (TASSEL) • PowerMarker
  • 11.
    Factors affecting LD LD increases due to mating system (self-pollination), genetic isolation, population structure, relatedness (kinship), small founder population size or genetic drift, admixture, selection (natural, artificial, and balancing), epistasis, and genomic rearrangements.  While factors like outcrossing, high recombination rate, high mutation rate, gene conversion, etc., lead to a decrease/disruption in LD.
  • 12.
    LD Decay:  LDwill tend to decay with genetic distance between the loci under consideration.  Loci attains linkage equilibrium (LE), i.e. alleles are not preferentially paired anymore.  LD decays by one-half with each generation of random mating.  Thus, LD declines as the number of generations increases, so that in old populations LD is limited to small distances. Raveendran et. al., 2008
  • 13.
    Types of associationmapping 1. Genome wide association mapping: search whole genome for causal genetic variation. A large number of markers are tested for association with various complex traits and it doesn‟t require any prior information on the candidate genes. 2. Candidate gene association mapping: dissect out the genetic control of complex traits, based on the available results from genetic, biochemical, or physiology studies in model and non- model plant species (Mackay, 2001). Requires identification of SNPs between lines within specific genes.
  • 14.
  • 15.
    Steps in Association Mapping Abdurakhmonov & Abdukarimov, 2008
  • 16.
     Power todetect associations depends on Sample size and experimental design accurate phenotypic evaluations. genotyping, genetic architecture.
  • 17.
    Phenotyping and Germplasmselection Phenotyping • Replications across multiple years in randomized plots and multiple locations and environments • influence of flowering time on other correlated traits, photoperiod sensitivity, lodging, and susceptibility to prevalent pathogens because these traits affect the measurement of other morphological or agronomic traits at field condition. (Raveendran et al. 2008) • Field Design:- incomplete block design (Lattice) (Eskridge, 2003). Should be done on the basis of • Diversity:- on the basis of phenotype and genotype • Population structure
  • 18.
    Germplasm selection andPopulation structure • Randomly or non-randomly mated germplasm • Randomly mated populations represent a rather narrow group of germplasm, likely to lower resolution and harbor only a narrow range of alleles • Nonrandomly mated germplasm is used, population structure needs to be controlled in the statistical analyses (Yu et al., 2006)
  • 19.
    • A setof unlinked, selectively neutral background markers are used to achieve genome-wide coverage to broadly characterize the genetic composition of individuals. • Cluster analysis and boot strapping is done. • On the basis of cluster analysis most diverse individuals are selected from each cluster to represent the individuals of that cluster. • Helps in preventing spurious associations if population structure and relatedness exist.
  • 20.
  • 21.
    Estimation of populationstructure Low--dimensional projection PCA based methods (Patterson et al., 2006) Clustering  Distance--based (Bowcock et al., 1994)  Model--based  STRUCTURE (Pritchard et al., 2000)  mStruct (Shringarpure & Xing, 2008)
  • 22.
    Evaluation of linkagedisequilibrium and associating genotype- phenotype • Structure of linkage disequilibrium (LD) for a specific locus will, reveal the association resolution possible at that locus. • TASSEL (http://www.maizegenetics.net) is used to measure the extent of LD as squared allele frequency correlation estimates (R2, Weir, 1996) and measure the significance of R2. • Eg. if LD decays within 1000 bp, then 1 or 2 markers per 1000 bp will be needed to identify associations. • Besides TASSEL there are many other softwares like DnaSP, Arlequin etc. used to calculate D‟ and R2.
  • 23.
    Softwares used inAM Software Focus Description Haploview 4.2 Haplotype LD and haplotype block analysis, haplotype population frequency estimation, analysis and single SNP and haplotype association tests, permutation testing for LD association significance SVS 7 Stratification, Estimate stratification, LD, haplotypes blocks and multiple AM approaches for LD and AM up to 1.8 million SNPs and 10,000 sample TASSEL Stratification, LD and AM SSR markers, GLM and MLM methods GenStat Stratification, LD and AM SSR markers, GLM and MLM-PCA methods JMP genomics Stratification, LD and structured SNPs, CG and GWAS, analysis of common and rare Variants AM GenAMap Stratification, LD and structured SNPs, tree of functional branches, multiple visualization tools AM PLINK Stratification, LD and structured SNPs, multiple AM approaches, IBD and IBS Analyses AM STRUCTURE Population Compute a MCMC Bayesian analysis to estimate the proportion of the structure genome of an individual originating from the different inferred Populations SPAGeDi Relative kinship genetic relationship analysis BAPS 5.0 Population Compute Bayesian analysis to estimate the proportion of the genome of an structure individual and assign individuals to genetic clusters by either considering them as immigrants or as descendents from immigrants mStruct Population Structure Detection of population structure in the presence of admixing and mutations from multi-locus genotype data. It is an admixture model which incorporates a mutation process on the observed genetic markers LDheatmap LD LD estimation (r2) displayed as heatmap plots using SNPs
  • 24.
    Examples of associationmapping studies • Much of the association mapping in crop plants is just emerging from the research phase and is beginning to be applied, especially in commercial breeding setting. • First attempt on candidate-gene association mapping study in plants (maize) resulted in the identification of DNA sequence polymorphisms within the D8 locus associated with flowering time (Thornsberry et al., 2001). • Using same population, Whitt et al., 2002 associated the candidate gene su1 with sweetness taste , bt2, sh1 and sh2 with kernel composition, and Wilson et al., 2004 ae1 and sh2 with starch pasting properties.
  • 25.
  • 26.
    Association mapping studiesin Rice Population Sampl BG markers Trait Reference e Size Diverse land races 577 577 Starch quality (Bao et al., 2006) Diverse accessions 103 123 SSRs Yield and its components (Agrama et al., 2007) Landraces SSRs Heading date, plant height and Wen et al. (2009) panicle length Landraces SNPs Multiple agronomic traits Huang et al. (2010) Diverse accessions 203 154 Trait of Harvest Index Li et al. (2012) SSRs,1indel Diverse accessions 210 86 SSRs yield and grain quality Borba et al. (2010) diverse rice 383 44,000SNPs Aluminum Tolerance Famoso et al (2011) accessions Mini core 90 108 SSR+indel stigma and spikelet Yan et al. (2009) collection characteristics Diverse accessions 950 Sequence based Flowering time and grain yield Huang et al. (2011) Diverse accessions 127 Sequence based Aroma Singh et al. (2010) Diverse accessions 413 44K SNP chip Agronoical traits Zhao et al. (2011)
  • 27.
    • Out of18,000 accession of global origin, a USDA rice mini core collection of 203 accession were used for phenotyping 14 agronomic traits. • Out of 14 agronomic trait 5 traits were correlated with grain yield per plant: plant height, plant weight, tillers, panicle length, and kernels/ branch. • Genotyped with 155 SSRs and Model based clustering using STRUCTURE seperated the accessions into 5 main clusters namely in ARO, AUS, TRJ,TEJ, IND.
  • 28.
     4 maingroups (AUS, IND, TEJ and TRJ) were separately analyzed for the LD measured by R2  mean R2 ranged from 0.04 for IND to 0.10 for TEJ and TRJ.  IND had the most linked marker pairs with significant LD (9.53%), while TRJ had the least (5.57%).  LD decay in distances was about 20 cM within both AUS and IND, while it decayed about 30 and 40 cM within TRJ and TEJ
  • 30.
    Association analysis oncandidate genes Association study employs techniques from molecular biology, field sampling/breeding, bioinformatics and statistics. 1. Select candidate genes using existing QTL and positional cloning 2. Choose diverse germplasm for the trait. 3. Score phenotypic traits in replicated trials. 4. Amplify and sequence candidate genes. 5. Manipulate sequence into valid alignments and identify. 6. Obtain diversity estimates and evaluate patterns of selection 7. Statistically evaluate associations between genotypes and phenotypes taking population structure into account.
  • 32.
     BADH genewas isolated from all 16 varieties and sequenced  Sequence trace files from each variety were assembled into contigs using combined Phred/Pharp/Consed software.  Polymorphism tags were generated automatically by Polyphred software integrated with the Consed.  High quality SNPs from transcribed region were then identified manually and screen shots of the SNP trace files for the two alleles.
  • 33.
     MassARRAY AssayDesign 3.1software was further used to detect more SNPs  127 diverse rice varieties and landraces were used to analyse polymorphism for the identified SNPs  Phylogenetic tree of the BADH1 gene sequence obtained by resequencing of 16 rice varieties and Nipponbare reference gene sequence was constructed using MEGA 4.0.  Analysis of the BADH1 sequence variation among 127 rice varieties was done based on the scores of 15 validated SNPs identified by resequencing of the BADH1 gene from 16 varieties and Nipponbare using the Sequenom MassARRAY assays.
  • 34.
    • Two commonBADH1 protein haplotypes (corresponding to four BADH1 SNP haplotypes) were analyzed in all 127 rice varieties and also separately in the aromatic and salt-tolerant subgroups of varieties • 54 SNPs giving more than 95%success rates were used for the population structure analysis using STRUCTURE software . • Two haplotypes of the BADH1 protein, PH1 and PH2 were modeled and docked.
  • 35.
    • The threeexonic SNPs were • (1) S6 in exon 4 with a T/A polymorphism resulting in asparagine to lysine substitution at amino acid position 144; • (2) S18in exon 11 with a C/A polymorphism resulting in glutamine to lysine substitution at amino acid position 345, and • (3) S19 in exon 11 with T/C polymorphism resulting in isolucine to threonine substitution at amino acid position 347. • PH1 has 15 active GABald binding site where as PH2 has 8.
  • 36.
    • 517 landraceswere phenotyped and genotyped by sequencing upto one fold coverage using Illumina Genome Analyzer II • Aligned sequence reads to the rice reference genome for SNP identification • Discrepancies with rice reference genome were called as candidate SNPs.
  • 37.
    • A totalof 3,625,200 nonredundant SNPs were identified, resulting in an average of 9.32 SNPs per kb, with 87.9% of the SNPs located within 0.2 kb of the nearest SNP • A total of 167,514 SNPs were found in the coding regions of 25,409 annotated genes. • 3,625 large-effect SNPs (representing mutations predicted to cause large effects) were identified. • Neighbor-joining tree as well as the principal-component analysis seperated rice germaplasm in two groups i.e. indica and japonica. • Further both indica and japonica had three subgroups.
  • 38.
     Genome-wide LDdecay rates of indica and japonica were estimated at ~123 kb and ~167 kb, where the r2 drops to 0.25 and 0.28,  Because of strong population differentiation between the two subspecies of cultivated rice GWAS was conducted only for 373 indica lines using mixed linear model (MLM)  80 associations for the 14 agronomic traits were identified.  Heading date strongly correlated with both population structure and geographic distribution.
  • 40.
    • 413 diverseaccessions of O. sativa were phenotyped for 34 traits and genotyped using 44K SNP array. • Probe was prepared from DNA, labelled and hybridized against array. • Genotype calling was done using ALCHEMY program • 36,901 high-performing SNPs (call rate > 70 %) were used for all analyses.
  • 41.
    • PCA analysiswas done to determine population structure and separated all the accessions into 5 clusters. • mixed model approach was implemented to correct population structure • SNP LD among the 44K common SNPs were detected using r2 using PLINK software. • LD decay was observed at ~ 100 kb in indica , 200 kb in aus and temperate japonica , and 300 kb in tropical japonica giving and average marker distance of about 10kb
  • 42.
    Plant Panicle length height Flowering time Photoperiod sensitivity
  • 44.
    Comparison Candidate gene Genome wide association Mapping approach GWA using Markers SNP genotyping Whole genome using Microarray sequencing • Choice of • Discovery of large • Good and robust • Detects all candidate gene number of markers. can process large polymorphisms in and marker number of sample the population within them • In crops like A. and identify large thus avoids the often involves thaliana (125Mb) no. of SNPs in erosion of power some guess work ~140,000 and in one shot. due to so chances are maize ascertainment there many (475Mb)~10-15 • But if bias. earlier million markers polymorphism is unreported genes will be required to not present in will go give complete initial discovery undetected. coverage. panel remains undetected in large sample.

Editor's Notes

  • #6 F6 or higher generational lines derived by continual generations of outcrossing the F2 (Darvasi and Soller, 1995), sufficient meioses have occurred to reduce disequilibrium between moderately linked markers. When these advance generation lines are created by selfing, the reduction is disequilibrium is not nearly as great as that under random mating.Assuming many generations, and therefore meioses, have elapsed since these events, recombination will have removed association between a QTL and any marker not tightly linked to it. Association mapping thus allows for much finer mapping than standard bi-parental cross approaches.
  • #8 (1) Availability of broader genetic variations with wider background for marker-trait correlations (i.e., many alleles evaluated simultaneously),(2) likelihood for a higher resolution mapping because of the utilization of majority recombination events from a large number of meiosis throughout the germplasm development history, (3) possibility of exploiting historically measured trait data for association, and(4) no need for the development of expensive and tedious biparental populations that makes approach timesaving and cost-effective
  • #9 Association analysis is done by LD
  • #10 LE is a random association of alleles at different loci and equals the product of allele frequencies within haplotypes, meaning that at random combination of alleles at each locus its haplotypes (combination of alleles) frequency has equal value in a population. In contrast, LD is a nonrandom association of alleles at different loci, describing the condition with nonequal (increased or reduced) frequency of the haplotypes in a population at random combination of alleles at different loci. LD is not the same as linkage, although tight linkage may generate high levels of LD between alleles.Usually, there is significant LD between more distant sites or sites located in different chromosomes, caused by some specific genetic factorsr2, the square of the correlation coefficient between the two loci have more reliable sampling properties than D with the cases of low allele frequencies
  • #14 The absolutely most important aspect when deciding between a candidate gene approach and a whole-genome study is the extent of LD in the organism of interest, because the extent of LD determines not only the mapping resolution that can be achieved, but also the numbers of markers that are needed for an adequate coverage of the genome in a genome-wide study
  • #18 From a practical level, it is found that a sample of 100 diverse inbred lines has enough statistical power to identify associations that control 10% of the phenotypic variation.Larger samples and/or more replications of phenotypic evaluation could be used to identify associations with smaller effects.Randomly mated populations represent a rather narrow group of germplasm, likely to lower resolution and harbor only a narrow range of alleles. However, if nonrandomly mated germplasm is used, population structure needs to be controlled in the statistical analyses.Because association mapping often involves a relatively large number of diverse accessions, phenotypic data collection with adequate replications across multiple years and multiple locations is challenging. Efficient field design with incomplete block design (e.g., á-lattice), appropriate statistical methods (e.g., nearest neighbor analysis and spatial models), and consideration of QTL × environmental interaction should be explored to increase the mapping power, particularly if the field conditions are not homogenous (Eskridge, 2003). The increase in power of detecting QTLs with repeated measurements is well known and also has been demonstrated by simulation studies in mapping with pedigree-based breeding germplasm (Arbelbideet al., 2006).
  • #19 The final consideration in selecting the sample population is whether to use randomly or non-randomly mated germplasm.
  • #21 The dendrogram represents population structure in a subset of maize breeding lines. If a certain trait, such as disease resistance (red dots), is common in subgroup A but rare in subgroup B, any markers with significantly uneven allele distribution between the two subgroups will be positive in an association test, irrespective of their genomic locations.
  • #22 If the samples are not randomly mated, it is critical that population structure be included in the association analysis.
  • #23 ( p value) may be obtained from linear regression, ANOVA or one of several non-parametric statistical methods.TASSEL General Linear Model (Yu and Buckler, 2006; TASSEL: http://www.maizegenetics.net), a multiple regressionmodel combined with the estimates for the false discovery rate suggested by Kraakmanet al. (2006),The D’ is the standardized disequilibrium coefficient which mainly measures recombinational history and is therefore useful to assess the probability of historical recombination in a given population. The r2 is essentially the correlation between the alleles at two loci; it summarizes both recombinational and mutational history and is useful in the context of association studies. Both parameters vary in the interval from 0 to the value of 1.
  • #29 The ARO group had seven accessions, too small to participate.A substantial drop in LD values (LD decay) varied from about 20 to 40 cM among four main groups in our study, suggesting that the mapping resolutions could possibly be achieved between 20 and 40 cM with variation among different genetic groups and different chromosomes.
  • #30 30 marker loci were identified to have significant marker–trait associations for yield and its correlated traits.RM7003 co-associated with grain yield and plant weight is reported to flank a major yield QTL (yld12.1), RM7003 is near the QTL (gpp12.1) (Thomson et al. 2003) dealing with grains per panicle, the QTL (pss12.1) dealing with seed set (Fu et al. 2010) and another QTL (qFG12-2) dealing with filled grain numberRM431, RM340, and RM245 were found to be associated with yield QTLs, yld1.1 (Fu et al. 2010), qYI-6-1 and qYI-9 (Suh et al. 2005), respectively.Rid12 co-associated with tillers and plant weight was found to be very close to a famous QTL ‘‘Ghd7’’ that had major effects on grain yield,plant height and heading date (Xue et al. 2008) in addition to its function for rice pericarp color (Sweeney et al. 2006; Brooks et al. 2008). RM125 associated with tillers was also identified to have a strong association with yield (Borba et al. 2010; Jiang et al. 2004). RM431 co-associated with plant height and tillers in this study has been reported to be closely linked with a QTL ‘‘sd1’’ to decrease plant height and increase yield (Peng et al. 1999; Fu et al. 2010).
  • #34 127 diverse rice varieties and landraces were used to analyse genetic diversity and on the basis of that 16 diverse rice varieties were used for detecting polymorphism in BADH gene.
  • #36 BADH2 gene is a major locus responsible for rice aroma, where loss of function mutations in the gene lead to accumulation of gamma-aminobutyraldehyde (GABald), a precursor of the aroma compound 2-AP in the rice grains (Bradbury et al. 2008)Due to their similar biochemical function it is anticipated that loss of function mutations in the BADH1 gene could also control rice aroma similar to the BADH2 gene, particularly in salt and water stress conditions (Bradbury et al. 2008).It is important to note here that the loss of function mutation in the BADH2 gene is a primary requirement for aroma development due to constitutive expression of the BADH2 gene (Bradbury et al. 2005). However, just the loss of function of the BADH2 gene is not enough; it may be complemented with the BADH1 protein haplotype PH2 (SNP haplotypes SH1 and SH2) for full aroma expression.Thus, a combination of loss of function mutation in the BADH2 gene and a reduction in the substrate binding capacity of the BADH1 enzyme to aroma precursor compound GABald could be important for full aroma development in rice.For example, the popular crossbred basmati variety Pusa Basmati 1 has the badh2-exon 7 deletion mutation but has BADH1 haplotype PH1/ SH1, which could be the reason for its mild aroma, whereas another popular crossbred basmati variety Pusa 1121 has a rare allele of the BADH1 gene (haplotype PH1/SH14) that might lead to a better aroma development than Pusa Basmati 1.
  • #37 About 78% of all SNPs were found in intergenic regions; of the remaining SNPs, the largest number were in introns of annotated genes, followed by coding regions and untranslated regions of annotated genes
  • #38 It has previously been suggested that the photoperiod and temperature clines along latitudes may have been the primary factors driving differentiation of cultivated rice in China1
  • #39 Genome-wide LD decay rates of indica and japonica were estimated at ~123 kb and ~167 kb, where the r2 drops to 0.25 and 0.28, respectively. This is in agreement with the previous estimation that cultivated rice has a long-range LD from close to 100 kb to over 200 kb, which might be a result of self-fertilization coupled with a relatively small effective population size.GWAS was carried out on 14 agronomic traits, which can be divided into five categories: morphological characteristics (tiller number and leaf angle), yield components (grain width, grain length, grain weight and spikelet number), grain quality (gelatinization temperature and amylose content), coloration (apiculus color, pericarp color and hull color) and physiological features (heading date, drought tolerance and degree of seed shattering)
  • #40 Numbers of loci used to assign contributions to phenotypic variance are indicated at ends of bars.
  • #41 All accessions were purified for two generations (single seed descent) before DNA extraction44,100 SNPs from 2 data sources: SNPs from the Oryza SNP project, an oligomer array-based re-sequencing effort using Perlegen Sciences technology and BAC clone Sanger sequencing of wild species from OMAP projectphenotypes examined were classified broadly into six categories: plant morphology related traits; yield-related traits; seed and grain morphology related traits; stress-related phenotypes; cooking, eating and nutritional- quality-related traits; and plant development, represented by flowering time
  • #43 SNPs within 200 kb range of known genes are in red; other significant SNPs are in blue. Candidate gene locations shown as red vertical dashed lines with names on top.the aromatic subpopulation panicle length was not included for GWAS because of the small sample size.