Data analysis tools and associated scientific developments ...


Published on

  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Data analysis tools and associated scientific developments ...

  1. 1. WP - C Data analysis tools Marco Bink & Gerrit Gort
  2. 2. Outline <ul><li>Overview Work Package C </li></ul><ul><ul><li>C1:Upgrade standard tools </li></ul></ul><ul><ul><ul><li>Partly presented by M. Frisch, HOH </li></ul></ul></ul><ul><ul><li>C2: Novel map-based tools </li></ul></ul><ul><ul><li>C3: Genome-wide and locus specific tools </li></ul></ul><ul><ul><li>C4: Large-data mining tools </li></ul></ul><ul><ul><li>C5: Germplasm Simulator </li></ul></ul><ul><ul><ul><li>Presented by M. Frisch, HOH </li></ul></ul></ul><ul><li>Concluding remarks </li></ul>
  3. 3. WP-C I Upgrading statistical analysis tools <ul><li>Objective: Upgrade standard cluster and correlation tools, able to handle large data sets </li></ul><ul><li>Case: cluster analysis in S-Plus </li></ul><ul><ul><li>clustering based on (genetic) distance matrix </li></ul></ul><ul><ul><li>S-Plus functions not sufficient for large data sets </li></ul></ul><ul><ul><ul><li>May depend on computer capacity </li></ul></ul></ul><ul><ul><li>BigClus algorithm (Gerrit Gort PRI) </li></ul></ul><ul><ul><ul><li>Written in C-code, accessible in S-Plus via dynamic link library (DLL) </li></ul></ul></ul>
  4. 4. WP-C I BigClus algorithm characteristics <ul><li>Methods of Clustering </li></ul><ul><ul><li>Single link </li></ul></ul><ul><ul><li>Complete link </li></ul></ul><ul><ul><li>Average link </li></ul></ul><ul><ul><li>McQuitty’s </li></ul></ul><ul><ul><li>Ward’s </li></ul></ul><ul><li>Distance measures </li></ul><ul><ul><li>Eucledian </li></ul></ul><ul><ul><li>Jaccard </li></ul></ul><ul><li>Allow missing values </li></ul><ul><ul><li>Jaccard </li></ul></ul>
  5. 5. WP-C I Dendrograms (from BigClus) <ul><li>Large datasets </li></ul><ul><ul><li>Ordinary dendograms will not suffice (e.g., 5000 plants, 100 markers, Jaccard distance, UPGMA) </li></ul></ul><ul><li>Ability to look at part of dendogram </li></ul><ul><ul><li>e.g. show first 25 clusters from top, show number of observations below each leave. </li></ul></ul><ul><li>S-PLUS functions </li></ul><ul><ul><li>to plot top of tree, plot summary information on tree, like frequencies, cluster averages of covariates. </li></ul></ul>
  6. 6. WP-C II Novel map-based tools <ul><li>Two important issues </li></ul><ul><li>Account for genetic linkage map information </li></ul><ul><ul><li>Consider molecular markers to be dependent variables </li></ul></ul><ul><li>Combine information from (a) trait characteristics (b) passport data and (c) molecular markers </li></ul><ul><ul><li>Map-based diversity tools, cluster & correlation analysis software </li></ul></ul><ul><li>Core - selection </li></ul>
  7. 7. WP-C II Account for genetic linkage maps <ul><li>Genetic distances </li></ul><ul><ul><li>Rational: Data on genetic markers are likely correlated due to underlying genetic map </li></ul></ul><ul><ul><li>Utilise correlation structure? </li></ul></ul><ul><ul><li>Account for correlation! </li></ul></ul><ul><ul><ul><li>Allow different weights for markers </li></ul></ul></ul><ul><ul><li>Unequal distribution of markers across genome </li></ul></ul>Unlinked markers Loosely linked markers closely linked markers
  8. 8. WP-C II Account for genetic linkage maps <ul><li>Correlation among linked markers: </li></ul><ul><ul><li>erodes with increasing number of meioses separating two individuals due to recombination </li></ul></ul><ul><ul><li>increases due to linkage disequilibrium (non-random mating / selection pressure) </li></ul></ul><ul><li>Use all available markers </li></ul><ul><ul><li>calculate weights for every marker locus </li></ul></ul><ul><ul><ul><li>Partial regression coefficients (Zeng, PNAS ’93) </li></ul></ul></ul><ul><ul><ul><li>Meioses factor (M f ,) = Expected average number of meioses separating two individuals </li></ul></ul></ul>
  9. 9. WP-C II Account for genetic linkage maps <ul><ul><li>set arbitrary value for M f , meioses factor </li></ul></ul><ul><ul><ul><li>= Expected average number of meioses separating two individuals </li></ul></ul></ul><ul><ul><li>calculate r e = min(0.5, power(r,1/M f )) </li></ul></ul><ul><ul><li>construct correlation matrix, M , with element </li></ul></ul><ul><ul><ul><li>M (k,k) = 1.0 </li></ul></ul></ul><ul><ul><ul><li>M (k,l) = (1.0 - 2*r e ) </li></ul></ul></ul><ul><ul><ul><li>unlinked markers: matix M is diagonal </li></ul></ul></ul><ul><ul><li>calculate w = ( 1 T  M -1 ), </li></ul></ul><ul><ul><ul><li>effective number of loci = sum( w ) </li></ul></ul></ul><ul><ul><ul><li>unlinked markers: vector w is unity => sum( w ) = N loci </li></ul></ul></ul>
  10. 10. WP-C II Account for genetic linkage maps Unlinked markers Loosely linked markers closely linked markers W = 1.0 W = 1.0 W = 0.5 W = 0.7 W = 0.2 W = 0.3 M eff = 5.0 M eff = 2.9 M eff = 1.2 Example!
  11. 11. WP-C II Combine passport, trait & marker info <ul><li>S-Plus software offers a very limited possibility to combine different types of data </li></ul><ul><ul><li>Function “Daisy()” applies normalization to all data variables, no specification of weights across variables </li></ul></ul><ul><li>Improve/extend function “Daisy()” </li></ul><ul><ul><li>Allow user-defined weights for every variable </li></ul></ul><ul><ul><li>S-Plus function WeightedDaisy() </li></ul></ul><ul><ul><ul><li>E.g., use weights for markers (from S-Plus function WeightMap() ) </li></ul></ul></ul>
  12. 12. WP-C II Multiple sources of data for cluster analysis phenotypes AFLP markers MS markers Poor distinction Poor distinction Fair distinction
  13. 13. Standard weights ( daisy() ) User-defined weights ( weighteddaisy() ) WP-C II Combining multiple sources of data
  14. 14. WP-C II Example marker weights 1.00 0.36 0.54 0.52 0.13 0.18 0. 43 1.00
  15. 15. WP-C II Results of cluster analyses w. & w/out weights Map-based weights Standard weights 1 63 95 10 11 57 55 74 1 63 95 10 11 57 55 74
  16. 16. WP-C II ( next step ) Core Selection <ul><li>Form N (e.g., 6) distinct groups </li></ul><ul><ul><li>cluster analysis tree </li></ul></ul><ul><ul><li>Cut tree at arbitrary level </li></ul></ul><ul><ul><li>Our example: group sizes </li></ul></ul><ul><ul><ul><li>No weights: 81, 7, 6, 2, 2, and 2 </li></ul></ul></ul><ul><ul><ul><li>Map-based weights: 4, 7, 81, 2, 4, and 2 </li></ul></ul></ul><ul><li>Sample/select from each group a given number </li></ul><ul><ul><li>Define core selection, e.g., 12 </li></ul></ul><ul><ul><li>Sampling strategy Standard Map-based </li></ul></ul><ul><ul><ul><li>Constant [2 2 2 2 2 2] [2 2 2 2 2 2] </li></ul></ul></ul><ul><ul><ul><li>Proportional [7 1 1 1 1 1] [1 1 7 1 1 1] </li></ul></ul></ul><ul><ul><ul><li>Logproportional [5 2 2 1 1 1] [1 2 5 1 2 1] </li></ul></ul></ul>
  17. 17. WP-C II Core selection (logproportional sampling) <ul><li>12 accessions selected from 6 clusters </li></ul>Cut off line 3 42 24,77 4,23 5 14,25,68,94,95 Tree from map-based clustering
  18. 18. WP-C III Genome-wide and Locus-specific mapping <ul><li>Objective was to develop novel map-based tools for searching systematically for useful genes and alleles in germplasm collections </li></ul><ul><li>Genome-wide search </li></ul><ul><li>Tagged loci search (fine mapping) </li></ul>
  19. 19. WP-C III Genome-wide mapping <ul><li>Marker-marker association </li></ul><ul><ul><li>Assemble genome wide map of AFLP markers (no map available) </li></ul></ul><ul><ul><li>Only few markers could be mapped last summer (KeyGene) </li></ul></ul><ul><ul><li>Are high associations indicative for distance between markers on genome? </li></ul></ul><ul><li>Marker-trait association </li></ul><ul><ul><li>More interesting to associate markers to traits, e.g. Bremia resistance to map genes coding for trait </li></ul></ul><ul><ul><li>But: if high associations between markers are not indicative for distance between markers does it make sense to associate markers to traits then? </li></ul></ul>
  20. 20. WP-C III Retrieval of linkage map from genome wide pair-wise marker associations <ul><li>Multi Dimensional Scaling (MDS) </li></ul><ul><ul><li>One-dimensional representation of markers from pair-wise distances is achieved, corresponding to a marker map. </li></ul></ul><ul><ul><li>Correction for population structure is very important </li></ul></ul><ul><ul><ul><li>Logistic regression correction by stratification </li></ul></ul></ul><ul><li>Three types of MDS (S-PLUS) evaluated </li></ul><ul><ul><ul><li>Classical ( = PCO = Principal Coordinate Analysis) </li></ul></ul></ul><ul><ul><ul><li>Kruskal's ( = non-metric MDS ) </li></ul></ul></ul><ul><ul><ul><li>Sammon’s MDS ( minimizes weighted “stress”) (performs best) </li></ul></ul></ul>
  21. 21. WP-C III Example MDS to form linkage map
  22. 22. WP-C III Resolution of QTL (fine) mapping <ul><li>Experiments of linkage analysis </li></ul><ul><ul><li>2 or 3 generations of individuals </li></ul></ul><ul><ul><li>limited number of meioses in experiment </li></ul></ul><ul><ul><li>dense marker maps hardly improve map-resolution QTL </li></ul></ul><ul><ul><ul><li>Even with RIL populations: 5 - 10 centiMorgan </li></ul></ul></ul><ul><li>Higher resolution desired to </li></ul><ul><ul><li>allow better (molecular) study of gene involved </li></ul></ul><ul><ul><ul><li>cloning, comparative mapping, etc. </li></ul></ul></ul><ul><ul><li>identify tightly linked markers </li></ul></ul><ul><ul><ul><li>more efficient marker-assisted breeding </li></ul></ul></ul>
  23. 23. WP-C III Locus specific (Fine) mapping <ul><li>This leads to the detection of a small region containing the disease gene. </li></ul>Key-paper: Meuwissen & Goddard (2000) Genetics 155 :421-430 Linkage disequilibrium mapping successful in mapping genetic disorders: = Identify a chromosomal region that is identical by descent ( IBD ) among diseased individuals ( region may contain disease gene ) The IBD region is detected by closely linked marker loci that carry identical alleles at this region in the diseased individuals. Size of IBD region decreases with the number of meioses since the disease mutation occurred and may be small.
  24. 24. WP-C III Methodology LD fine-mapping of QTL <ul><li>QTL position known up to 5 - 20 cM precision </li></ul><ul><ul><li>effective population size for many discrete generations </li></ul></ul><ul><ul><li>phenotypes available for last generation of individuals </li></ul></ul><ul><ul><li>Fully inbred individuals (selfed by single seed descent) </li></ul></ul><ul><li>(1) Expected correlation matrix among marker haplotypes </li></ul><ul><ul><li>Whether two marker haplotypes have identical alleles in a region depends on the position of the QTL . Hence, the covariance between haplotype effects depends on the position of the QTL . </li></ul></ul><ul><ul><li>Identity By Descent (IBD) probability </li></ul></ul><ul><li>(2) Maximum Likelihood estimation of QTL position </li></ul><ul><ul><li>Linear model (phenotypes normally distributed) </li></ul></ul><ul><ul><li>ML estimates for each marker interval </li></ul></ul>
  25. 26. WP-C III Calculate power of QTL fine-mapping <ul><ul><li>20 markers = 19 intervals </li></ul></ul><ul><ul><ul><li>Simulation: QTL between marker 10 and 11 </li></ul></ul></ul><ul><ul><ul><li>Estimated interval: Interval with highest ML estimate </li></ul></ul></ul><ul><ul><ul><ul><li>ML base = ML without QTL </li></ul></ul></ul></ul><ul><ul><ul><ul><li>ML QTL,I = ML with QTL in interval I </li></ul></ul></ul></ul><ul><ul><ul><ul><li>Test statistic : ML QTL,I - ML base </li></ul></ul></ul></ul><ul><ul><li>Deviations of estimated interval from true interval </li></ul></ul><ul><ul><ul><li>-9, -8, -7, -6, -5, -4, -3, -2, -1, 0 , 1, 2, 3, 4, 5, 6, 7, 8, 9 </li></ul></ul></ul><ul><li>Power </li></ul><ul><ul><li>% replicates in true or next to true interval (interval -1 , 0, 1) </li></ul></ul><ul><li>1000 replicates/scenario </li></ul>Power = 0.91
  26. 27. WP-C III Results of power from simulations
  27. 28. WP-C III Locus-specific search <ul><li>Separate modules (C–language) for </li></ul><ul><ul><li>Calculation of IBD probabilities (= expected correlation matrix) </li></ul></ul><ul><ul><li>Simulation of data sets & Max. Likelihood estimation </li></ul></ul><ul><li>Paper on methodology and simulation results </li></ul><ul><ul><li>Bink & Meuwissen (2004) Euphytica, in press </li></ul></ul>
  28. 29. WP-C IV Large-data mining tools <ul><li>Objective was to find important patterns within the germplasm data set, which are not apparent from visual analysis and to compare and contrast these patterns with those found from the classical statistical analyses </li></ul>
  29. 30. WP-C IV Large-data mining tools <ul><li>Methods </li></ul><ul><ul><li>Data Mining methods (JIC) </li></ul></ul><ul><ul><ul><li>Decision Trees, Built with C4.5 (Quinlan) [ DAM ] </li></ul></ul></ul><ul><ul><ul><li>Rule induction, Simulated Annealing: Witness Miner 2001 </li></ul></ul></ul><ul><ul><li>Artificial Neural Networks (PRI) </li></ul></ul><ul><ul><ul><li>Linear Vector Quantisation [ LVQ ] </li></ul></ul></ul><ul><ul><ul><li>Support Vector Machines [ SVM ] </li></ul></ul></ul><ul><ul><li>(Classical) Statistical analysis (PRI) </li></ul></ul><ul><ul><ul><li>LDA/Linear Regression [ CS ] </li></ul></ul></ul>
  30. 31. WP-C IV Large-data mining ‘data set’ <ul><li>Data: 1423 Lactuca Sativa accessions, CGN </li></ul><ul><ul><li>X1: 167 AFLP markers </li></ul></ul><ul><ul><li>X2: 20 (2x10) STMS markers </li></ul></ul><ul><ul><li>Y: 5 traits, all treated as categorical variables </li></ul></ul><ul><ul><ul><li>Y1:[n=1413] seed colour (black, white, varied) </li></ul></ul></ul><ul><ul><ul><li>Y2:[n= 761] flowering time (< 41 d., 41-60, …. 101-120 d.) </li></ul></ul></ul><ul><ul><ul><li>Y3:[n=1208] leaf colour (yellow, green, grey, blue, red) </li></ul></ul></ul><ul><ul><ul><li>Y4:[n= 927] resistance to Bremia 1 (resistant, susceptible) </li></ul></ul></ul><ul><ul><ul><li>Y5:[n= 919] resistance to Bremia 3 (resistant, susceptible) </li></ul></ul></ul><ul><li>Data split into training and test sample (50 - 50) </li></ul><ul><li>Objective: use X to predict Y </li></ul>
  31. 32. Criteria: coverage, accuracy, applicability
  32. 33. Results: Resistance to Bremia 1 <ul><li>Performance across all traits: </li></ul><ul><ul><li>LVQ lowest </li></ul></ul><ul><ul><li>CS good </li></ul></ul><ul><ul><li>SVM & DAM best </li></ul></ul><ul><li>Note: differences not very large! </li></ul>Trade-off between increase accuracy and maintain coverage/applicability!
  33. 34. Concluding remarks Novel statistical tools available Answer questions you could not answer before? Applicability & integration with other WP–tools © Wageningen UR