Searching for traits in PGR collections              using Focused Identification of              Germplasm Strategy (FIGS...
Content              • Background                 – PGR - traits                 – FIGS - traits              • Objective ...
ICARDA              ICARDA’s Worldwide              presence                                   International              ...
ICARDA              PGR              centers              of origin              and              diversityGrainResearch &...
PGR contribution              Traits of importance to agriculture                  – phenological adaptation (short growth...
PGR Challenges              • 50 - 60 000 traits (loci)              • 7 million of accessions              • 1400 geneban...
PGR Challenges              A needle in a hay stack               PGR users want variation for               specific trai...
PGR Challenges and Concerns                • Size of collections                  – Addressed by Brown et al. 1999        ...
Content              • Background                 – PGR traits                 – FIGS              • Objective            ...
Objective               FIGS searches genetic resources (data) germplasm collections to detect               any particula...
Origin of FIGS approach              Boron toxicity of wheat and barley – early FIGS examples                             ...
FIGS approach               “FIGS applies to plant genetic resources (stored collections)                the same selectio...
FIGS approach               FIGS has helped breeders identify                 long sought-after plant traits such         ...
Sunn pest trait of resistance              8 landrace accessions from              Afghanistan and              2 from Taj...
FIGS approach to Pm                  16,000 variétés locales de blé                                                       ...
Locating new Pm3 alleles              The distribution of the new seven functional alleles of Pm3              Out of 96.2...
The FIGS picture              Genotypes x Environments x Time1 = Genetic Variation              Can we use the same evolut...
Examples of eco-geographic variation of                       traits linked to environmental influences   Environment infl...
FIGS system                                      PGR collections    User defined       needs                     Database ...
Mining natural variation              By linking traits, environments (and associated selection pressures)              wi...
FIGS approach – summarized                                    Focused Identification of                                   ...
Content              • Background                 – PGR traits                 – FIGS              • Objective            ...
Eco-climate data (X)               ICARDA eco-climatic database, average:               annual temperature (front), annual...
Eco-climate data (X)              Layers used in the stem rust studies:              •   Precipitation (rainfall)         ...
Trait data set (Y)              Trait data              (Y as              dependent              variable)               ...
Searching for stem rust trait of resistance -              concerns               Stem rust               spreading       ...
Stem rust on wheat landraces – trait data              Green dots indicate collecting sites for resistant wheat           ...
Content              • Background                 – PGR traits                 – FIGS              • Objective            ...
Data preparation                                                                            Climate data (X as independent...
Platform                                                   Geographical                      R language                   ...
Modeling framework                                  Trait data (Y)                Environmental data (X)                  ...
Modeling framework              •   Principal component analysis (PCA)              •   Partial Least Square (PLS)        ...
Principal Component Analysis (PCA)                                                           •   Principal component analy...
Partial Least Square (PLS)                                                             •   Principal component analysis (P...
Random Forest (RF)                                                        •   Principal component analysis (PCA)          ...
Support Vector Machines (SVM)                                                          •     Principal component analysis ...
Neural Networks (NN)                                                          •   Principal component analysis (PCA)      ...
Optimization/tuning              error                                                                    Test set        ...
Accuracy metrics              Parameters that provide information on the specificity              (“trait agro-climate”)  ...
Accuracy metrics               Parameters that provide information on the specificity               (“trait agro-climate”)...
Accuracy metrics               Randomness              1-   ROC curve       pdf’s of trait distribution               1   ...
Content              • Background                 – PGR traits                 – FIGS              • Objective            ...
Data preparation - Raw data                                                                                               ...
Data preparation – Transformed data                                                                                       ...
Data preparation - Raw data (PLS)                                                                                         ...
Data preparation – Transformed data                                                                                       ...
Optimization process                                             R_CALC                                                   ...
PCA                       PC2                       Few components  ~ random                                             ...
PCA          PC5                                                                                                          ...
PLS                      LV2                      2 latent variables of PLS are better than 2 PCs of PCA                  ...
PLS              LV10                                                                                                     ...
PCA (optimized)                                                                             •   Principal component analys...
PLS (optimized)                                                                             •   Principal component analys...
RF                                                                             •   Principal component analysis (PCA)     ...
SVM                                                                           •   Principal component analysis (PCA)      ...
NN                                                                             •   Principal component analysis (PCA)     ...
Random (PCA)                                                                                                              ...
Stem rust hot spots                         60                         50                         40              Latitude...
Stem rust hot spots                                                            areas where resistance is               lat...
PLS (optimized)              Areas where resistance is likely to occur (dark red)                          60             ...
Random Forest (RF)              Areas where resistance is likely to occur (dark red)                            60        ...
svm              Areas where resistance is likely to occur (dark red)                            60                       ...
Content              • Background                 – PGR traits                 – FIGS              • Objective            ...
Results – stem rust on wheat               Dataset (unit)        PPV                    LR+                               ...
Results – stem rust on wheat                                                                   AUC = Area Under the ROC Cu...
Results – stem rust on wheat               Classifier method          PPV                         LR+                     ...
Results of stem rust (Ug99) on wheat              4563 wheat landraces              screened for Ug99              10.2 % ...
Content              • Background                 – PGR traits                 – FIGS              • Objective            ...
Conclusion ...              Results               –   Raw data vs Transformed data               –   PLS vs PCA           ...
GrainResearch &DevelopmentCorporation
Upcoming SlideShare
Loading in …5
×

Searching for traits in PGR collections using Focused Identification of Germplasm Strategy

1,763 views

Published on

Published in: Technology
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,763
On SlideShare
0
From Embeds
0
Number of Embeds
387
Actions
Shares
0
Downloads
0
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide
  • Landrace samples (genebank seed accessions)Trait observations (experimental design) - High cost dataClimate data (for the landrace location of origin) - Low cost dataThe accession identifier (accession number) provides the bridge to the crop trait observations.The longitude, latitude coordinates for the original collecting site of the accessions (landraces) provide the bridge to the environmental data.
  • GRIN database (USDA-ARS, National Plant Germplasm System, Germplasm Resources Information Network, online http://www.ars-grin.gov/npgs) USDA GRIN, trait data online: http://www.ars-grin.gov/cgi-bin/npgs/html/desc.pl?65049
  • Photo: USDA ARS Image k1192-1, http://www.ars.usda.gov/is/graphics/photos/mar09/k11192-1.htm
  • USDA ARS Image Archive, http://www.ars.usda.gov/is/graphics/photos/
  • Photo: Wheat infected by stem rust (Ug99) at the Kenya Agricultural Research Station in Njoro northwest of Nairobi.
  • Endresen, D.T.F., K. Street, M. Mackay, A. Bari, E. De Pauw, K. Nazari, and A. Yahyaoui (2012). Sources of Resistance to Stem Rust (Ug99) in Bread Wheat and Durum Wheat Identified Using Focused Identification of Germplasm Strategy (FIGS). Crop Science [online first]. doi: 10.2135/cropsci2011.08.0427; Published online 8 Dec 2011.
  • Searching for traits in PGR collections using Focused Identification of Germplasm Strategy

    1. 1. Searching for traits in PGR collections using Focused Identification of Germplasm Strategy (FIGS) Abdallah Bari, Kenneth Street, Michael Mackay, Eddy De Pauw, Dag Endresen, Ahmed Amri, Kumarse Nazari and Ammor Yahiaoui CIAT Palmira, Colombia 14 March 2012GrainResearch &DevelopmentCorporation
    2. 2. Content • Background – PGR - traits – FIGS - traits • Objective – Develop a priori information – Develop best bet subset of accs with traits • Datasets – Trait data – Environmental data • Methodologies – Data preparation – Modeling techniques • Results/Discussion – Sub-setting (accessions/variables) – “Hot spots”GrainResearch & • ConclusionDevelopment 2Corporation
    3. 3. ICARDA ICARDA’s Worldwide presence International Center for Agricultural Research in the Dry Areas (ICARDA)GrainResearch &DevelopmentCorporation
    4. 4. ICARDA PGR centers of origin and diversityGrainResearch &DevelopmentCorporation
    5. 5. PGR contribution Traits of importance to agriculture – phenological adaptation (short growth duration), – efficient use of water, – resistance to biotic stresses (diseases and insects), – tolerance to abiotic stresses (such as drought and salinity), and – superior grain qualityGrainResearch & plant pre-evaluationDevelopmentCorporation
    6. 6. PGR Challenges • 50 - 60 000 traits (loci) • 7 million of accessions • 1400 genebanks Seed samplesGrainResearch &DevelopmentCorporation
    7. 7. PGR Challenges A needle in a hay stack PGR users want variation for specific traits and a hundred germplasm accessions to evaluate.GrainResearch &Development 7Corporation
    8. 8. PGR Challenges and Concerns • Size of collections – Addressed by Brown et al. 1999 • Cost in evaluating accessions lacking the desired trait – Addressed by Gollin et al. 2000GrainResearch &DevelopmentCorporation
    9. 9. Content • Background – PGR traits – FIGS • Objective – Develop a priori information – Develop best bet subset of accs with traits • Datasets – Trait data – Environmental data • Methodologies – Data preparation – Modeling techniques • Results/Discussion – Sub-setting (accessions/variables) – “Hot spots”GrainResearch & • ConclusionDevelopment 9Corporation
    10. 10. Objective FIGS searches genetic resources (data) germplasm collections to detect any particular trait-environment patterns/ relationships (as a priori information). This a priori information is then used to develop predictive models to find novel genetic variation of the traits of interest and where it is likely to occur the most. Quantification Utilization of of trait- A priori Develop genetic environment information trait subsets resources relationshipGrainResearch &Development 10Corporation
    11. 11. Origin of FIGS approach Boron toxicity of wheat and barley – early FIGS examples Mediterranean Sea Wheat landraces from marine origin soils in Mediterranean region providedGrain all the genetic variation needed to produce boron tolerant varietiesResearch &Development M.C. Mackay, 1995Corporation
    12. 12. FIGS approach “FIGS applies to plant genetic resources (stored collections) the same selection pressure exerted on plants by evolution.” PGR Collection sampling core (Biodiversity) PGR sampling trait user (Biodiversity)GrainResearch &Development 12Corporation
    13. 13. FIGS approach FIGS has helped breeders identify long sought-after plant traits such as resistance to: – Net blotch (barley), – Powdery mildew, – Russian wheat aphid (RWA) and – Sunn pest. Braidotti, G.2009. Keys to the gene bank, Biotechnology. Partners in Research for Development 16-17.GrainResearch &DevelopmentCorporation
    14. 14. Sunn pest trait of resistance 8 landrace accessions from Afghanistan and 2 from Tajikistan identified as resistant at juvenile stage Now developing mapping populationsGrainResearch & 14DevelopmentCorporation
    15. 15. FIGS approach to Pm 16,000 variétés locales de blé FIGS applique 1,300 sélectionnées Phenotyping 40% yielded accessions that were 211 accs entre R et IR resistant to the Genotyping isolates used 7 nouveau allèles Au moins 2 ont la spécificité de race nouvelle 100 ans de génétique classiques = 7 allèles Kaur K; Street K; Mackay M; Yahiaoui N; Keller B (2008). Allele mining and sequenceGrainResearch & diversity at the wheat powdery mildew resistance locus Pm3. 11th IWGS, 24-29 Aug.,Development Brisbane)Corporation
    16. 16. Locating new Pm3 alleles The distribution of the new seven functional alleles of Pm3 Out of 96.2% of the total set screened Turkey Afghanistan Iran Pakistan and ArmeniaGrainResearch & 16DevelopmentCorporation
    17. 17. The FIGS picture Genotypes x Environments x Time1 = Genetic Variation Can we use the same evolutionary principles in reverse to identify the environments that ‘engender’ trait specific genetic variation? Environments x Traits x Time = Trait variation (ExT)? 1 plus some selectionGrainResearch & 17DevelopmentCorporation
    18. 18. Examples of eco-geographic variation of traits linked to environmental influences Environment influence Trait Species Reference Low altitudes, high winter emp., Cyanogenesis Trifolium repens Pederson, Fairbrother et al. low summer rain, spring 1996 cloudiness Aridity Seed dormancy, early Annual legumes Ehrman and Cocks 1996 flowering, high seed to pod ratio Soil type Tolerance to Boron toxicity Bread wheat Mackay (1990) Altitude, winter temp, RWA Russian Wheat Aphid (RWA) Bread wheat Bohssini, et al accepted for distribution resistance publication 2008 Temperature, aridity Drought resistance Triticum dicoccoides Peleg, Fahima et al. 2005 Altitude Glume colour and beak length Durum wheat Bechere, Belay et al. 1996 Climate, soil and water Heading date, culm length, Triticum dicoccoides Beharav and Nevo 2004 availability biomass, grain yield and its Components Precipitation, minimum Glutenin diversity Durum wheat Vanhintum and Elings 1991 January temperature, altitude. temperature, aridity More efficient RUBISCO Woody perennials Galmes et al, 2005 activityGrainResearch &relations, WaterDevelopment temperature Hordatine accumulation Barely After18 C Mackay M. Batchu, Zimmermann et al. andCorporation (disease defence) 2006
    19. 19. FIGS system PGR collections User defined needs Database Filters Type of material Evaluation data Collection site Interface Other information Size limit 500 1500 250 750Grain See www.figstraitmine.com New Subset After M. C Mackay 1995Research & 19DevelopmentCorporation
    20. 20. Mining natural variation By linking traits, environments (and associated selection pressures) with genebank accessions (e.g. landraces and crop relatives) we can ‘focus’ in on those accession most likely to possess trait specific genetic variation. 60 50 40 Latitude 30 20 10 0 0 50 100 150 Longitude Environnement Trait FIGS subsetGrainResearch &DevelopmentCorporation
    21. 21. FIGS approach – summarized Focused Identification of Germplasm Strategy Environment (E) Trait (T) Geo-referencing of Evaluation collecting places (phenotyping) Accession (G)GrainResearch &DevelopmentCorporation 21
    22. 22. Content • Background – PGR traits – FIGS • Objective – Develop a priori information – Develop best bet subset of accs with traits • Datasets – Trait data – Environmental data • Methodologies – Data preparation – Modeling techniques • Results/Discussion – Sub-setting (accessions/variables) – “Hot spots”GrainResearch & • ConclusionDevelopment 22Corporation
    23. 23. Eco-climate data (X) ICARDA eco-climatic database, average: annual temperature (front), annual precipitation (middle), and winter precipitation (back) (De Pauw 2008) Climate data (X as independent variables) site_code1 prec01 prec02 prec03 prec04 prec05 ….. ari01 ari02 ari03 ari04 ari05 ETH-S893 25 36 72 154.22 148.88 0.167 0.246 0.439 1.098 1.169 ETH-S1222 29 44 92 167.46 168 0.223 0.344 0.646 1.354 1.612 NS_339 44 67 130.43 177.96 185.74 0.351 0.552 0.949 1.457 1.751 ETH-S1153 36 48 86 140.92 131.94 0.28 0.39 0.609 1.108 1.078 NS_415 32 46.61 95.42 150.3 157 0.271 0.419 0.732 1.289 1.437 NS_424 31.94 45 90 143.62 150 0.257 0.38 0.641 1.146 1.272 ETH64:55 28 38.26 57 97.57 81 0.247 0.344 0.45 0.834 0.662 NS_525 28 39 57 97.13 80.78 0.248 0.352 0.452 0.836 0.669 NS_526 27 39 57 97.01 80.77 0.241 0.354 0.455 0.842 0.68 NS_559 23 40 61.89 129.04 102 0.226 0.397 0.511 1.206 0.998 . . . Source: International Center for Agricultural Research in the Dry Areas (ICARDA) . .GrainResearch &DevelopmentCorporation
    24. 24. Eco-climate data (X) Layers used in the stem rust studies: • Precipitation (rainfall) • Maximum temperatures • Minimum temperatures + Derived GIS layers such as: • Potential evapotranspiration (water-loss) • Agro-climatic Zone (UNESCO classification) • Moisture/Aridity index (mean values for month and year)GrainResearch &Development 24Corporation
    25. 25. Trait data set (Y) Trait data (Y as dependent variable) http://www.news.cornell.edu/ site_code1 R_state0 R_state1 R_state2 R_state3 R_state4 R_state5 R_state6 R_state7 R_state8 R_state9 ETH-S893 0 0 0 0 0 0 0 0 1 0 ETH-S1222 0 0 0 0 0 0 0 0 0 1 NS_339 0 0 0 0 0 0 1 0 1 0 ETH-S1153 0 0 0 0 2 1 3 0 0 0 NS_415 0 0 0 0 0 0 1 0 0 0 NS_424 0 0 0 1 0 0 0 0 0 0 ETH64:55 0 0 1 0 0 0 0 0 0 0 NS_525 0 0 0 0 0 0 1 0 0 0 NS_526 0 1 2 1 2 0 3 0 0 0 . NS_559 2 5 1 0 0 2 0 0 0 0 . ETH64:53 . 0 0 1 0 0 0 0 0 0 0 . . Source: (USDA) National Genetic Resources Program (NGRP) GRIN databaseGrainResearch &DevelopmentCorporation
    26. 26. Searching for stem rust trait of resistance - concerns Stem rust spreading to wheat production areas http://www.news.cornell.edu/GrainResearch &DevelopmentCorporation
    27. 27. Stem rust on wheat landraces – trait data Green dots indicate collecting sites for resistant wheat landraces and red dots collecting sites for susceptible landraces. USDA GRIN, trait data online: Field experiments made in http://www.ars-grin.gov/cgi-bin/npgs/html/desc.pl?65049 Minnesota by Don McVeyGrainResearch &DevelopmentCorporation 27
    28. 28. Content • Background – PGR traits – FIGS • Objective – Develop a priori information – Develop best bet subset of accs with traits • Datasets – Trait data – Environmental data • Methodologies – Data preparation – Modeling techniques • Results/Discussion – Sub-setting (accessions/variables) – “Hot spots”GrainResearch & • ConclusionDevelopment 28Corporation
    29. 29. Data preparation Climate data (X as independent variables) Power relationship ~ 2(p) (spread) site_code ….. ari02 ….. ETH-S893 0.246 ETH-S1222 0.344 NS_339 0.552 ETH-S1153 0.390 NS_415 0.419 NS_424 0.380 ETH64:55 0.344 NS_525 0.352 NS_526 0.354 NS_559 0.397 500 800 400 600 Frequecy 300 Frequecy 400 200 200 100 0 0 0 5 10 15 -4 -2 0 2 4 Aridity or Moisture Index during February Aridity or Moisture Index during FebruaryGrainResearch &Development 29Corporation
    30. 30. Platform Geographical R language Information System (Development of algorithms) (GIS) > Data transformation ( ) Arc Gis > Model <- model(trait ~ climate) > Environmental data/layers Measuring accuracy metrics > …. (surfaces) Modeling purpose Generation of environmental dataGrainResearch &Development 30Corporation
    31. 31. Modeling framework Trait data (Y) Environmental data (X) Y ~ f(X) Fist linear approach irrespective of the underlying distributions describing the data Yi ~ X is the set of variables that contains explanatory variables or predictors (climate data) where X ∈ Rm, Y ∈ Y that is either a categorical (label) or a numerical response (trait descriptor Yi ~ states).GrainResearch &Development 31Corporation
    32. 32. Modeling framework • Principal component analysis (PCA) • Partial Least Square (PLS) • Random Forest (RF) • Support Vector Machines (SVM) • Neural Networks (NN) Bari A., Street K., Mackay M., Endresen D.T.F., De Pauw E. & Amri A. (2011) Focused identification of germplasm strategy (FIGS) detects wheat stem rust resistance linked to environmental variables. Genetic Resources and Crop Evolution http://www.springerlink.com/content/m7140x68v2065113/fulltext.pdfGrainResearch &DevelopmentCorporation
    33. 33. Principal Component Analysis (PCA) • Principal component analysis (PCA) • Partial Least Square (PLS) • Random Forest (RF) • Support Vector Machines (SVM) • Neural Networks (NN) B a matrix of coefficients. The prediction was initially carried out using the number of components (PCs) that account for 95% of explained variance. Followed by adding a component at a time till the error reached a minimumGrainResearch &Development 33Corporation
    34. 34. Partial Least Square (PLS) • Principal component analysis (PCA) • Partial Least Square (PLS) • Random Forest (RF) • Support Vector Machines (SVM) • Neural Networks (NN) PLS : A product of factors and their loadings (regression coefficients) where both environmental dataset and trait dataset simultaneously The prediction was initially carried out using the number of components (PCs) that account for 95% of explained variance. Followed by adding a component at a time till the error reached a minimumGrainResearch &Development 34Corporation
    35. 35. Random Forest (RF) • Principal component analysis (PCA) • Partial Least Square (PLS) • Random Forest (RF) Data • Support Vector Machines (SVM) • Neural Networks (NN) Bootstrapping (with replacement) Training (set) Out-of-bag (set) OOB ntree 1 ntree 2 ntree 1000GrainResearch &Development 35Corporation
    36. 36. Support Vector Machines (SVM) • Principal component analysis (PCA) • Partial Least Square (PLS) SVM a learning-based technique that maps • Random Forest (RF) input data to a high-dimensional space. • Support Vector Machines (SVM) • Neural Networks (NN) Optimally separates mapped input into respective classes v v (x) v (x) v (x) (x) (x) From l-dimensional space (input variable space) into k-dimensional space, where k is more higher than l.GrainResearch &Development 36Corporation
    37. 37. Neural Networks (NN) • Principal component analysis (PCA) • Partial Least Square (PLS) Neural Networks (RBF) • Random Forest (RF) • Support Vector Machines (SVM) • Neural Networks (NN) error Test set x1 x2 F(x) Training set xp epochs numberGrainResearch &Development 37Corporation
    38. 38. Optimization/tuning error Test set Training set PCs, LVs or epochs number Trend of output error versus the number of components(PCs/LVs) or epochs (NN)GrainResearch &DevelopmentCorporation
    39. 39. Accuracy metrics Parameters that provide information on the specificity (“trait agro-climate”) Confusion matrix (2-by-2 contingency table) Observed Resistant Susceptible Predicted Resistant a b Susceptible c d Sensitivity a/ (a + c) = Specificity d/(b + d) = and are indicators of the models ability to correctly classify observations.GrainResearch &DevelopmentCorporation
    40. 40. Accuracy metrics Parameters that provide information on the specificity (“trait agro-climate”) .. High AUC (area) values indication of potential trait-environment relationship 1- ROC curve pdf’s of trait distribution 1 1Grain The ROC curve and the resulting pdf’s of trait distribution (trait states)Research &DevelopmentCorporation
    41. 41. Accuracy metrics Randomness 1- ROC curve pdf’s of trait distribution 1 1GrainResearch &DevelopmentCorporation
    42. 42. Content • Background – PGR traits – FIGS • Objective – Develop a priori information – Develop best bet subset of accs with traits • Datasets – Trait data – Environmental data • Methodologies – Data preparation – Modeling techniques • Results/Discussion – Sub-setting (accessions/variables) – “Hot spots”GrainResearch & • ConclusionDevelopment 42Corporation
    43. 43. Data preparation - Raw data PCs = 42 1.0 1 0.46 0.71 0.8 True positive rate 0.44 0.6 RMSE 0.13 0.4 0.42 0.2 0.40 -0.45 0.0 0 10 20 30 40 50 60 0.0 0.2 0.4 0.6 0.8 1.0 Number of components False positive rate Distribution by trait 2.0 AUC = 0.67 1.5 Density Kappa = 0.40 1.0 0.5 0.0 -0.5 0.0 0.5 1.0GrainResearch &DevelopmentCorporation
    44. 44. Data preparation – Transformed data PCs = 42 1.0 0.46 0.59 0.8 True positive rate 0.44 0.6 RMSE 0.42 0.03 0.4 0.2 0.40 -0.54 0.0 0 10 20 30 40 50 60 0.0 0.2 0.4 0.6 0.8 1.0 Number of components False positive rate Distribution by trait 2.0 1.5 AUC = 0.71 Density 1.0 Kappa = 0.45 0.5 0.0 -0.5 0.0 0.5 1.0GrainResearch &DevelopmentCorporation
    45. 45. Data preparation - Raw data (PLS) LVs = 30 1.0 0.46 0.68 0.8 True positive rate 0.44 0.6 RMSE 0.07 0.4 0.42 0.2 0.40 -0.55 0.0 0 10 20 30 40 50 60 0.0 0.2 0.4 0.6 0.8 1.0 Number of components False positive rate Distribution by trait 2.0 AUC = 0.70 1.5 Density Kappa = 0.43 1.0 0.5 0.0 -1.0 -0.5 0.0 0.5 1.0GrainResearch &DevelopmentCorporation
    46. 46. Data preparation – Transformed data LVs = 22 0.6 0.85 1.0 0.46 0.8 True positive rate 0.44 0.6 RMSE 0.42 0.09 0.4 0.2 0.40 -0.42 0.0 0 10 20 30 40 50 60 0.0 0.2 0.4 0.6 0.8 1.0 Number of components False positive rate Distribution by trait 2.0 AUC = 0.71 1.5 Density 1.0 Kappa = 0.44 0.5 0.0 -0.5 0.0 0.5 1.0GrainResearch &DevelopmentCorporation
    47. 47. Optimization process R_CALC R_CALC 0.46 0.46 0.44 0.44 RMSEP RMSEP 0.42 0.42 0.40 0.40 0 10 20 30 40 50 60 0 10 20 30 40 50 60 number of components number of components Mean square error (RMSEP) for PCA (left) and PLS (right) models. Arrow indicate minimum errors where the number of components (PCs and LVs) were selected for prediction (red/discount nous = test data, continuous line = training set)GrainResearch &Development 47Corporation
    48. 48. PCA PC2 Few components  ~ random Distribution per R_CALC 1.0 12 Resistant 0.8 Susceptible 10 True positive rate 0.6 8 Density 6 0.4 4 0.2 2 0.0 0.0 0.2 0.4 0.6 0.8 1.0 0 0.2 0.3 0.4 0.5 False positive rate ...GrainResearch &Development 48Corporation
    49. 49. PCA PC5 Distribution per R_CALC 1.0 4 Resistant Susceptible 0.8 3 True positive rate 0.6 Density 2 0.4 1 0.2 0.0 0.0 0.2 0.4 0.6 0.8 1.0 0 -0.4 -0.2 0.0 0.2 0.4 0.6 0.8 1.0 False positive rate ...GrainResearch &Development 49Corporation
    50. 50. PLS LV2 2 latent variables of PLS are better than 2 PCs of PCA Distribution per R_CALC 1.0 4 Resistant Susceptible 0.8 3 True positive rate 0.6 Density 2 0.4 1 0.2 0.0 0 0.0 0.2 0.4 0.6 0.8 1.0 -0.4 -0.2 0.0 0.2 0.4 0.6 0.8 1.0 False positive rate ...GrainResearch &Development 50Corporation
    51. 51. PLS LV10 Distribution per R_CALC 1.0 Resistant 2.0 0.8 Susceptible True positive rate 0.6 1.5 Density 0.4 1.0 0.2 0.5 0.0 0.0 0.0 0.2 0.4 0.6 0.8 1.0 False positive rate -0.5 0.0 0.5 1.0 ...GrainResearch &Development 51Corporation
    52. 52. PCA (optimized) • Principal component analysis (PCA) • Partial Least Square (PLS) • Random Forest (RF) • Support Vector Machines (SVM) • Neural Networks (NN) ROC curve 1.0 2.0 True positive rate 0.8 1.5 Density 0.6 1.0 0.4 0.5 0.2 0.0 0.0 0.0 0.4 0.8 -0.5 0.0 0.5 1.0 False positive rate PredictionGrainResearch &DevelopmentCorporation
    53. 53. PLS (optimized) • Principal component analysis (PCA) • Partial Least Square (PLS) • Random Forest (RF) • Support Vector Machines (SVM) • Neural Networks (NN) ROC curve 1.0 2.0 True positive rate 0.8 1.5 Density 0.6 1.0 0.4 0.5 0.2 0.0 0.0 0.0 0.4 0.8 -0.5 0.0 0.5 1.0 False positive rate PredictionGrainResearch &DevelopmentCorporation
    54. 54. RF • Principal component analysis (PCA) • Partial Least Square (PLS) • Random Forest (RF) • Support Vector Machines (SVM) • Neural Networks (NN) ROC curve 3.0 1.0 2.5 True positive rate 0.8 2.0 Density 0.6 1.5 0.4 1.0 0.2 0.5 0.0 0.0 0.0 0.4 0.8 0.0 0.5 1.0 False positive rate PredictionGrainResearch &DevelopmentCorporation
    55. 55. SVM • Principal component analysis (PCA) • Partial Least Square (PLS) • Random Forest (RF) • Support Vector Machines (SVM) • Neural Networks (NN) ROC curve 1.0 4 True positive rate 0.8 3 Density 0.6 2 0.4 1 0.2 0.0 0 0.0 0.4 0.8 0.0 0.5 1.0 False positive rate PredictionGrainResearch &DevelopmentCorporation
    56. 56. NN • Principal component analysis (PCA) • Partial Least Square (PLS) • Random Forest (RF) • Support Vector Machines (SVM) • Neural Networks (NN) ROC curve 1.0 3.0 True positive rate 0.8 2.5 Density 2.0 0.6 1.5 0.4 1.0 0.2 0.5 0.0 0.0 0.0 0.4 0.8 -0.2 0.2 0.6 1.0 False positive rate PredictionGrainResearch &DevelopmentCorporation
    57. 57. Random (PCA) R_CALC 0.470 1.0 Complete random 0.8 distribution 0.465 True positive rate 0.6 RMSEP of trait of 0.4 stem rust 0.460 resistance 0.2 AUC ~ 0.5 0.0 0.0 0.2 0.4 0.6 0.8 1.0 0 10 20 30 40 50 60 False positive rate number of components 1.0 0.1 0.465 0.2 Partially 0.460 0.8 0.3 random 0.455 True positive rate 0.6 0.4 0.450 distribution RMSE of trait of 0.4 0.445 0.5 stem rust 0.2 0.440 resistance 0.6 0.435 0.0 0.8 0.7 0.0 0.2 0.4 0.6 0.8 1.0Grain False positive rate 0 10 20 30 40 50 60Research & Number of componentsDevelopment 57Corporation
    58. 58. Stem rust hot spots 60 50 40 Latitude 30 20 10 0 0 50 100 150 LongitudeGrainResearch &DevelopmentCorporation
    59. 59. Stem rust hot spots areas where resistance is latitude 60 50 40 likely to occur (longitude wise) Latitude 30 1 20 10 0 0 50 100 150 60 Longitude b 50 40 Latitude 30 20 10Grain 0Research &DevelopmentCorporation 0 50 longitude Longitude 100 150
    60. 60. PLS (optimized) Areas where resistance is likely to occur (dark red) 60 -0.2 0.8 0 50 0.2 0.6 2 -0. 0.4 0 0.6 -0.2 0.6 Latitude 40 0.2 0 0.2 0.4 0.6 0.4 Y 0.6 30 0 0.2 0.6 0.4 0.2 20 0 0 0.0 0 0 0.2 0.4 -0.2 10 0.4 0.08 0 20 40 60 80 100 120 Longitude 0.06 X semivariance 0.04 0.02GrainResearch & 10 20 30 40Development distanceCorporation
    61. 61. Random Forest (RF) Areas where resistance is likely to occur (dark red) 60 0.4 50 0.8 0.2 0 0.4 0 0.6 0.8 0.6 Latitude 40 0.2 0.4 0.6 0.2 0.6 0 0.4 0.2 0.4 0.2 Y 0 0.4 30 0.6 0.4 0.6 0.6 0.2 20 0.0 10 0.2 0.4 0.15 0 20 40 60 80 100 120 Longitude X 0.10 semivariance 0.05GrainResearch &Development 10 20 distance 30 40Corporation
    62. 62. svm Areas where resistance is likely to occur (dark red) 60 1.0 50 0 0 0.8 0.6 0.6 0 0.6 Latitude 40 0.2 0.4 0.4 0.6 1 0.6 0.8 0 0.2 0 0.2 0 0.8 Y 0.6 30 0.2 0.4 0.8 0.6 0.6 0.4 0.4 0.2 20 0 0.4 0.2 0.0 10 0 0.4 0.2 0 20 40 60 80 100 120 Longitude XGrainResearch &DevelopmentCorporation
    63. 63. Content • Background – PGR traits – FIGS • Objective – Develop a priori information – Develop best bet subset of accs with traits • Datasets – Trait data – Environmental data • Methodologies – Data preparation – Modeling techniques • Results/Discussion – Sub-setting (accessions/variables) – “Hot spots”GrainResearch & • ConclusionDevelopment 63Corporation
    64. 64. Results – stem rust on wheat Dataset (unit) PPV LR+ Estimated gain Stem rust 0.54 (0.50-0.59) 3.07 (2.66-3.54) 1.95 (1.79-2.09) (accession) Random 0.29 (0.26-0.33) 1.04 (0.90-1.20) 1.03 (0.91-1.16) (28 % resistant samples) Stem rust (site) 0.50 (0.40-0.60) 4.00 (2.85-5.66) 2.51 (2.02-2.98) Random 0.19 (0.13-0.26) 0.94 (0.63-1.39) 0.95 (0.66-1.33) (20 % resistant samples) PPV = Positive Predictive Value; LR+ = Positive Diagnostic Likelihood Ratio Endresen, D.T.F., K. Street, M. Mackay, A. Bari, E. De Pauw (2011). Predictive association between biotic stress traits and ecogeographic data for wheat and barleyGrainResearch & landraces. Crop Science 51: 2036-2055. DOI: 10.2135/cropsci2010.12.0717DevelopmentCorporation 64
    65. 65. Results – stem rust on wheat AUC = Area Under the ROC Curve (ROC, Receiver Operating Curve) Classifier method AUC Cohen’s Kappa Principal Component Regression 0.69 (0.68-0.70) 0.40 (0.37-0.42) (PCR) Partial Least Squares (PLS) 0.69 (0.68-0.70) 0.41 (0.39-0.43) Random Forest (RF) 0.70 (0.69-0.71) 0.42 (0.40-0.44) Support Vector Machines (SVM) 0.71 (0.70-0.72) 0.44 (0.42-0.45) Artificial Neural Networks (ANN) 0.71 (0.70-0.72) 0.44 (0.42-0.46) Bari, A., K. Street, , M. Mackay, D.T.F. Endresen, E. De Pauw, and A. Amri (2011). Focused Identification of Germplasm Strategy (FIGS) detects wheat stem rust resistance linked toGrain environment variables. Genetic Resources and Crop Evolution [online first]. doi:10.1007/s10722-Research & 011-9775-5; Published online 3 Dec 2011.DevelopmentCorporation 65
    66. 66. Results – stem rust on wheat Classifier method PPV LR+ Estimated gain kNN (pre-study) 0.29 (0.13-0.53) 5.61 (2.21-14.28) 4.14 (1.86-7.57) SIMCA 0.28 (0.14-0.48) 5.26 (2.51-11.01) 4.00 (2.00-6.86) Ensemble classifier 0.33 (0.12-0.65) 8.09 (2.23-29.42) 6.47 (2.05-11.06) Random 0.06 (0.01-0.27) 0.95 (0.13-6.73) 0.97 (0.16-4.35) (pre-study, 550 + 275 accessions) Ensemble 0.26 (0.22-0.30) 2.78 (2.34-3.31) 2.32 (2.00-2.68) Random 0.11 (0.09-0.15) 1.02 (0.77-1.36) 0.95 (0.77-1.32) (blind study, 825 + 3738 accessions) PPV = Positive Predictive Value; LR+ = Positive Diagnostic Likelihood Ratio Endresen, D.T.F., K. Street, M. Mackay, A. Bari, E. De Pauw, K. Nazari, and A. Yahyaoui (2012). Sources of Resistance to Stem Rust (Ug99) in Bread Wheat and Durum Wheat Identified Using Focused Identification of Germplasm Strategy (FIGS). Crop ScienceGrain [online first]. doi: 10.2135/cropsci2011.08.0427; Published online 8 Dec 2011.Research &DevelopmentCorporation 66
    67. 67. Results of stem rust (Ug99) on wheat 4563 wheat landraces screened for Ug99 10.2 % resistant accessions. The true trait scores for 20% of the accessions (825 samples) 500 accessions more likely to be resistant from 3728 accession with true scores hidden 25.8 % resistant samples and thus 2.3 times higher than expected by chance.GrainResearch &DevelopmentCorporation 67
    68. 68. Content • Background – PGR traits – FIGS • Objective – Develop a priori information – Develop best bet subset of accs with traits • Datasets – Trait data – Environmental data • Methodologies – Data preparation – Modeling techniques • Results/Discussion – Sub-setting (accessions/variables) – “Hot spots”GrainResearch & • ConclusionDevelopment 68Corporation
    69. 69. Conclusion ... Results – Raw data vs Transformed data – PLS vs PCA – Non-linear vs linear – FIGS vs random (selection) Issues – Extent of variables (trait/agro-climate) – Phenology (adaptation) – Fuzzy approach (trait variation capture)GrainResearch &Development 69Corporation
    70. 70. GrainResearch &DevelopmentCorporation

    ×