Searching for traits in PGR collections using Focused Identification of Germplasm Strategy
1. Searching for traits in PGR collections
using Focused Identification of
Germplasm Strategy (FIGS)
Abdallah Bari, Kenneth Street, Michael Mackay, Eddy De Pauw,
Dag Endresen, Ahmed Amri, Kumarse Nazari and Ammor Yahiaoui
CIAT
Palmira, Colombia
14 March 2012
Grain
Research &
Development
Corporation
2. Content
• Background
– PGR - traits
– FIGS - traits
• Objective
– Develop a priori information
– Develop best bet subset of accs with traits
• Datasets
– Trait data
– Environmental data
• Methodologies
– Data preparation
– Modeling techniques
• Results/Discussion
– Sub-setting (accessions/variables)
– “Hot spots”
Grain
Research &
• Conclusion
Development 2
Corporation
3. ICARDA
ICARDA’s Worldwide
presence
International
Center for
Agricultural
Research in the
Dry
Areas
(ICARDA)
Grain
Research &
Development
Corporation
4. ICARDA
PGR
centers
of origin
and
diversity
Grain
Research &
Development
Corporation
5. PGR contribution
Traits of importance to agriculture
– phenological adaptation (short growth
duration),
– efficient use of water,
– resistance to biotic stresses (diseases
and insects),
– tolerance to abiotic stresses (such as
drought and salinity), and
– superior grain quality
Grain
Research &
plant pre-evaluation
Development
Corporation
6. PGR Challenges
• 50 - 60 000 traits (loci)
• 7 million of accessions
• 1400 genebanks
Seed samples
Grain
Research &
Development
Corporation
7. PGR Challenges
A needle in a hay stack
PGR users want variation for
specific traits and a hundred
germplasm accessions to evaluate.
Grain
Research &
Development 7
Corporation
8. PGR Challenges and Concerns
• Size of collections
– Addressed by Brown et al. 1999
• Cost in evaluating accessions
lacking the desired trait
– Addressed by Gollin et al. 2000
Grain
Research &
Development
Corporation
9. Content
• Background
– PGR traits
– FIGS
• Objective
– Develop a priori information
– Develop best bet subset of accs with traits
• Datasets
– Trait data
– Environmental data
• Methodologies
– Data preparation
– Modeling techniques
• Results/Discussion
– Sub-setting (accessions/variables)
– “Hot spots”
Grain
Research &
• Conclusion
Development 9
Corporation
10. Objective
FIGS searches genetic resources (data) germplasm collections to detect
any particular trait-environment patterns/ relationships (as a priori
information).
This a priori information is then used to develop predictive models to find
novel genetic variation of the traits of interest and where it is likely to
occur the most.
Quantification Utilization of
of trait- A priori Develop
genetic
environment information trait subsets
resources
relationship
Grain
Research &
Development 10
Corporation
11. Origin of FIGS approach
Boron toxicity of wheat and barley – early FIGS examples
Mediterranean Sea
Wheat landraces from marine origin soils in Mediterranean region provided
Grain
all the genetic variation needed to produce boron tolerant varieties
Research &
Development M.C. Mackay, 1995
Corporation
12. FIGS approach
“FIGS applies to plant genetic resources (stored collections)
the same selection pressure exerted on plants by evolution.”
PGR
Collection sampling core
(Biodiversity)
PGR
sampling trait user
(Biodiversity)
Grain
Research &
Development 12
Corporation
13. FIGS approach
FIGS has helped breeders identify
long sought-after plant traits such
as resistance to:
– Net blotch (barley),
– Powdery mildew,
– Russian wheat aphid (RWA) and
– Sunn pest.
Braidotti, G.2009. Keys to the gene bank, Biotechnology.
Partners in Research for Development 16-17.
Grain
Research &
Development
Corporation
14. Sunn pest trait of resistance
8 landrace accessions from
Afghanistan and
2 from Tajikistan identified as resistant
at juvenile stage
Now developing mapping populations
Grain
Research & 14
Development
Corporation
15. FIGS approach to Pm
16,000 variétés locales de blé
FIGS applique
1,300 sélectionnées
Phenotyping 40% yielded
accessions that were
211 accs entre R et IR resistant to the
Genotyping isolates used
7 nouveau allèles
Au moins 2 ont la spécificité de race nouvelle
100 ans de génétique classiques = 7 allèles
Kaur K; Street K; Mackay M; Yahiaoui N; Keller B (2008). Allele mining and sequence
Grain
Research &
diversity at the wheat powdery mildew resistance locus Pm3. 11th IWGS, 24-29 Aug.,
Development Brisbane)
Corporation
16. Locating new Pm3 alleles
The distribution of the new seven functional alleles of Pm3
Out of 96.2% of the total set screened
Turkey
Afghanistan
Iran
Pakistan and
Armenia
Grain
Research & 16
Development
Corporation
17. The FIGS picture
Genotypes x Environments x Time1 = Genetic Variation
Can we use the same evolutionary principles
in reverse to identify the environments that
‘engender’ trait specific genetic variation?
Environments x Traits x Time = Trait variation
(ExT)?
1 plus some selection
Grain
Research & 17
Development
Corporation
18. Examples of eco-geographic variation of
traits linked to environmental influences
Environment influence Trait Species Reference
Low altitudes, high winter emp., Cyanogenesis Trifolium repens Pederson, Fairbrother et al.
low summer rain, spring 1996
cloudiness
Aridity Seed dormancy, early Annual legumes Ehrman and Cocks 1996
flowering, high seed to pod ratio
Soil type Tolerance to Boron toxicity Bread wheat Mackay (1990)
Altitude, winter temp, RWA Russian Wheat Aphid (RWA) Bread wheat Bohssini, et al accepted for
distribution resistance publication 2008
Temperature, aridity Drought resistance Triticum dicoccoides Peleg, Fahima et al. 2005
Altitude Glume colour and beak length Durum wheat Bechere, Belay et al. 1996
Climate, soil and water Heading date, culm length, Triticum dicoccoides Beharav and Nevo 2004
availability biomass, grain yield and its
Components
Precipitation, minimum Glutenin diversity Durum wheat Vanhintum and Elings 1991
January
temperature, altitude.
temperature, aridity More efficient RUBISCO Woody perennials Galmes et al, 2005
activity
Grain
Research &relations,
Water
Development
temperature Hordatine accumulation Barely After18 C Mackay
M.
Batchu, Zimmermann et al.
and
Corporation (disease defence) 2006
19. FIGS system PGR collections
User defined
needs Database Filters
Type of material
Evaluation data
Collection site
Interface Other information
Size limit 500 1500
250 750
Grain See www.figstraitmine.com
New Subset
After M. C Mackay 1995
Research & 19
Development
Corporation
20. Mining natural variation
By linking traits, environments (and associated selection pressures)
with genebank accessions (e.g. landraces and crop relatives) we can
‘focus’ in on those accession most likely to possess trait specific
genetic variation.
60
50
40
Latitude
30
20
10
0
0 50 100 150
Longitude
Environnement Trait FIGS subset
Grain
Research &
Development
Corporation
21. FIGS approach – summarized
Focused Identification of
Germplasm Strategy
Environment (E) Trait (T)
Geo-referencing of Evaluation
collecting places (phenotyping)
Accession
(G)
Grain
Research &
Development
Corporation
21
22. Content
• Background
– PGR traits
– FIGS
• Objective
– Develop a priori information
– Develop best bet subset of accs with traits
• Datasets
– Trait data
– Environmental data
• Methodologies
– Data preparation
– Modeling techniques
• Results/Discussion
– Sub-setting (accessions/variables)
– “Hot spots”
Grain
Research &
• Conclusion
Development 22
Corporation
24. Eco-climate data (X)
Layers used in the stem rust studies:
• Precipitation (rainfall)
• Maximum temperatures
• Minimum temperatures
+ Derived GIS layers such as:
• Potential evapotranspiration (water-loss)
• Agro-climatic Zone (UNESCO classification)
• Moisture/Aridity index
(mean values for month and year)
Grain
Research &
Development 24
Corporation
26. Searching for stem rust trait of resistance -
concerns
Stem rust
spreading
to wheat
production
areas
http://www.news.cornell.edu/
Grain
Research &
Development
Corporation
27. Stem rust on wheat landraces – trait data
Green dots indicate collecting sites for resistant wheat
landraces and red dots collecting sites for susceptible
landraces.
USDA GRIN, trait data online: Field experiments made in
http://www.ars-grin.gov/cgi-bin/npgs/html/desc.pl?65049 Minnesota by Don McVey
Grain
Research &
Development
Corporation
27
28. Content
• Background
– PGR traits
– FIGS
• Objective
– Develop a priori information
– Develop best bet subset of accs with traits
• Datasets
– Trait data
– Environmental data
• Methodologies
– Data preparation
– Modeling techniques
• Results/Discussion
– Sub-setting (accessions/variables)
– “Hot spots”
Grain
Research &
• Conclusion
Development 28
Corporation
29. Data preparation
Climate data (X as independent variables)
Power relationship ~ 2(p) (spread) site_code ….. ari02 …..
ETH-S893 0.246
ETH-S1222 0.344
NS_339 0.552
ETH-S1153 0.390
NS_415 0.419
NS_424 0.380
ETH64:55 0.344
NS_525 0.352
NS_526 0.354
NS_559 0.397
500
800
400
600
Frequecy
300
Frequecy
400
200
200
100
0
0
0 5 10 15 -4 -2 0 2 4
Aridity or Moisture Index during February Aridity or Moisture Index during February
Grain
Research &
Development 29
Corporation
30. Platform
Geographical
R language
Information System
(Development of algorithms)
(GIS)
> Data transformation ( )
Arc Gis
> Model <- model(trait ~ climate)
>
Environmental data/layers
Measuring accuracy metrics
> …. (surfaces)
Modeling purpose Generation of
environmental data
Grain
Research &
Development 30
Corporation
31. Modeling framework
Trait data (Y) Environmental data (X)
Y ~ f(X)
Fist linear approach irrespective of the underlying distributions describing the data
Yi ~
X is the set of variables that contains
explanatory variables or predictors
(climate data) where X ∈ Rm,
Y ∈ Y that is either a categorical (label)
or a numerical response (trait descriptor
Yi ~ states).
Grain
Research &
Development 31
Corporation
32. Modeling framework
• Principal component analysis (PCA)
• Partial Least Square (PLS)
• Random Forest (RF)
• Support Vector Machines (SVM)
• Neural Networks (NN)
Bari A., Street K., Mackay M., Endresen D.T.F., De Pauw E. & Amri A.
(2011) Focused identification of germplasm strategy (FIGS) detects wheat
stem rust resistance linked to environmental variables.
Genetic Resources and Crop Evolution
http://www.springerlink.com/content/m7140x68v2065113/fulltext.pdf
Grain
Research &
Development
Corporation
33. Principal Component Analysis (PCA)
• Principal component analysis (PCA)
• Partial Least Square (PLS)
• Random Forest (RF)
• Support Vector Machines (SVM)
• Neural Networks (NN)
B a matrix of coefficients.
The prediction was initially carried out using the number of
components (PCs) that account for 95% of explained variance.
Followed by adding a component at a time till the error reached a
minimum
Grain
Research &
Development 33
Corporation
34. Partial Least Square (PLS)
• Principal component analysis (PCA)
• Partial Least Square (PLS)
• Random Forest (RF)
• Support Vector Machines (SVM)
• Neural Networks (NN)
PLS :
A product of factors and their loadings (regression coefficients) where
both environmental dataset and trait dataset simultaneously
The prediction was initially carried out using the number of components
(PCs) that account for 95% of explained variance.
Followed by adding a component at a time till the error reached a
minimum
Grain
Research &
Development 34
Corporation
35. Random Forest (RF)
• Principal component analysis (PCA)
• Partial Least Square (PLS)
• Random Forest (RF)
Data • Support Vector Machines (SVM)
• Neural Networks (NN)
Bootstrapping (with replacement)
Training (set) Out-of-bag (set)
OOB
ntree 1 ntree 2 ntree 1000
Grain
Research &
Development 35
Corporation
36. Support Vector Machines (SVM)
• Principal component analysis (PCA)
• Partial Least Square (PLS)
SVM a learning-based technique that maps • Random Forest (RF)
input data to a high-dimensional space. • Support Vector Machines (SVM)
• Neural Networks (NN)
Optimally separates mapped input into
respective classes v
v
(x) v
(x) v (x)
(x) (x)
From l-dimensional space (input variable space)
into k-dimensional space,
where k is more higher than l.
Grain
Research &
Development 36
Corporation
37. Neural Networks (NN)
• Principal component analysis (PCA)
• Partial Least Square (PLS)
Neural Networks (RBF)
• Random Forest (RF)
• Support Vector Machines (SVM)
• Neural Networks (NN)
error
Test set
x1
x2 F(x)
Training set
xp
epochs number
Grain
Research &
Development 37
Corporation
38. Optimization/tuning
error
Test set
Training set
PCs, LVs or epochs number
Trend of output error versus the number of components(PCs/LVs) or epochs (NN)
Grain
Research &
Development
Corporation
39. Accuracy metrics
Parameters that provide information on the specificity
(“trait agro-climate”)
Confusion matrix (2-by-2 contingency table)
Observed
Resistant Susceptible
Predicted Resistant a b
Susceptible c d
Sensitivity a/ (a + c) =
Specificity d/(b + d) =
and are indicators of the models ability to correctly classify observations.
Grain
Research &
Development
Corporation
40. Accuracy metrics
Parameters that provide information on the specificity
(“trait agro-climate”) ..
High AUC (area) values indication of potential trait-environment relationship
1-
ROC curve pdf’s of trait distribution
1
1
Grain
The ROC curve and the resulting pdf’s of trait distribution (trait states)
Research &
Development
Corporation
41. Accuracy metrics
Randomness
1- ROC curve pdf’s of trait distribution
1
1
Grain
Research &
Development
Corporation
42. Content
• Background
– PGR traits
– FIGS
• Objective
– Develop a priori information
– Develop best bet subset of accs with traits
• Datasets
– Trait data
– Environmental data
• Methodologies
– Data preparation
– Modeling techniques
• Results/Discussion
– Sub-setting (accessions/variables)
– “Hot spots”
Grain
Research &
• Conclusion
Development 42
Corporation
43. Data preparation - Raw data
PCs = 42
1.0
1
0.46
0.71
0.8
True positive rate
0.44
0.6
RMSE
0.13
0.4
0.42
0.2
0.40
-0.45
0.0
0 10 20 30 40 50 60 0.0 0.2 0.4 0.6 0.8 1.0
Number of components False positive rate
Distribution by trait
2.0
AUC = 0.67
1.5
Density
Kappa = 0.40
1.0
0.5
0.0
-0.5 0.0 0.5 1.0
Grain
Research &
Development
Corporation
44. Data preparation – Transformed data
PCs = 42
1.0
0.46
0.59
0.8
True positive rate
0.44
0.6
RMSE
0.42
0.03
0.4
0.2
0.40
-0.54
0.0
0 10 20 30 40 50 60 0.0 0.2 0.4 0.6 0.8 1.0
Number of components False positive rate
Distribution by trait
2.0
1.5
AUC = 0.71
Density
1.0
Kappa = 0.45
0.5
0.0
-0.5 0.0 0.5 1.0
Grain
Research &
Development
Corporation
45. Data preparation - Raw data (PLS)
LVs = 30
1.0
0.46
0.68
0.8
True positive rate
0.44
0.6
RMSE
0.07
0.4
0.42
0.2
0.40
-0.55
0.0
0 10 20 30 40 50 60 0.0 0.2 0.4 0.6 0.8 1.0
Number of components False positive rate
Distribution by trait
2.0
AUC = 0.70
1.5
Density
Kappa = 0.43
1.0
0.5
0.0
-1.0 -0.5 0.0 0.5 1.0
Grain
Research &
Development
Corporation
46. Data preparation – Transformed data
LVs = 22
0.6 0.85
1.0
0.46
0.8
True positive rate
0.44
0.6
RMSE
0.42
0.09
0.4
0.2
0.40
-0.42
0.0
0 10 20 30 40 50 60 0.0 0.2 0.4 0.6 0.8 1.0
Number of components False positive rate
Distribution by trait
2.0
AUC = 0.71
1.5
Density
1.0
Kappa = 0.44
0.5
0.0
-0.5 0.0 0.5 1.0
Grain
Research &
Development
Corporation
47. Optimization process
R_CALC R_CALC
0.46
0.46
0.44
0.44
RMSEP
RMSEP
0.42
0.42
0.40
0.40
0 10 20 30 40 50 60 0 10 20 30 40 50 60
number of components number of components
Mean square error (RMSEP) for PCA (left) and PLS (right) models. Arrow indicate
minimum errors where the number of components (PCs and LVs) were selected for
prediction (red/discount nous = test data, continuous line = training set)
Grain
Research &
Development 47
Corporation
48. PCA
PC2
Few components ~ random
Distribution per R_CALC
1.0
12
Resistant
0.8
Susceptible
10
True positive rate
0.6
8
Density
6
0.4
4
0.2
2
0.0
0.0 0.2 0.4 0.6 0.8 1.0 0
0.2 0.3 0.4 0.5
False positive rate
...
Grain
Research &
Development 48
Corporation
49. PCA
PC5
Distribution per R_CALC
1.0
4
Resistant
Susceptible
0.8
3
True positive rate
0.6
Density
2
0.4
1
0.2
0.0
0.0 0.2 0.4 0.6 0.8 1.0 0
-0.4 -0.2 0.0 0.2 0.4 0.6 0.8 1.0
False positive rate
...
Grain
Research &
Development 49
Corporation
50. PLS
LV2
2 latent variables of PLS are better than 2 PCs of PCA
Distribution per R_CALC
1.0
4
Resistant
Susceptible
0.8
3
True positive rate
0.6
Density
2
0.4
1
0.2
0.0
0
0.0 0.2 0.4 0.6 0.8 1.0 -0.4 -0.2 0.0 0.2 0.4 0.6 0.8 1.0
False positive rate ...
Grain
Research &
Development 50
Corporation
51. PLS
LV10
Distribution per R_CALC
1.0
Resistant
2.0
0.8
Susceptible
True positive rate
0.6
1.5
Density
0.4
1.0
0.2
0.5
0.0
0.0
0.0 0.2 0.4 0.6 0.8 1.0
False positive rate -0.5 0.0 0.5 1.0
...
Grain
Research &
Development 51
Corporation
52. PCA (optimized)
• Principal component analysis (PCA)
• Partial Least Square (PLS)
• Random Forest (RF)
• Support Vector Machines (SVM)
• Neural Networks (NN)
ROC curve
1.0
2.0
True positive rate
0.8
1.5
Density
0.6
1.0
0.4
0.5
0.2
0.0
0.0
0.0 0.4 0.8 -0.5 0.0 0.5 1.0
False positive rate Prediction
Grain
Research &
Development
Corporation
53. PLS (optimized)
• Principal component analysis (PCA)
• Partial Least Square (PLS)
• Random Forest (RF)
• Support Vector Machines (SVM)
• Neural Networks (NN)
ROC curve
1.0
2.0
True positive rate
0.8
1.5
Density
0.6
1.0
0.4
0.5
0.2
0.0
0.0
0.0 0.4 0.8 -0.5 0.0 0.5 1.0
False positive rate Prediction
Grain
Research &
Development
Corporation
54. RF
• Principal component analysis (PCA)
• Partial Least Square (PLS)
• Random Forest (RF)
• Support Vector Machines (SVM)
• Neural Networks (NN)
ROC curve
3.0
1.0
2.5
True positive rate
0.8
2.0
Density
0.6
1.5
0.4
1.0
0.2
0.5
0.0
0.0
0.0 0.4 0.8 0.0 0.5 1.0
False positive rate Prediction
Grain
Research &
Development
Corporation
55. SVM
• Principal component analysis (PCA)
• Partial Least Square (PLS)
• Random Forest (RF)
• Support Vector Machines (SVM)
• Neural Networks (NN)
ROC curve
1.0
4
True positive rate
0.8
3
Density
0.6
2
0.4
1
0.2
0.0
0
0.0 0.4 0.8 0.0 0.5 1.0
False positive rate Prediction
Grain
Research &
Development
Corporation
56. NN
• Principal component analysis (PCA)
• Partial Least Square (PLS)
• Random Forest (RF)
• Support Vector Machines (SVM)
• Neural Networks (NN)
ROC curve
1.0
3.0
True positive rate
0.8
2.5
Density
2.0
0.6
1.5
0.4
1.0
0.2
0.5
0.0
0.0
0.0 0.4 0.8 -0.2 0.2 0.6 1.0
False positive rate Prediction
Grain
Research &
Development
Corporation
57. Random (PCA)
R_CALC
0.470
1.0
Complete
random
0.8
distribution
0.465
True positive rate
0.6
RMSEP
of trait of
0.4
stem rust
0.460
resistance
0.2
AUC ~ 0.5
0.0
0.0 0.2 0.4 0.6 0.8 1.0 0 10 20 30 40 50 60
False positive rate number of components
1.0
0.1
0.465
0.2
Partially
0.460
0.8
0.3
random
0.455
True positive rate
0.6
0.4
0.450 distribution
RMSE
of trait of
0.4
0.445
0.5
stem rust
0.2
0.440
resistance
0.6
0.435
0.0
0.8
0.7
0.0 0.2 0.4 0.6 0.8 1.0
Grain False positive rate
0 10 20 30 40 50 60
Research & Number of components
Development 57
Corporation
58. Stem rust hot spots
60
50
40
Latitude
30
20
10
0
0 50 100 150
Longitude
Grain
Research &
Development
Corporation
59. Stem rust hot spots
areas where resistance is
latitude 60
50
40
likely to occur (longitude wise)
Latitude
30
1
20
10
0
0 50 100 150
60
Longitude
b
50
40
Latitude
30
20
10
Grain
0
Research &
Development
Corporation
0 50
longitude
Longitude
100 150
60. PLS (optimized)
Areas where resistance is likely to occur (dark red)
60
-0.2
0.8
0
50
0.2
0.6
2
-0.
0.4 0
0.6
-0.2
0.6
Latitude
40
0.2
0
0.2
0.4
0.6
0.4
Y
0.6
30
0 0.2
0.6
0.4
0.2
20
0
0
0.0
0
0
0.2
0.4
-0.2
10
0.4
0.08
0 20 40 60 80 100 120
Longitude
0.06
X
semivariance
0.04
0.02
Grain
Research & 10 20 30 40
Development distance
Corporation
61. Random Forest (RF)
Areas where resistance is likely to occur (dark red)
60
0.4
50
0.8
0.2
0
0.4
0
0.6
0.8 0.6
Latitude
40
0.2
0.4
0.6 0.2
0.6
0
0.4
0.2
0.4
0.2
Y
0
0.4
30
0.6
0.4
0.6
0.6
0.2
20
0.0
10
0.2
0.4
0.15
0 20 40 60 80 100 120
Longitude
X 0.10
semivariance
0.05
Grain
Research &
Development 10 20
distance
30 40
Corporation
62. svm
Areas where resistance is likely to occur (dark red)
60
1.0
50
0
0
0.8
0.6 0.6 0
0.6
Latitude
40
0.2
0.4
0.4
0.6 1 0.6
0.8 0
0.2
0 0.2
0
0.8
Y
0.6
30
0.2 0.4
0.8
0.6
0.6
0.4
0.4
0.2
20
0 0.4
0.2
0.0
10
0
0.4 0.2
0 20 40 60 80 100 120
Longitude
X
Grain
Research &
Development
Corporation
63. Content
• Background
– PGR traits
– FIGS
• Objective
– Develop a priori information
– Develop best bet subset of accs with traits
• Datasets
– Trait data
– Environmental data
• Methodologies
– Data preparation
– Modeling techniques
• Results/Discussion
– Sub-setting (accessions/variables)
– “Hot spots”
Grain
Research &
• Conclusion
Development 63
Corporation
64. Results – stem rust on wheat
Dataset (unit) PPV LR+ Estimated gain
Stem rust 0.54 (0.50-0.59) 3.07 (2.66-3.54) 1.95 (1.79-2.09)
(accession)
Random 0.29 (0.26-0.33) 1.04 (0.90-1.20) 1.03 (0.91-1.16)
(28 % resistant samples)
Stem rust (site) 0.50 (0.40-0.60) 4.00 (2.85-5.66) 2.51 (2.02-2.98)
Random 0.19 (0.13-0.26) 0.94 (0.63-1.39) 0.95 (0.66-1.33)
(20 % resistant samples)
PPV = Positive Predictive Value; LR+ = Positive Diagnostic Likelihood Ratio
Endresen, D.T.F., K. Street, M. Mackay, A. Bari, E. De Pauw (2011). Predictive
association between biotic stress traits and ecogeographic data for wheat and barley
Grain
Research & landraces. Crop Science 51: 2036-2055. DOI: 10.2135/cropsci2010.12.0717
Development
Corporation
64
65. Results – stem rust on wheat
AUC = Area Under the ROC Curve (ROC, Receiver Operating Curve)
Classifier method AUC Cohen’s Kappa
Principal Component Regression 0.69 (0.68-0.70) 0.40 (0.37-0.42)
(PCR)
Partial Least Squares (PLS) 0.69 (0.68-0.70) 0.41 (0.39-0.43)
Random Forest (RF) 0.70 (0.69-0.71) 0.42 (0.40-0.44)
Support Vector Machines (SVM) 0.71 (0.70-0.72) 0.44 (0.42-0.45)
Artificial Neural Networks (ANN) 0.71 (0.70-0.72) 0.44 (0.42-0.46)
Bari, A., K. Street, , M. Mackay, D.T.F. Endresen, E. De Pauw, and A. Amri (2011). Focused
Identification of Germplasm Strategy (FIGS) detects wheat stem rust resistance linked to
Grain
environment variables. Genetic Resources and Crop Evolution [online first]. doi:10.1007/s10722-
Research & 011-9775-5; Published online 3 Dec 2011.
Development
Corporation
65
66. Results – stem rust on wheat
Classifier method PPV LR+ Estimated gain
kNN (pre-study) 0.29 (0.13-0.53) 5.61 (2.21-14.28) 4.14 (1.86-7.57)
SIMCA 0.28 (0.14-0.48) 5.26 (2.51-11.01) 4.00 (2.00-6.86)
Ensemble classifier 0.33 (0.12-0.65) 8.09 (2.23-29.42) 6.47 (2.05-11.06)
Random 0.06 (0.01-0.27) 0.95 (0.13-6.73) 0.97 (0.16-4.35)
(pre-study, 550 + 275 accessions)
Ensemble 0.26 (0.22-0.30) 2.78 (2.34-3.31) 2.32 (2.00-2.68)
Random 0.11 (0.09-0.15) 1.02 (0.77-1.36) 0.95 (0.77-1.32)
(blind study, 825 + 3738 accessions)
PPV = Positive Predictive Value; LR+ = Positive Diagnostic Likelihood Ratio
Endresen, D.T.F., K. Street, M. Mackay, A. Bari, E. De Pauw, K. Nazari, and A. Yahyaoui
(2012). Sources of Resistance to Stem Rust (Ug99) in Bread Wheat and Durum Wheat
Identified Using Focused Identification of Germplasm Strategy (FIGS). Crop Science
Grain [online first]. doi: 10.2135/cropsci2011.08.0427; Published online 8 Dec 2011.
Research &
Development
Corporation
66
67. Results of stem rust (Ug99) on wheat
4563 wheat landraces
screened for Ug99
10.2 % resistant
accessions.
The true trait scores for 20% of the
accessions (825 samples)
500 accessions more likely to be resistant from
3728 accession with true scores hidden
25.8 % resistant samples and thus 2.3 times
higher than expected by chance.
Grain
Research &
Development
Corporation
67
68. Content
• Background
– PGR traits
– FIGS
• Objective
– Develop a priori information
– Develop best bet subset of accs with traits
• Datasets
– Trait data
– Environmental data
• Methodologies
– Data preparation
– Modeling techniques
• Results/Discussion
– Sub-setting (accessions/variables)
– “Hot spots”
Grain
Research &
• Conclusion
Development 68
Corporation
69. Conclusion ...
Results
– Raw data vs Transformed data
– PLS vs PCA
– Non-linear vs linear
– FIGS vs random (selection)
Issues
– Extent of variables (trait/agro-climate)
– Phenology (adaptation)
– Fuzzy approach (trait variation capture)
Grain
Research &
Development 69
Corporation
Landrace samples (genebank seed accessions)Trait observations (experimental design) - High cost dataClimate data (for the landrace location of origin) - Low cost dataThe accession identifier (accession number) provides the bridge to the crop trait observations.The longitude, latitude coordinates for the original collecting site of the accessions (landraces) provide the bridge to the environmental data.
GRIN database (USDA-ARS, National Plant Germplasm System, Germplasm Resources Information Network, online http://www.ars-grin.gov/npgs) USDA GRIN, trait data online: http://www.ars-grin.gov/cgi-bin/npgs/html/desc.pl?65049
Photo: USDA ARS Image k1192-1, http://www.ars.usda.gov/is/graphics/photos/mar09/k11192-1.htm
USDA ARS Image Archive, http://www.ars.usda.gov/is/graphics/photos/
Photo: Wheat infected by stem rust (Ug99) at the Kenya Agricultural Research Station in Njoro northwest of Nairobi.
Endresen, D.T.F., K. Street, M. Mackay, A. Bari, E. De Pauw, K. Nazari, and A. Yahyaoui (2012). Sources of Resistance to Stem Rust (Ug99) in Bread Wheat and Durum Wheat Identified Using Focused Identification of Germplasm Strategy (FIGS). Crop Science [online first]. doi: 10.2135/cropsci2011.08.0427; Published online 8 Dec 2011.