• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
FIGS workshop in Madrid, PGR Secure (9 to 13 January 2012)
 

FIGS workshop in Madrid, PGR Secure (9 to 13 January 2012)

on

  • 3,816 views

We organized last week (9 to 13 January 2012) a workshop in Madrid (Spain) on predictive characterization using the Focused Identification of Germplasm Strategy (FIGS) for wild relatives to the ...

We organized last week (9 to 13 January 2012) a workshop in Madrid (Spain) on predictive characterization using the Focused Identification of Germplasm Strategy (FIGS) for wild relatives to the cultivated plants (crop wild relatives). This workshop was part of the EU funded PGR Secure project [1] (EU 7th framework programme). The objective of this workshop was to use predictive computer modeling with R [2] for data mining (trait mining) to identify genebank accessions and populations of crop wild relatives with a higher density of genetic variation for a target trait property (response, independent variable) using climate data and other environment data layers as the explanatory or independent multivariate variables. We have previously validated the FIGS approach for landraces of wheat and barley [3]. This study was one of the first attempts to validate the FIGS approach for other crops as well as for crop wild relatives (CWR). The crop landraces and crop wild relatives included in this study was: Oats (Avena sp.), Beet (Beta sp.), Cabbage and mustard (Brassica sp.), Medick including alfalfa, lucerne (Medicago sp.). We made good progress on the methodology, but also faced some major obstacles related to data availability.

Endresen, D.T.F. (2010). Predictive association between trait data and ecogeographic data for Nordic barley landraces. Crop Sci. 50(6):2418-2430. doi: 10.2135/cropsci2010.03.0174

Endresen, D.T.F., K. Street, M. Mackay, A. Bari, and E. De Pauw (2011). Predictive Association between Biotic Stress Traits and Eco-Geographic Data for Wheat and Barley Landraces. Crop Science 51 (5): 2036-2055. doi: 10.2135/cropsci2010.12.0717

Endresen, D.T.F. (2011). Utilization of Plant Genetic Resources: A Lifeboat to the Gene Pool [PhD Thesis]. Copenhagen University, Faculty for Life Sciences, Department of Agriculture and Ecology. Printed at Media-Tryck, Lund University Press, April 2011. Available at: http://goo.gl/pYa9x (PDF 37 MB). ISBN: 978-91-628-8268-6.

Bari, A., K. Street, M. Mackay, D.T.F. Endresen, E. De Pauw, and A. Amri (2012). Focused identification of germplasm strategy (FIGS) detects wheat stem rust resistance linked to environmental variables. Genetic Resources and Crop Evolution (in press). doi:10.1007/s10722-011-9775-5

Endresen, D.T.F., K. Street, M. Mackay, A. Bari, A. Amri, E. De Pauw, K. Nazari, and A. Yahyaoui (2012). Sources of Resistance to Stem Rust (Ug99) in Bread Wheat and Durum Wheat Identified Using Focused Identification of Germplasm Strategy (FIGS). Crop Science 52, in press. doi: 10.2135/cropsci2011.08.0427

Statistics

Views

Total Views
3,816
Views on SlideShare
2,656
Embed Views
1,160

Actions

Likes
1
Downloads
12
Comments
0

5 Embeds 1,160

http://dagendresen.wordpress.com 591
http://www.scoop.it 314
http://unjobs.org 250
http://marcodw05feb 4
https://dagendresen.wordpress.com 1

Accessibility

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

CC Attribution License

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • PGR Secure Workshop, 9-13 January, Madrid, Spain. Focused Identification of Germplasm Strategy (FIGS) analysis for the wild relatives of the cultivated plants. Predictive characterization of crop wild relatives. http://pgrsecure.org Photo: Oat (Avena sativa L.) at Alnarp on 3 Aug 2010, by Dag Endresen.
  • Photo: Wheat, Triticum aestivum L., at Nöbbelöv in Lund Sweden, June 2010 by Dag Endresen. URL: http://www.flickr.com/photos/dag_endresen/4826175873/, https://picasaweb.google.com/dag.endresen/GermplasmCrops#5497796034327520578
  • Genetic resources from the wild relatives of the cultivated plants contributes the raw material required for domesticated forms and the furtherdevelopment of these food crops. Genebanks preserve and provides plant genetic resources for utilization by plant breeders and other bona fide use.
  • Photo from the USDA Photo archive. Slide text by Ken Street, ICARDA FIGS team (2009).
  • Photo: Dag Endresen.Field of sugar beet (Beta vulgarisL.) at Alnarp (June 2005). URL: http://www.flickr.com/photos/dag_endresen/4189812241/
  • Photo: Bread wheat (Triticum aestivum L.) at Nöbbelöv in Lund July 2010 by Dag Endresen. URL: http://www.flickr.com/photos/dag_endresen/4826565058/
  • Photo: European mountain ash (Sorbusaucuparia L.) July 2004 by Dag Endresen.https://picasaweb.google.com/115547050550954466285/GermplasmCrops#5362091001327096818
  • Illustration of trait mining with ecoclimatic GIS layers. GIS layers included in the illustration are from the ICARDA ecoclimatic database, average: annual temperature (front), annual precipitation (middle), and winter precipitation (back) (De Pauw, 2003).
  • The assumption for trait mining using the FIGS approach is that there is a link between trait properties for crop landraces and crop wild relatives and the eco-climatic environment at the source location (collecting site). And that this link can be captured and described using a computer modeling approach.
  • Landrace samples (genebank seed accessions)Trait observations (experimental design) - High cost dataClimate data (for the landrace location of origin) - Low cost dataThe accession identifier (accession number) provides the bridge to the crop trait observations.The longitude, latitude coordinates for the original collecting site of the accessions (landraces) provide the bridge to the environmental data.
  • Modern agriculture uses advanced plant varieties based on the most productive genetics. The original land races and wild forms produce lower yields, but their greater genetic variation contains a higher diversity in e.g. resistance to disease. High-yielding modern crops are therefore vulnerable when a new disease arises.
  • Illustration traditional cattle farming: http://commons.wikimedia.org/wiki/File:Traditional_farming_Guinea.jpg (USAID, Public Domain).
  • The WorldClim dataset is described in: Hijmans, R.J., S.E. Cameron, J.L. Parra, P.G. Jones and A. Jarvis, 2005. Very high resolution interpolated climate surfaces for global land areas. International Journal of Climatology 25: 1965-1978.NOAA GHCN-Monthly version 2: http://www.ncdc.noaa.gov/oa/climate/ghcn-monthly/index.phpWeather stations, precipitation: 20 590 stations; temperature:7280 stations.
  • Endresen, D.T.F. (2010). Predictive association between trait data and ecogeographic data for Nordic barley landraces. Crop Science 50: 2418-2430. DOI: 10.2135/cropsci2010.03.0174
  • GRIN database (USDA-ARS, National Plant Germplasm System, Germplasm Resources Information Network, online http://www.ars-grin.gov/npgs).USDA GRIN, trait data online: http://www.ars-grin.gov/cgi-bin/npgs/html/desc.pl?1041.Dr. Harold Bockelman extracted the trait data (C&E).
  • Endresen, D.T.F., K. Street, M. Mackay, A. Bari, E. De Pauw (2011). Predictive association between biotic stress traits and ecogeographic data for wheat and barley landraces. Crop Science 51: 2036-2055. DOI: 10.2135/cropsci2010.12.0717
  • GRIN database (USDA-ARS, National Plant Germplasm System, Germplasm Resources Information Network, online http://www.ars-grin.gov/npgs). USDA GRIN, trait data online: http://www.ars-grin.gov/cgi-bin/npgs/html/desc.pl?65049. Dr. Harold Bockelmanextracted the trait data (C&E).
  • Photo: USDA ARS Image k1192-1, http://www.ars.usda.gov/is/graphics/photos/mar09/k11192-1.htm
  • GRIN database (USDA-ARS, National Plant Germplasm System, Germplasm Resources Information Network, online http://www.ars-grin.gov/npgs). USDA GRIN, trait data online: http://www.ars-grin.gov/cgi-bin/npgs/html/desc.pl?65049. Dr. Harold Bockelmanextracted the trait data (C&E). Photo: USDA ARS Image Archive, http://www.ars.usda.gov/is/graphics/photos/
  • Endresen, D.T.F., K. Street, M. Mackay, A. Bari, E. De Pauw, K. Nazari, and A. Yahyaoui (2012). Sources of Resistance to Stem Rust (Ug99) in Bread Wheat and Durum Wheat Identified Using Focused Identification of Germplasm Strategy (FIGS). Crop Science [online first]. doi: 10.2135/cropsci2011.08.0427; Published online 8 Dec 2011.
  • Photo: Wheat infected by stem rust (Ug99) at the Kenya Agricultural Research Station in Njoro northwest of Nairobi. This study is in press and will soon be available from Crop Science.
  • Endresen, D.T.F. (2011). Utilization of plant genetic resources, a lifeboat to the gene pool. PhD thesis. Department of Agriculture and Ecology, Faculty of Life Sciences, University of Copenhagen. ISBN: 978-91-628-8268-6. Available at http://goo.gl/pYa9x, verified 7 Jan 2011.
  • Mackay, M.C. (2011). Surfing the genepool, the effective and efficient use of plant genetic resources. PhD thesis. Department of Plant Breeding and Biotechnology, Faculty of landscape planning, horticulture and agricultural science, Swedish University of Agricultural Sciences (SLU), Alnarp. ISBN: 978-91-576-7634-4. Available at http://pub.epsilon.slu.se/8439/, verified 7 Jan 2011.
  • More than 7.4 million genebank accessions; and more than 1400 genebanks - including approximately 140 large genebanks each holding more than 10.000 accessions: Second Report on the State of the World’s Plant Genetic Resources for Food and Agriculture (2010) Food and Agriculture Organization of the United Nations (FAO).Photo: Genebank Accession at IPK Gatersleben by Dag Endresen, 2010.
  • For more information on Gap Analysis see: http://gisweb.ciat.cgiar.org/GapAnalysis/.Photo: South of Tunisia, http://www.flickr.com/photos/dag_endresen/4221301525/
  • NordGen study in June 2010, Wormwood (Artemisia absinthiumL.). Species distribution model using the Maxent desktop ecological niche modeling software. Only the niche model study to identify suitable locations for the presence of the species was made. The next step to combine this result with a FIGS study to identify locations where a target trait in these predicted populations would be likely to be found was only planned but not completed for this experiment conducted at NordGen.
  • A data exchange protocol, format, infrastructure and a network for sharing datasets are important elements.
  • PGR Secure Workshop, 9-13 January, Madrid, Spain. Focused Identification of Germplasm Strategy (FIGS) analysis for the wild relatives of the cultivated plants. Photo: Oat (Avena sativa L.) at Alnarp on 3 Aug 2010, by Dag Endresen. This previous slides were for the first day of the workshop (9 Jan 2012). The following slides were for the second day of the workshop (10 Jan 2012).
  • These are the steps to follow in a trait mining experiment using the FIGS approach.
  • KRAK: http://www.krak.dk/query?mop=aq&mapstate=7%3B9.305588071850734%3B56.61105751259899%3Bh%3B9.282591620463698%3B56.61775781407488%3B9.328584523237769%3B56.60435721112311%3B853%3B469&what=map_adr# Google Maps: http://maps.google.com/maps/ms?ie=UTF8&hl=en&msa=0&msid=107144586665622662057.00045ff98921bd0418037&ll=56.606941,9.297695&spn=0.055554,0.150204&t=h&z=13
  • NGB6300 (accide 9039, FRO) observed at Priekuli, Latvia in 2003, replicate 2 is highlighted in red (with high leverage as indicated by the high value for Hoteling T2).NGB776 (accide 8510, SWE) observed at Landskrona, Sweden in 2002 (Replicate 2, LYR312) is highlighted in green.
  • Box-plot of the trait scores to illustrate the effect of the preprocessing. First row is no preprocessing; row 2 is mean-centering (centering across mode 1, samples); last row is auto-scale (centering across mode 1 and scaling across mode 2, traits).Mean centeringremoves the absolute intensity information (the mean for each variable is subtracted from the individual data values). This pre-processing strategy is applied to avoid the model to focus on the variables with the highest numerical values (intensity). Scaling: In general, scaling a variable in the data can be viewed as a multiplication of the corresponding column vector entries with some number. If the significances of the variables to the model are known prior to modeling, then it might be a good idea to upscale the highly relevant variables. In contrast, if a variable is supposed to bear merely noise, then its significance must be downscaled. However this is a rare case in reality. Therefore, unit-variance scaling (UV-scaling) is most often used. Moreover scaling itself is sometimes associated with UV-scaling. (Johann Gasteiger, and Dr. Thomas Engel (editors). 2003. Chemoinformatics: a textbook. Wiley-VCH, Weinheim. ISBN 9783527306817. Page 214)
  • Box-plot of the trait scores to illustrate the effect of the preprocessing. First row is no preprocessing; row 2 is mean-centering (centering across mode 1, samples); last row is auto-scale (centering across mode 1 and scaling across mode 2, traits).
  • We often divide the data for a simulation model project in three equal parts: one set for initial model calibration or training, one set for further calibration or fine tuning; and one test set for validation on the model.
  • Multi-way analysis using a multi-linear data cube representation of the dataset preserves much more information for the model to learn from compared to a more common 2-way data table. The multi-way model can calibrate against systematic patterns in all dimensions (ways) of the data cube including the dimensions for the different types and months for the climate variables - while the reduced 2-way data table will only present systematic variation across the accessions/samples (1st way) and across all of the climate variables (2nd way).
  • The PARAFAC algorithm provides a very powerful and compact model with the consumption of very few degrees of freedom (!). However one must be particular careful to validate the PARAFAC model solution because the PARAFAC sometimes calibrate to bad solutions. The split half approach is a suitable method where the dataset is split by random in two parts each used to calibrate independent PARAFAC models. The profiles from a plot of the scores and loadings for the latent factors (synthetic variables) will give a good indication if the two halves of the dataset calibrate to the same PARAFAC model solutions. The calculated core consistency is another useful indicator for these PARAFAC models.
  • Map to illustrated the first successful split-half subsets. Set 1: NGB6300, NGB27, NGB469, NGB776, NGB4701, NGB2072, NGB4641 are indicated with blue placemarks. Set 2: NGB792, NGB13458, NGB9529, NGB468, NGB775, NGB456, NGB2565 are indicated with red placemarks. Map of the second good split-half. Set 1: NGB456, NGB9529, NGB469, NGB2072, NGB468, NGB4641, NGB776 are indicated with blue placemarks. Set 2: NGB4701, NGB27, NGB2565, NGB792, NGB13458, NGB6300, NGB775 are indicated by red placemarks.
  • Residuals can tell much about the model. If the residuals are not on a normal distribution this is often an indication that there remain systematic information in the dataset that the model has not captured.
  • Photo: Wheat field at Nöbbelöv, July 2010, by Dag Endresen.
  • Left side illustration is modified from Wise et al., 2006:201 (PLS Toolbox software manual). The right side illustration is made by the PLS Toolbox software in MATLAB.
  • Photo: Wheat field at Nöbbelöv, July 2010, by Dag Endresen.
  • The 2 by 2 confusion matrix provides the starting point for the calculation of many useful performance indicators – for classification of ordinal and categorical trait variables.
  • Notice that the Root Mean Square Error from Cross-Validation (RMSECV) reach a minimum at 6 latent factors(LVs) while the Root Mean Square Error from Calibration (RMSEC) continue to fall. Models with a higher complexity than 5 to 6 LVs will thus most likely be overfitted to the training data.
  • Notice that the Root Mean Square Error from Cross-Validation (RMSECV) reach a minimum at 9 latent factors(LVs) while the Root Mean Square Error from Calibration (RMSEC) continue to fall. Notice also that the RMSECV level out around 4 LVs after which the cross-validation performance show very minor gains before a final drop from 8 to 9 LVs. Models with a higher complexity than 4 LVs will here most likely be overfitted to the training data.
  • http://en.wikipedia.org/wiki/Correlation, http://en.wikipedia.org/wiki/Coefficient_of_determination, http://en.wikipedia.org/wiki/Statistical_model_validationTable of critical values for r: http://www.runet.edu/~jaspelme/statsbook/Chapter%20files/Table_of_Critical_Values_for_r.pdfTable of critical values for r: http://www.gifted.uconn.edu/siegle/research/Correlation/corrchrt.htmTable of critical values for r: http://www.jeremymiles.co.uk/misc/tables/pearson.html

FIGS workshop in Madrid, PGR Secure (9 to 13 January 2012) FIGS workshop in Madrid, PGR Secure (9 to 13 January 2012) Presentation Transcript

  • • Trait mining with FIGS – Predictive link between climate data and trait data – Case studies: • Morphological traits, Nordic barley • Net blotch on barley • Stem rust on wheat • Stem rust, Ug99 on bread wheat landraces Wheat at Alnarp, June 2010 2
  • wild tomato tomatoteosinte cultivation corn, maize 3
  • • Scientists and plant breeders want a few hundred germplasm accessions to evaluate for a particular trait.• How does the scientist select a small subset likely to have the useful trait? 4
  • 5
  • What is Focused Identification of Germplasm Strategy Mediterranean Sea South AustraliaOrigin of Concept:Boron toxicity of wheat andbarley example of late 1980s Slide made by M.C. Mackay, 1995
  • Illustration by Mackay (1995) based on latitude & longitude Data layers sieve accessions Temperature Salinity scoreOrigin of FIGS:Michael Mackay (1986,1990, 1995) Elevation Rainfall Agro-climatic zone Disease distribution FOCUSED IDENTIFICATION OF GERMPLASM STRATEGY 7
  • 8
  • – Identification of plant germplasm with a higher likelihood of having desired genetic diversity for a target trait property.– Using climate data for prediction of crop traits a priori BEFORE the field trials. Bread wheat at Nöbbelöv in Lund 9
  • • Focused Identification of Germplasm Strategy (FIGS).• Identify new and useful genetic diversity for crop improvement.• Based on eco-geographic data analysis using climate data. European mountain ash (Sorbus aucuparia L.) at Alnarp, July 2004 10
  • Climate layers from the ICARDAecoclimatic database (De Pauw, 2003) 11
  • The climate at the original sourcelocation, where the plantgermplasm was developed iscorrelated to the trait property.To build a computer modelexplaining the crop trait scorefrom the climate data. 12
  • Genebank accessions Field trials (€€€) Trait(landraces & CWR) data High cost data Climate data Low cost data 13
  • Wild relatives are shaped Primitive cultivated crops are Traditional cultivated cropsby the environment shaped by local climate and (landraces) are shaped by climate humans and humans Modern cultivated crops are Perhaps future crops are shaped mostly shaped by humans (plant in the molecular laboratory…? breeders) 14
  • It is possible that thehuman mediated selectionof landraces will contributeto the link betweenecogeography and traits.During traditionalcultivation the farmer willselect for and introducegermplasm for improvedsuitability of the landraceto the local conditions. 15
  • The climate data can be extractedfrom the WorldClim dataset.http://www.worldclim.org/(Hijmans et al., 2005)Data from weather stationsworldwide are combined to acontinuous surface layer.Climate data for each landrace is Precipitation: 20 590 stationsextracted from this surface layer. Temperature: 7 280 stations 16
  • Layers used in these early FIGS studies:• Precipitation (rainfall)• Maximum temperatures• Minimum temperaturesSome of the other layers available:• Potential evapotranspiration (water-loss)• Agro-climatic Zone (UNESCO classification)• Soil classification (FAO Soil map)• Aridity (dryness) Eddy De Pauw (ICARDA, 2008) (mean values for month and year) 17
  • • Landraces and wild relatives – The link between climate data and the trait data is required for trait mining with FIGS. Modern cultivars are not expected to show this predictive link (complex pedigree).• Georeferenced accessions – Trait mining with FIGS is based on multivariate models using climate data from the source location of the germplasm. To extract climate data the accessions need to be accurately georeferenced. 18
  • Field observations by AgneseKolodinska Brantestam (2002-2003)Multi-way N-PLS dataanalysis, Dag Endresen (2009-2010) 19 Priekuli (LVA) Bjørke (NOR) Landskrona (SWE)
  • Experiment Heading Ripening Length Harvest Volumetric Thousand Site Year days days of plant index weight grain weight LVA 20021 n.s. n.s. n.s. n.s. *** n.s. LVA 2003 *** n.s. ** ** *** n.s. NOR 2002 - * ** *** ** n.s. NOR 2003 ** *** *** * * n.s. SWE 2002 ** *** n.s. ** * n.s. SWE 20032 n.s. ** n.s. n.s. ** n.s. *** Significant at the 0.001 level (p-value) 1 LVA 2002 Germination on spikes (very wet June) ** Significant at the 0.01 level 2 SWE 2003 Incomplete grain filling (very dry June) * Significant at the 0.05 level n.s. Not significant (at the above levels)Endresen, D.T.F. (2010). Predictive association between trait data and ecogeographic data forNordic barley landraces. Crop Science 50: 2418-2430. DOI: 10.2135/cropsci2010.03.0174 20
  • • Positive predictive value (PPV) • PPV = True positives / (True positives + False positives) • Classification performance for the identification of resistant samples (positives)• Positive diagnostic likelihood ratio (LR+) • LR+ = sensitivity / (1 – specificity) • Less sensitive to prevalence than PPV 21
  • Green dots indicate collecting sites for resistant wheat landraces and reddots collecting sites for susceptible landraces. Field experiments made inUSDA GRIN, trait data online: Minnesota, North Dakotahttp://www.ars-grin.gov/cgi-bin/npgs/html/desc.pl?1041 and Georgia in the USA 22
  • Dataset (unit) PPV LR+ Estimated gainNet blotch (accession) 0.54 (0.48-0.60) 1.75 (1.42-2.17) 1.35 (1.19-1.50)Random 0.40 (0.35-0.45) 0.99 (0.84-1.17) 0.99 (0.87-1.12)(40 % resistant samples) PPV = Positive Predictive Value; LR+ = Positive Diagnostic Likelihood RatioEndresen, D.T.F., K. Street, M. Mackay, A. Bari, E. De Pauw (2011). Predictive associationbetween biotic stress traits and ecogeographic data for wheat and barley landraces. CropScience 51: 2036-2055. DOI: 10.2135/cropsci2010.12.0717 23
  • Green dots indicate collecting sites for resistant wheat landraces and reddots collecting sites for susceptible landraces.USDA GRIN, trait data online: Field experiments made inhttp://www.ars-grin.gov/cgi-bin/npgs/html/desc.pl?65049 Minnesota by Don McVey 24
  • Dataset (unit) PPV LR+ Estimated gainStem rust (accession) 0.54 (0.50-0.59) 3.07 (2.66-3.54) 1.95 (1.79-2.09)Random 0.29 (0.26-0.33) 1.04 (0.90-1.20) 1.03 (0.91-1.16)(28 % resistant samples)Stem rust (site) 0.50 (0.40-0.60) 4.00 (2.85-5.66) 2.51 (2.02-2.98)Random 0.19 (0.13-0.26) 0.94 (0.63-1.39) 0.95 (0.66-1.33)(20 % resistant samples) PPV = Positive Predictive Value; LR+ = Positive Diagnostic Likelihood RatioEndresen, D.T.F., K. Street, M. Mackay, A. Bari, E. De Pauw (2011). Predictive associationbetween biotic stress traits and ecogeographic data for wheat and barley landraces. CropScience 51: 2036-2055. DOI: 10.2135/cropsci2010.12.0717 25
  • Classifier method AUC Cohen’s KappaPrincipal Component Regression 0.69 (0.68-0.70) 0.40 (0.37-0.42)(PCR)Partial Least Squares (PLS) 0.69 (0.68-0.70) 0.41 (0.39-0.43)Random Forest (RF) 0.70 (0.69-0.71) 0.42 (0.40-0.44)Support Vector Machines (SVM) 0.71 (0.70-0.72) 0.44 (0.42-0.45)Artificial Neural Networks (ANN) 0.71 (0.70-0.72) 0.44 (0.42-0.46) AUC = Area Under the ROC Curve (ROC, Receiver Operating Curve)Bari, A., K. Street, , M. Mackay, D.T.F. Endresen, E. De Pauw, and A. Amri(2011). Focused Identification of Germplasm Strategy (FIGS) detects wheatstem rust resistance linked to environment variables. Genetic Resources andCrop Evolution [online first]. doi:10.1007/s10722-011-9775-5; Publishedonline 3 Dec 2011. Abdallah Bari (ICARDA) 26
  • Ug99 set with 4563 wheat landraces screened for Ug99 in Yemen 2007, 10.2 % resistant accessions. The truetrait scores for 20% of the accessions (825 samples) were revealed. We used trait mining with SIMCA toselect 500 accessions more likely to be resistant from 3728 accession with true scores hidden (to the personmaking the analysis). The FIGS set was observed to hold 25.8 % resistant samples and thus 2.3 times higherthan expected by chance. 27
  • Classifier method PPV LR+ Estimated gainkNN (pre-study) 0.29 (0.13-0.53) 5.61 (2.21-14.28) 4.14 (1.86-7.57)SIMCA 0.28 (0.14-0.48) 5.26 (2.51-11.01) 4.00 (2.00-6.86)Ensemble classifier 0.33 (0.12-0.65) 8.09 (2.23-29.42) 6.47 (2.05-11.06)Random 0.06 (0.01-0.27) 0.95 (0.13-6.73) 0.97 (0.16-4.35)(pre-study, 550 + 275 accessions)Ensemble 0.26 (0.22-0.30) 2.78 (2.34-3.31) 2.32 (2.00-2.68)Random 0.11 (0.09-0.15) 1.02 (0.77-1.36) 0.95 (0.77-1.32)(blind study, 825 + 3738 accessions) PPV = Positive Predictive Value; LR+ = Positive Diagnostic Likelihood RatioEndresen, D.T.F., K. Street, M. Mackay, A. Bari, E. De Pauw, K. Nazari, and A. Yahyaoui (2012). Sourcesof Resistance to Stem Rust (Ug99) in Bread Wheat and Durum Wheat Identified Using FocusedIdentification of Germplasm Strategy (FIGS). Crop Science [online first]. doi:10.2135/cropsci2011.08.0427; Published online 8 Dec 2011. 28
  • • PDF available from: – http://db.tt/lZMpwgJ• Available from Libris (Sweden) – ISBN: 978-91-628-8268-6 29
  • Michael Clemet Mackay (2011)• PDF available from: – http://pub.epsilon.slu.se/8439/• Available from Libris (Sweden) – ISBN: 978-91-576-7634-4 30
  • 31
  • • Advice the planning of new collecting/gathering expeditions – Identification of relevant areas were the crop species is predicted to be present (using GBIF data and niche models). – Focus on areas least well represented in the genebank collection (maximize diversity). – Focus on areas with a higher likelihood for a desired target trait (FIGS). See http://gisweb.ciat.cgiar.org/GapAnalysis/ for more information. 32
  • Speciesdistribution Wormwood (Artemisia absinthium L.)model(7 364 records)Using the Maxentdesktop software. 33
  • http://data.gbif.org/datasets/network/2Using GBIF/TDWG technology (and The compatibility of data standardscontributing to its development), the between PGR and biodiversity collectionsPGR community can more easily made it possible to integrate theestablish specific PGR networks worldwide germplasm collections into thewithout duplicating GBIFs work. biodiversity community (TDWG, GBIF). 34
  • • Data collection and preparation• Geo-referencing of collecting locations• Initial data exploration• Pre-processing of dataset• Choose modeling method• Calibration of model• Validation of model• Validation of prediction results 36
  • Example of georeferencing for NGB9529, a barley landrace reported as originating from Lyderupgaard using KRAK.dk and maps.google.com 37
  • The influence plot(residuals againstleverage) showssample(FRO) observed atPriekuli in 2003(replicate 2) with avery high leverage -well separated fromthe “data cloud”.After looking into theraw data, thisobservation pointwas removed as (set to NaN). 38
  •  Mean centering removes the absolute intensity to avoid the model to focus on the variables with the highest numerical values (intensity). Scaling makes the relative distribution of values (range spread) more equal between variables. Auto-scaling is a combination of mean centering and variance scaling. After auto-scaling all variables have a mean of zero and a standard deviation of one. The objective is to help the model to separate the relevant information from the noise. 39
  • No preprocessing Mean centering Auto scale Priekuli BjørkeLandskrona 40
  • – No model can ever be absolutely correct– A simulation model can only be an approximation– A model is always created for a specific purpose– The simulation model is applied to make predictions based on new fresh data– Be aware to avoid extrapolation problems 41
  • – For the initial calibration or training step.– Further calibration, tuning step– Often cross-validation on the training set is used to reduce the consumption of raw data.– For the model validation or goodness of fit testing.– External data, not used in the model calibration. 42
  • 36 variables Min. temperature Max. temperature Precipitation mode 1 14 samples Jan, Feb, Mar, … Jan, Feb, Mar, … Jan, Feb, Mar, … (mode 2) (mode 2) (mode 2) 1st level for mode 3 2nd level for mode 3 3rd level for mode 3 Precipitation Max tempMin temp 14 samples (mode 1) 14 samples (mode 1) 12 months (mode 2) 12 months (mode 2) 43
  • The two PARAFAC modelseach calibrated from twoindependent split-halfsubsets, both converge toa very similar solution asthe model calibratedfrom the completedataset.The PARAFAC model isthus a general and stablemodel for the scope ofScandinavia.Example used here is the Traitdata model (mode 1) fromEndresen (2010). 44
  • 45
  • The distance between the model (predictions) andthe reference values (validation) is the residuals. Example of a bad model calibration Calibration step Cross-validation indicates the appropriate model Be aware of over-fitting! NB! Model validation! complexity. 46
  • 47
  • • Parallel Factor Analysis (PARAFAC) (Multi-way)• Multi-linear Partial Least Squares (N-PLS) (Multi-way)• Soft Independent Modeling of Class Analogy (SIMCA)• k-Nearest Neighbor (kNN)• Partial Least Squares Discriminant Analysis (PLS-DA)• Linear Discriminant Analysis (LDA)• Principal component logistic regression (PCLR)• Generalized Partial Least Squares (GPLS)• Random Forests (RF)• Neural Networks (NN)• Support Vector Machines (SVM) These methods above are the modeling methods used by Endresen (2010), Endresen et al (2011, 2012), and Bari et al (2012).• Boosted Regression Trees (BRT)• Multivariate Regression Trees (MRT)• Bayesian Regression Trees 48
  • Example from the stem rust set:2 PCs Principal component 3 3 PCs Resistant samples 1 PC * Intermediate Susceptible Illustration modified from Wise et al., 2006:201 (PLS Toolbox software manual) 49
  • • Pearson product-moment correlation (R) (-1 to 1)• Coefficient of determination (R2) (0 to 1)• Cohen’s Kappa (K) (-1 to 1)• Proportion observed agreement (PO) (0 to 1)• Proportion positive agreement (PA) (0 to 1)• Positive predictive value (PPV) (0 to 1)• Positive diagnostic likelihood ratio (LR+) (from 0)• Sensitivity and specificity• Area under the curve (AUC) – Receiver operating characteristics (ROC)• Root mean square error (RMSE) – RMSE of calibration (RMSEC) – RMSE of cross-validation (RMSECV) – RMSE of prediction (RMSEP)• Predicted residual sum of squares (PRESS) 50
  • Predicted Resistant (positive) Susceptible (negative)Observed Resistant (positive) True positive (TP) False negative (FN)(Actual) Susceptible (negative) False positive (FP) True negative (TN) Proportion observed agreement (PO) = (TP + TN) / N Proportion positive agreement (PA) = (2 * TP) / (2 * TP + FP + FN) Positive predictive value (PPV) = TP / (TP + FP) Sensitivity = TP / (TP + FN) Specificity = TN / (TN + FP) Positive diagnostic likelihood ratio (LR+) = Sensitivity / (1 – specificity) Positive diagnostic likelihood ratio (LR+) = (TP / [TP + FN]) / (FP / [FP + TN]) Odds ratio (OR) = (TP * TN) / (FN * FP) Yule’s Q = (OR - 1) / (OR + 1) Positive predictive gain (Gain) = PPV/prevalence 51
  • Predictions for the cross-validated (leave-one-out) samples for a N- PLS modelPredicted trait scores Endresen (2010) for trait 5 (volumetric weight) observations from Priekuli, 2002 (mean of the replications) and 6 principal components. Correlation coefficient: Sum squared residuals: Actual trait scores 52
  • Predictions for the cross- validated (leave-one-out) samples with a N-PLS model.Predicted trait scores Endresen (2010) for trait 5 (volumetric weight) observations from Bjørke, 2002 (mean of the replications) and 4 principal components. Correlation coefficient: Sum squared residuals: Actual trait scores 53
  • • Often the critical levels ( ) for the p-value significance is set as 0.05, 0.01 and 0.001 (5 %, 1 %, 0.1 %).• The significance level is often marked with one star (*) for the 0.05 level, two stars (**) for the 0.01 level and three stars (***) for the 0.001 level. – 5% (even a random effect when an experiment is repeated 20 times is likely to be observed one time) – 1% (if an experiment is repeated 100 times a random effect is likely to be observed one time) – 0.1% (if an experiment is repeated 1000 times a random effect is likely to be observed one time) 54
  • PGR Secure (EU 7th Framework)Workshop FIGS approach9-13 Jan 2012, MadridDag Endresendag.endresen@gmail.comAbdallah Bari (ICARDA)abdallah.bari@gmail.com 55