Trait data mining using FIGS, seminar at Copenhagen University (27 May 2009)
Upcoming SlideShare
Loading in...5

Trait data mining using FIGS, seminar at Copenhagen University (27 May 2009)



A presentation I made for a masters student training course at Copenhagen University (KU) Faculty for Life Sciences (LIFE) in May 2009. ...

A presentation I made for a masters student training course at Copenhagen University (KU) Faculty for Life Sciences (LIFE) in May 2009.

Endresen, D.T.F. (2010). Predictive association between trait data and ecogeographic data for Nordic barley landraces. Crop Sci. 50(6):2418-2430. doi: 10.2135/cropsci2010.03.0174



Total Views
Views on SlideShare
Embed Views



0 Embeds 0

No embeds



Upload Details

Uploaded via as Adobe PDF

Usage Rights

CC Attribution License

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
Post Comment
Edit your comment

Trait data mining using FIGS, seminar at Copenhagen University (27 May 2009) Trait data mining using FIGS, seminar at Copenhagen University (27 May 2009) Presentation Transcript

  •  Overall  goal:   –  User-­‐friendly  access  to  relevant   informa3on  on  plant  gene3c  resources.     –  Increased  u3liza3on  of  germplasm  for   gene3c  diversity  in  food  crops.    Strategies  to  improve  the  u,liza,on  of   germplasm  in  seedbank  collec3ons  to   increase  the  gene3c  diversity  of  food   crops  for  enhanced  food  security.   2  
  • •  Scien3sts  and  plant  breeders  want  a   few  hundred  germplasm  accessions   to  evaluate  for  a  par3cular  trait.   •  How  does  the  scien3st  select  a   small  subset  likely  to  have  the   useful  trait?   •  More  than  560  000  wheat   accessions  in  genebanks  worldwide.   3   Slide  adopted  from  a  slide  by  Ken  Street,  ICARDA  (FIGS  team)  
  •   “I am screening for variations in powdery mildew resistance genes can you send me 1200 landrace accessions of bread wheat”…   “I am screening for drought – could you send me some landraces from Afghanistan and some other dry countries”…   “I am screening for rust can you send me 9000 bread wheat samples”…   “I am looking for new salt tolerance genes can you send me some wild relatives from salty areas”…   “I want about 500 bread durum acc to screen for RWA”…   “I am screening for Sunn Pest and can handle about 200 acc – can you send me a selection of Triticum species”… 4   Slide  adopted  from  a  slide  by  Ken  Street,  ICARDA  (FIGS  team)  
  • •  The  scien3st  or  the  breeder   need  a  smaller  subset  to  cope   with  the  field    screening   experiments.   •  A  common  approach  is  to   create  a  so-­‐called  core   collec,on.   Sir  OVo  H.  Frankel  (1900-­‐1998)   proposed  that  a  limited  or  "core   collec3on"  could  be  established   from  an  exis3ng  collec3on.  With   minimum  similarity  between  its   entries  the  core  collec3on  is  of   limited  size  and  chosen  to   represent  the  gene,c  diversity   of  a  large  collec3on,  a  crop,  a   wild  species  or  group  of  species   5   (1984)  .  
  • •  Given  that  the  trait   property  you  are   looking  for  is  rela3vely   rare:   •  Perhaps  as  rare  as  a   unique  allele  for  one   single  landrace  cul3var...   •  Ge_ng  what  you  want   is  largely  a  ques3on  of   LUCK!   6   Slide  adopted  from  a  slide  by  Ken  Street,  ICARDA  (FIGS  team)  
  • 7  
  •  Objec,ve  of  this  study:     –  Explore  climate  data  as  a  predic3on   model  for  “pre-­‐screening”  of  crop   traits  BEFORE  full  scale  field  trials.   –  Iden3fica3on  of  landraces  with  a   higher  probability  of  holding  an   interes3ng  trait  property.   8  
  • •  Primi,ve  crops  and  tradi,onal  landraces  are   the  source  of  exo3c  traits,  crop  proper3es.   •  Traits  from  landraces  are  an  interes3ng   source  of  novel  traits  for  improvement  of   modern  crops.   •  Landraces  are  ogen  not  described  for  the   economically  valuable  trait  in  ques3on.   •  Iden3fica3on  of  crop  traits  are  ogen  the   result  of  a  larger  field  trial  screening  project   (thousands  of  individual  plants).   •  Large  scale  field  trials  are  very  costly  (land   area  and  human  working  hours).   9  
  •  The  underlying  assump3on   is  that  the  climate  at  the   original  source  loca3on,   where  the  landrace  was   developed  during  long-­‐term   tradi3onal  cul3va3on,  is   correlated  to  trait.      The  aim  is  to  build  a   computer  model  explaining   the  crop  trait  score   (dependent  variables)  from  the   climate  data  (independent   variables).   10  
  • Wild  rela3ves  are   Primi3ve  cul3vated  crops   Tradi3onal  cul3vated  crops   shaped  by  climate   are  shaped  by  climate   (landraces)  are  shaped  by   and  humans   climate  and  humans   Modern  cul3vated  crops   Perhaps  future  crops  are   (cul3vars)  are  mostly  shaped   shaped  in  the  molecular   by  humans  (plant  breeders)   laboratory…?   11  
  • 1)  Landrace  samples  (genebank  seed  accessions)   2)  Trait  observa3ons  (experimental  design)   3)  Climate  data  (for  the  landrace  origin  loca3ons)   •   The  accession  iden3fier  (accession  number)  provides  the  bridge  to  the  crop  trait  observa3ons.   •   The  longitude,  la,tude  coordinates  for  the  original  collec3ng  site  of  the  accessions  (landraces)  provide  the   bridge  to  the  environmental  data.     12  
  • More  than  6  million  genebank  accessions,  more  than  1  400  genebanks,  worldwide.   13  
  • Faba  bean,  Finland   Field  trials,  Gatersleben,  Germany   Cauliflower  (S.  Jeppson)   Forage  crops,  Dotnuva,  Lithuania   Radish  (S.  Jeppson)   Linnés  äpple   Powdery  Mildew,     Leaf  spots   Yellow  rust   Black  stem  rust   14   Blumeria  graminis   Ascochyta  sp.   Puccinia  strilformis   Puccinia  graminis   hVp://barley.ipk-­‐    
  •  The  climate  data  is  extracted  from   the  WorldClim  dataset.    hVp://      Data  from  weather  sta3ons   worldwide  are  combined    to  a   con3nuous  surface  layer.    Climate  data  for  each  landrace  is   extracted  from  this  surface  layer.   Precipita3on:  20  590  sta3ons   Temperature:  7  280  sta3ons   15  
  • This  study  is  part  of   a  new  method  to   predict  crop  traits   of  primi3ve   cul3vated  material   from  climate   variables  by  using   mul3variate   sta3s3cal  methods.     16  
  • FIGS    The  FIGS  technology  takes  much  of  the  guess   work  out  of  choosing  which  accessions  are  most   likely  to  contain  the  specific  characteris3cs  being   sought  by  plant  breeders  to  improve  plant   produc3vity  across  numerous  challenging   environments.        hVp://     17   17  
  • What is hVp://     Mediterranean  region   Origin of Concept (1980s): Wheat and barley landraces from Queensland  Australia   marine soils in the Mediterranean region provided genetic variation Slide made by for boron toxicity. M C Mackay 1995 18  
  • Slide made by M C Mackay 1995 19  
  • •  No  sources  of  Sunn  pest  resistance   previously  found  in  hexaploid   wheat.   •  2000  accessions  screened  at   ICARDA  without  result   •  A  FIGS  set  of  534  accessions  was   developed  and  screened.     •  10  resistant  accessions  were  found!   •  The  FIGS  selec3on  started  from  16  000   landraces  from  VIR,  ICARDA  and  AWCC   •  Exclude  origin  CHN,  PAK,  IND  were  Sunn  pest   only  recently  reported  (6  328  acc).   •  Only  accession  per  collec3ng  site  (2  830  acc).   •  Excluding  dry  environments  below  280  mm/ year   •  Excluding  sites  of  low  winter  temperature  below   10  degrees  Celsius  (1  502  acc)   Slide  adopted  from  Ken  Street,  ICARDA  (FIGS  team)   20  
  • •  The  fundamental  ecological  niche  of  an  organism   was  formalized  by  Hutchinson[1]  in  1957  as  a   mul3dimensional  hypercube  defining  the  ecological   condi3ons  that  allow  a  species  to  exist.   •  Full  understanding  of  all  the  environmental   condi3ons  for  any  organism  is  a  monumental  task [2].     •  A  computer  model  of  the  occurrence  locali3es   together  with  associated  environmental  condi3ons   such  as  rainfall,  temperature,  day  length  etc.,   provides  an  approxima3on  of  the  fundamental   niche.   •  Popular  soCware  implementa3ons  for  modeling   the  ecological  niche  include  openModeller,  MaxEnt,   BioCLIM,  DesktopGARP,  etc.   21   George  Evelyn  Hutchinson  (1903  –  1991)  
  •   A flexible, user friendly, cross- platform environment where the entire process of a fundamental niche modeling experiment can be carried out. Input: species occurrence and environmental data. Output: a fundamental niche model and projection of the model into an environmental scenario. hVp://   22  
  • 23  
  • –  The  ini3al  model  is  developed  from  the  training   set   –  Fine  tuning  of  model  parameters  and  se_ngs   –  No  model  can  ever  be  absolutely  correct!   –  A  simula3on  model  can  only  be  an  approxima3on   –  A  model  is  always  created  for  a  specific  purpose!   –  The  simula3on  model  is  applied  to  make   predic3ons  based  on  new  fresh  data   –  Be  aware  of  extrapola3on   24  
  • –  For  the  ini3al  calibra3on  or   training  step.   –  Further  calibra3on,  tuning  step   –  Ogen  cross-­‐valida3on  on  the   training  set  is  used  to  reduce  the   consump3on  of  raw  data.   –  For  the  model  valida3on  or   goodness  of  fit  tes3ng.   –  External  data,  not  used  in  the   model  calibra3on.   25  
  • 26  
  • Name  of  the  sta3s3c   Symbol   Range   *  Correla3on  coefficient     r   -­‐1  to  1   *  Coefficient  of  determina3on     r2   0  to  1   •   A  number  of  different  coefficients  are   developed  to  measure  correla3on  in   different  situa3ons.     •   The  best  known  is  the  Pearson  product-­‐ moment  correla,on  coefficient.   •   The   indicates   the  strength  and  direc3on  of  a  linear   rela3onship  between  two  random   variables.   •   The   indicates  how  well  future  outcomes  are   The  covariance  of  the  two  variables  is  divided  by  the   likely  to  be  predicted  by  a  sta3s3cal  model.   product  of  their  standard  devia3ons.   27  
  • The  distance  between  the  model  (predic3ons)  and   the  reference  values  (valida3on)  is  the  residuals.   Example  of  a  bad   model  calibra3on   Cross-­‐valida3on  indicates   the  appropriate  model   Be  aware  of  over-­‐fi_ng!  NB!  Model  valida3on!   complexity.   28  
  • 29  
  • 30  
  • Sta,on   Al,tude   La,tude   Longitude   Priekuli,  Latvia   83  m   57.3167   25.3667   Bjørke  forsøksgård,  Norway   149  m   60.7667   11.2167   Landskrona,  Sweden   3  m   55.8667   12.8333   31  
  • accide AccNum Country Locality Eleva,on La,tude Longitude Coordinate 7436 NGB27 Finland Sarkalahti, Luumäki 95 m 61.0333 27.3333 SESTO 9717 NGB456 Norway Dønna, Nordland 71 m 66.1167 12.5 Georeferenced 9601 NGB468 Norway Trysil 400 m 61.2833 12.2833 Georeferenced 9600 NGB469 Norway BJØRNEBY 400 m 61.2833 12.2833 Georeferenced 7966 NGB775 Sweden Överkalix, Allsån 45 m 66.4 22.9333 SESTO 8510 NGB776 Sweden Överkalix 100 m 66.4 22.7667 SESTO 7810 NGB792 Finland Luusua, Kemijärvi 145 m 66.4833 27.35 SESTO 9538 NGB2072 Norway Finset 1220 m 60.6 7.5 Georeferenced 8482 NGB2565 Sweden Öland 11 m 56.7333 16.6667 Georeferenced 9102 NGB4641 Denmark Støvring, Jylland 55 m 56.8833 9.8333 Georeferenced 9015 NGB4701 Faroe Islands Faroe Islands 81 m 62.0167 -6.7667 Georeferenced 9039 NGB6300 Faroe Islands Faroe Islands 81 m 62.0167 -6.7667 Georeferenced 8531 NGB9529 Denmark Lyderupgaard 9m 56.5667 9.35 Georeferenced 7344 NGB13458 Finland Koskenkylä, Rovaniemi 91 m 66.5167 25.8667 Georeferenced 32  
  • From  a  total  of  19  landrace   accessions  included  in  the  dataset,   only  4  of  the  landrace  accessions   included  geo-­‐referenced  coordinates   in  the  NordGen  SESTO  database.     10  accessions  were  geo-­‐referenced   from  the  reported  place  name  and   descrip3ons  of  the  original  gathering   site  included  in  SESTO  and  other   sources.     For  5  accessions  there  were  not   enough  informa3on  available  to   locate  the  original  gathering  loca3on.   Right  side  illustra.on     Example  of  georeferencing  for  NGB9529,  landrace  reported   as  origina@ng  from  Lyderupgaard  using  and   33  
  • 34  
  • Score  plots   The  observa3ons  made  at  Priekuli  (Latvia)  are   separated  from  the  observa3ons  made  at   Bjørke  (Norway)  and  Landskrona  (Sweden)  in   PC1  and  PC2.   The  combined  observa3ons  from  each  year   (2002  and  2003)  are  less  separated.   The  two  replicate  series  are  NOT  separated   35  
  • The  bi-­‐plot  shows  heading  days   and  ripening  days  as  the  most   influen3al  trait  variables  for  the   separa3on  of  the  observa3ons   from  the  different  observa3on   loca3ons.     Length  of  plant  par3cipate  in   spreading  out  the  scores  (in  PC1   and  PC2),  but  is  less  ac3ve  in  the   separa3on  of  the  groups.   The  influence  plot  (residuals   against  leverage)  shows  sample   observed  at  Priekuli  in   2003  (replicate  2)  with  a  very  high   leverage  -­‐  well  separated  from  the   “data  cloud”.     Ager  looking  into  the  raw  data  (see   next  slide),  this  data  point  was   removed  as  outlier  (set  to  NaN).   36  
  • Sample   (FRO)  observed  at  Priekuli  in  2003  (replicate  2)   has  the  lowest  score  for  harvest  index  in  the  en3re  dataset.   Ager  looking  into  the  raw  data  (see  the  table  above),  this   observa3on  point  was  removed  as  outlier  (set  to  NaN).   37  
  • The  ini3al  PCA  analysis  of   the  climate  data  showed  a   nice  spread  of  the  scores.   No  surprises.     The  influence  plot  iden3fied   sample   (NOR)  as  a   mild  outlier.  I  decided  to   keep  this  sample,  but  to   keep  an  eye  out  for  it  in  the   mul3-­‐way  analysis.   38  
  • 39  
  • •   Plot  of  the  trait  scores  (max  –  min)  from  each  observa3on  loca3on  and  year.   •   The  effect  from  the  different  experimental  condi3ons  have  a  significant  effect  on   the  trait  observa3ons.   40  
  • 41  
  • tmin   tmax   prec   Mode  3  (climate  variables)   have  very  different  range  of     numerical  values  (tmin,  tmax,   and  prec).  Scaling  across  mode   3  is  thus  applied  to  the  mul3-­‐ way  models.     Leg  is  displayed  the  box-­‐plot   for  the  3-­‐way  data  unfolded  as   to  keep  the  dimensions  of   Scaling  across  mode  3     mode  3.   The  3-­‐way  climate  data  was   reasonably  well  described  by  a   PARAFAC  model  of  two   components.   42  
  • 43  
  • 6      Mode  3   *  LVA  2002   *  LVA  2003     *  NOR  2002   28   6   *  NOR  2003   *  SWE  2002   14  landraces  (x2)      Mode  2  (Traits)     *  SWE2003   *  Heading  days   *  Ripening  days   *  Length  of  plant   *  Harvest  index   *  Volumetric  weight   6  traits   *  Grain  weight   Bjørke  (N)   Bjørke  (N)   Landskrona  (S)   Landskrona  (S)   Priekuli  (Lv)   Priekuli  (Lv)   2002   2003   2002   2003   2002   2003   6  traits   6  traits   6  traits   6  traits   6  traits   6  traits   28  records   44  
  • 3     14   12   (loca3on  of  origin)   Climate  data  (mode  3):   14  landraces   •   Minimum  temperature   •   Maximum  temperature   •   Precipita3on   •   …  (many  more  can  be  added)   12  monthly   means   Min.  temperature   Max.  temperature   Precipita3on   Jan,  Feb,  Mar,  …   Jan,  Feb,  Mar,  …   Jan,  Feb,  Mar,  …   14  samples   45  
  • •   The  ini3al  PARAFAC  models  calibrated  from  the  4-­‐way  trait  dataset  failed  to   converge  to  any  good  models.  The  core-­‐consistency  remained  very  low.   •   The  problem  showed  to  be  lack  of  systema3c  independent  varia3on  between   instances  of  mode  3  (observa3on  years)  and  mode  4  (observa3on  loca3ons)   •   A  two  component  PARAFAC  model  was  chosen  for  the  new  3-­‐way  trait  dataset.   (NOR)  was   iden3fied  as  a  mild   outlier  from  the   influence  plot.     No3ce  that  both   replica3ons  are   located  in  the  same   part  of  the  plot.  And   that  they  (together)   are  not  isolated   from  the  “data   cloud”.   46  
  • PARAFAC  split-­‐half   (mode  1)  analysis:   The  two  PARAFAC   models  each  calibrated   from  two  independent   split-­‐half  subsets,  both   converge  to  a  very   similar  solu3on  as  the   model  calibrated  from   the  complete  dataset.   The  PARAFAC  model  is   thus  a  general  and   stable  model  for  the   scope  of    Scandinavia.   47  
  • Further  search  for  any   good  PARAFAC  split-­‐half   for  the  climate  dataset:   A  systema3c  recording  of   results  from  10  different   split-­‐half  alterna3ves   resulted  in  two  good   split-­‐half.   The  PARAFAC  model  for   the  climate  data  is  thus   reasonable  general  (for   Scandinavia),  but  less   stable  than  the  model   for  the  3-­‐way  trait  data.   48  
  • 49  
  • 50  
  • •  Ogen  the  cri3cal  levels  (α)  for  the  p-­‐value  is  set  as  0.05,   0.01  and  0.001.   •  For  the  modeling  of  14  samples  (landraces)  gives:   –  12  degrees  of  freedom  for  the  correla3on  tests   –  One-­‐tailed  test  (looking  only  at  posi3ve  correla3on  of   predic3ons  versus  the  reference  values).   –  A  coefficient  of  determina3on  (r2)  larger  than  0.56  is   significant  at  the  0.001  (0.1%)  level  for  14  values/samples.   Many  introductory  text  books  on  sta3s3cs  include  a  table  of  Cri3cal  Values  for  Pearson’s  r.   51  
  • 52  
  • •  Latvia  2002  (LY11)   –  May  2002  was  extreme  dry  in  Priekuli.   –  June  2002  was  extreme  wet  in  Priekuli.   –  The  wet  June  caused  germina3on  on  the   spikes  for  many  of  the  early  varie3es.   •  Landskrona  2003  (LY32)   –  June  2003  was  extreme  dry  in  Landskrona.   –  June  was  the  3me  for  grain  filling  here.   •  Too  extreme  for  the  genotype  to  be   “normally”  expressed  ?   •  Too  large  effect  from  “G  by  E”   interac3on  ?   53  
  • Sowing   Rainfall  (mm)   Sta,on   Year   week   May   June   July   August   Bjørke  forsøksgård,  Norway   2002   17   82.9   67.4   128.5   136.5   2003   21   75.1   85.7   67.1   53.2   Landskrona,  Sweden   2002   13   53.5   75.3   76.4   68.9   2003   15   70.7   40.4   76.0   45.7   Priekuli,  Latvia   2002   17   38.2   111.1   67.0   11.3   2003   19   88.0   59.2   87.8   175.8   54  
  •   55  
  •         56  
  • Exploring  why  some  of  the  subset  (LY)   give  very  bad  N-­‐PLS  regressions...   57  
  •   58  
  • All  samples   RMSECV=3.72   Without  NGB456   RMSECV=3.18     Expl.  X  =  96%   r2  cal  =  0.64   Expl.  X  =  98%   r2  cal  =  0.54   r2  cv  =  0.16   Expl.  y  =  54%   r2  cv  =  0.33   Expl.  y  =  64%   59  
  • 60  
  • 61  
  • 65  
  • •  The first dataset I started to work with is a “FIGS” dataset with genebank accessions of Barley (Hordeum vulgare ssp. vulgare) collected from different countries worldwide and tested for susceptibility of net blotch infection. Net blotch is a common disease of barley caused by the fungus Pyrenophora teres.   •  The barley plants were inoculated with the fungus and the percentage of the leaves infected with the disease was normalized to an interval scale (1 to 9). •  1-3 are basically resistant  group 1 •  4-6 are intermediate  group 2 •  7-9 are susceptible  group 3 66  
  • •  Field  loca3ons  (USA)   –  Athens,  Georgia  (273  observa3ons)   –  Fargo,  North  Dakota  (3381  observa3ons)   –  Langdon,  North  Dakota  (858  observa3ons)   –  Stephen,  Minnesota  (139  observa3ons)   •  Observa3on  years  (1987  –  2004)   –  9  dis3nct  years   •  Greenhouse  versus  field  trials   –  Greenhouse  (1676  observa3ons)   –  Field  trial  (2975  observa3ons)   67  
  • 68  
  • Individual 95% CIs For Mean Based on   Pooled StDev   Level N Mean StDev -----+---------+---------+---------+-   ATHENS 262 2,0840 0,6555 (---*---)   FARGO 789 1,6793 0,6023 (-*-)   LANGDON 1558 1,6727 0,6466 (-*)   STEPHEN 136 1,6103 0,7810 (-----*----)   -----+---------+---------+---------+-   1,60 1,80 2,00 2,20   •  one-­‐way  ANOVA  test  for  difference  between  the  observa3on   loca3ons.  The  p-­‐value  of  0.000  rejects  the  null  hypothesis  of  no   difference.   •  The  Tukey  pair-­‐wise  comparison  test  gave  the  same  result.   70  
  • 71  
  • •  Agro-­‐clima3c  Zone  (UNESCO  classifica3on)   •  Soil  classifica3on  (FAO  Soil  map)   •  Aridity  (dryness)   •  Precipita3on   •  Poten3al  evapotranspira3on  (water  loss)   •  Temperature     •  Maximum  temperatures     •  Minimum  temperatures    (mean  values  for  month  and  year)   72  
  • Discriminant Analysis: obs_nb versus acz_moisture; ...   Quadratic Method for Response: obs_nb   Predictors: acz_moisture; acz_winter_temp; acz_summer_temp; arid_annual;  pet_annual; prec_annual; temp_annual; tmax_annual; •  The  correctly  classified  groups   tmin_annual   for  the  training  dataset  was   Group Count 1049 1 2 1190 3   234   45.9%,  and  we  would  expect  a   similar  success  rate  for  the   Summary of classification   predic3on  of  the  “blinded”   Put into Group 1 2 3   values.   1 523 427 48   2 287 451 25   •  Remember  that  random   3 238 314 163   classifica3on  of  three  groups   Total N 1048 1192 236   N correct 523 451 163   are:  33.3%   Proportion 0,499 0,378 0,691   •  A  test  set  of  9  samples   N = 2476 N Correct = 1137 showed  a  propor3on  correct   Proportion Correct = 0,459     classifica3ons  of  44.4%   73  
  • Michael  Mackay   FIGS  coordinator   Ken  Street   FIGS  project  leader   Harold  Bockelman   Net  blotch  data   Eddy  De  Pauw   Climate  data   Dag  Endresen   Data  analysis   74  
  • 75