Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

The Biodiversity Informatics Landscape


Published on

Presentation given at the Biodiversity Informatics Horizons Meeting in Rome, Italy. 3-6 Rome, 2013.

Published in: Technology
  • Be the first to comment

The Biodiversity Informatics Landscape

  1. 1. Vince Smith The biodiversity informatics landscape: a systematics perspective Biodiversity Informatics Horizons Rome, 3-6 Sept 2013
  2. 2. Overview 1.  Background  –  the  biodiversity  informa9cs  domain   •  The  problem  (i.e.  why  are  we  here)   •  Representa6ons  of  the  domain  (data,  infrastructures,  projects…)   •  Toward  an  integrated  view  (strategy)   2.  Social  challenges   •  Openness   •  Collabora6on  and  communi6es     •  Standards,    iden6fiers  &  protocols   3.  (Big)  data  challenges   •  Mobilizing  exis6ng  data  (metadata,  literature,  collec6ons)     •  New  forms  of  data  ([meta]genomics  &  observatories)   4.  Synthe9c  challenges   •  Data  Aggrega6on  &  linking   •  Visualisa6on   •  Modeling   5.  Next  steps  (data  infrastructures  &  funding)   •  Lessons  learned:  new  informa6cs  opportuni6es  in  H2020  
  3. 3. 1.  Background  
  4. 4. The problem – integrating biodiversity research How  to  we  join  up  these  ac0vi0es?     How  do  we  use  this  as  a  tool?     Species  conserva6on  &  protected  areas   Impacts  of  human  development   Biodiversity  &  human  health   Impacts  of  climate  change   Food,  farming  &  biofuels   Invasive  alien  species     What  infrastructures  do  we  need?   (technologies,  tools,  standards…)   What  processes  do  we  need?   (Modelling,  workflows…)   What  data  do  we  need?   (Genes,  locali6es…)    
  5. 5. Natural History – the foundation "It  is  interes0ng  to  contemplate  a  tangled  bank,   clothed  with  many  plants  of  many  kinds,  …,  so   different  from  each  other,  and  dependent  upon   each   other   in   so   complex   a   manner,   have   all   been  produced  by  laws  ac0ng  around  us.”   C.  Darwin  "On  the  Origin  of  Species”,  1859   Darwin’s  “tangled  bank”…   Systema9cs,  a  founda9onal  “law”  
  6. 6. Ecological interactions
  7. 7. A granular understanding of biodiversity Genes GCGC GTAC CTAG Individuals i ii iii iv v vi Populations 1 2 1 2 3 Local populations Species A B C D E F Global biodiversity Interactions A B C D E F - + + + + + + - + + + + + - + - + - + - Biological networks GenBank
  8. 8. Key  problems   •  Landscape  is  complex,  fragmented  &  hard  to  navigate   •  Many  audiences  (policy  makers,  scien6sts,  amateurs,  ci6zen  scien6sts)   •  Many  scales  (global  solu6ons  to  local  problems)   Figure  adapted  from   Peterson  et  al  2010   Genotype Phenotype Biotic Interactions Environment Human Effects Niche & Pop. Ecology Biodiversity Loss Phylogenetic Trees Taxonomy Geographic Dsitributions Range Maps Forecasts of Change Conservation & management Products Data GenBank MorphBank Interactions Geospatial Census IUCN TreeBase IPNI, Zoobank Pop. data GBIF Extent of Occurrence AquaMaps AquaMaps Systems An informaticians view of biodiversity
  9. 9. A project centric view of biodiversity Nomenclators Index Fungorum ZooBank IPNI (Kew/AUS/Harvard) ING AFD/APC/APUI NZOR CoL (Sp2000& ITIS) ZooRecord PESI: ERMS Fauna Europea Euro+Med Plantbase ORBIS WORMS Flora Europea Checklists Phylogenetic Tree of Life TreeBase CIPRES Molecular Databases NCBI/EMBL/DDBJ CBoL Barcode of Life Initiative Biodiversity ALA CONABIO CRIA (Brazil) IUCN SEEK OPAL DAISIE iNaturalist uBio PLAZI Inotaxa BHL eFloras Scan / Mark/up Identification Key2Nature IdentifyLife Inter-Institutional Synthesis BCI BioCASE GeoCASE MaNIS Institutional EMu (=MOA) Recorder TDWG LifeWatch GBIF CDM GNA (NameBank) IPNI Google Scholar Connotea ViTaL ISI Bibliographic Descriptive / classification EoL Scratchpads CATE MorphoBank Wikipedia A  snapshot  from  2009,  “the  dance  of  the  ini0a0ves”  
  10. 10. The strategic view: community informatics challenges GBIF  GBIC  Report   (Coming  soon)   EU  Biodiversity  Strategy   (2011)   Biodiv.  Inf.  Challenges   (2013)   Grand  Challenges  for  Biodiversity  Informa6cs   (integra6ng  ac6vi6es  for  H2020)  
  11. 11. 2.  Social  challenges   -   Openness   -   Collabora6on  and  communi6es     -   Standards,    iden6fiers  &  links  
  12. 12. Openness in biodiversity informatics E.   Archambault   et.   al.,   Propor9on   of   Open   Access   Peer-­‐Reviewed   Papers   at   the   European  and  World  Levels-­‐-­‐2004-­‐2011,  June  2013,  Science-­‐Metrix  Inc.   “One-­‐half  of  all  papers  are  now  freely  available   within  a  year  or  two  of  publica0on”   “A  piece  of  data  or  content  is  open  if  anyone  is  free  to  use,  reuse,  and  redistribute  it    -­‐   subject,  at  most,  to  the  requirement  to  aOribute  and/or  share-­‐alike.”  hfp://   Many  kinds  of  openness:   •  Open  Access   •  Open  Data   •  Open  Science   •  Open  Source   •  Sharing  data  is  a  founda6on   for  our  ac6vi6es     •  Normal  prac6ce  in  some   communi6es  (molecular)   •  Mandated  by  some  funders   &  governments  
  13. 13. Openness in biodiversity informatics Many  kinds  of  openness:   •  Open  Access   •  Open  Data   •  Open  Science   •  Open  Source   Need  to  con0nue  to  incen0vise  openness   “A  piece  of  data  or  content  is  open  if  anyone  is  free  to  use,  reuse,  and  redistribute  it    -­‐   subject,  at  most,  to  the  requirement  to  aOribute  and/or  share-­‐alike.”   •  Sharing  data  is  a  founda6on   for  our  ac6vi6es     •  Normal  prac6ce  in  some   communi6es  (molecular)   •  Mandated  by  some  funders   &  governments   hfp://   Incen6vise  through  credit  via  cita6on  (e.g.  BDJ)  
  14. 14. What  are  Scratchpads?  (hfp://   Taxa   Projects   Regions   Socie9es   544  Scratchpad  Communi6es     by  6,644  ac6ve  registered  users     covering  91,631  taxa     in  535,317  pages.   81  paper  cita9ons  in  2012   In  total  more  than   1,300,000  visitors   e.g.,  Scratchpad  Virtual  Research  Communi0es   Collaboration & communities Making  taxonomy  a  team  sport   Our  infrastructures  need  to  facilitate  collabora0on  
  15. 15. Standards, identifiers & protocols Standards  can’t  be  developed  in  isola0on  –  they  must  be  used   Key  requirements:   •  Need  to  be  inclusive,  prac6cal  &  extensible   •  Readable  by  humans  &  machines   •  Widely  used     Good  examples:   •  Darwin  Core   •  CrossRef  &  DataCite  DOIs   •  ORCHID  Author  iden6fiers     Gaps  /  Problems   •  Reuse  &  persistence  of  iden6fiers   •  Vocabularies  &  ontologies  (6me  consuming  /  lifle  reward)     Poten0al  solu0ons   •  Build  them  into  our  credit  systems   •  Show  sema6c  reasoning  poten6al  (LOD  &  RDF  demonstrators)   A  founda6on  for  integra6on   Facilita9ng  data  sharing  across  communi9es  
  16. 16. 3.  (Big)  data  challenges   -   Mobilising  exis6ng  data     -   New  forms  of  data  
  17. 17. Mobilising existing data Collec0ons   •  1.5-­‐3B  specimens  in  collec6ons  worldwide   •  Fragments  efforts  /  heterogeneity  of  process   •  Needs  ambi6on  (NHM:  20M  in  5  yrs.)  &  coord.     Literature   •  >300M  pages  of  biodiversity  literature   •  BHL  (41M  pp.)  an  example  of  what  can  be  done   •  Needs  a  sustainability  &  ar6cle  metadata     Metadata  registries   •  Data  about  data  (cheaper  &  scalable)   •  e.g.  bibliographic  data,  dataset  portals     Informa0cs  challenges   •  Storage  &  persistence   •  Automa6on  &  annota6on   •  Incen6ves  to  digi6se  &  fitness  for  use   Collec9ons,  literature  &  metadata   How  can  we  quickly,  efficiently  and  cost   effec6vely  mobilise  biological  data  at  scale?   Bibliography  of  Life   (RefFinder  &  RefBank)   BHL   literature   NHM   Digi0sa0on  
  18. 18. Mobilising & managing new forms of data   New  Molecular  approaches   •  Molecular  detec6on  &  monitoring  of  organisms  is  rou6ne   •  Metagenomics  (env.  sequencing)  commonplace   •  Becoming  the  1°  route  to  understanding  biodiversity   Ecological  observatories   •  Automated  biodiversity  detec6on   •  Remote  sensing  (e.g.  satellite  &  acous6c  data,  drones,  camera  traps)   •  Monitoring  conspicuous,  rare  or  invasive  spp.  (algal  blooms,  palms)     •  Monitoring  human  ac6vity     Informa0cs  challenges   •  Very  large  quan66es  of  data  (2.5-­‐10TB  per  researcher  per  yr.)   •  Doesn’t  map  well  to  exis6ng  data  infrastructures   •  Challenge  current  networking  &  storage  capacity     •  Digital  and  physical  collec6ons  become  equally  important?   3-­‐4  June  2013,  NHM   22  July,  2013   Metagenomics  &  ecological  observatories     These  new  data  types  do  not  depend  on   tradi6onal  taxonomy  &  systema6cs  
  19. 19. 4.  Synthe9c  challenges   -   Data  aggrega6on  &  linking   -   Visualisa6on   -   Modeling  
  20. 20. Aggregation & linking Portals  bringing  together  distributed  &  diverse  forms  of  data   Giving  consistent  and  comprehensive  access   to  all  biological  data     Several  approaches,  with  different  advantages   •  Tightly  coupled  to  a  few  data  sources     •  (e.g.  eMonocot,  CDM)   •  Loosely  coupled  to  many  sources   •  (e.g.  BioNames,  Wikipedia)   •  Hybrid  forms  (e.g.  Canadensys,  EOL,  GBIF)       Informa0cs  challenges   •  Portals  are  hard  to  sustain   •  New  methods  of  data  discovery  &  access   •  Create  new  windows  (views)  on  content   •  New  data  structures,  new  types  of  database     Scalable  but    less  accurate   (3M  taxon  names,  93k  phylogenies  &  28k  ar6cles)   BioNames   Selec0ve  &  accurate  but  hard  to  scale   (276k  taxa,  8k  images,  13  keys  &  3  phylogenies)   eMonocot  
  21. 21. Visualisation Visually  synthesizing  large,  linked  biodiversity  datasets   Making  biodiversity  data  accessible  &   understandable   NHM  specimen  records   hfp://     Research  opportuni0es   •  Tools  integra6on  (e.g.  GeoCat,  CartoDB)   •  Span  mul6ple  audiences     Outreach  opportuni0es   •  Visually  compelling  story  telling   •  Crowdsourcing  tools  (e.g.  Notes  From  Nature)     Exploi0ng  new  technologies   •  Touch  screens   •  Mobile   •  Loca6on  awareness   Informa0cs  challenges   •  Very  specific  to  individual  use  cases   •  Sustainability  issues  
  22. 22. Modeling the biosphere: a (the) 30 year goal? Conceptually  has  many  poten0al  uses   •  Iden6fying  trends   •  Explaining  paferns   •  Making  predic6ons   •  Real  6me  alerts     -­‐  when  data  contradicts  current  knowledge   •  The  ul6mate  policy  tool   Major  informa0cs  challenges   •  Technical  very  difficult  (many  years  off)   •  Needs  effec6ve  prototypes  &  plarorms   •  Some  first  steps  e.g.  OBOE,  LEFT   Nature  2013,  doi:10.1038/493295a   Reasoning  across  large,  linked  biodiversity  datasets   A  clear,  singular,  long-­‐term  vision,  which   biodiversity  data  can  contribute  too  
  23. 23. 5.  Next  steps  
  24. 24. Lessons learned: new opportunities in H2020 PATHWAYS  TO  INTEGRATION        (by  addressing  these  social,  data  &  synthe0c  challenges)     •  Break  out  of  the  discipline,  technical  &   project  centric  ac9vi9es  (it  is  unsustainable,   inefficient  &  bad  for  science)     •  Integrate  &  build  on  exi9ng  programmes   where  possible  (LifeWatch  is  a  poten6al  umbrella   for  these  ac6vi6es)     •  Bridge  the  disconnect  between   informa9cians  &  users  (make  the  users   informa6cians  &  in  informa6cians  users)     •  Our  products  well  suited  to  address  these   challenges     •  Use  H2020  as  a  mechanism  to  achieve   integra9on   How  do  we  join  up  these  ac0vi0es?    
  25. 25. QUESTIONS  
  26. 26. Possible biodiversity informatics design principles* 1.  Start  with  needs  -­‐  focus  on  real  user  needs  (not  just  the  ‘official  process’)   2.  Do  less  -­‐  if  someone  else  is  doing  it,  link  to  it  or  use  it   3.  Design  with  data  -­‐  prototype  and  test  with  real  users  on  the  live  website   4.  Do  the  hard  work  to  make  it  simple  -­‐  let  the  computer  take  the  strain   5.  Iterate.  Then  iterate  again.  -­‐  itera0on  reduces  risk  &  is  more  sustainable   6.  Build  for  inclusion  –  it’s  easier  in  the  long  run   7.  Understand  context  -­‐  we  are  designing  for  people,  not  a  screen  or  a  brand   8.  Build  digital  services,  not  websites  -­‐  there  is  life  beyond  the  website   9.  Be  consistent,  not  uniform  -­‐  every  circumstance  is  different   10. Make  things  open:  it  makes  things  bejer  -­‐  it’s  more  sustainable   =  experience  from  7-­‐years  with  the  Scratchpads   =  lessons  for  infrastructures  in  H2020?   *hfps://  
  27. 27. Mobilising existing data: how to prioritise Nick  Poole,  UK  Collec6ons  Trust   CONTENT   METADATA   A  LITTLE   A  LOT   Digi6se  a  few  things  &  invest  in   depth,  descrip6on  &  promo6on   Digi6se  lots  of  things,  put  lifle  effort   into  descrip6on  &  promo6on   FUN   OUTREACH   LEARNING   RESEARCH   AGGREGATION   DATA  MINING   COLECTIONS   MANAGEMENT  
  28. 28. Collaboration & communities •  Very  few  recent  single  author  papers   •  Most  (fundable)  science  is  cross-­‐disciplinary   •  Need  to  incen6vise  data  cura6on  &  annota6on   •  Need  mechanisms  to  share  annota6ons   Our  infrastructures  need  to  facilitate  collabora0on   Joppa et al, 2011 CONE  SNAILS   BIRDS   MAMMALS   AMPHIBIANS   SPIDERS   PLANTS   Average  dates  when  increasing  numbers  of  taxonomists  were  involved  in  describing  species   Making  taxonomy  a  team  sport