Biodiversity informatics: why aren't we there yet?

1,494 views

Published on

Talk given at CISA 2013, Barcelona, 26 September 2013

Published in: Education, Technology

Biodiversity informatics: why aren't we there yet?

  1. 1. Biodiversity informatics: why aren’t we there yet? @rdmpage http://iphylo.blogspot.com
  2. 2. I’ve often said I want a Google for biodiversity data…
  3. 3. …turns out what I should have asked for was a NSA for biodiversity
  4. 4. • There are known knowns, things we know that we know • There are known unknowns, things we now know we don’t know • But there are also unknown unknowns, things we do not know we don't know
  5. 5. known unknown know ns unknowns
  6. 6. What do these diagrams tell us?
  7. 7. Implications • Sequencing is cheap • The flood of sequences is only going to increase • How much of this is relevant to biodiversity? • --
  8. 8. Numbers of new animal names 1923 WWI WWII
  9. 9. Implications • Rate of new taxa being described is relatively constant • Suggests taxonomists are working at capacity • Most taxonomic work is in the past • Compare this to exponential growth of sequencing • --
  10. 10. Mammals in GenBank Proper Linnaean names Aus sp.
  11. 11. Mammals Proper Linnaean names Aus sp.
  12. 12. “Invertebrates” BOLD
  13. 13. Dark taxa • Disconnect between taxonomy and genomics • How much of this comprises taxa we already know about versus new diversity? • Do we need taxonomic names? • --
  14. 14. 100,000 articles from http://biostor.org (BHL) 1923 today
  15. 15. Scanned legacy • BHL is more than pre-1923 literature • The real gap is post-1923 to pre-open access (2003) • Most of the 20th century taxonomic literature is “dark” • --
  16. 16. Size of Wikipedia articles on mammals Few, large articles Many, small articles“long tail”
  17. 17. Power law • We know a lot about a few species • For most species we know very little (even in well- known groups) • For poorly known species need to go to legacy literature • --
  18. 18. PanTHERIA (2009) 1923 2003
  19. 19. Legacy literature • Legacy literature matters (even for well-studied taxa) • Much of this will be in digitally “dark” period • --
  20. 20. Publishers of taxonomy (# articles) http://bionames.org
  21. 21. Publishers • BioStor (BHL) is the single largest source of taxonomic literature • Lots of tiny publishers (long tail) • Commercial publishers important (Magnolia Press, Springer, Informa, Wiley, Elsevier, BioOne) • Who do we talk to about data mining? • --
  22. 22. Taxonomic journals (articles/decade)
  23. 23. Implications • Zootaxa is indeed a “mega journal” • If we had to pick one journal to data mine it is Zootaxa • --
  24. 24. GBIF • The Global Biodiversity Information Facility is not evenly “global” • Tells us as much about sampling as distribution of diversity
  25. 25. Flickr EOL group
  26. 26. Crowd sourcing • Where is the “crowd”? • It’s where the iPhones are…
  27. 27. GenBank animal sequences
  28. 28. GenBank host records
  29. 29. Implications • GenBank is about more than genes • GenBank has a wealth of information on location, and ecological interactions
  30. 30. Implications • Phylogenetic data is not being archived (why not?) • Makes it hard to reproduce studies • Does data matter? • What level of granularity should be citable?
  31. 31. What do these diagrams tell us?

×