Biodiversity informatics: why
aren’t we there yet?
@rdmpage
http://iphylo.blogspot.com
I’ve often said I want a Google
for biodiversity data…
…turns out what I should have asked
for was a NSA for biodiversity
• There are known knowns, things we know
that we know
• There are known unknowns, things we
now know we don’t know
• But t...
known
unknown
know
ns
unknowns
What do these diagrams tell us?
Implications
• Sequencing is cheap
• The flood of sequences is only going to increase
• How much of this is relevant to bi...
Numbers of new animal names
1923
WWI WWII
Implications
• Rate of new taxa being described is relatively
constant
• Suggests taxonomists are working at capacity
• Mo...
Mammals in GenBank
Proper Linnaean names
Aus sp.
Mammals
Proper Linnaean names
Aus sp.
“Invertebrates”
BOLD
Dark taxa
• Disconnect between taxonomy and genomics
• How much of this comprises taxa we already
know about versus new di...
100,000 articles from http://biostor.org (BHL)
1923 today
Scanned legacy
• BHL is more than pre-1923 literature
• The real gap is post-1923 to pre-open access (2003)
• Most of the ...
Size of Wikipedia articles on mammals
Few, large articles
Many, small
articles“long tail”
Power law
• We know a lot about a few species
• For most species we know very little (even in well-
known groups)
• For po...
PanTHERIA (2009)
1923 2003
Legacy literature
• Legacy literature matters (even for well-studied taxa)
• Much of this will be in digitally “dark” peri...
Publishers of
taxonomy
(# articles)
http://bionames.org
Publishers
• BioStor (BHL) is the single largest source of
taxonomic literature
• Lots of tiny publishers (long tail)
• Co...
Taxonomic journals (articles/decade)
Implications
• Zootaxa is indeed a “mega journal”
• If we had to pick one journal to data mine it is
Zootaxa
• --
GBIF
• The Global Biodiversity Information Facility is not
evenly “global”
• Tells us as much about sampling as distributi...
Flickr EOL group
Crowd sourcing
• Where is the “crowd”?
• It’s where the iPhones are…
GenBank animal sequences
GenBank host records
Implications
• GenBank is about more than genes
• GenBank has a wealth of information on location,
and ecological interact...
Implications
• Phylogenetic data is not being archived (why not?)
• Makes it hard to reproduce studies
• Does data matter?...
What do these diagrams tell us?
Biodiversity informatics: why aren't we there yet?
Biodiversity informatics: why aren't we there yet?
Biodiversity informatics: why aren't we there yet?
Biodiversity informatics: why aren't we there yet?
Biodiversity informatics: why aren't we there yet?
Biodiversity informatics: why aren't we there yet?
Upcoming SlideShare
Loading in...5
×

Biodiversity informatics: why aren't we there yet?

882

Published on

Talk given at CISA 2013, Barcelona, 26 September 2013

Published in: Education, Technology
1 Comment
4 Likes
Statistics
Notes
No Downloads
Views
Total Views
882
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
11
Comments
1
Likes
4
Embeds 0
No embeds

No notes for slide
  • http://www.technologyreview.com/graphiti/427720/bases-to-bytes/ Moore’s law, number of transitors doubles every two years
  • http://www.nature.com/scitable/content/growth-in-nucleotide-sequences-submitted-to-genbank-45068
  • http://www.organismnames.com/metrics.htm?page=graphs
  • http://biostor.org
  • http://iphylo.blogspot.co.uk/2013/02/does-legacy-biodiversity-literature.html
  • http://uat.gbif.org/developer/maps#preview
  • Biodiversity informatics: why aren't we there yet?

    1. 1. Biodiversity informatics: why aren’t we there yet? @rdmpage http://iphylo.blogspot.com
    2. 2. I’ve often said I want a Google for biodiversity data…
    3. 3. …turns out what I should have asked for was a NSA for biodiversity
    4. 4. • There are known knowns, things we know that we know • There are known unknowns, things we now know we don’t know • But there are also unknown unknowns, things we do not know we don't know
    5. 5. known unknown know ns unknowns
    6. 6. What do these diagrams tell us?
    7. 7. Implications • Sequencing is cheap • The flood of sequences is only going to increase • How much of this is relevant to biodiversity? • --
    8. 8. Numbers of new animal names 1923 WWI WWII
    9. 9. Implications • Rate of new taxa being described is relatively constant • Suggests taxonomists are working at capacity • Most taxonomic work is in the past • Compare this to exponential growth of sequencing • --
    10. 10. Mammals in GenBank Proper Linnaean names Aus sp.
    11. 11. Mammals Proper Linnaean names Aus sp.
    12. 12. “Invertebrates” BOLD
    13. 13. Dark taxa • Disconnect between taxonomy and genomics • How much of this comprises taxa we already know about versus new diversity? • Do we need taxonomic names? • --
    14. 14. 100,000 articles from http://biostor.org (BHL) 1923 today
    15. 15. Scanned legacy • BHL is more than pre-1923 literature • The real gap is post-1923 to pre-open access (2003) • Most of the 20th century taxonomic literature is “dark” • --
    16. 16. Size of Wikipedia articles on mammals Few, large articles Many, small articles“long tail”
    17. 17. Power law • We know a lot about a few species • For most species we know very little (even in well- known groups) • For poorly known species need to go to legacy literature • --
    18. 18. PanTHERIA (2009) 1923 2003
    19. 19. Legacy literature • Legacy literature matters (even for well-studied taxa) • Much of this will be in digitally “dark” period • --
    20. 20. Publishers of taxonomy (# articles) http://bionames.org
    21. 21. Publishers • BioStor (BHL) is the single largest source of taxonomic literature • Lots of tiny publishers (long tail) • Commercial publishers important (Magnolia Press, Springer, Informa, Wiley, Elsevier, BioOne) • Who do we talk to about data mining? • --
    22. 22. Taxonomic journals (articles/decade)
    23. 23. Implications • Zootaxa is indeed a “mega journal” • If we had to pick one journal to data mine it is Zootaxa • --
    24. 24. GBIF • The Global Biodiversity Information Facility is not evenly “global” • Tells us as much about sampling as distribution of diversity
    25. 25. Flickr EOL group
    26. 26. Crowd sourcing • Where is the “crowd”? • It’s where the iPhones are…
    27. 27. GenBank animal sequences
    28. 28. GenBank host records
    29. 29. Implications • GenBank is about more than genes • GenBank has a wealth of information on location, and ecological interactions
    30. 30. Implications • Phylogenetic data is not being archived (why not?) • Makes it hard to reproduce studies • Does data matter? • What level of granularity should be citable?
    31. 31. What do these diagrams tell us?
    1. A particular slide catching your eye?

      Clipping is a handy way to collect important slides you want to go back to later.

    ×