Biodiversity informatics: why aren't we there yet?

•Download as PPT, PDF•

5 likes•1,737 views

Roderic Page

Talk given at CISA 2013, Barcelona, 26 September 2013

Education Technology

Biodiversity informatics: why
aren’t we there yet?
@rdmpage
http://iphylo.blogspot.com

I’ve often said I want a Google
for biodiversity data…

…turns out what I should have asked
for was a NSA for biodiversity

• There are known knowns, things we know
that we know
• There are known unknowns, things we
now know we don’t know
• But there are also unknown unknowns,
things we do not know we don't know

Implications
• Sequencing is cheap
• The flood of sequences is only going to increase
• How much of this is relevant to biodiversity?
• --

Numbers of new animal names
1923
WWI WWII

Implications
• Rate of new taxa being described is relatively
constant
• Suggests taxonomists are working at capacity
• Most taxonomic work is in the past
• Compare this to exponential growth of sequencing
• --

Mammals in GenBank
Proper Linnaean names
Aus sp.

Dark taxa
• Disconnect between taxonomy and genomics
• How much of this comprises taxa we already
know about versus new diversity?
• Do we need taxonomic names?
• --

100,000 articles from http://biostor.org (BHL)
1923 today

Scanned legacy
• BHL is more than pre-1923 literature
• The real gap is post-1923 to pre-open access (2003)
• Most of the 20th
century taxonomic literature is
“dark”
• --

Size of Wikipedia articles on mammals
Few, large articles
Many, small
articles“long tail”

Power law
• We know a lot about a few species
• For most species we know very little (even in well-
known groups)
• For poorly known species need to go to legacy
literature
• --

Legacy literature
• Legacy literature matters (even for well-studied taxa)
• Much of this will be in digitally “dark” period
• --

Publishers of
taxonomy
(# articles)
http://bionames.org

Publishers
• BioStor (BHL) is the single largest source of
taxonomic literature
• Lots of tiny publishers (long tail)
• Commercial publishers important (Magnolia Press,
Springer, Informa, Wiley, Elsevier, BioOne)
• Who do we talk to about data mining?
• --

Implications
• Zootaxa is indeed a “mega journal”
• If we had to pick one journal to data mine it is
Zootaxa
• --

GBIF
• The Global Biodiversity Information Facility is not
evenly “global”
• Tells us as much about sampling as distribution of
diversity

Crowd sourcing
• Where is the “crowd”?
• It’s where the iPhones are…

Implications
• GenBank is about more than genes
• GenBank has a wealth of information on location,
and ecological interactions

Implications
• Phylogenetic data is not being archived (why not?)
• Makes it hard to reproduce studies
• Does data matter?
• What level of granularity should be citable?

Similar to Biodiversity informatics: why aren't we there yet?

iEvoBio Keynote: Frontiers of discovery with Encyclopedia of Life -- TRAITBANK Cyndy Parr

Why share your genealogy content on WeRelate.org (2009)Dallan Quass

2014 nyu-bio-talkc.titus.brown

Bed Bugs 101V180Media

2014 bangkok-talkc.titus.brown

The time for Libraries is NOWNed Potter

Librarynow 110311041940-phpapp01ummeasima

2018 09-03-ses open-fair_practices_in_evolutionary_genomicsYannick Wurm

Data Management and Information Design Session 1Javier de la Torre

Building and Using Ontologies to do biologyrobertstevens65

Data Mining Dissertations and Adventures and Experiences in the World of Chem...US Environmental Protection Agency (EPA), Center for Computational Toxicology and Exposure

The Future of Microalgal TaxonomyAnne Thessen

Ewan Birney Biocuration 2013Iddo

Natural selectionRosio DeLeon

uBio presentation to Species 2000 May 2004David Remsen

Modern-day eugenics?maxpress

Research presentation for teens (1)Nicolette Sosulski

Population ecology introMaria Donohue

Evolutioncallr

We've Got Issues: Issue Tracking and Workflow in the Digital LibraryElectronic Resources & Libraries

Similar to Biodiversity informatics: why aren't we there yet? (20)

iEvoBio Keynote: Frontiers of discovery with Encyclopedia of Life -- TRAITBANK

Why share your genealogy content on WeRelate.org (2009)

2014 nyu-bio-talk

Bed Bugs 101

2014 bangkok-talk

The time for Libraries is NOW

Librarynow 110311041940-phpapp01

2018 09-03-ses open-fair_practices_in_evolutionary_genomics

Data Management and Information Design Session 1

Building and Using Ontologies to do biology

Data Mining Dissertations and Adventures and Experiences in the World of Chem...

The Future of Microalgal Taxonomy

Ewan Birney Biocuration 2013

Natural selection

uBio presentation to Species 2000 May 2004

Modern-day eugenics?

Research presentation for teens (1)

Population ecology intro

Evolution

We've Got Issues: Issue Tracking and Workflow in the Digital Library

Recently uploaded

CARE OF CHILD IN INCUBATOR..........pptxGaneshChakor2

1029 - Danh muc Sach Giao Khoa 10 . pdfQucHHunhnh

INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptxRAM LAL ANAND COLLEGE, DELHI UNIVERSITY.

Accessible design: Minimum effort, maximum impactdawncurless

Beyond the EU: DORA and NIS 2 Directive's Global ImpactPECB

The Most Excellent Way | 1 Corinthians 13Steve Thomason

Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"National Information Standards Organization (NISO)

BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdfSoniaTolstoy

Software Engineering Methodologies (overview)eniolaolutunde

Disha NEET Physics Guide for classes 11 and 12.pdfchloefrazer622

Separation of Lanthanides/ Lanthanides and ActinidesFatimaKhan178732

Organic Name Reactions for the students and aspirants of Chemistry12th.pptxVS Mahajan Coaching Centre

Call Girls in Dwarka Mor Delhi Contact Us 9654467111Sapana Sha

Nutritional Needs Presentation - HLTH 104misteraugie

Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...fonyou31

Paris 2024 Olympic Geographies - an activityGeoBlogs

Russian Call Girls in Andheri Airport Mumbai WhatsApp 9167673311 💞 Full Nigh...Pooja Nehwal

1029-Danh muc Sach Giao Khoa khoi 6.pdfQucHHunhnh

Sanyam Choudhary Chemistry practical.pdfsanyamsingh5019

A Critique of the Proposed National Education Policy ReformChameera Dedduwage

Recently uploaded (20)

CARE OF CHILD IN INCUBATOR..........pptx

1029 - Danh muc Sach Giao Khoa 10 . pdf

INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx

Accessible design: Minimum effort, maximum impact

Beyond the EU: DORA and NIS 2 Directive's Global Impact

The Most Excellent Way | 1 Corinthians 13

Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"

BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf

Software Engineering Methodologies (overview)

Disha NEET Physics Guide for classes 11 and 12.pdf

Separation of Lanthanides/ Lanthanides and Actinides

Organic Name Reactions for the students and aspirants of Chemistry12th.pptx

Call Girls in Dwarka Mor Delhi Contact Us 9654467111

Nutritional Needs Presentation - HLTH 104

Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...

Paris 2024 Olympic Geographies - an activity

Russian Call Girls in Andheri Airport Mumbai WhatsApp 9167673311 💞 Full Nigh...

1029-Danh muc Sach Giao Khoa khoi 6.pdf

Sanyam Choudhary Chemistry practical.pdf

A Critique of the Proposed National Education Policy Reform

Biodiversity informatics: why aren't we there yet?

1. Biodiversity informatics: why aren’t we there yet? @rdmpage http://iphylo.blogspot.com

2. I’ve often said I want a Google for biodiversity data…

3. …turns out what I should have asked for was a NSA for biodiversity

5. • There are known knowns, things we know that we know • There are known unknowns, things we now know we don’t know • But there are also unknown unknowns, things we do not know we don't know

6. known unknown know ns unknowns

8. What do these diagrams tell us?

10.

11. Implications • Sequencing is cheap • The flood of sequences is only going to increase • How much of this is relevant to biodiversity? • --

12. Numbers of new animal names 1923 WWI WWII

13. Implications • Rate of new taxa being described is relatively constant • Suggests taxonomists are working at capacity • Most taxonomic work is in the past • Compare this to exponential growth of sequencing • --

14. Mammals in GenBank Proper Linnaean names Aus sp.

15. Mammals Proper Linnaean names Aus sp.

16. “Invertebrates” BOLD

17. Dark taxa • Disconnect between taxonomy and genomics • How much of this comprises taxa we already know about versus new diversity? • Do we need taxonomic names? • --

18. 100,000 articles from http://biostor.org (BHL) 1923 today

19. Scanned legacy • BHL is more than pre-1923 literature • The real gap is post-1923 to pre-open access (2003) • Most of the 20th century taxonomic literature is “dark” • --

20. Size of Wikipedia articles on mammals Few, large articles Many, small articles“long tail”

21. Power law • We know a lot about a few species • For most species we know very little (even in well- known groups) • For poorly known species need to go to legacy literature • --

22. PanTHERIA (2009) 1923 2003

23. Legacy literature • Legacy literature matters (even for well-studied taxa) • Much of this will be in digitally “dark” period • --

24. Publishers of taxonomy (# articles) http://bionames.org

25. Publishers • BioStor (BHL) is the single largest source of taxonomic literature • Lots of tiny publishers (long tail) • Commercial publishers important (Magnolia Press, Springer, Informa, Wiley, Elsevier, BioOne) • Who do we talk to about data mining? • --

26. Taxonomic journals (articles/decade)

27. Implications • Zootaxa is indeed a “mega journal” • If we had to pick one journal to data mine it is Zootaxa • --

28.

29. GBIF • The Global Biodiversity Information Facility is not evenly “global” • Tells us as much about sampling as distribution of diversity

30. Flickr EOL group

31. Crowd sourcing • Where is the “crowd”? • It’s where the iPhones are…

32. GenBank animal sequences

33. GenBank host records

34. Implications • GenBank is about more than genes • GenBank has a wealth of information on location, and ecological interactions

35.

36. Implications • Phylogenetic data is not being archived (why not?) • Makes it hard to reproduce studies • Does data matter? • What level of granularity should be citable?

37. What do these diagrams tell us?

Editor's Notes

http://www.technologyreview.com/graphiti/427720/bases-to-bytes/ Moore’s law, number of transitors doubles every two years
http://www.nature.com/scitable/content/growth-in-nucleotide-sequences-submitted-to-genbank-45068
http://www.organismnames.com/metrics.htm?page=graphs
http://biostor.org
http://iphylo.blogspot.co.uk/2013/02/does-legacy-biodiversity-literature.html
http://uat.gbif.org/developer/maps#preview

Biodiversity informatics: why aren't we there yet?

Recommended

Recommended

More Related Content

Similar to Biodiversity informatics: why aren't we there yet?

Similar to Biodiversity informatics: why aren't we there yet? (20)

More from Roderic Page

More from Roderic Page (20)

Recently uploaded

Recently uploaded (20)

Biodiversity informatics: why aren't we there yet?

Editor's Notes