Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Text Mining Biodiversity 20160127

223 views

Published on

Mining Biodiversity Project presentation for Digging Round Three Conference, January 27-28 2016. http://diggingintodata.org/awards/2016/news/digging-round-three-conference

Published in: Science
  • Be the first to comment

  • Be the first to like this

Text Mining Biodiversity 20160127

  1. 1. Text Mining Biodiversity S. Ananiadou E. Milios W. Ulate
  2. 2. Partners 24/15/2016 Mining Biodiversity
  3. 3. Outline 1. Introduction 2. Creating a Term Inventory of Biodiversity 3. Interactive Visualization of Inventory 4. Creating a Text Mining Infrastructure for Biodiversity 5. Interactive Clustering of Search Engine results 6. OCR Error correction 7. Social media platform 8. Impact
  4. 4. Social Media Visualisation Semantic Metadata What do we want to do? 54/15/2016 Mining Biodiversity http://miningbiodiversity.org Help transform BHL into a next-generation social digital library through a multi-disciplinary approach that includes: • Text Mining • Machine learning • History of Science • Environmental History & Studies • Library and Information Science • Social Media
  5. 5. Creating the Term Inventory: why we need it • A species name may usually be expressed in multiple ways, e.g., using scientific names or vernacular names – Balaena mysticetus Bowhead whale, bowhead – Spizella passerina Chipping sparrows • Identify synonymous terms in biodiversity text • Why? To go beyond keyword-based search! 6
  6. 6. Search Results Using Vernacular Names Vernacular name of “Balaena mysticetus” Different results!! 7
  7. 7. Keyword-based Search: Ambiguity Boxwood historic place in Alabama? North American term for plants in the Buxaceae family? Box container? Boxwood for other English-speaking countries? 8
  8. 8. Methods: Distributional Semantics • Determine the meaning of terms and phrases by looking at the context and the meaning of individual words bowhead whale 43.99 39.99 25.06 23.92 20.84 19.86 19.52 17.91 … 5.62 balaena mysticetus alaska seals distribu tion ringed catch quota … murray 9 mysticetus seals distribut ion ringed … murray 43.99 25.06 19.52 17.91 … balaena alaska catch quota … bowhead whale 39.99 23.92 20.84 19.52 … 5.62
  9. 9. Distributional semantics methods balaena mysticetus balaena glacialis 0.7896 bowhead whale 0.7392 bowhead 0.7074 bowhead whales 0.6999 eubalaena glacialis 0.6905 minke whale 0.6864 humpback whale 0.6490 sperm whale 0.6440 finback whale 0.6322 sei whale 0.6287 eubalaena japonica 0.6065 brydes whale 0.6052 humpback whales 0.6000 finback whales 0.5998 10
  10. 10. Experiments • Training data: all English texts from the BHL • about 26 million pages with a size of 49GB • Evaluation data: synonymous terms from the Catalogue of Life • Select 500 scientific names and their synonyms from the CoL • Results at top-20 Category Class #terms in CoL #terms in BHL #average synonyms in CoL Birds Aves 1140 818 2.28 Mammals Mammalia 1131 726 2.26 Plants Plantae 1141 826 2.28 Category Pre@20 Re@20 Birds 69.41% 63% Mammals 62.12% 53.84% Plants 56.17% 21.43% 11
  11. 11. 3. Interactive visualization of term inventory 12
  12. 12. TermInventoryVisualization Video
  13. 13. 4. Creating a text mining infrastructure for biodiversity 14 • Web-based, graphical TM workbench • Straightforward integration of tools into modular, extensible, reconfigurable and reusable workflows http://argo.nactem.ac.uk Source: LEGO DUPLO
  14. 14. Annotation Workflow for Biodiversity Pre-processing Dictionary lookup Machine learning- based recognition Relation extraction Saving 15
  15. 15. AnnotationWorkflowsVideo
  16. 16. 5. Interactive clustering of search engine results • Goal: to cluster BHL search engine results • Input dataset: output of an “Or” query based on the following terms: 1. Kangaroo 2. Lion 3. Rabbit 4. Shark • Only titles of books or articles are considered in clustering • Interactive clustering based on the keyterms of the titles
  17. 17. InteractiveClusteringVideo
  18. 18. 6. OCR error correction • Correct errors in natural language texts • Spelling errors (e.g. the => teh) • Grammar errors (e.g. this is => this are) • Outline
  19. 19. OCR error correction • Input • Document • Component selection (select components to use for processing) • Correction candidates • A list of candidates with confidence for each error • Component structure
  20. 20. OCRerrorcorrectionvideo
  21. 21. 7. Social media platform
  22. 22. Making Biodiversity Digital Objects More Social and Shareable Follow us on Twitter: @SMLabTO
  23. 23. “My Tweeps” app mytweeps.com Helping BHL (and other organizations) to get daily insights about their Twitter followers (or Tweeps) and what they are interested in. We call it a "reverse" Twitter because instead of seeing tweets from people whom you follow, the app shows you tweets from people who follow you. Follow us on Twitter: @SMLabTO
  24. 24. We also partnered with Altmetric to better understand who and why people share BHL content across various social media platforms Follow us on Twitter: @SMLabTO
  25. 25. MyTweepsvideo
  26. 26. 8. Impact
  27. 27. Enhanced Searching of BHL Content Faceted search Automatically generated questions Time-sensitive search 28
  28. 28. Enhanced Document Viewing Page in PDF/image format OCR-corrected text with colour-coded annotations 29
  29. 29. The Team • NaCTeM • Ryerson • Dalhousie • Missouri Botanical Garden • Smithsonian Libraries (contract)
  30. 30. Thanks to the sponsors:

×