Shortcuts for fast forward of VLC videos: http://www.shortcutworld.com/en/win/VLC-Media-Player.html Before starting, go to display settings and make the projector screen the main screen, so that videos pop up There and not on the laptop screen.
BHL is the data source IMLS is the Funding Agency Missouri Botanical Garden is the partner for the US Smithsonian Libraries is a contractor (not sure if we should include it)
Sophia Sophia Sophia Sophia Evangelos Evangelos William (Anatoliy’s video has voice, so it is self-explanatory) William
The Biodiversity Heritage Library (BHL) is a consortium of natural history and botanical libraries that cooperate to digitize and make accessible the legacy literature of biodiversity held in their collections and to make that literature available for open access and responsible use as a part of a global “biodiversity commons.” The BHL consortium works with the international taxonomic community, rights holders, and other interested parties to ensure that this biodiversity heritage is made available to a global audience through open access principles. In partnership with the Internet Archive and through local digitization efforts, the BHL has digitized more than 48 million pages of taxonomic literature, representing over 100,000 titles and over 170,000 volumes.
MiBIO will integrate TM tools within an interoperable platform to provide a semantic search system for the BHL, enhanced through clustering and visualisation capabilities. MiBIO will also provide a social media environment, which will enable BHL users to discuss, link and share digital artifacts posted to social media sites linked to the BHL search portal. The outcome will be the transformation of the BHL from a Digital Library (DL) into a Social Digital Library (SDL). This will be achieved through the enrichment of its historical digital archives with semantic metadata generated by TM. Furthermore, by leveraging existing social media sites and providing facilities for their integration with the BHL, we will engage a community of users to exploit the BHL as a forum for the exchange of ideas. In a nutshell, we have incorporated into BHL three elements, as part of the Mining Biodiversity project: Visualisation, Social Media and Semantic Metadata.
Such variants may cause low performance to a keyword-based search engine and moreover it causes difficulties for non-expert users (users that are not familiar with scientific names). To alleviate the issue of variants searching in the search engine, we have compiled a terminological inventory containing semantic variants of biodiversity terms, e.g., mammals, birds, plants, by using distributional semantic methods. Learn the representation vector of each term Calculate the cosine similarity between two terms Extract top-20 candidates of synonyms.
And here is the search result when we use a common name of the previous term, which consists only one document related to “bowhead whale”. Apparently, the search engine returns a different result with the previous one …
Another problem with keyword-based search, as mentioned above, is ambiguity. If one searches for “Boxwood”, a keyword-based system wouldn’t know if he/she was referring to a place in Alabama, or the North American term for plants under the Buxaceae family. It will just return all documents pertaining to both. Nor will it know if a query “Box” pertains to the same plant family because apparently this is how other English-speaking countries refer to it, or a container.
We then implemented two distributional semantic models. The first one is a count-based model that determines the …
For example, within a 7-word window, this is the context vector of “bowhead whale” -- SA rubbish frequency
In this manner, for each name, we generate a list of names ranked by similarity. For “balaena mysticetus”, for example, we obtained the following list. Determine the meaning of a term by considering all lexical units occurring within a N-word window.
We have conducted our experiments on the Biodiversity Heritage Library (BHL) corpus. The corpus size is about 49 GB.
We have created a golden data of synonymous terms based on the Catalogue of Life. For each scientific name, we extract the corresponding common names and synonyms. We then picked randomly 500 species whose class is Aves. As a result, we got about 11 hundred terms of bird names (both vernacular and scientific names), of which about 8 hundreds existing in the BHL corpus. According to CoL, the average number of synonyms for each scientific names is about 2. We did the same process with mammal and plant names.
Follows are the precision and recall scores at top-20. Among the three categories, the performance of bird names is the best. With plant names, its lower performance can be explained by the fact that unlike mammals and birds, most of synonyms of plant names are also scientific names, which is more difficult to detect than the other.
-Frequency of species names can be visually explored, or queried by a search interface-Clicking on a species name acts as a query to retrieve its top-20 semantically related species.--Their semantically related score can be inspected--A blue color denotes that the species names appear as synonym in the CoL-Interactive visualizations were constructed for mammals, plants and birds[and in case somebody asks:]-Images, which were crawled from external open sources, may help assess visually species' relatedness based on their visible features.
Species names are shown in bubbles Larger bubbles denote species more frequently mentioned in the biodiversity literature Upon interaction (semantically) related species can be inspected Color opacity indicates degree of relatedness Blue color indicates that species also appear as synonyms in CoL Images are retrieved from open data collections (e.g. Wikipedia)
Web-based application: No installation; Access with a web browser Multi-user system: Remote collaborative annotation Supports Unstructured Information Management Architecture UIMA, Cloud and high-performance computing
This is the workflow that we put together using Argo. Without going too much into detail, I will just point out the general types of processing it tries to do: pre-processing (sentence splitting, tokenisation and part-of-speech tagging), matching against dictionaries or controlled vocabularies such as the ENVO and PATO ontologies, machine learning-based recognition of entities, extraction of relations based on the results of dependency parsing, and serialisation of the generated annotations.
2. Creating a Term Inventory of Biodiversity
3. Interactive Visualization of Inventory
4. Creating a Text Mining Infrastructure for Biodiversity
5. Interactive Clustering of Search Engine results
6. OCR Error correction
7. Social media platform
What do we want to do?
54/15/2016 Mining Biodiversity
Help transform BHL into a next-generation social digital
library through a multi-disciplinary approach that includes:
• Text Mining
• Machine learning
• History of Science
• Environmental History & Studies
• Library and Information Science
• Social Media
Creating the Term Inventory: why we need it
• A species name may usually be expressed in multiple ways, e.g., using
scientific names or vernacular names
– Balaena mysticetus Bowhead whale, bowhead
– Spizella passerina Chipping sparrows
• Identify synonymous terms in biodiversity text
• Why? To go beyond keyword-based search!
Search Results Using Vernacular Names
Vernacular name of “Balaena
Keyword-based Search: Ambiguity
historic place in
North American term for plants in the
Boxwood for other English-speaking
Methods: Distributional Semantics
• Determine the meaning of terms and phrases by looking at the context
and the meaning of individual words
43.99 39.99 25.06 23.92 20.84 19.86 19.52 17.91 … 5.62
balaena mysticetus alaska seals distribu
ringed catch quota … murray
mysticetus seals distribut
ringed … murray
43.99 25.06 19.52 17.91 …
balaena alaska catch quota …
39.99 23.92 20.84 19.52 … 5.62
• Training data: all English texts from the BHL
• about 26 million pages with a size of 49GB
• Evaluation data: synonymous terms from the Catalogue of Life
• Select 500 scientific names and their synonyms from the CoL
• Results at top-20
Category Class #terms in
Birds Aves 1140 818 2.28
Mammals Mammalia 1131 726 2.26
Plants Plantae 1141 826 2.28
Category Pre@20 Re@20
Birds 69.41% 63%
Mammals 62.12% 53.84%
Plants 56.17% 21.43% 11
3. Interactive visualization of term inventory
5. Interactive clustering of search engine
• Goal: to cluster BHL search engine results
• Input dataset: output of an “Or” query based on the following terms:
• Only titles of books or articles are considered in clustering
• Interactive clustering based on the keyterms of the titles
6. OCR error correction
• Correct errors in natural language texts
• Spelling errors (e.g. the => teh)
• Grammar errors (e.g. this is => this are)
OCR error correction
• Component selection (select components to use for processing)
• Correction candidates
• A list of candidates with confidence for each error
• Component structure
Digital Objects More
Social and Shareable
Follow us on Twitter: @SMLabTO
“My Tweeps” app
Helping BHL (and other organizations)
to get daily insights about their Twitter
followers (or Tweeps) and what they
are interested in.
We call it a "reverse" Twitter because
instead of seeing tweets from people
whom you follow, the app shows you
tweets from people who follow you.
Follow us on Twitter: @SMLabTO
We also partnered with Altmetric to better understand who and why people
share BHL content across various social media platforms
Follow us on Twitter: @SMLabTO