Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

BHL Markup Efforts and Plans

1,378 views

Published on

Presentation about past BHL Markup Efforts and present Plans for the pro-iBiosphere Markup Workshop.

Published in: Education, Technology
  • Be the first to comment

BHL Markup Efforts and Plans

  1. 1. pro-iBiosphere Markup Workshop Efforts and plans towards Markup of the BHL Content William Ulate R. BHL Technical Director Missouri Botanical Garden Berlin, Feb. 10, 2014
  2. 2. BHL Mission and Vision
  3. 3. More Online Content Pages (Millions) and Volumes (in Thousands) included in BHL 140 130.68 120.09 120 105.85 100 94.6 84.86 80 60 40 40.00 31.8 20 22.00 9.2 Oct-08 35.4 38.9 41.942.6 Volumes (K) 16.4 Pages (M) Oct-09 Oct-10 Oct-11 Oct-12 Oct-13
  4. 4. Subjects
  5. 5. New Types of Content
  6. 6. New Types of Content
  7. 7. Scientific Name Extraction • TaxonFinder algorithm in production since 2008 – More than 100 million candidate name strings – More than 1.5 million unique, verified names – Available through UI, APIs, Data Exports & Internet Archive • New collaboration with Global Names project – Improved algorithm, better precision & recall – More data with TaxonFinder and Neti Neti! – http://gnrd.globalnames.org/
  8. 8. Taxon Names BEFORE Name Instances Unique Names Verified Names EOL Names EOL Pages 101,591,803 7,498,554 1,905,507 63,130,350 13,579,868 101,288,804 7,464,924 1,902,803 62,963,582 13,532,684 151,222,182 29,246,382 10,153,165 87,791,695 15,466,713 150,066,425 29,091,767 10,109,540 87,135,089 15,342,867 AFTER Name Instances Unique Names Verified Names EOL Names EOL Pages
  9. 9. Article-level metadata Chapter-level metadata Treatment-level metadata Part-level metadata
  10. 10. Articles in the BHL UI
  11. 11. See also:
  12. 12. Related Titles
  13. 13. Global Replication & Serving Replicated Data Center Portal Application
  14. 14. BHL-Europe Term Expansion
  15. 15. Taxonomic Literature II (TL-2)
  16. 16. BioStor articles marked up with JATS
  17. 17. Art of Life
  18. 18. Art of Life
  19. 19. Art of Life
  20. 20. Art of Life
  21. 21. Macaw https://github.com/cajunjoel/macaw-book-metadata-tool
  22. 22. Reviewing Metadata
  23. 23. Reviewing Metadata
  24. 24. Manually built: 1,693 sets 87,879 images
  25. 25. The Art of Life schema: describing and providing access to natural history illustrations from the Biodiversity Heritage Library (BHL) by William Ulate, Trish Rose-Sandler, Gaurav Vaidya, Robert Guralnick Example of illustration described using Art of Life schema Title Stictospiza formosa Type Illustrations Date Publication: 1898 Agent Description Subjects Inscriptions Source Rights Author: Arthur G. Butler (1844-1925) Illustrator: F.W. Frohawk (1861-1946) A pair of finches with green and yellow bodies resting on reeds Scientific name: Amandava formosa (Latham, 1790) Vernacular Name: Green Avadavat or Green Munia Accepted Name: Amandava formosa (Latham, 1790) Birds, finches bottom center: Green Amaduvade Waxbill (Stictospiza formosa) Butler, Arthur Gardiner. Foreign finches in captivity. Hull and London: Brumby and Clarke, limited,1889 (2nd edition). This image comes from the Biodiversity Heritage Library, and is available online at biodiversitylibrary.org/page/17195895 Public domain Art of Life schema elements required in Red Element Agents Definition person or corporate entity involved in the creation, design, production, or publication of a visual resource. Examples Repea t <vra:agent> <vra:name type="personal" vocab="LCNAF" refid="89015596> Curtis,John</vra:name> <vra:dates type="life"> <vra:earliestDate>1791</vra:earliestDate> <vra:latestDate>1862</vra:latestDate> </vra:dates> <vra:role vocab="AAT" refid="300025574">publisher</vra:role> </vra:agent> Y Copyright The copyright status of the visual resource. Date Date or range of dates associated with the creation or publication of the visual resource. <vra:date type="creation"> <vra:earliestDate>1945</vra:earliestDate> <vra:latestDate>1955</vra:latestDate> </vra:date> Y Description A free-text note about content of the image, including comments, description, or interpretation, that gives additional information not recorded in other categories. <vra:description>This illustration shows a scale, coloured illustration of Sepsis annulipes (now known as Encita annulipes) beside the Trifolium ochroleucum plant. Several dissections from Sepsis cylindrica Fab. (all these details are provided on the next page of this book and the subsequent page).</vra:description> Y Inscriptions All marks, caption, or written words added to the object at the time of production or in its subsequent history, including signatures, dates, dedications, texts, and colophons, as well as marks, such as the stamps of silversmiths, publishers, or printers. <vra:inscription> <vra:position>bottom</vra:position> <vra:text>Radula of L. souleyetianum on a more reduced scale</vra:text> </vra:inscription> Y Source A citation for the book, journal or resource that hosts the visual resource <vra:source><vra:name type=”book”>Butler, Arthur Gardiner. Foreign finches in captivity. HullBrumby and Clarke, limited,1889 (2nd edition). </vra:name> <vra:refid type=”URI”>http://biodiversitylibrary.org/page/17195895</vra:refid> </vra:source> N Subject Terms or phrases that describe, identify, or interpret the visual resource. <vra:subject><vra:term type=”personalName”>Carl Linnaeus</vra:term></vra:subject> Y <vra:rights refid=”http://creativecommons.org/licenses/bync/2.0/deed.en”>Creative Commons Attribution-NonCommercial 2.0 Generic (CC BY-NC 2.0) </vra:rights> N <dwc:scientificName>Plant: Picea abies</dwc:scientificName> <dwc:acceptedName>Plant: Picea abies</dwc:acceptedName> <dwc:vernacularName>Plant: Norway spruce<dwc:vernacularName> Title The title or identifying phrase given to an Image <vra:title xml:lang=”la”>Sepsis annulipes</vra:title> <vra:title type=“alternate”>Orangutan</vra:title> Y We welcome your feedback on the schema! http://tinyurl.com/9hm7nsb
  26. 26. *E.xvi�c�piteI von c. cXx.WptdvonfnrWmn bu�fbe;bcn.5 am cix bIa � S &3rn~ 41X a�m cv(f b1air�'o�et ert oiensr �; �', :�hlrfc�c wa ff�4am.diug bist a 6aiw~s ff oJrJtwt nof bL4ecImt& blfafra mem b t wag `wr 4 cn wiu 4 e8t5m.ed bvUratflb ck wuo, ma144'*4I bttE5rmbebt =rt3'kn am4ra tif vrmr Waff C * t6rmnli an `tn�ciblatGteaM w ?ffoaifrn w4wmeu nu weib e , wpiteI voE5teiri ct c ober gtUcr cit cm` 91 cLi biar J ' >bSciatl�Oiff ;Bruet wacfttc n qmcx b1a bl: bt5c lttmtt bb9 lkr w.llr#e iti ncn xoa ff cu :r trtuft *e t � B Rn "� trv W1Rt' ?Cm c blas waIwutr Ober �ci ti 1V Ces ' wt gbtiemwwajfu tpctt, afferain 9 c: b�titbfof �r f eran m rs bra wlg auig4;f aer�m *mc vrt blatcabtfm wfru an'deg~m rt blas Iaum bwWt� run f ncmai b14ianf tJobrrfan ebrut4net vnber Brwt Ober awawi*m.crriii btafwfm uww c on$ 'it ttu wttkc 5,10 $ m~C fca trc* cx u W�e�&mcyfbq4 Mabtt mmw rc a iiu bc Jcn ncI.end.*, blat s. a u:�rprd3 rw4ftf wm c ii,+ ttCC tn wa frr9fr orfab fcfbt enb c optiti bt -r9 ceDa ttDcn i34M sn Sem i
  27. 27. OCR Improvements • Gaming • Transcription
  28. 28. OCR Improvements • Transcription • Purposeful Gaming • Looking at… – Crowdsource Markup
  29. 29. Purposeful Gaming DIGITALKOOT • Joint project run by the National Library of Finland and Microtask to index the library's enormous archives so that they are searchable on the Internet for easier access to the Finnish cultural heritage. .
  30. 30. Purposeful Gaming DIGITALKOOT • Launched on Feb 8 2011, nearly 110 000 participants completed over 8 million word fixing tasks by Nov 29 2012 • DigiTalkoot enabled volunteers to participate in this fixing work by playing games. • .
  31. 31. Purposeful gaming and BHL: engaging the public in improving and enhancing access to digital texts • IMLS Grant Program: National Leadership Grants for Libraries • Partners: – – – – Missouri Botanical Garden Harvard University Cornell University New York Botanical Garden • P.I.: Trish Rose-Sandler, Missouri Botanical Garden • Dates: Dec 2013 – Nov. 2015
  32. 32. Project objectives and benefits • Test new means of crowdsourcing to support the enhancement of content in BHL • Demonstrate if digital games are an effective tool for analyzing and improving digital outputs from OCR and transcription • Benefits of gaming include: – improved access to content by providing richer and more accurate data; – an extension of limited staff resources; and – exposure of library content to communities who may not know about the collections otherwise.
  33. 33. OCR Improvements German text interpreted by the OCR process as: “unb auf ben ©elnrgen be6 fublic{)en”
  34. 34. OCR Improvements IA OCR OCR 2 Transcription 1 Transcription 2 1 unb und und und Ok 2 den ben den den Ok 3 ©elnrgen ©ebirgen Bebirgen Gebirgen X 4 be6 des de5 des Chk 5 fublic{)en fublichen Füdlichen Südlichen X 6 £)eittfc{)(anb6 Deutfchlanbs Deutfchlands Deutschlands X Different resulting texts from parsing the phrase: “und auf den Gebirgen des südlichen Deutschlands” (“and on the mountains of southern Germany”)
  35. 35. Purposeful Gaming
  36. 36. iDigBio’s aOCR Hackathon • Improve OCR parsing of labels with clear metrics (datasets, output formats, scoring algorithm) • Libraries of regular expr. to clean up each field (different error correction for latitude/longitude coordinates than personal names or herbarium catalog numbers) • Tool for classifying segments of the image before submitting to OCR • Do a first pass of OCR to clean images before sending them to a second, 'real' pass of OCR
  37. 37. iDigBio’s CITScribe Hackathon 1. Interoperability betweenpublic participation tools and biodiversity data systems, 2. Transcription quality assessment/quality control (QA/QC) and the reconciliation of replicatetranscriptions, 3. Integration of optical character recognition (OCR) into thetranscription workflow 4. User engagement
  38. 38. NfN & iDigBio’s CITScribe Hackathon • Jason Best’s DarwinScore • Ben Brumfield’s Handwriting Gibberish Detector • Dictionaries to improve crowdsourcing consensus (e.g., names of collectors, scientific names) • Word Clouds created using n-gram scoring, faceting, and Solr for indexing + Carrot2 for specimen selection (visualize and explore of the use with a word of interest from the word cloud) and a data cleaning step (highlight infrequent words by the system).
  39. 39. NESCent EOL-BHL Research Sprint There is no place like home: Defining “habitat” for biodiversity science Robert D. Stevenson UMass Boston, Dept. of Biology, 100 Morrissey Blvd., Boston, MA 02125-3393 Carl Nordman (Natureserve) and Evangelos Pafilis Hellenic Centre for Marine Research, P.O. Box 2214, Heraklion, 71003, Crete, Greece
  40. 40. NESCent EOL-BHL Research Sprint Assessing Risk Status of Mexican Amphibians Through Data Mining. Esther Quintero and Bárbara Ayala National Commission for Knowledge and Use of Biodiversity (CONABIO) and Anne Thessen Marine Biological Laboratory and Arizona State University
  41. 41. NESCent EOL-BHL Research Sprint Evolution in the usage of anatomical concepts in the biodiversity literature Todd Vision (tjv@bio.unc.edu), Prashanti Manda (manda.prashanti@gmail.com), and Dongye Meng University of North Carolina at Chapel Hill
  42. 42. MiBIO: Mining Biodiversity • Mining Biodiversity: Enriching Biodiversity Heritage with Text Mining and Social Media • One of the international projects that won in the third round of the 2013 Digging Into Data Challenge • Promote the development of innovative computational techniques to apply into big data in the humanities and social sciences – The National Centre for Text Mining (UK) – Missouri Botanical Garden (US) – Dalhousie University's Big Data Analytics Institute (Canada) – Social Media Lab (Canada)
  43. 43. MiBIO: Mining Biodiversity 1. Automatic error correction of OCR text errors. 2. Crowdsource annotation of legacy texts with semantic metadata. 3. Adapt text mining techniques to extract terminology, entities and significant events automatically and to track terminology evolution over time. 4. Use Interactive visualization techniques to help users manage search results through next generation browsing capabilities, assisted by a semantic similarity network of important terms and entities. 5. Design of a social media layer, serving as an environment for diverse users to interact and collaborate on science, public education, awareness and outreach.
  44. 44. MiBIO: Mining Biodiversity •
  45. 45. Crowdsource Markup Display text Species Profile Model category General/summary TaxonBiology Geographic range Distribution Habitat Habitat Food sources and feeding behavior TrophicStrategy Physical description (general) Description Physical description (detailed morphology) DiagnosticDescription
  46. 46. Thank you William Ulate Global BHL Project Manager / Technical Director Missouri Botanical Garden william.ulate@mobot.org Skype: william_ulate_r

×