Digitization and enhancement of biodiversity literature through OCR, scientific names mapping and crowdsourcing

  • 1,649 views
Uploaded on

22 Feb 2011. BioSystematics Berlin 2011.

22 Feb 2011. BioSystematics Berlin 2011.

More in: Technology , Education
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
1,649
On Slideshare
0
From Embeds
0
Number of Embeds
0

Actions

Shares
Downloads
26
Comments
0
Likes
4

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Digitization and enhancement of biodiversity literature through OCR, scientific names mapping andcrowdsourcing
    Chris Freeland
    Technical Director,
    Biodiversity Heritage Library
    BioSystematics Berlin 2011
    22 Feb 2011
    http://biodiversitylibrary.org/page/33061402
  • 2. Digitization
    http://biodiversitylibrary.org/page/6165462
  • 3. Workflow
    Conservation
    Digitization
    Selection
    Preparation
    Post Production
    (Re)publication
  • 4. Scanning Derivatives
    Files are stored & sync’d across BHL clusters
    Master
    Derivatives
    XML
    JP2
    PDF
    JPG
    TXT
    DJVu
    Storage
    PDF
    OCR
    JP2
    XML
  • 5. Optical Character Recognition (OCR)
    http://biodiversitylibrary.org/page/2836705
  • 6. OCR is a *BIG* challenge
    All book / literature digitization projects affected, not just BHL
    Especially problematic in BHL
    More than 50 languages represented in BHL
    Dates of publication from 1400’s to 2000’s
    Irregular typeface / typesetting
    Multiple languages on one page
    Botanical descriptions in Latin
  • 7. Abbildungenund Beschreibungen
    der
    FischeSyriens,
    nebst
    einerneuen Classification und Characteristik
    sämmtlicherGattungen
    der
    i
    JOH. JAKOB HECKEL,
    Inipectoiam k. k. Hof-Natur.-iUenkabinete in Wien, mehr, yelelirt. UeHtllMeii. MIfglivd.
    STUTTGART.
    E. Schweizerbart' seheVerlagshandlung,
    1843.
  • 8. *E.xvi�c�piteI von c. cXx.WptdvonfnrWmn bu�fbe;bcn.5 am cixbIa� S &3rn~ 41X a�mcv(f b1air�'o�et ertoiensr�; �', :�hlrfc�cwa ff�4am.diug bist a
    6aiw~s ff oJrJtwtnof bL4ecImt& blfaframembt wag `wr 4 cnwiu 4 e8t5m.ed bvUratflb ck wuo, ma144'*4I bttE5rmbebt =rt3'kn am4ra tifvrmrWaff C * t6rmnli an `tn�ciblatGteaMw ?ffoaifrn w4wmeu nu weibe , wpiteI voE5teiri ct cobergtUcr cit cm` 91 cLibiar J ' >bSciatl�Oiff ;Bruetwacfttcnqmcx b1a bl: bt5c lttmtt bb9 lkrw.llr#eitincnxoa ff cu :rtrtuft *et� B Rn "�trv W1Rt' ?Cm cblaswaIwutrOber�citi 1V Ces ' wt gbtiemwwajfutpctt, afferain 9 c: b�titbfof�rferanmrs bra wlg auig4;f aer�m *mc vrtblatcabtfmwfruan'deg~mrtblasIaumbwWt� run fncmai b14ianf tJobrrfan ebrut4net vnberBrwtOberawawi*m.crriiibtafwfmuwwc on$ 'it ttuwttkc 5,10 $ m~Cfcatrc* cxu W�e�&mcyfbq4 Mabttmmwrc a iiubcJcnncI.end.*, blat s. a u:�rprd3 rw4ftf wm c ii,+ ttCCtnwa frr9fr orfabfcfbtenbcoptitibt -r9 ceDattDcn i34M snSemi
  • 9. 2007 Name Finding Study
    35.16%
    >35% OCR error rate for names only
    Of the 3,003 names, 1,056 were incorrectly transcribed by OCR.
    Top OCR errors
    Wei, et al. An Evaluation of Taxonomic Name Recognition (TNR) in the Biodiversity Heritage Library. Proceedings of TDWG. 2008.
    http://www.tdwg.org/proceedings/article/view/380
  • 10. WikiSource
    Trove - National Library of Australia
    Manual techniques for text correction
  • 11. WikiSource Example
    http://biostor.org/wiki/Page:Spixiana1999zool.djvu/293
  • 12. Goal: Semi-automated text correction
    OCR + Machine Learning + Users
    Let machines do raw processing
    Develop algorithms for natural language processing & machine learning
    Build a community of (human) users to help
    reCAPTCHA as an example
    Why not just use reCAPTCHA?
    Google bought it
    *More work needed here*
  • 13. Scientific names mapping
    http://biodiversitylibrary.org/page/27782237
  • 14. TaxonFinder API response
    Name finding via TaxonFinder
    Extract names
    Submit to NameBank
    Image from Scanner
    Converted to text via OCR
    Name Finding in action
    withuBio’sTaxonFinder…
  • 15.
  • 16.
  • 17.
  • 18. Crowdsourcing
    http://biodiversitylibrary.org/page/20965795
  • 19.
  • 20.
  • 21.
  • 22.
  • 23.
  • 24.
  • 25.
  • 26.
  • 27.
  • 28. CiteBank: http://citebank.org
    New search index to BHL content
    Platform for journals/publishers/societies in need of tools to store & share their digitized content
    Access to “crowdsourced” articles from BHL scans
  • 29.
  • 30.
  • 31.
  • 32.
  • 33.
  • 34. Crowdsourcing Statistics & Analysis
    Analysis
    http://biodiversitylibrary.blogspot.com/2009/04/pdf-article-metadata-analysis.html
    At that time, more than 80% of the PDFs created had metadata attached by users
    More than 50% contributed accurate article-level information
    New analysis over more data this summer / fall
    Now have more than 58,000 PDFs to analyze
  • 35. Open Data = More Use
    Scholars
    Rod Page
    iPhylo
    BioGUID
    BioStor
    Ryan Schenk
    Other Apps
    EarthCape
    ZipecodeZoo
  • 36. Conclusion
    BHL is a massive dataset useful for multidisciplinary research
    Systematics
    Natural Language Processing
    Humanities
    BHL is open
    Free to use at http://biodiversitylibrary.org
    Open access data for scholarly use & reuse
    BHL has APIs and data exports to enable reuse
    BHL data can be incorporated into other virtual research environments (EOL, Scratchpads, BioStor, others)
  • 37. Questions?
    Chris Freeland
    Technical Director, Biodiversity Heritage Library
    Director, Center for Biodiversity Informatics, Missouri Botanical Garden
    Missouri Botanical Garden
    4344 Shaw Blvd.
    St. Louis, MO 63110 USA
    Email: chris.freeland@mobot.org
    Twitter: @chrisfreeland
    Blog / info: chrisfreeland.com
    BioSystematics Berlin 2011
    22 Feb 2011