Digitization and enhancement of biodiversity literature through OCR, scientific names mapping and crowdsourcing


Published on

22 Feb 2011. BioSystematics Berlin 2011.

Published in: Technology, Education

Digitization and enhancement of biodiversity literature through OCR, scientific names mapping and crowdsourcing

  1. 1. Digitization and enhancement of biodiversity literature through OCR, scientific names mapping andcrowdsourcing<br />Chris Freeland<br />Technical Director,<br />Biodiversity Heritage Library<br />BioSystematics Berlin 2011<br />22 Feb 2011<br />http://biodiversitylibrary.org/page/33061402<br />
  2. 2. Digitization<br />http://biodiversitylibrary.org/page/6165462<br />
  3. 3. Workflow<br />Conservation<br />Digitization<br />Selection<br />Preparation<br />Post Production<br />(Re)publication<br />
  4. 4. Scanning Derivatives<br />Files are stored & sync’d across BHL clusters<br />Master<br />Derivatives<br />XML<br />JP2<br />PDF<br />JPG<br />TXT<br />DJVu<br />Storage<br />PDF<br />OCR<br />JP2<br />XML<br />
  5. 5. Optical Character Recognition (OCR)<br />http://biodiversitylibrary.org/page/2836705<br />
  6. 6. OCR is a *BIG* challenge<br />All book / literature digitization projects affected, not just BHL<br />Especially problematic in BHL<br />More than 50 languages represented in BHL<br />Dates of publication from 1400’s to 2000’s<br />Irregular typeface / typesetting<br />Multiple languages on one page<br />Botanical descriptions in Latin<br />
  7. 7. Abbildungenund Beschreibungen<br />der<br />FischeSyriens,<br />nebst<br />einerneuen Classification und Characteristik<br />sämmtlicherGattungen<br />der<br />i<br />JOH. JAKOB HECKEL, <br />Inipectoiam k. k. Hof-Natur.-iUenkabinete in Wien, mehr, yelelirt. UeHtllMeii. MIfglivd.<br />STUTTGART.<br />E. Schweizerbart' seheVerlagshandlung,<br />1843. <br />
  8. 8. *E.xvi�c�piteI von c. cXx.WptdvonfnrWmn bu�fbe;bcn.5 am cixbIa� S &3rn~ 41X a�mcv(f b1air�'o�et ertoiensr�; �', :�hlrfc�cwa ff�4am.diug bist a<br />6aiw~s ff oJrJtwtnof bL4ecImt& blfaframembt wag `wr 4 cnwiu 4 e8t5m.ed bvUratflb ck wuo, ma144'*4I bttE5rmbebt =rt3'kn am4ra tifvrmrWaff C * t6rmnli an `tn�ciblatGteaMw ?ffoaifrn w4wmeu nu weibe , wpiteI voE5teiri ct cobergtUcr cit cm` 91 cLibiar J ' >bSciatl�Oiff ;Bruetwacfttcnqmcx b1a bl: bt5c lttmtt bb9 lkrw.llr#eitincnxoa ff cu :rtrtuft *et� B Rn "�trv W1Rt' ?Cm cblaswaIwutrOber�citi 1V Ces ' wt gbtiemwwajfutpctt, afferain 9 c: b�titbfof�rferanmrs bra wlg auig4;f aer�m *mc vrtblatcabtfmwfruan'deg~mrtblasIaumbwWt� run fncmai b14ianf tJobrrfan ebrut4net vnberBrwtOberawawi*m.crriiibtafwfmuwwc on$ 'it ttuwttkc 5,10 $ m~Cfcatrc* cxu W�e�&mcyfbq4 Mabttmmwrc a iiubcJcnncI.end.*, blat s. a u:�rprd3 rw4ftf wm c ii,+ ttCCtnwa frr9fr orfabfcfbtenbcoptitibt -r9 ceDattDcn i34M snSemi<br />
  9. 9. 2007 Name Finding Study<br />35.16%<br />>35% OCR error rate for names only<br />Of the 3,003 names, 1,056 were incorrectly transcribed by OCR.<br />Top OCR errors<br />Wei, et al. An Evaluation of Taxonomic Name Recognition (TNR) in the Biodiversity Heritage Library. Proceedings of TDWG. 2008.<br />http://www.tdwg.org/proceedings/article/view/380<br />
  10. 10. WikiSource<br />Trove - National Library of Australia<br />Manual techniques for text correction<br />
  11. 11. WikiSource Example<br />http://biostor.org/wiki/Page:Spixiana1999zool.djvu/293<br />
  12. 12. Goal: Semi-automated text correction<br />OCR + Machine Learning + Users<br />Let machines do raw processing <br />Develop algorithms for natural language processing & machine learning<br />Build a community of (human) users to help<br />reCAPTCHA as an example<br />Why not just use reCAPTCHA?<br />Google bought it<br />*More work needed here*<br />
  13. 13. Scientific names mapping <br />http://biodiversitylibrary.org/page/27782237<br />
  14. 14. TaxonFinder API response<br />Name finding via TaxonFinder<br />Extract names<br />Submit to NameBank<br />Image from Scanner<br />Converted to text via OCR<br />Name Finding in action<br />withuBio’sTaxonFinder…<br />
  15. 15.
  16. 16.
  17. 17.
  18. 18. Crowdsourcing<br />http://biodiversitylibrary.org/page/20965795<br />
  19. 19.
  20. 20.
  21. 21.
  22. 22.
  23. 23.
  24. 24.
  25. 25.
  26. 26.
  27. 27.
  28. 28. CiteBank: http://citebank.org<br />New search index to BHL content<br />Platform for journals/publishers/societies in need of tools to store & share their digitized content<br />Access to “crowdsourced” articles from BHL scans<br />
  29. 29.
  30. 30.
  31. 31.
  32. 32.
  33. 33.
  34. 34. Crowdsourcing Statistics & Analysis<br />Analysis<br />http://biodiversitylibrary.blogspot.com/2009/04/pdf-article-metadata-analysis.html<br />At that time, more than 80% of the PDFs created had metadata attached by users<br />More than 50% contributed accurate article-level information<br />New analysis over more data this summer / fall<br />Now have more than 58,000 PDFs to analyze<br />
  35. 35. Open Data = More Use<br />Scholars<br />Rod Page<br />iPhylo<br />BioGUID<br />BioStor<br />Ryan Schenk<br />Other Apps<br />EarthCape<br />ZipecodeZoo<br />
  36. 36. Conclusion<br />BHL is a massive dataset useful for multidisciplinary research<br />Systematics<br />Natural Language Processing<br />Humanities<br />BHL is open<br />Free to use at http://biodiversitylibrary.org<br />Open access data for scholarly use & reuse<br />BHL has APIs and data exports to enable reuse<br />BHL data can be incorporated into other virtual research environments (EOL, Scratchpads, BioStor, others)<br />
  37. 37. Questions?<br />Chris Freeland<br />Technical Director, Biodiversity Heritage Library<br />Director, Center for Biodiversity Informatics, Missouri Botanical Garden<br />Missouri Botanical Garden<br />4344 Shaw Blvd.<br />St. Louis, MO 63110 USA<br />Email: chris.freeland@mobot.org<br />Twitter: @chrisfreeland<br />Blog / info: chrisfreeland.com<br />BioSystematics Berlin 2011<br />22 Feb 2011<br />