Oboyski cal bug_ecn_2012

334 views

Published on

Published in: Technology, Business
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
334
On SlideShare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
Downloads
7
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide
  • The tool prompts the user to first highlight where the record text is within the image. This allows us to store a spatial annotation about where on an image data was transcribed (stored in MongoDB)
  • Oboyski cal bug_ecn_2012

    1. 1. Digitizing California Arthropod Collections Peter Oboyski, Gordon Nishida, Kipling Will, Rosemary Gillespie Essig Museum of Entomology University of California Berkeley, California, USA
    2. 2. What is CalBug? Essig Museum of Entomology California Academy of Sciences California State Collection of Arthropods Bohart Museum, UC Davis Entomology Research Museum, UC Riverside San Diego Natural History Museum LA County Museum Santa Barbara Museum of Natural History
    3. 3. (Optional) Sort by locality, date, sex, etc. Remove labels, add unique identifier Replace labels, return to collection Manually enter data into MySQL database Online crowd-sourcing of manual data entry Optical Character Recognition (OCR) & Automated data parsing Error checking Geographic referencing Aggregate data in online cache Temporospatial analyses Take digital image, name and save file Digitization workflow Handling & Imaging Data Capture Data Manipulation
    4. 4. Why Image Specimens/Labels? • Data capture can be done remotely • Magnify difficult to read labels • Verbatim archive of label data
    5. 5. (Optional) Sort by locality, date, sex, etc. Remove labels, add unique identifier Replace labels, return to collection Take digital image, name and save file Handling & Imaging Presorting allows faster databasing Removing labels is quick Adding unique identifiers is slow Efficient work station, file naming conventions and batch processing Replacing labels takes time
    6. 6. 1st generation - DinoLite digital microscope
    7. 7. 2nd generation – Digital Camera (Canon G9)
    8. 8. High resolution - magnify hard to read labels Labels flat, unobscured - better for OCR Scale bar, controlled light Important to add species name to image or file name Digital camera Tethered to computer Labels removed EMEC218958 Paracotalpa ursina.jpg
    9. 9. Scanning Slides Flatbed scanner & Photoshop
    10. 10. Save for Web & Devices
    11. 11. IrfanView software for batch processing of image files EMEC218958 Paracotalpa ursina.jpg
    12. 12. Manually enter data into MySQL database Online crowd-sourcing of manual data entry Optical Character Recognition (OCR) & Automated data parsing Data capture Using our own MySQL database (EssigDB) Built-in error checking Data carry-over one record to next Taxonomy automatically added “Notes from Nature” Collaboration with Zooniverse Citizen Scientist transcription of labels Collaboration with UC San Diego Improved OCR and “word spotting” Automatic data parsing (not yet!!) - iDigBio “hackathon” in February for OCR
    13. 13. Genus and species from file name Higher taxonomy auto-filled from database authority file
    14. 14. Notes from Nature Citizen Science data transcription
    15. 15. Integrating OCR with crowd sourcing o Spotting words within images o Copy-paste, highlight-drag fields o Auto-detecting repeated “words” o eg. species, states, counties o Providing an additional “vote” for transcription consensus
    16. 16. The OCR challenge for specimen labels DETECTION: Finding text in a complex matrix Machine-typed vs. hand-written labels Sliding window classifier creating text bounding boxes >95% detection and localization using pixel- overlap measures
    17. 17. RECOGNITION: Using Tesseract OCR engine Machine Type 74% accuracy for word-level 82% accuracy for character-level Hand Writing 5.4% accuracy for word-level 9.2% accuracy for character-level Current Progress in OCR recognition
    18. 18. Error checking Geographic referencing Aggregate data in online cache Temporospatial analyses Data Manipulation Just starting this phase No report on error rates Georeferencing very slow even with semi- automation with GeoLocate and other services Following Darwin Core standards Merging of data straight forward Analyses pending
    19. 19. Progress • After 2 years ... • Undergraduate student work force • Pinned specimens – imaging 20-65 specimens per hour (ave. = 40) • Microscope slides – Imaging 100-170 specimens per hour (ave. = 140) • Approximately 40,000 records databased – Plus 115,000 previously databased insect records • 150,000+ images waiting to be databased
    20. 20. Thank you http://calbug.berkeley.edu

    ×