Oboyski ecn2013


Published on

Published in: Business, Technology
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • Collaboration between Zooniverse, a citizen science portal which hosts a number of citizen science projects with a very large following, and CalBug, SERNAC (SouthEast Regional Network of Collections), and Natural History Museum, London, Ornithology Collection.
  • The site went live while I was at the iDigBio meeting at the Field Museum in April. Since that time we have surpassed a quarter million transcriptions by over 3,500 citizen scientists.
  • CalBug is an NSF-ADBC collaborative project among the eight major arthropod collections in California to digitize over one million specimens from our combined collections. Although we are collecting all the data together in a single cache and sharing techniques and workflows, each museum has developed its own approach based on the people and resources they have available. Therefore, what I am presenting is the approach we use at the Essig Museum, which may be somewhat different from the other institutions.
  • The goal is to make California arthropod diversity data available online through our own web service as well as through aggregators such as GBIF.
  • Our workflow for digitization can be broken down into three general categories. First is specimen handling and imaging where we remove the labels (from pinned specimens), add unique identifiers (we use datamatrix barcodes), and image the labels placed next to the specimens. Next we capture data from the images either with our own people directly in our own MySQL database, or through our citizen science project, Notes from Nature. We are also looking into ways to incorporate OCR into data capture. Finally, the data are proofed, georeferenced, aggregated and analyzed.
  • During the iDigBio meeting at the Field Museum in Chicago in April I learned that although many institutions are doing some form of imaging, hardly any were using the images as part of their databasing workflow! Personally I see an overwhelming benefit to imaging the individual specimens with their labels.
  • Here is an example of one of our pinned specimens. We use a digital camera tethered to a computer. Using IrfanView software to batch process image files we rename each file to include the unique identifier, genus, and species name. Although the genus and species name may change for this specimen over time, it is critical that these elements are in the filename for fast and efficient management of image files.
  • And now … slide scanning
  • The site went live while I was at the iDigBio meeting at the Field Museum in April. Since that time we have surpassed a quarter million transcriptions by over 3,500 citizen scientists.
  • Oboyski ecn2013

    1. 1. Notes from Nature Citizen Science data transcription Peter Oboyski, Jun Ying Lim, Joyce Gross, Chris Snyder*, Arfon Smith*, Joanie Ball, Kip Will, Rosemary Gillespie Essig Museum of Entomology * Zooniverse Citizen Science Alliance
    2. 2. How does it work? • • • • • • • • Introduction to CalBug What is Zooniverse? What do we provide? What happens online? What do we get back? Technical issues Maintaining interest How can you get involved?
    3. 3. What is CalBug? NSF - ADBC grant Collaboration among the eight major entomology collections in California Digitize 1.2 million specimens Essig Museum of Entomology California Academy of Sciences California State Collection of Arthropods Bohart Museum, UC Davis Entomology Research Museum, UC Riverside San Diego Natural History Museum Santa Barbara Museum of Natural History LA County Museum
    4. 4. Stephen Dowlan CalPhotos MySQL database Berkeley Mapper http://calbug.berkeley.edu
    5. 5. Berkeley Natural History Museums • In development – Integrating point data (specimen records) with Habitat, Range maps, Elevation, Climate, etc. – Historical recreation of the environment – Predict potential impacts of environmental change – Facilitate land use/management decisions
    6. 6. Digitization workflow (Optional) Sort by locality, date, sex, etc. Error checking Manually enter data into MySQL database Remove labels, add unique identifier Geographic referencing Online crowd-sourcing of manual data entry Take digital image, name and save file Replace labels, return to collection Handling & Imaging Aggregate data in online cache Optical Character Recognition (OCR) & Automated data parsing Data Capture Temporospatial analyses Data Manipulation
    7. 7. Why Image Labels? • Magnify difficult to read labels • Verbatim archive of label data – Essential for proofing data – Useful for taxonomists interested in label data • Data capture can be done remotely
    8. 8. Digital camera tethered to computer Average 50-55 images per hour Including imaging, file renaming, and upload Filename = EMEC218958 Paracotalpa ursina.jpg
    9. 9. Slide Scanning average 150 slides per hour including scan, file renaming, and upload
    10. 10. 400 DPI Seems to provide high enough resolution for difficult to read labels while keeping file size relatively small
    11. 11. But not high resolution enough for taxonomic work
    12. 12. Using Citizen Scientist to transcribe label data
    13. 13. http://www.notesfromnature.org/ Launched April 22, 2013
    14. 14. Images in  Transcriptions out • We supply jpeg images – 400 DPI (300 DPI good) – Deposited as zip file – Stored in Amazon Cloud • In development – Automated service to upload images to A.C. – Be able to prioritize image set • Zooniverse provides – MondoDB data dump – 1 record = 1 transcription – 4 transcriptions / image • In development – Automated daily dump
    15. 15. Reconciling transcriptions • Drop down lists (Country, State, County, Date) are compared for exact match – Occasionally missing, sometimes wrong – Majority rule • Free-form text fields (Locality, Collectors) are much more problematic – Transcribers asked to record label data verbatim – Puctuation, capitalization, spacing between words – Misspelling, expanding abbreviations, interpretations
    16. 16. Reconciling transcriptions • Developing scripts in R to reconcile free-form text • Text matching for maximum correspondence among multiple transcriptions (cf. DNA alignment methods) • Final result = 1 transcription in our database with links to the 4 original transcriptions marked as Citizen Science transcribed record • Vetting by CalBug personnel still necessary, but we can prioritize based on record-matching confidence scores
    17. 17. Generating & Maintaining Interest Number of Notes from Nature transcriptions for CalBug
    18. 18. Generating & Maintaining Interest
    19. 19. Generating & Maintaining Interest • Popular media, social media, and press releases – Only so many occasions for a press release • Campaigns – Highlight particular taxa, habitats, geographic regions • Education – High quality, high resolution photo of species transcribed – Create links to other services to learn more about species • Competitions – Prizes are worth more than badges – However, need to watch for bad data in pursuit of prize
    20. 20. How can you get involved? • Right now you cannot • iDigBio is interested in getting involved • iDigBio hosting a hackathon in December • Begin building up collections of images
    21. 21. Thank you And a HUGE thank you to the CalBug Army who image our specimens Chris Amy, Maritess Aristorenas, Jazmin Calderon, Alex Carolina, Sonia Castillo, Matthew Chan, Sabina Cook, Alex Darwish, John Davie, Jesson Go, Nick Grady-Grote, Ginger Haight, Laura Hayes, Dennis Ho, Aubrey Huey, Leah Humphreys, Veronica Hurd, Hanna Huynh, Eseosa Igbinedion, Ilona Istenes, Emma Kohlsmith, Asia Kwan, Tiffany Kyo, Jerry Lee, Ken Lee, Christina Lew, Maggie Lewis, Alex Lim, Derick Matano, Christian Munevar, Frank Ngo, Kent Nguyen, Minh Nguyen, Riley O'Brien, Marielle Pinheiro, Rammonhan Reddy, Jessica Rothery, Stacey Rutherford, Anna Szendrenyi, Anni Sheh, Hannah Shin, Erika So, Mee Thao, Cindy Truong, Darleen Tu, Skyler Valle, Daug Vaughn, Hayden Wong, Yiu Kei Wong, Keane Yang, Kevin Yao, Frances Zhang