ROBOCATALOGINGAccelerated workflows using OCR and automation                                                        Joshua...
University of Washington College of Built EnvironmentsVisual Resources Collection         Serves the departments of Archit...
Visual Resources CollectionDigital components:MS Access database catalog    MDID2 for faculty / students
The big question:Automated processes exist for batchdigitizing analog photos.
The big question:Automated processes exist for batchdigitizing analog photos.      Is it possible to batch digitize old ca...
Paper records to the rescueBinders and binders of accession records   Pristine label photocopies
A closer look at the slide label                                                                                 Architect...
The big challenge:•   Digitize these typewritten pages•   Sort slide label text into distinct columns in Excel•   Identify...
Photo: Alvaro Farfán via Flickr. 3392225359
Hardware           Apple iMac             •   2010 model             •   OS 10.6           Any recent Mac will do (OS 10.4...
Hardware           Epson Perfection V500 scanner            •   With optional Automatic Document                Feeder for...
Photo: Zak Moreira via Flickr. 3425393424
Software           Photo: Zak Moreira via Flickr. 3425393424
Adobe Photoshop CS4• Resize and realign scanned page into a  single-column tif with ActionsAdobe Acrobat Pro• Create a pdf...
Microsoft Excel 2008• Receive text from Acrobat in columns• After text manipulation and sorting, output  in a cross-platfo...
Automator•   Comes standard with    Mac OS X 10.4+•   Allows scripting and    workflow creation via    GUI•   Can perform ...
Document scanning: Automator, Folder Actions, Photoshop[video here in original presentataion]
Text processing: Automator + Automator Virtual Input, Folder Actions, Acrobat, Excel[video here in original presentataion]
Processed output in Excel
Sometimes it looks good...
Sometimes it looks good...Sometimes it doesn’t.
Final result after text sorting and cleanup
Goal• Produce nearly perfect metadata,  clean enough to import into  existing database
Goal                                 Actual outcome• Produce nearly perfect metadata,   • Produced pretty good metadata  c...
Goal• Use tools on hand; any new tools  should be cheap or useful for  other projects
Goal                                 Actual outcome• Use tools on hand; any new tools   • Used standard software, plus one...
Goal• Have 75,000 new records ready  to pair with images and publish  to MDID
Goal                                Actual outcome• Have 75,000 new records ready     • Got 75,000 records!  to pair with ...
Photo: JF Sebastian via Flickr. 412874324
• Every Mac comes with Automator  and it is easy to learn• You probably have OCR tools on  your computer right now• Experi...
• Every Mac comes with Automator                                             and it is easy to learn                      ...
VRA 2012, Cataloging Case Studies, ROBOCATALOGING
VRA 2012, Cataloging Case Studies, ROBOCATALOGING
Upcoming SlideShare
Loading in …5
×

VRA 2012, Cataloging Case Studies, ROBOCATALOGING

596 views
564 views

Published on

Presented by Joshua Polansky at the Annual Conference of the Visual Resources Association, April 18th - April 21st, 2012, in Albuquerque, New Mexico.

The Cataloguing Case Studies session will explore metadata migration, workflows, cloud computing, and tagging and how they can be applied to digital collections. Mary Alexander of the University of Alabama will present on the second of two migrations that have taken place at the University of Alabama Libraries and the importance of metadata schema and workflows in that process. Joshua Polansky of the University of Washington will describe his automated workflow using optical character recognition (OCR), Apple Automator, and Microsoft Excel to speed the process of collecting metadata for 75,000 digital assets. Elizabeth Berenz of ARTstor will look at the advantages of cloud based software for image management using Shared Shelf as a working example. And finally Ian McDermott will demonstrate the advantages of expert tagging and annotation in improving metadata. His presentation will focus on two ARTstor collections that could benefit from the knowledge of the larger ARTstor community: the Gernsheim Photographic Corpus of Drawings and the Larry Qualls Archive of contemporary art exhibitions.

MODERATOR:
Jeannine Keefer, University of Richmond, VA

PRESENTERS:
Mary Alexander, University of Alabama
Elizabeth Berenz, ARTstor
Ian McDermott, ARTstor
Joshua Polansky, University of Washington

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
596
On SlideShare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
2
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

VRA 2012, Cataloging Case Studies, ROBOCATALOGING

  1. 1. ROBOCATALOGINGAccelerated workflows using OCR and automation Joshua Polansky University of Washington College of Built EnvironmentsCataloging Case Studies April 21, 2012 Visual Resources Collection
  2. 2. University of Washington College of Built EnvironmentsVisual Resources Collection Serves the departments of Architecture, Construction Management, Landscape Architecture and Urban Design & PlanningAnalog collection:• 130,000 35mm slides accessioned and cataloged since 1950s• Typewritten records; no digital database or online component until 2002
  3. 3. Visual Resources CollectionDigital components:MS Access database catalog MDID2 for faculty / students
  4. 4. The big question:Automated processes exist for batchdigitizing analog photos.
  5. 5. The big question:Automated processes exist for batchdigitizing analog photos. Is it possible to batch digitize old cataloging data, too? Good cataloging information here, researched and typed years ago. More good data, including source and a unique accession number.
  6. 6. Paper records to the rescueBinders and binders of accession records Pristine label photocopies
  7. 7. A closer look at the slide label Architect Building name Location / Year View SourcePhotocopied label edge that Collection ID that appears Accession numberwill interfere with OCR later on every label in this form
  8. 8. The big challenge:• Digitize these typewritten pages• Sort slide label text into distinct columns in Excel• Identify each record with its accession number• Do it all with common or affordable tools
  9. 9. Photo: Alvaro Farfán via Flickr. 3392225359
  10. 10. Hardware Apple iMac • 2010 model • OS 10.6 Any recent Mac will do (OS 10.4 or higher) Photo: Alvaro Farfán via Flickr. 3392225359
  11. 11. Hardware Epson Perfection V500 scanner • With optional Automatic Document Feeder for stacks of 30 sheets at a time • Standard transparency unit makes it useful for other scanning projects • Retails for less than $300 with ADF Photo: Alvaro Farfán via Flickr. 3392225359
  12. 12. Photo: Zak Moreira via Flickr. 3425393424
  13. 13. Software Photo: Zak Moreira via Flickr. 3425393424
  14. 14. Adobe Photoshop CS4• Resize and realign scanned page into a single-column tif with ActionsAdobe Acrobat Pro• Create a pdf of each tif• Analyze pdf with optical character recognition (OCR) and make pdf text selectable
  15. 15. Microsoft Excel 2008• Receive text from Acrobat in columns• After text manipulation and sorting, output in a cross-platform format like csvApple AutomatorAutomator Virtual Input• Execute workflows to control multiple applications. Launch, copy, paste, manipulate, save, repeat.• Create Folder Actions for Finder automation• Virtual Input: Extend the functionality of Automator for even more control over apps, mouse, keyboard
  16. 16. Automator• Comes standard with Mac OS X 10.4+• Allows scripting and workflow creation via GUI• Can perform operations within an application or across multiple applications
  17. 17. Document scanning: Automator, Folder Actions, Photoshop[video here in original presentataion]
  18. 18. Text processing: Automator + Automator Virtual Input, Folder Actions, Acrobat, Excel[video here in original presentataion]
  19. 19. Processed output in Excel
  20. 20. Sometimes it looks good...
  21. 21. Sometimes it looks good...Sometimes it doesn’t.
  22. 22. Final result after text sorting and cleanup
  23. 23. Goal• Produce nearly perfect metadata, clean enough to import into existing database
  24. 24. Goal Actual outcome• Produce nearly perfect metadata, • Produced pretty good metadata clean enough to import into • Spent lots of time on data cleanup existing database to get there
  25. 25. Goal• Use tools on hand; any new tools should be cheap or useful for other projects
  26. 26. Goal Actual outcome• Use tools on hand; any new tools • Used standard software, plus one should be cheap or useful for new application ($25) other projects • iMac is a student workstation • Epson scanner is in use for print and film scanning plus pdf creation
  27. 27. Goal• Have 75,000 new records ready to pair with images and publish to MDID
  28. 28. Goal Actual outcome• Have 75,000 new records ready • Got 75,000 records! to pair with images and publish • Created a searchable shelf list and to MDID archival finding aid • With further data cleanup, the original goal of MDID use can be achieved
  29. 29. Photo: JF Sebastian via Flickr. 412874324
  30. 30. • Every Mac comes with Automator and it is easy to learn• You probably have OCR tools on your computer right now• Experimenting can produce great results Photo: JF Sebastian via Flickr. 412874324
  31. 31. • Every Mac comes with Automator and it is easy to learn • You probably have OCR tools on your computer right now • Experimenting can produce great resultsPhoto credits Thank you• Software icons and screenshots by Adobe, Apple, Rainer Metzger Microsoft and Singed Labcoat University of Washington• Kraftwerk images by Flickr users Zak Moreira, Alvaro Farfán and JF Sebastian• Other photo and video by UW CBE VRC Photo: JF Sebastian via Flickr. 412874324

×