Crowdsourcing Digitization: Harnessing Workflows to Increase Output - Presentation Transcript
Crowdsourcing Digitization Harnessing Workflows to Increase Output Gretchen Gueguen, East Carolina University Ann Hanlon, Marquette University LITA National Forum, 2008 Cincinnati, Ohio
What is crowdsourcing?
Jeff Howe, Wired Magazine , 2006
“ distributed labor networks are using the Internet to exploit the spare processing power of millions of human brains” – best example, Wikipedia…
Any end achieved by harnessing the wisdom and labor of crowds
Distributing the burden of a large endeavor
Howe, Jeff. “The Rise of Crowdsourcing”, Wired Magazine , Issue 14.06, June 2006
Crowdsourcing Digitization
Crowd?
Patrons and Co-workers
Capturing digitization for patron request
Selection is driven by patron request
Centralized and Decentralized staffing for digitization
Object : Build robust digital collections
Online collections dense enough for systematic research (not just showcases)
Crowdsourcing Digitization
The Wisdom of Crowds
How the project was conceived and developed: success story
The Madness of Crowds
How the project failed, why: bringing it back from the brink
Crowd Control
Methods used and lessons learned
Attracting a Crowd
Critical mass for the masses: why we digitize
The Wisdom of Crowds
The Wisdom of Crowds
Project Background: Archives and Special Collections
Digital image management for archives and special collections
Reducing redundancy – many items requested for digitization more than once, why not track them?
Digital Collections and Research (DCR)
New office to coordinate digitization efforts established
Establishing a digital repository
More ambitious than just image management
Image management = capturing patron scanning workflow to populate the new repository
The Wisdom of Crowds
Coordination between Archives and Digital Collections:
New metadata schema
New best practice guidelines
Developing Repository
Fedora required development
Meanwhile, patron scanning continues to grow…
The Wisdom of Crowds
Answer: Scanning Database
Microsoft Access database: “stop-gap measure” while digital repository was being built
Corresponded to newly created XML schema and metadata requirements for repository
The Wisdom of Crowds
The Wisdom of Crowds
Biggest beneficiary: University Archives
Receives the most scanning requests from patrons
Capture patron requests, as well as items scanned prior to implementation of Scanning Database
University celebrating 150 th anniversary
Documentary
“ Coffee table” book
Departmental histories
Nostalgic alumnae
The Wisdom of Crowds
Collections created by crowdsourcing digitization:
University AlbUM
National Trust for Historic Preservation Postcard Collection
The Madness of Crowds
The Madness of Crowds
Evolution
Evolving standards for both metadata and imaging
Training/Quality
(dis)Organization
Backlog
www.funnyfreepics.com
The Madness of Crowds
Evolution
Quality of legacy scans
file types
spatial resolutions
Color profiles
Clipping, noise, and other “problems”
Flawed equipment
Training/Procedures
(dis)Organization
Backlog
The Madness of Crowds Rotated 90º Rotated 180º 24-bit color 300 dpi tif 8-bit 600 dpi tif 48-bit color 600 dpi tif Bitonal EPS 16-bit 300 dpi JPEG indexed color 72 dpi gif PDF???
The Madness of Crowds
Evolution
Metadata Quality
Lack of experience with controlled vocabularies and input standards
Changing metadata requirements
Training/Procedures
(dis)Organization
Backlog
The Madness of Crowds It’s not quite wrong… but, it’s not quite right
Evolution
Training/Procedures
No standard guidelines for scanning procedures
No quality control procedures for images or metadata
No one to set them up anyway…
(dis)Organization
Backlog
The Madness of Crowds
The Madness of Crowds
The Madness of Crowds
The Madness of Crowds
Evolution
Training/Procedures
(dis)Organization
Does everything fit in a “collection?
Backlog
The Madness of Crowds
Evolution
Training/Procedures
(dis)Organization
Backlog
Robust metadata standard to enable repurposing and “sharability”
Could take 10x more time to do metadata than scanning
Volume of scanning didn’t leave much time for metadata
The Madness of Crowds
Crowd Control
Create Documentation
“ Teachable” standard
Responsibility
Quality
Divide and Conquer?!?
Crowd Control
Crowd Control
Create Documentation
TEACH it
Responsibility
Quality: Live it, Learn it, Love it
Divide and Conquer
6. file format
3. straightness and placement
1. resolution
2. color
4. reference points (targets)
5. noise
Crowd Control Puglia, 2007 Imaging Environment Defined Image State RAW Prepped for a specific output Output Referred - looks towards output Input Referred - looks towards sensor Original Referred - defined relationship between original and digital version Current Practice Emerging Practice More technical metadata is needed Should be able to get by with less technical metadata
Create documentation
TEACH it!
Quality: Live it, Learn it, Love it
Have curatorial staff check for accuracy and completeness
DCR staff follow up with a check of a statistically significant portion for style and consistency
Eventually, give curatorial staff to make corrections as they find them using the web-based administrative form
Responsibility
Divide and conquer?!?
Crowd Control
Documentation
“ Teachable” standard
Quality: Live it, Learn it, Love it
Responsibility
Someone has to have some
But it doesn’t have to be an entire job
Divide and Conquer
Crowd Control
Create documentation
TEACH it!
Quality: Live it, Learn it, Love it
Responsibility
Divide and conquer?!?
Stub record created at request time; Cataloging enhances
Crowd Control
Crowd Control
Create documentation
TEACH it!
Quality: Live it, Learn it, Love it
Responsibility
Divide and conquer
Give up
Less control, more power
Crowd Control
Would you want to try this?
Give yourself room to evolve and change through the project
Don’t feel like every image is a precious snowflake
More than any single technique, it’s the philosophy of crowdsourcing that’s more important
Crowd Control
Would you want to try this?
Don’t feel like every image is a precious snowflake
l Access to a low-quality scan… … is still better than no access at all.
Would you want to try this?
More than any single technique, it’s the philosophy of crowdsourcing that’s important
Crowd Control
Attracting a Crowd
Attracting a Crowd
Letting Go
“ Letting go” creates efficiencies
Looking at expertise across the Libraries
Distribute the burden
Move away from “trophy” collections
toward online Research Collections
Attracting a Crowd
Distributed Problem-solving
Ideas from Archives:
Organizing repository by subject rather than by collection
Dabbling in folder-level description (and digitization) rather than just item-level
Neutral Collection-building
Erway, Ricky and Jennifer Schaffner. 2007, “Gearing Up to Get Into the Flow.” Report produced by OCLC Programs and Research (formerly RLG)
Attracting a Crowd
Distributed Problem-solving
Ideas from Archives:
Using “stub records” from patron request forms
Dabbling in folder-level description (and digitization) rather than just item-level
“ Neutral” Collection-building
Wikipedia-style collection-building
Building a collection with wide range
Attracting a Crowd
Mass digitization
Google projects:
Books
Newspapers
Mass decision- making
Instead of item-level decision-making
Attracting a Crowd
Making Digitization a Core Function of the Library
Mission Statements come to life!
Organizing around digitization – very little has really been done yet
Why? For researchers
“ Fringe activities” need to become core investments
Metadata creation
Digitization
Council on Library and Information Resources (CLIR). No Brief Candle: Reconceiving Research Libraries for the 21 st Century , 2008.
Gretchen Gueguen [email_address] East Carolina University Greenville, North Carolina Ann Hanlon [email_address] Marquette University Milwaukee, Wisconsin
Are the highly selective models of digital content more
Are the highly selective models of digital content creation satisfying user demands for
increasing access to our vast collection holdings? In this era of decreasing library
budgets and increasing responsibilities, is such a level of staffing possible at any but the
well-funded libraries? As a recent article in the New York Times estimated, it would take
1,800 years for the National Archives to digitize its text holdings at the current rate of
digitization1. Since November 2005, the University of Maryland libraries has engaged in
another model for digitization: a workflow model that harnesses the digitization already
being done by archivists and other staff for requests by patrons. By “crowdsourcing”
selection decisions in this way the libraries have built a collection of over 5,000 objects
from the holdings of the University Archives and Historical Manuscripts. This model is
based on two main principles:
· Selection: As one part of a programmatic approach to digitization, selections are
based on user request and added to the publicly accessible digital repository
· Image capture: Digitization itself proceeds on the premise that creating useful
surrogates is more important than digital reformatting. The path to a successful
workflow is fraught with perils, though.
The presenters will discuss the issues that have proven most effective and most difficult
in the large-scale digitization workflow in place at UM. They will highlight the technical
requirements chosen for images, metadata, and quality control and speak about how
they were, or in some cases were not, able to achieve them. In bringing to light these
issues we hope to continue an ongoing conversation (most recently articulated at
OCLC\'s \"Digitization Matters\" forum) about the purpose of digital collections and
standards of digital surrogate creation, especially in the age of mass digitization projects.
We hope to explore need to harness all of the library’s expertise and resources where
they can best be deployed. less
1 comments
Comments 1 - 1 of 1 previous next Post a comment