CROWDSOURCING IN THE DIGITALKOOT PROJECT Majlis Bremer-Laamanen IMPACT 24TH OF OCTOBER, 2011 Microtask.com:Digitalkoot: Making Old Archives Accessible Using Crowdsourcing by Otto Chrons and Sami Sundell, Discussions Managing Director Harri Holopainen email@example.com
The Centre for Preservation and Digitisation: statistics• Established in 1990 • Digitisation: 1,3• Digitisation started in million pages 1998 • Audio digitisation• Over 50 employees and cataloguing music 1,300 unique• Yearly average (past cassettes and the three years): sleeves • Microfilm • Conservation: production: 1, 3 10,000-15,000 units million exposures
ENRICHING CONTENT (http://digi.nationallibrary.fi, http://www.doria.fi/handle/10024/4194)• Newspapers - > 2 million pages, the Historical Newspaper Library• Journals - > 2,7 million pages, free to 1910, in all legal depositlibraries to 1944• Books - > travel, novels, Dissertations 17th century, Save the Book• Ephemera - > industrial price lists• Sound - > national sound archive, C-casettes• Interest groups: the creators, users, contributors of the material
Context for mass digitisation and crowdsourcing ClientAccessibility Centre for Preservation and Digitisation Temporary Physical Preparation for Post- storage for Digitisation objectsTransferring Digitisation processing digitised objects Retrieval Physical Objects Mass digitisation activities in the most cost-effective manner: Newspapers, books, journals, ephemera, audio: • Logistics for physical items • Process for digital objects: network services and long-term preservation • Metadata Mets - Alto: capturing through process • Metadata development: User experience and crowdsourcing • Customizing of the tracking systems (CCS, Item Tracking, Scan Client) • Operational environment: scaling architecture and implementation
DIGITALKOOTDIGI = TO DIGITISETALKOOT = PEOPLE GATHERING TO WORK TOGETHERVOLUNTARILY (WITHOUT PAYMENT)FIRST EXPERIENCE 2011:DIGITALKOOT: correction of OCR by gamification, turning usefulactivities into games ”THE MOLE HUNT” by Microtask.com. – People can spend hours on games – Turning useful activities into games – Activities can be rewarded with scores, achievments and social benefitsFrom February, 8th to September 15th, 2011: about 80.000visitors, 4000 hours of effective game time. More than 5 milliontasks.
CHALLENGESMeaningful tasks without breaking the flow of the gameReal-time feedback – many simultaneous players doingthe same taskBuild a bridge to save the moles from falling down => – Correct typing gives you a block to the bridge – Incorrect is punished by explosion
GAMIFICATION CHALLENGESBalancing game play elements with task completion speed andaccuracyKeep the motivation of people and enlarge the audienceIntroduction of meaningful tasks into the game without breakinggame play mechanismsInstant feedback on players´ actions (simultaneous players)•pressure to adapt to varying feedback situations/latencities
POSITIVE EFFECT OF VERIFICATION”The wisdom of the crowds” • includes answers from possible spammersGame start: verification tasks onlyAccurate work shown => verification lowered in phases, never zeroVerification tasks are created automatically: • A randomly selected task is sent to several players: all have to agree on the result => verification task
VERIFICATION OF THE OCRPlayers and their pace cannot be synchronized.Verification tasks to the task stream:•Fed to players varies according to the number of active players•The system knows the answer: the game play is improved by fastfeedback•Downside: no new information produced
USERS: February 8th to March 31st, 201131,816 visitors, 4,768 players, 2,740 hours of game time, 2,5 milliontasks.1 % via Internet, 99 % via FacebookHalf of the users were men.Gametime: seconds to over 100 hours (altogether).Median time: => 9 minutes.Women >13 minutes and 54 % of the tasksHardest working top 4 were all men
ACCURACYOCR-system 0.8 confidential about accuracy => human correction in 30%Random selection of 2 articles:•1,467 words Digitalkoot result: only14 mistakes /228 OCR•516 words Digitalkoot result: 1 mistake/118 OCR•>> well over 99% possible by gamificationSpammer play: •One player 1,5 hours and 5,692 tasks was detected by the verification system and only 4 tasks were accepted
Enriching Digitisation Production Processes, METS Profiles: a new development platform RESOURCE DIGITAL Articles Illustrations COMPREHENSIVE Poems LEVEL OF DIGITAL COLLECTIONS MARK UP Standards & OAI-PMH Structural metadata METS, ALTO complient METS SIP POST packages PROCESSING METS EXPORT Administrative/technical metadata MIX/PREMIS Packesges include: SCANNING JPEG2000 Descriptive metadata MARC21/MODS OCR TXT as ALTO XML PDF CATALOGUING Two BibliographicNewspapers Records JPEG(150)Serials METSXMLBooksParchments MARCXMLNotesMaps SOURCE MATERIALAudio PHYSICAL COLLECTIONS
IN THE MEDIA-Until March 31st, over 30 articles: all around the world: New YorkTimes…-Television appearances ongoing-Helsingin Sanomat : HS talkoot using the National Library´sdigitised newspaper material Historical Newspaper Library >advertising Digitalkoot e.g. September 15th-Influenced user interest => stabilisation to 300 individual users per week
KUVATALKOOTGoal: sophisticateduser experienceCollections discovery and Luonnon-kirja ala-alkeiskouluin tarpeeksi / Z. Topelius, 1868reuse of digital content byresearchers and people atlarge: Researchers will get better systematic coverage of images and articles in published printed material.