Datech2014 - Session 5 - Bimodal Crowdsourcing Platform for Demographic Historical Manuscripts
Upcoming SlideShare
Loading in...5
×
 

Datech2014 - Session 5 - Bimodal Crowdsourcing Platform for Demographic Historical Manuscripts

on

  • 192 views

Presentation of the paper Bimodal Crowdsourcing Platform for Demographic Historical Manuscripts by Alicia Fornés, Josep Llados, Joan Mas, Joana Maria Pujades and Anna Cabré in DATeCH 2014. #digidays

Presentation of the paper Bimodal Crowdsourcing Platform for Demographic Historical Manuscripts by Alicia Fornés, Josep Llados, Joan Mas, Joana Maria Pujades and Anna Cabré in DATeCH 2014. #digidays

Statistics

Views

Total Views
192
Views on SlideShare
132
Embed Views
60

Actions

Likes
0
Downloads
3
Comments
0

2 Embeds 60

http://www.digitisation.eu 59
http://newsblur.com 1

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Datech2014 - Session 5 - Bimodal Crowdsourcing Platform for Demographic Historical Manuscripts Datech2014 - Session 5 - Bimodal Crowdsourcing Platform for Demographic Historical Manuscripts Presentation Transcript

  • A Bimodal Crowdsourcing Platform for Demographic Historical Manuscripts Alicia Fornés, Josep Lladós, Joan Mas, Joana Maria Pujades, Anna Cabré Computer Vision Center - Centre for Demographic Studies Universitat Autònoma de Barcelona
  • 2 Index  Introduction  5CofM project: The Barcelona Marriage Licenses  Bi-modal Crowdsourcing Platform  Contents view  Labeling view  Running experience  Generalization to other kind of documents  Conclusions
  • 3 5CofM: Barcelona Marriage Licenses 5CofM project: Five Centuries of Marriages • Advanced Grant – European Research Council. • 2011 – 2016. • Partners: • Universitat Autònoma de Barcelona (UAB) • Centre for Demographic Studies (CED). • Computer Vision Center (CVC). • Aim: This project is based on the data-mining of the Llibres d'Esposalles conserved at the Archive of the Barcelona Cathedral. This extraordinary data source comprises 291 books of marriage licenses records, with information of approximately 610.000 unions celebrated in over 250 parishes of the Diocese between 1451 and 1905.
  • 4 The Barcelona Marriage Licenses The Marriage Licenses contain information about: – The couple (groom/bride) – Their parents – Their occupation (job) – The place of origin – The parish (church) where they married – The fee that was paid (depending on their social class) NAME DATE JOB PLACE FEE NAME NAME
  • 5 The Barcelona Marriage Licenses Index Marriage Licenses
  • 6 The Barcelona Marriage Licenses “Llibres d’esposalles” from the Archives of the Barcelona Cathedral • 244 books • From 1451 to 1905 • Approximately 550.000 marriages licenses Ground truth • From the volume 69 • 50 documents • 20 classes Index License marriage Husband’s surname License marriage Fee 6
  • 7 The Barcelona Marriage Licenses: Continuity 1481: volume 3 1601: volume 61 Marriage license Husband’s surname 1729: volume 127 1860: volume 200 Fee Marriage license Fee Husband’s surname Marriage license Fee Husband’s surname Marriage license Fee
  • 8 The Barcelona Marriage Licenses: Fees Marriage licenses fees for the two year period that starts on the first of May, 1627 and ends on the last day of April, 1629 Dukes, Marquises, Counts and Viscounts. Noble knights and Lords of vassals. Knights, Honored Citizens and Bourgeoisies. Merchants, Notaries of Barcelona, Shopkeepers of distinguish materials, Chemists and Druggists. Shopkeepers of materials, Royal Notaries, Surgeons, Traders, Solicitors, Middlemen and Artists. The rest. The poor ones for the love of God. 12 ll 2ll 6s 1ll 4s 12s 6s 4s -
  • 9 CED objectives (scholars) – Genealogic tree • Ancestors / descendants – Immigration / Emigration • Family names appear / disappear • French surnames (descendants) – Population (by num. of marriages) • Plagues, epidemics, baby boom – Parish churches • Neighborhood is/becomes rich/poor – Evolution of a family name • Jobs, fees (higher or lower) – Relationships between families • Strategic, commercial reasons CVC objectives (computer scientists) – Layout analysis • Text-line segmentation – Word Spotting • Query by example • Query by string – Handwriting Recognition – Syntactic analysis The Barcelona Marriage Licenses
  • 10 Document Image Analysis: Tasks • Layout analysis: to detect (crop) records, lines, words for subsequent recognition. • Full transcription: to convert images to editable text. • Word spotting: given a query word to search, to locate at image level visually similar word snippets. dit dia rebere$ de Hieronym Ponsich corder de Bar^(a) fill de Jua$ Pon= BLOCKS WORDS LINES
  • 11 Index  Introduction  5CofM project: The Barcelona Marriage Licenses  Bi-modal Crowdsourcing Platform  Contents view  Labeling view  Running experience  Generalization to other kind of documents  Conclusions
  • 12 Technical architecture Image Space Transcription Space Contextual knowledge Space HW recognition Crowdsourcing Data mining • Harmonization • Record linkage Scanning exploitation
  • 13 Crowdsourcing platform • Manual transcription  tedious and time consuming task • Crowdsourcing Platform (Divide & Conquer) • Split and distribute a big amount of small and simple tasks • Crowdsourcing architecture: • Image space (digitized documents) • Transcription space (extraction of information) • Contextual space (semantic meaning)
  • 14 Crowdsourcing platform • Web-based application: Integration of two points of view • Contents view: Semantic information  demographic research • Labeling view: Ground-truthing  document analysis research http://www.cvc.uab.es/5cofm/
  • 15 Crowdsourcing platform: Administration Administration: Managing documents and Users
  • 16 Crowdsourcing platform: User login
  • 17 Contents view (semantics): Form filling
  • 18 Contents view (semantics): Form filling (Indices)
  • 19 Contents view (semantics): Checking correction Check for posible spelling errors (words that appear only once?)
  • 20 Contents view (semantics): Record Linkage • Record Linkage  Genealogical tree • Batch process searches links between individuals: • Parent’s marriage, Brothers/Sisters marriages • The search allows spelling variations • String Edit distance (Levenshtein), with different costs for substitutions • Useful for harmonization of names, surnames… • The expert decides the correct linkage from the candidates Year Bride Father Mother Year Groom Bride Similarity 1638 Jeronima Lluis Teixidor Paula 1606 Lluis Teixidor Paula 1 1638 Joana Nicolau Ferrer Antiga 1613 Nicolau Ferrera Antiga 0.95
  • 21 Index  Introduction  5CofM project: The Barcelona Marriage Licenses  Bi-modal Crowdsourcing Platform  Contents view  Labeling view  Running experience  Generalization to other kind of documents  Conclusions
  • 22 Labeling view (annotation): Transcription (lines) Literal transcription  Ground-truth for handwriting recognition methods
  • 23 Labeling view (annotation): Word Labeling Word meta-data: • Bounding-box (coordinates) • Cathegory (e.g. groom’s name, occupation…) • The system does the automatic correspondence  The user validates! Integrated platform: put into correspondence contents view  labeling view
  • 24 Index  Introduction  5CofM project: The Barcelona Marriage Licenses  Bi-modal Crowdsourcing Platform  Contents view  Labeling view  Running experience  Generalization to other kind of documents  Conclusions
  • 25 Running Experience ADVANTAGES • Digital source • Not necessary to go to the Archive • No timetable limitations • Parallelization • Many users work simultaneously • Centralization • Easier management of images, users, database... • Easy to see “who works on what” • Automatic control • System forces to fill some fields, raises warnings • Useful for detection of spelling errors (auto-correction)
  • 26 Running Experience ADVANTAGES • Security • Frequent back-up • Users can visualize the documents assigned to them, but not download them • Monitoring • Administrator can monitor the user’s work and provide feedback • Visualization and confort • Drag (move), zoom in/out DISADVANTAGES • Internet connection is always needed • If system is down (e.g. maintenance)  no one can work
  • 27 Index  Introduction  5CofM project: The Barcelona Marriage Licenses  Bi-modal Crowdsourcing Platform  Contents view  Labeling view  Running experience  Generalization to other kind of documents  Conclusions
  • Generalization to other demographic manuscripts • The platform has been adapted for census documents
  • 29 Index  Introduction  5CofM project: The Barcelona Marriage Licenses  Bi-modal Crowdsourcing Platform  Contents view  Labeling view  Running experience  Generalization to other kind of documents  Conclusions
  • Conclusions • Web-based crowdsourcing platform for demographic manuscripts • Integrates the needs of demographers and computer scientists Future directions • Improve validation • Combine the output of several users • Compare with the output of document analysis techniques • Mobile-based applications • For crowdsourcing  Faster ground-truth generation • For browsing and searching  User friendly interfaces
  • Crowdsourcing on mobile devices Task 1 Page layout R · 30 s/T · 1 T/P · 29 P Initial (29 pages) Redundancy: each task solved by different people Task 2 Bounding Box R · 30 s/T · 18 T/P · 29 P s/T = seconds per task T/P = task per page R = 5, Redundancy Task 3 Word Segmentation R · 10 s/T · 360 T/P · 29 P
  • 32 Browsing the marriage licenses on a mobile device
  • 33 33 Thank you!!