Overview of Lincoln Paper Design

639 views
578 views

Published on

This set of slides has been presented to the Illinois Program for Research in the Humanities at the University of Illinois at Urbana-Champaign on 02-27-2009

Published in: Technology, Education
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
639
On SlideShare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
2
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Overview of Lincoln Paper Design

  1. 1. Emancipating Digital Data: The Lincoln Digitization Project Di iti ti P j t Peter Bajcsy, PhD -RResearch S i ti t NCSA h Scientist, - Adjunct Assistant Professor ECE & CS at UIUC - Associate Director Center for Humanities, Social Sciences and Arts (CHASS), Illinois Informatics Institute (I3), UIUC National Center for Supercomputing Applications University of Illinois at Urbana-Champaign
  2. 2. Outline • Introduction to Lincoln project • Emancipating Digital Data: The Lincoln Digitization Project • From Large Volumes Of Scanned Lincoln Papers To Virtual Observatories • Image Cropping g pp g • Georeferencing and Re-Projections of Historical Maps • System Architecture for Web-based Delivery of Information and Services I f ti dS i • Delivering Layered Information and Providing Services • Summary
  3. 3. Acknowledgement • Funding Agencies: • NASA, NARA, NSF, NIH, NAVY, DARPA, ONR, NCSA Industrial Partners, NCSA Internal, COM UIUC, State of Illinois, UIUC Provost, Provost NCSA International Partners Google Summer Code Partners, • Full Time Employees: • Peter Bajcsy, Rob Kooper, Michal Ondrejcek, Kenton McHenry, Jason Kastner and Luigi Marini • Students: • Andrew Spencer, Hye Jung Na, Suk Kyu Lee, Rahul Malik, William McFadden, McFadden Chandra Ramachandran, Ben Raichel, Maryam Ramachandran Raichel Moslemi Naeini • Collaborators on Lincoln Project: • Daniel Stowell & Stacy McDermott Lincoln Library in Springfield McDermott, Springfield, IL; Vernon Burton & Kevin Franklin, I-CHASS, UIUC; Melvin Casares, and Jose Castro from Instituto Tecnológico de Costa Rica (ITCR); Piotr Wendykier and James Nagy from Emory ( ); y gy y University Atlanta
  4. 4. FROM LARGE VOLUMES OF SCANNED LINCOLN PAPERS TO VIRTUAL OBSERVATORIES INTRODUCTION Imaginations unbound
  5. 5. Background • The Papers of Abraham Lincoln is a research initiative with the ultimate goal of making all writings by America's 16th president available on-line. • Complex workflow process from paper documents to on-line virtual observatories • Multiple end users • R Researchers h • General public
  6. 6. Input: Paper Copies of Docs & Metadata
  7. 7. Output: Multi-dimensional Views The Lincoln Log, A Chronology compiled by g, gy p y the Lincoln Sesquicentennial Commission: http://www.thelincolnlog.org/ DIMENSIONS Hyperlink to the Lincoln Log (temporal representation) Hyperlink to the Markers (spatial representation) Hyperlinks to the Image scans (document content).
  8. 8. Output: Hyperlinked Multi-media Views • Audio (e.g., music of Lincoln’s time) • Images and maps I d • Video • 3D objects (e g musical instruments) (e.g., IMAGES SONGS Imaginations unbound
  9. 9. Output: Services to Search, Display and Transcribe Digital Data Google service The Lincoln Log, A Chronology compiled by g, gy p y Transcription service the Lincoln Sesquicentennial Commission: http://www.thelincolnlog.org/ Search service
  10. 10. Output: On-line Virtual Observatory • Digital Information Organization • Multi-dimensional views in time, space and document dimensions • Hyperlinked multi-media views including all existing n- dimensional data • Computational Services to Operate on Digital Data • Search • Layered display with third party data • Transcription of documents • Educational Services to Enable Learning • Simple demonstrations • Homework exercises • Support of forensic studies Imaginations unbound
  11. 11. From Input to Output: A Few Key Components 1. Cropping of scanned documents (algorithm, accuracy & robustness, scalability, computational resources). 2. 2 Cleaning and parsing of metadata obtained from The Lincoln Log and The Papers of Abraham Lincoln in Springfield (Lat, Lng, places, ASCII characters, populating MySQL Database etc.) 3. Designing an underlying architecture of information storage and retrieval g 4. Geo-referencing and re-projection of historical maps. 5. Building web-based interfaces and providing services (Programming against Google Maps API Database API, Ajax/Javascript requests using PHP and mySQL). http://isda.ncsa.uiuc.edu/lpapers/index.html
  12. 12. FROM LARGE VOLUMES OF SCANNED LINCOLN PAPERS TO VIRTUAL OBSERVATORIES IMAGE CROPPING Imaginations unbound
  13. 13. Image Cropping: Understanding Variability of Document Scans • Background paper color and intensity • Ink color and intensity • Density of writing • Color scale bar position • Task: Automatically classify images for pre- p processing and g remove the Kodak color scale bar if needed. needed
  14. 14. Image Cropping Approach Training Classify Crop Output
  15. 15. Humanities & High Performance Computing • Assuming that the world is perfect …. • Image cropping 300 000 files times 60 seconds per file = 5 000 cropping: 300,000 5,000 hours = 208.3 days • Other operations such as file format conversions (TIFF->PDF), pyramid construction for web deployment • Storage requirements for original (100K-300K images ~ 45 Terabytes), cropped (?) and pyramid representation for fast retrieval over the Internet (?) • Need to joint forces and form interdisciplinary teams • The storage requirements and p g q preservation – NCSA mass storage • The CPU requirements – parallel codes to utilize HPC Imaginations unbound
  16. 16. FROM LARGE VOLUMES OF SCANNED LINCOLN PAPERS TO VIRTUAL OBSERVATORIES GEO-REFERENCING AND RE-PROJECTION OF HISTORICAL MAPS Imaginations unbound
  17. 17. Georeferencing Historical Maps • Goal: to overlay historical maps on top of Google Maps • Challenges: Geodetic information is not always available. available • The geodetic coordinate system consists of a datum, a projection, an origin, a unit system and two axis. • T Target Projection: G t P j ti Google M l Maps uses WGS84 WGS84, Mercator projection and a pixel unit system. • Most of the maps of the United States are in conical projection projection, Lambert Conformal Conic and Albers Equal Area or in Molweide Pseudocylindrical Projection. Imaginations unbound
  18. 18. Layered Geospatial Information: Google Map Example
  19. 19. Geospatial Characteristics: Neighborhoods
  20. 20. Example of Map Georeferencing • Software: Used Global Mapper Albers equal-area conic Lambert's conformal conic Mercator cylindrical Mollweide pseudocylindrical In our case the projection does not have to be exact. For small areas in Molweide projection, for example a simple perspective correction can be sufficient for the map of the US 1861-1865 Imaginations unbound
  21. 21. FROM LARGE VOLUMES OF SCANNED LINCOLN PAPERS TO VIRTUAL OBSERVATORIES DESIGNING AN UNDERLYING ARCHITECTURE OF VIRTUAL OBSERVATORIES Imaginations unbound
  22. 22. Software Architecture Design The front-end consists of a HTML file with Google Map loaded, a JavaScript script, and a search form with pre-defined data sets. The client-side HTML and JavaScript files make requests to the server. The server-side consists of a PHP file which bridges the gap between Ajax and connects to MySQL database. The result is returned as an XML response to the Ajax engine. Imaginations unbound
  23. 23. Data Storage and Organization Imaginations unbound
  24. 24. FROM LARGE VOLUMES OF SCANNED LINCOLN PAPERS TO VIRTUAL OBSERVATORIES DELIVERING LAYERED INFORMATION AND PROVIDING SERVICES Imaginations unbound
  25. 25. Multi Dimensional View of Lincoln Papers • Delivering Layers of Information (geospatial – historical maps and current maps, temporal – Lincoln log, relational – p p , p g, source & destination links, content – document scans
  26. 26. User Interface Information in time, space and document dimensions. Time Ti Space Search
  27. 27. Providing Search Services
  28. 28. Providing Transcription Services
  29. 29. Safety Guards for Transcription Services RFC Valid e-mail addresses abc@example.com Abc@example.com aBC@example.com abc.123@example.com abc 123@example com "abc@def"@example.com "Abc@def"@example.com 1234567890@example.com Standards for email addresses: RFC822 _______@example.com abc+mailbox/department=shipping@example.com abc mailbox/department shipping@example.com (published in 1982) defines, amongst other !#$%&'*+-/=?^_`.{|}~@example.com things, the f format for internet text message f "Fred "quota" Bloggs"@example.com (email) addresses. "Abc@def"@example.com "Fred Bloggs"@example.com "JoeBlow"@example.com customer/department=shipping@example.com $A12345@example.com RFC Invalid e-mail addresses !def!xyz%abc@example.com Abc.example.com (character @ is missing) _somename@example.com Abc.@example.com (character dot(.) is last in local part) Abc..123@example.com ( @ p (character dot(.) is double) () ) A@b@c@example.com (only one @ is allowed outside quotations marks) ()[];:,<>@example.com (none of the characters before the @ is allowed outside quotation marks)
  30. 30. What Would You Learn ? In this example a letter was sent from Fort Randall to President Abraham Lincoln on October 26, 1862. The bits of information about the document (metadata) namely the time, the location of a sender and the location of President Lincoln are known. The letter path is visualized in Google Maps, the document can be retrieved from the database and edited. Additionally, user can overlay one of the historical maps. The markers are positioned with hi h accuracy b iti d ith high based on th l tit d and l d the latitude d longitude of it d f historical sites.
  31. 31. Summary • Design and implementation of automated document cropping. • Integration of spatial, temporal and document information. • Design and prototype a web-based user interface to heterogeneous data. • --------------------------------------------------------------------- • The system is available at http://isda.ncsa.uiuc.edu/lpapers/search.html • W would b excited if you would fi d th system useful We ld be it d ld find the t f l in your research or education! Imaginations unbound

×