SlideShare a Scribd company logo
1 of 31
Download to read offline
Emancipating Digital
 Data: The Lincoln
 Digitization Project
 Di iti ti    P j t
  Peter Bajcsy, PhD
  -RResearch S i ti t NCSA
             h Scientist,
  - Adjunct Assistant Professor ECE
  & CS at UIUC
  - Associate Director Center for
  Humanities, Social Sciences and
  Arts (CHASS), Illinois Informatics
  Institute (I3), UIUC

National Center for Supercomputing Applications
University of Illinois at Urbana-Champaign
Outline
• Introduction to Lincoln project
   • Emancipating Digital Data: The Lincoln Digitization Project
   • From Large Volumes Of Scanned Lincoln Papers To Virtual
     Observatories
• Image Cropping
      g      pp g
• Georeferencing and Re-Projections of Historical Maps
• System Architecture for Web-based Delivery of
  Information and Services
  I f     ti    dS     i
• Delivering Layered Information and Providing Services
• Summary
Acknowledgement
•   Funding Agencies:
     • NASA, NARA, NSF, NIH, NAVY, DARPA, ONR, NCSA Industrial
       Partners, NCSA Internal, COM UIUC, State of Illinois, UIUC
       Provost,
       Provost NCSA International Partners Google Summer Code
                                   Partners,
•   Full Time Employees:
     • Peter Bajcsy, Rob Kooper, Michal Ondrejcek, Kenton McHenry,
       Jason Kastner and Luigi Marini
•   Students:
     • Andrew Spencer, Hye Jung Na, Suk Kyu Lee, Rahul Malik, William
       McFadden,
       McFadden Chandra Ramachandran, Ben Raichel, Maryam
                            Ramachandran         Raichel
       Moslemi Naeini
•   Collaborators on Lincoln Project:
     • Daniel Stowell & Stacy McDermott Lincoln Library in Springfield
                              McDermott,                    Springfield,
       IL; Vernon Burton & Kevin Franklin, I-CHASS, UIUC; Melvin
       Casares, and Jose Castro from Instituto Tecnológico de Costa
       Rica (ITCR); Piotr Wendykier and James Nagy from Emory
            (     );           y                  gy            y
       University Atlanta
FROM LARGE VOLUMES OF SCANNED
  LINCOLN PAPERS TO VIRTUAL
  OBSERVATORIES

  INTRODUCTION




Imaginations unbound
Background

• The Papers of Abraham Lincoln is a research
  initiative with the ultimate goal of making all
  writings by America's 16th president available
  on-line.

• Complex workflow process from paper
  documents to on-line virtual observatories
• Multiple end users
   • R
     Researchers
             h
   • General public
Input: Paper Copies of Docs & Metadata
Output: Multi-dimensional Views




The Lincoln Log, A Chronology compiled by
               g,             gy p      y
the Lincoln Sesquicentennial Commission:
http://www.thelincolnlog.org/


DIMENSIONS
Hyperlink to the Lincoln Log (temporal
representation)
Hyperlink to the Markers (spatial
representation)
Hyperlinks to the Image scans
(document content).
Output: Hyperlinked Multi-media Views

   •   Audio (e.g., music of Lincoln’s time)
   •   Images and maps
       I         d
   •   Video
   •   3D objects (e g musical instruments)
                   (e.g.,

                                               IMAGES




                       SONGS

Imaginations unbound
Output: Services to Search, Display and
  Transcribe Digital Data
                                            Google service




The Lincoln Log, A Chronology compiled by
               g,             gy p      y                        Transcription service
the Lincoln Sesquicentennial Commission:
http://www.thelincolnlog.org/




                                                Search service
Output: On-line Virtual Observatory
  • Digital Information Organization
       • Multi-dimensional views in time, space and document
         dimensions
       • Hyperlinked multi-media views including all existing n-
         dimensional data
  • Computational Services to Operate on Digital Data
       • Search
       • Layered display with third party data
       • Transcription of documents
  • Educational Services to Enable Learning
       • Simple demonstrations
       • Homework exercises
       • Support of forensic studies


Imaginations unbound
From Input to Output: A Few Key Components
1. Cropping of scanned documents (algorithm, accuracy &
   robustness, scalability, computational resources).
2.
2 Cleaning and parsing of metadata obtained from The
   Lincoln Log and The Papers of Abraham Lincoln in
   Springfield (Lat, Lng, places, ASCII characters, populating
   MySQL Database etc.)
3. Designing an underlying architecture of information
   storage and retrieval
         g
4. Geo-referencing and re-projection of historical maps.
5. Building web-based interfaces and providing services
   (Programming against Google Maps API Database
                                         API,
   Ajax/Javascript requests using PHP and mySQL).
   http://isda.ncsa.uiuc.edu/lpapers/index.html
FROM LARGE VOLUMES OF SCANNED
  LINCOLN PAPERS TO VIRTUAL
  OBSERVATORIES

  IMAGE CROPPING




Imaginations unbound
Image Cropping: Understanding Variability
    of Document Scans

•   Background paper color and
    intensity
•   Ink color and intensity
•   Density of writing
•   Color scale bar position


• Task: Automatically
  classify images for pre-
  p
  processing and
             g
  remove the Kodak
  color scale bar if
  needed.
  needed
Image Cropping Approach


         Training




         Classify            Crop




                    Output
Humanities & High Performance Computing

   • Assuming that the world is perfect ….
        • Image cropping 300 000 files times 60 seconds per file = 5 000
                 cropping: 300,000                                 5,000
          hours = 208.3 days
        • Other operations such as file format conversions (TIFF->PDF),
          pyramid construction for web deployment
        • Storage requirements for original (100K-300K images ~ 45
          Terabytes), cropped (?) and pyramid representation for fast
          retrieval over the Internet (?)
   • Need to joint forces and form interdisciplinary teams
        • The storage requirements and p
                   g    q              preservation – NCSA mass
          storage
        • The CPU requirements – parallel codes to utilize HPC



Imaginations unbound
FROM LARGE VOLUMES OF SCANNED
  LINCOLN PAPERS TO VIRTUAL
  OBSERVATORIES

  GEO-REFERENCING AND RE-PROJECTION
  OF HISTORICAL MAPS




Imaginations unbound
Georeferencing Historical Maps

 • Goal: to overlay historical maps on top of Google
   Maps
 • Challenges: Geodetic information is not always
   available.
   available
      • The geodetic coordinate system consists of a datum, a
        projection, an origin, a unit system and two axis.
 • T
   Target Projection: G
        t P j ti      Google M
                          l Maps uses WGS84
                                          WGS84,
   Mercator projection and a pixel unit system.
      • Most of the maps of the United States are in conical projection
                                                             projection,
        Lambert Conformal Conic and Albers Equal Area or in Molweide
        Pseudocylindrical Projection.



Imaginations unbound
Layered Geospatial Information: Google Map
Example
Geospatial Characteristics: Neighborhoods
Example of Map Georeferencing
   • Software: Used Global Mapper




        Albers equal-area conic
        Lambert's conformal conic                Mercator cylindrical
        Mollweide pseudocylindrical


In our case the projection does not have to be exact. For
small areas in Molweide projection, for example a simple
perspective correction can be sufficient for the map of the
US 1861-1865
Imaginations unbound
FROM LARGE VOLUMES OF SCANNED
  LINCOLN PAPERS TO VIRTUAL
  OBSERVATORIES

  DESIGNING AN UNDERLYING
  ARCHITECTURE OF VIRTUAL
  OBSERVATORIES




Imaginations unbound
Software Architecture Design




The front-end consists of a HTML file with Google Map loaded, a JavaScript
  script, and a search form with pre-defined data sets. The client-side HTML
  and JavaScript files make requests to the server. The server-side consists
  of a PHP file which bridges the gap between Ajax and connects to MySQL
  database. The result is returned as an XML response to the Ajax engine.
 Imaginations unbound
Data Storage and Organization




Imaginations unbound
FROM LARGE VOLUMES OF SCANNED
  LINCOLN PAPERS TO VIRTUAL
  OBSERVATORIES

  DELIVERING LAYERED INFORMATION
  AND PROVIDING SERVICES




Imaginations unbound
Multi Dimensional View of Lincoln Papers

• Delivering Layers of Information (geospatial – historical
  maps and current maps, temporal – Lincoln log, relational –
     p                  p ,     p                g,
  source & destination links, content – document scans
User Interface
                 Information in
                 time, space and
                 document
                 dimensions.

                      Time
                      Ti

                      Space




                      Search
Providing Search Services
Providing Transcription Services
Safety Guards for Transcription Services


          RFC Valid e-mail addresses
               abc@example.com
               Abc@example.com
               aBC@example.com
             abc.123@example.com
             abc 123@example com
            "abc@def"@example.com
            "Abc@def"@example.com
           1234567890@example.com                               Standards for email addresses: RFC822
             _______@example.com
abc+mailbox/department=shipping@example.com
abc mailbox/department shipping@example.com
                                                                (published in 1982) defines, amongst other
       !#$%&'*+-/=?^_`.{|}~@example.com                         things, the f format for internet text message
                                                                                       f
     "Fred "quota" Bloggs"@example.com                        (email) addresses.
           "Abc@def"@example.com
          "Fred Bloggs"@example.com
           "JoeBlow"@example.com
 customer/department=shipping@example.com
             $A12345@example.com
                                                       RFC Invalid e-mail addresses
          !def!xyz%abc@example.com
                                                 Abc.example.com (character @ is missing)
           _somename@example.com
                                           Abc.@example.com (character dot(.) is last in local part)
                                             Abc..123@example.com (
                                                     @       p        (character dot(.) is double)
                                                                                    ()           )
                                   A@b@c@example.com (only one @ is allowed outside quotations marks)
                       ()[];:,<>@example.com (none of the characters before the @ is allowed outside quotation marks)
What Would You Learn ?




In this example a letter was sent from Fort Randall to President Abraham
Lincoln on October 26, 1862. The bits of information about the document
(metadata) namely the time, the location of a sender and the location of
President Lincoln are known. The letter path is visualized in Google
Maps, the document can be retrieved from the database and edited.
Additionally, user can overlay one of the historical maps. The markers are
positioned with hi h accuracy b
    iti   d ith high             based on th l tit d and l
                                       d     the latitude   d longitude of
                                                                  it d    f
historical sites.
Summary
  • Design and implementation of automated document
    cropping.
  • Integration of spatial, temporal and document
    information.
  • Design and prototype a web-based user interface to
    heterogeneous data.
  • ---------------------------------------------------------------------
  • The system is available at
    http://isda.ncsa.uiuc.edu/lpapers/search.html
  • W would b excited if you would fi d th system useful
    We         ld be       it d               ld find the       t         f l
    in your research or education!



Imaginations unbound

More Related Content

Similar to Overview of Lincoln Paper Design

g-Social - Enhancing e-Science Tools with Social Networking Functionality
g-Social - Enhancing e-Science Tools with Social Networking Functionalityg-Social - Enhancing e-Science Tools with Social Networking Functionality
g-Social - Enhancing e-Science Tools with Social Networking FunctionalityNicholas Loulloudes
 
Mainstreaming Digital Imaging: Missouri Botanical Garden Archives
Mainstreaming Digital Imaging: Missouri Botanical Garden Archives Mainstreaming Digital Imaging: Missouri Botanical Garden Archives
Mainstreaming Digital Imaging: Missouri Botanical Garden Archives Chris Freeland
 
Graeme earl introduction
Graeme earl introductionGraeme earl introduction
Graeme earl introductionGraeme Earl
 
The Elephant in the Library - Integrating Hadoop
The Elephant in the Library - Integrating HadoopThe Elephant in the Library - Integrating Hadoop
The Elephant in the Library - Integrating Hadoopcneudecker
 
How the Web of Data Will be Won
How the Web of Data Will be WonHow the Web of Data Will be Won
How the Web of Data Will be WonJeni Tennison
 
Nye forskninsgresultater inden for geo-spatiale data af Christian S. Jensen, AAU
Nye forskninsgresultater inden for geo-spatiale data af Christian S. Jensen, AAUNye forskninsgresultater inden for geo-spatiale data af Christian S. Jensen, AAU
Nye forskninsgresultater inden for geo-spatiale data af Christian S. Jensen, AAUInfinIT - Innovationsnetværket for it
 
Digital Medieval Data Curation
Digital Medieval Data CurationDigital Medieval Data Curation
Digital Medieval Data Curationblalbritton
 
Digitizing Spectator - Libraries Digital Program
Digitizing Spectator - Libraries Digital ProgramDigitizing Spectator - Libraries Digital Program
Digitizing Spectator - Libraries Digital ProgramRobert Frech
 
2017 GIS in Education Track: Sharing Historical Maps and Atlases in Web Apps
2017 GIS in Education Track: Sharing Historical Maps and Atlases in Web Apps2017 GIS in Education Track: Sharing Historical Maps and Atlases in Web Apps
2017 GIS in Education Track: Sharing Historical Maps and Atlases in Web AppsGIS in the Rockies
 
Core Metadata Project: Harmonizing Legacy Metadata for the Future
Core Metadata Project: Harmonizing Legacy Metadata for the FutureCore Metadata Project: Harmonizing Legacy Metadata for the Future
Core Metadata Project: Harmonizing Legacy Metadata for the FutureWiLS
 
IIIF for CNI Spring 2014 Membership Meeting
IIIF for CNI Spring 2014 Membership MeetingIIIF for CNI Spring 2014 Membership Meeting
IIIF for CNI Spring 2014 Membership MeetingTom-Cramer
 
Multimodal Perspectives for Digitised Historical Newspapers
Multimodal Perspectives for Digitised Historical NewspapersMultimodal Perspectives for Digitised Historical Newspapers
Multimodal Perspectives for Digitised Historical Newspaperscneudecker
 
Using R to Visualize Spatial Data: R as GIS - Guy Lansley
Using R to Visualize Spatial Data: R as GIS - Guy LansleyUsing R to Visualize Spatial Data: R as GIS - Guy Lansley
Using R to Visualize Spatial Data: R as GIS - Guy LansleyGuy Lansley
 
Class 5-introto dl
Class 5-introto dlClass 5-introto dl
Class 5-introto dlmadhuvardhan
 
Class 5-introto dl
Class 5-introto dlClass 5-introto dl
Class 5-introto dlmadhuvardhan
 
Network Mapping - Esri UK Annual Conference 2016
Network Mapping - Esri UK Annual Conference 2016Network Mapping - Esri UK Annual Conference 2016
Network Mapping - Esri UK Annual Conference 2016Esri UK
 

Similar to Overview of Lincoln Paper Design (20)

g-Social - Enhancing e-Science Tools with Social Networking Functionality
g-Social - Enhancing e-Science Tools with Social Networking Functionalityg-Social - Enhancing e-Science Tools with Social Networking Functionality
g-Social - Enhancing e-Science Tools with Social Networking Functionality
 
Mainstreaming Digital Imaging: Missouri Botanical Garden Archives
Mainstreaming Digital Imaging: Missouri Botanical Garden Archives Mainstreaming Digital Imaging: Missouri Botanical Garden Archives
Mainstreaming Digital Imaging: Missouri Botanical Garden Archives
 
Graeme earl introduction
Graeme earl introductionGraeme earl introduction
Graeme earl introduction
 
The Elephant in the Library - Integrating Hadoop
The Elephant in the Library - Integrating HadoopThe Elephant in the Library - Integrating Hadoop
The Elephant in the Library - Integrating Hadoop
 
How the Web of Data Will be Won
How the Web of Data Will be WonHow the Web of Data Will be Won
How the Web of Data Will be Won
 
Nye forskninsgresultater inden for geo-spatiale data af Christian S. Jensen, AAU
Nye forskninsgresultater inden for geo-spatiale data af Christian S. Jensen, AAUNye forskninsgresultater inden for geo-spatiale data af Christian S. Jensen, AAU
Nye forskninsgresultater inden for geo-spatiale data af Christian S. Jensen, AAU
 
IMIA Chiang Spatial Computing - 2016
IMIA Chiang Spatial Computing - 2016IMIA Chiang Spatial Computing - 2016
IMIA Chiang Spatial Computing - 2016
 
Digital Medieval Data Curation
Digital Medieval Data CurationDigital Medieval Data Curation
Digital Medieval Data Curation
 
Digitizing Spectator - Libraries Digital Program
Digitizing Spectator - Libraries Digital ProgramDigitizing Spectator - Libraries Digital Program
Digitizing Spectator - Libraries Digital Program
 
2017 GIS in Education Track: Sharing Historical Maps and Atlases in Web Apps
2017 GIS in Education Track: Sharing Historical Maps and Atlases in Web Apps2017 GIS in Education Track: Sharing Historical Maps and Atlases in Web Apps
2017 GIS in Education Track: Sharing Historical Maps and Atlases in Web Apps
 
Core Metadata Project: Harmonizing Legacy Metadata for the Future
Core Metadata Project: Harmonizing Legacy Metadata for the FutureCore Metadata Project: Harmonizing Legacy Metadata for the Future
Core Metadata Project: Harmonizing Legacy Metadata for the Future
 
IIIF for CNI Spring 2014 Membership Meeting
IIIF for CNI Spring 2014 Membership MeetingIIIF for CNI Spring 2014 Membership Meeting
IIIF for CNI Spring 2014 Membership Meeting
 
Statistical data in RDF
Statistical data in RDFStatistical data in RDF
Statistical data in RDF
 
Multimodal Perspectives for Digitised Historical Newspapers
Multimodal Perspectives for Digitised Historical NewspapersMultimodal Perspectives for Digitised Historical Newspapers
Multimodal Perspectives for Digitised Historical Newspapers
 
Using R to Visualize Spatial Data: R as GIS - Guy Lansley
Using R to Visualize Spatial Data: R as GIS - Guy LansleyUsing R to Visualize Spatial Data: R as GIS - Guy Lansley
Using R to Visualize Spatial Data: R as GIS - Guy Lansley
 
Class 5-introto dl
Class 5-introto dlClass 5-introto dl
Class 5-introto dl
 
Class 5-introto dl
Class 5-introto dlClass 5-introto dl
Class 5-introto dl
 
Network Mapping - Esri UK Annual Conference 2016
Network Mapping - Esri UK Annual Conference 2016Network Mapping - Esri UK Annual Conference 2016
Network Mapping - Esri UK Annual Conference 2016
 
Our World is Socio-technical
Our World is Socio-technicalOur World is Socio-technical
Our World is Socio-technical
 
Data Science At Zillow
Data Science At ZillowData Science At Zillow
Data Science At Zillow
 

Recently uploaded

Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?XfilesPro
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 

Recently uploaded (20)

Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping Elbows
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 

Overview of Lincoln Paper Design

  • 1. Emancipating Digital Data: The Lincoln Digitization Project Di iti ti P j t Peter Bajcsy, PhD -RResearch S i ti t NCSA h Scientist, - Adjunct Assistant Professor ECE & CS at UIUC - Associate Director Center for Humanities, Social Sciences and Arts (CHASS), Illinois Informatics Institute (I3), UIUC National Center for Supercomputing Applications University of Illinois at Urbana-Champaign
  • 2. Outline • Introduction to Lincoln project • Emancipating Digital Data: The Lincoln Digitization Project • From Large Volumes Of Scanned Lincoln Papers To Virtual Observatories • Image Cropping g pp g • Georeferencing and Re-Projections of Historical Maps • System Architecture for Web-based Delivery of Information and Services I f ti dS i • Delivering Layered Information and Providing Services • Summary
  • 3. Acknowledgement • Funding Agencies: • NASA, NARA, NSF, NIH, NAVY, DARPA, ONR, NCSA Industrial Partners, NCSA Internal, COM UIUC, State of Illinois, UIUC Provost, Provost NCSA International Partners Google Summer Code Partners, • Full Time Employees: • Peter Bajcsy, Rob Kooper, Michal Ondrejcek, Kenton McHenry, Jason Kastner and Luigi Marini • Students: • Andrew Spencer, Hye Jung Na, Suk Kyu Lee, Rahul Malik, William McFadden, McFadden Chandra Ramachandran, Ben Raichel, Maryam Ramachandran Raichel Moslemi Naeini • Collaborators on Lincoln Project: • Daniel Stowell & Stacy McDermott Lincoln Library in Springfield McDermott, Springfield, IL; Vernon Burton & Kevin Franklin, I-CHASS, UIUC; Melvin Casares, and Jose Castro from Instituto Tecnológico de Costa Rica (ITCR); Piotr Wendykier and James Nagy from Emory ( ); y gy y University Atlanta
  • 4. FROM LARGE VOLUMES OF SCANNED LINCOLN PAPERS TO VIRTUAL OBSERVATORIES INTRODUCTION Imaginations unbound
  • 5. Background • The Papers of Abraham Lincoln is a research initiative with the ultimate goal of making all writings by America's 16th president available on-line. • Complex workflow process from paper documents to on-line virtual observatories • Multiple end users • R Researchers h • General public
  • 6. Input: Paper Copies of Docs & Metadata
  • 7. Output: Multi-dimensional Views The Lincoln Log, A Chronology compiled by g, gy p y the Lincoln Sesquicentennial Commission: http://www.thelincolnlog.org/ DIMENSIONS Hyperlink to the Lincoln Log (temporal representation) Hyperlink to the Markers (spatial representation) Hyperlinks to the Image scans (document content).
  • 8. Output: Hyperlinked Multi-media Views • Audio (e.g., music of Lincoln’s time) • Images and maps I d • Video • 3D objects (e g musical instruments) (e.g., IMAGES SONGS Imaginations unbound
  • 9. Output: Services to Search, Display and Transcribe Digital Data Google service The Lincoln Log, A Chronology compiled by g, gy p y Transcription service the Lincoln Sesquicentennial Commission: http://www.thelincolnlog.org/ Search service
  • 10. Output: On-line Virtual Observatory • Digital Information Organization • Multi-dimensional views in time, space and document dimensions • Hyperlinked multi-media views including all existing n- dimensional data • Computational Services to Operate on Digital Data • Search • Layered display with third party data • Transcription of documents • Educational Services to Enable Learning • Simple demonstrations • Homework exercises • Support of forensic studies Imaginations unbound
  • 11. From Input to Output: A Few Key Components 1. Cropping of scanned documents (algorithm, accuracy & robustness, scalability, computational resources). 2. 2 Cleaning and parsing of metadata obtained from The Lincoln Log and The Papers of Abraham Lincoln in Springfield (Lat, Lng, places, ASCII characters, populating MySQL Database etc.) 3. Designing an underlying architecture of information storage and retrieval g 4. Geo-referencing and re-projection of historical maps. 5. Building web-based interfaces and providing services (Programming against Google Maps API Database API, Ajax/Javascript requests using PHP and mySQL). http://isda.ncsa.uiuc.edu/lpapers/index.html
  • 12. FROM LARGE VOLUMES OF SCANNED LINCOLN PAPERS TO VIRTUAL OBSERVATORIES IMAGE CROPPING Imaginations unbound
  • 13. Image Cropping: Understanding Variability of Document Scans • Background paper color and intensity • Ink color and intensity • Density of writing • Color scale bar position • Task: Automatically classify images for pre- p processing and g remove the Kodak color scale bar if needed. needed
  • 14. Image Cropping Approach Training Classify Crop Output
  • 15. Humanities & High Performance Computing • Assuming that the world is perfect …. • Image cropping 300 000 files times 60 seconds per file = 5 000 cropping: 300,000 5,000 hours = 208.3 days • Other operations such as file format conversions (TIFF->PDF), pyramid construction for web deployment • Storage requirements for original (100K-300K images ~ 45 Terabytes), cropped (?) and pyramid representation for fast retrieval over the Internet (?) • Need to joint forces and form interdisciplinary teams • The storage requirements and p g q preservation – NCSA mass storage • The CPU requirements – parallel codes to utilize HPC Imaginations unbound
  • 16. FROM LARGE VOLUMES OF SCANNED LINCOLN PAPERS TO VIRTUAL OBSERVATORIES GEO-REFERENCING AND RE-PROJECTION OF HISTORICAL MAPS Imaginations unbound
  • 17. Georeferencing Historical Maps • Goal: to overlay historical maps on top of Google Maps • Challenges: Geodetic information is not always available. available • The geodetic coordinate system consists of a datum, a projection, an origin, a unit system and two axis. • T Target Projection: G t P j ti Google M l Maps uses WGS84 WGS84, Mercator projection and a pixel unit system. • Most of the maps of the United States are in conical projection projection, Lambert Conformal Conic and Albers Equal Area or in Molweide Pseudocylindrical Projection. Imaginations unbound
  • 18. Layered Geospatial Information: Google Map Example
  • 20. Example of Map Georeferencing • Software: Used Global Mapper Albers equal-area conic Lambert's conformal conic Mercator cylindrical Mollweide pseudocylindrical In our case the projection does not have to be exact. For small areas in Molweide projection, for example a simple perspective correction can be sufficient for the map of the US 1861-1865 Imaginations unbound
  • 21. FROM LARGE VOLUMES OF SCANNED LINCOLN PAPERS TO VIRTUAL OBSERVATORIES DESIGNING AN UNDERLYING ARCHITECTURE OF VIRTUAL OBSERVATORIES Imaginations unbound
  • 22. Software Architecture Design The front-end consists of a HTML file with Google Map loaded, a JavaScript script, and a search form with pre-defined data sets. The client-side HTML and JavaScript files make requests to the server. The server-side consists of a PHP file which bridges the gap between Ajax and connects to MySQL database. The result is returned as an XML response to the Ajax engine. Imaginations unbound
  • 23. Data Storage and Organization Imaginations unbound
  • 24. FROM LARGE VOLUMES OF SCANNED LINCOLN PAPERS TO VIRTUAL OBSERVATORIES DELIVERING LAYERED INFORMATION AND PROVIDING SERVICES Imaginations unbound
  • 25. Multi Dimensional View of Lincoln Papers • Delivering Layers of Information (geospatial – historical maps and current maps, temporal – Lincoln log, relational – p p , p g, source & destination links, content – document scans
  • 26. User Interface Information in time, space and document dimensions. Time Ti Space Search
  • 29. Safety Guards for Transcription Services RFC Valid e-mail addresses abc@example.com Abc@example.com aBC@example.com abc.123@example.com abc 123@example com "abc@def"@example.com "Abc@def"@example.com 1234567890@example.com Standards for email addresses: RFC822 _______@example.com abc+mailbox/department=shipping@example.com abc mailbox/department shipping@example.com (published in 1982) defines, amongst other !#$%&'*+-/=?^_`.{|}~@example.com things, the f format for internet text message f "Fred "quota" Bloggs"@example.com (email) addresses. "Abc@def"@example.com "Fred Bloggs"@example.com "JoeBlow"@example.com customer/department=shipping@example.com $A12345@example.com RFC Invalid e-mail addresses !def!xyz%abc@example.com Abc.example.com (character @ is missing) _somename@example.com Abc.@example.com (character dot(.) is last in local part) Abc..123@example.com ( @ p (character dot(.) is double) () ) A@b@c@example.com (only one @ is allowed outside quotations marks) ()[];:,<>@example.com (none of the characters before the @ is allowed outside quotation marks)
  • 30. What Would You Learn ? In this example a letter was sent from Fort Randall to President Abraham Lincoln on October 26, 1862. The bits of information about the document (metadata) namely the time, the location of a sender and the location of President Lincoln are known. The letter path is visualized in Google Maps, the document can be retrieved from the database and edited. Additionally, user can overlay one of the historical maps. The markers are positioned with hi h accuracy b iti d ith high based on th l tit d and l d the latitude d longitude of it d f historical sites.
  • 31. Summary • Design and implementation of automated document cropping. • Integration of spatial, temporal and document information. • Design and prototype a web-based user interface to heterogeneous data. • --------------------------------------------------------------------- • The system is available at http://isda.ncsa.uiuc.edu/lpapers/search.html • W would b excited if you would fi d th system useful We ld be it d ld find the t f l in your research or education! Imaginations unbound