SlideShare a Scribd company logo
From the
Printed Page
to
Discoverable Content
the open source way
Steven Miles
@stevermiles stevenmiles.com.au
Tuesday, 18 January 2011
About Me
Tuesday, 18 January 2011
About Me
Web Application Developer
State Library of Western Australia
@
Tuesday, 18 January 2011
About Me
Web Application Developer
State Library of Western Australia
@
S.L.U.R.P.
Digital Content Ingestion &
Integration with LMS
PC Reservation
PC Reservations and Booking
System
PLO
Public Libraries Online
Venues Bookings
Venues Booking & Reservation
System
P.URL
Permanent URL
Tuesday, 18 January 2011
WARNING !!!!
Lots of technical stuff!
Tuesday, 18 January 2011
How can I make scanned content more discoverable?
presentation
Digitisation
Indexing
Capture DIY Scanner
Existing Documents
Dual Camera Setup
Single Camera Setup
Commercial Scanners
Image Processing
OCR
Document Scanners
MFD’s
Rotation
Cropping
Normalisation Levels Correction
Multi page
Tagging
Open source
Commercial
Cuneiform
Tesseract
Ocropus
GOCR
Page
Layout Analysis
Abby Fine Reader
Acrobat
leptonica
Metadata
ManualAutomatic
PersonsLocations
Dates
Organisations
Locations
Formats
hOCR
Text
XML
Manual
Import
Z39.50
SRU/SRW
Engine
Zebra
XML
Z39.50
RBMS
Postgres
MySQL
Search
Pull from
LMS
Search
Multiple Databases Results
Expose Web API’s
Other Library Systems
Z39.50
SRU/SRW
Facets Page
Previews
Ranked
Sortable
Filters
Web Accessible
Simple
Keyword
Searching
Encourage
Exploration
Tagging
Advanced
Search
Saved
Searches
Social Sharing,
Intergration
Web Browser
Accessible
Auto Updating
Downloadable PDF’s
User Correctable
Text
In Document
Searching
Highlight Search Results
Potential Conversion to Other Formats
Tuesday, 18 January 2011
Most common process of digitisation for
public consumption
Scan /
Capture
Generate PDF OCR
Indexed by Content
Management
System
Link to
Downloadable
PDF(Uncorrected OCR)
(Links only to Document)
How can we do this better?
Tuesday, 18 January 2011
Inspirational Resources
National Libraries Australia - Australian Newspapers
http://newspapers.nla.gov.au/
Google Docs
http://docs.google.com
Informit -Text Searchable Content
Tuesday, 18 January 2011
Scan /
Capture
Semi Auto
Cropping
and Rotation
Correction
Optimise
Each Page
for OCR
OCR Pages
Retain Positional
Information (hocr)
Post OCR
Processing
Spell checking &
correction of common
OCR errors
Natural
Language
Processing
Auto Extract Names,
Organisations,
Locations & Dates
from Text and Use for
tagging
Store as
XML
Generate
Page Level
XML Index
Files
Add/Update
XML
Indexing
Server
Fully Automated Process
Generate
Searchable PDF
Generate Web
FriendlyVersions
of each page
Full Text
Search
Web Services & Z39.50
Downloadable
PDF
Google Docs
Style Interface
Individual Line
Highlighting to Show
search results
Proposed Digitisation Process
Tuesday, 18 January 2011
Available Open Source Projects
Ocropus - Page Layout Analysis
http://code.google.com/p/ocropus/
Tesseract OCR - OCR
http://code.google.com/p/ocropus/
Image Magick - Image Processing
http://www.imagemagick.org/
Index Data Zebra -XML Indexing
http://www.indexdata.com/zebra
Index Data Pazpar2 -Federated Search
http://www.indexdata.com/pazpar2
Existing Web Technologies - PHP, HTML, CSS etc
Tuesday, 18 January 2011
DIY Book Scanner
Project
www.diybookscanner.org
Tuesday, 18 January 2011
Discovery Layer
(PHP, HTML,CSS)
Federated Search
Using PazPar2 - Z39.50, SRU, SRW
Full Text Search
Zebra - XML Indexer
via Z39.50
LMS & External
Databases
Existing via Z39.50
XML Data Files
MARC, Dublin Core, OAI-PM
DocumentViewer / Editor
(PHP, HTML,CSS)
Ingest / Digitisation
(PHP,HTML,CSS)
OCR & NLP
(Document Processing, OCR & Natural Language Processing)
DownloadableVersion
Automatic Generation of Searchable
PDF,Text Files etc
(Updated from User Alterations)
External Resources
Basic Architecture
Crowdsourcing OCR
Corrections & Possible
translation on handwritten
documents
Tuesday, 18 January 2011
Converting Images for OCR
Convert to Grayscale Generate Text Image Mask Clean Up Background Noise OCRVersion
OCRopus Page Layout Analysis
Image Magick Image Manipulation
Combined
Tuesday, 18 January 2011
Images to Text
Image for OCR Processing Tesseract OCR to HOCR File
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
"http://www.w3.org/TR/html4/loose.dtd"><html><head><title></title>
<meta http-equiv="Content-Type" content="text/html;charset=utf-8" ><meta
name='ocr-system' content='tesseract'></head>
<body><div class='ocr_page' id='page_1' title='image "/var/digindex/repository/
eastern_reporter/2010/10/5/ocr/1-masked.png"; bbox 0 0 2161 3247'>
<div class='ocr_carea' id='block_1_1' title="bbox 200 46 1858 233">
<p class='ocr_par'><span class='ocr_line' id='line_1_1' title="bbox 201 50 1858
230"><span class='ocr_word' id='word_1_1' title="bbox 1058 50 1211
196"><span class='xocr_word' id='xword_1_1' title="x_wconf -6">R</span></
span> <span class='ocr_word' id='word_1_2' title="bbox 1319 88 1858
230"><span class='xocr_word' id='xword_1_2' title="x_wconf -4"> r </span></
span></span></p>
</div><div class='ocr_carea' id='block_1_2' title="bbox 47 1855 241 1883">
<p class='ocr_par'><span class='ocr_line' id='line_1_2' title="bbox 47 1855 241
1882"><span class='ocr_word' id='word_1_3' title="bbox 47 1855 77
1882"><span class='xocr_word' id='xword_1_3' title="x_wconf -2">By</span></
span> <span class='ocr_word' id='word_1_4' title="bbox 87 1855 153
1877"><span class='xocr_word' id='xword_1_4' title="x_wconf -3">LIAM</
span></span> <span class='ocr_word' id='word_1_5' title="bbox 163 1856 241
1878"><span class='xocr_word' id='xword_1_5' title="x_wconf -2">CROY</
span></span></span></p></div><div class='ocr_carea' id='block_1_3'
title="bbox 43 1909 533 2404"><p class='ocr_par'>
<span class='ocr_line' id='line_1_3' title="bbox 46 1910 531 1934"><span
class='ocr_word' id='word_1_6' title="bbox 46 1910 72 1928"><span
class='xocr_word' id='xword_1_6' title="x_wconf -3">IN</span></span> <span
class='ocr_word' id='word_1_7' title="bbox 83 1914 94 1928"><span
class='xocr_word' id='xword_1_7' title="x_wconf -2">a</span></span> <span
class='ocr_word' id='word_1_8' title="bbox 105 1910 185 1933"><span
<document><metadata><title>Eastern Reporter Tuesday, October 5,
2010</title><id>eastern_reporter/2010/10/5</id></metadata>
<pages><page id="0" origWidth="3648" origHeight="2736"
rotate="-90.5" crop="2199x3321+147+147"/><page id="1"
origWidth="3648" origHeight="2736" rotate="91" path="odd/
IMG_0946.JPG" crop="2161x3247+374+274" width="2161"
height="3247"><paragraph><line id="line_1_1" top="50" left="201"
width="1657" height="180">R r</line></
paragraph><paragraph><line id="line_1_2" top="1855" left="47"
width="194" height="27">By LIAM CROY</line></
paragraph><paragraph><line id="line_1_3" top="1910" left="46"
width="485" height="24">IN a display of unity, Muslims and Chris-</
line><line id="line_1_4" top="1937" left="45" width="486"
height="26">tians gathered at Dianella Uniting Church</line><line
id="line_1_5" top="1965" left="45" width="485" height="26">last
Thursday to share thei.r experiences</line><line id="line_1_6"
top="1993" left="45" width="212" height="24">and pray for peace.</
line></paragraph><paragraph><line id="line_1_7" top="2020"
left="79" width="451" height="25">Sheikh Muhammad Agherdien of
the</line></paragraph><paragraph><line id="line_1_8" top="2048"
left="46" width="484" height="25">Mirrabooka mosque opened the
service</line><line id="line_1_9" top="2076" left="46" width="484"
height="26">with a verse of the Islamic religious text,</line><line
id="line_1_10" top="2103" left="45" width="117" height="20">the
Koran:</line></paragraph><paragraph><line id="line_1_11"
top="2131" left="79" width="451" height="27">&#x201C;Oh People!
Behold, we have created you</line></paragraph><paragraph><line
id="line_1_12" top="2158" left="46" width="331" height="22">all out
ofa male and a female.</line></paragraph><paragraph><line
id="line_1_13" top="2187" left="79" width="451"
height="25">&#x201C;And we have made you into nations</line></
paragraph><paragraph><line id="line_1_14" top="2214" left="46"
Convert HOCR to XML for Storage Sample Auto Generate Tags
IN a display of unity , [MISC Muslims ] and [MISC Chris- ] , tians
gathered at [ORG Dianella Uniting Church ] , last Thursday to share
thei.r experiences , and pray for peace.
Tuesday, 18 January 2011
Demo
Tuesday, 18 January 2011
Prototype Interface for Ingesting Pages
from Book Scanner
Tuesday, 18 January 2011
Perform Basic Image Rotation and
Cropping
Rotation and Cropping can replicated to other pages
Tuesday, 18 January 2011
Prototype Search Pages
Results on the left are the Auto Generated facets based on the natural language processing tags
Tuesday, 18 January 2011
Viewing Document Pages
Tuesday, 18 January 2011
Viewing Document Pages with
Highlighted Results
Tuesday, 18 January 2011
Editing Document with Auto Updating of
Indexer
Tuesday, 18 January 2011
Pazar2 can be used to alternative interfaces for
search multiple existing catalogs
Tuesday, 18 January 2011
Questions?
Tuesday, 18 January 2011
More Info & Credits
Tesseract-OCR
http://code.google.com/p/tesseract-ocr/
OCRopus
http://code.google.com/p/ocropus/
Do-It-Yourself Book Scanning
http://www.diybookscanner.org/
CHDK - Canon Hack Development Kit
http://chdk.wikia.com/wiki/CHDK
Zebra - XML Indexing
http://www.indexdata.com/zebra
PazPar2 -Federated Search
http://www.indexdata.com/pazpar2
Cuneiform
http://en.wikipedia.org/wiki/HOCR
EyeFi Python Server
http://returnbooleantrue.blogspot.com/2009/01/eye-fi-
standalone-server.html/
hOCR - HTML OCR
http://en.wikipedia.org/wiki/HOCR
OpenNLP
http://www.indexdata.com/pazpar2
Illinois Named Entity Tagger
http://cogcomp.cs.illinois.edu/page/software_view/4
Tuesday, 18 January 2011

More Related Content

Viewers also liked

Design Out Loud: Brainstorming
Design Out Loud: BrainstormingDesign Out Loud: Brainstorming
Design Out Loud: Social Media
Design Out Loud: Social MediaDesign Out Loud: Social Media
Structureof Prokaryotic Eukary
Structureof Prokaryotic EukaryStructureof Prokaryotic Eukary
Structureof Prokaryotic EukaryDeepika Tripathi
 

Viewers also liked (7)

Design Out Loud: Brainstorming
Design Out Loud: BrainstormingDesign Out Loud: Brainstorming
Design Out Loud: Brainstorming
 
Generic Handbook
Generic HandbookGeneric Handbook
Generic Handbook
 
Union migrant
Union migrantUnion migrant
Union migrant
 
Project Management Templates
Project Management TemplatesProject Management Templates
Project Management Templates
 
Design Out Loud: Social Media
Design Out Loud: Social MediaDesign Out Loud: Social Media
Design Out Loud: Social Media
 
Design Out Loud: Making A Web Video
Design Out Loud: Making A Web VideoDesign Out Loud: Making A Web Video
Design Out Loud: Making A Web Video
 
Structureof Prokaryotic Eukary
Structureof Prokaryotic EukaryStructureof Prokaryotic Eukary
Structureof Prokaryotic Eukary
 

Similar to From the printed page to discoverable content library camp perth 2010

CrossRef Technical Basics 2010 CrossRef Workshops
CrossRef Technical Basics 2010 CrossRef WorkshopsCrossRef Technical Basics 2010 CrossRef Workshops
CrossRef Technical Basics 2010 CrossRef WorkshopsCrossref
 
Rank | Analyse | Lead | Search
Rank | Analyse | Lead | SearchRank | Analyse | Lead | Search
Rank | Analyse | Lead | Search
sopekmir
 
Introduction to Semantic Web for GIS Practitioners
Introduction to Semantic Web for GIS PractitionersIntroduction to Semantic Web for GIS Practitioners
Introduction to Semantic Web for GIS PractitionersEmanuele Della Valle
 
(Updated) SharePoint & jQuery Guide
(Updated) SharePoint & jQuery Guide(Updated) SharePoint & jQuery Guide
(Updated) SharePoint & jQuery Guide
Mark Rackley
 
How to Find a Needle in the Haystack
How to Find a Needle in the HaystackHow to Find a Needle in the Haystack
How to Find a Needle in the Haystack
Adrian Stevenson
 
Intro on Oracle Application express - APEX
Intro on Oracle Application express - APEXIntro on Oracle Application express - APEX
Intro on Oracle Application express - APEX
Lino Schildenfeld
 
IWMW 2003: Content Management - Buy or Build?
IWMW 2003: Content Management - Buy or Build?IWMW 2003: Content Management - Buy or Build?
IWMW 2003: Content Management - Buy or Build?
IWMW
 
Open for Business - Open Archives, OpenURL, RSS and the Dublin Core
Open for Business - Open Archives, OpenURL, RSS and the Dublin CoreOpen for Business - Open Archives, OpenURL, RSS and the Dublin Core
Open for Business - Open Archives, OpenURL, RSS and the Dublin Core
Andy Powell
 
Learning Regular Expressions for the Extraction of Product Attributes from E-...
Learning Regular Expressions for the Extraction of Product Attributes from E-...Learning Regular Expressions for the Extraction of Product Attributes from E-...
Learning Regular Expressions for the Extraction of Product Attributes from E-...
Volha Bryl
 
Linked data for Enterprise Data Integration
Linked data for Enterprise Data IntegrationLinked data for Enterprise Data Integration
Linked data for Enterprise Data Integration
Sören Auer
 
SharePoint & jQuery Guide - SPSTC 5/18/2013
SharePoint & jQuery Guide - SPSTC 5/18/2013 SharePoint & jQuery Guide - SPSTC 5/18/2013
SharePoint & jQuery Guide - SPSTC 5/18/2013
Mark Rackley
 
Prototyping interactions
Prototyping interactionsPrototyping interactions
Prototyping interactions
selwynjacob90
 
PoolParty Semantic Platform - Overview
PoolParty Semantic Platform - OverviewPoolParty Semantic Platform - Overview
PoolParty Semantic Platform - Overview
Semantic Web Company
 
Markup As An Api
Markup As An ApiMarkup As An Api
Markup As An Api
Jean-Jacques Halans
 
Creating Interactive Olap Applications With My Sql Enterprise And Mondrian Pr...
Creating Interactive Olap Applications With My Sql Enterprise And Mondrian Pr...Creating Interactive Olap Applications With My Sql Enterprise And Mondrian Pr...
Creating Interactive Olap Applications With My Sql Enterprise And Mondrian Pr...Indus Khaitan
 
Semantic technologies in practice - KULeuven 2016
Semantic technologies in practice - KULeuven 2016Semantic technologies in practice - KULeuven 2016
Semantic technologies in practice - KULeuven 2016
Aad Versteden
 
What do we want computers to do for us?
What do we want computers to do for us? What do we want computers to do for us?
What do we want computers to do for us?
Andrea Volpini
 
Nuxeo JavaOne 2007
Nuxeo JavaOne 2007Nuxeo JavaOne 2007
Nuxeo JavaOne 2007
Stefane Fermigier
 

Similar to From the printed page to discoverable content library camp perth 2010 (20)

Obiee
ObieeObiee
Obiee
 
CrossRef Technical Basics 2010 CrossRef Workshops
CrossRef Technical Basics 2010 CrossRef WorkshopsCrossRef Technical Basics 2010 CrossRef Workshops
CrossRef Technical Basics 2010 CrossRef Workshops
 
Rank | Analyse | Lead | Search
Rank | Analyse | Lead | SearchRank | Analyse | Lead | Search
Rank | Analyse | Lead | Search
 
Introduction to Semantic Web for GIS Practitioners
Introduction to Semantic Web for GIS PractitionersIntroduction to Semantic Web for GIS Practitioners
Introduction to Semantic Web for GIS Practitioners
 
(Updated) SharePoint & jQuery Guide
(Updated) SharePoint & jQuery Guide(Updated) SharePoint & jQuery Guide
(Updated) SharePoint & jQuery Guide
 
How to Find a Needle in the Haystack
How to Find a Needle in the HaystackHow to Find a Needle in the Haystack
How to Find a Needle in the Haystack
 
PoolParty Overview
PoolParty OverviewPoolParty Overview
PoolParty Overview
 
Intro on Oracle Application express - APEX
Intro on Oracle Application express - APEXIntro on Oracle Application express - APEX
Intro on Oracle Application express - APEX
 
IWMW 2003: Content Management - Buy or Build?
IWMW 2003: Content Management - Buy or Build?IWMW 2003: Content Management - Buy or Build?
IWMW 2003: Content Management - Buy or Build?
 
Open for Business - Open Archives, OpenURL, RSS and the Dublin Core
Open for Business - Open Archives, OpenURL, RSS and the Dublin CoreOpen for Business - Open Archives, OpenURL, RSS and the Dublin Core
Open for Business - Open Archives, OpenURL, RSS and the Dublin Core
 
Learning Regular Expressions for the Extraction of Product Attributes from E-...
Learning Regular Expressions for the Extraction of Product Attributes from E-...Learning Regular Expressions for the Extraction of Product Attributes from E-...
Learning Regular Expressions for the Extraction of Product Attributes from E-...
 
Linked data for Enterprise Data Integration
Linked data for Enterprise Data IntegrationLinked data for Enterprise Data Integration
Linked data for Enterprise Data Integration
 
SharePoint & jQuery Guide - SPSTC 5/18/2013
SharePoint & jQuery Guide - SPSTC 5/18/2013 SharePoint & jQuery Guide - SPSTC 5/18/2013
SharePoint & jQuery Guide - SPSTC 5/18/2013
 
Prototyping interactions
Prototyping interactionsPrototyping interactions
Prototyping interactions
 
PoolParty Semantic Platform - Overview
PoolParty Semantic Platform - OverviewPoolParty Semantic Platform - Overview
PoolParty Semantic Platform - Overview
 
Markup As An Api
Markup As An ApiMarkup As An Api
Markup As An Api
 
Creating Interactive Olap Applications With My Sql Enterprise And Mondrian Pr...
Creating Interactive Olap Applications With My Sql Enterprise And Mondrian Pr...Creating Interactive Olap Applications With My Sql Enterprise And Mondrian Pr...
Creating Interactive Olap Applications With My Sql Enterprise And Mondrian Pr...
 
Semantic technologies in practice - KULeuven 2016
Semantic technologies in practice - KULeuven 2016Semantic technologies in practice - KULeuven 2016
Semantic technologies in practice - KULeuven 2016
 
What do we want computers to do for us?
What do we want computers to do for us? What do we want computers to do for us?
What do we want computers to do for us?
 
Nuxeo JavaOne 2007
Nuxeo JavaOne 2007Nuxeo JavaOne 2007
Nuxeo JavaOne 2007
 

Recently uploaded

UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
DianaGray10
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
BookNet Canada
 
Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*
Frank van Harmelen
 
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Jeffrey Haguewood
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
Elena Simperl
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
Thijs Feryn
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Product School
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
DanBrown980551
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Thierry Lestable
 
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Product School
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
DianaGray10
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
Prayukth K V
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Albert Hoitingh
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
Ana-Maria Mihalceanu
 
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
Paul Groth
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
James Anderson
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
Jemma Hussein Allen
 
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Ramesh Iyer
 

Recently uploaded (20)

UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
 
Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*
 
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
 
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
 
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
 
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
 

From the printed page to discoverable content library camp perth 2010

  • 1. From the Printed Page to Discoverable Content the open source way Steven Miles @stevermiles stevenmiles.com.au Tuesday, 18 January 2011
  • 2. About Me Tuesday, 18 January 2011
  • 3. About Me Web Application Developer State Library of Western Australia @ Tuesday, 18 January 2011
  • 4. About Me Web Application Developer State Library of Western Australia @ S.L.U.R.P. Digital Content Ingestion & Integration with LMS PC Reservation PC Reservations and Booking System PLO Public Libraries Online Venues Bookings Venues Booking & Reservation System P.URL Permanent URL Tuesday, 18 January 2011
  • 5. WARNING !!!! Lots of technical stuff! Tuesday, 18 January 2011
  • 6. How can I make scanned content more discoverable? presentation Digitisation Indexing Capture DIY Scanner Existing Documents Dual Camera Setup Single Camera Setup Commercial Scanners Image Processing OCR Document Scanners MFD’s Rotation Cropping Normalisation Levels Correction Multi page Tagging Open source Commercial Cuneiform Tesseract Ocropus GOCR Page Layout Analysis Abby Fine Reader Acrobat leptonica Metadata ManualAutomatic PersonsLocations Dates Organisations Locations Formats hOCR Text XML Manual Import Z39.50 SRU/SRW Engine Zebra XML Z39.50 RBMS Postgres MySQL Search Pull from LMS Search Multiple Databases Results Expose Web API’s Other Library Systems Z39.50 SRU/SRW Facets Page Previews Ranked Sortable Filters Web Accessible Simple Keyword Searching Encourage Exploration Tagging Advanced Search Saved Searches Social Sharing, Intergration Web Browser Accessible Auto Updating Downloadable PDF’s User Correctable Text In Document Searching Highlight Search Results Potential Conversion to Other Formats Tuesday, 18 January 2011
  • 7. Most common process of digitisation for public consumption Scan / Capture Generate PDF OCR Indexed by Content Management System Link to Downloadable PDF(Uncorrected OCR) (Links only to Document) How can we do this better? Tuesday, 18 January 2011
  • 8. Inspirational Resources National Libraries Australia - Australian Newspapers http://newspapers.nla.gov.au/ Google Docs http://docs.google.com Informit -Text Searchable Content Tuesday, 18 January 2011
  • 9. Scan / Capture Semi Auto Cropping and Rotation Correction Optimise Each Page for OCR OCR Pages Retain Positional Information (hocr) Post OCR Processing Spell checking & correction of common OCR errors Natural Language Processing Auto Extract Names, Organisations, Locations & Dates from Text and Use for tagging Store as XML Generate Page Level XML Index Files Add/Update XML Indexing Server Fully Automated Process Generate Searchable PDF Generate Web FriendlyVersions of each page Full Text Search Web Services & Z39.50 Downloadable PDF Google Docs Style Interface Individual Line Highlighting to Show search results Proposed Digitisation Process Tuesday, 18 January 2011
  • 10. Available Open Source Projects Ocropus - Page Layout Analysis http://code.google.com/p/ocropus/ Tesseract OCR - OCR http://code.google.com/p/ocropus/ Image Magick - Image Processing http://www.imagemagick.org/ Index Data Zebra -XML Indexing http://www.indexdata.com/zebra Index Data Pazpar2 -Federated Search http://www.indexdata.com/pazpar2 Existing Web Technologies - PHP, HTML, CSS etc Tuesday, 18 January 2011
  • 12. Discovery Layer (PHP, HTML,CSS) Federated Search Using PazPar2 - Z39.50, SRU, SRW Full Text Search Zebra - XML Indexer via Z39.50 LMS & External Databases Existing via Z39.50 XML Data Files MARC, Dublin Core, OAI-PM DocumentViewer / Editor (PHP, HTML,CSS) Ingest / Digitisation (PHP,HTML,CSS) OCR & NLP (Document Processing, OCR & Natural Language Processing) DownloadableVersion Automatic Generation of Searchable PDF,Text Files etc (Updated from User Alterations) External Resources Basic Architecture Crowdsourcing OCR Corrections & Possible translation on handwritten documents Tuesday, 18 January 2011
  • 13. Converting Images for OCR Convert to Grayscale Generate Text Image Mask Clean Up Background Noise OCRVersion OCRopus Page Layout Analysis Image Magick Image Manipulation Combined Tuesday, 18 January 2011
  • 14. Images to Text Image for OCR Processing Tesseract OCR to HOCR File <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd"><html><head><title></title> <meta http-equiv="Content-Type" content="text/html;charset=utf-8" ><meta name='ocr-system' content='tesseract'></head> <body><div class='ocr_page' id='page_1' title='image "/var/digindex/repository/ eastern_reporter/2010/10/5/ocr/1-masked.png"; bbox 0 0 2161 3247'> <div class='ocr_carea' id='block_1_1' title="bbox 200 46 1858 233"> <p class='ocr_par'><span class='ocr_line' id='line_1_1' title="bbox 201 50 1858 230"><span class='ocr_word' id='word_1_1' title="bbox 1058 50 1211 196"><span class='xocr_word' id='xword_1_1' title="x_wconf -6">R</span></ span> <span class='ocr_word' id='word_1_2' title="bbox 1319 88 1858 230"><span class='xocr_word' id='xword_1_2' title="x_wconf -4"> r </span></ span></span></p> </div><div class='ocr_carea' id='block_1_2' title="bbox 47 1855 241 1883"> <p class='ocr_par'><span class='ocr_line' id='line_1_2' title="bbox 47 1855 241 1882"><span class='ocr_word' id='word_1_3' title="bbox 47 1855 77 1882"><span class='xocr_word' id='xword_1_3' title="x_wconf -2">By</span></ span> <span class='ocr_word' id='word_1_4' title="bbox 87 1855 153 1877"><span class='xocr_word' id='xword_1_4' title="x_wconf -3">LIAM</ span></span> <span class='ocr_word' id='word_1_5' title="bbox 163 1856 241 1878"><span class='xocr_word' id='xword_1_5' title="x_wconf -2">CROY</ span></span></span></p></div><div class='ocr_carea' id='block_1_3' title="bbox 43 1909 533 2404"><p class='ocr_par'> <span class='ocr_line' id='line_1_3' title="bbox 46 1910 531 1934"><span class='ocr_word' id='word_1_6' title="bbox 46 1910 72 1928"><span class='xocr_word' id='xword_1_6' title="x_wconf -3">IN</span></span> <span class='ocr_word' id='word_1_7' title="bbox 83 1914 94 1928"><span class='xocr_word' id='xword_1_7' title="x_wconf -2">a</span></span> <span class='ocr_word' id='word_1_8' title="bbox 105 1910 185 1933"><span <document><metadata><title>Eastern Reporter Tuesday, October 5, 2010</title><id>eastern_reporter/2010/10/5</id></metadata> <pages><page id="0" origWidth="3648" origHeight="2736" rotate="-90.5" crop="2199x3321+147+147"/><page id="1" origWidth="3648" origHeight="2736" rotate="91" path="odd/ IMG_0946.JPG" crop="2161x3247+374+274" width="2161" height="3247"><paragraph><line id="line_1_1" top="50" left="201" width="1657" height="180">R r</line></ paragraph><paragraph><line id="line_1_2" top="1855" left="47" width="194" height="27">By LIAM CROY</line></ paragraph><paragraph><line id="line_1_3" top="1910" left="46" width="485" height="24">IN a display of unity, Muslims and Chris-</ line><line id="line_1_4" top="1937" left="45" width="486" height="26">tians gathered at Dianella Uniting Church</line><line id="line_1_5" top="1965" left="45" width="485" height="26">last Thursday to share thei.r experiences</line><line id="line_1_6" top="1993" left="45" width="212" height="24">and pray for peace.</ line></paragraph><paragraph><line id="line_1_7" top="2020" left="79" width="451" height="25">Sheikh Muhammad Agherdien of the</line></paragraph><paragraph><line id="line_1_8" top="2048" left="46" width="484" height="25">Mirrabooka mosque opened the service</line><line id="line_1_9" top="2076" left="46" width="484" height="26">with a verse of the Islamic religious text,</line><line id="line_1_10" top="2103" left="45" width="117" height="20">the Koran:</line></paragraph><paragraph><line id="line_1_11" top="2131" left="79" width="451" height="27">&#x201C;Oh People! Behold, we have created you</line></paragraph><paragraph><line id="line_1_12" top="2158" left="46" width="331" height="22">all out ofa male and a female.</line></paragraph><paragraph><line id="line_1_13" top="2187" left="79" width="451" height="25">&#x201C;And we have made you into nations</line></ paragraph><paragraph><line id="line_1_14" top="2214" left="46" Convert HOCR to XML for Storage Sample Auto Generate Tags IN a display of unity , [MISC Muslims ] and [MISC Chris- ] , tians gathered at [ORG Dianella Uniting Church ] , last Thursday to share thei.r experiences , and pray for peace. Tuesday, 18 January 2011
  • 16. Prototype Interface for Ingesting Pages from Book Scanner Tuesday, 18 January 2011
  • 17. Perform Basic Image Rotation and Cropping Rotation and Cropping can replicated to other pages Tuesday, 18 January 2011
  • 18. Prototype Search Pages Results on the left are the Auto Generated facets based on the natural language processing tags Tuesday, 18 January 2011
  • 20. Viewing Document Pages with Highlighted Results Tuesday, 18 January 2011
  • 21. Editing Document with Auto Updating of Indexer Tuesday, 18 January 2011
  • 22. Pazar2 can be used to alternative interfaces for search multiple existing catalogs Tuesday, 18 January 2011
  • 24. More Info & Credits Tesseract-OCR http://code.google.com/p/tesseract-ocr/ OCRopus http://code.google.com/p/ocropus/ Do-It-Yourself Book Scanning http://www.diybookscanner.org/ CHDK - Canon Hack Development Kit http://chdk.wikia.com/wiki/CHDK Zebra - XML Indexing http://www.indexdata.com/zebra PazPar2 -Federated Search http://www.indexdata.com/pazpar2 Cuneiform http://en.wikipedia.org/wiki/HOCR EyeFi Python Server http://returnbooleantrue.blogspot.com/2009/01/eye-fi- standalone-server.html/ hOCR - HTML OCR http://en.wikipedia.org/wiki/HOCR OpenNLP http://www.indexdata.com/pazpar2 Illinois Named Entity Tagger http://cogcomp.cs.illinois.edu/page/software_view/4 Tuesday, 18 January 2011