SlideShare a Scribd company logo
1 of 22
A Ground-Truthing Engine for
Proofsetting, Publishing, Re-
Purposing and Quality
Assurance
Steven Simske and Margaret Sturgill
ImagingSystems Laboratory, HPLabs
NOVEMBER21, 2003
pa
ge
Steven Simske and Margaret Sturgill,
Overview of the Digital Library Creation and
Fulfillment Process
Paper-based Corpus
POD ServerWorkflow Server
Web-Enabled Device
Computer / Browser
Customized ApplicationMill Servers
Digital Library
Generation:
Automated
conversion of TIFF
to PDF and XML
{
whyeidmclgot
ktlpaooensue
hqyqnamlpzo
pa
ge
Steven Simske and Margaret Sturgill,
MIT Press Digital Library Creation Specifics
• 4000 Out-of-print books—1.2 million+ pages
• 260+ books placed in Cognet (cognitive
network) web community
• The rest available forpurchase via MIT
Press Classics site
Scan (400 ppi, 8-,24-bit)
Zoning Analysis
OCR
PDF and XML Output
AnalysisForceManhattanRegions()
mitprocess -rects -o outdir indir
START
AnalysisForceRectangularPageRegion()
mitprocess -1bw -0 outdir indir
AnalysisForceManhattanTextAndGrayRegions()
mitprocess -rectgray -0 outdir indir
AnalysisNew()
AnalysisAlterRegionPresentationAfterZoning()
mitprocess -0 outdir indir
Use GroundTruthing Application
on File to input desired regions via an XML file
mitprocess -gtxml -0 outdir indir
Passes
AutoQA?
Passes
AutoQA?
Passes
AutoQA?
Passes
AutoQA?
STOP
NO
NO
NO
NO
YES
YES
YES
YES
AnalysisForceGrayRectangularPageRegion()
mitprocess -1gray -0 outdir indir
Passes
AutoQA?
YES
NO
pa
ge
Steven Simske and Margaret Sturgill,
OCR Multiple-Engine Combination
OCR
Engine 1
OCR
Engine 2
OCR
Engine 3
Reorder on BBox
Reorder on BBox
Reorder on BBox
Text
Text
T
X
E
T
A
L
I
G
N
M
E
V
O
T
I
N
Text
Image
Grayscale
Color
pa
ge
Steven Simske and Margaret Sturgill,
OCR Results
Statistics on 20 CogNet book pages
Error Rates
(10**(- 3))
11.29 4.50 1.67
(41.6%
drop
against
Abbyy)
Standard Deviation
(10**(- 3))
16.40 4.56 4.62 3.22
(30.3%
drop
against
Abbyy)
2.86
ScansoftIris AbbyyCombination
Iris results due to SLP (statistical language processing) techniques to
improve output
pa
ge
Steven Simske and Margaret Sturgill,
Overall Process:
PDF Generation
Zoning- Based
PDF: 135 kB
Zoning- Based
PDF w/
searchable text
(in blue box):
140 kB
Zoning Analysis: 1-4 sec
OCR: 2-10 sec
PDF creation: 1-2 sec
Zoning Analysis: 1-4 sec
PDF creation: 1-2 sec
Compression: 63X
Compression: 61X
pa
ge
Steven Simske and Margaret Sturgill,
Rationale for the Ground Truth Engine (GTE)
After the automated quality
assurance (QA) step, 0.8% of
the documents still need some
editorial changes
This amounts to nearly 10,000
documents
Additionally, visual QA is
performed through rapid PDF
output screening—another 1-
2% of pages fail Visual QA
Nearly 30,000 pages were
“imperfect” after QA, and
needed some editing
Voici the GTE…!
pa
ge
Steven Simske and Margaret Sturgill,
Ground-Truthing Engine Features
-Simple UI tools for specifying zoning & re-
purposing -Imposed region manager
-Integrated analysis engine matching that in the
production (POD) system
-XML-schema for zoning & layout description
Java version .NET version
We use this
one to
describe
the design!
pa
ge
Steven Simske and Margaret Sturgill,
File Support
Standard Java
JAI and ImageIO
tools
Standard .NET
System::
Drawing:: Bitmap
tools
Rasterauto-
scaled upon
opening to fit
into screen
Fast resize via
command menu
pa
ge
Steven Simske and Margaret Sturgill,
Zoning Engine-Based Region Generation
Regions can be
auto-generated
with the same
zoning engine as
the primary POD
application to
allow direct
referencing to the
errorcase reports
Regions can also
be generated “on
the fly” using HP
“Clickand Select”
technologies
pa
ge
Steven Simske and Margaret Sturgill,
Zoning Engine-Based Region Generation:
Click and Select Capture Model
Statistical Model for Zoning Analysis
If the set of all region classification types is C, where “text” = C1
,
“drawing” = C2
, “photo” = C3
, “table = “C4
”, etc., then for all C1
…CN
(N=number of region types possible), each region Rj
ε ℜ where ℜ is the
set of all M regions formed during segmentation, is assigned
probabilities p(Ci
), such that: 
∀ Rj
ε ℜj=0…M
, Σi=0…N
pj
(Ci
) = 1.0
That is, the given region Rj
has a summed probability of 1.0, which
represents its relative probabilities over all region types. The
differential statistic, DS, for region x with respect to classification y is
given by:
 DS(x|y) = px
(Cy
) – max(px
(Ci
))i ≠ y
Click and Select UI Tool is
Based on the Statistical Model
Regions can be generated with a “Drag and Click” UI
tool that allows successive clicks to define the region
vertices
pa
ge
Steven Simske and Margaret Sturgill,
UI-Based Polygonal Region Generation
Regions can be
generated with a
“Drag and Click”
UI tool that
allows successive
clicks to define
the region
vertices
With advanced
region
management
capabilities, this
can be used to
generate
outlining regions
pa
ge
Steven Simske and Margaret Sturgill,
UI-Based Polygon Region Example
Afterdefinition of
two polygonal text
regions
These regions are
now individual
objects,
represented as
such in both
Java/C# and XML
Objects could be
“dragged” with an
extension to auto-
layout
pa
ge
Steven Simske and Margaret Sturgill,
Rectangular Region Definition (Rubber-banding)
Regions can be
generated with a
“Drag and
Release” UI tool
that allows the
userto clickto
define a cornerof
the rectangle and
then to drag to
the farcornerof
the rectangle and
release
Usually set to the
default mouse
mode
pa
ge
Steven Simske and Margaret Sturgill,
Re-Classification of Regions
Regions can be
Re-classified with
the Intuitive
“Right Click” +
“Pop-Up” Motif
Forthis project,
the region types
were {Text,
Drawing, Photo,
Table and
Equation}
Rendering of the
Classification Types:
Text at 400 ppi, 1-bit
Drawing at 400 ppi, 8-
or24-bit
Table and Equation at
400 ppi, 1-bit
Photo at 200 ppi, 8- or
24-bit
pa
ge
Steven Simske and Margaret Sturgill,
Changing Region Bit Depth
Regions can be
generated with a
“Drag and
Release” UI tool
that allows the
userto clickto
define a cornerof
the rectangle and
then to drag to
the farcornerof
the rectangle and
release
Usually set to the
default mouse
mode
Bit-Depths of the
“Region Modalities”:
BW= Blackand White,
1-bit perpixel
Gray = 8-bit perpixel
Color= 24-bit perpixel,
sRGB
pa
ge
Steven Simske and Margaret Sturgill,
Context-Sensitive Right-Click
Regions can be
generated with a
“Drag and
Release” UI tool
that allows the
userto clickto
define a cornerof
the rectangle and
then to drag to
the farcornerof
the rectangle and
release
Usually set to the
default mouse
mode
Several Context-
Sensitive Delete
Options forEase of
Editing
“Delete All”—Start
Over(Often After
Rescale)
“Delete Previous”—
Undo Delete
“Delete Nearest”—For
Tricky Layouts (Often
Before Rescale)
pa
ge
Steven Simske and Margaret Sturgill,
Region Manager: No Overlap
Regions are
Described by
BBox, Polygon
and Scan Line
Segment
Representation,
Allowing
Efficacious
“Scale-Up” in
Thoroughness of
Comparison
Any Overlap
Triggered an
ErrorFlag (to
Prevent Later
Print Job Errors)
Overlap May Be
Allowed in Other
Publishing
Applications, such as
VDP
pa
ge
Steven Simske and Margaret Sturgill,
Region Manager: No Crossing of Line Segments
During the Creation of Vertices
Sequential
Vertices are
Compared to all
Previous Vertices
forIntersection
Any Intersecting
Vertices Disallow
the Region
Formation
This Prevents
Potential “Region
Fragmentation”
Upon Scaling to
Original
Resolution
This Region
Management Rule is
Generally Applicable to
Prevent Ill-Shaped
Regions
pa
ge
Steven Simske and Margaret Sturgill,
Design Principles
SPEED OF
ZONING ENGINE
ACCURACY OF
ZONING
ANALYSIS
ENGINE
SIMPLICITY OF UI
GROUND
TRUTHING
The zoning engine was further tuned to
run on low-resolution data so that the
analysis would run at the same nominal
ppi (75) as the later QA proofing
The Ground Truthing Engine is the last
step in the otherwise automatic
generation of 1.2 million PDFs from TIFF
Smart scaling (integral) is performed on file load—allowing the ready
maintenance in most cases of boundary information after scaling
Zoning engine for document analysis matches that of the Ground
Truthing Engine for maximum throughput
Redundancy of keys, menus, tool tips for fast training
pa
ge
Steven Simske and Margaret Sturgill,
Throughput Results
• 94.6% of pages passed Auto QA after the original zoning analysis
• 99.2% of pages passed AutoQA after the iterative attempts at
logical zoning analysis
• 97.7% of pages passed both AutoQA and VisualQA. Failures in
the latter were generally due to bit depth mismatch for poorly-
exposed originals
• Thus, 2.3%, or 27,500 pages needed to be Ground Truthed for full
re-purposing (All text OCR’d, all drawings and photos saved in
best format, etc.)
• After 30 minutes of training, 2 pages/minute is an easy rate to
proof, even while multi-tasking (on phone, email): this means less
than 6 human-weeks for the entire set of 1.2 x 106
pages
• This is 2x105
pages “cleaned” per human-week of work
Questions…?

More Related Content

What's hot

What's new in Hexagon-Geospatial Power Portfolio 2016
What's new in Hexagon-Geospatial Power Portfolio 2016What's new in Hexagon-Geospatial Power Portfolio 2016
What's new in Hexagon-Geospatial Power Portfolio 2016Planetek Italia Srl
 
Erdas Apollo Product Overview
Erdas Apollo Product OverviewErdas Apollo Product Overview
Erdas Apollo Product OverviewGilbert Madrid
 
Generation of High Resolution DSM using UAV Images - Final Year Project
Generation of High Resolution DSM using UAV Images - Final Year ProjectGeneration of High Resolution DSM using UAV Images - Final Year Project
Generation of High Resolution DSM using UAV Images - Final Year ProjectBiplov Bhandari
 
DSM Generation Using High Resolution UAV Images
DSM Generation Using High Resolution UAV ImagesDSM Generation Using High Resolution UAV Images
DSM Generation Using High Resolution UAV ImagesBiplov Bhandari
 
Cogent3 d master slides (12 april 2009)
Cogent3 d master slides (12 april 2009)Cogent3 d master slides (12 april 2009)
Cogent3 d master slides (12 april 2009)Danny Bronson
 
Poster Presentation "Generation of High Resolution DSM Usin UAV Images"
Poster Presentation "Generation of High Resolution DSM Usin UAV Images"Poster Presentation "Generation of High Resolution DSM Usin UAV Images"
Poster Presentation "Generation of High Resolution DSM Usin UAV Images"Nepal Flying Labs
 

What's hot (7)

What's new in Hexagon-Geospatial Power Portfolio 2016
What's new in Hexagon-Geospatial Power Portfolio 2016What's new in Hexagon-Geospatial Power Portfolio 2016
What's new in Hexagon-Geospatial Power Portfolio 2016
 
Erdas Apollo Product Overview
Erdas Apollo Product OverviewErdas Apollo Product Overview
Erdas Apollo Product Overview
 
Generation of High Resolution DSM using UAV Images - Final Year Project
Generation of High Resolution DSM using UAV Images - Final Year ProjectGeneration of High Resolution DSM using UAV Images - Final Year Project
Generation of High Resolution DSM using UAV Images - Final Year Project
 
DSM Generation Using High Resolution UAV Images
DSM Generation Using High Resolution UAV ImagesDSM Generation Using High Resolution UAV Images
DSM Generation Using High Resolution UAV Images
 
Cogent3 d master slides (12 april 2009)
Cogent3 d master slides (12 april 2009)Cogent3 d master slides (12 april 2009)
Cogent3 d master slides (12 april 2009)
 
HighAccuracyDataCollection
HighAccuracyDataCollectionHighAccuracyDataCollection
HighAccuracyDataCollection
 
Poster Presentation "Generation of High Resolution DSM Usin UAV Images"
Poster Presentation "Generation of High Resolution DSM Usin UAV Images"Poster Presentation "Generation of High Resolution DSM Usin UAV Images"
Poster Presentation "Generation of High Resolution DSM Usin UAV Images"
 

Viewers also liked

Viewers also liked (11)

RESILIENCIA_PERENIDADES_2
RESILIENCIA_PERENIDADES_2RESILIENCIA_PERENIDADES_2
RESILIENCIA_PERENIDADES_2
 
Resume, 2016
Resume, 2016Resume, 2016
Resume, 2016
 
Egide Mwemero Presentation
Egide Mwemero PresentationEgide Mwemero Presentation
Egide Mwemero Presentation
 
IBM
IBMIBM
IBM
 
Bon jour
Bon jourBon jour
Bon jour
 
Calendario.quito 2011
Calendario.quito 2011Calendario.quito 2011
Calendario.quito 2011
 
Gustave Moreau
Gustave MoreauGustave Moreau
Gustave Moreau
 
Plenary 4 albert motivans
Plenary 4   albert motivansPlenary 4   albert motivans
Plenary 4 albert motivans
 
Mining Cause Effect Chains from Version Archives - ISSRE 2011
Mining Cause Effect Chains from Version Archives - ISSRE 2011Mining Cause Effect Chains from Version Archives - ISSRE 2011
Mining Cause Effect Chains from Version Archives - ISSRE 2011
 
Asn parking management system profile
Asn parking management system profileAsn parking management system profile
Asn parking management system profile
 
Icha Peter_cv
Icha Peter_cvIcha Peter_cv
Icha Peter_cv
 

Similar to Ground-Truthing Engine for Document Processing Quality Assurance

Introduction to Computer graphics
Introduction to Computer graphicsIntroduction to Computer graphics
Introduction to Computer graphicsLOKESH KUMAR
 
Sigmaplot 13 PPT
Sigmaplot 13 PPTSigmaplot 13 PPT
Sigmaplot 13 PPTSiriyak Cr
 
UserDirectedAnalysis of Scanned Images (DOCENG 03 talk)
UserDirectedAnalysis of Scanned Images (DOCENG 03 talk)UserDirectedAnalysis of Scanned Images (DOCENG 03 talk)
UserDirectedAnalysis of Scanned Images (DOCENG 03 talk)Jordi Arnabat
 
CMS Flow Model SMS 13.0
CMS Flow Model SMS 13.0CMS Flow Model SMS 13.0
CMS Flow Model SMS 13.0bahar fahmi
 
Desktop Softwares for Unmanned Aerial Systems(UAS))
Desktop Softwares for Unmanned Aerial Systems(UAS))Desktop Softwares for Unmanned Aerial Systems(UAS))
Desktop Softwares for Unmanned Aerial Systems(UAS))Kamal Shahi
 
LAB3_CADCAM_using_MasterCam.pptx
LAB3_CADCAM_using_MasterCam.pptxLAB3_CADCAM_using_MasterCam.pptx
LAB3_CADCAM_using_MasterCam.pptxMohammedAlobaidy16
 
"High-resolution 3D Reconstruction on a Mobile Processor," a Presentation fro...
"High-resolution 3D Reconstruction on a Mobile Processor," a Presentation fro..."High-resolution 3D Reconstruction on a Mobile Processor," a Presentation fro...
"High-resolution 3D Reconstruction on a Mobile Processor," a Presentation fro...Edge AI and Vision Alliance
 
ITK Tutorial Presentation Slides-946
ITK Tutorial Presentation Slides-946ITK Tutorial Presentation Slides-946
ITK Tutorial Presentation Slides-946Kitware Kitware
 
Inkjet quality measurement
Inkjet quality measurementInkjet quality measurement
Inkjet quality measurementYair Kipman
 
unit1_updated.pptx
unit1_updated.pptxunit1_updated.pptx
unit1_updated.pptxRYZEN14
 
Arindam batabyal literature reviewpresentation
Arindam batabyal literature reviewpresentationArindam batabyal literature reviewpresentation
Arindam batabyal literature reviewpresentationArindam Batabyal
 
B Kindilien-Does Manufacturing Have a Future?
B Kindilien-Does Manufacturing Have a Future?B Kindilien-Does Manufacturing Have a Future?
B Kindilien-Does Manufacturing Have a Future?jgIpotiwon
 
Introduction of super map gis 10i(2020) (1)
Introduction of super map gis 10i(2020) (1)Introduction of super map gis 10i(2020) (1)
Introduction of super map gis 10i(2020) (1)GeoMedeelel
 
Advanced Open IoT Platform for Prevention and Early Detection of Forest Fires
Advanced Open IoT Platform for Prevention and Early Detection of Forest FiresAdvanced Open IoT Platform for Prevention and Early Detection of Forest Fires
Advanced Open IoT Platform for Prevention and Early Detection of Forest FiresIvo Andreev
 
computer graphics unit 1.ppt
computer graphics unit 1.pptcomputer graphics unit 1.ppt
computer graphics unit 1.pptoumiarashid
 
Chapter7 New
Chapter7 NewChapter7 New
Chapter7 Newajithsrc
 

Similar to Ground-Truthing Engine for Document Processing Quality Assurance (20)

Dynamic Mapping with Automation
Dynamic Mapping with AutomationDynamic Mapping with Automation
Dynamic Mapping with Automation
 
Introduction to Computer graphics
Introduction to Computer graphicsIntroduction to Computer graphics
Introduction to Computer graphics
 
Sigmaplot 13 PPT
Sigmaplot 13 PPTSigmaplot 13 PPT
Sigmaplot 13 PPT
 
UserDirectedAnalysis of Scanned Images (DOCENG 03 talk)
UserDirectedAnalysis of Scanned Images (DOCENG 03 talk)UserDirectedAnalysis of Scanned Images (DOCENG 03 talk)
UserDirectedAnalysis of Scanned Images (DOCENG 03 talk)
 
Tecnomatix 11 para simulação e validação
Tecnomatix 11 para simulação e validaçãoTecnomatix 11 para simulação e validação
Tecnomatix 11 para simulação e validação
 
CMS Flow Model SMS 13.0
CMS Flow Model SMS 13.0CMS Flow Model SMS 13.0
CMS Flow Model SMS 13.0
 
Desktop Softwares for Unmanned Aerial Systems(UAS))
Desktop Softwares for Unmanned Aerial Systems(UAS))Desktop Softwares for Unmanned Aerial Systems(UAS))
Desktop Softwares for Unmanned Aerial Systems(UAS))
 
LAB3_CADCAM_using_MasterCam.pptx
LAB3_CADCAM_using_MasterCam.pptxLAB3_CADCAM_using_MasterCam.pptx
LAB3_CADCAM_using_MasterCam.pptx
 
"High-resolution 3D Reconstruction on a Mobile Processor," a Presentation fro...
"High-resolution 3D Reconstruction on a Mobile Processor," a Presentation fro..."High-resolution 3D Reconstruction on a Mobile Processor," a Presentation fro...
"High-resolution 3D Reconstruction on a Mobile Processor," a Presentation fro...
 
ITK Tutorial Presentation Slides-946
ITK Tutorial Presentation Slides-946ITK Tutorial Presentation Slides-946
ITK Tutorial Presentation Slides-946
 
Inkjet quality measurement
Inkjet quality measurementInkjet quality measurement
Inkjet quality measurement
 
unit1_updated.pptx
unit1_updated.pptxunit1_updated.pptx
unit1_updated.pptx
 
Arindam batabyal literature reviewpresentation
Arindam batabyal literature reviewpresentationArindam batabyal literature reviewpresentation
Arindam batabyal literature reviewpresentation
 
B Kindilien-Does Manufacturing Have a Future?
B Kindilien-Does Manufacturing Have a Future?B Kindilien-Does Manufacturing Have a Future?
B Kindilien-Does Manufacturing Have a Future?
 
Choudhary2015
Choudhary2015Choudhary2015
Choudhary2015
 
Introduction of super map gis 10i(2020) (1)
Introduction of super map gis 10i(2020) (1)Introduction of super map gis 10i(2020) (1)
Introduction of super map gis 10i(2020) (1)
 
Advanced Open IoT Platform for Prevention and Early Detection of Forest Fires
Advanced Open IoT Platform for Prevention and Early Detection of Forest FiresAdvanced Open IoT Platform for Prevention and Early Detection of Forest Fires
Advanced Open IoT Platform for Prevention and Early Detection of Forest Fires
 
computer graphics unit 1.ppt
computer graphics unit 1.pptcomputer graphics unit 1.ppt
computer graphics unit 1.ppt
 
Potter’S Wheel
Potter’S WheelPotter’S Wheel
Potter’S Wheel
 
Chapter7 New
Chapter7 NewChapter7 New
Chapter7 New
 

Ground-Truthing Engine for Document Processing Quality Assurance

  • 1. A Ground-Truthing Engine for Proofsetting, Publishing, Re- Purposing and Quality Assurance Steven Simske and Margaret Sturgill ImagingSystems Laboratory, HPLabs NOVEMBER21, 2003
  • 2. pa ge Steven Simske and Margaret Sturgill, Overview of the Digital Library Creation and Fulfillment Process Paper-based Corpus POD ServerWorkflow Server Web-Enabled Device Computer / Browser Customized ApplicationMill Servers Digital Library Generation: Automated conversion of TIFF to PDF and XML { whyeidmclgot ktlpaooensue hqyqnamlpzo
  • 3. pa ge Steven Simske and Margaret Sturgill, MIT Press Digital Library Creation Specifics • 4000 Out-of-print books—1.2 million+ pages • 260+ books placed in Cognet (cognitive network) web community • The rest available forpurchase via MIT Press Classics site Scan (400 ppi, 8-,24-bit) Zoning Analysis OCR PDF and XML Output AnalysisForceManhattanRegions() mitprocess -rects -o outdir indir START AnalysisForceRectangularPageRegion() mitprocess -1bw -0 outdir indir AnalysisForceManhattanTextAndGrayRegions() mitprocess -rectgray -0 outdir indir AnalysisNew() AnalysisAlterRegionPresentationAfterZoning() mitprocess -0 outdir indir Use GroundTruthing Application on File to input desired regions via an XML file mitprocess -gtxml -0 outdir indir Passes AutoQA? Passes AutoQA? Passes AutoQA? Passes AutoQA? STOP NO NO NO NO YES YES YES YES AnalysisForceGrayRectangularPageRegion() mitprocess -1gray -0 outdir indir Passes AutoQA? YES NO
  • 4. pa ge Steven Simske and Margaret Sturgill, OCR Multiple-Engine Combination OCR Engine 1 OCR Engine 2 OCR Engine 3 Reorder on BBox Reorder on BBox Reorder on BBox Text Text T X E T A L I G N M E V O T I N Text Image Grayscale Color
  • 5. pa ge Steven Simske and Margaret Sturgill, OCR Results Statistics on 20 CogNet book pages Error Rates (10**(- 3)) 11.29 4.50 1.67 (41.6% drop against Abbyy) Standard Deviation (10**(- 3)) 16.40 4.56 4.62 3.22 (30.3% drop against Abbyy) 2.86 ScansoftIris AbbyyCombination Iris results due to SLP (statistical language processing) techniques to improve output
  • 6. pa ge Steven Simske and Margaret Sturgill, Overall Process: PDF Generation Zoning- Based PDF: 135 kB Zoning- Based PDF w/ searchable text (in blue box): 140 kB Zoning Analysis: 1-4 sec OCR: 2-10 sec PDF creation: 1-2 sec Zoning Analysis: 1-4 sec PDF creation: 1-2 sec Compression: 63X Compression: 61X
  • 7. pa ge Steven Simske and Margaret Sturgill, Rationale for the Ground Truth Engine (GTE) After the automated quality assurance (QA) step, 0.8% of the documents still need some editorial changes This amounts to nearly 10,000 documents Additionally, visual QA is performed through rapid PDF output screening—another 1- 2% of pages fail Visual QA Nearly 30,000 pages were “imperfect” after QA, and needed some editing Voici the GTE…!
  • 8. pa ge Steven Simske and Margaret Sturgill, Ground-Truthing Engine Features -Simple UI tools for specifying zoning & re- purposing -Imposed region manager -Integrated analysis engine matching that in the production (POD) system -XML-schema for zoning & layout description Java version .NET version We use this one to describe the design!
  • 9. pa ge Steven Simske and Margaret Sturgill, File Support Standard Java JAI and ImageIO tools Standard .NET System:: Drawing:: Bitmap tools Rasterauto- scaled upon opening to fit into screen Fast resize via command menu
  • 10. pa ge Steven Simske and Margaret Sturgill, Zoning Engine-Based Region Generation Regions can be auto-generated with the same zoning engine as the primary POD application to allow direct referencing to the errorcase reports Regions can also be generated “on the fly” using HP “Clickand Select” technologies
  • 11. pa ge Steven Simske and Margaret Sturgill, Zoning Engine-Based Region Generation: Click and Select Capture Model Statistical Model for Zoning Analysis If the set of all region classification types is C, where “text” = C1 , “drawing” = C2 , “photo” = C3 , “table = “C4 ”, etc., then for all C1 …CN (N=number of region types possible), each region Rj ε ℜ where ℜ is the set of all M regions formed during segmentation, is assigned probabilities p(Ci ), such that:  ∀ Rj ε ℜj=0…M , Σi=0…N pj (Ci ) = 1.0 That is, the given region Rj has a summed probability of 1.0, which represents its relative probabilities over all region types. The differential statistic, DS, for region x with respect to classification y is given by:  DS(x|y) = px (Cy ) – max(px (Ci ))i ≠ y Click and Select UI Tool is Based on the Statistical Model Regions can be generated with a “Drag and Click” UI tool that allows successive clicks to define the region vertices
  • 12. pa ge Steven Simske and Margaret Sturgill, UI-Based Polygonal Region Generation Regions can be generated with a “Drag and Click” UI tool that allows successive clicks to define the region vertices With advanced region management capabilities, this can be used to generate outlining regions
  • 13. pa ge Steven Simske and Margaret Sturgill, UI-Based Polygon Region Example Afterdefinition of two polygonal text regions These regions are now individual objects, represented as such in both Java/C# and XML Objects could be “dragged” with an extension to auto- layout
  • 14. pa ge Steven Simske and Margaret Sturgill, Rectangular Region Definition (Rubber-banding) Regions can be generated with a “Drag and Release” UI tool that allows the userto clickto define a cornerof the rectangle and then to drag to the farcornerof the rectangle and release Usually set to the default mouse mode
  • 15. pa ge Steven Simske and Margaret Sturgill, Re-Classification of Regions Regions can be Re-classified with the Intuitive “Right Click” + “Pop-Up” Motif Forthis project, the region types were {Text, Drawing, Photo, Table and Equation} Rendering of the Classification Types: Text at 400 ppi, 1-bit Drawing at 400 ppi, 8- or24-bit Table and Equation at 400 ppi, 1-bit Photo at 200 ppi, 8- or 24-bit
  • 16. pa ge Steven Simske and Margaret Sturgill, Changing Region Bit Depth Regions can be generated with a “Drag and Release” UI tool that allows the userto clickto define a cornerof the rectangle and then to drag to the farcornerof the rectangle and release Usually set to the default mouse mode Bit-Depths of the “Region Modalities”: BW= Blackand White, 1-bit perpixel Gray = 8-bit perpixel Color= 24-bit perpixel, sRGB
  • 17. pa ge Steven Simske and Margaret Sturgill, Context-Sensitive Right-Click Regions can be generated with a “Drag and Release” UI tool that allows the userto clickto define a cornerof the rectangle and then to drag to the farcornerof the rectangle and release Usually set to the default mouse mode Several Context- Sensitive Delete Options forEase of Editing “Delete All”—Start Over(Often After Rescale) “Delete Previous”— Undo Delete “Delete Nearest”—For Tricky Layouts (Often Before Rescale)
  • 18. pa ge Steven Simske and Margaret Sturgill, Region Manager: No Overlap Regions are Described by BBox, Polygon and Scan Line Segment Representation, Allowing Efficacious “Scale-Up” in Thoroughness of Comparison Any Overlap Triggered an ErrorFlag (to Prevent Later Print Job Errors) Overlap May Be Allowed in Other Publishing Applications, such as VDP
  • 19. pa ge Steven Simske and Margaret Sturgill, Region Manager: No Crossing of Line Segments During the Creation of Vertices Sequential Vertices are Compared to all Previous Vertices forIntersection Any Intersecting Vertices Disallow the Region Formation This Prevents Potential “Region Fragmentation” Upon Scaling to Original Resolution This Region Management Rule is Generally Applicable to Prevent Ill-Shaped Regions
  • 20. pa ge Steven Simske and Margaret Sturgill, Design Principles SPEED OF ZONING ENGINE ACCURACY OF ZONING ANALYSIS ENGINE SIMPLICITY OF UI GROUND TRUTHING The zoning engine was further tuned to run on low-resolution data so that the analysis would run at the same nominal ppi (75) as the later QA proofing The Ground Truthing Engine is the last step in the otherwise automatic generation of 1.2 million PDFs from TIFF Smart scaling (integral) is performed on file load—allowing the ready maintenance in most cases of boundary information after scaling Zoning engine for document analysis matches that of the Ground Truthing Engine for maximum throughput Redundancy of keys, menus, tool tips for fast training
  • 21. pa ge Steven Simske and Margaret Sturgill, Throughput Results • 94.6% of pages passed Auto QA after the original zoning analysis • 99.2% of pages passed AutoQA after the iterative attempts at logical zoning analysis • 97.7% of pages passed both AutoQA and VisualQA. Failures in the latter were generally due to bit depth mismatch for poorly- exposed originals • Thus, 2.3%, or 27,500 pages needed to be Ground Truthed for full re-purposing (All text OCR’d, all drawings and photos saved in best format, etc.) • After 30 minutes of training, 2 pages/minute is an easy rate to proof, even while multi-tasking (on phone, email): this means less than 6 human-weeks for the entire set of 1.2 x 106 pages • This is 2x105 pages “cleaned” per human-week of work