Ground-Truthing Engine for Document Processing Quality Assurance
1. A Ground-Truthing Engine for
Proofsetting, Publishing, Re-
Purposing and Quality
Assurance
Steven Simske and Margaret Sturgill
ImagingSystems Laboratory, HPLabs
NOVEMBER21, 2003
2. pa
ge
Steven Simske and Margaret Sturgill,
Overview of the Digital Library Creation and
Fulfillment Process
Paper-based Corpus
POD ServerWorkflow Server
Web-Enabled Device
Computer / Browser
Customized ApplicationMill Servers
Digital Library
Generation:
Automated
conversion of TIFF
to PDF and XML
{
whyeidmclgot
ktlpaooensue
hqyqnamlpzo
3. pa
ge
Steven Simske and Margaret Sturgill,
MIT Press Digital Library Creation Specifics
• 4000 Out-of-print books—1.2 million+ pages
• 260+ books placed in Cognet (cognitive
network) web community
• The rest available forpurchase via MIT
Press Classics site
Scan (400 ppi, 8-,24-bit)
Zoning Analysis
OCR
PDF and XML Output
AnalysisForceManhattanRegions()
mitprocess -rects -o outdir indir
START
AnalysisForceRectangularPageRegion()
mitprocess -1bw -0 outdir indir
AnalysisForceManhattanTextAndGrayRegions()
mitprocess -rectgray -0 outdir indir
AnalysisNew()
AnalysisAlterRegionPresentationAfterZoning()
mitprocess -0 outdir indir
Use GroundTruthing Application
on File to input desired regions via an XML file
mitprocess -gtxml -0 outdir indir
Passes
AutoQA?
Passes
AutoQA?
Passes
AutoQA?
Passes
AutoQA?
STOP
NO
NO
NO
NO
YES
YES
YES
YES
AnalysisForceGrayRectangularPageRegion()
mitprocess -1gray -0 outdir indir
Passes
AutoQA?
YES
NO
4. pa
ge
Steven Simske and Margaret Sturgill,
OCR Multiple-Engine Combination
OCR
Engine 1
OCR
Engine 2
OCR
Engine 3
Reorder on BBox
Reorder on BBox
Reorder on BBox
Text
Text
T
X
E
T
A
L
I
G
N
M
E
V
O
T
I
N
Text
Image
Grayscale
Color
5. pa
ge
Steven Simske and Margaret Sturgill,
OCR Results
Statistics on 20 CogNet book pages
Error Rates
(10**(- 3))
11.29 4.50 1.67
(41.6%
drop
against
Abbyy)
Standard Deviation
(10**(- 3))
16.40 4.56 4.62 3.22
(30.3%
drop
against
Abbyy)
2.86
ScansoftIris AbbyyCombination
Iris results due to SLP (statistical language processing) techniques to
improve output
6. pa
ge
Steven Simske and Margaret Sturgill,
Overall Process:
PDF Generation
Zoning- Based
PDF: 135 kB
Zoning- Based
PDF w/
searchable text
(in blue box):
140 kB
Zoning Analysis: 1-4 sec
OCR: 2-10 sec
PDF creation: 1-2 sec
Zoning Analysis: 1-4 sec
PDF creation: 1-2 sec
Compression: 63X
Compression: 61X
7. pa
ge
Steven Simske and Margaret Sturgill,
Rationale for the Ground Truth Engine (GTE)
After the automated quality
assurance (QA) step, 0.8% of
the documents still need some
editorial changes
This amounts to nearly 10,000
documents
Additionally, visual QA is
performed through rapid PDF
output screening—another 1-
2% of pages fail Visual QA
Nearly 30,000 pages were
“imperfect” after QA, and
needed some editing
Voici the GTE…!
8. pa
ge
Steven Simske and Margaret Sturgill,
Ground-Truthing Engine Features
-Simple UI tools for specifying zoning & re-
purposing -Imposed region manager
-Integrated analysis engine matching that in the
production (POD) system
-XML-schema for zoning & layout description
Java version .NET version
We use this
one to
describe
the design!
9. pa
ge
Steven Simske and Margaret Sturgill,
File Support
Standard Java
JAI and ImageIO
tools
Standard .NET
System::
Drawing:: Bitmap
tools
Rasterauto-
scaled upon
opening to fit
into screen
Fast resize via
command menu
10. pa
ge
Steven Simske and Margaret Sturgill,
Zoning Engine-Based Region Generation
Regions can be
auto-generated
with the same
zoning engine as
the primary POD
application to
allow direct
referencing to the
errorcase reports
Regions can also
be generated “on
the fly” using HP
“Clickand Select”
technologies
11. pa
ge
Steven Simske and Margaret Sturgill,
Zoning Engine-Based Region Generation:
Click and Select Capture Model
Statistical Model for Zoning Analysis
If the set of all region classification types is C, where “text” = C1
,
“drawing” = C2
, “photo” = C3
, “table = “C4
”, etc., then for all C1
…CN
(N=number of region types possible), each region Rj
ε ℜ where ℜ is the
set of all M regions formed during segmentation, is assigned
probabilities p(Ci
), such that:
∀ Rj
ε ℜj=0…M
, Σi=0…N
pj
(Ci
) = 1.0
That is, the given region Rj
has a summed probability of 1.0, which
represents its relative probabilities over all region types. The
differential statistic, DS, for region x with respect to classification y is
given by:
DS(x|y) = px
(Cy
) – max(px
(Ci
))i ≠ y
Click and Select UI Tool is
Based on the Statistical Model
Regions can be generated with a “Drag and Click” UI
tool that allows successive clicks to define the region
vertices
12. pa
ge
Steven Simske and Margaret Sturgill,
UI-Based Polygonal Region Generation
Regions can be
generated with a
“Drag and Click”
UI tool that
allows successive
clicks to define
the region
vertices
With advanced
region
management
capabilities, this
can be used to
generate
outlining regions
13. pa
ge
Steven Simske and Margaret Sturgill,
UI-Based Polygon Region Example
Afterdefinition of
two polygonal text
regions
These regions are
now individual
objects,
represented as
such in both
Java/C# and XML
Objects could be
“dragged” with an
extension to auto-
layout
14. pa
ge
Steven Simske and Margaret Sturgill,
Rectangular Region Definition (Rubber-banding)
Regions can be
generated with a
“Drag and
Release” UI tool
that allows the
userto clickto
define a cornerof
the rectangle and
then to drag to
the farcornerof
the rectangle and
release
Usually set to the
default mouse
mode
15. pa
ge
Steven Simske and Margaret Sturgill,
Re-Classification of Regions
Regions can be
Re-classified with
the Intuitive
“Right Click” +
“Pop-Up” Motif
Forthis project,
the region types
were {Text,
Drawing, Photo,
Table and
Equation}
Rendering of the
Classification Types:
Text at 400 ppi, 1-bit
Drawing at 400 ppi, 8-
or24-bit
Table and Equation at
400 ppi, 1-bit
Photo at 200 ppi, 8- or
24-bit
16. pa
ge
Steven Simske and Margaret Sturgill,
Changing Region Bit Depth
Regions can be
generated with a
“Drag and
Release” UI tool
that allows the
userto clickto
define a cornerof
the rectangle and
then to drag to
the farcornerof
the rectangle and
release
Usually set to the
default mouse
mode
Bit-Depths of the
“Region Modalities”:
BW= Blackand White,
1-bit perpixel
Gray = 8-bit perpixel
Color= 24-bit perpixel,
sRGB
17. pa
ge
Steven Simske and Margaret Sturgill,
Context-Sensitive Right-Click
Regions can be
generated with a
“Drag and
Release” UI tool
that allows the
userto clickto
define a cornerof
the rectangle and
then to drag to
the farcornerof
the rectangle and
release
Usually set to the
default mouse
mode
Several Context-
Sensitive Delete
Options forEase of
Editing
“Delete All”—Start
Over(Often After
Rescale)
“Delete Previous”—
Undo Delete
“Delete Nearest”—For
Tricky Layouts (Often
Before Rescale)
18. pa
ge
Steven Simske and Margaret Sturgill,
Region Manager: No Overlap
Regions are
Described by
BBox, Polygon
and Scan Line
Segment
Representation,
Allowing
Efficacious
“Scale-Up” in
Thoroughness of
Comparison
Any Overlap
Triggered an
ErrorFlag (to
Prevent Later
Print Job Errors)
Overlap May Be
Allowed in Other
Publishing
Applications, such as
VDP
19. pa
ge
Steven Simske and Margaret Sturgill,
Region Manager: No Crossing of Line Segments
During the Creation of Vertices
Sequential
Vertices are
Compared to all
Previous Vertices
forIntersection
Any Intersecting
Vertices Disallow
the Region
Formation
This Prevents
Potential “Region
Fragmentation”
Upon Scaling to
Original
Resolution
This Region
Management Rule is
Generally Applicable to
Prevent Ill-Shaped
Regions
20. pa
ge
Steven Simske and Margaret Sturgill,
Design Principles
SPEED OF
ZONING ENGINE
ACCURACY OF
ZONING
ANALYSIS
ENGINE
SIMPLICITY OF UI
GROUND
TRUTHING
The zoning engine was further tuned to
run on low-resolution data so that the
analysis would run at the same nominal
ppi (75) as the later QA proofing
The Ground Truthing Engine is the last
step in the otherwise automatic
generation of 1.2 million PDFs from TIFF
Smart scaling (integral) is performed on file load—allowing the ready
maintenance in most cases of boundary information after scaling
Zoning engine for document analysis matches that of the Ground
Truthing Engine for maximum throughput
Redundancy of keys, menus, tool tips for fast training
21. pa
ge
Steven Simske and Margaret Sturgill,
Throughput Results
• 94.6% of pages passed Auto QA after the original zoning analysis
• 99.2% of pages passed AutoQA after the iterative attempts at
logical zoning analysis
• 97.7% of pages passed both AutoQA and VisualQA. Failures in
the latter were generally due to bit depth mismatch for poorly-
exposed originals
• Thus, 2.3%, or 27,500 pages needed to be Ground Truthed for full
re-purposing (All text OCR’d, all drawings and photos saved in
best format, etc.)
• After 30 minutes of training, 2 pages/minute is an easy rate to
proof, even while multi-tasking (on phone, email): this means less
than 6 human-weeks for the entire set of 1.2 x 106
pages
• This is 2x105
pages “cleaned” per human-week of work