Discrete Point
Based Signatures and Applications to
Document Matching
Nemanja Spasojevic, Guillaume Poncin, and Dan Bloomberg
September 15th
2011, Ravenna, Italy
Overview
● Background
● Algorithm
● Applications
○ duplicate page detection
○ text image lockup
● Conclusion
Background
(Duplicate Page Detection)
● Find duplicate pages for a given set of scans
of physically same book. Assumption:
○ has to handle at time corpus of ~10k text pages
○ pages rich in text
○ < 4o
of rotation from image to image
○ some translation
○ needs to be quick
○ simple to use (discrete signatures, for easy indexing /
lookup)
Background (example)
Background
(Image Lookup)
● See how well we perform in image lookup mode. Test how
robust algorithm is for something it was not designed for:
○ index clean, images
○ lookup by image take with cell phone camera
○ skew
○ rotation
○ blur
○ part of original page
Other Aproaches
● Image matching well studied problem
○ SURF, SIFT, FIT work well at point matching across
images and image lookup
○ do not work as well for repetetive patterns such as text
documents
● Document page matching
○ Locally Likely Arrangement Hashing (LLAH), Nakai, et. al.
■ affine invariant
■ produces thousands of signatures per page
■ precise
■ handles 10k image corpus
Algorithm
Possible inputs:
● raw image (operate on word centroids)
● OCR-ed text with word bounding boxes (operate on word
bounding box center)
● PDF with word bounding box info (operate on word
bounding box center)
Algorithm (Image Processing)
Signature Generation Algorithm
Signature Instability
Signature is composed of N sub-signatures:
S = [s(0)][s(1)]...[s(N-1)]
Instability of signatures comes from:
● Small shifts may lead to changes in discretized angle value
(e.g. s(0) flipping from 13 to 14 due to small word position
shifts)
● order of sub-signatures may change (s(0) and s(1) swap as
they had almost same radial distance)
Signature Filtering Based on
Estimated Risk
Superposition of Ambiguous
Signature
[s1
][{s2,
s'2
}][s3
][s4
] => { [s1
][s2
][s3
][s4
],
[s1
][s'2
][s3
][s4
] }
Duplicate Page Detection
(metrics)
● The similarity based on signature sets is calculated as:
● The similarity based on matched (aligned) word bounding
boxes:
Duplicate Page Detection
(example)
Js
= 19% Jb
= 93%
Duplicate Page Detection
(example)
Js
= 5% Jb
= 37%
Image Lookup
Image Lookup Examples
Image Lookup Examples
Image Lookup Result
1M pages index (32bit signatures) stats:
● 386M signatures total
● stored as sorted array (<signature, book_id, page_pid, x, y>) fits
in ~4GB of memory
● 0.8% of all signatures filtered (those repeated on 1k or more
pages)
● each query on average returns 2000 canidates
Index Size [pages] Accuracy Signature Size [bits]
4.1k 0.966 16
4.1k 0.949 32
1M 0.871 32
Conclusion
● Simple schema for point cloud based discrete signature
generation
● Filtering based on signature stability estimate
● Superposing signatures
● Duplicate page detection
● Image lookup by cell phone camera image query (87.1%
accuracy for 1M pages indexed)
Q & A
Thank You!
Synthetic Data Evaluation
Synthetic Data Evaluation

Discrete Point Based Signatures and Applications to Document Matching