SlideShare a Scribd company logo
Page Layout Analysis of
19th
Century Siamese Newspapers
using Python and OpenCV
Mark Hollow
PyCon APAC, 2017
graduated in classical music · self-taught in computing
programming python since 2002 · 20 years working in IT
IT infrastructure · UNIX sysadmin · project management
software engineering · data systems · product management
about me...
2
once upon a time...
Dr Dan Beach Bradley - หมอ บรัดเลย
Born 18th
July 1804, New York; died 23th
June 1873, Bangkok
Graduated as Doctor of Medicine from New York University
American Protestant missionary in Siam
Arrives in Bangkok on 18th
July 1835 from Boston via Singapore
Brings with him the first printing press to Siam
Many notable achievements & firsts in Siam
first surgery, first vaccination, first printed Royal edict, first newspaper, first classified & commercial
advertising, first copyright transaction, first printing of Siamese law, first Siamese translation of the
Old Testament, first monolingual Siamese dictionary
3
the first siamese newspaper
The Bangkok Recorder - หนังสือจดหมายเหตุ
1844–1845 magazine-like, fact-based, introduces western
ideas, knowledge, science and Christianity
1865–1867 more social commentary and introduction of
western liberalism -- rather controversial
a lot of historical information...
thai society (seen from a western perspective)
regional and global news/information
prices of goods, services, imports and exports
4
there is no online
searchable database
of this historical
information.
5
DigitalBangkokRecorder
markhollow.com
digital bangkok recorder project
objectives
scan all the surviving editions
transcribe all text
make all text available online
learn how to do all of this
in this presentation
cleaning scanned images
detecting the page layout
extract all text lines
prepare for transcription
6
page layout
2 column layout
front page:
title & date lines
last page:
tabular data
some illustrations
some full-width
tables
7
a closer look...
large header on cover
dual-language headings
column separator line
topic separators
unique typeface
the first ever thai typeface
now-obsolete characters
not supported by modern ocr
8
basic workflow
1. SCAN
2. CLEAN
9
3. STRUCTURAL ANALYSIS
4. EXTRACT TEXT
5. TRANSCRIPTION
getting started
with opencv
10
what is opencv?
“OpenCV (Open Source Computer Vision Library) is an
open source computer vision and machine learning
software library.”
- opencv.org
Written in C++; bindings for Python and others
v3.2 used for here, v3.1 probably works
v2.x won’t work - different API structure
many v2.x blogs/articles still online - beware!
11
opencv basics: installation
$ pip install opencv-python
or
$ pip install opencv-contrib-python
No FFmpeg, GTK or carbon support - limits some features.
Works well in jupyter/ipython.
Non-free
patented
stuff!!
12
opencv basics: loading/saving images
loading images…
>>> import cv2
>>> img = cv2.imread(’image001.jpg’)
>>> type(img)
<type 'numpy.ndarray'>
saving images...
>>> cv2.imwrite(’newfile.png’, img)
OpenCV images
are numpy arrays!
All common
formats supported.
Extra args
supported for
image formats.
13
document cleaning.
14
removing background noise (1)
- binarization: set pixel value based on threshold
- types: basic, adaptive
- both need experimentation with threshold value
bin_image, th = cv2.threshold(image, 192, 255,
cv2.THRESH_BINARY)
bin_image = cv2.adaptiveThreshold(image, 255,
cv2.ADAPTIVE_THRESH_MEAN_C,
cv2.THRESH_BINARY, 101, 2)
15
removing background noise (2)
bin_image, th = cv2.threshold(img, 0, 255,
cv2.THRESH_BINARY + cv2.THRESH_OTSU)
otsu binarization tries to find best threshold value
example:
manual threshold guessed at v=192
otsu selects v=177
improvement in number of artifacts
16
removing background noise (3)
* Contrast emphasized for display purposes. 17
structural analysis:
page margins
18
morphological transforms (1)
- erosion: erodes away the
boundaries of foreground object
- dilation: dilates/thickens
boundaries
NOTE: black = background
white = foreground
kernel = numpy.ones(
(5, 5), np.uint8)
new_image = ~cv2.erode(
~original_image,
kernel,
iterations=1)
19
morphological transforms (2)
- opening: erosion+dilation
used for removing noise
- closing: dilation+erosion
closes small holes in objects
kernel = numpy.ones(
(5, 5), np.uint8)
img2= ~cv2.morphologyEx(
~img1,
cv2.MORPH_OPEN,
kernel,
iterations=1)
20
contours
“a curve joining all the
continuous points having same
color or intensity”
_, contours, hierarchy = cv2.findContours(
~img, cv2.RETR_TREE, cv2.CHAIN_APPROX_SIMPLE)
col_img = cv2.cvtColor(img, cv2.COLOR_GRAY2RGB)
cv2.drawContours(col_img, contours, -1, (255,0,0), 5)
findContours return values:
contours: list of contours
hierarchy: contour structure
21
finding page margins
“open” removes
artifacts; “dilate”
emphasizes text
opened & dilated
22
get margin from
contour edges
findContours() to
group blocks; filter
out small contours.
structural analysis:
identify page
sections
23
morphological transforms (revisited)
structuring element (kernel) array
is made of 1’s & 0’s
it’s compared to each pixel
erode: takes minimum value
dilate: takes maximum value
a linear structuring element will
operate on linear patterns
Diagram: http://docs.opencv.org/3.1.0/d1/dee/tutorial_moprh_lines_detection.html
kernel
input image output image
Dilation Example
>>> cv2.getStructuringElement(cv2.MORPH_RECT, (10, 1))
array([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1]], dtype=uint8)
24
page segmentation
1
2
3
once for horizontal,
then for vertical
lines
erode & dilate with a long
linear structuring element:
extracts lines to mask
findContours() on
the mask gets
contour coordinates
draw contour to
remove line
centre line of contours
used as page section
boundaries
25
page segmentation (full page)
section boundaries
page margin
26
blank areas from
average values of
multiple adjacent
lines
structural analysis:
topic separators
27
template matching: finding objects in an image
result = cv2.matchTemplate(image,
template, cv2.TM_CCOEFF_NORMED)
_, maxval, _, maxloc =
cv2.minMaxLoc(result)
a template is a small image segment:
cv2.matchTemplate() returns match scores
28
structural analysis complete
- margins identified
- horizontal and vertical lines detected
- original lines removed
- blank areas identified
- removed decorative markers with templates
- use template matching to identify titles
- and therefore page style (eg. first or other page)
29
structural analysis: first edition
30
extract
text lines
31
extract text lines
THRESHOLD = 248
thresholds = cv2.reduce(
image,
1, # 1 => column; 0 => row
cv2.REDUCE_AVG
) >= THRESHOLD
32
workflow: page layout analysis all done!
1. SCAN
2. CLEAN
5. TRANSCRIPTION
33
3. STRUCTURAL ANALYSIS
4. EXTRACT TEXT
✓
✓
✓
what’s next?
34
transcription
- transcribe enough text for developing an OCR model
- regular ocr is very inaccurate due to
the unique font
- hire typists or amazon mechanical turk
- there’s a few problems to solve:
- transcription cost, guidelines needed due to archaic text & unique typeface
- how to develop an OCR system?
- retrain tesseract 4.x’s LSTM (Long Short Term Memory) neural network?
- use tensorflow or similar?
- perhaps that’s my next PyCon presentation!
35
appendix
Not enough time to cover these
topics… :-(
- Removing page frames *
- Skew correction *
- Detecting tables †
- Detecting pictures
* See https://markhollow.com/
† Coming soon
36
Other resources:
- ocropus / ocropy: python document
analysis tools
- scantailor: GUI for cleaning
scanned documents
- CE316 / CE866: Computer Vision,
University of Essex, UK
http://orb.essex.ac.uk/ce/ce316/
in summary...
opencv basics · thresholds · morphological
transformations · contours · masks · template
matching and a little bit of numpy
...plus a practical application
to document layout analysis
37
thank you for listening.
questions?
38
Mark Hollow
markhollow.com
DigitalBangkokRecorder

More Related Content

Similar to PyCon APAC 2017: Page Layout Analysis of 19th Century Siamese Newspapers using Python and OpenCV

25 Years of C++ History Flashed in Front of My Eyes
25 Years of C++ History Flashed in Front of My Eyes25 Years of C++ History Flashed in Front of My Eyes
25 Years of C++ History Flashed in Front of My Eyes
Yauheni Akhotnikau
 
graph2tab, a library to convert experimental workflow graphs into tabular for...
graph2tab, a library to convert experimental workflow graphs into tabular for...graph2tab, a library to convert experimental workflow graphs into tabular for...
graph2tab, a library to convert experimental workflow graphs into tabular for...
Rothamsted Research, UK
 
Introduction To Autumata Theory
 Introduction To Autumata Theory Introduction To Autumata Theory
Introduction To Autumata Theory
Abdul Rehman
 
Let's LISP like it's 1959
Let's LISP like it's 1959Let's LISP like it's 1959
Let's LISP like it's 1959
Mohamed Essam
 
WiNGS 2014 Workshop 2 R, RStudio, and reproducible research with knitr
WiNGS 2014 Workshop 2 R, RStudio, and reproducible research with knitrWiNGS 2014 Workshop 2 R, RStudio, and reproducible research with knitr
WiNGS 2014 Workshop 2 R, RStudio, and reproducible research with knitr
Ann Loraine
 
The Effect of Hierarchical Memory on the Design of Parallel Algorithms and th...
The Effect of Hierarchical Memory on the Design of Parallel Algorithms and th...The Effect of Hierarchical Memory on the Design of Parallel Algorithms and th...
The Effect of Hierarchical Memory on the Design of Parallel Algorithms and th...
David Walker
 
Bioinformatics v2014 wim_vancriekinge
Bioinformatics v2014 wim_vancriekingeBioinformatics v2014 wim_vancriekinge
Bioinformatics v2014 wim_vancriekinge
Prof. Wim Van Criekinge
 
Numerical Simulation of Nonlinear Mechanical Problems using Metafor
Numerical Simulation of Nonlinear Mechanical Problems using MetaforNumerical Simulation of Nonlinear Mechanical Problems using Metafor
Numerical Simulation of Nonlinear Mechanical Problems using Metafor
Romain Boman
 
Python your new best friend
Python your new best friendPython your new best friend
Python your new best friend
Luis Goldster
 
Python your new best friend
Python your new best friendPython your new best friend
Python your new best friend
Fraboni Ec
 
Python your new best friend
Python your new best friendPython your new best friend
Python your new best friend
Young Alista
 
Python your new best friend
Python your new best friendPython your new best friend
Python your new best friend
James Wong
 
Python your new best friend
Python your new best friendPython your new best friend
Python your new best friend
Harry Potter
 
Python your new best friend
Python your new best friendPython your new best friend
Python your new best friend
Tony Nguyen
 
Python your new best friend
Python your new best friendPython your new best friend
Python your new best friend
Hoang Nguyen
 
A brief introduction to lisp language
A brief introduction to lisp languageA brief introduction to lisp language
A brief introduction to lisp language
David Gu
 
Reasoning about memory-critical algorithms by Theo D'Hondt
Reasoning about memory-critical algorithms by Theo D'HondtReasoning about memory-critical algorithms by Theo D'Hondt
Reasoning about memory-critical algorithms by Theo D'Hondt
FAST
 
cis97003
cis97003cis97003
cis97003perfj
 

Similar to PyCon APAC 2017: Page Layout Analysis of 19th Century Siamese Newspapers using Python and OpenCV (20)

25 Years of C++ History Flashed in Front of My Eyes
25 Years of C++ History Flashed in Front of My Eyes25 Years of C++ History Flashed in Front of My Eyes
25 Years of C++ History Flashed in Front of My Eyes
 
graph2tab, a library to convert experimental workflow graphs into tabular for...
graph2tab, a library to convert experimental workflow graphs into tabular for...graph2tab, a library to convert experimental workflow graphs into tabular for...
graph2tab, a library to convert experimental workflow graphs into tabular for...
 
Introduction To Autumata Theory
 Introduction To Autumata Theory Introduction To Autumata Theory
Introduction To Autumata Theory
 
Let's LISP like it's 1959
Let's LISP like it's 1959Let's LISP like it's 1959
Let's LISP like it's 1959
 
Q
QQ
Q
 
WiNGS 2014 Workshop 2 R, RStudio, and reproducible research with knitr
WiNGS 2014 Workshop 2 R, RStudio, and reproducible research with knitrWiNGS 2014 Workshop 2 R, RStudio, and reproducible research with knitr
WiNGS 2014 Workshop 2 R, RStudio, and reproducible research with knitr
 
The Effect of Hierarchical Memory on the Design of Parallel Algorithms and th...
The Effect of Hierarchical Memory on the Design of Parallel Algorithms and th...The Effect of Hierarchical Memory on the Design of Parallel Algorithms and th...
The Effect of Hierarchical Memory on the Design of Parallel Algorithms and th...
 
Lecture12
Lecture12Lecture12
Lecture12
 
Bioinformatics v2014 wim_vancriekinge
Bioinformatics v2014 wim_vancriekingeBioinformatics v2014 wim_vancriekinge
Bioinformatics v2014 wim_vancriekinge
 
Numerical Simulation of Nonlinear Mechanical Problems using Metafor
Numerical Simulation of Nonlinear Mechanical Problems using MetaforNumerical Simulation of Nonlinear Mechanical Problems using Metafor
Numerical Simulation of Nonlinear Mechanical Problems using Metafor
 
Python your new best friend
Python your new best friendPython your new best friend
Python your new best friend
 
Python your new best friend
Python your new best friendPython your new best friend
Python your new best friend
 
Python your new best friend
Python your new best friendPython your new best friend
Python your new best friend
 
Python your new best friend
Python your new best friendPython your new best friend
Python your new best friend
 
Python your new best friend
Python your new best friendPython your new best friend
Python your new best friend
 
Python your new best friend
Python your new best friendPython your new best friend
Python your new best friend
 
Python your new best friend
Python your new best friendPython your new best friend
Python your new best friend
 
A brief introduction to lisp language
A brief introduction to lisp languageA brief introduction to lisp language
A brief introduction to lisp language
 
Reasoning about memory-critical algorithms by Theo D'Hondt
Reasoning about memory-critical algorithms by Theo D'HondtReasoning about memory-critical algorithms by Theo D'Hondt
Reasoning about memory-critical algorithms by Theo D'Hondt
 
cis97003
cis97003cis97003
cis97003
 

Recently uploaded

Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Albert Hoitingh
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance
 
UiPath Community Day Dubai: AI at Work..
UiPath Community Day Dubai: AI at Work..UiPath Community Day Dubai: AI at Work..
UiPath Community Day Dubai: AI at Work..
UiPathCommunity
 
Enhancing Performance with Globus and the Science DMZ
Enhancing Performance with Globus and the Science DMZEnhancing Performance with Globus and the Science DMZ
Enhancing Performance with Globus and the Science DMZ
Globus
 
RESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for studentsRESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for students
KAMESHS29
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
UiPathCommunity
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
Thijs Feryn
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
91mobiles
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
Alan Dix
 
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptxSecstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx
nkrafacyberclub
 
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionGenerative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Aggregage
 
Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
OnBoard
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
Prayukth K V
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
BookNet Canada
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
Ana-Maria Mihalceanu
 
Pushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 daysPushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 days
Adtran
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
DanBrown980551
 
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
SOFTTECHHUB
 
Climate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing DaysClimate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing Days
Kari Kakkonen
 

Recently uploaded (20)

Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
 
UiPath Community Day Dubai: AI at Work..
UiPath Community Day Dubai: AI at Work..UiPath Community Day Dubai: AI at Work..
UiPath Community Day Dubai: AI at Work..
 
Enhancing Performance with Globus and the Science DMZ
Enhancing Performance with Globus and the Science DMZEnhancing Performance with Globus and the Science DMZ
Enhancing Performance with Globus and the Science DMZ
 
RESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for studentsRESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for students
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
 
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptxSecstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx
 
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionGenerative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to Production
 
Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
 
Pushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 daysPushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 days
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
 
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
 
Climate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing DaysClimate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing Days
 

PyCon APAC 2017: Page Layout Analysis of 19th Century Siamese Newspapers using Python and OpenCV

  • 1. Page Layout Analysis of 19th Century Siamese Newspapers using Python and OpenCV Mark Hollow PyCon APAC, 2017
  • 2. graduated in classical music · self-taught in computing programming python since 2002 · 20 years working in IT IT infrastructure · UNIX sysadmin · project management software engineering · data systems · product management about me... 2
  • 3. once upon a time... Dr Dan Beach Bradley - หมอ บรัดเลย Born 18th July 1804, New York; died 23th June 1873, Bangkok Graduated as Doctor of Medicine from New York University American Protestant missionary in Siam Arrives in Bangkok on 18th July 1835 from Boston via Singapore Brings with him the first printing press to Siam Many notable achievements & firsts in Siam first surgery, first vaccination, first printed Royal edict, first newspaper, first classified & commercial advertising, first copyright transaction, first printing of Siamese law, first Siamese translation of the Old Testament, first monolingual Siamese dictionary 3
  • 4. the first siamese newspaper The Bangkok Recorder - หนังสือจดหมายเหตุ 1844–1845 magazine-like, fact-based, introduces western ideas, knowledge, science and Christianity 1865–1867 more social commentary and introduction of western liberalism -- rather controversial a lot of historical information... thai society (seen from a western perspective) regional and global news/information prices of goods, services, imports and exports 4
  • 5. there is no online searchable database of this historical information. 5
  • 6. DigitalBangkokRecorder markhollow.com digital bangkok recorder project objectives scan all the surviving editions transcribe all text make all text available online learn how to do all of this in this presentation cleaning scanned images detecting the page layout extract all text lines prepare for transcription 6
  • 7. page layout 2 column layout front page: title & date lines last page: tabular data some illustrations some full-width tables 7
  • 8. a closer look... large header on cover dual-language headings column separator line topic separators unique typeface the first ever thai typeface now-obsolete characters not supported by modern ocr 8
  • 9. basic workflow 1. SCAN 2. CLEAN 9 3. STRUCTURAL ANALYSIS 4. EXTRACT TEXT 5. TRANSCRIPTION
  • 11. what is opencv? “OpenCV (Open Source Computer Vision Library) is an open source computer vision and machine learning software library.” - opencv.org Written in C++; bindings for Python and others v3.2 used for here, v3.1 probably works v2.x won’t work - different API structure many v2.x blogs/articles still online - beware! 11
  • 12. opencv basics: installation $ pip install opencv-python or $ pip install opencv-contrib-python No FFmpeg, GTK or carbon support - limits some features. Works well in jupyter/ipython. Non-free patented stuff!! 12
  • 13. opencv basics: loading/saving images loading images… >>> import cv2 >>> img = cv2.imread(’image001.jpg’) >>> type(img) <type 'numpy.ndarray'> saving images... >>> cv2.imwrite(’newfile.png’, img) OpenCV images are numpy arrays! All common formats supported. Extra args supported for image formats. 13
  • 15. removing background noise (1) - binarization: set pixel value based on threshold - types: basic, adaptive - both need experimentation with threshold value bin_image, th = cv2.threshold(image, 192, 255, cv2.THRESH_BINARY) bin_image = cv2.adaptiveThreshold(image, 255, cv2.ADAPTIVE_THRESH_MEAN_C, cv2.THRESH_BINARY, 101, 2) 15
  • 16. removing background noise (2) bin_image, th = cv2.threshold(img, 0, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU) otsu binarization tries to find best threshold value example: manual threshold guessed at v=192 otsu selects v=177 improvement in number of artifacts 16
  • 17. removing background noise (3) * Contrast emphasized for display purposes. 17
  • 19. morphological transforms (1) - erosion: erodes away the boundaries of foreground object - dilation: dilates/thickens boundaries NOTE: black = background white = foreground kernel = numpy.ones( (5, 5), np.uint8) new_image = ~cv2.erode( ~original_image, kernel, iterations=1) 19
  • 20. morphological transforms (2) - opening: erosion+dilation used for removing noise - closing: dilation+erosion closes small holes in objects kernel = numpy.ones( (5, 5), np.uint8) img2= ~cv2.morphologyEx( ~img1, cv2.MORPH_OPEN, kernel, iterations=1) 20
  • 21. contours “a curve joining all the continuous points having same color or intensity” _, contours, hierarchy = cv2.findContours( ~img, cv2.RETR_TREE, cv2.CHAIN_APPROX_SIMPLE) col_img = cv2.cvtColor(img, cv2.COLOR_GRAY2RGB) cv2.drawContours(col_img, contours, -1, (255,0,0), 5) findContours return values: contours: list of contours hierarchy: contour structure 21
  • 22. finding page margins “open” removes artifacts; “dilate” emphasizes text opened & dilated 22 get margin from contour edges findContours() to group blocks; filter out small contours.
  • 24. morphological transforms (revisited) structuring element (kernel) array is made of 1’s & 0’s it’s compared to each pixel erode: takes minimum value dilate: takes maximum value a linear structuring element will operate on linear patterns Diagram: http://docs.opencv.org/3.1.0/d1/dee/tutorial_moprh_lines_detection.html kernel input image output image Dilation Example >>> cv2.getStructuringElement(cv2.MORPH_RECT, (10, 1)) array([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1]], dtype=uint8) 24
  • 25. page segmentation 1 2 3 once for horizontal, then for vertical lines erode & dilate with a long linear structuring element: extracts lines to mask findContours() on the mask gets contour coordinates draw contour to remove line centre line of contours used as page section boundaries 25
  • 26. page segmentation (full page) section boundaries page margin 26 blank areas from average values of multiple adjacent lines
  • 28. template matching: finding objects in an image result = cv2.matchTemplate(image, template, cv2.TM_CCOEFF_NORMED) _, maxval, _, maxloc = cv2.minMaxLoc(result) a template is a small image segment: cv2.matchTemplate() returns match scores 28
  • 29. structural analysis complete - margins identified - horizontal and vertical lines detected - original lines removed - blank areas identified - removed decorative markers with templates - use template matching to identify titles - and therefore page style (eg. first or other page) 29
  • 32. extract text lines THRESHOLD = 248 thresholds = cv2.reduce( image, 1, # 1 => column; 0 => row cv2.REDUCE_AVG ) >= THRESHOLD 32
  • 33. workflow: page layout analysis all done! 1. SCAN 2. CLEAN 5. TRANSCRIPTION 33 3. STRUCTURAL ANALYSIS 4. EXTRACT TEXT ✓ ✓ ✓
  • 35. transcription - transcribe enough text for developing an OCR model - regular ocr is very inaccurate due to the unique font - hire typists or amazon mechanical turk - there’s a few problems to solve: - transcription cost, guidelines needed due to archaic text & unique typeface - how to develop an OCR system? - retrain tesseract 4.x’s LSTM (Long Short Term Memory) neural network? - use tensorflow or similar? - perhaps that’s my next PyCon presentation! 35
  • 36. appendix Not enough time to cover these topics… :-( - Removing page frames * - Skew correction * - Detecting tables † - Detecting pictures * See https://markhollow.com/ † Coming soon 36 Other resources: - ocropus / ocropy: python document analysis tools - scantailor: GUI for cleaning scanned documents - CE316 / CE866: Computer Vision, University of Essex, UK http://orb.essex.ac.uk/ce/ce316/
  • 37. in summary... opencv basics · thresholds · morphological transformations · contours · masks · template matching and a little bit of numpy ...plus a practical application to document layout analysis 37
  • 38. thank you for listening. questions? 38 Mark Hollow markhollow.com DigitalBangkokRecorder