Machine learning for document analysis and understanding

1
Machine learning
for document analysis and
understanding
TC10/TC11 Summer School on Document Analysis:
Traditional Approaches and New Trends
@La Rochelle, France. 8:30-10:30, 4th July 2018
Seiichi Uchida, Kyushu University, Japan

2
The Nearest Neighbor Method
The simplest ML for pattern recognition;
Everything starts from it!

3
The nearest neighbor method:
Learning = memorizing
Input PorkBeef
Orange
Watermelon
Pineapple Fish
Which reference
pattern is the most
similar?
Reference
patterns

4
Each pattern is represented
as a feature vector
Color feature
Texture
feature
Pork=(10, 2.5, 4.3)
*Those numbers are just a random example
Note: In the classical
nearest neighbor method,
those features are
designed by human

5
A different pattern becomes a different
feature vector
Beef = (8, 2.6, 0.9)
Pork=(10, 2.5, 4.3)
Color feature
Texture
feature

6
Reference patterns
in the feature vector space
Color feature
Texture
feature

7
An input pattern
We want to
recognize this
input x
Color feature
Texture
feature

8
input x
Nearest neighbor method
Nearest
neighbor
input = orange
Color feature
Texture
feature

99
How do you define
“the nearest neighbor”?
Distance-based
The smallest distance gives
the nearest neighbor
Ex.
• Euclidean distance /
Similarity-based
The largest similarity gives
the nearest neighbor
Ex.
• Inner product
• Cosine similarity
𝐱 𝐲
𝐱 𝐲
x
?

1010
Do you remember an important property
of “inner product”?
If and are in the similar direction, their
inner product becomes larger
The inner product evaluates
the similarity between and

11
Well, two different types of features
(Note: important to understand deep learning)
Features defined by the
pattern itself
 Orange pixels→ Many
 Blue pixels → Rare
 Roundness → High
 Symmetry →High
 Texture → Fine
…
Features defined by the
similarity to others
 Similarity to ”car” → Low
 Similarity to ”apple” → High
 Similarity to “monkey”→Low
 Similarity to “Kaki”
(persimmon) →Very high
…

12
The nearest neighbor method with
similarity-based feature vectors
Similarity
to “Kaki”
Similarity to “car”
Important note:
Similarity is used for not
only feature extraction
but also classification

13
A shallow explanation of
neural networks
Don’t think it is a black box.
If you know “inner-product”, it becomes

14
The neuron – its reality
https://commons.wikimedia.org/

15
From reality to computational model
https://commons.wikimedia.org/
input g  xgx
1x
jx
dx
1w
jw
dw
f
 xg

16
The neuron by computer
Σ  xg
1x
jx
dx
1
……
b
 bf
bxwfg
T
d
j
jj









 
xw
x
　　　
1
)(
x 1w
jw
dw
f
f: non-linear func.
input
output

17
Σ  xg
1x
jx
dx
1
……
b
x 1w
jw
dw
f
f: non-linear func.
Let’s
forget
 bf
bxwfg
T
d
j
jj









 
xw
x
　　　
1
)(

18
Σ  xg
1x
jx
dx
1
……
b
x 1w
jw
dwLet’s
forget


d
j
jj bxwg
1
)(x

19
Σ
1x
jx
dx
……
xwT
just “inner product”
of two vectors
x
1w
jw
dw
w
xw
x
T
d
j
jj xwg

 
　　　
1
)(

20
So, a neuron calculate…
xwT
Σ
1x
jx
dx
……
1w
jw
dw
xw andbetweensimilarityA
=0.9 if they
are similar
=0.02 if they
are dissimilar

21
So, if we have K neurons, we have
a K-dimensional similarity-based feature vector
…














xw
xw
xw
T
K
T
T

2
11w
2w
Kwx
1x
jx
dx
0.9
0.05
0.75
x

22
K-dimensional similarity-based feature
vector by K neurons
0.9
0.05
0.75
input
equiv.
similarity to
similarity to

23
Another function of the inner product
Similarity-based classification!
(Yes, the nearest neighbor method!)
Σ
1x
jx
dx
……
x
reference
pattern
of class k

24
Note: Multiple functions are realized by just combining neurons!
Just by layering the neuron elements, we
can have a complete recognition system!
…
Feature extraction
1w
Kw
1x
jx
dx
……
2w
Classification
AV
CV
BV
Similarity
to class A
Similarity
to class B
Similarity
to class C
Choose
max

25
Now the time for deep neural networks
1x
jx
dx
feature extraction layers
……
…
f
f
f
…
classification
f
f
f

26
An example: AlexNet
“Deep” neural network called AlexNet
A Krizhevsky, NIPS2012
classification
layers

27
Now the time for deep neural networks
1x
jx
dx
……
…
f
f
f
…
Classification
f
f
f
Why do we need to repeat
feature extraction?

28
Why do we need to repeat feature
extraction?
A
D
C
B
E
Ｆ
A difficult
classification
task

29
extraction?
A
D
C
B
E
Ｆ
1w2w

30
extraction?
A
D
C
B
E
Ｆ
1w2w
Ｆ
A
B
C
D
E
Large similarity to 𝐰
Small similarity to 𝐰
similarity to
similarity
to𝐰
Note: The lower picture is not
very accurate (because it does not
use inner-product-based but
distance-based space
transformation. However I believe
that it does not seriously damage
the explanation here.

31
extraction?
A
D
C
B
E
Ｆ
1w2w
Ｆ
A
B
C
D
E
It becomes more
separable
but still not
very separable
similarity to
similarity
to𝐰

32
extraction?
A
D
C
B
E
Ｆ
1w2w
Ｆ
A
B
C
D
E3w
4w
similarity to
similarity
to𝐰

33
extraction?
A
D
C
B
E
Ｆ
1w2w
Ｆ
A
B
C
D
E3w
4w
similarity to
similarity
to𝐰
A
D
E
B
C
Ｆ
similarity to
similarity
to𝐰

34
Ｆ
A
B
C
D
E3w
4w
similarity to
similarity
to𝐰
A
D
E
B
C
Ｆ
similarity to
similarity
to𝐰
extraction?
A
D
C
B
E
Ｆ
1w2w
3w
4w
Wow, they
become
separable!

35
extraction?
A
D
C
B
E
Ｆ
1w2w Now two classes
become totally
separable by
2v
1v
A
D
E
B
C
Ｆ
similarity to
similarity
to𝐰
A
D
E
B
C
Ｆ
similarity to
similarity
to𝐰
Ｆ
A
B
C
D
E3w
4w
similarity to
similarity
to𝐰

36
Remembering the non-linear function
Σ  xg
1x
jx
dx
1
……
b
x 1w
jw
dw
f
f: non-linear func.

37
The typical non-linear function:
Rectified linear function (ReLU)
Σ  xg
1x
jx
dx
1
……
b
x 1w
jw
dw
f
Rectified linear function

3838
How does ReLU affect
the similarity-based feature?
Minus elements in the feature vector are
forced to be zero
xwT
1
xwT
K
f
Unchanged
Unchanged
f

39
How to train neural networks:
Super-superficial explanation

40
In order to realize a DNN with
an expected “input-output” relation
…
1w
Kw
1x
jx
dx
……
2w
AV
CV
BV
Similarity to
class A
Similarity to
class B
Similarity to
class C
Those parameters should be tuned
1w AV2w

41
Training DNN; the goal
Class B
Class A
DNN
Knobs
Perfect
classification
boundary
Note: Actual number of #knobs (=#parameters)

42
Training DNN;
error-correcting learning by back propagation
NG
tuning
NG
NG
NG
Initial status
tuning
OK, end.
boundary

4343
Advanced topic: Why (SGD-based) back-
propagation works?
Many theoretical researches have been done
[Choromanska+, PMLR2015] [Wu+, arXiv2017]
Under several assumptions,
local minima is close to the
global minimum.
flat basin of loss surface

44
Knob = weight
= a pattern for similarity-based feature
Σ
1x
jx
dx
……
input weight
similarity
to
similarity to
This pattern is automatically
derived through training…

45
Optimal feature is extracted automatically
through training (Representation learning)
Google’s cat
https://googleblog.blogspot.jp/2012/06/
similarity
to
similarity toDetermined
automatically

46
DNN for image classification:
Convolutional neural networks
(CNN)

47
kw
How to deal with images by DNN?
x
xwT
k
400million-dim vector
400million-dim vector
①Intractable
computations
②Enormous
parameters

4848
kw
Convolution
= Repeating “local inner product” operations
= Linear filtering
x
ji
T
k ,xw
Low-dimensional
vector
ji,x
①Tractable
computations
②Trainable
#parameters

49
kw
Convolutional layer
x
ji,x
= Use the same weight
(filter coefficient)
at all locations
“Filtered”
image

50
kw
Pooling layer
x
ji,x
Keep only the
maximum value
①Deformation
compensation
②Local info
aggregation

51
Application to DAR:
Isolated character recognition
machine printed
handwritten
designed fonts
95.49%
99.79%
99.99%
[Uchida+, ICFHR2016]
Near-human performance

5252
Application to DAR:
Breaking Captcha
99.8% by 1 million training samples
[Goodfellow＋, ArXiv, 2014]

53
Application to DAR:
Detecting a component in a character imageMulti-part
component
[Iwana+, ICDAR2017]
Q: Can CNN detect complex components accurately?

54
Application to DAR:
Font Recognition (DeepFont)
[Wang+, ACMMM2015]

56
CNN can be used as a feature extractor
1x
jx
dx
……
…
f
f
f
…
classification
(discarded)
f
f
f
1x
jx
dx
……
…
f
f
f
Another classifier
e.g., SVM and LSTM
Anomaly
detector
Clustering
great

5757
The current CNN does not
“understand” characters yet
Adversarial examples
[Abe+, unpublished]
Motivated by [Nguyen+, CVPR2015]
Likelihood values for
classes “A” and “B”

58
On the other hand, CNN can learn “math
operation” through images
input images output “image”
showing the sum
[Hoshen+, AAAI, 2016]

5959
Visualization for deep learning:
DeCAF [Donahue+, arXiv 2013]
Visualizing the pattern distribution at each
layer
Near to the input layer Near to the output layer

6060
DeepDream and its relations
Finding an input image that excites a neuron
at a certain layer
https://distill.pub/2017/feature-visualization/

6161
Layer-wise Relevance Propagation (LRP)
Finding pixels which contribute the final
decision by a backward process
http://www.explain-ai.org/

62
Local sensitivity analysis by making a hole
Motivated by [Zeiler+, arXiv, 2013][Ide+, Unpublished]
Likelihood of class “0” degrades a lot by making a hole around the pixel

6363
Grad-CAM [Selvaraju+, arXiv2016]
Finding pixels which contribute the final
decision by a backward process
http://gradcam.cloudcv.org/

6464
tensorflow playground by Google
https://playground.tensorflow.org/

65
Several Variants of DNN/CNN

6666
Auto encoder
(= Nonlinear principal component analysis)
Training the network to output the input
App: Denoising by convolutional auto-encoder
Compact
representation
of the input
wikipedia
https://blog.sicara.com/keras-tutorial-content-based-image-retrieval-convolutional-denoising-autoencoder-dc91450cc511

6767
U-Net:
Conv-Deconv net that outputs an image
[Ronneberger+, MICCAI2015]
Skip connection
cell
image
cell boundary
image

68
Application to DAR:
Scene text eraser
[Nakamura+, ICDAR2017]

6969
Application to DAR:
Binarization
ICDAR-DIBCO2017 Winner (Smart Engines
Ltd, Moscow, Russia) used U-net
[Pratikakis+, ICDAR2017]

70
Application to DAR:
Dewarping [Ma+, CVPR2018]
Stacked U-nets

7171
Note: Deep Image Prior
[Ulyanov+, CVPR2018]
Conv-Deonv structure has an inherent
characteristics which is suitable for image
completion and other “low-pass” operations
train a conv-deconv
net just to generate
the left image but it
results in the right
image

7272
Generative Adversarial Networks
The battle of two neural networks
VS
Generate
“fake bill” Discriminate
fake or real bill
Generator Discriminator
Fake bill becomes more and more realistic

73
Application to DAR:
(Our) Style-consistent font generation
[Hayashi+, unpublished]

74
Application to DAR:
Oh no.. CVPR2018 was filled by Font-GANs

75
Huge variety of GANs:
Just several examples…
StackGANCycleGAN
Standard GAN (DCGAN)
https://www.slideshare.net/YunjeyChoi/generative-adversarial-networks-75916964
condition
(class)
Conditional GAN

76
Style Transfer [Gatys+, CVPR2016]
style image
(given)
content image
(given)
generated
image

77
Style Transfer [Gatys+, CVPR2016]
style image
(given)
content image
(given)
generated
image
similar internal
outputs
similar internal
output

78
Application to DAR:
Font Style Transfer
[Gantugs+, DAS2018]

79
SSD (Single Shot MultiBox Detector)
Fully-Conv Net that outputs bounding boxes
[Liu+, ECCV2016]

80
Application to DAR:
EAST: An Efficient and Accurate Scene Text Detector
[Zhou+, “EAST: An Efficient and Accurate Scene Text Detector”, CVPR2017]
Evaluating bounding box shape

81
Long short-term memory(LSTM),
which is the most typical
Recurrent Neural Networks

82
LSTM (Long short-term memory):
A recurrent neural network
… …
… …
Recurrent
structure
Info from
all the past
Gate
structure
Active info
selection
input vector
output vector
Also very effective for solving
the vanishing gradient problem
in t-direction
[Graves+, TPAMI2009]

83
LSTM NN
Recurrent NN
Recurrent
structure
Info from
all the past
LSTM NN
Gate
structure
Active info
selection input
output
input
output
input gate
forget gate
output gate
[Graves+, TPAMI2009]

84
Standard LSTM NN-based HWR
Character
class
Feature vector sequence

85
Extension to Bi-directional LSTM
Character
class
Feature vector sequence
combine Output using the past info
Output using the future info

86
Deep BLSTM network
[Frinken-Uchida, ICDAR2015]
Output
layer
Input
layer
LSTM layer
LSTM layer
LSTM layer

87
Application to DAR:
Convolutional Recurrent Neural Network (CRNN)
An End-to-End Trainable Neural Network for Image-Based Sequence Recognition and Its Application to Scene Text
Recognition, IEEE TPAMI, 2017

88
Image Captioning (CNN+LSTM):
Converting an image to a “document”
[Vinyals+, arXiv2015]

89
Application to DAR:
End-to-end math OCR (Image LaTeX)
[Deng+, Image-to-Markup Generation with Coarse-to-Fine Attention, arXiv2017]

90
More conventional machine
learning techniques
(SVM, -machine, AdaBoost)

91
Support Vector Machines (SVM)
Still the best choice when the amount of data is
insufficient

92
Linear discriminant function
A A AB B
x
training patterns from class A and B

93
A A AB B
x
bT
xw
positive
=classA
negative
=classB

94
A A AB B
x
bT
xw
positive
=classA
negative
=classB
misrecognized

95
A A AB B
x
bT
xw
positive
=classA
negative
=classB
no misrecognition!

96
Which one is the best?
A A AB B
x
bT
xw
positive
=classA
negative
=classB
All of those functions
can recognize all
training patterns…

97
Don’t forget unseen patterns…
A A AB B
x
bT
xw
positive
=classA
negative
=classB
B A
We might have those patterns
around the class boundary
A

98
Max-margin classification
A A AB B
x
bT
xw
positive
=classA
negative
=classB
margin margin

99
A A AB B
How can we get it?
Minimize the slope under constraints
x
bT
xw
1
-1
For us, the function value
should be more than 1
For us, the function value
should be more than 1

100
A A AB B
How can we get it?
x
bT
xw
1
-1
NG OK NG

101
A A AB B
How can we get it?
x
bT
xw
1
-1
nail

102
A A AB B
How can we get it?
x
bT
xw
1
-1
the minimum slope
satisfying the constraints

103
A A AB B
How can we get it?
x
bT
xw
1
-1
It also gives the maximum
margin classification!

104
A A AB B
Support vectors
x
bT
xw
1
-1
SV
SV
Only those SVs contribute to
determine the discriminant function

105
Two-dimensional case
Minimize the slope

106
No solution that satisfies the constraints:
Not linearly-separable
A A AB B
x
bT
xw
1
-1
B

107
A relaxation:
Replace the constraint as a penalty
A A AB B
x
bT
xw
1
-1
B
Penalty
Penalty
Minimize “slope + penalty”

108
-machine
An old partner for linear classifier
The idea of “kernel” comes from this

109
Mapping the feature vector space
to a higher-dimensional space
1x
2x
1
1
0
Not linearly-separable
1x
21xx
2x
Linearly-separable!

















21
2
1
2
1
xx
x
x
x
x
:

110
What happens in the original space
1x
2x
21xx

111
1y
2y
3y
dcybyay  321
A plane in 3D space
Rewrite

112
1x
2x
21xx
dxcxbxax  2121
??? What is this?
Revert

113
1x
2x
1
1
0
1
1
2
2121
cxb
axd
x
dxcxbxax



識別面：
Linear classification
in the higher-space
corresponds to a
non-linear classification
in the original space
Classification
boundary

114
Another example
A
A
B
B
1x
2x
B
1x
2
2
1 xx 
B
A
A
2x


















2
2
1
2
1
2
1
xx
x
x
x
x
：

115
 
 daxcx
cb
x
dxxcbxax




1
2
12
2
2
121
1
B
1x
2
2
1 xx 
B
A
A
2x
A
A
B
B
1x
2x

116116
Notes about -machine
Combination with SVM is popular
 -function leads “kernel”
Choosing a good mapping is not trivial
In the past, the choice was done by try-and-error
Recently….
   
  





i
iji
ji
jiji
i
ij
T
i
ji
jiji
i
ij
T
i
ji
jiji
kyy
yy
yy



xx
xx
xx
,
,
,
,

117117
Deep neural networks can find
a good mapping automatically
Feature extraction layer = a mapping
The mapping is specified by the weight
The weight (i.e, ) is optimized via training
It is so-called “representation learning”
…
…

118
Classifier Ensemble
and AdaBoost

119
Majority voting
ifsum>0thenA;elseB
1g
cg…
…
Cg
…
…1cg
Two-class
classifier
that returns:
1 for class A,
-1 for class B
+1
-1
+1
+1
input x
A
A
A
B
A

120
Weighted majority voting
1g
cg…
…
Cg
…
…1cg
Two-class
classifier
that returns:
1 for class A,
-1 for class B
input x
A
A
A
B
0.7
0.02
0.2
0.15
0.7
-0.2
ifsum>0thenA;elseB
A
well, how do we
decide the weighs?

121
AdaBoost:
A set of complementary classifiers
1g 0.7
training
patterns
1.Train
ifsum>0thenA;elseB
2. Reliability

122
AdaBoost:
0.7
training
patterns
ifsum>0thenA;elseB
3. Give a large (small) weight to each
sample which is misrecognized
(correctly recognized) by

123
training
patterns
ifsum>0thenA;elseB
AdaBoost:
0.7
0.43
4. Training with the weight
(Patterns with larger
weight should be
recognized correctly)
5. Reliability

124
training
patterns
ifsum>0thenA;elseB
AdaBoost:
0.7
0.43
6. Give a large (small)
weight to each sample which
is misrecognized (correctly
recognized) by
Repeat until
convergence of
training accuracy

125125
Today I cannot explain the following ML
techniques…
 Semi-supervised learning methods
 ex. constrained clustering, virtual adversarial training,
 Weakly-supervised learning methods
 ex. Multiple-instance learning
 Unsupervised learning methods
 Clustering, self-organizing feature maps, intrinsic dimensionality
 Ensemble methods
 Random forests, ECOC, bagging, random subspace
 Robust regression
 Hidden Markov models, graphical models
 Error-correcting learning (and perceptron)
 Statistical inference
 Esp. Gaussian mixtures, maximum likelihood, Bayesian estimation

126
Concluding remarks:
New DAR research by ML

127
Near-human performance has been
achieved by big data and neural networks
machine printed
handwritten
designed fonts
95.49%
99.79%
99.99%
[Uchida+, ICFHR2016]
[Zhou+, CVPR2017]
Scene text detection
Scene text recognition
CRNN [Shi+, TPAMI, 2017]
F value=0.8 on
ICDAR2015 Incidental scene text
89.6% word recog. rate
on ICDAR2013

128
Now we can imagine
what we can do in the world

129
Beyond 100% = Computer can detect, read,
and collect all text information perfectly
Texts on notebook
Texts on object label
Texts on digital display
Texts on book page
Texts on signboard
Texts on poster / ad
So, what do want to do
with the perfect recognition results?

130
Poor recognition results
In fact, our real goal should NOT be
perfect recognition results
Real goals
Ultimate application
by using perfect
recognition results
Scientific discovery
by analyzing perfect
recognition results
Perfect recognition results
Tentative goal

131
What will you do
in the world beyond 100%?
Ultimate application
 Education
 “Total-recall” for perfect
information search
 Welfare
 Alarm, translation,
information complement
 “Life log”-related apps
 Summary, log compression,
captioning, question
answering, behavior
prediction, reminder
Scientific discovery
 With social science
 Interaction between scene
text and human
 Text statistics
 With design science
 Font shape and impression
 Discovering typographic
knowledge
 With humanities
 Historical knowledge
 Semiology

132132
Another direction:
Use characters to understand ML
Simple binary and stroke-structured pattern
Less background clutter
Small size (ex. 32x32)
Big data (ex. 80,000 samples / class)
Predefined classes (ex. 10 classes for digits)
ML has achieved near-human performance
Very good “testbed” for
not only evaluating but also understanding ML

133
The last message...
... and please do NOT become an accuracist,
parameter-tuner, and libraholic!

Machine learning for document analysis and understanding

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Machine learning for document analysis and understanding

Similar to Machine learning for document analysis and understanding (20)

More from Seiichi Uchida

More from Seiichi Uchida (20)

Recently uploaded

Recently uploaded (20)

Machine learning for document analysis and understanding