2. Motivation
• In this digital age, still there are many documents that are handwritten and are
required to be scanned eg Legal documents, Forms, Receipts , Bond papers etc
• Difficult than OCR (Optical Character Recognition) because typed letters can be
easily recognized with fixed set of rules.
• A single word can be written in ‘N’ number of ways by the same person, so
basically there are many variations, so its difficult problem to crack
• High value for companies that design and develop scanners and printers eg
Kodak, Hewlett Packard, Canon etc.
• Different ways of writing ‘of’ by the same person
3. Why not use OCR tool?
●Tesseract is a famous Open Source OCR engine.
●Widely used for OCR and is being maintained by Google.
●Works for around 30-40 different languages.
●Good for OCR but not good for Intelligent Character recognition
●For Example
Original Image OCR output
4. Previous approaches
• Sophisticated preprocessing techniques
• Extracting Handcrafted features
• Combination of classifier and a sequential model
i.e Hybrid ANN/DNN Hidden Markov model
• Sequential models like HMM were good at providing transcription.
6. Recurrent Neural Networks
• RNNs helps to model local as a well as global context .
• Does not require alphabet specific pre processing
and no need of handcrafted features
• Can be used for any other language and has shown
promising results (like machine to machine translation
, NLP, Speech & Handwriting Recognition etc.)
• Works on raw inputs (pixels)
• Globally trainable model.
• Good in handling long term dependencies.
7. Why LSTM cell?
General idea about LSTMs
• Solves vanishing gradient problem and thus works better
for long term dependencies.
• Activation is controlled by 3 multiplicative gates
o Input gate
o Forget gate
o Output gate
• Gates allow cell to store or retrieve information over time.
• Showed state of the art results for speech recognition
8. Bidirectional and Multidimensional
RNN(LSTM)
• Normal Recurrent neural networks can look back at previous time steps (left side)
and get contextual information
• This works good but its been seen that context from right also helps, so we have
BLSTM (Bidirectional LSTMs)
• RNNs are usually structured for 1D sequences, so the input is always converted to
a 1D vector and fed to RNN.
• So, any ‘d’ dimensional data needs to be brought down to 1D before it can be
processed by RNNs
• To overcome this shortcoming, [1] suggested Multidimensional RNNs
[1] Graves, Alex, and Jürgen Schmidhuber. "Offline handwriting recognition with multidimensional recurrent neural networks."
9. Multidimensional RNNs
• The standard LSTMs are explicitly one dimensional,with one recurrent
connection, and whether to use the information from that recurrent connection
is controlled by just one forget gate.
• With multi dimension, we extend this idea to ‘n’ dimensions, with ‘n’ recurrent
connections, and the information controlled by ‘n’ forget gates.
• The network starts scanning from top left.
1. The thick lines show connection to current
point (i,j).
2. The connections within the hidden plane are
recurrent.
3. The dashed lines are previous points
scanned by the network.
11. Calculation for input gate
input gate at current time
sigmoid activation
weights from i/p to hidden layer
input to LSTM block
weights from prev hidden layer to current
hidden layer input gate
previous hidden layer output across ‘d’ dimensions
peep hole weights
previous cell state output across ‘d’ dimensions
input gate bias
12. Calculation for forget gate
forget gate value is calculated for every
dimension separately, because it helps to
store or forget previous information based on
which dimension is useful
sigmoid activation
weights from input to
forget gate
weights from prev hidden layer across ‘d’
dimensions to current hidden layer
hidden layer output from
previous time step
peep hole weights
cell state output of prev time step
forget gate bias, each across ‘d’ dimension
input to forget gate from i/p layer
13. Calculation for input
output after tanh activation. This is not an
output of any of the 3 gates, this is same as o/p from
fully connected network with tanh activation
tanh activation
weights from input layer to hidden layer
input from i/p layer to hidden layer
weights from prev hidden layer across ‘d’
dimensions to current hidden layer
hidden layer o/p from prev time steps across ‘d’ dimensions
bias for input
14. Calculation for cell state
cell state of that particular LSTM block,
a single lstm block can have multiple cell states,
usually one cell state works well in practice.
this expression calculates whether the input from
this particular time step is useful, if not then input
gate value will be close to zero, else close to 1
this expressions calculates, which dimensions
are useful, so suppose if information from ‘X’
dimension is not useful then forget gate value
calculated for that dimension will be 0 and it is
multiplied with the previous time step cell state of X
dimension, so that no information is carried forward from X dimension.
15. Calculation for output gate
output gate value at current time step
sigmoid activation
weights from input layer to output gate
peep hole weights
input from i/p layer to the hidden layer
weights from prev hidden layer to current hidden
layer across ‘d’ dimensions
hidden layer o/p of previous time steps across ‘d’
dimensions
remember this is the cell state output gate bias
at current time step
16. Calculation for output of hidden neuron
output of LSTM block (i.e hidden neuron) at current
time step.
output gate value, this decides whether this neuron’s o/p
should be given as an input to the hidden layer of future
time steps, if not then the value would be close to 0 else
close to 1
passing cell state of neuron through tanh activation
17. CTC (Connectionist Temporal Classification)
• Previous approaches to train end to end systems for handwriting recognition
involved segmenting input with ground truths.
• As a result, we had to do force alignment, which are prone to errors ,
thereby that error getting propagated to the training system.
• In order to overcome this issue, we use Connectionist Temporal
Classification(CTC)
• It provides two advantages
1. We can have variable input, no need for force alignment.
2. The CTC loss function is differentiable, hence end to end trainable.
19. CTC Cost Function and Intuition
• The objective function is negative log likelihood of correctly labelling
the entire training data.
• x - Training sample
S - Training data
z - Generated sequence
20. CTC Cost Function and Intuition
derivative of objective function wrt o/p
o/p at time step ‘t’ for
‘k’ th label.
pr
probability of all the paths that can be formed for a particular input. I like speech recognition eg, suppose you are
saying a word ‘Robocop’, now there are many ways of saying ‘Robocop’, like Roooooooooobooocop or Robocoooooop
or Robo <pause> cop. P(z|x) is the total probability of all the sequences that can be formed for a given word. Now,
number of possible paths (words/sequences) can be exponential, we use forward backward algorithm to find total
probability of paths. More on this in next slides
22. CTC Cost Function and Intuition
• Figure on the right represents total number of paths that pass through
every node at time step t=2. For eg
alpha for C = 3 and beta C = 1 , total number of paths going through C
at time step t = 2 is 3. This is how we calculate all the paths going through every
node at each time step.
• O/P at time step ‘t’
for label ‘k’
• Sum across same labels, like if
your ground truth word is KITKAT,
then K appears twice, so you sum across label ‘K’.
• alphas(s) * betas(s) , it is the probability of total number of paths
that contain label ‘s’. So, if you have 10 labels,
and your soft max outputs equal probabilities i.e 0.1 at each
time step, then the probability of total paths, through A = 16 * (0.1^5), for 5 time steps.
23. CTC Cost Function and Intuition
• Now to backpropagate the gradients, we need to find gradients with respect to output,
i.e before activation is applied
• Here k’ refers all the labels and k is the kth label
• Finally, we arrive to the following equation for gradients with respect to output before
activation is applied.
=
• For eg, if we take gradient with respect to activation at ‘A’ , considering that there are 10
labels, and all output 0.1 probability initially. Then gradient propagation for label ‘A’
• P (z|x) = 0 + (3* (0.1^5)) + (3* (0.1^5)) + (16 * (0.1^5)) + (3* (0.1^5)) + (3* (0.1^5)) + 0
alphas* betas = (16 * (0.1^5)) and output activation of label ‘A’ at time t is 0.1. Gradient value
is -0.4714
25. Architecture
Louradour, Jérôme, and Christopher Kermorvant. "Curriculum learning for handwritten text line recognition." Document Analysis Systems (DAS), 2014 11th IAPR International
Workshop on. IEEE, 2014.
26. Results
• Trained and Tested on IAM Handwriting database using
a)python code written from scratch
b) RNNlib library
1. Training data 80K
2. Validation data 20K
3. Testing data 15k
• NCER % (Normalized Character Error Recognition)
1. Training NCER 15.5 %
2. Testing NCER 15 %
3. Testing NCER with Lexicon 12.60 %
• Some examples from the database