Your SlideShare is downloading. ×
Contributions to connectionist language modeling and its application to sequence recognition and machine translation
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Contributions to connectionist language modeling and its application to sequence recognition and machine translation

320

Published on

Natural Language Processing is an area of Artificial Intelligence, in …

Natural Language Processing is an area of Artificial Intelligence, in
particular, of Pattern Recognition. It is a multidisciplinary field that studies
human language, both oral and written. It deals with the development and
research of computational mechanisms for communication between people and
computers, using natural languages. Natural Language Processing is a reasearch
area constantly evolving, and this work focuses only on the part related to
language modeling, and its application to various tasks:
recognition/understanding of sequences and statistical machine translation.

Specifically, this thesis focus its interest on the so-called connectionist
language models (or continuos space language models), i.e., language models
based on neural networks. Their excellent performance in various Natural
Language Processing areas has motivated this study.

Because of certain computational problems suffered by connectionist language
models, the most widespread approach followed by the systems that currently
use these models, is based on two totally decoupled stages. At a first stage,
using a standard and cheaper language model, a set of feasible hypotheses,
assuming that this set is representative of the search space in which the best
hypothesis is located, is generated. In a second stage, on this set, a
connectionist language model is applied and a rescoring of the list of
hypotheses is done.

This scenario motivates scientific goals of this thesis:

- Developing techniques to reduce drastically the computational cost degrading
as less as possible the quality.

- Study the effect of a totally coupled approach that integrates neural network
language models on decoding stage.

- Developing some extensions of original model in order to improve it quality
and to fulfill context domain adaptation.

- Empirical application of neural network language models to sequence
recognition and machine translation tasks.

All developed algorithms were implemented in C++ and using Lua as scripting
language. The implementations are compared with those that are considered
standard on each of the addressed tasks. Neural network language models achieve
very interesting improvements of quality over the reference baseline systems:

- competitive results are achieved on automatic speech recognition and spoken
language understanding;

- improvement of state-of-the-art handwritten text recognition;

- state-of-the-art results on statistical machine translation, as was stated
with the participation on international evaluation campaigns.

On sequence recognition tasks, the integration of neural network language models
on the first decoding stage achieve very competitive computational
costs. However, their integration in machine translation tasks requires a deeper
development because the computation cost of the system is still somewhat high.

Published in: Education
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
320
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
8
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Contributions to connectionist languagemodeling and its application to sequence recognition and machine translation PhD Thesis defense Francisco Zamora Martínez supervised by María José Castro Bleda Departament de Sistemes Informàtics i Computació Universitat Politècnica de València 2012 November 30
  • 2. Index 1. Introduction 2. Connectionist language modeling 3. Sequence recognition applications 4. Machine translation applications 5. Conclusions
  • 3. Index 1. Introduction 2. Connectionist language modeling 3. Sequence recognition applications 4. Machine translation applications 5. Conclusions
  • 4. Motivation Important role of Language Models (LMs): N-grams + learned automatically + simple and effective – problem with unseen patterns, smoothing heuristics Neural Network Language Models (NN LMs): based on N-grams + automatic smoothing of unseen patterns – big computational cost – decoupled integration ⇒ N-best list rescoring . Open questions on NN LMs . Totally coupled integration scheme. NN LM capabilities to improve hypotheses pruning. Quality of fast evaluation of NN LMs. Long-dependencies modelisation with NN LMs. . Tasks: Spoken Language Understanding, Handwritten Text Recognition, Statistical Machine Translation.
  • 5. Pattern Recognition . Fundamental equation . y = arg max p(¯|¯) = arg max p(¯|¯)p(¯) ˆ ¯ yx xy y . y∈Ω+ ¯ y∈Ω+ ¯ . Machine translation . x1 x2 x3 x4 x5 x= ¯ Traduciremos esta frase al inglés y1 y2 y3 y4 y5 y6 y7 .y= ¯ We will translate this sentence into English . Handwritten Text Recognition . x= ¯ y1 y2 y3 y4 . y = must be points . ¯
  • 6. Pattern Recognition . Fundamental equation and its generalization . y = arg max p(¯|¯) = arg max p(¯|¯)p(¯) ˆ ¯ yx xy y y∈Ω+ ¯ y∈Ω+ ¯ ∏ M y = arg max p(¯|¯) = arg max ˆ ¯ yx Hm (¯, y)λm x ¯ y∈Ω+ ¯ y∈Ω+ ¯ . m=1 Language Models (LMs) estimate a-priori probability p(¯), that is, y compute a measure of y belongs to the task language. ¯ Generalized under maximum entropy framework as a log-linear combination. For Spoken Language Understanding (SLU) and Handwritten Text Recognition (HTR) it models Grammar Scale Factor (GSF) and Word Insertion Penalty (WIP) weights.
  • 7. Dataflow architecture ˆ Search procedure computes y using previous equations. ¯ Algorithms based on graphs were implemented, breaking the search in little building blocks. Each block is a module in a dataflow architecture. Modules exchange different information types: feature vectors, graph protocol messages, probabilities, … . Speech recognition dataflow . m´x a p(e) e∈active vertex GenProbs OSE WGen NGramParser ¯ y Module Module Module Module Wordgraph .¯ x
  • 8. Graph protocol messages The most important dataflow data type. Normally, graph messages are produced/consumed left-to-right. Possible a specialization for multi-stage graphs (as for Statistical Machine Translation). . General graph . 1: begin_dag (multistage=false) 2: vertex (0) 3: is_initial (0) 4: no_more_in_edges (0) 5: vertex (1) 6: edge (0,data={⟨a, 1.0⟩, ⟨b, 0.7⟩}) 7: no_more_in_edges (1) 14: edge (2,data={⟨a, 1.0⟩}) 20: no_more_out_edges (1) 8: vertex (2) 15: no_more_in_edges (3) 21: edge (3,data={⟨d, 0.4⟩}) 9: edge (0,data={⟨b, 0.5⟩}) 16: no_more_out_edges (2) 22: no_more_out_edges (3) 10: no_more_in_edges (2) 17: vertex (4) 23: no_more_in_edges (4) 11: no_more_out_edges (0) 18: is_final (4) 24: no_more_out_edges (4) 12: vertex (3) 19: edge (1,data={⟨a, 1.0⟩, 25: end_dag () . 13: edge (1,data={⟨b, 0.1⟩}) ⟨c, 0.2⟩})
  • 9. Optimization and evaluation . . Log-linear combination Evaluation measures . . All tasks are formalized as Perplexity (PPL). log-linear combination. Word Error Rate (WER). Estimated via Minimum Error Character Error Rate (CER). Rate Training (MERT). . Sentence Error Rate (SER). . Concept Error Rate (CER). Confidence interval and Bilingual Evaluation comparison Understudy (BLEU). . Bootstrapping technique. Translation Edit Rate (TER). . Pairwise comparison. .
  • 10. Goals . Scientific aims . NN LMs formalization as general N-grams. Method specification of totally coupled integration of NN LMs. Evaluation of totally coupled approach. NN LMs extension to Cache NN LMs, inspired by cache LMs. . . Technological aims . Efficient implementation of training and evaluation of NN LMs. Efficient implementation of algorithms for coupling NN LMs on SLU, HTR and Statistical Machine Translation (SMT). MERT algorithm to estimate GSF and WIP in SLU and HTR. April toolkit development in collaboration with research group. .
  • 11. Index 1. Introduction 2. Connectionist language modeling 3. Sequence recognition applications 4. Machine translation applications 5. Conclusions
  • 12. Language modeling Statistical LMs follow pattern recognition fundamental equation. Estimate the probability of a sentence as belonging to a certain language. Simplified by using N-gram LMs to sequences of order N: |¯| ∏y p(¯) ≈ y ¯ p(yj |hj ) j=1 ¯ being hj = yj−1 yj−2 . . . yj−N+1 . . N-gram probability computation: 3-gram . p(ω1 ω2 ω3 ) = p(ω1 |bcc) · p(ω2 |bcc ω1 ) · p(ω3 |ω1 ω2 ) · p(ecc|ω2 ω3 ) . is begin context cue or start of sentence; ecc is end context cue or end of sentence. bcc
  • 13. Connectionist Language Models Based on the idea of word projections onto a continuous space. Interpolation of unseen N-grams given word projections. Joint training of word projections and LM probability computation. Word projections are position independent: shared weights. . Word projection . Input: local encoding, word as a category (size = |Ω|) 0, 0, . . . , 0, 0, 0, 1, 0, 0, . . . , 0, 0, 0, 0 Projection: distributed encoding, a feature vector (size <<< |Ω|) 0.1, . . . , −0.4, 0.2, . . . , 1.1 .
  • 14. Connectionist Language Models 0 . 0 1 0 . . ... ... 0 0 . . . ... ... 0 1 0 0 ... 1 ... 0 . . . 0 ... ... ... ... ... ...
  • 15. Connectionist Language Models . Training issues . Stochastic backpropagation algorithm with weight decay regularization. Stochastic training selects with replacement a random set of patterns every epoch. For large datasets, training converges before training partition was completely traversed. Fast training using matrix-matrix multiplications and fine-tuned BLAS implementations (bunch mode). .
  • 16. NN LM deficiencies and our solutions I . Word projections – Projection layer . Deficiencies: Random initialization of projection layer: low frequency words (“rare” words) has very different encoding. Large vocabularies + stochastic training ⇒ poor training of “rare” words. Solutions: Restrict NN LM input vocabulary to words with frequency > θ (experiments). Use of bias and weight decay terms on the projection layer. .
  • 17. NN LM deficiencies and our solutions II . Computational problems – Output layer . Deficiencies: ¯ exp(A) ¯ Softmax activation forces to compute all outputs: O = ∑|A| i=1 exp(ai ) Currently, training problems were partially solved using fast math operations and stochastic backpropagation. At decoding, softmax bottle-neck forces developing of decoupled systems: N-best list rescoring. Solutions: Shortlist output vocabulary Ω′ restricted to most frequent words.  ¯ pNN (yj |hj ),  iff yj ∈ Ω′  COOS (yj ) ¯ p(yj |hj ) = pNN (OOS|hj ) · ∑ ¯ , iff yj ̸∈ Ω′   COOS (y′ )  y′ ̸∈Ω′ Precompute softmax normalization constants: Fast NN LM. .
  • 18. NN LM and decoding . N-best list rescoring . Widely used in literature. Encouraging improvements in ASR and SMT. . . Totally integrated decoding . Anyone try it before. Major contribution and focus of this work. .
  • 19. Integration of NN LM into decoding A generic framework will be presented. Softmax computation problems need to be solved. . Generic LM interface . . N-gram stochastic A LMkey is an automaton state number. finite state LMkey is used by decoder as N-gram automaton context identifier. . b aa ab prepareLM(LMkey) b getLMprob(LMkey, word) a getNextLMKey(LMkey, word) a b getInitialLMKey() b a getFinalLMKey() _ restartLM() . .
  • 20. Generic framework . TrieLM . TrieLM represents the whole N-gram space: |Ω|N (NN LMs). It is built on-the-fly, enumerating only states needed by decoding. A path in the TrieLM is a sequence of the N − 1 context words of an N-gram. Two kinds of node: persistent and dynamic. . . TrieLM example . . TrieLM node . Parent back-pointer. Time-stamp. Word transition. . .
  • 21. Fast evaluation of NN LMs Softmax normalization constant forces ¯ exp(A) ¯ O = ∑|A| the computation of all output neurons, even when only a few are needed: i=1 exp(ai ) Softmax has several advantages: . # of weights at output layer Ensures true probability . computations with ANNs. |Ω′ | Output layer 100 25 700 Improve training convergence. 1 000 257 000 . 10 000 2 570 000 . Our solution . Precompute the most important constants needed during decoding. When a constant is not found, two possibilities are feasible: Compute the constant on-the-fly. Use some kind of smoothing. .
  • 22. Fast evaluation of NN LMs . Preliminar notes . For a NN LM of order N exists |ΩI |N−1 softmax normalization constants. Note a bigram only needs |ΩI |. ΩI is the NN LM input restricted vocabulary. . ... ... ... ... ... ... A 4-gram NN LM
  • 23. Fast evaluation of NN LMs . Precomputation procedure: training . N-gram contexts are extracted from a training corpora, counting its frequency. For the most frequent contexts, the softmax normalization constants are computed. The TrieLM stores as persistent nodes: the N-gram context words related to softmax normalization constants and its associated constant value. . . Softmax normalization constants computation: 3-gram . Sentence N − 1 context words Softmax constant A MOVE 43 418 MOVE to 78 184 A MOVE to stop Mr. Gaitskell from to stop 88 931 . … …
  • 24. Fast evaluation of NN LMs . Smoothed approach: evaluation . Use a simpler model if a constant is not found. The most simpler model is a NN LM bigram or a standard N-gram. . a move to stop mr. gatiskell ... 3-gram softmax a move to normalization constants a move to 4-gram Search Yes NN LM No stop 2-gram softmax move to normalization constants P(stop | a move to) Search move to 3-gram Yes NN LM No stop 2-gram to 1-gram softmax NN LM normalization constants
  • 25. Fast evaluation of NN LMs . Smoothed approach pros/cons . – Model quality reduction. + Constant do not need to be computed. Trade-off between quality and speed: more precomputed constants means more quality, but slower speed. . . On-the-fly approach pros/cons . + Full model quality. – Needs to compute on-the-fly a lot of constants. More precomputed constants means faster speed. It is possible to store computed constants to future use. – However, always slower than smoothed approach. .
  • 26. Experiments with PPL in text: NN LMs and speed-up . Fast evaluation approach . LOB-ale corpus: random sentences from LOB with closed vocabulary. Three configurations are evaluated: On-the-fly Fast NN LM: computing constants when needed. Smoothed Fast NN LM: using a simpler model. Smoothed-SRI Fast NN LM: simplest model is a SRI model. . . PPL and speed-up results . Setup ms/word speed-up test PPL Mixed NN LM 6.43 0 79.90 On-the-fly Fast NN LM 1.82 3 79.90 Smoothed Fast NN LM 0.19 33 80.78 . Smoothed-SRI Fast NN LM 0.19 33 79.02
  • 27. Experiments with PPL in text: NN LMs and speed-up Smoothed Fast NN LM
  • 28. Experiments on HTR: NN LMs and integrated decoding Totally integrated decoding. Experiments on the influence of pruning, WER, and time. IAM-DB task, deeply presented at sequence recognition part. 4-gram NN LM Linearly combined with a SRI bigram following on-the-fly (standard) approach and smoothed approach.
  • 29. Experiments on HTR: NN LMs and integrated decoding 1.6 40 10 SmoothedFastNNLM SmoothedFastNNLM StandardNNLM 38 StandardNNLM 1.4 SRI bigram SRI bigram 36 % improvement 8 1.2 34 1 32 6 sec/word 30 WER % 0.8 28 4 0.6 26 0.4 24 22 2 0.2 20 0 18 0 18 20 22 24 26 28 30 32 34 36 38 40 0 1 2 3 4 5 6 7 8 9 10 WER Histogram pruning size (x 1000) 1.6 SmoothedFastNNLM StandardNNLM 150 1.4 SRI bigram % ratio . 1.2 140 Conclusions 1 . Smoothed Fast NN LM: better time, same sec/word 130 % 0.8 WER. 0.6 120 Smoothed Fast NN LM: improves WER by 0.4 8%, additional 10% time. 110 0.2 Standard NN LM approach: same WER 0 100 . than smoothed, but two times slower. 0 1 2 3 4 5 6 7 8 9 10 Histogram pruning size (x 1000)
  • 30. Index 1. Introduction 2. Connectionist language modeling 3. Sequence recognition applications 4. Machine translation applications 5. Conclusions
  • 31. Hidden Markov Model decoding Based on HMM/ANN models. Two step decoding with pruning synchronization. N-gram Viterbi decoder with integrated NN LMs. m´x a p(e) e∈active vertex GenProbs OSE WGen NGramParser ¯ y Module Module Module Module Wordgraph ¯ x . Two tasks . Spoken Language Understanding (SLU). Handwritten Text Recognition (HTR). .
  • 32. Spoken Language Understanding . Cache NN LM for long distance dependencies . Cache-based LM: p(yj |yj−1 . . . y1 ) = αp(yj |hj ) + (1 − α)pcache (yj |hj−1 ) , ¯ ¯ 1 Cache NN LM: |¯| ∏y p(¯) ≈ p(¯|hu−1 ) ≈ y y ¯1 p(yj |hj , hu−1 ) ¯ ¯ 1 j=1 hj is the N-gram context and hu−1 is the cache part. ¯ ¯ 1 Cache NN LM receives a summary of all previous machine/user interactions. The cache part remains the same during one sentence decoding. .
  • 33. Spoken Language Understanding SLU using language models of pairs: concept/word sequences. Using Cache NN LM, different summary combinations were tested.
  • 34. Spoken Language Understanding . Experiments . MEDIA French corpus. Using Cache NN LM, different summary combinations: A) Using a cache only of concepts. B) Using only words at the cache. C) Using concepts+words at the cache. D) Using concepts+words+Wizard-of-Oz at the cache. . . MEDIA statistics . Set # sentences # running words # running concepts Training 12 811 87 297 42 251 Validation 1 241 9 996 4 652 . Test 3 468 24 598 11 790
  • 35. Spoken Language Understanding 2-grams 3-grams 4-grams System Val. Test Val. Test Val. Test baseline-a 33.6 30.1 32.9 29.3 33.5 29.3 baseline-b 33.1 28.3 30.7 27.4 30.2 28.1 cacheNNLM-A 31.7 28.2 29.7 27.0 29.7 27.0 cacheNNLM-B 30.5 27.3 29.7 27.0 30.0 26.1 cacheNNLM-C 32.2 28.3 30.5 27.0 30.8 27.4 cacheNNLM-D 31.2 28.2 29.9 26.2 30.3 27.1 CER results for validation and test sets. .baseline-a) Standard N-gram of pairs, without cache. Conclusions .baseline-b) Standard NN LM of pairs, without cache. Significant CER reduction using a rather A) Using a cache only of concepts. simple SLU model. B) Using only words at the cache. The best Cache NN LM suggests that there is plenty of room for improvement. C) Using concepts+words at the cache. The use of cache with long-distance D) Using concepts+words+Wizard-of-Oz at the dependencies improves systematically the cache. baselines. Best CER in the literature 23.8%. .
  • 36. Handwritten Text Recognition Based on HMM/ANN models: 80 characters (26 lowercase and 26 uppercase letters, 10 digits, 16 punctuation marks, white space, and crossing out mark). Two step decoding, totally integration of NN LMs. IAM-DB task: off-line text line recognition. Language models trained using LOB+WELLINGTON+BROWN: 103K vocabulary size (|Ω|). Two type of experiments: word-based, and character-based. . IAM-DB text line images . .
  • 37. Handwritten Text Recognition: word-based experiments . Validation results . % WER / % CER / % SER in validation System |ΩI | bigram 3-gram 4-gram Bigram mKN 103K 17.3 / 6.2 / 69.8 17.8 / 6.3 / 70.3 17.9 / 6.3 / 69.3 Rescoring with NN LMs Θ = 21 NN 10K 16.0 / 5.8 / 67.4 16.0 / 5.9 / 67.2 16.2 / 5.8 / 67.2 Θ = 10 NN 16K 15.9 / 5.9 / 67.5 16.0 / 5.9 / 67.2 16.4 / 5.9 / 66.5 Θ = 8 NN 19K 16.0 / 5.8 / 66.9 16.3 / 5.9 / 67.8 16.9 / 6.1 / 68.8 Θ = 1 NN 56K 16.0 / 5.8 / 66.6 16.3 / 6.0 / 67.9 16.9 / 6.2 / 69.2 Θ = 21 FNN 10K 16.0 / 5.8 / 67.4 15.8 / 5.7 / 66.0 15.8 / 5.7 / 65.1 Θ = 10 FNN 16K 15.9 / 5.9 / 67.5 15.9 / 5.7 / 65.0 15.9 / 5.8 / 66.0 Θ = 8 FNN 19K 16.0 / 5.8 / 66.9 15.8 / 5.8 / 65.8 15.8 / 5.7 / 65.8 Θ = 1 FNN 56K 16.0 / 5.8 / 66.6 15.9 / 5.7 / 66.1 15.7 / 5.8 / 66.3 NN LMs integrated during decoding Θ = 21 NN 10K 16.0 / 5.8 / 67.0 16.0 / 5.8 / 67.2 16.1 / 5.8 / 67.2 Θ = 10 NN 16K 16.1 / 5.8 / 66.9 16.1 / 5.8 / 67.3 16.4 / 5.8 / 66.7 Θ = 8 NN 19K 16.1 / 5.9 / 67.0 16.3 / 5.9 / 67.9 16.9 / 6.0 / 69.2 Θ = 1 NN 56K 16.1 / 5.8 / 66.7 16.4 / 6.0 / 67.5 16.8 / 6.2 / 69.3 Θ = 21 FNN 10K 16.0 / 5.8 / 67.0 15.8 / 5.8 / 66.5 15.8 / 5.6 / 65.2 Θ = 10 FNN 16K 16.1 / 5.8 / 66.9 15.9 / 5.7 / 65.0 16.0 / 5.8 / 65.6 Θ = 8 FNN 19K 16.1 / 5.9 / 67.0 15.8 / 5.8 / 65.7 15.8 / 5.7 / 65.6 Θ = 1 FNN 56K 16.1 / 5.8 / 66.7 15.9 / 5.7 / 66.1 15.8 / 5.7 / 65.6 . NN are standard NN LMs; FNN are Smoothed Fast NN LM; words frequency > Θ ⇒ ΩI .
  • 38. Handwritten Text Recognition: word-based experiments . Test results . System % WER % CER % SER [Graves et al.] 25.9 ± 0.8 – – Bigram 20K mKN 23.4 ± 0.8 9.6 ± 0.4 78.1 ± 1.6 Bigram mKN 21.9 ± 0.7 8.8 ± 0.4 76.0 ± 1.6 Rescoring with NN LMs 4-gram Θ = 21 FNN 20.2 ± 0.7 8.3 ± 0.4 72.9 ± 1.6 NN LMs integrated during decoding 4-gram Θ = 21 FNN 20.2 ± 0.7 8.3 ± 0.4 73.0 ± 1.6 . 20K mKN from [Graves et al]; words frequency > Θ ⇒ ΩI . . Conclusions . Not significant ⇒ input vocabulary sizes. Rescoring vs integrated approach similar results. First system for HTR using NN LMs, significant improvement of state-of-the-art results. .
  • 39. Handwritten Text Recognition: character-based experiments LMs using characters instead of words. Useful for tasks with lack of data. . Validation results . .
  • 40. Handwritten Text Recognition: character-based experiments . Test results . System % WER % CER OOV % accuracy SRI 30.9 13.8 29.8 NN LM 24.2 10.1 33.8 . % improvement 21.7 26.8 13.4 . Conclusions . Large improvement over N-gram baseline (SRI). Almost 33% of OOV words for a word-based LM were recovered. This results suggests to try mixed approach combining word-based LM with character-based LM. Ancient document transcription could take profit of this approach. .
  • 41. Index 1. Introduction 2. Connectionist language modeling 3. Sequence recognition applications 4. Machine translation applications 5. Conclusions
  • 42. Statistical Machine Translation . Overview . Introduction to SMT following phrase-based and N-gram-based approaches. Decoding process algorithm details. Experiments for decoder quality and integration of NN LMs. .
  • 43. Statistical Machine Translation Following maximum entropy approach: log-linear combination of several models. This work focuses in phrase-based and N-gram-based approaches: Both follow similar basis: word alignments, segmentation in tuples (bilingual phrases), log-linear modelisation. Phrase-based: all consistent phrases (multiple segmentations) are extracted. N-gram-based: source language sentence is reordered to follow target language order; a unique segmentation in tuples is extracted for training, in decoding multiple paths are possible. . Fundamental equation . ∏ M ¯ ˆ ˆ y = target_part(T , φ) = arg max ¯ Hm (T , φ)λm ¯ (T ,φ) ¯ m=1 being T = T1 , T2 , . . . , T|T | a sequence of bilingual tuples; ¯ ¯ and φ : {1, 2, . . . , |¯|} → {1, 2, . . . , |T |} a mapping between source words and tuples. . x ¯
  • 44. Statistical Machine Translation Source: María no daba una bofetada a la bruja verde Reordered source: María no daba una bofetada a la verde bruja Target: Mary did not slap the green witch bofetada María verde daba bruja una no la a Mary did not slap the green witch . . Phrase-based N-gram-based . . (María, Mary), (no, did not), (María, Mary) ⇒ T1 , (María no, Mary did not), (no, did not) ⇒ T2 , (slap, daba una bofetada), (verde, green), (daba una bofetada, slap) ⇒ T3 , (no daba una bofetada, did not slap), (a,<NULL>) ⇒ T4 , (daba una bofetada a la, slap the), (la, the) ⇒ T5 , (a la, the), (bruja, witch), (verde, green) ⇒ T6 , (bruja verde, green witch), ⇒ T7 . . (bruja, witch) (a la bruja verde, the green witch), … .
  • 45. Machine Translation decoding Phrase-based and N-gram-based: follow a similar decoding algorithm. Major difference: N-gram-based approach: uses a LM for the computation of joint probability. Phrase-based approach: does not needs joint probability, it uses a larger phrase table with conditional probabilities. . General decoding overview . m´x a p(v) m´x a p(e) v∈current stage e∈current active vertex Reordering Word2Tuple Viterbi Module Module Module . Current Stage Current Active Vertex
  • 46. Machine Translation decoding 1 3 5 3 2 1000 1010 1110 0 1 4 7 tengo|||have un 2 3 tengo un|||I have a 0000 1111 tengo|||I have 2 3 2 4 6 1 4 0100 1100 1101 tengo un|||a un||| un coche rojo rojo coche|||red coche rojo|||red car rojo|||red [2,2] [1,1] [3,3] 3 un|||a tengo|||I have 1 coche|||car 1010 un||| tengo|||have 5 1000 [2,2] 1110 [4,4] un|||a [3,3] rojo|||red un||| coche|||car [2,2] 0 un|||a 2 un||| [1,1] [4,4] 6 [3,3] 7 0000 0100 tengo|||I have 4 rojo|||red 1101 coche|||car 1111 tengo|||have 1100 [1,2] [3,4] tengo un|||I have a coche rojo|||red car
  • 47. Machine Translation experiments . N-best list rescoring with NN LMs . IWSLT’06 Italian-English task. WMT’10 and WMT’11 English-Spanish task. . . Totally integrated NN LMs decoding . IWSLT’10 French-English task. News-Commentary 2010 Spanish-English task. . Participation in international evaluation campaigns IWSLT’10, WMT’10 and WMT’11, achieving very well positioned systems (second position at IWSLT’10 and WMT’11).
  • 48. Machine Translation experiments News-Commentary 2010 Spanish-English task. Totally integration of NN LMs in decoding vs N-best list rescoring. Comparative evaluation of N-gram-based and phrase-based approach. Phrase-based models will be trained using Giza++ and Moses toolkits, decoding with April . N-gram-based models will be trained with Giza++ and April toolkits, decoding with April .
  • 49. Machine Translation experiments . News-Commentary 2010 statistics . Spanish English Set # Lines # Words # Lines # Words Voc. size News-Commentary 2010 80.9K 1.8M 81.0K 1.6M 38 781 News2008 2.0K 52.6K 2.0K 49.7K – News2009 2.5K 68.0K 2.5K 65.6K – News2010 2.5K 65.5K 2.5K 61.9K – N-gram-based bilingual tuple translation corpus Set # Lines # Tuples Voc. size News-Commentary 2010 80.9K 1.5M 231 981 Tuple vocabulary size is huge for a direct NN LMs training ⇒ NN LMs of statistical classes. English data for LM training Set # Lines # Words . News-Commentary 2010 125.9K 2.97M
  • 50. Machine Translation experiments . NN LMs of statistical classes training procedure . 1. Using Giza++ a non-ambiguous mapping between tuples and classes is done (CLS number of classes). 2. Conditioned probability of tuple in classes is computed by counting: C(z|c) p(z ∈ ∆|c ∈ CLS) = ∑ ′ z′ ∈∆ C(z |c) being C(z|c) the count of tuple z in class c. 3. Tuples are substituted by their corresponding class. 4. Standard training algorithm is used over previous dataset to estimate NN LMs. 5. Joint probability is computed following: ∏ ∏ p(¯, y) ≈ x ¯ p(Ti |Ti−1 . . . Ti−N+1 ) ≈ p(Ti |ci ) · p(ci |ci−1 . . . ci−N+1 ) . i i
  • 51. Machine Translation experiments 20.9 00 20.8 -3 TM N N BLEU 00 20.7 -10 TM NN 0 -50 TM NN 20.6 00 M-1 NNT 20.5 2 3 4 5 Value of N-gram order News2009 System BLEU TER April-NB baseline 20.2 60.4 + NNTLM-4gr 20.9 59.9 + NNTM-300-4gr 21.1 59.7 + NNTM-500-4gr 21.2 59.7
  • 52. Machine Translation experiments News2009 News2010 System BLEU TER BLEU TER Time (s/sentence) Moses 20.4 60.3 22.6 57.8 0.6 April-PB 20.6 60.3 22.7 57.8 0.4 Moses⋆ – – 22.6 57.9 0.6 April-NB 20.2 60.4 22.7 58.0 0.8 Integrating smoothed Fast NN LMs in the decoder April-PB + NNTLM 21.2 59.8 23.2 57.5 1.8 April-NB + NNTLM 20.9 59.9 23.2 57.4 1.8 April-NB + NNTM 20.7 60.0 23.3 57.6 1.6 April-NB + NNTLM + NNTM 21.2 59.7 23.6 57.1 2.5 Integrating on-the-fly Fast NN LM (standard NN LMs) in the decoder April-PB + NNTLM – – 23.3 57.3 384.3 April-NB + NNTLM + NNTM – – 23.7 57.1 177.3 Rescoring 2000-uniq-best list with standard NN LMs April-PB + NNTLM 21.1 59.9 23.4 57.3 – April-NB + NNTLM 20.9 60.0 – – – April-NB + NNTM 20.6 60.2 – – – April-NB + NNTLM + NNTM 21.1 59.8 23.5 57.4 –
  • 53. Machine Translation experiments . Conclusions 23.8 57.8 BLEU NG BLEU PB . 23.6 57.7 23.5 57.6 TER NG TER PB Smoothed Fast NN LM 23.4 57.5 loss ≈ 0.2 BLEU/TER BLEU TER 23.2 57.4 23.1 57.3 points. 22.9 57.2 22.8 57.1 Phrase-based system: a 22.6 2.4 57.0 bit worst integrated. NG Time (s/sentence) 2.2 PB 2.0 1.8 N-gram-based system: a 1.6 1.4 bit better integrated, TER 1.2 1.0 has statistical significance 0.8 1e+02 1e+03 1e+04 1e+05 1e+06 at 95% confidence using Number of pre-calculated softmax normalization constants (log-scaled) a pairwise test. . . Adding class-based NNTM improves 0.6 BLEU points. Smoothed Fast NN LM system is two/three times slower… but achieves an speed-up of 70 compared with standard NN LM. .
  • 54. Index 1. Introduction 2. Connectionist language modeling 3. Sequence recognition applications 4. Machine translation applications 5. Conclusions
  • 55. Final conclusions I . Contributions to connectionist language modeling . Speed-up technique based on softmax normalization constants precomputation. Formalization and development of a totally coupled Viterbi decoding: comparable computational cost for sequence recognition tasks; computational cost two/three times higher for Machine Translation tasks. Improve baseline quality in every case. Extension to dynamic domain adaptation with cache-based NN LMs. .
  • 56. Final conclusions II . Contributions to sequence recognition . Encouraging results using Cache NN LMs for a SLU task. State-of-the-art improvement using NN LMs at HTR tasks. Character-based NN LMs to deal with lack of data. . . Contributions to SMT . Implementation of DP decoding algorithm for SMT. Improving N-gram-based SMT by using class-based NN LMs. Well positioned systems at international Machine Translation evaluations. .
  • 57. Future work . Connectionist language modeling . New projection layer initialization method based on POS tags. Improve NN LM and standard N-gram combination: GLI-CS. . . Sequence recognition . Improve SLU using deep learning with continuous space techniques. HTR system combining character-based and word-based LMs (OOV). . . Statistical Machine Translation . Application of continuous space idea to reordering models. Study of Cache NN LMs for document translation. New NN LMs for vocabulary dispersion problem N-gram-based SMT. Integration SMT decoder for human assisted transcription. .
  • 58. Publications related with this PhD I Speed-up technique for NN LMs F. Zamora-Martínez, M.J. Castro-Bleda, S. España-Boquera. Fast Evaluation of Connectionist Language Models. Pages 33–40 of IWANN 2009 proceedings. Salamanca. S. España-Boquera, F. Zamora-Martínez, M.J. Castro-Bleda, J. Gorbe-Moya. Efficient BP Algorithms for General Feedforward Neural Networks. Pages 327–336 of IWINAC 2007, Murcia. Spoken Languag Understanding and Cache NN LMs F. Zamora-Martínez, Salvador España-Boquera, M.J. Castro-Bleda, Renato de-Mori. Cache Neural Network Language Models based on Long-Distance Dependencies for a Spoken Dialog System. Pages 4993–4996 of the IEEE ICASSP 2012 proceedings, Kyoto (Japan).
  • 59. Publications related with this PhD II Handwritten Text Recognition F. Zamora-Martínez, V. Frinken, S. España-Boquera, M.J. Castro-Bleda, A. Fisher, H. Bunke. Neural Network Language Models in Off-Line Handwriting Recognition. Pattern Recognition. SUBMITTED. F. Zamora-Martínez, M.J. Castro-Bleda, S. España-Boquera, J. Gorbe-Moya. Unconstrained Offline Handwritting Recognition using Connectionist Character N-grams. Pages 18–23 of the IJCNN, Barcelona, 2010. Statistical Machine Translation F. Zamora-Martínez, M.J. Castro-Bleda. CEU-UPV English-Spanish system for WMT11. Pages 490–495 of the WMT 2011 proceedings, Edinburgh (Scotland). F. Zamora-Martínez, M.J. Castro-Bleda, H. Schwenk. Ngram-based Machine Translation enhanced with Neural Networks for the French-English BTEC-IWSLT’10 task.. Pages 45–52 of the IWSLT 2010 proceedings, Paris (France).
  • 60. Publications related with this PhD III F. Zamora-Martínez, Germán Sanchis-Trilles. UCH-UPV English-Spanish system for WMT10. Pages 207–211 of the WMT 2010 proceedings, Uppsala (Sweden). F. Zamora-Martínez, M.J. Castro-Bleda. Traducción Automática Estadística basada en Ngramas Conexionistas. Pages 221–228 of the SEPLN journal. Volume 45, number 45, 2010. Valencia (Spain). Maxim Khalilov, José A. R. Fonollosa, F. Zamora-Martínez, María J. Castro-Bleda, S. España-Boquera. Neural Network Language Models for Translation with Limited Data. Pages 445–451 of the ICTAI 2008 proceedings, Dayton (USA). Maxim Khalilov, José A. R. Fonollosa, F. Zamora-Martínez, María J. Castro-Bleda, S. España-Boquera. Arabic-English translation improvement by target-side neural network language modeling. In proceedings of HLT & NLP workshop at LREC 2008, Marrakech (Marocco).
  • 61. Publications in collaboration I Recurrent NN LM based on Long-Short Term Memories: Volkmar Frinken, F. Zamora-Martínez, Salvador España-Boquera, María J. Castro-Bleda, Andreas Fischer, Horst. Bunke. Long-Short Term Memory Neural Networks Language Modeling for Handwriting Recognition. ICPR 2012 proceedings, Tsukuba (Japan). Handwritten Text Recognition: S. España-Boquera, M.J. Castro-Bleda, J. Gorbe-Moya, F. Zamora-Martínez. Improving Offline Handwritten Text Recognition with Hybrid HMM/ANN Models. Pages 767–779 of the IEEE TPAMI journal. Volume 33, number 4, 2011. F. Zamora-Martínez, M.J. Castro-Bleda, S. España-Boquera, J. Gorbe-Moya. Improving Isolated Handwritten Word Recognition Using a Specialized Classifier for Short Words. Pages 61–70 of CAEPIA 2009 proceedings, Sevilla. J. Gorbe-Moya, S. España-Boquera, F. Zamora-Martínez, M.J. Castro-Bleda. Handwritten Text Normalization by using Local Extrema Classification. Pages 164–172 of PRIS 2008 workshop, Barcelona.
  • 62. Publications in collaboration II Decoding: S. España-Boquera, M.J. Castro-Bleda, F. Zamora-Martínez, J. Gorbe-Moya. Efficient Viterbi Algorithms for Lexical Tree Based Models. Pages 179–187 of NOLISP 2007, Paris. S. España-Boquera, J. Gorbe-Moya, F. Zamora-Martínez. Semiring Lattice Parsing Applied to CYK. Pages 603–610 of IbPRIA 2007 conference, Girona.
  • 63. Thanks for your attention!Questions?

×