Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

An Efficient Language Model Using Double-Array Structures

5,626 views

Published on

A presentation slide of EMNLP 2013.

Paper : http://aclweb.org/anthology/D/D13/
Direct Link : http://aclweb.org/anthology/D/D13/D13-1023.pdf
Source Code : https://github.com/jnory/DALM

Published in: Technology, Education
  • Be the first to comment

An Efficient Language Model Using Double-Array Structures

  1. 1. EMNLP 2013 An Efficient Language Model Using Double-Array Structures Makoto Yasuhara, Toru Tanaka Jun-ya Norimatsu, Mikio Yamamoto University of Tsukuba, Japan
  2. 2. Introduction(1) Bigger and Bigger LMs Have you ever encountered these problems? LMs cannot be load into memory because of their size The query speed for LMs become a bottleneck of your system Store compactly, query fast!
  3. 3. Our System Overview • LM implementation based on double-array structures • Modified double-array structure to store backward suffix trees • Two optimization methods to improve efficiency We call our LM “DALM”
  4. 4. Double-Array Structures (Aoe, 1989) What is a double-array structure? A fast and compact representation of a trie Abstract image A trie is represented by two arrays (BASE and CHECK) Double-array representation ROOT A BASE 1 CHECK 1 1 B ROOT A B
  5. 5. 2D Array Implementation of a Trie Node# 1 ROOT A 2 3 B C 4 5 A C 7 C 1 2 3 B C 2 3 4 5 6 4 6 B 5 6 7 7 Sparse array Simple and fast but consumes a lot of memory
  6. 6. Compact Representation of a Sparse 2D Array Node# A 1 2 3 4 5 B 2 4 C 3 5 Shift 6 7 2 3 Shift 3 Shift 3 4 5 6 7 Shift 4 6 7 Merge Merged-NEXT 2 3 4 6 5 7 Information loss! Double-array structure modified to include all information about the original trie
  7. 7. Details of Double-Array Structures (Aoe, 1989) Definition: Example: ROOT B A C C C B BASE CHECK 0 0 1 2 3 3 3 4 0 5 0 6 4 7 0 0 0 2 3 2 6
  8. 8. Efficient Trie Representations for Ngram Model Backward suffix trees (Bell et al., 1990; Stockle, 2002; Germann et al., 2009) History words are stored in reverse order Target words are stored in separated lists X ROOT Y Z Efficient back-off X B A C Y Y The B node is not found X C X
  9. 9. Endmarker Symbols for Backwards Suffix Trees Endmarker symbols (Aoe, 1989) are placed after history words X ROOT ROOT Y Z B B # X A C C C Z # C # X Y # Y X Y Y A X Y X Target word follows the endmarker symbol X X
  10. 10. Double-array Representation of Backward Suffix Trees Endmarker symbols are treated as words A word ID is assigned to the endmarker symbol X ROOT Y Z BASE B CHECK 0 0 1 2 2 3 4 4 0 5 0 6 4 7 0 0 2 2 3 3 3
  11. 11. Double-array Language Model: Simple Structures Introducing a VALUE array ROOT A BASE CHECK X B B A 0 0 1 2 X # 3 2 4 5 0 5 6 4 3 VALUE The VALUE array contains corresponding probabilities and back-off weights (BOW) 7 6
  12. 12. Double-array Language Model: Embedding structures (1) Filling unused slots with values ROOT A BASE CHECK X B A 0 0 1 2 X # 3 2 4 5 0 3 5 6 4 7 6 B Unused slots These empty slots are used to store values
  13. 13. Double-array Language Model: Embedding structures (2) Using the BASE and CHECK arrays to store values B A BASE CHECK VALUE Lossless quantization 0 0 1 2 X # 3 2 4 5 0 3 5 6 4 7 -2 6 Index of the VALUE array with a negative sign
  14. 14. Double-array Language Model: Ordering method (1) Tuning for word IDs We assign word IDs in order of unigram probability P(Word) Word ID - Sort the words in order of descending probability Word # 1 0.0413 B 2 0.0300 X 3 0.0284 A 4 0.0201 Y 5 0.0101 C 6 0.0050 Z 7 0.0020 D 8
  15. 15. Double-array Language Model: Ordering method (2) Modifying the 2D array Before ordering: Node# # 1 3 2 3 4 A B C D 6 2 9 1 3 2 3 4 Z Z D 11 8 B 2 6 X 4 6 9 Y 13 4 6 After ordering: Node# # X 8 11 A Y 13 C
  16. 16. Experiments: Datasets Model 100 Mwords 5 Gwords Test set Corpus size [words] Unique types [words] N-grams (unigrams to 5-grams) 100 M 195 K 31 M 5G 2,140 K 936 M 100 M 198 K - Data source Publication of unexamined Japanese patent applications Distributed with the NTCIR 3,4,5,6 patent retrieval task (Iwayama et al., 2003; Fujii et al., 2004;2005;2007)
  17. 17. Comparison: Proposed Methods Results for 100-Mword corpus
  18. 18. Division Method Building a large double-array structure needs a lot of time (Nakamura and Mochizuki, 2006) It is impractical to wait for the 5-Gword model to get built Dividing the trie into several parts ROOT A C A C # ROOT ROOT # C # C #
  19. 19. Experiments: Division Methods Results for 100-Mword corpus
  20. 20. Experiments: Other Methods Results for 100-Mword and 5-Gword corpora
  21. 21. Discussion DALM is smaller and faster than KenLM Probing The smallest LM is KenLM Trie The differences between KenLM Probing and DALM are smaller for the 5-Gword model than for the 100-Mword model Large language models require shorter back-off time
  22. 22. Conclusion We proposed an efficient language model using double-array structures • Double-array structures are a fast and compact representation of tries • We use double-array structures to represent backward suffix trees We proposed two optimization methods: embedding and ordering • Embedding: using empty slots in the double-array to store values • Ordering: tuning word IDs to make LMs smaller and faster In experiments, DALM achieved the best speed among the compared LMs though keeping modest model size.
  23. 23. Questions… My English skills are limited  Please speak slowly if you have any questions.

×