Seq2seq Model to
Tokenize the Chinese Language
Catherine Xiao, Hang Jiang, Jinho D. Choi, PhD
Department of Mathematics and Computer Science, Emory University, Atlanta, GA, USA 30322
What is seq2seq model
• Based on an improved Recurrent Neural Networks (RNN) algorithm, which is
called LSTM (Long Short Term Memory)
• Condition the generated a sequence of words on some input and generate a
meaningful sequence as response
• Translate English to French
• Not only learn vocabularies, but also learn how to use them
RNN vs LSTM
• Simple RNN: I __ a student.
• LSTM: I once lived in France. … Therefore, I speak ___.
Project Goal
• Test whether seq2seq model is also robust to do the job of Chinese tokenization
without much optimization
Seq2seq Model
Language Translation vs Syntactic Constituency Parsing
• Google Team reached 90.5% in F-1 score, which is the state-of-the-art (Vinyals,
Oriol, et al., 2015 )
Fig.2 An English sentence as input and a tokenized English sentence as output
Language Translation vs Tokenization
• Modified seq2seq model to fit Chinese language
• Treated tokenized Chinese as a foreign language to be translated from raw texts
• Expect to have reasonable results
Grammar as a Foreign Language
Three Steps
Data Preparation
F-1 Score
• [0,1]
• To measure accuracy
• State-of-the-art for tokenization: 90%
• We reached around 30% in the end
Fig.4 A glance at our typical output
Analysis
Applying seq2seq model directly didn’t give us ideal results. Here are possible reasons:
• Need to use Beam Search (Vinyals, Oriol, et al., 2015 )
• Remove candidates with low probabilities
• Faster in speed
• More optimization is needed to boost the score
• Early stopping
• Dropout
• Use of multiprocessing to increase our speed of testing
F-1 Score & Analysis
Contributions
• Show that seq2seq model has its limits in differentiating symptoms/syntax
• Prove that seq2seq model itself needs beam search to get good F-1 scores
• Provide a viable method to tokenize Chinese for further researchers
Future Work
• Use beam search together with seq2seq model to improve the F-1 score
• Try other algorithms to do the work of Chinese tokenization because seq2seq was
not even invented to do this job
• Apply seq2seq to do other tasks like disfluency removal or POS tagging
Contributions and Future Work
Reference
• Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig
Citro, Greg S. Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay
Ghemawat, Ian Goodfellow, Andrew Harp, Geoffrey Irving, Michael Isard, Rafal
Jozefowicz, Yangqing Jia, Lukasz Kaiser, Manjunath Kudlur, Josh Levenberg, Dan
Mané, Mike Schuster, Rajat Monga, Sherry Moore, Derek Murray, Chris Olah,
Jonathon Shlens, Benoit Steiner, Ilya Sutskever, Kunal Talwar, Paul Tucker,
Vincent Vanhoucke, Vijay Vasudevan, Fernanda Viégas, Oriol Vinyals, Pete
Warden, Martin Wattenberg, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng.
TensorFlow: Large-scale machine learning on heterogeneous systems, 2015.
Software available from tensorflow.org.
• Vinyals, Oriol, et al. "Grammar as a foreign language." Advances in Neural
Information Processing Systems. 2015.
• Xue, Nianwen, et al. Chinese Treebank 8.0 LDC2013T21. Web Download.
Philadelphia: Linguistic Data Consortium, 2013.
Acknowledgement
• This research was supported by Emory NLP in terms of its assistance with Emory
NLP demo. See http://nlp.mathcs.emory.edu/.
• This research is funded and directed under Emory Summer Research Partner
Program (RPP).
Reference & Acknowledgement
Fig.1 TensorFlow seq2seq model translates English to French
Much scholarship has been done on Indo-European language processing with the goal
of accurate machine translation. However, little work has been done so far in the
Chinese language, because it is character-based and lacking spaces between words.
Our project aims to explore whether Recurrent Neural Network (RNN), a new
algorithm based on Neural Network, is capable of performing better than previous
approaches that rely heavily on human choices of features.
Specifically, we look at whether sequence-to-sequence (seq2seq) model is capable of
clustering phrases in Chinese together and splitting them with single space, a process
called tokenization. (ex. The process of converting raw text ‘我们是谁’ that literally
means ‘we are who’ into ‘我们 是 谁’ is called tokenization.)
Introduction
Chinese Encoding
• Unicode issues in python 2
• åæäçè¼ï¸€Œºéš»œ„½ˆ -- Gibberish
• Encoding Chinese with utf-8 in tokenizer function
Hyper-parameter Tuning
• Batch size
• Number of neurons per layer
• Number of layers
• Bucket size
• Learning rate
Encoding Issue & Hyperparameter Tuning
Chinese
Treebank
8.0 (Xue,
2013)
Tokenized
and
untokenized
files
80% dev
10% train
10% eval
What is tokenization?
• 我说过如果Ivanka不是我的女儿,也许我会约她。
• 我 说过 如果 Ivanka 不是 我 的 女儿 , 也许 我 会 约 她 。
• “I’ve said if Ivanka weren’t my daughter, perhaps I’d be dating her.”
Complexities about tokenization
• Word boundaries are highly contextually dependent
热(hot)狗(dog)屋(house)
• Example1: 热狗屋对狗的身体不好
• Translation: Hot doghouse is not good for dogs’ health.
• Examples2: 热狗屋是我们经常吃东西的地方。
• Translation: Hotdog house is where we usually eat dinner.
Chinese Tokenization
Fig.3 The structure of neuron networks in the seq2seq model

Seq2seq Model to Tokenize the Chinese Language

  • 1.
    Seq2seq Model to Tokenizethe Chinese Language Catherine Xiao, Hang Jiang, Jinho D. Choi, PhD Department of Mathematics and Computer Science, Emory University, Atlanta, GA, USA 30322 What is seq2seq model • Based on an improved Recurrent Neural Networks (RNN) algorithm, which is called LSTM (Long Short Term Memory) • Condition the generated a sequence of words on some input and generate a meaningful sequence as response • Translate English to French • Not only learn vocabularies, but also learn how to use them RNN vs LSTM • Simple RNN: I __ a student. • LSTM: I once lived in France. … Therefore, I speak ___. Project Goal • Test whether seq2seq model is also robust to do the job of Chinese tokenization without much optimization Seq2seq Model Language Translation vs Syntactic Constituency Parsing • Google Team reached 90.5% in F-1 score, which is the state-of-the-art (Vinyals, Oriol, et al., 2015 ) Fig.2 An English sentence as input and a tokenized English sentence as output Language Translation vs Tokenization • Modified seq2seq model to fit Chinese language • Treated tokenized Chinese as a foreign language to be translated from raw texts • Expect to have reasonable results Grammar as a Foreign Language Three Steps Data Preparation F-1 Score • [0,1] • To measure accuracy • State-of-the-art for tokenization: 90% • We reached around 30% in the end Fig.4 A glance at our typical output Analysis Applying seq2seq model directly didn’t give us ideal results. Here are possible reasons: • Need to use Beam Search (Vinyals, Oriol, et al., 2015 ) • Remove candidates with low probabilities • Faster in speed • More optimization is needed to boost the score • Early stopping • Dropout • Use of multiprocessing to increase our speed of testing F-1 Score & Analysis Contributions • Show that seq2seq model has its limits in differentiating symptoms/syntax • Prove that seq2seq model itself needs beam search to get good F-1 scores • Provide a viable method to tokenize Chinese for further researchers Future Work • Use beam search together with seq2seq model to improve the F-1 score • Try other algorithms to do the work of Chinese tokenization because seq2seq was not even invented to do this job • Apply seq2seq to do other tasks like disfluency removal or POS tagging Contributions and Future Work Reference • Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S. Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian Goodfellow, Andrew Harp, Geoffrey Irving, Michael Isard, Rafal Jozefowicz, Yangqing Jia, Lukasz Kaiser, Manjunath Kudlur, Josh Levenberg, Dan Mané, Mike Schuster, Rajat Monga, Sherry Moore, Derek Murray, Chris Olah, Jonathon Shlens, Benoit Steiner, Ilya Sutskever, Kunal Talwar, Paul Tucker, Vincent Vanhoucke, Vijay Vasudevan, Fernanda Viégas, Oriol Vinyals, Pete Warden, Martin Wattenberg, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. TensorFlow: Large-scale machine learning on heterogeneous systems, 2015. Software available from tensorflow.org. • Vinyals, Oriol, et al. "Grammar as a foreign language." Advances in Neural Information Processing Systems. 2015. • Xue, Nianwen, et al. Chinese Treebank 8.0 LDC2013T21. Web Download. Philadelphia: Linguistic Data Consortium, 2013. Acknowledgement • This research was supported by Emory NLP in terms of its assistance with Emory NLP demo. See http://nlp.mathcs.emory.edu/. • This research is funded and directed under Emory Summer Research Partner Program (RPP). Reference & Acknowledgement Fig.1 TensorFlow seq2seq model translates English to French Much scholarship has been done on Indo-European language processing with the goal of accurate machine translation. However, little work has been done so far in the Chinese language, because it is character-based and lacking spaces between words. Our project aims to explore whether Recurrent Neural Network (RNN), a new algorithm based on Neural Network, is capable of performing better than previous approaches that rely heavily on human choices of features. Specifically, we look at whether sequence-to-sequence (seq2seq) model is capable of clustering phrases in Chinese together and splitting them with single space, a process called tokenization. (ex. The process of converting raw text ‘我们是谁’ that literally means ‘we are who’ into ‘我们 是 谁’ is called tokenization.) Introduction Chinese Encoding • Unicode issues in python 2 • åæäçè¼ï¸€Œºéš»œ„½ˆ -- Gibberish • Encoding Chinese with utf-8 in tokenizer function Hyper-parameter Tuning • Batch size • Number of neurons per layer • Number of layers • Bucket size • Learning rate Encoding Issue & Hyperparameter Tuning Chinese Treebank 8.0 (Xue, 2013) Tokenized and untokenized files 80% dev 10% train 10% eval What is tokenization? • 我说过如果Ivanka不是我的女儿,也许我会约她。 • 我 说过 如果 Ivanka 不是 我 的 女儿 , 也许 我 会 约 她 。 • “I’ve said if Ivanka weren’t my daughter, perhaps I’d be dating her.” Complexities about tokenization • Word boundaries are highly contextually dependent 热(hot)狗(dog)屋(house) • Example1: 热狗屋对狗的身体不好 • Translation: Hot doghouse is not good for dogs’ health. • Examples2: 热狗屋是我们经常吃东西的地方。 • Translation: Hotdog house is where we usually eat dinner. Chinese Tokenization Fig.3 The structure of neuron networks in the seq2seq model