2. 鄉民教我做的聊天機器⼈
STUDY GROUP MEMBERS
▸ Edhsu bassed1984@gmail.com
▸ Ryan ryanchao2012@gmail.com
▸ 國⽗父 joseph.yen@gmail.com
▸ 儀峰 timlmj@gmail.com
▸ 福德 zuxfoucault@yahoo.com.tw
▸ Hans costonaut@gmail.com
6. 鄉民教我做的聊天機器⼈
WHERE TO FIND CONVERSATION DATA?
▸ It’s easy to find conversation corpus in English, such as
▸ Cornell Movie Dialogs Corpus
▸ Santa Barbara Corpus of Spoken American English
▸ Ubuntu Dialogue Corpus
▸ Microsoft Research Social Media Conversation Corpus
▸ But it's hard to find the conversation corpus in traditional Chinese.
▸ Movie subtitles (?)
▸ PTT (?)
7. 鄉民教我做的聊天機器⼈
MOVIE SUBTITLES
▸ ~220k Chinese
Movies’ subtitles
(~9M pairs) with LM
filtering
▸ Seq2seq Results
▸ Unnatural
▸ Poor conversation
ability
8. 鄉民教我做的聊天機器⼈
SO…WE TRY TO EXTRACT DATA FROM PTT
▸ After observing from PTT articles, We found the article title
and comments can be paired in some “boards”.
10. ▸ Backend: Hosted on AWS, and used Django to manage
several crawlers.
▸ About ~480k posts crawled so far.
▸ Pairing “article titles” and “comments.”
Django我也略略懂略略懂
鄉民教我做的聊天機器⼈
BACKEND & PLATFORM
16. 鄉民教我做的聊天機器⼈
SOME IMPROVEMENTS WE TRIED
‣ Tokenizer Improvement
‣ Emoji icon pre-processing
(ex: ◢▆▅▄▃ 崩╰(〒⽫皿〒)╯潰 ▃▄▅▆◣)
‣ Improve tokenizer accuracy
‣ Keyword Extraction & Association
‣ Word2Vec (query associative term if the original one
doesn’t exist)
21. 鄉民教我做的聊天機器⼈
MAP (MEAN AVERAGE PRECISION)
▸ “k” is the rank in the sequence of retrieved documents
▸ “P(k)” is the precision at cut-off “k” in the list.
23. 鄉民教我做的聊天機器⼈
VECTOR REPRESENTATION FOR DOCUMENT
▸ There’re two way to represent for document with vector
▸ Doc2Vec (gensim)
▸ RNN-encoder (arXiv:1506.08909v3)
Doc2Vec RNN-encoder
25. 鄉民教我做的聊天機器⼈
NDCG (NORMALIZED DISCOUNTED CUMULATIVE GAIN)
▸ Cumulative Gain
▸ Discounted Cumulative Gain
▸ Highly relevant documents are more useful when appearing earlier in a search engine result list.
▸ Normalized DCG
26. ▸ Doc2Vec trick works
▸ Still have room
for improvement
鄉民教我做的聊天機器⼈
COMPARE RANKING METHODS WITH NDCG
28. ▸ For now, It’s still very difficult to make a chatbot as a real person. So
we will continue to improve in future …
▸ Resource
▸ Model
▸ Functionality
鄉民教我做的聊天機器⼈
CHALLENGE
⾰革命尚未成功,肥宅們仍須努⼒力力