Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

打造面向金融場景的中文自然語言理解引擎

自然語言理解(NLU)為建構問答系統的核心。要讓智慧代理人能夠以對話的方式來協助人類完成各式各樣目標, 就需要一個具有意圖及實體識別能力的自然語言理解引擎。

在這次演講中,講者將以一個後端工程師初次踏入自然語言處理領域的姿態,介紹實作NLU模組所需要使用的NLP技術與相對應機器學習的方法,接著分享基於RasaNLU開源專案來達成目標的過程與採用此方案的優點。

最後,除了透過統計來驗證模型能力以外,也搭配Rasa Core建構一個智慧型對話代理人,跟市面上應用在金融場景的智慧客服做個比較。

  • Login to see the comments

  • Be the first to like this

打造面向金融場景的中文自然語言理解引擎

  1. 1. 打造面向金融場景的中文自然語言理解引擎 數據研究發展中心 陳皓遠
  2. 2. About me • Member of AI group, CTBC Data R&D Center • Past experience on • Cyber security and defense industry • Smartphone industry • Familiar with • Machine learning • Natural language processing • Software development • Cloud native architecture design
  3. 3. Team • CTBC Data R&D Center AI group is founded in 2018 • AI group is composed of data scientists and software developers • Our mission is to realize AI-based solution in banking scenario • We currently focus on • Computer Vision (CV) • Natural Language Processing (NLP) Retrieved from https://www.ithome.com.tw/news/131697
  4. 4. Achievement NLP • Pluto: A Deep Learning based Watchdog for Anti Money Laundering • First Vertical AI paradigm in RegTech field in CTBC globally • Daily reduce 67% human effort on adverse media screening • Publication • https://www.aclweb.org/anthology/W19-5515 CV • NIST Face Recognition Verification Test (FRVT) • Rank 35th globally • Rank 2nd in Taiwan industry • X-ATM for fraud avoidance 名次 企業名稱 國家 FRR 10 Sensetine(商湯) 中國 0.0092 18 Face++(曠視) 中國 0.0145 26 CyberLink (訊連) 台灣 0.0195 29 Tencent Deepsea (騰訊) 中國 0.0215 35 CTBC BANK (中國信託) 台灣 0.0250 39 Gorilla Technology(大猩猩) 台灣 0.0291 55 Kneron Inc. (耐能) 台灣 0.0902
  5. 5. Outline • Background • Proposed Solution • Evaluation • Prototype • Conclusion
  6. 6. Digitalized channel plays an important role 遠見雜誌 - 2018數位⾦融⼒調查 Retrieved from https://www.gvm.com.tw/article.html?id=54981
  7. 7. Abundant Platform for Conversational Assistants messaging platform Google Home Amazon Echo
  8. 8. • A task-oriented dialogue system • Chat in natural language • Be realized on Amazon Alexa Eno, your Capital One dialogue assistant
  9. 9. Motivation • Realize a task-oriented dialogue system on heterogeneous conversational platforms in Mandarin to serve customers facing banking scenario Prerequisite • A natural language understanding (NLU) • intent recognition (IR) • named entity recognition (NER) NLU IR NER 美元定存六個月期的利率是多少 • Intent • 查詢利率 • Entity • 幣別:美元 • 帳戶類型:定存 • 期數:六個月
  10. 10. Outline • Background • Proposed Solution • Evaluation • Prototype • Conclusion
  11. 11. Key Components in NLU • Deep Neural Networks (DNN) • Conditional Random Field (CRF) • Recurrent Neural Network (RNN) Preprocessing Tokenizer POS tagger Modeling Modeling Embeddings Supervised learning method vectorization • Intent Recognizer • Classification problem • Named Entity Extractor • Sequence labeling problem Approach
  12. 12. Data Preparation • Intent dataset • 1016 samples over 3 distinct classes • 試算匯兌, 查詢存款利率, 查詢台外幣餘額 • Named entity dataset • 977 samples over 6 distinct entities • amount, money, duration, currency, acnt_type, timestamp Great acknowledgment for 數位金融處 and 個金數位營運處
  13. 13. Intent Classification Techniques • Preprocessing • Tokenization (ckiptagger) • Feature extraction • Bag of Word (scikit-learn) Vocabulary [ “現在”, “台幣”,”美金”, “日圓”,“一 年期”, “定存”,“是”, “多少”] 現在美金一年期定存是多少 Text 現在 美金 一年期 定存 是 多少 Tokens • Model • Deep Neural Network (DNN) (tensorflow) [ 1 , 0 , 1 , 0 , 1 , 1 ] Feature vector Word Count encodingFeature engineering Model Training
  14. 14. Named Entity Recognition Techniques • Preprocessing • Tokenization (ckiptagger) • POS tagging (ckiptagger) • Feature extraction • Text and POS tags within context Model I : CRF for Word-Level Feature 現在美金一年期定存是多少 Text 現在(Nd) 美金(Na) 一年期(Na) 定存(Na) 是(SHI) 多少(Neqa) Tokens …, ( -1:現在, -1:Nd, 0:美金, 0:Na, 1:一年期, 1:NA ), … Feature vector Context windows: 3 tokens • Model • Conditional Random Field (CRF) (scikit-learn) Feature engineering Model Training
  15. 15. Named Entity Recognition Techniques • Preprocessing • Tokenization (ckiptagger) Model II : Bi-LSTM-CRF for Word-Level Embedding 現在美金一年期定存是多少 Text 現在 美金 一年期 定存 是 多少 Tokens • Model • Embedding Layer (keras) • Long Short-Term Memory (LSTM) layer (keras) • CRF layer (keras) Embedding learning Features learning Model training
  16. 16. Outline • Background • Proposed Solution • Evaluation • Prototype • Conclusion
  17. 17. Evaluation Methodology Metrics Precision Recall F1-Score Confusion Matrix 實際 Yes 實際 No 預測 Yes True Positive (TP) False Positive (FP) 預測 No False Negative (FN) True Negative (TN) Reference: https://en.wikipedia.org/wiki/Confusion_matrix 𝑇𝑃 𝑇𝑃 + 𝐹𝑃 𝑇𝑃 𝑇𝑃 + 𝐹𝑁 2 ∗ 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 ∗ 𝑅𝑒𝑐𝑎𝑙𝑙 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 + 𝑅𝑒𝑐𝑎𝑙𝑙
  18. 18. Evaluation Precision and Recall Intent classification 0.91 0.98 0.97 0.94 0.95 0.96 0.93 0.96 0.96 0.88 0.90 0.92 0.94 0.96 0.98 1.00 查詢台外幣餘額 查詢存款利率 試算匯兌 Precision Recall F1-Score
  19. 19. Evaluation Precision Named Entity Recognition 0.79 0.75 0.85 0.74 0.55 0.90 0.98 0.93 0.80 0.89 0.81 0.96 0.00 0.20 0.40 0.60 0.80 1.00 1.20 幣別 期數 時間點 帳戶類型 錢 ⾦額 CRF BiLSTM+CRF
  20. 20. Evaluation Recall Named Entity Recognition 0.82 0.55 0.78 0.67 0.52 0.940.95 0.67 0.79 0.80 0.89 0.72 0.00 0.10 0.20 0.30 0.40 0.50 0.60 0.70 0.80 0.90 1.00 幣別 期數 時間點 帳戶類型 錢 ⾦額 CRF BiLSTM+CRF
  21. 21. Evaluation F1-Score Named Entity Recognition 0.81 0.64 0.82 0.68 0.52 0.92 0.97 0.71 0.72 0.84 0.88 0.82 0.00 0.20 0.40 0.60 0.80 1.00 1.20 幣別 期數 時間點 帳戶類型 錢 ⾦額 CRF BiLSTM+CRF
  22. 22. Outline • Background • Proposed Solution • Evaluation • Prototype • Conclusion
  23. 23. Prototype Conversational AI with Rasa framework: https://github.com/RasaHQ/rasa NLU
  24. 24. Prototype Why Rasa ? Extendible Architecture Open sourceOwn Our Data • Preserve privacy • Do not hand data over to big tech company • Transparency • Community support • Task-oriented dialogue architecture • Customizable components Rasa characteristics CTBC strategy • Customize Mandarin- based component • Integration on core technology • Compliance on Security and Regulation • Customized scenario • Ownership on core technology
  25. 25. Prototype • Intent recognition • CKIP Tokenizer (customized) • EmbeddingIntentClassifier (built-in) • Named Entity Recognition • CKIP Tokenizer (customized) • Bi-LSTM-CRF for Word-Level Embedding (customized)
  26. 26. Prototype Demo
  27. 27. Outline • Background • Proposed Solution • Evaluation • Prototype • Conclusion
  28. 28. Conclusion • NLU is a key module in task-oriented dialogue systems • Intent recognizer and entity extractor are key components to realize NLU by machine learning techniques and annotated data • DNN performs generally better than traditional method but not for all tasks • Rasa powered by open source offers a framework for conversational assistant development from scratch Summary
  29. 29. Conclusion • Transfer learning based on pre-trained word embeddings initialization • Word-based embeddings vs. char-based embeddings • Model engineering What’s next
  30. 30. Q&A

×