打造面向金融場景的中文自然語言理解引擎
數據研究發展中心
陳皓遠
About me
• Member of AI group, CTBC Data R&D Center
• Past experience on
• Cyber security and defense industry
• Smartphone industry
• Familiar with
• Machine learning
• Natural language processing
• Software development
• Cloud native architecture design
Team
• CTBC Data R&D Center AI group is founded in 2018
• AI group is composed of data scientists and software developers
• Our mission is to realize AI-based solution in banking scenario
• We currently focus on
• Computer Vision (CV)
• Natural Language Processing (NLP)
Retrieved from https://www.ithome.com.tw/news/131697
Achievement
NLP
• Pluto: A Deep Learning based Watchdog for
Anti Money Laundering
• First Vertical AI paradigm in RegTech
field in CTBC globally
• Daily reduce 67% human effort on
adverse media screening
• Publication
• https://www.aclweb.org/anthology/W19-5515
CV
• NIST Face Recognition Verification Test (FRVT)
• Rank 35th globally
• Rank 2nd in Taiwan industry
• X-ATM for fraud avoidance
名次 企業名稱 國家 FRR
10 Sensetine(商湯) 中國 0.0092
18 Face++(曠視) 中國 0.0145
26 CyberLink (訊連) 台灣 0.0195
29 Tencent Deepsea (騰訊) 中國 0.0215
35 CTBC BANK (中國信託) 台灣 0.0250
39 Gorilla Technology(大猩猩) 台灣 0.0291
55 Kneron Inc. (耐能) 台灣 0.0902
Outline
• Background
• Proposed Solution
• Evaluation
• Prototype
• Conclusion
Digitalized channel plays an important role
遠見雜誌 - 2018數位⾦融⼒調查
Retrieved from https://www.gvm.com.tw/article.html?id=54981
Abundant Platform for Conversational Assistants
messaging platform
Google Home Amazon Echo
• A task-oriented dialogue system
• Chat in natural language
• Be realized on Amazon Alexa
Eno, your Capital One dialogue assistant
Motivation
• Realize a task-oriented dialogue system on heterogeneous conversational platforms
in Mandarin to serve customers facing banking scenario
Prerequisite
• A natural language understanding
(NLU)
• intent recognition (IR)
• named entity recognition (NER)
NLU
IR NER
美元定存六個月期的利率是多少
• Intent
• 查詢利率
• Entity
• 幣別:美元
• 帳戶類型:定存
• 期數:六個月
Outline
• Background
• Proposed Solution
• Evaluation
• Prototype
• Conclusion
Key Components in NLU
• Deep Neural Networks (DNN)
• Conditional Random Field (CRF)
• Recurrent Neural Network (RNN)
Preprocessing
Tokenizer POS tagger
Modeling Modeling
Embeddings
Supervised learning method
vectorization
• Intent Recognizer
• Classification problem
• Named Entity Extractor
• Sequence labeling problem
Approach
Data Preparation
• Intent dataset
• 1016 samples over 3 distinct classes
• 試算匯兌, 查詢存款利率, 查詢台外幣餘額
• Named entity dataset
• 977 samples over 6 distinct entities
• amount, money, duration, currency, acnt_type, timestamp
Great
acknowledgment
for
數位金融處
and
個金數位營運處
Intent Classification Techniques
• Preprocessing
• Tokenization (ckiptagger)
• Feature extraction
• Bag of Word (scikit-learn)
Vocabulary
[ “現在”, “台幣”,”美金”, “日圓”,“一
年期”, “定存”,“是”, “多少”]
現在美金一年期定存是多少
Text
現在 美金 一年期 定存 是 多少
Tokens
• Model
• Deep Neural Network
(DNN) (tensorflow)
[ 1 , 0 , 1 , 0 , 1 , 1 ]
Feature vector
Word Count encodingFeature engineering
Model Training
Named Entity Recognition Techniques
• Preprocessing
• Tokenization (ckiptagger)
• POS tagging (ckiptagger)
• Feature extraction
• Text and POS tags
within context
Model I : CRF for Word-Level Feature
現在美金一年期定存是多少
Text
現在(Nd) 美金(Na) 一年期(Na) 定存(Na) 是(SHI) 多少(Neqa)
Tokens
…, ( -1:現在, -1:Nd, 0:美金, 0:Na, 1:一年期, 1:NA ), …
Feature vector
Context windows: 3 tokens
• Model
• Conditional Random Field
(CRF) (scikit-learn)
Feature engineering
Model Training
Named Entity Recognition Techniques
• Preprocessing
• Tokenization (ckiptagger)
Model II : Bi-LSTM-CRF for Word-Level Embedding
現在美金一年期定存是多少
Text
現在 美金 一年期 定存 是 多少
Tokens
• Model
• Embedding Layer (keras)
• Long Short-Term Memory
(LSTM) layer (keras)
• CRF layer (keras)
Embedding learning
Features learning
Model training
Outline
• Background
• Proposed Solution
• Evaluation
• Prototype
• Conclusion
Evaluation
Methodology
Metrics
Precision Recall F1-Score
Confusion Matrix
實際 Yes 實際 No
預測 Yes True Positive (TP) False Positive (FP)
預測 No False Negative (FN) True Negative (TN)
Reference: https://en.wikipedia.org/wiki/Confusion_matrix
𝑇𝑃
𝑇𝑃 + 𝐹𝑃
𝑇𝑃
𝑇𝑃 + 𝐹𝑁
2 ∗ 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 ∗ 𝑅𝑒𝑐𝑎𝑙𝑙
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 + 𝑅𝑒𝑐𝑎𝑙𝑙
Evaluation
Precision and Recall
Intent classification
0.91
0.98
0.97
0.94
0.95
0.96
0.93
0.96 0.96
0.88
0.90
0.92
0.94
0.96
0.98
1.00
查詢台外幣餘額 查詢存款利率 試算匯兌
Precision Recall F1-Score
Evaluation
Precision
Named Entity Recognition
0.79
0.75
0.85
0.74
0.55
0.90
0.98
0.93
0.80
0.89
0.81
0.96
0.00
0.20
0.40
0.60
0.80
1.00
1.20
幣別 期數 時間點 帳戶類型 錢 ⾦額
CRF BiLSTM+CRF
Evaluation
Recall
Named Entity Recognition
0.82
0.55
0.78
0.67
0.52
0.940.95
0.67
0.79 0.80
0.89
0.72
0.00
0.10
0.20
0.30
0.40
0.50
0.60
0.70
0.80
0.90
1.00
幣別 期數 時間點 帳戶類型 錢 ⾦額
CRF BiLSTM+CRF
Evaluation
F1-Score
Named Entity Recognition
0.81
0.64
0.82
0.68
0.52
0.92
0.97
0.71 0.72
0.84
0.88
0.82
0.00
0.20
0.40
0.60
0.80
1.00
1.20
幣別 期數 時間點 帳戶類型 錢 ⾦額
CRF BiLSTM+CRF
Outline
• Background
• Proposed Solution
• Evaluation
• Prototype
• Conclusion
Prototype
Conversational AI with Rasa framework: https://github.com/RasaHQ/rasa
NLU
Prototype
Why Rasa ?
Extendible Architecture Open sourceOwn Our Data
• Preserve privacy
• Do not hand data over
to big tech company
• Transparency
• Community support
• Task-oriented dialogue
architecture
• Customizable
components
Rasa characteristics
CTBC strategy
• Customize Mandarin-
based component
• Integration on core
technology
• Compliance on Security and Regulation
• Customized scenario
• Ownership on core technology
Prototype
• Intent recognition
• CKIP Tokenizer (customized)
• EmbeddingIntentClassifier (built-in)
• Named Entity Recognition
• CKIP Tokenizer (customized)
• Bi-LSTM-CRF for Word-Level Embedding
(customized)
Prototype
Demo
Outline
• Background
• Proposed Solution
• Evaluation
• Prototype
• Conclusion
Conclusion
• NLU is a key module in task-oriented dialogue systems
• Intent recognizer and entity extractor are key components to realize NLU by machine
learning techniques and annotated data
• DNN performs generally better than traditional method but not for all tasks
• Rasa powered by open source offers a framework for conversational assistant
development from scratch
Summary
Conclusion
• Transfer learning based on pre-trained word embeddings initialization
• Word-based embeddings vs. char-based embeddings
• Model engineering
What’s next
Q&A

打造面向金融場景的中文自然語言理解引擎

  • 1.
  • 2.
    About me • Memberof AI group, CTBC Data R&D Center • Past experience on • Cyber security and defense industry • Smartphone industry • Familiar with • Machine learning • Natural language processing • Software development • Cloud native architecture design
  • 3.
    Team • CTBC DataR&D Center AI group is founded in 2018 • AI group is composed of data scientists and software developers • Our mission is to realize AI-based solution in banking scenario • We currently focus on • Computer Vision (CV) • Natural Language Processing (NLP) Retrieved from https://www.ithome.com.tw/news/131697
  • 4.
    Achievement NLP • Pluto: ADeep Learning based Watchdog for Anti Money Laundering • First Vertical AI paradigm in RegTech field in CTBC globally • Daily reduce 67% human effort on adverse media screening • Publication • https://www.aclweb.org/anthology/W19-5515 CV • NIST Face Recognition Verification Test (FRVT) • Rank 35th globally • Rank 2nd in Taiwan industry • X-ATM for fraud avoidance 名次 企業名稱 國家 FRR 10 Sensetine(商湯) 中國 0.0092 18 Face++(曠視) 中國 0.0145 26 CyberLink (訊連) 台灣 0.0195 29 Tencent Deepsea (騰訊) 中國 0.0215 35 CTBC BANK (中國信託) 台灣 0.0250 39 Gorilla Technology(大猩猩) 台灣 0.0291 55 Kneron Inc. (耐能) 台灣 0.0902
  • 5.
    Outline • Background • ProposedSolution • Evaluation • Prototype • Conclusion
  • 6.
    Digitalized channel playsan important role 遠見雜誌 - 2018數位⾦融⼒調查 Retrieved from https://www.gvm.com.tw/article.html?id=54981
  • 7.
    Abundant Platform forConversational Assistants messaging platform Google Home Amazon Echo
  • 8.
    • A task-orienteddialogue system • Chat in natural language • Be realized on Amazon Alexa Eno, your Capital One dialogue assistant
  • 9.
    Motivation • Realize atask-oriented dialogue system on heterogeneous conversational platforms in Mandarin to serve customers facing banking scenario Prerequisite • A natural language understanding (NLU) • intent recognition (IR) • named entity recognition (NER) NLU IR NER 美元定存六個月期的利率是多少 • Intent • 查詢利率 • Entity • 幣別:美元 • 帳戶類型:定存 • 期數:六個月
  • 10.
    Outline • Background • ProposedSolution • Evaluation • Prototype • Conclusion
  • 11.
    Key Components inNLU • Deep Neural Networks (DNN) • Conditional Random Field (CRF) • Recurrent Neural Network (RNN) Preprocessing Tokenizer POS tagger Modeling Modeling Embeddings Supervised learning method vectorization • Intent Recognizer • Classification problem • Named Entity Extractor • Sequence labeling problem Approach
  • 12.
    Data Preparation • Intentdataset • 1016 samples over 3 distinct classes • 試算匯兌, 查詢存款利率, 查詢台外幣餘額 • Named entity dataset • 977 samples over 6 distinct entities • amount, money, duration, currency, acnt_type, timestamp Great acknowledgment for 數位金融處 and 個金數位營運處
  • 13.
    Intent Classification Techniques •Preprocessing • Tokenization (ckiptagger) • Feature extraction • Bag of Word (scikit-learn) Vocabulary [ “現在”, “台幣”,”美金”, “日圓”,“一 年期”, “定存”,“是”, “多少”] 現在美金一年期定存是多少 Text 現在 美金 一年期 定存 是 多少 Tokens • Model • Deep Neural Network (DNN) (tensorflow) [ 1 , 0 , 1 , 0 , 1 , 1 ] Feature vector Word Count encodingFeature engineering Model Training
  • 14.
    Named Entity RecognitionTechniques • Preprocessing • Tokenization (ckiptagger) • POS tagging (ckiptagger) • Feature extraction • Text and POS tags within context Model I : CRF for Word-Level Feature 現在美金一年期定存是多少 Text 現在(Nd) 美金(Na) 一年期(Na) 定存(Na) 是(SHI) 多少(Neqa) Tokens …, ( -1:現在, -1:Nd, 0:美金, 0:Na, 1:一年期, 1:NA ), … Feature vector Context windows: 3 tokens • Model • Conditional Random Field (CRF) (scikit-learn) Feature engineering Model Training
  • 15.
    Named Entity RecognitionTechniques • Preprocessing • Tokenization (ckiptagger) Model II : Bi-LSTM-CRF for Word-Level Embedding 現在美金一年期定存是多少 Text 現在 美金 一年期 定存 是 多少 Tokens • Model • Embedding Layer (keras) • Long Short-Term Memory (LSTM) layer (keras) • CRF layer (keras) Embedding learning Features learning Model training
  • 16.
    Outline • Background • ProposedSolution • Evaluation • Prototype • Conclusion
  • 17.
    Evaluation Methodology Metrics Precision Recall F1-Score ConfusionMatrix 實際 Yes 實際 No 預測 Yes True Positive (TP) False Positive (FP) 預測 No False Negative (FN) True Negative (TN) Reference: https://en.wikipedia.org/wiki/Confusion_matrix 𝑇𝑃 𝑇𝑃 + 𝐹𝑃 𝑇𝑃 𝑇𝑃 + 𝐹𝑁 2 ∗ 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 ∗ 𝑅𝑒𝑐𝑎𝑙𝑙 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 + 𝑅𝑒𝑐𝑎𝑙𝑙
  • 18.
    Evaluation Precision and Recall Intentclassification 0.91 0.98 0.97 0.94 0.95 0.96 0.93 0.96 0.96 0.88 0.90 0.92 0.94 0.96 0.98 1.00 查詢台外幣餘額 查詢存款利率 試算匯兌 Precision Recall F1-Score
  • 19.
  • 20.
    Evaluation Recall Named Entity Recognition 0.82 0.55 0.78 0.67 0.52 0.940.95 0.67 0.790.80 0.89 0.72 0.00 0.10 0.20 0.30 0.40 0.50 0.60 0.70 0.80 0.90 1.00 幣別 期數 時間點 帳戶類型 錢 ⾦額 CRF BiLSTM+CRF
  • 21.
    Evaluation F1-Score Named Entity Recognition 0.81 0.64 0.82 0.68 0.52 0.92 0.97 0.710.72 0.84 0.88 0.82 0.00 0.20 0.40 0.60 0.80 1.00 1.20 幣別 期數 時間點 帳戶類型 錢 ⾦額 CRF BiLSTM+CRF
  • 22.
    Outline • Background • ProposedSolution • Evaluation • Prototype • Conclusion
  • 23.
    Prototype Conversational AI withRasa framework: https://github.com/RasaHQ/rasa NLU
  • 24.
    Prototype Why Rasa ? ExtendibleArchitecture Open sourceOwn Our Data • Preserve privacy • Do not hand data over to big tech company • Transparency • Community support • Task-oriented dialogue architecture • Customizable components Rasa characteristics CTBC strategy • Customize Mandarin- based component • Integration on core technology • Compliance on Security and Regulation • Customized scenario • Ownership on core technology
  • 25.
    Prototype • Intent recognition •CKIP Tokenizer (customized) • EmbeddingIntentClassifier (built-in) • Named Entity Recognition • CKIP Tokenizer (customized) • Bi-LSTM-CRF for Word-Level Embedding (customized)
  • 26.
  • 27.
    Outline • Background • ProposedSolution • Evaluation • Prototype • Conclusion
  • 28.
    Conclusion • NLU isa key module in task-oriented dialogue systems • Intent recognizer and entity extractor are key components to realize NLU by machine learning techniques and annotated data • DNN performs generally better than traditional method but not for all tasks • Rasa powered by open source offers a framework for conversational assistant development from scratch Summary
  • 29.
    Conclusion • Transfer learningbased on pre-trained word embeddings initialization • Word-based embeddings vs. char-based embeddings • Model engineering What’s next
  • 30.