打造面向金融場景的中文自然語言理解引擎

打造面向金融場景的中文自然語言理解引擎
數據研究發展中心
陳皓遠

About me
• Member of AI group, CTBC Data R&D Center
• Past experience on
• Cyber security and defense industry
• Smartphone industry
• Familiar with
• Machine learning
• Natural language processing
• Software development
• Cloud native architecture design

Team
• CTBC Data R&D Center AI group is founded in 2018
• AI group is composed of data scientists and software developers
• Our mission is to realize AI-based solution in banking scenario
• We currently focus on
• Computer Vision (CV)
• Natural Language Processing (NLP)
Retrieved from https://www.ithome.com.tw/news/131697

Achievement
NLP
• Pluto: A Deep Learning based Watchdog for
Anti Money Laundering
• First Vertical AI paradigm in RegTech
field in CTBC globally
• Daily reduce 67% human effort on
adverse media screening
• Publication
• https://www.aclweb.org/anthology/W19-5515
CV
• NIST Face Recognition Verification Test (FRVT)
• Rank 35th globally
• Rank 2nd in Taiwan industry
• X-ATM for fraud avoidance
名次企業名稱國家 FRR
10 Sensetine(商湯) 中國 0.0092
18 Face++(曠視) 中國 0.0145
26 CyberLink (訊連) 台灣 0.0195
29 Tencent Deepsea (騰訊) 中國 0.0215
35 CTBC BANK (中國信託) 台灣 0.0250
39 Gorilla Technology(大猩猩) 台灣 0.0291
55 Kneron Inc. (耐能) 台灣 0.0902

Outline
• Background
• Proposed Solution
• Evaluation
• Prototype
• Conclusion

Digitalized channel plays an important role
遠見雜誌 - 2018數位⾦融⼒調查
Retrieved from https://www.gvm.com.tw/article.html?id=54981

Abundant Platform for Conversational Assistants
messaging platform
Google Home Amazon Echo

• A task-oriented dialogue system
• Chat in natural language
• Be realized on Amazon Alexa
Eno, your Capital One dialogue assistant

Motivation
• Realize a task-oriented dialogue system on heterogeneous conversational platforms
in Mandarin to serve customers facing banking scenario
Prerequisite
• A natural language understanding
(NLU)
• intent recognition (IR)
• named entity recognition (NER)
NLU
IR NER
美元定存六個月期的利率是多少
• Intent
• 查詢利率
• Entity
• 幣別：美元
• 帳戶類型：定存
• 期數：六個月

Key Components in NLU
• Deep Neural Networks (DNN)
• Conditional Random Field (CRF)
• Recurrent Neural Network (RNN)
Preprocessing
Tokenizer POS tagger
Modeling Modeling
Embeddings
Supervised learning method
vectorization
• Intent Recognizer
• Classification problem
• Named Entity Extractor
• Sequence labeling problem
Approach

Data Preparation
• Intent dataset
• 1016 samples over 3 distinct classes
• 試算匯兌, 查詢存款利率, 查詢台外幣餘額
• Named entity dataset
• 977 samples over 6 distinct entities
• amount, money, duration, currency, acnt_type, timestamp
Great
acknowledgment
for
數位金融處
and
個金數位營運處

Intent Classification Techniques
• Preprocessing
• Tokenization (ckiptagger)
• Feature extraction
• Bag of Word (scikit-learn)
Vocabulary
[ “現在”, “台幣”,”美金”, “日圓”,“一
年期”, “定存”,“是”, “多少”]
現在美金一年期定存是多少
Text
Tokens
• Model
• Deep Neural Network
(DNN) (tensorflow)
[ 1 , 0 , 1 , 0 , 1 , 1 ]
Feature vector
Word Count encodingFeature engineering
Model Training

Named Entity Recognition Techniques
• Preprocessing
• POS tagging (ckiptagger)
• Feature extraction
• Text and POS tags
within context
Model I : CRF for Word-Level Feature
Text
現在(Nd) 美金(Na) 一年期(Na) 定存(Na) 是(SHI) 多少(Neqa)
Tokens
…, ( -1:現在, -1:Nd, 0:美金, 0:Na, 1:一年期, 1:NA ), …
Feature vector
Context windows: 3 tokens
• Model
• Conditional Random Field
(CRF) (scikit-learn)
Feature engineering
Model Training

Named Entity Recognition Techniques
• Preprocessing
Model II : Bi-LSTM-CRF for Word-Level Embedding
Text
Tokens
• Model
• Embedding Layer (keras)
• Long Short-Term Memory
(LSTM) layer (keras)
• CRF layer (keras)
Embedding learning
Features learning
Model training

Evaluation
Methodology
Metrics
Precision Recall F1-Score
Confusion Matrix
實際 Yes 實際 No
預測 Yes True Positive (TP) False Positive (FP)
預測 No False Negative (FN) True Negative (TN)
Reference: https://en.wikipedia.org/wiki/Confusion_matrix
𝑇𝑃
𝑇𝑃 + 𝐹𝑃
𝑇𝑃
𝑇𝑃 + 𝐹𝑁
2 ∗ 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 ∗ 𝑅𝑒𝑐𝑎𝑙𝑙
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 + 𝑅𝑒𝑐𝑎𝑙𝑙

Evaluation
Precision and Recall
Intent classification
0.91
0.98
0.97
0.94
0.95
0.96
0.93
0.96 0.96
0.88
0.90
0.92
0.94
0.96
0.98
1.00
查詢台外幣餘額查詢存款利率試算匯兌
Precision Recall F1-Score

Evaluation
Precision
Named Entity Recognition
0.79
0.75
0.85
0.74
0.55
0.90
0.98
0.93
0.80
0.89
0.81
0.96
0.00
0.20
0.40
0.60
0.80
1.00
1.20
幣別期數時間點帳戶類型錢⾦額
CRF BiLSTM+CRF

Evaluation
Recall
0.82
0.55
0.78
0.67
0.52
0.940.95
0.67
0.79 0.80
0.89
0.72
0.00
0.10
0.20
0.30
0.40
0.50
0.60
0.70
0.80
0.90
1.00
CRF BiLSTM+CRF

Evaluation
F1-Score
0.81
0.64
0.82
0.68
0.52
0.92
0.97
0.71 0.72
0.84
0.88
0.82
0.00
0.20
0.40
0.60
0.80
1.00
1.20
CRF BiLSTM+CRF

Prototype
Conversational AI with Rasa framework: https://github.com/RasaHQ/rasa
NLU

Prototype
Why Rasa ?
Extendible Architecture Open sourceOwn Our Data
• Preserve privacy
• Do not hand data over
to big tech company
• Transparency
• Community support
• Task-oriented dialogue
architecture
• Customizable
components
Rasa characteristics
CTBC strategy
• Customize Mandarin-
based component
• Integration on core
technology
• Compliance on Security and Regulation
• Customized scenario
• Ownership on core technology

Prototype
• Intent recognition
• CKIP Tokenizer (customized)
• EmbeddingIntentClassifier (built-in)
• Named Entity Recognition
• CKIP Tokenizer (customized)
• Bi-LSTM-CRF for Word-Level Embedding
(customized)

Conclusion
• NLU is a key module in task-oriented dialogue systems
• Intent recognizer and entity extractor are key components to realize NLU by machine
learning techniques and annotated data
• DNN performs generally better than traditional method but not for all tasks
• Rasa powered by open source offers a framework for conversational assistant
development from scratch
Summary

Conclusion
• Transfer learning based on pre-trained word embeddings initialization
• Word-based embeddings vs. char-based embeddings
• Model engineering
What’s next

打造面向金融場景的中文自然語言理解引擎

More Related Content

Recently uploaded

Featured

打造面向金融場景的中文自然語言理解引擎