[2018 台灣人工智慧學校校友年會] Textual Data Analytics in Finance / 王釧茹

Talk @ Taiwan AI Academy, November 17, 2018
Textual Data Analytics in Finance
Dr. Chuan-Ju Wang (王釧茹)
Research Center for Information Technology
Innovation, Academia Sinica
Computational Finance and Data Analytics
Laboratory (CFDA Lab)
http://cfda.csie.org

Chuan-Ju Wang (CITI, AS) Talk @ Taiwan AI Academy November 17, 2018
Quant — Data Scientist
Source: http://www.indeed.com/jobtrends
Source: http://www.computerweekly.com/blogs/Data-Matters/2014/06/data-scientist-the-new-quant.html

Data Science in Finance

Text Analytics
❖ Big Data
❖ Structured Data
❖ user logs, sensor logs, click through logs, …
❖ Unstructured Data
❖ web texts, user conversions, public opinions, reports…
❖ Big Data for Unstructured Text – Text Analytics
❖ Goal — Turn text into data for analysis, via application of
natural language processing (NLP) and analytical methods
https://insidebigdata.com/2015/06/05/text-analytics-the-next-generation-of-big-data/

Textual Sentiment Analysis for
Financial Risk Prediction
On the Risk Prediction and Analysis of Soft
Information in Finance Reports. European Journal of
Operational Research (EJOR), 257(1), 243-250, 2017.

Soft and Hard Information in Finance
❖ Growing amount of financial data makes it more and more important
to learn how to discover valuable information for various financial
applications.
❖ In finance, there are typically two kinds of information:
❖ Soft information: text, including opinions, ideas, and market
commentary.
❖ Hard information: numerical values, such as financial measures and
historical prices.
❖ Our work aims to exploit soft information for financial risk prediction.

Risk Proxy: Stock Return Volatility
❖ Stock return
❖ Stock return volatility
❖ A common risk metric measured by the standard
deviation of returns over a period of time.
Rt =
(St St 1)
St 1
v[t n,t] =
t
i=t n(Ri R)2
n
, where R =
t
i=t n
Ri
(n + 1)
.

Financial Sentiment Analysis
❖ In this work, we attempt to apply sentiment analysis on the
risk prediction task.
❖ A ﬁnance-speciﬁc sentiment lexicon is adopted for analysis.
❖ Two machine learning techniques are adopted for the task:
❖ Regression approach: Predict the stock return volatilities.
❖ Ranking approach: Rank the companies to be in line
with their relative risk levels.

Financial Sentiment Lexicon
❖ Words in finance domain and in general usage usually have
different meanings, such as
❖ vice: immoral or wicked behavior
❖ vice: secondary (in finance context)
❖ Almost three-fourths of the words in the 10-K financial reports
from year 1994 to 2008, which are identified as negative by the
widely used Harvard Psychosociological Dictionary, are
typically not considered negative in financial contexts.

Six Finance-Specific Lexicons
❖ Loughran and McDonald (2011)
❖ When is a liability not a liability? textual analysis, dictionaries,
and 10-ks. Journal of Finance.

Problem Formulation
❖ Predict target: Future’s stock return volatility (regression) and
future’s relative risk levels (ranking)
❖ Features
❖ Soft textual information: All words or ﬁnancial sentiment words
❖ Hard numerical information: The twelve months before the
report volatility for each company
v(+12)
2007/3/222006/3/22
Report ﬁling date
2005/3/22
v(-12)

Corpora: The 10-K Corpus
❖ A Form 10-K is an annual report required by the U.S. Securities and Exchange Commission (SEC)
❖ Only section 7 “management’s discussion and analysis of ﬁnancial conditions and results of operations”(MD&A)
❖ The Sarbanes-Oxley Act of 2002: Explain the drastic increase in length during the 2002-2003 period

Experimental Results

Financial Sentiment Terms Analysis
amend
deficit
forbear
delist
defaultsureti
discontinu
wherebi
unabl
disput
concern
profit
violat
regain
uncom
-plet
accid
abl
integr
grantor
ceg
nasdaq
gnb
coven
forbear
waiver
sureti
excelsior
rais
ebix
shelbour
nplacement
syndic
pfc
stage
same
driver
default
small-
cap
seri
hearth
awg
amend
libert
special
benefici sever
breach
doubt
Fin-Neg
Fin-Pos
Fin-Lit
Fin-Unc
Non
SEN
ORG
1
1
2
3
4
5
2
3
4
5
deficit
deficits
default
defaulted
defaulting
defaults
delist
delisted
deslisting
delists
amend
amendable
amendatory
amended
amending
amendment
amendments
amends
forbear
forbearance
forbearances
forbearing
forbears

FIN10K Prototype Demo
https://cfda.csie.org/10K/
FIN10K: A Web-based Information System for
Financial Report Analysis and Visualization.
ACM CIKM (Demo paper), 2016.

Financial Keyword Expansion via
Continuous Word Vector Representations
Discovering Finance Keywords via Continuous
Space Language Models. ACM Transactions on
Management Information Systems, 7(3), 7:1-7:17, 2016.

Sentiment Analysis — the Lexicon
❖ For sentiment analysis, the lexicon is one of the most
important and common resources.
❖ Usually have a great impact on results and the
corresponding analyses
❖ In ﬁnance, the lexicon is usually semi-manually generated.
❖ Result in inadequate words
❖ In this work, we attempt to use the advanced continuous space
language models to expand ﬁnance keywords automatically.

Continuous Space Language Models
❖ “You shall know a word by the company it keeps” 
(J. R. Firth 1957)
❖ One of the most successful ideas of modern statistical NLP!

Continuous Space Language Models
❖ Continuous space language models
❖ a.k.a. Continuous word embeddings
❖ Words are represented as low-rank dense vectors.
❖ Recent studies show their superiority in capturing
syntactic and contextual regularities in language.

Keyword Expansion
❖ Our Proposed Keyword Expansion Method
❖ Adapt this technique to incorporate syntactic
information to capture more similarly meaningful
keywords.
❖ Learn vector representations of words via a large
collection of financial reports (domain-specific)
❖ Words in the financial sentiment lexicon are used as seed
words to obtain those within the top N cosine distances.

Keyword Expansion
❖ Keyword Expansion with Syntactic Information

The New 10-K Corpus

Four Prediction Tasks
❖ Four prediction tasks are conducted.
❖ To demonstrate that our approach is effective for
discovering predictability keywords
1) Post-event volatility
2) Stock volatility
3) Abnormal trading volume
4) Excess returns

Postevent Volatility Prediction

FIN10K Prototype Demo
https://cfda.csie.org/10K/
FIN10K: A Web-based Information System for Financial Report Analysis
and Visualization. ACM CIKM (Demo paper), 2016.

Beyond Word-Level Analysis
❖ Multi-word expression detection and analysis
❖ Beyond Word-Level to Sentence-Level Sentiment Analysis for
Financial Reports
❖ RiskFinder: A Sentence-level Risk Detector for Financial Reports,
NAACL’18
❖ https://cfda.csie.org/RiskFinder/
❖ FRIDAYS: A Financial Risk Information Detecting and Analyzing
System, AAAI’18
❖ https://cfda.csie.org/FRIDAYS/

Summary
❖ If structured data is big, then unstructured data is huge.
❖ 20% (structured) vs. 80% (unstructured)
❖ There is a massive potential waiting to be leveraged in
the analysis of unstructured data in the ﬁeld of ﬁnance.

Thanks for Your Listening!

[2018 台灣人工智慧學校校友年會] Textual Data Analytics in Finance / 王釧茹

Recommended

Recommended

More Related Content

Similar to [2018 台灣人工智慧學校校友年會] Textual Data Analytics in Finance / 王釧茹

Similar to [2018 台灣人工智慧學校校友年會] Textual Data Analytics in Finance / 王釧茹 (20)

More from 台灣資料科學年會

More from 台灣資料科學年會 (20)

Recently uploaded

Recently uploaded (20)

[2018 台灣人工智慧學校校友年會] Textual Data Analytics in Finance / 王釧茹