Sentimental analysis of financial articles using neural network
1. Sentimental Analysis of Financial
Articles Using Neural Network
on Apache Spark
Advisor : Dr. Mohammad Zubair
Computer science Department
Old Dominion University
1
2. What is sentiment analysis?
• In a nutshell: extracting attitudes towards something from human
language
• Sentiment analysis aims to map qualitative data to a quantitative
output(s)
• EX: This movie was actually neither that funny, nor super witty
• A human can easily understand this context
• How to convert human language to a machine understanding form
2
3. Previous Vs Current Approach for Sentiment
Analysis
Previous Approach
Keyword lookup/ lexicon
approach[1]
Assign sentiment score to words (“bad”: -1,
“good”: +1)
Overall + / - determines sentiment.
Drawbacks:
Ignores Word Context
Can’t implicitly capture negation (“Not
Good” =0??)
Current Approach
Words Prediction/Word2vec[2]
Maps words to continuous vector
representations(i.e. points in an N-
dimensional space)
Learns vectors from training data
(generalizable!)
Advantages:
Capture Context
More importantly, stuff like:
vector(“king”) – vector(“man”) +
vector(“woman”) ≈ vector(“queen”)
[1] https://www.aclweb.org/anthology/J/J11/J11-2001.pdf
[2 ]http://arxiv.org/pdf/1301.3781.pdf
3
7. Project OverviewInternet
Harvest
Articles and
Market data
(structured/
Unstructured
)
Label Articles
using the
insights of
Market Data
Use the labeled
Vectors to Build
a Binary
Classifier
Positive /Negative/Unknown Doc2Vec
4
8. Project OverviewInternet
Harvest
Articles and
Market data
(structured/
Unstructured
)
Label Articles
using the
insights of
Market Data
Use the labeled
Vectors to Build
a Binary
Classifier
Positive /Negative/Unknown Doc2Vec
Predict polarity of unknown
articles
4
10. Data Extraction Vedanta/SESGOA
Steel Authority of India
National Aluminium Company
Hindalco Industries
Welspun Corp
Jindal Steel & Power
Usha Martin
Adhunik Metaliks
PSL
Visa Steel
Bhushan Steel
Gujarat NRE Coke
5
11. Data Extraction
• Financial News websites
• http://www.moneycontrol.com/
• http://www.thehindubusinessline.com/
Vedanta/SESGOA
Steel Authority of India
National Aluminium Company
Hindalco Industries
Welspun Corp
Jindal Steel & Power
Usha Martin
Adhunik Metaliks
PSL
Visa Steel
Bhushan Steel
Gujarat NRE Coke
5
18. Labeling Articles
• 2352 Positive Articles
• 3688 Negative Articles
• Positive Article Example:
• Abu Dhabi has awarded an order of aggregate value of US $460 million for pipe supply to Jindal SAW
Limited (app. USD 95 million), besides Japans Sumitomo and Germanys Salzgitter for the balance
portion. Jindal SAW Limited is the only Indian company which has been considered for and awarded this
order.
• Negative Article Example:
• Adhunik informed the stock exchanges that the company's and its subsidiary's businesses were
impacted due to the closure of iron and manganese ore mines and scarcity of coal. Hence, the lenders
of company at their joint lenders forum meeting decided for a corrective action plan to restructure its
debt.
7
19. Feature Extraction
• Implemented neural network approach proposed by Mikolov Tomas[5]
• This is a part of Word2Vec tool extended for documents[6]
• Implementation of this model includes 3 steps[7]
• Building vocab
• Building unigram table
• Updating word and document vectors
[5] https://cs.stanford.edu/~quocle/paragraph_vector.pdf
[6] https://code.google.com/archive/p/word2vec/
[7] http://arxiv.org/pdf/1402.3722v1.pdf
8
20. Model Implementation
• Sample Input
• Adhunik Metaliks Ltd has informed BSE that the Company is operating its
captive Kulum iron ore mine in Orissa and its wholly owned subsidiary, Orissa
Manganese & Minerals Limited (OMML), is operating two (2) iron ore mines
in the State of Jharkhand and Orissa.
• Abu Dhabi has awarded an order of aggregate value of US $460 million for
pipe supply to Jindal SAW Limited (app. USD 95 million), besides Japans
Sumitomo and Germanys Salzgitter for the balance portion. Jindal SAW
Limited is the only Indian company which has been considered for and
awarded this order.
9
21. Model Implementation
INDEX WORD COUNT
0 the 4
1 and 4
2 has 3
3 is 3
4 Limited 3
5 of 3
6 for 3
7 operating 2
8 its 2
9 iron 2
10 ore 2
11 in 2
12 Orissa 2
13 awarded 2
14 Jindal 2
.
.
69 order. 1
• Build Vocab
Read input files:
Create a Dictionary of words
Dictionary [words] = word count
Sort Dictionary w.r.t word count
9
22. Model Implementation
INDEX WORD COUNT
0 the 4
1 and 4
2 has 3
3 is 3
4 Limited 3
5 of 3
6 for 3
7 operating 2
8 its 2
9 iron 2
10 ore 2
11 in 2
12 Orissa 2
13 awarded 2
14 Jindal 2
.
.
69 order. 1
INDEX
0
0
1
1
1
2
2
2
3
3
4
4
4
5
5
5
6
.
.
69
• Build Unigram Table:
Initialize unigram table ut, of size greater than
the number of words in file
P=0
P+=word count ¾
While index I of ut/tablesize <probability
ut[I] = wordindex
9
24. Implementation (contd.)Read the input directory d:
for each document di in the file do
for each word wi in the document di do
Input I = get the index of the word wi from vocabulary:
set a context window cw of size 3
[16, 17, 18, 2, 19, 20, 21, 0, 22, 3, 7, 8, 23, 24, 9, 10, 25, 11,
12, 1, 8, 26, 27, 28, 12, 29, 30, 31, 4, 32, 3, 7, 33, 34, 9, 10,
35, 11, 0, 36, 5, 37, 1, 38]
16 17 18 2
Input I
Context window cw
11
25. Implementation (contd.)Read the input directory d:
for each document di in the file do
for each word wi in the document di do
Input I = get the index of the word wi from vocabulary:
set a context window cw of size 3
for each word cwi in context window cw do
negative = ut.random.sample [2]
classifier = [(input I, 1), (ui,0 for ui in negative)]
[16, 17, 18, 2, 19, 20, 21, 0, 22, 3, 7, 8, 23, 24, 9, 10, 25, 11,
12, 1, 8, 26, 27, 28, 12, 29, 30, 31, 4, 32, 3, 7, 33, 34, 9, 10,
35, 11, 0, 36, 5, 37, 1, 38]
16 17 18 2 Context window cw
cwi
INDEX
0
0
1
1
1
2
2
2
3
3
4
4
4
5
5
5
6
.
.
69
56 8[(16,1),(56,0),(8,0)]
11
26. Implementation (contd.)Read the input directory d:
for each document di in the file do
for each word wi in the document di do
Input I = get the index of the word wi from vocabulary:
set a context window cw of size 3
for each word cwi in context window cw do
negative = ut.random.sample [2]
classifier = [(input I, 1), (ui,0 for ui in negative)]
for each pwi, label in classifier:
dot = sigmoid (syn0[cwi:]*syn1[pwi:])
gradient = alpha *(label-dot)
e+=gradient *syn1[pwi:]
syn1[pwi:]+=gradient*syn0[cwi:]
end for
[16, 17, 18, 2, 19, 20, 21, 0, 22, 3, 7, 8, 23, 24, 9, 10, 25, 11,
12, 1, 8, 26, 27, 28, 12, 29, 30, 31, 4, 32, 3, 7, 33, 34, 9, 10,
35, 11, 0, 36, 5, 37, 1, 38]
16 17 18 2 Context window cw
cwi
[(16,1),(56,0),(8,0)]
Syn0[17:]
Syn1[16:]
Syn1[56:]
Syn1[8:]
11
27. Implementation (contd.)Read the input directory d:
for each document di in the file do
for each word wi in the document di do
Input I = get the index of the word wi from vocabulary:
set a context window cw of size 3
for each word cwi in context window cw do
negative = ut.random.sample [2]
classifier = [(input I, 1), (ui,0 for ui in negative)]
for each pwi, label in classifier:
dot = sigmoid (syn0[cwi:]*syn1[pwi:])
gradient = alpha *(label-dot)
e+=gradient *syn1[pwi:]
syn1[pwi:]+=gradient*syn0[cwi:]
end for
syn0[cwi:]+=e
[16, 17, 18, 2, 19, 20, 21, 0, 22, 3, 7, 8, 23, 24, 9, 10, 25, 11,
12, 1, 8, 26, 27, 28, 12, 29, 30, 31, 4, 32, 3, 7, 33, 34, 9, 10,
35, 11, 0, 36, 5, 37, 1, 38]
16 17 18 2 Context window cw
cwi
Syn0[17:]
11
28. Implementation (contd.)Read the input directory d:
for each document di in the file do
for each word wi in the document di do
Input I = get the index of the word wi from vocabulary:
set a context window cw of size 3
for each word cwi in context window cw do
negative = ut.random.sample [2]
classifier = [(input I, 1), (ui,0 for ui in negative)]
for each pwi, label in classifier:
dot = sigmoid (syn0[cwi:]*syn1[pwi:])
gradient = alpha *(label-dot)
e+=gradient *syn1[pwi:]
syn1[pwi:]+=gradient*syn0[cwi:]
end for
syn0[cwi:]+=e
doc_vec [di:] +=e
[16, 17, 18, 2, 19, 20, 21, 0, 22, 3, 7, 8, 23, 24, 9, 10, 25, 11,
12, 1, 8, 26, 27, 28, 12, 29, 30, 31, 4, 32, 3, 7, 33, 34, 9, 10,
35, 11, 0, 36, 5, 37, 1, 38]
16 17 18 2 Context window cw
cwi
Doc_vec[1:]
11
29. Implementation (contd.)Read the input directory d:
for each document di in the file do
for each word wi in the document di do
Input I = get the index of the word wi from vocabulary:
set a context window cw of size 3
for each word cwi in context window cw do
negative = ut.random.sample [2]
classifier = [(input I, 1), (ui,0 for ui in negative)]
for each pwi, label in classifier:
dot = sigmoid (syn0[cwi:]*syn1[pwi:])
gradient = alpha *(label-dot)
e+=gradient *syn1[pwi:]
syn1[pwi:]+=gradient*syn0[cwi:]
end for
syn0[cwi:]+=e
doc_vec [di:] +=e
end for
end for
end for
[39, 40, 2, 13, 41, 42, 5, 43, 44, 5, 45, 46, 47, 6,
48, 49, 50, 14, 15, 4, 51, 52, 53, 54, 55, 56, 57,
1, 58, 59, 6, 0, 60, 61, 14, 15, 4, 3, 0, 62, 63, 64,
65, 2, 66, 67, 6, 1, 13, 68, 69]
11
32. Distributed Implementation On Spark[8]
Build Vocab, Unigram
Table
Initialize
syn1,syn0,doc_vec
matrix
partition
partition
partition
partition
partition
MASTER NODE
Broadcast
syn1,syn0,doc_vec,
matrix
Add the
updated vectors
of the respective
indexes
MASTER NODE
[8] https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala
12
33. Distributed Implementation On Spark[8]
Build Vocab, Unigram
Table
Initialize
syn1,syn0,doc_vec
matrix
partition
partition
partition
partition
partition
MASTER NODE
Broadcast
syn1,syn0,doc_vec,
matrix
Add the
updated vectors
of the respective
indexes
MASTER NODE
[8] https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala
12
34. Model Validation
• We have validated model using IMDB movie review dataset[9]
• Used 1500 positive and 1500 negative reviews
[9] http://ai.stanford.edu/~amaas/data/sentiment/ 13
35. Binary Classification
• We used logistic regression to train a binary classifier
• 2000 Positive and negative labeled documents
• Positive document vectors are labeled as 1 and negative vectors as 0
• Implemented K-fold cross validation
• Plotted ROC curve to determine performance of the classifier
14
36. Experiments and Results
• Experiments on feature extraction using different number of
documents
Experiment 1 Experiment 2
Doc2Vec (feature extraction)
Number of Documents 8000 16000
Total Number of Words 1450364 7171249
Vocab Size 14939 56321
Logistic Regression (Binary Classification)
Number of Positive Documents 2000 2000
Number of Negative Documents 2000 2000
Accuracy 0.77 0.89
Area Under Curve 0.84 0.92
15
39. Conclusion and Future Work
• Conclusion:
• Financial News articles have their impact on market trends
• Neural network approach can be used to automatically extract meaningful
information from text documents
• More the data is provided better is the model
• Future Work
• Extending model for online training with streaming input
• Using different features of market data for labeling the documents
17
Editor's Notes
Decades of Research to extract information from text documents.
Manually building massive dictionaries of positive, negative, strong, weak, active and passive words and phrases on multiple categories for every new story.