3. Problem & High Level Conclusions
3
Do news articles reveal more insight
about future DJIA indices than the
historical DJIA indices?
02
is it nowadays possible to anticipate
the market behavior based on its
evolution during the past years ?
01
01 NO
NO
General Problem:
How do we predict the stock market using historical stock prices and news articles?
4. Overview of Approach
4
Lorem 1
Convert to to lowercase
Lemmatize
Remove rare words
Remove non
Alphanumeric Characters
Clean up
Lorem 2
Text Vectorizations
● Word count
● Binary
● TF-IDF
● Frequency
Word Embeddings
● Glove vectors
● Co-occurrence
based word
embeddings
Lorem 3
Time Series Analysis
SVM Classifiers
Neural Networks
LSTM
Lorem 4
Baselines
B1: The predictor which
always predict up
B2: The predictor which
always predict the
previous day trend
Confidence interval of
the true error w. r. t.
baselines.
Pre process
Text Vectorizations
& Word Embeddings
Design Experiments
and Train Models
Evaluate
5. SVM - Approach
5
SVM parameters
● C value
● Gamma value
● Kernels:
○ Radial basis
function (rbf)
○ Polynomial
○ Sigmoid
○ Linear
Text vectorization techniques
● Binary occurrence
vectorization (binary)
● Word count
vectorization (count)
● Term frequency–
inverse document
frequency vectorization
(tfidf)
● Frequency
vectorization (freq)
Reducing Dimensionality
Techniques
● Truncated singular
value decomposition
(SVD)
● Principal component
analysis (PCA)
7. Neural Networks - Approach
7
Text vectorization techniques
● Binary occurrence
vectorization (binary)
● Word count
vectorization (count)
● Term frequency–
inverse document
frequency vectorization
(tfidf)
● Frequency
vectorization (freq)
Fine tune Neural Network
Parameters
● Hidden Layers were
varied from 1-3 layers
● Change the number of
neurons in the hidden
layers
● Loss function: Binary
Cross-entropy
● Optimizer: adam
● Activation: Relu
● Output layer Activation:
Sigmoid
11. LSTM - Discussion
Hyper-parameter Parameter Value #1 Parameter Value #2
Word Embedding Traditional co-occurrence
based word embeddings
Glove Vectors
Word embedding
Dimension
50 75
Direction Bi-directional Uni-directional
Optimization Algorithm SGD Adam
11
12. Time Series - Approach
12
Time Series Approach
AutoRegressive Integrated
Moving-Average (ARIMA)
SVM
All Binary
Regression
Out-sample forecasting
In-sample forecasting Classification
16. Conclusion - Research Questions
16
Do news articles reveal more
information on the short term (day-
ahead) evolution of the DJIA index
than the past evolution of this
market index itself ?
02
is it nowadays possible to anticipate
the market behavior based on its
evolution during the past years ?
01
01 NO
NO
20. Text Vectorizations
1. Word Count Vectorization: A text encoding technique that converts a collection of
texts or text documents into a matrix of token counts.
2. Frequency Vectorization: A text vectorization technique using the frequency of
occurrence of a particular word in a text.
3. Term Frequency–Inverse Document Frequency (TF–IDF) Vectorization: A text
encoding technique that converts a collection of text or text documents to a matrix of
TF-IDF features.TF-IDF determines how much a particular word is relevant to a
particular text or document. This method is more convenient than the Frequency
vectorization
4. Binary occurrence vectorization: This is a variant of the Word count vectorization
where the occurrence of a word is considered instead of the word count. There
might be cases where the binary occurrence vectorization might offer better
features.
20