Dataworkz odsc london 2018

@ODSC
OPEN
DATA
SCIENCE
CONFERENCE
London | September 19 - 22 2018

Can we predict the Bitcoin Price
with
LSTM Sentiment Analysis?
Olaf de Leeuw
Data Scientist at Dataworkz

This is where the idea arose, after a full day of talks at ODSC London 2017

Market
Data
Collector
Twitter
Data
Collector
Not only tweets with #Bitcoin but all kind of news related tweets were collected
Collecting the data

Exploring the Twitter data
•All the Twitter data is stored in ElasticSearch…,
•We don’t know exactly yet what it looks like…,
•We want to create a Recurrent Neural Network with LSTM in Tensorflow…,
•So it’s a good thing Python has an ElasticSearch and a Tensorflow module!
Short demo

Predicting the sentiment of a tweet: positive or negative?
1 Million tweets… How to analyze these tweets and how do we put them in a deep learning algorithm?
Deep learning needs scalars or matrices of scalars as input.
For example a convolutional neural network uses pixels of images for object recognition
Likewise text/speech needs to be vectorized before analyzing it.
“Only words or word encodings provide no useful information regarding the relationships
that may exist between the individual symbols” (tensorflow.org).
So vectorization of our tweets….

The basic ideas behind a Word2Vec model
Word2Vec
model
Neural Network with one hidden layer
This hidden layer is a matrix with dimension N x D where
D is the length of a vector representing a word.
The input is a one-hot vector of a word and has dimension
N x 1 where N is the number of words in your dictionary.
The output layer is a vector with probabilities that a the
input word is the neighbour of the words in this vector.
This hidden layer is exactly what we are looking for!

Pre-trained Word2Vec models
• Available on Stanford website (https://nlp.stanford.edu/projects/glove/)
• Data available with different number of words and several vector dimensions.
• In this project a set of 400k words is used with vectors of dimension 50 x 1.
• The data consist of a word list and a matrix:
❖The word list contains 400k words each represented by a number
❖The matrix has dimension 400k x 50, for each word a vector representation of length 50

Long-Short term memories, why should we use them?
source: http://colah.github.io/posts/2015-08-Understanding-LSTMs/
Recurrent Neural Networks are sufficient if you want to predict for instance the sentiment of:
“The movie was really bad”
The problem arises when the relevant information is much further away or spread out over multiple
sentences:
“This is the best day ever. The weather is beautiful and I got a new job. However the movie I just saw
was really bad”
In a more simple recurrent network this may be predicted as negative. Long-Short term memories can
deal with the information in the whole text.

First an intuitive interpretation
•The complete network consist of n such layers.
•At each layer you put in the next word of your text, Xt, and add it to the already stored information.
•A number of updates and calculations are done and finally there is some output, ht and we move on
to the next layer.
And now step by step…

Step by Step: the main information line
•On this line all the information is stored and this information loops through all the cells until all words a
•Within each cell information is added, removed and updated.

Step by Step: the forget gate
•The next word is added to the cell, Xt, just like the information from the previous cell, ht-1.
•The sigmoid function determines which information from ht-1 is kept, e.g.:
When Xt is a new subject, you may want to forget the old one which is stored in the cell state at
the main line
•The outcome is multiplied with information from the cell state Ct-1.

Step by Step: the forget gate - example
• Assume the word at Xt is “bitcoin”. As earlier stated we use word vectors:
• The vector is multiplied by a weight matrix Wf with dimension 50 x (num LSTM units) and after that a bias is
added. In formula notation:
• We work with 50d vectors and 64 LSTM units, so the formula gives us:
• Finally this is put into the sigmoid function and the outcome goes to the cell state Ct
• Together with the previous state ht-1 the complete equation becomes:
𝜎(𝑋𝑡 ∗ 𝑊𝑥,𝑓 + 𝑏𝑥,𝑓)
𝑋𝑡 ∗ 𝑊𝑥,𝑓 + 𝑏𝑥,𝑓
𝜎(𝑋𝑡 ∗ 𝑊𝑥,𝑓 + 𝑏𝑥,𝑓)
𝜎(ℎ𝑡−1 ∗ 𝑊ℎ,𝑓 + 𝑏ℎ,𝑓)

Step by Step: the input gate
•The input gate consists of two functions:
1. A sigmoid function is used to determine what kind of information we would like to store. e.g. the
new subject
2. A tanh function is used to determine the content of the information, e.g. is the new subject male
or female?
•The output of these functions together is added to the current cell state Ct.

Step by Step: the output gate
•The output gate filters some information from the current cell state.
•A sigmoid decides what we are going to output and the tanh function makes sure the values are
between -1 and 1:
If we saw a new subject the output will be whether the subject is male or female, singular or plural.

The full model:
Tweets (word indices)
Globe word vectors
Labels
([0,1]
[1,0])

Hyperparameters:
There are a lot of choices you have to make before training the RNN with LSTM.
• Length of the sequence: the number of LSTM cells.
• Number of LSTM units: comparable with the number of units in a layer of a regular NN.
• Iterations: how often you run the model during your training. Each iteration you run one batch.
• Batch size: each iteration you run one batch of tweets.
• Optimizer: the function that tries to optimize the loss. Often used functions are Gradient Descent and Adam.
• DropoutWrapper and its probability: the probability of keeping informations, it helps to prevent from overfitting.
• Learning rate: too big and you model may not converge, too small and it may take ages.

Loss function:
The loss function we use is softmax cross entropy:
• Softmax function: it squashes the output vector with real numbers to a vector with
real numbers between 0 and 1 and such that they add up to 1:
𝑆(𝑣)𝑖 =
𝑒𝑣𝑖
𝑘=1
𝑁
𝑒(𝑣𝑘)
• Cross entropy is an often used alternative of the well known squared error and is
defined by:
𝐻(𝑦, 𝑝) = −
𝑖
𝑦𝑖𝑙𝑜𝑔(𝑆𝑖)
Where Si is the output of the softmax function. Cross entropy is only useful when
the input is a probability distribution and therefore the Softmax function.

Optimization of the loss function:
The optimization functions used in this model are Gradient Descent and the Adam optimizer. The
Adam optimizer is an extension of Stochastic Gradient Descent. The SGD is defined as
SGD maintains a single learning rate for all parameter updates. Adam has learning rates for each
network weight and they are separately adapted.
• Adam: Adaptive Moment Estimation
• Adam stores the first and second moments (mean and variance) of the decaying average of the past gr
𝑚𝑡 = 𝛽1𝑚𝑡 − 1 + (1 − 𝛽1)𝛻𝑡
𝑣𝑡 = 𝛽2𝑣𝑡 − 1 + (1 − 𝛽2)𝛻𝑡
2
These variables are used to update your parameters/weights used in the model
𝑊𝑡+1 = 𝑊𝑡 −
𝛼
(𝑣𝑡) + 𝜖
𝑚𝑡
𝑊𝑡+1 = 𝑊𝑡 − 𝛼𝛻𝑡
http://ruder.io/optimizing-gradient-descent/index.html#adam

Demo model training and Tensorboard:

The result:
sentiment of our 1 million tweets and the Bitcoin rate

How about the ‘derivative’ of the sentiment?
• If the sentiment is getting better, the derivative is positive,
• If the sentiment is getting worse, the derivative is negative,
• If the sentiment is stable, the derivative is zero.

Discussion and conclusion
• Recurrent Neural Networks with LSTM are powerful tools to work with,
• The mathematics behind it are complicated, however the code is not that hard to understand,
• Many parameters to tune,
• Bitcoins and sentiment are not related according to this model.
Some possible improvements:
• Use a training set with the same kind of tweets as the actual set,
• Use other keywords in your tweets than only news and finance related topics
• Put a higher weight on tweets that were more retweeted than others.

Thank you all for coming
★Questions: https://www.linkedin.com/in/olaf-de-leeuw-6a2b073b/
★Code/Notebooks: https://github.com/olafdeleeuw/ODSC-London-2018

Dataworkz odsc london 2018

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Dataworkz odsc london 2018

Similar to Dataworkz odsc london 2018 (20)

Recently uploaded

Recently uploaded (20)

Dataworkz odsc london 2018

Editor's Notes