An intelligent framework using hybrid social media and market data, for stock prediction analysis

1
By: Eslam Nader Desokey
2018
An Intelligent framework
using Hybrid Social Media
and Market Data, for Stock
Prediction Analysis.

2
Prof. Abdel Fatah Hegazy
Professor at College of Computing and Information Technology at Arab
Academy for Science and Technology and Maritime Transport
Supervised by
Prof. Amr Badr
Professor at Dept. of Computer Science - Faculty of Computers and
Information - Cairo University

3
Motivation
Problem Statement & Objectives
Research methodology
Background & Data Sources
Related work
Proposed Framework
Implementation
Experimental study and evaluation
Contribution
Conclusion and Future work
Publications
Agenda

4
Attractive
places for
investment.
Hottest topics
for research
studies and
commercial
applications.
Huge amount
of information
that is available
on Internet,
intelligent
financial web-
mining and
stock
prediction
system.
Help Our
Egyptian
Market to
increase market
liquidity
special after
recession and
Revolutions.
Motivation

5
“Can stock market prices be predicted?” This has been an ongoing
debate. Earlier research argued that stock market prices are random and
cannot be predicted. However, the Efficient Market Hypothesis suggests
there are many variables besides the stock market itself.
Securities transactions are an important activity among all investors’
activities in the past decade. How can we understand the relationships
between a security’s name, price, trading quantity, and other scientific
technical indicators such as, human feeling? How can these factors affect
the buy or sell timing? Is this an important condition to be a successful
investor?
Problem Statement

• Our objective is to build a robust predictive framework that is a
hybrid of social media behaviors and market data.
Objectives
Increases Egyptian Stock Market liquidity using our
proposed prediction framework.
Forming impact of social media behavior with stock
market.
6

7
Building our role system that will provide decisions for stock market investors.
Building aggregators system to retrieve data from social media.
Building our new ontology based on our methodology.
Discussing proposed framework with stock market experts.
Conducting a survey on algorithms used for stock market prediction in stock exchange using market data.
Studying how to retrieve data form social media.
Studying social media data and its relation to stock market prediction.
Presenting a proposed methodology for ontology building
Studying the ontology classification, which involves annotation tools and ontology building methodology
Studying the stock market filed and its implementation in the Egyptian stock exchange
Methodology

9
Market
Data
News
Media
Data
Search
Engine
Data
Social
Media
Data
Data Sources

11
Stock Prediction using Social media

13
# Year Data Source Used Algorithms Final Result
1 2010 Twitter, Market
Data
Self-organizing fuzzy neural
network, Granger Causality
Impact of Calm mode on DJIA values appears
between 2 and 7 days with different coloration
Data
Reed-forward neural
network
They need to enhance their result, all available
semantic data set for English and they need one for
French and need to enhance NN algorithm
Data
Single regression High positive correlated between Twitter and S&P
500
Data, News
Noun phrasing, loose n-gram
model, ε-insensitive support
vector regression algorithm
Failed to predicted
Survey ON Related Work
• Sorting Depending on Researches values

17
Stage 1 : Ontology Building
Stage 2 : Data Collection
Stage 3 : Data Analysis and Cleaning
Stage 4: Data cluster and filtering
Stage 5 : Build our Rule Engine
Implement Steps

19
Experts and our technical analysis identification
Application knowledge capture techniques
Identify Concepts
Extract concepts constraints and the relationships
between concepts
Implementing our ontology using “Protégé”
Ontology Building Methodology

24
Data file Example
Example of Extracting Data from Our ontology

26
Stock market exchange data
Twitter Feeder
Google Feeder
Data Collection

27
Stock Market Exchange data ERD

28
• We used TweetSharp for .net developer to build our engine
• we collected 129,012 tweets from August 25, 2015 to October 23, 2015.
• These tweets came from 67,238 unique users around the world
Twitter Feeder

29
• We used the Custom Search Engine feature from
Google to build our Google Feeder
• used Google API Custom Search V1 to build our
engine
• Our application run for about one month from
September 20, 2015 to October 23, 2015 and
collected 890,342 results
Google Feeder

32
The Google developer console

33
Data Analysis and Cleaning
Stage 3

34
• Detecting Arabic words on our data set and deleting rows that contains Arabic
• Detecting duplicated data on our dataset. We used general comparing in
database ‘Like’ statement
• Delete Text that match regular Expression “HTTP” and “WWW”
• After those stages, we deleted 9,951 items from our dataset. Then, we had
1,025,945 rows for all texture datasets, ready to use in our next step
Social Media data

35
• we calculated the mean for
all calculated distances for
each row, to create one
value for distance between
our ontology and the
collected dataset
Levenshtein Distance Algorithm

36
• We used the Levenshtein distance to re-clean our dataset from duplication by
applying:
• X= first row from dataset
• Y= second row from dataset
• N= number of Levenshtein distance in our case =8
• a = min distance between two rows
• a=(x1-y1) +(x2-y2) +(x3-y3) +(x4-y4) +(x5-y5) +(x6-y6) +(x7-y7) +(x8-y8)
• If (a=0) this means that the two rows are identical and we must keep one and
delete the other.
Levenshtein Distance for cleaning

The textual dataset divisions
After Levenshtein
Twitter
106,583
Google
53,870
News
4,401
Before Levenshtein
Twitter
129,012
Google
890,342
News
4,401
37

38
• Algorithms combined in our model:
– Moving Average Convergence Divergence – MACD
– Relative Strength Index – RSI
– Bollinger Band
Stock Market trading

39
EGS02211C018 Combined Chart

40
Stock Market trading
Hold
59
Buy 60
Sell
67
Total : 186

41
Data cluster and filtering
Stage 4

42
K-Means
Genetic Algorithm
k-means with Initial Centroid Selection Optimization and GA
Data clustering and filtering

43
▪ Select all Levenshtein data as features for cluster “Symbol Code, Symbol
Name, Reuters Code, Sector Name, INDEX, CEO, Stock Market”
▪ Number of clusters: 3
▪ Number of examples: 164,854
▪ Number of feature columns: 9
▪ Training method: elkan
▪ Number of training iterations: 7
▪ Batch size: 164,854
▪ Total training time (seconds): 1.8036
Apply K-Means algorithm for clustering

44
Case 1 :Applying K-means Algorithm
to Cluster
Hold
127,479
Buy
34,772
Sell
2,603
Accuracy
83.3%
Total : 164,854

45
• We built our GA model as follows.
• Population site: 20
• Iteration :100
• Selection Method: Elite Selection
• Chromosome Length: 164,854
• Chromosome encoding: Permutation chromosome
Apply Genetic Algorithm for Cluster

46
Case 2 :Applying GA to Cluster
Hold
146,591
Buy
12,001
Sell
6,262
Accuracy
82.5 %
Total : 164,854

47
• The final case of cluster we mix between K-mean after implement Centroid
Selection Optimization by divided get first 20 % from our data set and divide
to our three-target section (Buy, hold, Sell) with our stock market experts help,
then guide our K-mean implementation in dato machine leaning tool to user
our Centroid point to make sure that all the startup point form different section
or portion that we divided before.
• After having the vector designed from the same categories grouped together in
regions within that vector, as mentioned before, the remaining 80% of the data
set will be represented in VRM using the same vector.
Applying k-means with Initial Centroid
Selection Optimization and GA

48
Case 3 :Applying k-means with Initial
Centroid Selection Optimization and GA
Hold
155,540
Buy
6,562
Sell
2,752
Accuracy
89.31%
Total : 164,854

49
Build our Rule Engine
Stage 5

50
• We Used DEEP v2.0 by Timothy Masters
• This is a real-domain model
• There are two unsupervised layers, not including input.
– First hidden layer has five neurons
– Last hidden layer has five neurons
• There are two supervised layers, including
– Hidden layer has six neurons
Apply Deep Learning

51
• Auto encode greedy training parameters.
• Initial annealing iterations for starting weights = 100
• Initial random range for starting weights = 1.00000
• Gradient reduction max iterations = 500
• Gradient reduction convergence tolerance = 0.0000500
• Greedily trained layers will be fine-tuned after each addition
• Unsupervised section weights will remain fixed, not fine-tuned by supervised
training
Training model

52
• Initial annealing iterations for starting weights = 100
• Initial random range for starting weights = 1.00000
• Supervised optimization max iterations = 1000
• Supervised optimization convergence tolerance = 0.0000500
Supervised layer(s) training parameters

53
Rule template 1:
If point X in C1 then
Close Price =<Predict Deep>
And Decision =Hold, Buy, Sell
With CF =
Hold= 0.7 Buy=0.2 Sell = 0.1
Create Rule-Base Model

54
Rule template 2:
With CF =
Hold= 0.6 Buy=0.2 Sell = 0.2

55
Rule template 3:
With CF =
Hold= 0.5 Buy=0.3 Sell = 0.2

56
Experimental study
and Evaluation

57
• In order to evaluate and test our proposed framework in real work
• We run our Framework 10 working day constantly to predict
EGS02211C018 [Egypt for Poultry] and recorded all results form
market and our proposed framework
Evaluation method

58
Ontology Output for EGS02211C018

59
DATE VOLUME VALUE COUNT OPEN HIGH LOW CLOSE Indicator
11/15/2015 156459 228258.5 53 1.46 1.47 1.45 1.46 Hold
11/16/2015 198245 288488.7 51 1.43 1.47 1.43 1.46 Buy
11/17/2015 69717 98593.95 24 1.42 1.43 1.4 1.41 Hold
11/18/2015 120800 169744.7 41 1.43 1.43 1.4 1.41 Sell
11/19/2015 87140 124591.6 24 1.44 1.44 1.42 1.43 Buy
11/22/2015 122743 174873.9 28 1.44 1.45 1.41 1.42 Sell
11/23/2015 207077 296915.3 56 1.44 1.46 1.41 1.43 Buy
11/24/2015 167025 247217.8 43 1.48 1.5 1.45 1.48 Hold
11/25/2015 465007 684234.8 110 1.49 1.5 1.45 1.47 Sell
11/26/2015 510804 753962.1 152 1.43 1.51 1.43 1.48 Hold
Current Market Indication for EGS02211C018

60
DATE CLOSE Market Indicator Cluster Rule Role
11/15/2015 1.46 Hold C1 Hold= 0.7 Buy=0.2 Sell = 0.1
11/16/2015 1.46 Buy C1 Hold= 0.7 Buy=0.2 Sell = 0.1
11/18/2015 1.41 Sell C2 Hold= 0.5 Buy=0.3 Sell = 0.2
Proposed Model Results for EGS02211C018

61
Proposed Model Templates Result
5
3
2
0
1
2
3
4
5
6
C1 C2 C3
RESULTCOUNT
RULE TEMPLATES
Live Result for 10 days
• C1: Recommend Hold
• C2: Recommend Buy
• C3:Recommend Sell

62
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
Buy Hold Sell
RESULTCOUNT
CLUSTER TYPE
Current Market Indication
Current Market Indication

Testing accuracy
Market Indication vs Model Results
Market Indication
Hold :4
Buy:3
Sell
:3
Model Results
C1 Hold
:5
C2
Buy:3
C3
Sell:2
63
• C1: Recommend Hold
• C2: Recommend Buy
• C3:Recommend Sell

64
The Cluster Average Accuracy
1 2 3
Avg Accuracy 83.30% 82.50% 89.31%
78.00%
80.00%
82.00%
84.00%
86.00%
88.00%
90.00%AVGAccuracy
Cluster Cases
Cluster Average Accuracy

65
The Cluster Average Sums of
Squared Distances
1 2 3
Avg SSD 419.2805488 610.4761191 31.0731798
0
100
200
300
400
500
600
700
AVGSSD
Cluster Cases
Average Sums of Squared Distances

66
The Cluster Average Number of K-
means Iterations
1 2 3
Avg Iterations 3.97 3.57 1.78
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
AvgIterations
Cluster Cases
Average Number of Iterations

68
• Build first Egyptian stock market ontology Based on our
customized methodology
• Build robust framework for stock market prediction base on deep
leaning technology and Data Mining
• Combined most used Prediction algorithms to be in one result
used nowadays at HSBC Security and other Egyptian Brokers
Contribution

70
• Our New Proposed framework effected in Egyptian Market
liquidity positively throw 10 day live run shows.
• Result show that Social media data special twitter effect positive
on trading at same day and next trading day This relation will
increases if we add Arabic language .
• Result for Cluster using k-Mean guided by GA gave more
accuracy result than GA or K-mean separated
Conclusion

72
Add Arabic Language in prediction data.
Automatic extracting data from Ontology.
Increase proposed Model Accuracy buy
add more sources.
Thesis work could be extended in this
direction through

74
Published Work
• Enhancing Stock Prediction Clustering Using K-Means with Genetic
Algorithm @ICENCO 2017,13th International Computer Engineering
Conference, December 2017
• An Intelligent Framework using Hybrid Social Media and Market
Data, for Stock Prediction Analysis @International Journal of
Computer and Information Technology (ISSN: 2279 –0764) Volume 05–
Issue 03, May 2016

75
• This research is supported by Egypt for information
dissemination-EGID and Egypt stock Exchange-EGX for
constant consulting and support
Acknowledgements

An intelligent framework using hybrid social media and market data, for stock prediction analysis

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to An intelligent framework using hybrid social media and market data, for stock prediction analysis

Similar to An intelligent framework using hybrid social media and market data, for stock prediction analysis (20)

Recently uploaded

Recently uploaded (20)

An intelligent framework using hybrid social media and market data, for stock prediction analysis