Diamond Application Development Crafting Solutions with Precision
An intelligent framework using hybrid social media and market data, for stock prediction analysis
1. 1
By: Eslam Nader Desokey
2018
An Intelligent framework
using Hybrid Social Media
and Market Data, for Stock
Prediction Analysis.
2. 2
Prof. Abdel Fatah Hegazy
Professor at College of Computing and Information Technology at Arab
Academy for Science and Technology and Maritime Transport
Supervised by
Prof. Amr Badr
Professor at Dept. of Computer Science - Faculty of Computers and
Information - Cairo University
3. 3
Motivation
Problem Statement & Objectives
Research methodology
Background & Data Sources
Related work
Proposed Framework
Implementation
Experimental study and evaluation
Contribution
Conclusion and Future work
Publications
Agenda
4. 4
Attractive
places for
investment.
Hottest topics
for research
studies and
commercial
applications.
Huge amount
of information
that is available
on Internet,
intelligent
financial web-
mining and
stock
prediction
system.
Help Our
Egyptian
Market to
increase market
liquidity
special after
recession and
Revolutions.
Motivation
5. 5
“Can stock market prices be predicted?” This has been an ongoing
debate. Earlier research argued that stock market prices are random and
cannot be predicted. However, the Efficient Market Hypothesis suggests
there are many variables besides the stock market itself.
Securities transactions are an important activity among all investors’
activities in the past decade. How can we understand the relationships
between a security’s name, price, trading quantity, and other scientific
technical indicators such as, human feeling? How can these factors affect
the buy or sell timing? Is this an important condition to be a successful
investor?
Problem Statement
6. • Our objective is to build a robust predictive framework that is a
hybrid of social media behaviors and market data.
Objectives
Increases Egyptian Stock Market liquidity using our
proposed prediction framework.
Forming impact of social media behavior with stock
market.
6
7. 7
Building our role system that will provide decisions for stock market investors.
Building aggregators system to retrieve data from social media.
Building our new ontology based on our methodology.
Discussing proposed framework with stock market experts.
Conducting a survey on algorithms used for stock market prediction in stock exchange using market data.
Studying how to retrieve data form social media.
Studying social media data and its relation to stock market prediction.
Presenting a proposed methodology for ontology building
Studying the ontology classification, which involves annotation tools and ontology building methodology
Studying the stock market filed and its implementation in the Egyptian stock exchange
Methodology
12. 13
# Year Data Source Used Algorithms Final Result
1 2010 Twitter, Market
Data
Self-organizing fuzzy neural
network, Granger Causality
Impact of Calm mode on DJIA values appears
between 2 and 7 days with different coloration
2 2013 Twitter, Market
Data
Reed-forward neural
network
They need to enhance their result, all available
semantic data set for English and they need one for
French and need to enhance NN algorithm
3 2014 Twitter, Market
Data
Single regression High positive correlated between Twitter and S&P
500
4 2009 Twitter, Market
Data, News
Noun phrasing, loose n-gram
model, ε-insensitive support
vector regression algorithm
Failed to predicted
Survey ON Related Work
• Sorting Depending on Researches values
18. 19
Experts and our technical analysis identification
Application knowledge capture techniques
Identify Concepts
Extract concepts constraints and the relationships
between concepts
Implementing our ontology using “Protégé”
Ontology Building Methodology
27. 28
• We used TweetSharp for .net developer to build our engine
• we collected 129,012 tweets from August 25, 2015 to October 23, 2015.
• These tweets came from 67,238 unique users around the world
Twitter Feeder
28. 29
• We used the Custom Search Engine feature from
Google to build our Google Feeder
• used Google API Custom Search V1 to build our
engine
• Our application run for about one month from
September 20, 2015 to October 23, 2015 and
collected 890,342 results
Google Feeder
33. 34
• Detecting Arabic words on our data set and deleting rows that contains Arabic
• Detecting duplicated data on our dataset. We used general comparing in
database ‘Like’ statement
• Delete Text that match regular Expression “HTTP” and “WWW”
• After those stages, we deleted 9,951 items from our dataset. Then, we had
1,025,945 rows for all texture datasets, ready to use in our next step
Social Media data
34. 35
• we calculated the mean for
all calculated distances for
each row, to create one
value for distance between
our ontology and the
collected dataset
Levenshtein Distance Algorithm
35. 36
• We used the Levenshtein distance to re-clean our dataset from duplication by
applying:
• X= first row from dataset
• Y= second row from dataset
• N= number of Levenshtein distance in our case =8
• a = min distance between two rows
• a=(x1-y1) +(x2-y2) +(x3-y3) +(x4-y4) +(x5-y5) +(x6-y6) +(x7-y7) +(x8-y8)
• If (a=0) this means that the two rows are identical and we must keep one and
delete the other.
Levenshtein Distance for cleaning
36. The textual dataset divisions
After Levenshtein
Twitter
106,583
Google
53,870
News
4,401
Before Levenshtein
Twitter
129,012
Google
890,342
News
4,401
37
37. 38
• Algorithms combined in our model:
– Moving Average Convergence Divergence – MACD
– Relative Strength Index – RSI
– Bollinger Band
Stock Market trading
42. 43
▪ Select all Levenshtein data as features for cluster “Symbol Code, Symbol
Name, Reuters Code, Sector Name, INDEX, CEO, Stock Market”
▪ Number of clusters: 3
▪ Number of examples: 164,854
▪ Number of feature columns: 9
▪ Training method: elkan
▪ Number of training iterations: 7
▪ Batch size: 164,854
▪ Total training time (seconds): 1.8036
Apply K-Means algorithm for clustering
43. 44
Case 1 :Applying K-means Algorithm
to Cluster
Hold
127,479
Buy
34,772
Sell
2,603
Accuracy
83.3%
Total : 164,854
44. 45
• We built our GA model as follows.
• Population site: 20
• Iteration :100
• Selection Method: Elite Selection
• Chromosome Length: 164,854
• Chromosome encoding: Permutation chromosome
Apply Genetic Algorithm for Cluster
45. 46
Case 2 :Applying GA to Cluster
Hold
146,591
Buy
12,001
Sell
6,262
Accuracy
82.5 %
Total : 164,854
46. 47
• The final case of cluster we mix between K-mean after implement Centroid
Selection Optimization by divided get first 20 % from our data set and divide
to our three-target section (Buy, hold, Sell) with our stock market experts help,
then guide our K-mean implementation in dato machine leaning tool to user
our Centroid point to make sure that all the startup point form different section
or portion that we divided before.
• After having the vector designed from the same categories grouped together in
regions within that vector, as mentioned before, the remaining 80% of the data
set will be represented in VRM using the same vector.
Applying k-means with Initial Centroid
Selection Optimization and GA
47. 48
Case 3 :Applying k-means with Initial
Centroid Selection Optimization and GA
Hold
155,540
Buy
6,562
Sell
2,752
Accuracy
89.31%
Total : 164,854
49. 50
• We Used DEEP v2.0 by Timothy Masters
• This is a real-domain model
• There are two unsupervised layers, not including input.
– First hidden layer has five neurons
– Last hidden layer has five neurons
• There are two supervised layers, including
– Hidden layer has six neurons
Apply Deep Learning
50. 51
• Auto encode greedy training parameters.
• Initial annealing iterations for starting weights = 100
• Initial random range for starting weights = 1.00000
• Gradient reduction max iterations = 500
• Gradient reduction convergence tolerance = 0.0000500
• Greedily trained layers will be fine-tuned after each addition
• Unsupervised section weights will remain fixed, not fine-tuned by supervised
training
Training model
51. 52
• Initial annealing iterations for starting weights = 100
• Initial random range for starting weights = 1.00000
• Supervised optimization max iterations = 1000
• Supervised optimization convergence tolerance = 0.0000500
Supervised layer(s) training parameters
52. 53
Rule template 1:
If point X in C1 then
Close Price =<Predict Deep>
And Decision =Hold, Buy, Sell
With CF =
Hold= 0.7 Buy=0.2 Sell = 0.1
Create Rule-Base Model
53. 54
Rule template 2:
If point X in C2 then
Close Price =<Predict Deep>
And Decision =Hold, Buy, Sell
With CF =
Hold= 0.6 Buy=0.2 Sell = 0.2
Create Rule-Base Model
54. 55
Rule template 3:
If point X in C2 then
Close Price =<Predict Deep>
And Decision =Hold, Buy, Sell
With CF =
Hold= 0.5 Buy=0.3 Sell = 0.2
Create Rule-Base Model
56. 57
• In order to evaluate and test our proposed framework in real work
• We run our Framework 10 working day constantly to predict
EGS02211C018 [Egypt for Poultry] and recorded all results form
market and our proposed framework
Evaluation method
62. Testing accuracy
Market Indication vs Model Results
Market Indication
Hold :4
Buy:3
Sell
:3
Model Results
C1 Hold
:5
C2
Buy:3
C3
Sell:2
63
• C1: Recommend Hold
• C2: Recommend Buy
• C3:Recommend Sell
63. 64
The Cluster Average Accuracy
1 2 3
Avg Accuracy 83.30% 82.50% 89.31%
78.00%
80.00%
82.00%
84.00%
86.00%
88.00%
90.00%AVGAccuracy
Cluster Cases
Cluster Average Accuracy
64. 65
The Cluster Average Sums of
Squared Distances
1 2 3
Avg SSD 419.2805488 610.4761191 31.0731798
0
100
200
300
400
500
600
700
AVGSSD
Cluster Cases
Average Sums of Squared Distances
65. 66
The Cluster Average Number of K-
means Iterations
1 2 3
Avg Iterations 3.97 3.57 1.78
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
AvgIterations
Cluster Cases
Average Number of Iterations
67. 68
• Build first Egyptian stock market ontology Based on our
customized methodology
• Build robust framework for stock market prediction base on deep
leaning technology and Data Mining
• Combined most used Prediction algorithms to be in one result
used nowadays at HSBC Security and other Egyptian Brokers
Contribution
69. 70
• Our New Proposed framework effected in Egyptian Market
liquidity positively throw 10 day live run shows.
• Result show that Social media data special twitter effect positive
on trading at same day and next trading day This relation will
increases if we add Arabic language .
• Result for Cluster using k-Mean guided by GA gave more
accuracy result than GA or K-mean separated
Conclusion
71. 72
Add Arabic Language in prediction data.
Automatic extracting data from Ontology.
Increase proposed Model Accuracy buy
add more sources.
Thesis work could be extended in this
direction through
73. 74
Published Work
• Enhancing Stock Prediction Clustering Using K-Means with Genetic
Algorithm @ICENCO 2017,13th International Computer Engineering
Conference, December 2017
• An Intelligent Framework using Hybrid Social Media and Market
Data, for Stock Prediction Analysis @International Journal of
Computer and Information Technology (ISSN: 2279 –0764) Volume 05–
Issue 03, May 2016
74. 75
• This research is supported by Egypt for information
dissemination-EGID and Egypt stock Exchange-EGX for
constant consulting and support
Acknowledgements