SlideShare a Scribd company logo
Algorithmic Trading using the Deutsche B¨orse
Public Dataset
Marjan Ahmed
Dept of Computer Science
University of Illinois
at Urbana-Champaign
Vancouver, BC, Canada
marjana2@illinois.edu
Fan Yang
Dept of Computer Science
University of Illinois
at Urbana-Champaign
Palo Alto, CA, United States
fanyang3@illinois.edu
Dilruba Hawk
Dept of Computer Science
University of Illinois
at Urbana-Champaign
New York, NY, United States
dilruba2@illinois.edu
Wang Chun Wei
Dept of Computer Science
University of Illinois
at Urbana-Champaign
Brisbane, QLD, Australia
wcwei2@illinois.edu
Abstract—This report details the findings, implementation and
research method of the teams algorithmic trading model devel-
oped using machine learning frameworks implemented leveraging
AWS cloud computing infrastructure. Github project website:
https://github.com/weiwangchun/cs498cca
I. INTRODUCTION
We intend to build a cloud hosted backtesting framework for
algorithmic trading. The objective is to analyze a given data set
of equities containing price-volume data and develop medium
to high frequency strategies for trading equities quickly via
cloud computing. A simple web interface will be constructed
allowing users to control and tweak trading parameter settings,
and see in-sample and out-of-sample trading performance (i.e.,
hypothetical profit and loss, tradings costs, turnover, portfolio
risk and Sharpe ratio).
Given the limitations of our dataset, we focus on using a
feature set that includes (i) stock returns, (ii) implied stock
volatility, (iii) order imbalance, (iv) trading volumes and (v)
number of trades to predict future stock prices. These infor-
mation are all publically available, and hence if the German
market was at least weakly efficient, trading profitably using
this subset of information would be difficult. For a given
stock at a given time, we compute its most recent feature
set, its long run average feature set and also its feature set
percentile compared to its historical feature sets (i.e., how
does it compare to history) and lastly compute its feature set
percentile compared to its peers’ feature sets (i.e., how does it
compare to its peers). We attempt to use both a simple logistic
regression model and a neural network to predict future stock
price.
Our results are mildly successful, but after accounting for
transaction costs, we conclude that we are unable to construct
a commerically viable trading strategy.
II. DATASET
The chosen dataset is the Deutsche Borse public dataset
available on Amazon Web Services (AWS). It contains trad-
ing data in 2 minute intervals for every tradeable security
listed on the Eurex and Xetra trading platforms located
in Frankfurt, Germany. The dataset is circa 3 GB in size
with over 18,000 files and is located at: https://s3.eu-central-
1.amazonaws.com/deutsche-boerse-xetra-pds/ In aggregate,
we collected data from the 2nd of January 2018 to the 1st
of April 2019 (314 unique trading days). In total, this had
18,854,026 rows. Each row represented a 2 minute trading
interval for a particular stock. There were 2,837 unique stocks
in the dataset. A snapshot of the dataset is shown below.
III. TECHNOLOGIES AND TOOLS
To implement the objective of this project, we have chosen
to leverage the vast computing and storage resources offered
by AWS. The technologies used in this project are as follows:
• Simple Storage Service (S3) for storing the dataset and
outputs.
• Elastic Cloud Compute (EC2) for initial exploratory
analysis and model development.
• Elastic MapReduce (EMR) - for Deployment in Produc-
tion.
• Route53 for front end web interface.
A step-by-step methodical approach was implemented start-
ing from creating an AWS account, creating users (team
members), creating roles and permissions for team members
and applications using Identity Access Management (IAM),
downloading the data into the created S3 bucket using python
helper functions and then use PySpark, built on AWS EMR
clusters for exploratory analysis of the data. EMR cluster
creation process involved creating a 3-node cluster. Selection
of PySpark for model development led to inclusion of the
relevant applictions, in this case, PySpark and JupyterHub.
JupyterHub helps in line block execution of codes and display
results subsequently helping us explore the data and keep track
of every outcome. This process will help us test different
predicivte models and alogrithms and fine tune the process
in developing the final model before production deployment.
After the model is finalized, We intend to develop a front
end Web Interface using Amazon Route 53. A domain named
datainsight.guru is chosen where users can choose tickers and,
control and tweak trading parameters to see the outcome of
the predictive model.
After running the dataset through EMR cluster on AWS
using dataset uploaded into S3 buckets, we met to discuss
limitations imposed on us due to cost of using AWS. The free
tier account and credit issued to us was exhausted. As a result,
we looked into other platforms that will help us complete the
project without incurring cost.
Progressing through the Cloud Computing Course helped us
gain more knowledge on using PySpark and after reasearch we
found Databricks Community Edition to be a viable platform
for performing additional distributed computing required to
complete the project. The Databricks platform provides jupyter
style notebook with Python, Scala and Java environment
in addition to the convenience of running SQL queries on
uploaded data and creating Spark Dataframes directly from
SQL queries.
IV. METHOD DESIGN
In terms of the overall methodology, we choose S3 to store
the stock price data and choose to use PySpark built on AWS
EMR clusters1
to analyze the data. We use boto3 package in
Python to upload stock price data from the Deustche public
dataset to S32
. Lastly, we use EMR Notebook to run PySpark
jobs. Announced in Nov 2018, EMR Notebook is essentially
a Python Jupyter notebook pre-configured to connect to EMR
and S3.
The general process of the project is as follows:
1) Stock price-volume data is uploaded on S3
2) User (via web interface) selects the time horizon and
trading parameters (i.e., trading costs, trading frequen-
cies, lookback period, size of trading portfolio, ...)
3) We iterate through the time period provided by the user,
at each point t:
• A training set is constructed based on a lookback
period (k), i.e. from t − k to t − 1
• A ‘trading model’ (described in the next section)
will be fitted / applied to the training set
• Stocks will be brought and sold in the test set (time
t)
• Profitability will be recorded.
4) Results will be displayed back to the user via the web
interface
1We created a 3-nodes cluster following the AWS Cluster creation process.
2See our codes in Github
A. Trading Model
Initially, we opted to examine momentum strategies (Je-
gadeesh and Titman, 1993) on the German market. However,
given the Deutsche database has a relatively short history
(circa 2 years), we decided to instead focus on higher fre-
quency trading strategies. In particular, we will focus on
implementing pairs trading (Vidyamurthy, 2004; Zhang and
Zang, 2008) on relatively high frequency equities data.
Over user selected trading frequencies (highest frequency =
2 minutes, lowest frequency = 1 day), we systematically
search for cointegrated pairs of stocks on the Xetra and Eurex
stock exchanges using both the Engle-Granger and Johansen
methods. Our trading strategy involves simultaneously buying
and selling top cointegrated pairs whose spread has diverged,
i.e., betting on convergence.
Our trading model incorporated both time-series and cross-
sectional data. We decided to focus on a daily frequency
trading model. Since our data was in 2-minute frequency, we
convert the raw data into the follow key features:
1) Total Volume: an aggregation of 2-minute frequency
volume for each stock on a daily basis
2) Total Trades: the aggregate number of trades made on
a given stock per day
3) VWAP: daily volume weighted average price (granular-
ity at the 2-minute frequency)
4) Daily Return: the total return made on the stock per
day
5) Realized Volatility: the absolute difference between the
maximum price and the minimum price on a stock on a
given day
6) Order Imbalance: a daily measure of trade imbalance
denoted as,
OI =
|B − S|
B + S
where B is the total number of buy initiated trades
and S is the total number of sell initiated trades. We
approximate buy/sell initiation as follows: if in a given
2 minute interval, the end price is greater than the start
price, the trades within that interval are classified as buy
(and vice versa).
The features above alone are not enough to build a pre-
dictive model. We extend by looking at 4 dimensions of
these features. To predict today’s returns of a given stock,
we analyze the following:
1) Yesterday’s feature set of given stock
2) Historical long run average feature set of given stock
3) Yesterday’s feature set as a percentile rank compared
to the given stocks’ historical feature set (gauge for
anything abnormal)
4) Yesterday’s feature set as a percentile rank compared
to yesterday’s peer feature set (compare this stock’s
features against all other stocks)
We excluded VWAP as a predictive feature, leaving us with 5
× 4 (dimensions) = 20 features for classification modelling.
We classify stock returns into three categories for classifi-
cation:
1) 0 Negative: returns < -0.001
2) 1 Neutral: -0.001 ≤ returns ≤ +0.001
3) 2 Positive: returns > +0.001
We run a logistic regression model to train the features on
the classification labels. We also tried implementing a neural
network model (and a simplier 2 layer fully connected model),
however, these did not outperform the simplier logistic model.
Our trading strategy is simple. At the beginning of a given
trading day, we use historical features to try and predict the
total returns for all the stocks at the end of the day. For stocks
where we predict a negative outcome, we short (short sell)
it with a k percent weight. For stocks where we predict a
positive outcome, we long (buy) it with a k percent weight.
For stocks where we predict a neutral outcome, we refrain
from investing. We define k = 1/#of stocks. Hence, each
position is equal weighted. For our benchmark strategy, we
will invest (long) k percent weight to all of the stocks, i.e., a
classic equal weighted portfolio. We will attempt to beat this
equal weighted portfolio with our algorithmic trading strategy.
V. EVALUATION
A. EMR PySpark Setup
Since the main objective of CS 498 Cloud Computing
Application is to understand cloud computing applications and
their successful deployment, we emphasized the preliminary
evaluation results on demostrating our ability to successfully
create and deploy the cloud computing technologies listed in
the Method Design section of this report.
The various stages of deployment that we have completed
so far:
1) Using IAM to create users and apply roles
and permissions for accessing the applications
2) Creating Roles for enabling application and user access
to applictions
3) Creating S3 bucket for data storage and access
4) IAM Role for S3 Bucket access by other applications
used in this project
5) Create EMR cluster (see Figure 1)
6) Configuration for Python Jupyter Note-
book to connect to EMR and S3
7) Run PySpark job on Jupyter Notebook (see Figure 2)
B. DataBricks Model Setup
Following team discussion on using DataBricks in the
Technologies and Tools section of this report, a DataBricks
community edition account was created to build a prediction
model (logistic regression) for the stock data. Databricks
provided a fully loaded environment of Python and SQL along
with its east to use internal DBFS (Data Bricks File System)
where the data was uploded from S3 using secure access keys.
The predictive logistic regression model was setup in the
following steps:
1) We import the high frequecy trade data from S3. The
display and initial analysis of the data are illustrated in
Figure 3.
2) Data analysis and wrangling can be due using Spark
SQL
3) Variable selection for the logistic model and vectoriz-
ing the features and labels can be done using Spark’s
VectorAssembler. This is illustrated in Figure 4.
4) An illustration of how we split training and testing data,
and fit the training data into Sparks LinearRegression
framework are illustrated in Figure 5. (For our final
results, we implement a logistic model rather than a
linear regression model).
VI. RESULTS
A. Random Batch Backtests
We randomly sample 100 stocks from our database, to run
our backtest across the full history. We found our logistic re-
gression model to be lacklustre in performance. In our training
sample, it yielded an accuracy of 52.5%. The confusion matrix
as as follows:
TABLE I
IN-SAMPLE CONFUSION MATRIX (TRAINING) - RANDOM BATCH
Actual
Negative Neutral Positive
Predicted
Negative 603 105 551
Neutral 137 9622 164
Positive 4884 7641 4693
This degraded to 43.2% in the test set. Overall, we did
not see a significant difference in performance between our
Fig. 1. EMR Cluster
Fig. 2. Running PySpark on Jupyter Notebooks
model and the benchmark long only trading strategy. Therefore
training the logistic model on a small random sample of 100
stocks is suboptimal. In the next section, we train our model
on all 2,837 stocks.
TABLE II
OUT-OF-SAMPLE CONFUSION MATRIX (TEST) - RANDOM BATCH
Actual
Negative Neutral Positive
Predicted
Negative 35 7 34
Neutral 14 420 10
Positive 401 670 409
B. Full Sample Backtests
When we ran our prediction model across all the stocks
(2,837) over all the dates (2nd Jan, 2018 to 1st Apr, 2019)
in the public dataset. Our logistic regression model performed
marginally better, In our training sample, it yielded an accu-
racy of 67.5%, and in our test sample, it yielded an accuracy
of 60.0%. We find that whilst it is better at predicting neutral,
reasonable at predicting negative states, it has a poor record
at predicting positive states.
We plot the performance of our model vs the benchmark,
which is an equal weighted portfolio of all the stocks listed on
the Deutsche Bourse. We find reasonably good performance
of our performance in-sample (on the trainings set), despite
not being very good at predicting positive movements. This is
Fig. 3. Data Import from Amazon S3 storage
TABLE III
IN-SAMPLE CONFUSION MATRIX (TRAINING) - FULL BACKTEST
Actual
Negative Neutral Positive
Predicted
Negative 26315 4557 23716
Neutral 115279 516430 116790
Positive 1220 254 1147
in part due to the fact that the German DAX had a poor run
during 2018. Our trading strategy also outperformed in the test
set, yielding a portfolio with lower volatility and high returns
than the benchmark equal weighted portfolio. However, these
TABLE IV
OUT-OF-SAMPLE CONFUSION MATRIX (TEST) - FULL BACKTEST
Actual
Negative Neutral Positive
Predicted
Negative 1680 322 1545
Neutral 9783 32312 10951
Positive 78 12 57
results do not include trading costs. Since we are trading on a
daily basis, it is most likely that these results would become
negative after applying broker transaction costs.
Fig. 4. An Example for Spark’s VectorAssembler in generating features. Here we construct only 2 features.
Fig. 5. An illustration of fitting a model in PySpark
Fig. 6. Training and Test - Profit and Loss (Random Batch of 100 stocks)
TABLE V
PORTFOLIO PERFORMANCE
Training Set Test Set
Portfolio Benchmark Portfolio Benchmark
Mean 1.79% -2.47% 1.47% 0.08%
Standard Deviation 0.73% 2.62% 0.56% 2.01%
Sharpe Ratio 2.45 - 0.94 2.63 0.04
Fig. 7. Training and Test - Profit and Loss (Full Sample)
VII. DISCUSSION
We find moderate success in predicting stock price move-
ments using a feature set that included returns, volatility,
volume and order imbalance. However, our backtests are only
profitable before transaction costs, and are unlikely to work
in real-life trading. Generally speaking, the Deutsche Bourse
is a relatively efficient market and therefore it is unreasonable
to expect to trade profitably by simply examining past trading
statistics as features.
Much of the team discussions focused on selecting the ap-
propriate technologies that can help us achieve our objective of
developing a successful application thats ready for real world
usage. Cost and the computing power of the technologies was
also an item of discussion. AWS can be quite expensive if we
dont select the proper technologies that can efficient execute
the tasks needed. For example, PySpark framework built on
EMR was chosen to achieve efficiency and speed that will
not only help us reduce cost but deliver fast results to the
users. As the project progessed, the team discussions focused
on coming up with new strategies build effective model for
the stock dataset and figuring out what other technologies or
platforms can be used to keep the project cost under check.
The team collaboratively agreed upon using Data Bricks Spark
platform which would help in preventing future AWS costs.
Subsequent discussions and meetings also resulted in dropping
Amazons Route 53 due to cost-prohibitive reasons.
VIII. FUTURE WORK
The projects initial set up and testing of the applications
seamless integration with each other has been completed.
Now the framework or the infratructre is in place, next
steps include creating or selecting the appripriate predictive
algorithms to help us build a successful model. After the model
is successully created, the project will move into production
phase, interconnected applications checked for seamless in-
tegration and development of the front end user interface
by using Amazons Route 53 application. As mentioned in
the Technologies and Tools and Discussion section of this
report, selecting new technologies and additional models led to
Linear Regression and Logistic Regression model building and
developing PySpark codes on Data Bricks Pyspark notebook.
IX. DIVISION OF WORK
Fan Yang and Wang Chun Wei, contributed to developing
the Python helper codes for connecting to and downloading the
data into S3. Along with their prior Macine Learning knowl-
edge, and in collaboartion with Marjan Ahmed and Dilruba
Hawk, the dataset was chosen and initial model development
strategy such as selecting the appropriate Machine Learning
model (for example, regression) were chosen. Marjan Ahmed
and Dilruba, along with inputs from Fan Yang and Wang Chun
Wei, created the Amazon account, created users, roles and
permissions using IAM and created EC2 and S3. Fan Yang
also tested the jupyter notebook and ran a computing job using
the EMR cluster. Dilruba will help with developing the front
end web application suing Route 53 for this project.
Selecting new technologies and additional models led to re-
distributing the work load based on team members familiarity
with the platform. Wang Chun and Fan Yang contributed sig-
nificantly in developing and testing machine learning models
of Logistic Regression and Neural Networks. Marjan Ahmed,
with the help of Fan Wang and Dilruba Hawk, developed a
Linear Regression and a Logistic Regression model on Data
Bricks platform. In the end we chose the Logistic model
for the final backests. Fan Yangs expert knowledge on AWS
helped Marjan Ahmed connect DataBricks platform to S3
using secure credetials. Marjans prior experience in Apache
Spark helped build the predictive model.
REFERENCES
[1] Jegadeesh, N., and S. Titman (1993) Returns to buying winners and
selling losers: Implications for stock market efficiency, Journal of
Finance, 48, 65 - 91
[2] Vidyamurthy, G. (2004) Pairs Trading: Quantitative Methods and Anal-
ysis, New Jersey: John Wiley & Sons.
[3] Zhang, H. and Zang, Q. (2008) Trading a meanreverting asset: Buy low
and sell high. Automatica Vol. 44, 1511-1518

More Related Content

Similar to Algorithmic Trading Deutsche Borse Public Dataset

Rachit Mishra_stock prediction_report
Rachit Mishra_stock prediction_reportRachit Mishra_stock prediction_report
Rachit Mishra_stock prediction_report
Rachit Mishra
 
IRJET- Classification of Pattern Storage System and Analysis of Online Shoppi...
IRJET- Classification of Pattern Storage System and Analysis of Online Shoppi...IRJET- Classification of Pattern Storage System and Analysis of Online Shoppi...
IRJET- Classification of Pattern Storage System and Analysis of Online Shoppi...
IRJET Journal
 
Stock Market Prediction
Stock Market PredictionStock Market Prediction
Stock Market Prediction
MRIDUL GUPTA
 
IRJET - Stock Market Analysis and Prediction
IRJET - Stock Market Analysis and PredictionIRJET - Stock Market Analysis and Prediction
IRJET - Stock Market Analysis and Prediction
IRJET Journal
 
MD AZAM CA-1-1.pptx
MD AZAM CA-1-1.pptxMD AZAM CA-1-1.pptx
MD AZAM CA-1-1.pptx
MyMovies15
 
IRJET- Empower Syntactic Exploration Based on Conceptual Graph using Searchab...
IRJET- Empower Syntactic Exploration Based on Conceptual Graph using Searchab...IRJET- Empower Syntactic Exploration Based on Conceptual Graph using Searchab...
IRJET- Empower Syntactic Exploration Based on Conceptual Graph using Searchab...
IRJET Journal
 
Icbai 2018 ver_1
Icbai 2018 ver_1Icbai 2018 ver_1
Icbai 2018 ver_1
BlackhatGAURAV
 
Cloudera Movies Data Science Project On Big Data
Cloudera Movies Data Science Project On Big DataCloudera Movies Data Science Project On Big Data
Cloudera Movies Data Science Project On Big Data
Abhishek M Shivalingaiah
 
The hidden engineering behind machine learning products at Helixa
The hidden engineering behind machine learning products at HelixaThe hidden engineering behind machine learning products at Helixa
The hidden engineering behind machine learning products at Helixa
Alluxio, Inc.
 
Creating a scalable & cost efficient BI infrastructure for a startup in the A...
Creating a scalable & cost efficient BI infrastructure for a startup in the A...Creating a scalable & cost efficient BI infrastructure for a startup in the A...
Creating a scalable & cost efficient BI infrastructure for a startup in the A...
vcrisan
 
IRJET- Text based Deep Learning for Stock Prediction
IRJET- Text based Deep Learning for Stock PredictionIRJET- Text based Deep Learning for Stock Prediction
IRJET- Text based Deep Learning for Stock Prediction
IRJET Journal
 
IRJET- Towards Efficient Framework for Semantic Query Search Engine in Large-...
IRJET- Towards Efficient Framework for Semantic Query Search Engine in Large-...IRJET- Towards Efficient Framework for Semantic Query Search Engine in Large-...
IRJET- Towards Efficient Framework for Semantic Query Search Engine in Large-...
IRJET Journal
 
IRJET- Foster Hashtag from Image and Text
IRJET-  	  Foster Hashtag from Image and TextIRJET-  	  Foster Hashtag from Image and Text
IRJET- Foster Hashtag from Image and Text
IRJET Journal
 
Improving Association Rule Mining by Defining a Novel Data Structure
Improving Association Rule Mining by Defining a Novel Data StructureImproving Association Rule Mining by Defining a Novel Data Structure
Improving Association Rule Mining by Defining a Novel Data Structure
IRJET Journal
 
WindowManagementFramework
WindowManagementFrameworkWindowManagementFramework
WindowManagementFramework
Sachin Jain
 
IRJET- Key-Aggregate Cryptosystem for Scalable Data Sharing in Cloud Storage
IRJET-  	  Key-Aggregate Cryptosystem for Scalable Data Sharing in Cloud StorageIRJET-  	  Key-Aggregate Cryptosystem for Scalable Data Sharing in Cloud Storage
IRJET- Key-Aggregate Cryptosystem for Scalable Data Sharing in Cloud Storage
IRJET Journal
 
Ijarcet vol-2-issue-3-881-883
Ijarcet vol-2-issue-3-881-883Ijarcet vol-2-issue-3-881-883
Ijarcet vol-2-issue-3-881-883
Editor IJARCET
 
Intelligent Systems Project: Bike sharing service modeling
Intelligent Systems Project: Bike sharing service modelingIntelligent Systems Project: Bike sharing service modeling
Intelligent Systems Project: Bike sharing service modeling
Alessio Villardita
 
Intelligent Supermarket using Apriori
Intelligent Supermarket using AprioriIntelligent Supermarket using Apriori
Intelligent Supermarket using Apriori
IRJET Journal
 
IRJET- Sentimental Analysis for Online Reviews using Machine Learning Algorithms
IRJET- Sentimental Analysis for Online Reviews using Machine Learning AlgorithmsIRJET- Sentimental Analysis for Online Reviews using Machine Learning Algorithms
IRJET- Sentimental Analysis for Online Reviews using Machine Learning Algorithms
IRJET Journal
 

Similar to Algorithmic Trading Deutsche Borse Public Dataset (20)

Rachit Mishra_stock prediction_report
Rachit Mishra_stock prediction_reportRachit Mishra_stock prediction_report
Rachit Mishra_stock prediction_report
 
IRJET- Classification of Pattern Storage System and Analysis of Online Shoppi...
IRJET- Classification of Pattern Storage System and Analysis of Online Shoppi...IRJET- Classification of Pattern Storage System and Analysis of Online Shoppi...
IRJET- Classification of Pattern Storage System and Analysis of Online Shoppi...
 
Stock Market Prediction
Stock Market PredictionStock Market Prediction
Stock Market Prediction
 
IRJET - Stock Market Analysis and Prediction
IRJET - Stock Market Analysis and PredictionIRJET - Stock Market Analysis and Prediction
IRJET - Stock Market Analysis and Prediction
 
MD AZAM CA-1-1.pptx
MD AZAM CA-1-1.pptxMD AZAM CA-1-1.pptx
MD AZAM CA-1-1.pptx
 
IRJET- Empower Syntactic Exploration Based on Conceptual Graph using Searchab...
IRJET- Empower Syntactic Exploration Based on Conceptual Graph using Searchab...IRJET- Empower Syntactic Exploration Based on Conceptual Graph using Searchab...
IRJET- Empower Syntactic Exploration Based on Conceptual Graph using Searchab...
 
Icbai 2018 ver_1
Icbai 2018 ver_1Icbai 2018 ver_1
Icbai 2018 ver_1
 
Cloudera Movies Data Science Project On Big Data
Cloudera Movies Data Science Project On Big DataCloudera Movies Data Science Project On Big Data
Cloudera Movies Data Science Project On Big Data
 
The hidden engineering behind machine learning products at Helixa
The hidden engineering behind machine learning products at HelixaThe hidden engineering behind machine learning products at Helixa
The hidden engineering behind machine learning products at Helixa
 
Creating a scalable & cost efficient BI infrastructure for a startup in the A...
Creating a scalable & cost efficient BI infrastructure for a startup in the A...Creating a scalable & cost efficient BI infrastructure for a startup in the A...
Creating a scalable & cost efficient BI infrastructure for a startup in the A...
 
IRJET- Text based Deep Learning for Stock Prediction
IRJET- Text based Deep Learning for Stock PredictionIRJET- Text based Deep Learning for Stock Prediction
IRJET- Text based Deep Learning for Stock Prediction
 
IRJET- Towards Efficient Framework for Semantic Query Search Engine in Large-...
IRJET- Towards Efficient Framework for Semantic Query Search Engine in Large-...IRJET- Towards Efficient Framework for Semantic Query Search Engine in Large-...
IRJET- Towards Efficient Framework for Semantic Query Search Engine in Large-...
 
IRJET- Foster Hashtag from Image and Text
IRJET-  	  Foster Hashtag from Image and TextIRJET-  	  Foster Hashtag from Image and Text
IRJET- Foster Hashtag from Image and Text
 
Improving Association Rule Mining by Defining a Novel Data Structure
Improving Association Rule Mining by Defining a Novel Data StructureImproving Association Rule Mining by Defining a Novel Data Structure
Improving Association Rule Mining by Defining a Novel Data Structure
 
WindowManagementFramework
WindowManagementFrameworkWindowManagementFramework
WindowManagementFramework
 
IRJET- Key-Aggregate Cryptosystem for Scalable Data Sharing in Cloud Storage
IRJET-  	  Key-Aggregate Cryptosystem for Scalable Data Sharing in Cloud StorageIRJET-  	  Key-Aggregate Cryptosystem for Scalable Data Sharing in Cloud Storage
IRJET- Key-Aggregate Cryptosystem for Scalable Data Sharing in Cloud Storage
 
Ijarcet vol-2-issue-3-881-883
Ijarcet vol-2-issue-3-881-883Ijarcet vol-2-issue-3-881-883
Ijarcet vol-2-issue-3-881-883
 
Intelligent Systems Project: Bike sharing service modeling
Intelligent Systems Project: Bike sharing service modelingIntelligent Systems Project: Bike sharing service modeling
Intelligent Systems Project: Bike sharing service modeling
 
Intelligent Supermarket using Apriori
Intelligent Supermarket using AprioriIntelligent Supermarket using Apriori
Intelligent Supermarket using Apriori
 
IRJET- Sentimental Analysis for Online Reviews using Machine Learning Algorithms
IRJET- Sentimental Analysis for Online Reviews using Machine Learning AlgorithmsIRJET- Sentimental Analysis for Online Reviews using Machine Learning Algorithms
IRJET- Sentimental Analysis for Online Reviews using Machine Learning Algorithms
 

Recently uploaded

The Ipsos - AI - Monitor 2024 Report.pdf
The  Ipsos - AI - Monitor 2024 Report.pdfThe  Ipsos - AI - Monitor 2024 Report.pdf
The Ipsos - AI - Monitor 2024 Report.pdf
Social Samosa
 
writing report business partner b1+ .pdf
writing report business partner b1+ .pdfwriting report business partner b1+ .pdf
writing report business partner b1+ .pdf
VyNguyen709676
 
Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...
Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...
Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...
Kaxil Naik
 
A presentation that explain the Power BI Licensing
A presentation that explain the Power BI LicensingA presentation that explain the Power BI Licensing
A presentation that explain the Power BI Licensing
AlessioFois2
 
Udemy_2024_Global_Learning_Skills_Trends_Report (1).pdf
Udemy_2024_Global_Learning_Skills_Trends_Report (1).pdfUdemy_2024_Global_Learning_Skills_Trends_Report (1).pdf
Udemy_2024_Global_Learning_Skills_Trends_Report (1).pdf
Fernanda Palhano
 
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
nuttdpt
 
原版一比一利兹贝克特大学毕业证(LeedsBeckett毕业证书)如何办理
原版一比一利兹贝克特大学毕业证(LeedsBeckett毕业证书)如何办理原版一比一利兹贝克特大学毕业证(LeedsBeckett毕业证书)如何办理
原版一比一利兹贝克特大学毕业证(LeedsBeckett毕业证书)如何办理
wyddcwye1
 
原版一比一弗林德斯大学毕业证(Flinders毕业证书)如何办理
原版一比一弗林德斯大学毕业证(Flinders毕业证书)如何办理原版一比一弗林德斯大学毕业证(Flinders毕业证书)如何办理
原版一比一弗林德斯大学毕业证(Flinders毕业证书)如何办理
a9qfiubqu
 
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
nyfuhyz
 
原版一比一多伦多大学毕业证(UofT毕业证书)如何办理
原版一比一多伦多大学毕业证(UofT毕业证书)如何办理原版一比一多伦多大学毕业证(UofT毕业证书)如何办理
原版一比一多伦多大学毕业证(UofT毕业证书)如何办理
mkkikqvo
 
Experts live - Improving user adoption with AI
Experts live - Improving user adoption with AIExperts live - Improving user adoption with AI
Experts live - Improving user adoption with AI
jitskeb
 
DATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docx
DATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docxDATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docx
DATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docx
SaffaIbrahim1
 
Palo Alto Cortex XDR presentation .......
Palo Alto Cortex XDR presentation .......Palo Alto Cortex XDR presentation .......
Palo Alto Cortex XDR presentation .......
Sachin Paul
 
Learn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queriesLearn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queries
manishkhaire30
 
Global Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headedGlobal Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headed
vikram sood
 
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
apvysm8
 
一比一原版(Unimelb毕业证书)墨尔本大学毕业证如何办理
一比一原版(Unimelb毕业证书)墨尔本大学毕业证如何办理一比一原版(Unimelb毕业证书)墨尔本大学毕业证如何办理
一比一原版(Unimelb毕业证书)墨尔本大学毕业证如何办理
xclpvhuk
 
一比一原版巴斯大学毕业证(Bath毕业证书)学历如何办理
一比一原版巴斯大学毕业证(Bath毕业证书)学历如何办理一比一原版巴斯大学毕业证(Bath毕业证书)学历如何办理
一比一原版巴斯大学毕业证(Bath毕业证书)学历如何办理
y3i0qsdzb
 
Analysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performanceAnalysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performance
roli9797
 
一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理
aqzctr7x
 

Recently uploaded (20)

The Ipsos - AI - Monitor 2024 Report.pdf
The  Ipsos - AI - Monitor 2024 Report.pdfThe  Ipsos - AI - Monitor 2024 Report.pdf
The Ipsos - AI - Monitor 2024 Report.pdf
 
writing report business partner b1+ .pdf
writing report business partner b1+ .pdfwriting report business partner b1+ .pdf
writing report business partner b1+ .pdf
 
Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...
Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...
Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...
 
A presentation that explain the Power BI Licensing
A presentation that explain the Power BI LicensingA presentation that explain the Power BI Licensing
A presentation that explain the Power BI Licensing
 
Udemy_2024_Global_Learning_Skills_Trends_Report (1).pdf
Udemy_2024_Global_Learning_Skills_Trends_Report (1).pdfUdemy_2024_Global_Learning_Skills_Trends_Report (1).pdf
Udemy_2024_Global_Learning_Skills_Trends_Report (1).pdf
 
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
 
原版一比一利兹贝克特大学毕业证(LeedsBeckett毕业证书)如何办理
原版一比一利兹贝克特大学毕业证(LeedsBeckett毕业证书)如何办理原版一比一利兹贝克特大学毕业证(LeedsBeckett毕业证书)如何办理
原版一比一利兹贝克特大学毕业证(LeedsBeckett毕业证书)如何办理
 
原版一比一弗林德斯大学毕业证(Flinders毕业证书)如何办理
原版一比一弗林德斯大学毕业证(Flinders毕业证书)如何办理原版一比一弗林德斯大学毕业证(Flinders毕业证书)如何办理
原版一比一弗林德斯大学毕业证(Flinders毕业证书)如何办理
 
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
 
原版一比一多伦多大学毕业证(UofT毕业证书)如何办理
原版一比一多伦多大学毕业证(UofT毕业证书)如何办理原版一比一多伦多大学毕业证(UofT毕业证书)如何办理
原版一比一多伦多大学毕业证(UofT毕业证书)如何办理
 
Experts live - Improving user adoption with AI
Experts live - Improving user adoption with AIExperts live - Improving user adoption with AI
Experts live - Improving user adoption with AI
 
DATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docx
DATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docxDATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docx
DATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docx
 
Palo Alto Cortex XDR presentation .......
Palo Alto Cortex XDR presentation .......Palo Alto Cortex XDR presentation .......
Palo Alto Cortex XDR presentation .......
 
Learn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queriesLearn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queries
 
Global Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headedGlobal Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headed
 
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
 
一比一原版(Unimelb毕业证书)墨尔本大学毕业证如何办理
一比一原版(Unimelb毕业证书)墨尔本大学毕业证如何办理一比一原版(Unimelb毕业证书)墨尔本大学毕业证如何办理
一比一原版(Unimelb毕业证书)墨尔本大学毕业证如何办理
 
一比一原版巴斯大学毕业证(Bath毕业证书)学历如何办理
一比一原版巴斯大学毕业证(Bath毕业证书)学历如何办理一比一原版巴斯大学毕业证(Bath毕业证书)学历如何办理
一比一原版巴斯大学毕业证(Bath毕业证书)学历如何办理
 
Analysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performanceAnalysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performance
 
一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理
 

Algorithmic Trading Deutsche Borse Public Dataset

  • 1. Algorithmic Trading using the Deutsche B¨orse Public Dataset Marjan Ahmed Dept of Computer Science University of Illinois at Urbana-Champaign Vancouver, BC, Canada marjana2@illinois.edu Fan Yang Dept of Computer Science University of Illinois at Urbana-Champaign Palo Alto, CA, United States fanyang3@illinois.edu Dilruba Hawk Dept of Computer Science University of Illinois at Urbana-Champaign New York, NY, United States dilruba2@illinois.edu Wang Chun Wei Dept of Computer Science University of Illinois at Urbana-Champaign Brisbane, QLD, Australia wcwei2@illinois.edu Abstract—This report details the findings, implementation and research method of the teams algorithmic trading model devel- oped using machine learning frameworks implemented leveraging AWS cloud computing infrastructure. Github project website: https://github.com/weiwangchun/cs498cca I. INTRODUCTION We intend to build a cloud hosted backtesting framework for algorithmic trading. The objective is to analyze a given data set of equities containing price-volume data and develop medium to high frequency strategies for trading equities quickly via cloud computing. A simple web interface will be constructed allowing users to control and tweak trading parameter settings, and see in-sample and out-of-sample trading performance (i.e., hypothetical profit and loss, tradings costs, turnover, portfolio risk and Sharpe ratio). Given the limitations of our dataset, we focus on using a feature set that includes (i) stock returns, (ii) implied stock volatility, (iii) order imbalance, (iv) trading volumes and (v) number of trades to predict future stock prices. These infor- mation are all publically available, and hence if the German market was at least weakly efficient, trading profitably using this subset of information would be difficult. For a given stock at a given time, we compute its most recent feature set, its long run average feature set and also its feature set percentile compared to its historical feature sets (i.e., how does it compare to history) and lastly compute its feature set percentile compared to its peers’ feature sets (i.e., how does it compare to its peers). We attempt to use both a simple logistic regression model and a neural network to predict future stock price. Our results are mildly successful, but after accounting for transaction costs, we conclude that we are unable to construct a commerically viable trading strategy. II. DATASET The chosen dataset is the Deutsche Borse public dataset available on Amazon Web Services (AWS). It contains trad- ing data in 2 minute intervals for every tradeable security listed on the Eurex and Xetra trading platforms located in Frankfurt, Germany. The dataset is circa 3 GB in size with over 18,000 files and is located at: https://s3.eu-central- 1.amazonaws.com/deutsche-boerse-xetra-pds/ In aggregate, we collected data from the 2nd of January 2018 to the 1st of April 2019 (314 unique trading days). In total, this had 18,854,026 rows. Each row represented a 2 minute trading interval for a particular stock. There were 2,837 unique stocks in the dataset. A snapshot of the dataset is shown below. III. TECHNOLOGIES AND TOOLS To implement the objective of this project, we have chosen to leverage the vast computing and storage resources offered by AWS. The technologies used in this project are as follows: • Simple Storage Service (S3) for storing the dataset and outputs. • Elastic Cloud Compute (EC2) for initial exploratory analysis and model development. • Elastic MapReduce (EMR) - for Deployment in Produc- tion. • Route53 for front end web interface. A step-by-step methodical approach was implemented start- ing from creating an AWS account, creating users (team members), creating roles and permissions for team members and applications using Identity Access Management (IAM), downloading the data into the created S3 bucket using python helper functions and then use PySpark, built on AWS EMR clusters for exploratory analysis of the data. EMR cluster creation process involved creating a 3-node cluster. Selection of PySpark for model development led to inclusion of the relevant applictions, in this case, PySpark and JupyterHub.
  • 2. JupyterHub helps in line block execution of codes and display results subsequently helping us explore the data and keep track of every outcome. This process will help us test different predicivte models and alogrithms and fine tune the process in developing the final model before production deployment. After the model is finalized, We intend to develop a front end Web Interface using Amazon Route 53. A domain named datainsight.guru is chosen where users can choose tickers and, control and tweak trading parameters to see the outcome of the predictive model. After running the dataset through EMR cluster on AWS using dataset uploaded into S3 buckets, we met to discuss limitations imposed on us due to cost of using AWS. The free tier account and credit issued to us was exhausted. As a result, we looked into other platforms that will help us complete the project without incurring cost. Progressing through the Cloud Computing Course helped us gain more knowledge on using PySpark and after reasearch we found Databricks Community Edition to be a viable platform for performing additional distributed computing required to complete the project. The Databricks platform provides jupyter style notebook with Python, Scala and Java environment in addition to the convenience of running SQL queries on uploaded data and creating Spark Dataframes directly from SQL queries. IV. METHOD DESIGN In terms of the overall methodology, we choose S3 to store the stock price data and choose to use PySpark built on AWS EMR clusters1 to analyze the data. We use boto3 package in Python to upload stock price data from the Deustche public dataset to S32 . Lastly, we use EMR Notebook to run PySpark jobs. Announced in Nov 2018, EMR Notebook is essentially a Python Jupyter notebook pre-configured to connect to EMR and S3. The general process of the project is as follows: 1) Stock price-volume data is uploaded on S3 2) User (via web interface) selects the time horizon and trading parameters (i.e., trading costs, trading frequen- cies, lookback period, size of trading portfolio, ...) 3) We iterate through the time period provided by the user, at each point t: • A training set is constructed based on a lookback period (k), i.e. from t − k to t − 1 • A ‘trading model’ (described in the next section) will be fitted / applied to the training set • Stocks will be brought and sold in the test set (time t) • Profitability will be recorded. 4) Results will be displayed back to the user via the web interface 1We created a 3-nodes cluster following the AWS Cluster creation process. 2See our codes in Github A. Trading Model Initially, we opted to examine momentum strategies (Je- gadeesh and Titman, 1993) on the German market. However, given the Deutsche database has a relatively short history (circa 2 years), we decided to instead focus on higher fre- quency trading strategies. In particular, we will focus on implementing pairs trading (Vidyamurthy, 2004; Zhang and Zang, 2008) on relatively high frequency equities data. Over user selected trading frequencies (highest frequency = 2 minutes, lowest frequency = 1 day), we systematically search for cointegrated pairs of stocks on the Xetra and Eurex stock exchanges using both the Engle-Granger and Johansen methods. Our trading strategy involves simultaneously buying and selling top cointegrated pairs whose spread has diverged, i.e., betting on convergence. Our trading model incorporated both time-series and cross- sectional data. We decided to focus on a daily frequency trading model. Since our data was in 2-minute frequency, we convert the raw data into the follow key features: 1) Total Volume: an aggregation of 2-minute frequency volume for each stock on a daily basis 2) Total Trades: the aggregate number of trades made on a given stock per day 3) VWAP: daily volume weighted average price (granular- ity at the 2-minute frequency) 4) Daily Return: the total return made on the stock per day 5) Realized Volatility: the absolute difference between the maximum price and the minimum price on a stock on a given day 6) Order Imbalance: a daily measure of trade imbalance denoted as, OI = |B − S| B + S where B is the total number of buy initiated trades and S is the total number of sell initiated trades. We approximate buy/sell initiation as follows: if in a given 2 minute interval, the end price is greater than the start price, the trades within that interval are classified as buy (and vice versa). The features above alone are not enough to build a pre- dictive model. We extend by looking at 4 dimensions of these features. To predict today’s returns of a given stock, we analyze the following: 1) Yesterday’s feature set of given stock 2) Historical long run average feature set of given stock 3) Yesterday’s feature set as a percentile rank compared to the given stocks’ historical feature set (gauge for anything abnormal) 4) Yesterday’s feature set as a percentile rank compared to yesterday’s peer feature set (compare this stock’s features against all other stocks) We excluded VWAP as a predictive feature, leaving us with 5 × 4 (dimensions) = 20 features for classification modelling.
  • 3. We classify stock returns into three categories for classifi- cation: 1) 0 Negative: returns < -0.001 2) 1 Neutral: -0.001 ≤ returns ≤ +0.001 3) 2 Positive: returns > +0.001 We run a logistic regression model to train the features on the classification labels. We also tried implementing a neural network model (and a simplier 2 layer fully connected model), however, these did not outperform the simplier logistic model. Our trading strategy is simple. At the beginning of a given trading day, we use historical features to try and predict the total returns for all the stocks at the end of the day. For stocks where we predict a negative outcome, we short (short sell) it with a k percent weight. For stocks where we predict a positive outcome, we long (buy) it with a k percent weight. For stocks where we predict a neutral outcome, we refrain from investing. We define k = 1/#of stocks. Hence, each position is equal weighted. For our benchmark strategy, we will invest (long) k percent weight to all of the stocks, i.e., a classic equal weighted portfolio. We will attempt to beat this equal weighted portfolio with our algorithmic trading strategy. V. EVALUATION A. EMR PySpark Setup Since the main objective of CS 498 Cloud Computing Application is to understand cloud computing applications and their successful deployment, we emphasized the preliminary evaluation results on demostrating our ability to successfully create and deploy the cloud computing technologies listed in the Method Design section of this report. The various stages of deployment that we have completed so far: 1) Using IAM to create users and apply roles and permissions for accessing the applications 2) Creating Roles for enabling application and user access to applictions 3) Creating S3 bucket for data storage and access 4) IAM Role for S3 Bucket access by other applications used in this project 5) Create EMR cluster (see Figure 1) 6) Configuration for Python Jupyter Note- book to connect to EMR and S3 7) Run PySpark job on Jupyter Notebook (see Figure 2) B. DataBricks Model Setup Following team discussion on using DataBricks in the Technologies and Tools section of this report, a DataBricks community edition account was created to build a prediction model (logistic regression) for the stock data. Databricks provided a fully loaded environment of Python and SQL along with its east to use internal DBFS (Data Bricks File System) where the data was uploded from S3 using secure access keys. The predictive logistic regression model was setup in the following steps: 1) We import the high frequecy trade data from S3. The display and initial analysis of the data are illustrated in Figure 3. 2) Data analysis and wrangling can be due using Spark SQL 3) Variable selection for the logistic model and vectoriz- ing the features and labels can be done using Spark’s VectorAssembler. This is illustrated in Figure 4. 4) An illustration of how we split training and testing data, and fit the training data into Sparks LinearRegression framework are illustrated in Figure 5. (For our final results, we implement a logistic model rather than a linear regression model). VI. RESULTS A. Random Batch Backtests We randomly sample 100 stocks from our database, to run our backtest across the full history. We found our logistic re- gression model to be lacklustre in performance. In our training sample, it yielded an accuracy of 52.5%. The confusion matrix as as follows: TABLE I IN-SAMPLE CONFUSION MATRIX (TRAINING) - RANDOM BATCH Actual Negative Neutral Positive Predicted Negative 603 105 551 Neutral 137 9622 164 Positive 4884 7641 4693 This degraded to 43.2% in the test set. Overall, we did not see a significant difference in performance between our
  • 4. Fig. 1. EMR Cluster Fig. 2. Running PySpark on Jupyter Notebooks model and the benchmark long only trading strategy. Therefore training the logistic model on a small random sample of 100 stocks is suboptimal. In the next section, we train our model on all 2,837 stocks. TABLE II OUT-OF-SAMPLE CONFUSION MATRIX (TEST) - RANDOM BATCH Actual Negative Neutral Positive Predicted Negative 35 7 34 Neutral 14 420 10 Positive 401 670 409 B. Full Sample Backtests When we ran our prediction model across all the stocks (2,837) over all the dates (2nd Jan, 2018 to 1st Apr, 2019) in the public dataset. Our logistic regression model performed marginally better, In our training sample, it yielded an accu- racy of 67.5%, and in our test sample, it yielded an accuracy of 60.0%. We find that whilst it is better at predicting neutral, reasonable at predicting negative states, it has a poor record at predicting positive states. We plot the performance of our model vs the benchmark, which is an equal weighted portfolio of all the stocks listed on the Deutsche Bourse. We find reasonably good performance of our performance in-sample (on the trainings set), despite not being very good at predicting positive movements. This is
  • 5. Fig. 3. Data Import from Amazon S3 storage TABLE III IN-SAMPLE CONFUSION MATRIX (TRAINING) - FULL BACKTEST Actual Negative Neutral Positive Predicted Negative 26315 4557 23716 Neutral 115279 516430 116790 Positive 1220 254 1147 in part due to the fact that the German DAX had a poor run during 2018. Our trading strategy also outperformed in the test set, yielding a portfolio with lower volatility and high returns than the benchmark equal weighted portfolio. However, these TABLE IV OUT-OF-SAMPLE CONFUSION MATRIX (TEST) - FULL BACKTEST Actual Negative Neutral Positive Predicted Negative 1680 322 1545 Neutral 9783 32312 10951 Positive 78 12 57 results do not include trading costs. Since we are trading on a daily basis, it is most likely that these results would become negative after applying broker transaction costs.
  • 6. Fig. 4. An Example for Spark’s VectorAssembler in generating features. Here we construct only 2 features. Fig. 5. An illustration of fitting a model in PySpark
  • 7. Fig. 6. Training and Test - Profit and Loss (Random Batch of 100 stocks) TABLE V PORTFOLIO PERFORMANCE Training Set Test Set Portfolio Benchmark Portfolio Benchmark Mean 1.79% -2.47% 1.47% 0.08% Standard Deviation 0.73% 2.62% 0.56% 2.01% Sharpe Ratio 2.45 - 0.94 2.63 0.04 Fig. 7. Training and Test - Profit and Loss (Full Sample) VII. DISCUSSION We find moderate success in predicting stock price move- ments using a feature set that included returns, volatility, volume and order imbalance. However, our backtests are only profitable before transaction costs, and are unlikely to work in real-life trading. Generally speaking, the Deutsche Bourse is a relatively efficient market and therefore it is unreasonable to expect to trade profitably by simply examining past trading statistics as features. Much of the team discussions focused on selecting the ap- propriate technologies that can help us achieve our objective of developing a successful application thats ready for real world usage. Cost and the computing power of the technologies was also an item of discussion. AWS can be quite expensive if we dont select the proper technologies that can efficient execute the tasks needed. For example, PySpark framework built on EMR was chosen to achieve efficiency and speed that will not only help us reduce cost but deliver fast results to the users. As the project progessed, the team discussions focused on coming up with new strategies build effective model for the stock dataset and figuring out what other technologies or platforms can be used to keep the project cost under check. The team collaboratively agreed upon using Data Bricks Spark platform which would help in preventing future AWS costs. Subsequent discussions and meetings also resulted in dropping Amazons Route 53 due to cost-prohibitive reasons. VIII. FUTURE WORK The projects initial set up and testing of the applications seamless integration with each other has been completed. Now the framework or the infratructre is in place, next steps include creating or selecting the appripriate predictive algorithms to help us build a successful model. After the model is successully created, the project will move into production phase, interconnected applications checked for seamless in- tegration and development of the front end user interface by using Amazons Route 53 application. As mentioned in the Technologies and Tools and Discussion section of this report, selecting new technologies and additional models led to Linear Regression and Logistic Regression model building and developing PySpark codes on Data Bricks Pyspark notebook. IX. DIVISION OF WORK Fan Yang and Wang Chun Wei, contributed to developing the Python helper codes for connecting to and downloading the data into S3. Along with their prior Macine Learning knowl- edge, and in collaboartion with Marjan Ahmed and Dilruba Hawk, the dataset was chosen and initial model development strategy such as selecting the appropriate Machine Learning model (for example, regression) were chosen. Marjan Ahmed and Dilruba, along with inputs from Fan Yang and Wang Chun Wei, created the Amazon account, created users, roles and permissions using IAM and created EC2 and S3. Fan Yang also tested the jupyter notebook and ran a computing job using the EMR cluster. Dilruba will help with developing the front end web application suing Route 53 for this project. Selecting new technologies and additional models led to re- distributing the work load based on team members familiarity with the platform. Wang Chun and Fan Yang contributed sig- nificantly in developing and testing machine learning models of Logistic Regression and Neural Networks. Marjan Ahmed, with the help of Fan Wang and Dilruba Hawk, developed a Linear Regression and a Logistic Regression model on Data
  • 8. Bricks platform. In the end we chose the Logistic model for the final backests. Fan Yangs expert knowledge on AWS helped Marjan Ahmed connect DataBricks platform to S3 using secure credetials. Marjans prior experience in Apache Spark helped build the predictive model. REFERENCES [1] Jegadeesh, N., and S. Titman (1993) Returns to buying winners and selling losers: Implications for stock market efficiency, Journal of Finance, 48, 65 - 91 [2] Vidyamurthy, G. (2004) Pairs Trading: Quantitative Methods and Anal- ysis, New Jersey: John Wiley & Sons. [3] Zhang, H. and Zang, Q. (2008) Trading a meanreverting asset: Buy low and sell high. Automatica Vol. 44, 1511-1518