SlideShare a Scribd company logo
1 of 37
Download to read offline
Real Time Processing of Trade
Data with Kafka, Spark
Streaming and Aerospike on
Prem and in Cloud
Architecture Design Series
Mich Talebzadeh
October 2019
Author
Mich Talebzadeh, Big Data, Cloud and Financial Fraud IT Specialist
Ph.D. in Particle Physics, Imperial College of Science and Technology, University of London
Past clients: Investment Banks, Barclaycard, Credit Suisse, Mizuho International Plc, HSBC, Fidelity, Bank of America,
Deutsche Bank etc
E-mail: mich@Peridale.co.uk
https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2?
33 Linkedlin articles on Big Data
Real Time Processing of Trade Data with
Kafka, Spark Streaming and Aerospike
2
Introduction
• I have been publishing articles adhering to the Lambda
Architecture with several different databases. An
example is given here back in 2016.
• With the advent of Digital Transformation things have
moved on. With better and improving performance
using newer offerings, organisations are looking at
using data warehouses in real time. The idea of
periodic load of data into data warehouses has given
rise to continuous load of data in real time. This has
been made possible through new capability additions
to Big Data and Cloud, resulting in enhanced
capabilities and better throughput.
3
Introduction
The notion of what is real time does play a role
here. The engineering definition of real time is
roughly fast enough to be interactive. However, I
put a stronger definition:
• In real time application or data, there is no such
thing as an answer which is supposed to be late
and correct. The timeliness is part of the
application.
• If we get the right answer too slowly it becomes
useless or wrong. We also need to be aware of
latency trades off with throughput.
4
Introduction
• Within a larger architecture, often latency is
dictated by the lowest denominator which
often does not adhere to our definition of low
latency. For example, Kafka as widely
deployed today in Big Data Architecture is
micro-batch. A moderate-latency message
queue that is Kafka plus low latency processor
equals a moderate-latency architecture.
Hence, the low latency architecture must be
treated within that context.
5
Introduction
• The challenge in such an architecture is that
everything must not only work, but it must
work seamlessly. The more links in the chain
from source to the engagement of data, the
more points of vulnerability.
• That is the reason that redundancy and fault
tolerance need to be part of solution design
and built-in from start.
6
7
Data Sources
• To stream trade data we rely on historical trade
data available from Google Finance.
• An average price value was taken per day
between high and low and they were stored as
arrays and selected at random at run time for the
sake of creating random prices for a given ticker.
• A ticker is used to uniquely identify publicly traded
shares of a particular stock.
• In this case we chose 10 different tickers as
shown below to generate random prices as JSON
data
IBM MRW MSFT ORCL SAP SBRY TSCO VOD BP MKS
8
Zookeeper setup
1.The Zookeeper role is the coordination service for
distributed systems.
2.Three physical nodes were deployed to create a
Zookeeper ensemble.
3.A collection of Zookeeper servers forms a
Zookeeper ensemble.
4.We used three zookeeper servers. In zookeeper
ensemble, all zookeeper servers must all know
about each other. They maintain an in-memory
image of state, along with a transaction logs and
snapshots in a persistent store.
9
Kafka setup
Apache Kafka is a publish/subscribe streaming
platform that has three key capabilities, namely;
1.Publish and subscribe to streams of records,
similar to a message queue or enterprise
messaging system
2.Store streams of records in a fault-tolerant
durable way
3.Process streams of records as they occur.
10
Kafka setup
Kafka architecture consists of the following
components:
• A stream of messages of a particular type is
defined as a topic.
• A Message is defined as a payload of bytes
and a Topic is a category or feed name to
which messages are published.
• A Producer can be anything that can publish
messages to a Topic.
11
Kafka setup
• The published messages are then stored at a
set of servers called Brokers or Kafka Cluster.
• A Consumer can subscribe to one or more
Topics and consume the published Messages
by pulling data from the Brokers. Zookeeper
Broker ensemble is shown below
12
Zookeeper and Kafka ensemble
13
Spark Deployment Modes
Spark can either be deployed in local or distributed
manner in the cluster.
• In Local mode Spark is run in a single JVM. This is
the simplest run mode and provides no
performance gains
• In a distributed environment, we can run spark
using master-slave architecture. There will be
multiple worker nodes in each cluster and cluster
manager will be allocating the resources to each
worker node.
14
Spark Deployment Modes
Spark can be deployed in distributed cluster in 3
ways:
• Standalone - meaning Spark will manage its
own cluster
• YARN - using Hadoop's YARN resource
manager
• Mesos - Apache's dedicated resource manager
project
15
Yarn Deployment Modes
• The term Deployment mode of Spark, simply means that “where the driver
program will be run”. There are two ways, namely; Spark Client Mode and
Spark Cluster Mode
• In the Client mode, the driver daemon runs in the node through which you
submit the spark job to your cluster. This is often done through the Edge
Node. This mode is valuable when you want to use spark interactively like
in our case where we would like to display high value prices in the
dashboard. In the Client mode you do not want to reserve any resource
from your cluster for the driver daemon
• In Cluster mode, you submit the spark job to your cluster and the driver
daemon is run inside your cluster and application master. In this mode you
do not get to use the spark job interactively as the client through which
you submit the job is gone as soon as it successfully submits the job to
cluster. You will have to reserve some resources for the driver daemon
process as it will be running in your cluster.
16
The architectural approach
• In this design we collect trades real time in
storage (Aerospike) and process them real time
through Spark. This contrasts with the traditional
Lambda architecture where the trades are
collected in a batch mode and processed real
time through Spark.
• The diagram below explains this approach:
17
The architectural approach
18
The Process Flow and the Decision tree
1. Live prices are streamed in through Kafka topics identified as (1) in JSON format
2. The Kafka cluster (2) processes these trades and delivers them to Spark
Streaming (3)
3. Kafka connect to Aerospike passes these prices to Aerospike batch set as well
(10)
4. These prices are put into Aerospike batch set in real time (11)
5. Back to Spark, prices are processed individually through the Decision Engine (5)
and depending on the ticker a decision to buy or sell is made
6. To enable this decision process, current prices for each ticker (security) are
obtained from Aerospike batch set. The definition of current reflects to how far
back the stats (i.e. the closing prices) are needed. For example, to work out a
simple moving average, one calculates this by adding the closing price of the
ticker for several time periods (for example 14) and then dividing this total by
that same number of periods. In general, short-term averages respond quickly to
changes in the price of the underlying ticker, while long-term averages are slow
to react. Other statistical parameters like MIN, MAX and Standard Deviation
(STDDEV) value of prices can also be calculated for the same time periods.
19
The Process Flow and the Decision tree
7. The Decision Engine can be literally few lines of code imbedded
within Scala program or a pre-defined package in the form of a
JAR file or JAR files provided to Spark. Generally, far more
sophisticated models can be made available to the Decision
Engine if needed.
8. The streaming topic in Spark consists of a JSON record for each
ticker. We need to loop through the records and work out if the
ticker price satisfies the condition to buy or sell.
9. Those prices that do not satisfy the condition are simply ignored.
10. Those prices that satisfy the condition (high value prices) are
notified via Real Time Dashboard (6-8)
11. These high value prices are then saved to the operational
database Aerospike real time set (7) for future reference.
20
Aerospike as a High Performance Database
Aerospike is a hybrid flash - index in RAM, data on disk, operational strongly consistent NoSQL
database.
This translates to:
• no data loss
• no stale reads
• no dirty reads, despite presence of replicas for resilience purposes.
Aerospike supports complex operations at high performance. From my own experience,
Aerospike's performance is better than other clustered NoSQL solutions that I have tested so far.
To put this in perspective:
1. Higher performance of Aerospike per-node means smaller cluster which results in lower
TCO (Total Cost of Ownership) and maintenance. Specifically:
• Use of flash gives data density per node => Lower TCO + reduced maintenance
• Use of flash gives lower cost / TB => Lower TCO
21
Aerospike as a High Performance Database
2. Aerospike does auto-clustering, auto-sharding and auto-
rebalancing.
3. Aerospike is designed to utilise flash memory as the
predominant storage medium.
4. Unlike in memory only equivalents, Aerospike can be
configured to use SSD to store information with no practical
speed loss.
5. An Aerospike cluster is typically sized to ensure 95% of
transactions are sub 1milliseconds, using commodity flash.
Adoption of flash to achieve high performance gives order of
magnitude better data density
6. Aerospike latency is very low as will be shown later in our
tests, even with high throughput.
22
Aerospike set-up for this deployment
• Minimal configuration or special tuning
• 2-node Aerospike, each on different physical hosts (on prem and on GCP Dataproc)
• Aerospike Enterprise Edition with Security enabled
• Aerospike Connect for Kafka (Inbound Connector) as an add-on feature
– The inbound connector supports streaming data from one or more Kafka topics and persisting
the data in an Aerospike database. It leverages the open source Kafka Connect framework,
that is part of the Apache Kafka project. In terms of Kafka Connect, the inbound connector
implements a "sink" connector.
– Requires the connector software to be installed on every Aerospike node.
• Aerospike Connect for Spark as an add-on feature
– Aerospike Connect for Spark enables companies to directly integrate the Aerospike Database
with their existing Spark infrastructure. In addition, Aerospike Connect for Spark allows
companies to combine transactional and historical data stored in the Aerospike Database with
streaming event data for consumption by machine learning and artificial intelligence engines
using Apache Spark.
– Requires Aerospike license file to be installed on every node that Spark executors will be
running with Aerospike as the high performance database
23
Why Fast components matter
• Need high speed components and
sophisticated programs for executing BUY and
SELL.
• A system to handle multiple data feeds with
scale.
• Very short time-frames for establishing and
liquidating positions.
• Submission of numerous orders that are
cancelled shortly after submission.
24
Why Fast components matter
• Ending the trading day in as close to a flat position as
possible (that is, not carrying significant, unhedged
positions overnight).
• We require an operational database that can handle
massive volumes of reads/writes simultaneously.
• As per scaling, the chosen database should manage
addition and reduction of capacity seamlessly with no
downtime.
• Aerospike ticks all the above boxes.
• High value prices collected in Aerospike sets provide
valuable insight into the real time trends in the market.
25
Understanding Spark results through
Visualization
26
Understanding Spark results through
Visualization
• The all important indicator is the Processing
Time defined by Spark GUI as Time taken to
process all jobs of a batch which on average is
152ms far smaller than the batch interval of 2
seconds. The Scheduling Delay and the Total
Delay are additional indicators of health.
27
The real time dashboard
The Real Time Dashboard shows the following
lines of output:
*** price for ticker BP is GBP360.65 <= GBP659.86 BUY ***
*** price for ticker MSFT is GBP53.78 >= GBP53.69 SELL ***
*** price for ticker SAP is GBP67.06 >= GBP66.31 SELL ***
*** price for ticker SAP is GBP73.10 >= GBP66.31 SELL ***
*** price for ticker MKS is GBP618.50 >= GBP582.53 SELL ***
*** price for ticker MRW is GBP188.80 <= GBP338.77 BUY ***
*** price for ticker IBM is GBP193.54 >= GBP184.54 SELL ***
28
Looking at Aerospike Management Console for
various matrices
29
Looking at Aerospike Management Console for
various matrices
• Management console provides multiple matrices
from writes per second, read per second to the
latency stats. It can therefore be used to identify
performance issues among other parameters in
conjunction with Spark instrumentation.
30
Deploying the architecture in Cloud
• For this work, deployed three Dataproc
Compute servers in Google Cloud.
• Cloud Dataproc is a fully managed service
that lets you run the Apache Spark and Apache
Hadoop ecosystem on Google Cloud Platform.
• Start with the PoC architecture built on one
physical host and scale up as needed. Reduce
the development, integration, deployment time
and time to market
• Machine type: c2-standard-4 (4 vCPUs, 16 GB
memory)
• 60GB of hard disk + 10GB of SSD for
Aerospike.
• Node 1 dockers: Zookeeper, Kafka, Aerospike
• Node 2 dockers: Zookeeper, Kafka, Aerospike
• Node 3 dockers: Zookeeper, Kafka
31
Aerospike set-up on the docker
• Docker EE in GCP
docker run -tid --net=host
-v /etc/aerospike:/opt/aerospike/etc
-v /etc/aerospike:/etc/aerospike
-v /mnt/disks/ssd/aerospike/data:/opt/aerospike/data
-v /var/log:/var/log
-e "FEATURE_KEY_FILE=/etc/aerospike/features.conf“
-e "LOGFILE=/var/log/aerospike.log"
--name aerospike
-p 3000:3000
-p 3001:3001
-p 3002:3002
-p 3003:3003
aerospike/aerospike-server-enterprise /usr/bin/asd
--foreground
--config-file /etc/aerospike/aerospike.conf
32
Deploying the architecture in Cloud
• Comparable results obtained in Cloud with
processing time of 200ms. Not as good as the ones
on prem but somehow expected due to using less
powerful Dataproc compute servers.
33
Conclusions
• In this presentation we covered the real time
processing of trade data relying on Kafka,
Spark Streaming, Spark SQL and Aerospike.
We streamed data real time into Aerospike
from Kafka and as the same time used Spark
Streaming to look for high value tickers
(securities) real time, displayed the result in
dashboard and posted them as records to
Aerospike as well. We performed this work
both on prem and with dockers in Cloud.
34
Conclusions
• Earlier studies were done using Hbase as the batch
storage and MongoDB as the real time storage.
• The average processing time was a staggering 951ms
compared to Aerospike of 153ms!
• I published the findings in Linkedlin
• https://www.linkedin.com/pulse/real-time-processing-
trade-data-kafka-flume-spark-talebzadeh-ph-d-/
• This poor timing were the main reason that I decided
to move away from both Hbase as batch storage and
MongoDB as high value real time storage in favour of
Aerospike for both batch and real time storage
35
Conclusions
36
Conclusions
• Reduced latency for calculations based on raw
prices in (11)
• Reduced the end-to-end processing time
further
• Used the Aerospike licensed products (Kafka
connector and Spark connector)
• Reduced the TCO by simplifying the
components usage and deploying dockers on
managed services in Cloud
37

More Related Content

What's hot

Develop Scalable Applications with DataStax Drivers (Alex Popescu, Bulat Shak...
Develop Scalable Applications with DataStax Drivers (Alex Popescu, Bulat Shak...Develop Scalable Applications with DataStax Drivers (Alex Popescu, Bulat Shak...
Develop Scalable Applications with DataStax Drivers (Alex Popescu, Bulat Shak...DataStax
 
Comparison between mongo db and cassandra using ycsb
Comparison between mongo db and cassandra using ycsbComparison between mongo db and cassandra using ycsb
Comparison between mongo db and cassandra using ycsbsonalighai
 
HBaseConAsia2018: Track2-5: JanusGraph-Distributed graph database with HBase
HBaseConAsia2018: Track2-5: JanusGraph-Distributed graph database with HBaseHBaseConAsia2018: Track2-5: JanusGraph-Distributed graph database with HBase
HBaseConAsia2018: Track2-5: JanusGraph-Distributed graph database with HBaseMichael Stack
 
Cassandra Tuning - above and beyond
Cassandra Tuning - above and beyondCassandra Tuning - above and beyond
Cassandra Tuning - above and beyondMatija Gobec
 
Apache Spark 2.4 Bridges the Gap Between Big Data and Deep Learning
Apache Spark 2.4 Bridges the Gap Between Big Data and Deep LearningApache Spark 2.4 Bridges the Gap Between Big Data and Deep Learning
Apache Spark 2.4 Bridges the Gap Between Big Data and Deep LearningDataWorks Summit
 
C* Summit 2013: Time for a New Relationship - Intuit's Journey from RDBMS to ...
C* Summit 2013: Time for a New Relationship - Intuit's Journey from RDBMS to ...C* Summit 2013: Time for a New Relationship - Intuit's Journey from RDBMS to ...
C* Summit 2013: Time for a New Relationship - Intuit's Journey from RDBMS to ...DataStax Academy
 
HBaseCon 2015: Solving HBase Performance Problems with Apache HTrace
HBaseCon 2015: Solving HBase Performance Problems with Apache HTraceHBaseCon 2015: Solving HBase Performance Problems with Apache HTrace
HBaseCon 2015: Solving HBase Performance Problems with Apache HTraceHBaseCon
 
Brian Bulkowski. Aerospike
Brian Bulkowski. AerospikeBrian Bulkowski. Aerospike
Brian Bulkowski. AerospikeVolha Banadyseva
 
Jay Kreps on Project Voldemort Scaling Simple Storage At LinkedIn
Jay Kreps on Project Voldemort Scaling Simple Storage At LinkedInJay Kreps on Project Voldemort Scaling Simple Storage At LinkedIn
Jay Kreps on Project Voldemort Scaling Simple Storage At LinkedInLinkedIn
 
Apache HBase in the Enterprise Data Hub at Cerner
Apache HBase in the Enterprise Data Hub at CernerApache HBase in the Enterprise Data Hub at Cerner
Apache HBase in the Enterprise Data Hub at CernerHBaseCon
 
Greenplum for Internet Scale Analytics and Mining - Greenplum Summit 2018
Greenplum for Internet Scale Analytics and Mining - Greenplum Summit 2018Greenplum for Internet Scale Analytics and Mining - Greenplum Summit 2018
Greenplum for Internet Scale Analytics and Mining - Greenplum Summit 2018VMware Tanzu
 
Argus Production Monitoring at Salesforce
Argus Production Monitoring at SalesforceArgus Production Monitoring at Salesforce
Argus Production Monitoring at SalesforceHBaseCon
 
ScyllaDB: What could you do with Cassandra compatibility at 1.8 million reque...
ScyllaDB: What could you do with Cassandra compatibility at 1.8 million reque...ScyllaDB: What could you do with Cassandra compatibility at 1.8 million reque...
ScyllaDB: What could you do with Cassandra compatibility at 1.8 million reque...Data Con LA
 
Time-Series Apache HBase
Time-Series Apache HBaseTime-Series Apache HBase
Time-Series Apache HBaseHBaseCon
 
Voldemort on Solid State Drives
Voldemort on Solid State DrivesVoldemort on Solid State Drives
Voldemort on Solid State DrivesVinoth Chandar
 
HBaseCon2017 Analyzing cryptocurrencies in real time with hBase, Kafka and St...
HBaseCon2017 Analyzing cryptocurrencies in real time with hBase, Kafka and St...HBaseCon2017 Analyzing cryptocurrencies in real time with hBase, Kafka and St...
HBaseCon2017 Analyzing cryptocurrencies in real time with hBase, Kafka and St...HBaseCon
 
The Future of Postgres Sharding / Bruce Momjian (PostgreSQL)
The Future of Postgres Sharding / Bruce Momjian (PostgreSQL)The Future of Postgres Sharding / Bruce Momjian (PostgreSQL)
The Future of Postgres Sharding / Bruce Momjian (PostgreSQL)Ontico
 
Micro-batching: High-performance Writes (Adam Zegelin, Instaclustr) | Cassand...
Micro-batching: High-performance Writes (Adam Zegelin, Instaclustr) | Cassand...Micro-batching: High-performance Writes (Adam Zegelin, Instaclustr) | Cassand...
Micro-batching: High-performance Writes (Adam Zegelin, Instaclustr) | Cassand...DataStax
 
Accelerating Machine Learning Pipelines with Alluxio at Alluxio Meetup 2016
Accelerating Machine Learning Pipelines with Alluxio at Alluxio Meetup 2016Accelerating Machine Learning Pipelines with Alluxio at Alluxio Meetup 2016
Accelerating Machine Learning Pipelines with Alluxio at Alluxio Meetup 2016Alluxio, Inc.
 

What's hot (20)

Develop Scalable Applications with DataStax Drivers (Alex Popescu, Bulat Shak...
Develop Scalable Applications with DataStax Drivers (Alex Popescu, Bulat Shak...Develop Scalable Applications with DataStax Drivers (Alex Popescu, Bulat Shak...
Develop Scalable Applications with DataStax Drivers (Alex Popescu, Bulat Shak...
 
Comparison between mongo db and cassandra using ycsb
Comparison between mongo db and cassandra using ycsbComparison between mongo db and cassandra using ycsb
Comparison between mongo db and cassandra using ycsb
 
HBaseConAsia2018: Track2-5: JanusGraph-Distributed graph database with HBase
HBaseConAsia2018: Track2-5: JanusGraph-Distributed graph database with HBaseHBaseConAsia2018: Track2-5: JanusGraph-Distributed graph database with HBase
HBaseConAsia2018: Track2-5: JanusGraph-Distributed graph database with HBase
 
Gfs vs hdfs
Gfs vs hdfsGfs vs hdfs
Gfs vs hdfs
 
Cassandra Tuning - above and beyond
Cassandra Tuning - above and beyondCassandra Tuning - above and beyond
Cassandra Tuning - above and beyond
 
Apache Spark 2.4 Bridges the Gap Between Big Data and Deep Learning
Apache Spark 2.4 Bridges the Gap Between Big Data and Deep LearningApache Spark 2.4 Bridges the Gap Between Big Data and Deep Learning
Apache Spark 2.4 Bridges the Gap Between Big Data and Deep Learning
 
C* Summit 2013: Time for a New Relationship - Intuit's Journey from RDBMS to ...
C* Summit 2013: Time for a New Relationship - Intuit's Journey from RDBMS to ...C* Summit 2013: Time for a New Relationship - Intuit's Journey from RDBMS to ...
C* Summit 2013: Time for a New Relationship - Intuit's Journey from RDBMS to ...
 
HBaseCon 2015: Solving HBase Performance Problems with Apache HTrace
HBaseCon 2015: Solving HBase Performance Problems with Apache HTraceHBaseCon 2015: Solving HBase Performance Problems with Apache HTrace
HBaseCon 2015: Solving HBase Performance Problems with Apache HTrace
 
Brian Bulkowski. Aerospike
Brian Bulkowski. AerospikeBrian Bulkowski. Aerospike
Brian Bulkowski. Aerospike
 
Jay Kreps on Project Voldemort Scaling Simple Storage At LinkedIn
Jay Kreps on Project Voldemort Scaling Simple Storage At LinkedInJay Kreps on Project Voldemort Scaling Simple Storage At LinkedIn
Jay Kreps on Project Voldemort Scaling Simple Storage At LinkedIn
 
Apache HBase in the Enterprise Data Hub at Cerner
Apache HBase in the Enterprise Data Hub at CernerApache HBase in the Enterprise Data Hub at Cerner
Apache HBase in the Enterprise Data Hub at Cerner
 
Greenplum for Internet Scale Analytics and Mining - Greenplum Summit 2018
Greenplum for Internet Scale Analytics and Mining - Greenplum Summit 2018Greenplum for Internet Scale Analytics and Mining - Greenplum Summit 2018
Greenplum for Internet Scale Analytics and Mining - Greenplum Summit 2018
 
Argus Production Monitoring at Salesforce
Argus Production Monitoring at SalesforceArgus Production Monitoring at Salesforce
Argus Production Monitoring at Salesforce
 
ScyllaDB: What could you do with Cassandra compatibility at 1.8 million reque...
ScyllaDB: What could you do with Cassandra compatibility at 1.8 million reque...ScyllaDB: What could you do with Cassandra compatibility at 1.8 million reque...
ScyllaDB: What could you do with Cassandra compatibility at 1.8 million reque...
 
Time-Series Apache HBase
Time-Series Apache HBaseTime-Series Apache HBase
Time-Series Apache HBase
 
Voldemort on Solid State Drives
Voldemort on Solid State DrivesVoldemort on Solid State Drives
Voldemort on Solid State Drives
 
HBaseCon2017 Analyzing cryptocurrencies in real time with hBase, Kafka and St...
HBaseCon2017 Analyzing cryptocurrencies in real time with hBase, Kafka and St...HBaseCon2017 Analyzing cryptocurrencies in real time with hBase, Kafka and St...
HBaseCon2017 Analyzing cryptocurrencies in real time with hBase, Kafka and St...
 
The Future of Postgres Sharding / Bruce Momjian (PostgreSQL)
The Future of Postgres Sharding / Bruce Momjian (PostgreSQL)The Future of Postgres Sharding / Bruce Momjian (PostgreSQL)
The Future of Postgres Sharding / Bruce Momjian (PostgreSQL)
 
Micro-batching: High-performance Writes (Adam Zegelin, Instaclustr) | Cassand...
Micro-batching: High-performance Writes (Adam Zegelin, Instaclustr) | Cassand...Micro-batching: High-performance Writes (Adam Zegelin, Instaclustr) | Cassand...
Micro-batching: High-performance Writes (Adam Zegelin, Instaclustr) | Cassand...
 
Accelerating Machine Learning Pipelines with Alluxio at Alluxio Meetup 2016
Accelerating Machine Learning Pipelines with Alluxio at Alluxio Meetup 2016Accelerating Machine Learning Pipelines with Alluxio at Alluxio Meetup 2016
Accelerating Machine Learning Pipelines with Alluxio at Alluxio Meetup 2016
 

Similar to Real time processing of trade data with kafka, spark streaming and aerospike in gcp with dockers

IMCSummit 2015 - Day 2 General Session - Flash-Extending In-Memory Computing
IMCSummit 2015 - Day 2 General Session - Flash-Extending In-Memory ComputingIMCSummit 2015 - Day 2 General Session - Flash-Extending In-Memory Computing
IMCSummit 2015 - Day 2 General Session - Flash-Extending In-Memory ComputingIn-Memory Computing Summit
 
SpringPeople - Introduction to Cloud Computing
SpringPeople - Introduction to Cloud ComputingSpringPeople - Introduction to Cloud Computing
SpringPeople - Introduction to Cloud ComputingSpringPeople
 
Data has a better idea the in-memory data grid
Data has a better idea   the in-memory data gridData has a better idea   the in-memory data grid
Data has a better idea the in-memory data gridBogdan Dina
 
Spark introduction and architecture
Spark introduction and architectureSpark introduction and architecture
Spark introduction and architectureSohil Jain
 
Spark introduction and architecture
Spark introduction and architectureSpark introduction and architecture
Spark introduction and architectureSohil Jain
 
Thing you didn't know you could do in Spark
Thing you didn't know you could do in SparkThing you didn't know you could do in Spark
Thing you didn't know you could do in SparkSnappyData
 
Webinar: Dyn + DataStax - helping companies deliver exceptional end-user expe...
Webinar: Dyn + DataStax - helping companies deliver exceptional end-user expe...Webinar: Dyn + DataStax - helping companies deliver exceptional end-user expe...
Webinar: Dyn + DataStax - helping companies deliver exceptional end-user expe...DataStax
 
Curriculum Associates Strata NYC 2017
Curriculum Associates Strata NYC 2017Curriculum Associates Strata NYC 2017
Curriculum Associates Strata NYC 2017Kristi Lewandowski
 
Curriculum Associates Strata NYC 2017
Curriculum Associates Strata NYC 2017Curriculum Associates Strata NYC 2017
Curriculum Associates Strata NYC 2017SingleStore
 
Curriculum Associates Strata NYC 2017
Curriculum Associates Strata NYC 2017Curriculum Associates Strata NYC 2017
Curriculum Associates Strata NYC 2017Kristi Lewandowski
 
Headaches and Breakthroughs in Building Continuous Applications
Headaches and Breakthroughs in Building Continuous ApplicationsHeadaches and Breakthroughs in Building Continuous Applications
Headaches and Breakthroughs in Building Continuous ApplicationsDatabricks
 
War Stories: DIY Kafka
War Stories: DIY KafkaWar Stories: DIY Kafka
War Stories: DIY Kafkaconfluent
 
Real time data pipline with kafka streams
Real time data pipline with kafka streamsReal time data pipline with kafka streams
Real time data pipline with kafka streamsYoni Farin
 
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...Landon Robinson
 
Typesafe & William Hill: Cassandra, Spark, and Kafka - The New Streaming Data...
Typesafe & William Hill: Cassandra, Spark, and Kafka - The New Streaming Data...Typesafe & William Hill: Cassandra, Spark, and Kafka - The New Streaming Data...
Typesafe & William Hill: Cassandra, Spark, and Kafka - The New Streaming Data...DataStax Academy
 
War Stories: DIY Kafka
War Stories: DIY KafkaWar Stories: DIY Kafka
War Stories: DIY Kafkaconfluent
 
Introducing Events and Stream Processing into Nationwide Building Society (Ro...
Introducing Events and Stream Processing into Nationwide Building Society (Ro...Introducing Events and Stream Processing into Nationwide Building Society (Ro...
Introducing Events and Stream Processing into Nationwide Building Society (Ro...confluent
 
Kafka streams decoupling with stores
Kafka streams decoupling with storesKafka streams decoupling with stores
Kafka streams decoupling with storesYoni Farin
 

Similar to Real time processing of trade data with kafka, spark streaming and aerospike in gcp with dockers (20)

IMCSummit 2015 - Day 2 General Session - Flash-Extending In-Memory Computing
IMCSummit 2015 - Day 2 General Session - Flash-Extending In-Memory ComputingIMCSummit 2015 - Day 2 General Session - Flash-Extending In-Memory Computing
IMCSummit 2015 - Day 2 General Session - Flash-Extending In-Memory Computing
 
SpringPeople - Introduction to Cloud Computing
SpringPeople - Introduction to Cloud ComputingSpringPeople - Introduction to Cloud Computing
SpringPeople - Introduction to Cloud Computing
 
Kafka internals
Kafka internalsKafka internals
Kafka internals
 
Introducing Mache
Introducing MacheIntroducing Mache
Introducing Mache
 
Data has a better idea the in-memory data grid
Data has a better idea   the in-memory data gridData has a better idea   the in-memory data grid
Data has a better idea the in-memory data grid
 
Spark introduction and architecture
Spark introduction and architectureSpark introduction and architecture
Spark introduction and architecture
 
Spark introduction and architecture
Spark introduction and architectureSpark introduction and architecture
Spark introduction and architecture
 
Thing you didn't know you could do in Spark
Thing you didn't know you could do in SparkThing you didn't know you could do in Spark
Thing you didn't know you could do in Spark
 
Webinar: Dyn + DataStax - helping companies deliver exceptional end-user expe...
Webinar: Dyn + DataStax - helping companies deliver exceptional end-user expe...Webinar: Dyn + DataStax - helping companies deliver exceptional end-user expe...
Webinar: Dyn + DataStax - helping companies deliver exceptional end-user expe...
 
Curriculum Associates Strata NYC 2017
Curriculum Associates Strata NYC 2017Curriculum Associates Strata NYC 2017
Curriculum Associates Strata NYC 2017
 
Curriculum Associates Strata NYC 2017
Curriculum Associates Strata NYC 2017Curriculum Associates Strata NYC 2017
Curriculum Associates Strata NYC 2017
 
Curriculum Associates Strata NYC 2017
Curriculum Associates Strata NYC 2017Curriculum Associates Strata NYC 2017
Curriculum Associates Strata NYC 2017
 
Headaches and Breakthroughs in Building Continuous Applications
Headaches and Breakthroughs in Building Continuous ApplicationsHeadaches and Breakthroughs in Building Continuous Applications
Headaches and Breakthroughs in Building Continuous Applications
 
War Stories: DIY Kafka
War Stories: DIY KafkaWar Stories: DIY Kafka
War Stories: DIY Kafka
 
Real time data pipline with kafka streams
Real time data pipline with kafka streamsReal time data pipline with kafka streams
Real time data pipline with kafka streams
 
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...
 
Typesafe & William Hill: Cassandra, Spark, and Kafka - The New Streaming Data...
Typesafe & William Hill: Cassandra, Spark, and Kafka - The New Streaming Data...Typesafe & William Hill: Cassandra, Spark, and Kafka - The New Streaming Data...
Typesafe & William Hill: Cassandra, Spark, and Kafka - The New Streaming Data...
 
War Stories: DIY Kafka
War Stories: DIY KafkaWar Stories: DIY Kafka
War Stories: DIY Kafka
 
Introducing Events and Stream Processing into Nationwide Building Society (Ro...
Introducing Events and Stream Processing into Nationwide Building Society (Ro...Introducing Events and Stream Processing into Nationwide Building Society (Ro...
Introducing Events and Stream Processing into Nationwide Building Society (Ro...
 
Kafka streams decoupling with stores
Kafka streams decoupling with storesKafka streams decoupling with stores
Kafka streams decoupling with stores
 

Recently uploaded

Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 

Recently uploaded (20)

Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 

Real time processing of trade data with kafka, spark streaming and aerospike in gcp with dockers

  • 1. Real Time Processing of Trade Data with Kafka, Spark Streaming and Aerospike on Prem and in Cloud Architecture Design Series Mich Talebzadeh October 2019
  • 2. Author Mich Talebzadeh, Big Data, Cloud and Financial Fraud IT Specialist Ph.D. in Particle Physics, Imperial College of Science and Technology, University of London Past clients: Investment Banks, Barclaycard, Credit Suisse, Mizuho International Plc, HSBC, Fidelity, Bank of America, Deutsche Bank etc E-mail: mich@Peridale.co.uk https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2? 33 Linkedlin articles on Big Data Real Time Processing of Trade Data with Kafka, Spark Streaming and Aerospike 2
  • 3. Introduction • I have been publishing articles adhering to the Lambda Architecture with several different databases. An example is given here back in 2016. • With the advent of Digital Transformation things have moved on. With better and improving performance using newer offerings, organisations are looking at using data warehouses in real time. The idea of periodic load of data into data warehouses has given rise to continuous load of data in real time. This has been made possible through new capability additions to Big Data and Cloud, resulting in enhanced capabilities and better throughput. 3
  • 4. Introduction The notion of what is real time does play a role here. The engineering definition of real time is roughly fast enough to be interactive. However, I put a stronger definition: • In real time application or data, there is no such thing as an answer which is supposed to be late and correct. The timeliness is part of the application. • If we get the right answer too slowly it becomes useless or wrong. We also need to be aware of latency trades off with throughput. 4
  • 5. Introduction • Within a larger architecture, often latency is dictated by the lowest denominator which often does not adhere to our definition of low latency. For example, Kafka as widely deployed today in Big Data Architecture is micro-batch. A moderate-latency message queue that is Kafka plus low latency processor equals a moderate-latency architecture. Hence, the low latency architecture must be treated within that context. 5
  • 6. Introduction • The challenge in such an architecture is that everything must not only work, but it must work seamlessly. The more links in the chain from source to the engagement of data, the more points of vulnerability. • That is the reason that redundancy and fault tolerance need to be part of solution design and built-in from start. 6
  • 7. 7
  • 8. Data Sources • To stream trade data we rely on historical trade data available from Google Finance. • An average price value was taken per day between high and low and they were stored as arrays and selected at random at run time for the sake of creating random prices for a given ticker. • A ticker is used to uniquely identify publicly traded shares of a particular stock. • In this case we chose 10 different tickers as shown below to generate random prices as JSON data IBM MRW MSFT ORCL SAP SBRY TSCO VOD BP MKS 8
  • 9. Zookeeper setup 1.The Zookeeper role is the coordination service for distributed systems. 2.Three physical nodes were deployed to create a Zookeeper ensemble. 3.A collection of Zookeeper servers forms a Zookeeper ensemble. 4.We used three zookeeper servers. In zookeeper ensemble, all zookeeper servers must all know about each other. They maintain an in-memory image of state, along with a transaction logs and snapshots in a persistent store. 9
  • 10. Kafka setup Apache Kafka is a publish/subscribe streaming platform that has three key capabilities, namely; 1.Publish and subscribe to streams of records, similar to a message queue or enterprise messaging system 2.Store streams of records in a fault-tolerant durable way 3.Process streams of records as they occur. 10
  • 11. Kafka setup Kafka architecture consists of the following components: • A stream of messages of a particular type is defined as a topic. • A Message is defined as a payload of bytes and a Topic is a category or feed name to which messages are published. • A Producer can be anything that can publish messages to a Topic. 11
  • 12. Kafka setup • The published messages are then stored at a set of servers called Brokers or Kafka Cluster. • A Consumer can subscribe to one or more Topics and consume the published Messages by pulling data from the Brokers. Zookeeper Broker ensemble is shown below 12
  • 13. Zookeeper and Kafka ensemble 13
  • 14. Spark Deployment Modes Spark can either be deployed in local or distributed manner in the cluster. • In Local mode Spark is run in a single JVM. This is the simplest run mode and provides no performance gains • In a distributed environment, we can run spark using master-slave architecture. There will be multiple worker nodes in each cluster and cluster manager will be allocating the resources to each worker node. 14
  • 15. Spark Deployment Modes Spark can be deployed in distributed cluster in 3 ways: • Standalone - meaning Spark will manage its own cluster • YARN - using Hadoop's YARN resource manager • Mesos - Apache's dedicated resource manager project 15
  • 16. Yarn Deployment Modes • The term Deployment mode of Spark, simply means that “where the driver program will be run”. There are two ways, namely; Spark Client Mode and Spark Cluster Mode • In the Client mode, the driver daemon runs in the node through which you submit the spark job to your cluster. This is often done through the Edge Node. This mode is valuable when you want to use spark interactively like in our case where we would like to display high value prices in the dashboard. In the Client mode you do not want to reserve any resource from your cluster for the driver daemon • In Cluster mode, you submit the spark job to your cluster and the driver daemon is run inside your cluster and application master. In this mode you do not get to use the spark job interactively as the client through which you submit the job is gone as soon as it successfully submits the job to cluster. You will have to reserve some resources for the driver daemon process as it will be running in your cluster. 16
  • 17. The architectural approach • In this design we collect trades real time in storage (Aerospike) and process them real time through Spark. This contrasts with the traditional Lambda architecture where the trades are collected in a batch mode and processed real time through Spark. • The diagram below explains this approach: 17
  • 19. The Process Flow and the Decision tree 1. Live prices are streamed in through Kafka topics identified as (1) in JSON format 2. The Kafka cluster (2) processes these trades and delivers them to Spark Streaming (3) 3. Kafka connect to Aerospike passes these prices to Aerospike batch set as well (10) 4. These prices are put into Aerospike batch set in real time (11) 5. Back to Spark, prices are processed individually through the Decision Engine (5) and depending on the ticker a decision to buy or sell is made 6. To enable this decision process, current prices for each ticker (security) are obtained from Aerospike batch set. The definition of current reflects to how far back the stats (i.e. the closing prices) are needed. For example, to work out a simple moving average, one calculates this by adding the closing price of the ticker for several time periods (for example 14) and then dividing this total by that same number of periods. In general, short-term averages respond quickly to changes in the price of the underlying ticker, while long-term averages are slow to react. Other statistical parameters like MIN, MAX and Standard Deviation (STDDEV) value of prices can also be calculated for the same time periods. 19
  • 20. The Process Flow and the Decision tree 7. The Decision Engine can be literally few lines of code imbedded within Scala program or a pre-defined package in the form of a JAR file or JAR files provided to Spark. Generally, far more sophisticated models can be made available to the Decision Engine if needed. 8. The streaming topic in Spark consists of a JSON record for each ticker. We need to loop through the records and work out if the ticker price satisfies the condition to buy or sell. 9. Those prices that do not satisfy the condition are simply ignored. 10. Those prices that satisfy the condition (high value prices) are notified via Real Time Dashboard (6-8) 11. These high value prices are then saved to the operational database Aerospike real time set (7) for future reference. 20
  • 21. Aerospike as a High Performance Database Aerospike is a hybrid flash - index in RAM, data on disk, operational strongly consistent NoSQL database. This translates to: • no data loss • no stale reads • no dirty reads, despite presence of replicas for resilience purposes. Aerospike supports complex operations at high performance. From my own experience, Aerospike's performance is better than other clustered NoSQL solutions that I have tested so far. To put this in perspective: 1. Higher performance of Aerospike per-node means smaller cluster which results in lower TCO (Total Cost of Ownership) and maintenance. Specifically: • Use of flash gives data density per node => Lower TCO + reduced maintenance • Use of flash gives lower cost / TB => Lower TCO 21
  • 22. Aerospike as a High Performance Database 2. Aerospike does auto-clustering, auto-sharding and auto- rebalancing. 3. Aerospike is designed to utilise flash memory as the predominant storage medium. 4. Unlike in memory only equivalents, Aerospike can be configured to use SSD to store information with no practical speed loss. 5. An Aerospike cluster is typically sized to ensure 95% of transactions are sub 1milliseconds, using commodity flash. Adoption of flash to achieve high performance gives order of magnitude better data density 6. Aerospike latency is very low as will be shown later in our tests, even with high throughput. 22
  • 23. Aerospike set-up for this deployment • Minimal configuration or special tuning • 2-node Aerospike, each on different physical hosts (on prem and on GCP Dataproc) • Aerospike Enterprise Edition with Security enabled • Aerospike Connect for Kafka (Inbound Connector) as an add-on feature – The inbound connector supports streaming data from one or more Kafka topics and persisting the data in an Aerospike database. It leverages the open source Kafka Connect framework, that is part of the Apache Kafka project. In terms of Kafka Connect, the inbound connector implements a "sink" connector. – Requires the connector software to be installed on every Aerospike node. • Aerospike Connect for Spark as an add-on feature – Aerospike Connect for Spark enables companies to directly integrate the Aerospike Database with their existing Spark infrastructure. In addition, Aerospike Connect for Spark allows companies to combine transactional and historical data stored in the Aerospike Database with streaming event data for consumption by machine learning and artificial intelligence engines using Apache Spark. – Requires Aerospike license file to be installed on every node that Spark executors will be running with Aerospike as the high performance database 23
  • 24. Why Fast components matter • Need high speed components and sophisticated programs for executing BUY and SELL. • A system to handle multiple data feeds with scale. • Very short time-frames for establishing and liquidating positions. • Submission of numerous orders that are cancelled shortly after submission. 24
  • 25. Why Fast components matter • Ending the trading day in as close to a flat position as possible (that is, not carrying significant, unhedged positions overnight). • We require an operational database that can handle massive volumes of reads/writes simultaneously. • As per scaling, the chosen database should manage addition and reduction of capacity seamlessly with no downtime. • Aerospike ticks all the above boxes. • High value prices collected in Aerospike sets provide valuable insight into the real time trends in the market. 25
  • 26. Understanding Spark results through Visualization 26
  • 27. Understanding Spark results through Visualization • The all important indicator is the Processing Time defined by Spark GUI as Time taken to process all jobs of a batch which on average is 152ms far smaller than the batch interval of 2 seconds. The Scheduling Delay and the Total Delay are additional indicators of health. 27
  • 28. The real time dashboard The Real Time Dashboard shows the following lines of output: *** price for ticker BP is GBP360.65 <= GBP659.86 BUY *** *** price for ticker MSFT is GBP53.78 >= GBP53.69 SELL *** *** price for ticker SAP is GBP67.06 >= GBP66.31 SELL *** *** price for ticker SAP is GBP73.10 >= GBP66.31 SELL *** *** price for ticker MKS is GBP618.50 >= GBP582.53 SELL *** *** price for ticker MRW is GBP188.80 <= GBP338.77 BUY *** *** price for ticker IBM is GBP193.54 >= GBP184.54 SELL *** 28
  • 29. Looking at Aerospike Management Console for various matrices 29
  • 30. Looking at Aerospike Management Console for various matrices • Management console provides multiple matrices from writes per second, read per second to the latency stats. It can therefore be used to identify performance issues among other parameters in conjunction with Spark instrumentation. 30
  • 31. Deploying the architecture in Cloud • For this work, deployed three Dataproc Compute servers in Google Cloud. • Cloud Dataproc is a fully managed service that lets you run the Apache Spark and Apache Hadoop ecosystem on Google Cloud Platform. • Start with the PoC architecture built on one physical host and scale up as needed. Reduce the development, integration, deployment time and time to market • Machine type: c2-standard-4 (4 vCPUs, 16 GB memory) • 60GB of hard disk + 10GB of SSD for Aerospike. • Node 1 dockers: Zookeeper, Kafka, Aerospike • Node 2 dockers: Zookeeper, Kafka, Aerospike • Node 3 dockers: Zookeeper, Kafka 31
  • 32. Aerospike set-up on the docker • Docker EE in GCP docker run -tid --net=host -v /etc/aerospike:/opt/aerospike/etc -v /etc/aerospike:/etc/aerospike -v /mnt/disks/ssd/aerospike/data:/opt/aerospike/data -v /var/log:/var/log -e "FEATURE_KEY_FILE=/etc/aerospike/features.conf“ -e "LOGFILE=/var/log/aerospike.log" --name aerospike -p 3000:3000 -p 3001:3001 -p 3002:3002 -p 3003:3003 aerospike/aerospike-server-enterprise /usr/bin/asd --foreground --config-file /etc/aerospike/aerospike.conf 32
  • 33. Deploying the architecture in Cloud • Comparable results obtained in Cloud with processing time of 200ms. Not as good as the ones on prem but somehow expected due to using less powerful Dataproc compute servers. 33
  • 34. Conclusions • In this presentation we covered the real time processing of trade data relying on Kafka, Spark Streaming, Spark SQL and Aerospike. We streamed data real time into Aerospike from Kafka and as the same time used Spark Streaming to look for high value tickers (securities) real time, displayed the result in dashboard and posted them as records to Aerospike as well. We performed this work both on prem and with dockers in Cloud. 34
  • 35. Conclusions • Earlier studies were done using Hbase as the batch storage and MongoDB as the real time storage. • The average processing time was a staggering 951ms compared to Aerospike of 153ms! • I published the findings in Linkedlin • https://www.linkedin.com/pulse/real-time-processing- trade-data-kafka-flume-spark-talebzadeh-ph-d-/ • This poor timing were the main reason that I decided to move away from both Hbase as batch storage and MongoDB as high value real time storage in favour of Aerospike for both batch and real time storage 35
  • 37. Conclusions • Reduced latency for calculations based on raw prices in (11) • Reduced the end-to-end processing time further • Used the Aerospike licensed products (Kafka connector and Spark connector) • Reduced the TCO by simplifying the components usage and deploying dockers on managed services in Cloud 37