Tata AIG General Insurance Company - Insurer Innovation Award 2024
Real time processing of trade data with kafka, spark streaming and aerospike in gcp with dockers
1. Real Time Processing of Trade
Data with Kafka, Spark
Streaming and Aerospike on
Prem and in Cloud
Architecture Design Series
Mich Talebzadeh
October 2019
2. Author
Mich Talebzadeh, Big Data, Cloud and Financial Fraud IT Specialist
Ph.D. in Particle Physics, Imperial College of Science and Technology, University of London
Past clients: Investment Banks, Barclaycard, Credit Suisse, Mizuho International Plc, HSBC, Fidelity, Bank of America,
Deutsche Bank etc
E-mail: mich@Peridale.co.uk
https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2?
33 Linkedlin articles on Big Data
Real Time Processing of Trade Data with
Kafka, Spark Streaming and Aerospike
2
3. Introduction
• I have been publishing articles adhering to the Lambda
Architecture with several different databases. An
example is given here back in 2016.
• With the advent of Digital Transformation things have
moved on. With better and improving performance
using newer offerings, organisations are looking at
using data warehouses in real time. The idea of
periodic load of data into data warehouses has given
rise to continuous load of data in real time. This has
been made possible through new capability additions
to Big Data and Cloud, resulting in enhanced
capabilities and better throughput.
3
4. Introduction
The notion of what is real time does play a role
here. The engineering definition of real time is
roughly fast enough to be interactive. However, I
put a stronger definition:
• In real time application or data, there is no such
thing as an answer which is supposed to be late
and correct. The timeliness is part of the
application.
• If we get the right answer too slowly it becomes
useless or wrong. We also need to be aware of
latency trades off with throughput.
4
5. Introduction
• Within a larger architecture, often latency is
dictated by the lowest denominator which
often does not adhere to our definition of low
latency. For example, Kafka as widely
deployed today in Big Data Architecture is
micro-batch. A moderate-latency message
queue that is Kafka plus low latency processor
equals a moderate-latency architecture.
Hence, the low latency architecture must be
treated within that context.
5
6. Introduction
• The challenge in such an architecture is that
everything must not only work, but it must
work seamlessly. The more links in the chain
from source to the engagement of data, the
more points of vulnerability.
• That is the reason that redundancy and fault
tolerance need to be part of solution design
and built-in from start.
6
8. Data Sources
• To stream trade data we rely on historical trade
data available from Google Finance.
• An average price value was taken per day
between high and low and they were stored as
arrays and selected at random at run time for the
sake of creating random prices for a given ticker.
• A ticker is used to uniquely identify publicly traded
shares of a particular stock.
• In this case we chose 10 different tickers as
shown below to generate random prices as JSON
data
IBM MRW MSFT ORCL SAP SBRY TSCO VOD BP MKS
8
9. Zookeeper setup
1.The Zookeeper role is the coordination service for
distributed systems.
2.Three physical nodes were deployed to create a
Zookeeper ensemble.
3.A collection of Zookeeper servers forms a
Zookeeper ensemble.
4.We used three zookeeper servers. In zookeeper
ensemble, all zookeeper servers must all know
about each other. They maintain an in-memory
image of state, along with a transaction logs and
snapshots in a persistent store.
9
10. Kafka setup
Apache Kafka is a publish/subscribe streaming
platform that has three key capabilities, namely;
1.Publish and subscribe to streams of records,
similar to a message queue or enterprise
messaging system
2.Store streams of records in a fault-tolerant
durable way
3.Process streams of records as they occur.
10
11. Kafka setup
Kafka architecture consists of the following
components:
• A stream of messages of a particular type is
defined as a topic.
• A Message is defined as a payload of bytes
and a Topic is a category or feed name to
which messages are published.
• A Producer can be anything that can publish
messages to a Topic.
11
12. Kafka setup
• The published messages are then stored at a
set of servers called Brokers or Kafka Cluster.
• A Consumer can subscribe to one or more
Topics and consume the published Messages
by pulling data from the Brokers. Zookeeper
Broker ensemble is shown below
12
14. Spark Deployment Modes
Spark can either be deployed in local or distributed
manner in the cluster.
• In Local mode Spark is run in a single JVM. This is
the simplest run mode and provides no
performance gains
• In a distributed environment, we can run spark
using master-slave architecture. There will be
multiple worker nodes in each cluster and cluster
manager will be allocating the resources to each
worker node.
14
15. Spark Deployment Modes
Spark can be deployed in distributed cluster in 3
ways:
• Standalone - meaning Spark will manage its
own cluster
• YARN - using Hadoop's YARN resource
manager
• Mesos - Apache's dedicated resource manager
project
15
16. Yarn Deployment Modes
• The term Deployment mode of Spark, simply means that “where the driver
program will be run”. There are two ways, namely; Spark Client Mode and
Spark Cluster Mode
• In the Client mode, the driver daemon runs in the node through which you
submit the spark job to your cluster. This is often done through the Edge
Node. This mode is valuable when you want to use spark interactively like
in our case where we would like to display high value prices in the
dashboard. In the Client mode you do not want to reserve any resource
from your cluster for the driver daemon
• In Cluster mode, you submit the spark job to your cluster and the driver
daemon is run inside your cluster and application master. In this mode you
do not get to use the spark job interactively as the client through which
you submit the job is gone as soon as it successfully submits the job to
cluster. You will have to reserve some resources for the driver daemon
process as it will be running in your cluster.
16
17. The architectural approach
• In this design we collect trades real time in
storage (Aerospike) and process them real time
through Spark. This contrasts with the traditional
Lambda architecture where the trades are
collected in a batch mode and processed real
time through Spark.
• The diagram below explains this approach:
17
19. The Process Flow and the Decision tree
1. Live prices are streamed in through Kafka topics identified as (1) in JSON format
2. The Kafka cluster (2) processes these trades and delivers them to Spark
Streaming (3)
3. Kafka connect to Aerospike passes these prices to Aerospike batch set as well
(10)
4. These prices are put into Aerospike batch set in real time (11)
5. Back to Spark, prices are processed individually through the Decision Engine (5)
and depending on the ticker a decision to buy or sell is made
6. To enable this decision process, current prices for each ticker (security) are
obtained from Aerospike batch set. The definition of current reflects to how far
back the stats (i.e. the closing prices) are needed. For example, to work out a
simple moving average, one calculates this by adding the closing price of the
ticker for several time periods (for example 14) and then dividing this total by
that same number of periods. In general, short-term averages respond quickly to
changes in the price of the underlying ticker, while long-term averages are slow
to react. Other statistical parameters like MIN, MAX and Standard Deviation
(STDDEV) value of prices can also be calculated for the same time periods.
19
20. The Process Flow and the Decision tree
7. The Decision Engine can be literally few lines of code imbedded
within Scala program or a pre-defined package in the form of a
JAR file or JAR files provided to Spark. Generally, far more
sophisticated models can be made available to the Decision
Engine if needed.
8. The streaming topic in Spark consists of a JSON record for each
ticker. We need to loop through the records and work out if the
ticker price satisfies the condition to buy or sell.
9. Those prices that do not satisfy the condition are simply ignored.
10. Those prices that satisfy the condition (high value prices) are
notified via Real Time Dashboard (6-8)
11. These high value prices are then saved to the operational
database Aerospike real time set (7) for future reference.
20
21. Aerospike as a High Performance Database
Aerospike is a hybrid flash - index in RAM, data on disk, operational strongly consistent NoSQL
database.
This translates to:
• no data loss
• no stale reads
• no dirty reads, despite presence of replicas for resilience purposes.
Aerospike supports complex operations at high performance. From my own experience,
Aerospike's performance is better than other clustered NoSQL solutions that I have tested so far.
To put this in perspective:
1. Higher performance of Aerospike per-node means smaller cluster which results in lower
TCO (Total Cost of Ownership) and maintenance. Specifically:
• Use of flash gives data density per node => Lower TCO + reduced maintenance
• Use of flash gives lower cost / TB => Lower TCO
21
22. Aerospike as a High Performance Database
2. Aerospike does auto-clustering, auto-sharding and auto-
rebalancing.
3. Aerospike is designed to utilise flash memory as the
predominant storage medium.
4. Unlike in memory only equivalents, Aerospike can be
configured to use SSD to store information with no practical
speed loss.
5. An Aerospike cluster is typically sized to ensure 95% of
transactions are sub 1milliseconds, using commodity flash.
Adoption of flash to achieve high performance gives order of
magnitude better data density
6. Aerospike latency is very low as will be shown later in our
tests, even with high throughput.
22
23. Aerospike set-up for this deployment
• Minimal configuration or special tuning
• 2-node Aerospike, each on different physical hosts (on prem and on GCP Dataproc)
• Aerospike Enterprise Edition with Security enabled
• Aerospike Connect for Kafka (Inbound Connector) as an add-on feature
– The inbound connector supports streaming data from one or more Kafka topics and persisting
the data in an Aerospike database. It leverages the open source Kafka Connect framework,
that is part of the Apache Kafka project. In terms of Kafka Connect, the inbound connector
implements a "sink" connector.
– Requires the connector software to be installed on every Aerospike node.
• Aerospike Connect for Spark as an add-on feature
– Aerospike Connect for Spark enables companies to directly integrate the Aerospike Database
with their existing Spark infrastructure. In addition, Aerospike Connect for Spark allows
companies to combine transactional and historical data stored in the Aerospike Database with
streaming event data for consumption by machine learning and artificial intelligence engines
using Apache Spark.
– Requires Aerospike license file to be installed on every node that Spark executors will be
running with Aerospike as the high performance database
23
24. Why Fast components matter
• Need high speed components and
sophisticated programs for executing BUY and
SELL.
• A system to handle multiple data feeds with
scale.
• Very short time-frames for establishing and
liquidating positions.
• Submission of numerous orders that are
cancelled shortly after submission.
24
25. Why Fast components matter
• Ending the trading day in as close to a flat position as
possible (that is, not carrying significant, unhedged
positions overnight).
• We require an operational database that can handle
massive volumes of reads/writes simultaneously.
• As per scaling, the chosen database should manage
addition and reduction of capacity seamlessly with no
downtime.
• Aerospike ticks all the above boxes.
• High value prices collected in Aerospike sets provide
valuable insight into the real time trends in the market.
25
27. Understanding Spark results through
Visualization
• The all important indicator is the Processing
Time defined by Spark GUI as Time taken to
process all jobs of a batch which on average is
152ms far smaller than the batch interval of 2
seconds. The Scheduling Delay and the Total
Delay are additional indicators of health.
27
28. The real time dashboard
The Real Time Dashboard shows the following
lines of output:
*** price for ticker BP is GBP360.65 <= GBP659.86 BUY ***
*** price for ticker MSFT is GBP53.78 >= GBP53.69 SELL ***
*** price for ticker SAP is GBP67.06 >= GBP66.31 SELL ***
*** price for ticker SAP is GBP73.10 >= GBP66.31 SELL ***
*** price for ticker MKS is GBP618.50 >= GBP582.53 SELL ***
*** price for ticker MRW is GBP188.80 <= GBP338.77 BUY ***
*** price for ticker IBM is GBP193.54 >= GBP184.54 SELL ***
28
30. Looking at Aerospike Management Console for
various matrices
• Management console provides multiple matrices
from writes per second, read per second to the
latency stats. It can therefore be used to identify
performance issues among other parameters in
conjunction with Spark instrumentation.
30
31. Deploying the architecture in Cloud
• For this work, deployed three Dataproc
Compute servers in Google Cloud.
• Cloud Dataproc is a fully managed service
that lets you run the Apache Spark and Apache
Hadoop ecosystem on Google Cloud Platform.
• Start with the PoC architecture built on one
physical host and scale up as needed. Reduce
the development, integration, deployment time
and time to market
• Machine type: c2-standard-4 (4 vCPUs, 16 GB
memory)
• 60GB of hard disk + 10GB of SSD for
Aerospike.
• Node 1 dockers: Zookeeper, Kafka, Aerospike
• Node 2 dockers: Zookeeper, Kafka, Aerospike
• Node 3 dockers: Zookeeper, Kafka
31
32. Aerospike set-up on the docker
• Docker EE in GCP
docker run -tid --net=host
-v /etc/aerospike:/opt/aerospike/etc
-v /etc/aerospike:/etc/aerospike
-v /mnt/disks/ssd/aerospike/data:/opt/aerospike/data
-v /var/log:/var/log
-e "FEATURE_KEY_FILE=/etc/aerospike/features.conf“
-e "LOGFILE=/var/log/aerospike.log"
--name aerospike
-p 3000:3000
-p 3001:3001
-p 3002:3002
-p 3003:3003
aerospike/aerospike-server-enterprise /usr/bin/asd
--foreground
--config-file /etc/aerospike/aerospike.conf
32
33. Deploying the architecture in Cloud
• Comparable results obtained in Cloud with
processing time of 200ms. Not as good as the ones
on prem but somehow expected due to using less
powerful Dataproc compute servers.
33
34. Conclusions
• In this presentation we covered the real time
processing of trade data relying on Kafka,
Spark Streaming, Spark SQL and Aerospike.
We streamed data real time into Aerospike
from Kafka and as the same time used Spark
Streaming to look for high value tickers
(securities) real time, displayed the result in
dashboard and posted them as records to
Aerospike as well. We performed this work
both on prem and with dockers in Cloud.
34
35. Conclusions
• Earlier studies were done using Hbase as the batch
storage and MongoDB as the real time storage.
• The average processing time was a staggering 951ms
compared to Aerospike of 153ms!
• I published the findings in Linkedlin
• https://www.linkedin.com/pulse/real-time-processing-
trade-data-kafka-flume-spark-talebzadeh-ph-d-/
• This poor timing were the main reason that I decided
to move away from both Hbase as batch storage and
MongoDB as high value real time storage in favour of
Aerospike for both batch and real time storage
35
37. Conclusions
• Reduced latency for calculations based on raw
prices in (11)
• Reduced the end-to-end processing time
further
• Used the Aerospike licensed products (Kafka
connector and Spark connector)
• Reduced the TCO by simplifying the
components usage and deploying dockers on
managed services in Cloud
37