SlideShare a Scribd company logo
Real-time​ ​Data-Pipeline​ ​from​ ​inception​ ​to​ ​production
Shreya​ ​Mukhopadhyay
Intuit
Bengaluru,​ ​India
shreya_mukhopadhyay3@intuit.com
Ashwini​ ​Vadivel
Intuit
Bengaluru,​ ​India
ashwini_vadivel@intuit.com
​ ​​ ​​ ​​ ​Basavaraj​ ​M
Intuit
Bengaluru,​ ​India
basavaraj_m@intuit.com
Abstract
Big Data being the buzzword of the industry,
organizations want to arrive at actionable insights
from their data quickly. Both historic and incoming
data needs to be ingested through data pipelines
onto a single data lake to help derive real-time
analytics. To build real time streaming pipelines,
we need to take care of data veracity, reliability of
the system, out of order events, complex
transformations and easier integration for future
purposes.
This paper will cover our experience of building
such real-time pipelines for financial data, the
various open source libraries we experimented with
and​ ​the​ ​impacts​ ​we​ ​saw​ ​in​ ​a​ ​very​ ​brief​ ​time.
1.​ ​Introduction
Intuit offers a plethora of financial products, which
helps small and medium businesses in
bookkeeping, financial management and tax filing.
These products can have multiple data sources-
customer entered data, bank feeds, payment,
payroll and tax information from Federal Agencies
among others. To build insights for our customers,
auditors, accountants and internal customer care
executives,​ ​a​ ​unified​ ​data​ ​lake​ ​is​ ​needed.
This data lake needs to be fed with both real time
and historical data, from internal and external
sources. We wanted to build a framework for
ETL(Extract, Transform, Load) data pipelines,
which can be used across the organization to stream
data and populate the data lake. Raw data from
multiple sources had to be transformed into
efficient formats before streaming and storage. The
main guiding principles for such a framework were
near real-time stateful transformation, data
streaming with integrity, high availability,
scalability​ ​and​ ​minimal​ ​latency.
2.​ ​Architecture
In order to meet the above standards, the
framework should be able to handle complex tasks
like ingestion, persistence, processing and
transformation. After considering multiple
distributed application architectures, we narrowed
down to Unix pipes and filter architecture. It was
best suited to solve the above requirements as it is a
simple yet powerful and robust system architecture.
It can have any number of components (filters) to
transform or filter data before passing it via
connector(pipes)​ ​to​ ​other​ ​components​ ​(​Figure​ ​1​)
Figure​ ​1​ ​Pipe​ ​and​ ​Filter​ ​Architecture
A filter can have any number of input pipes and
any number of output pipes. The pipe is the
connector that passes data from one filter to the
next. It is a directional stream of data, and is
usually implemented by a data buffer to store all
data, until the next filter has time to process it. The
source and sink are the producers and consumers
respectively and can be static files, any database or
user​ ​input.​ ​(Refer​ ​to​ ​​[8]​).
cat sample.txt | ​grep -v a | ​sort -r is a simple unix
command representational of the architecture. Here
sample.txt is the source and console is the sink.
Commands ​cat, grep –v a and sort –r are filters and
| is the pipe which passes unidirectional data
between​ ​these​ ​filters.
Our real-time streaming architecture was designed
using the same logic of pipes. It ensured the
following:
- Support​ ​for​ ​multiple​ ​sources​ ​and​ ​sinks
- Easier future enhancements by
rearrangement​ ​of​ ​filters
- Smaller processing steps ensuring easy
reusability
- Explicit storage of intermediate results for
further​ ​processing
- Scalability​ ​support
3.​ ​Pipeline​ ​Components
Our first use-case had a relational Microsoft SQL
database (Windows 2012 R2 Server) as source and
Apache ActiveMQ as sink. The sink application
contributes to the data lake, where real-time data
can help generate in-product recommendations,
help identify fraudsters among others through
Machine Learning(ML) models. The source
database had over 100+ tables in a single schema.
The sink being a JMS queue accepted only text
messages​ ​in​ ​certain​ ​formats.
Shreya​ ​Mukhopadhyay,​ ​Ashwini​ ​Vadivel,​ ​Basavaraj​ ​M
Figure​ ​2​ ​Pipeline​ ​Component
Working with the product teams, we were able to
create many-to-many input/output transformation
maps. Overall the inputs came from 10 dynamic
and 6 static tables which had to be transformed to 3
types​ ​of​ ​events.
Figure 2 ​gives the general idea of data flow in the
ETL pipeline. The choice of Kafka (Confluent
2.0.1 ​[9]​) for the pipes was a simple one as it well
known for its ability to process high volume data. It
can publish-subscribe messages and streams and is
meant to be durable, fast, and scalable. The grey
arrows before/after each component represents
Kafka​ ​topics.
As we delve deeper into individual components, we
will elaborate the technologies and open source
libraries that were used to get this pipeline to
production.
4.​ ​Data​ ​Ingestion​ ​and​ ​delivery
The first step in every pipeline is ingestion,
wherein data can be ingested in real time or in
batches. In our case, we needed real time streaming
and therefore real time ingestion. The first use case
of our data pipeline had Microsoft SQL(MSSQL)
as data source. and it supports Change Data
Capture (CDC) technology to capture the changes
in the source at real time. The Oracle Golden Gate
(GG) solution to capture the changes at real time
had an issue. The issue was, after each database
switchover GG started to read from the very
beginning, thereby creating huge data loads on the
pipeline. So, we chose MSSQL CDC mechanism to
capture insert, update and delete events. Kafka
source connectors pulled these CDC events,
converted them to avro[​12​] messages and published
them to Kafka topics. GG was later used for Oracle
and MySQL database sources where the above
issues​ ​were​ ​not​ ​seen.
4.1​ ​Connectors
The source and sink connectors are the entry and
exit points of our data pipeline. The source
connectors are responsible for bringing all change
data in, streaming to the pipeline and sink
connectors are responsible to pass the output
transformed​ ​data​ ​to​ ​sink/data​ ​lake.
In the following sections we will discuss our first
use case- source connector for MSSQL and sink
connector​ ​for​ ​ActiveMQ.
4.1.1​ ​Source​ ​Connector
For our use-case of MSSQL, we wanted to capture
all data manipulation operations on the database
tables and MSSQL server CDC provided this
technology. The source of change data is the server
transaction log and they can be enabled on an
individual table, chosen fields or on the entire
schema​ ​​[​11​]​.
A new schema and captured columns get created
once we enable CDC’s. 5 additional columns-
__$start_lsn, __$end_lsn, __$seqval, __$operation,
__$update_mask are added per table. These
columns will allow us to uniquely identify a
transaction​ ​within​ ​a​ ​commit​ ​and​ ​replay.
Once the CDC is set up, Kafka connect is used to
pull data from these tables using shared JDBC
connections, massage it and then publish onto the
CDC Kafka topics. Each table has a single data
source i.e. the CDC table. The final output onto the
topic is in the form of an avro whose schemas are
stored in Confluent Schema Registry. Below is a
sample​ ​avro​ ​message​ ​for​ ​INSERT
{
​ ​​ ​"​header​":​ ​{
​ ​​ ​​ ​​ ​"source":​ ​"MSSQLServer",
​ ​​ ​​ ​​ ​"seqno":​ ​"00127A53000034E00110",
​ ​​ ​​ ​​ ​"fragno":​ ​"00127A53000034C80007",
​ ​​ ​​ ​​ ​"schema":​ ​"mssql_database_name",
​ ​​ ​​ ​​ ​"table":​ ​"TABLENAME",
​ ​​ ​​ ​​ ​"timestamp":​ ​1494839698233,
​ ​​ ​​ ​​ ​"eventtype":​ ​"INSERT",
​ ​​ ​​ ​​ ​"shardid":​ ​"SHARD0",
​ ​​ ​​ ​​ ​"eventid":​ ​"00127A53000034E00110",
​ ​​ ​​ ​​ ​"primarykey":​ ​"ID",
​ ​},
Real-time​ ​Data-Pipeline​ ​from​ ​inception​ ​to​ ​production
Shreya​ ​Mukhopadhyay,​ ​Ashwini​ ​Vadivel,​ ​Basavaraj​ ​M
Figure​ ​3​ ​Real​ ​Time​ ​Data​ ​Pipeline
"​payload​":​ ​{
​ ​​ ​​ ​​ ​"​beforerecord​":​ ​null,
​ ​​ ​​ ​​ ​"​afterrecord​":​ ​{
​ ​​ ​​ ​​ ​​ ​​ ​"afterrecord":​ ​{
​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​"ID":​ ​{
​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​"long":​ ​245983
​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​}
​ ​​ ​​ ​​ ​​ ​​ ​}
​ ​​ ​​ ​​ ​}
​ ​​ ​}
}
Another implementation of the connector can use
Oracle Golden Gate to publish MySQL events to
Kafka​ ​topics​ ​​[​7​]
Figure 3 gives a detailed view of the pipeline with
all open source libraries used. The components-
Sink connector, Joiner and Transformer will be
discussed​ ​in​ ​detail​ ​in​ ​the​ ​following​ ​sections.
4.1.2​ ​Sink​ ​Connector
The JMS Sink connector allows us to extract
entries from a Kafka topic with the Connect Query
Language(CQL) driver and pass them to a JMS
topic/queue. The connectors, one for each type of
event de-dupes and takes the latest event which can
be identified by a combination of fragment and
sequence numbers added by the source connector.
These messages are then converted to text
messages using JMS API and then written onto the
queue. Input format in our case is an avro from
kafka and the output is a text message. Details of
configuration and Kafka Connect Sink JMS are
well​ ​explained​ ​in​ ​​[2]​.
4.2​ ​Bootstrap
Data from the source connectors is delta and to
support stateful transformations, complete data is
needed. We bootstrap historic data to Cassandra so
that the outgoing events are complete. It also aids
in replay and schema evolution. The next two
components in the pipeline- Joiner and
Transformer use this Cassandra data to construct a
complete​ ​and​ ​stateful​ ​event.
Bootstrap has 3 stages- each stage is a standalone
java program which is run separately and serially in
the following order before onboarding any new
source​ ​to​ ​the​ ​pipeline:
➢ Populate Kafka topics- Using JDBC
connection, SQL queries for historic data
is run for each table and data is populated
onto the raw CDC topics in the same avro
format discussed in Source Connector
section-​ ​all​ ​these​ ​events​ ​are​ ​insert.
Real-time​ ​Data-Pipeline​ ​from​ ​inception​ ​to​ ​production
Shreya​ ​Mukhopadhyay,​ ​Ashwini​ ​Vadivel,​ ​Basavaraj​ ​M
➢ Populate datomic tables- The bootstrap
java program reads from input kafka
topics and populates the corresponding
datomic​ ​tables
➢ Populate datomic references- In this stage,
the program populates references for
various​ ​records​ ​in​ ​different​ ​datomic​ ​tables
5.​ ​Data​ ​Joiner-​ ​Joiner
The next component in our pipeline does the task
of joining the events from multiple kafka streams to
form a de-normalised view, ready to be
transformed. The Joiner are Spark jobs that process
events​ ​from​ ​respective​ ​Kafka​ ​Streams.
The joins are performed based on a joiner
configuration that is provided to the job at startup.
This config is used to create the joiner output
events and to define the joins in the DB i.e,
Datomic​ ​in​ ​our​ ​case.
5.1​ ​Datomic
Datomic is a fully transactional, distributed
database that avoids the compromises and losses of
many NoSQL solutions. In addition, it offers
flexibility and power over the traditional RDBMS
model.
➢ Datomic stores a record of immutable
facts, which are never update-in-place and
all data is retained by default, giving you
built-in auditing and the ability to query
history.
➢ Caching is built-in and can be maintained
at the client-side, which makes reads
faster.
➢ Datomic provides rich schema and query
capabilities on top of a storage of your
choice. A storage 'service' can be anything
from a SQL database, to a key/value store,
to a true service like Amazon's
DynamoDB.
➢ Schema evolution can be handled easily
with Datomic as it follows a
EAVT(Entity, Attribute, Value,
Timestamp)​ ​structure.
➢ Joins are handled inherently where
references to joined rows is always
maintained.
➢ ACID-compliant​ ​transactions.
We used Datomic 0.9.5561v (refer to ​[10]​) on top
of​ ​a​ ​Cassandra​ ​cluster​ ​for​ ​storage.
But before the joining can be performed we needed
to ensure that the incoming events are complete
rows (since we want to support both CDC’s and
Golden​ ​Gate​ ​events).
5.2​ ​Reconciliation
Every event processed by the Joiner is persisted at
our end in a Datomic DB, using which we can
construct the complete row even when partial data
comes in through the CDC events. When the joiner
receives an event it reads the previous state for the
same from our database. It then applies the change
set on it to construct the latest and complete row
and​ ​pushes​ ​it​ ​back​ ​in.
5.3​ ​Joining
Only master table entries can translate into output
events. If the incoming event belongs to a master
table, then on fetching its value from Datomic we
get the complete output entity along with all the
referenced child entries (thanks to Datomic!). If it
is from a child table, then we fetch its
corresponding master values to form the output
event. A single table could be a master and/or
child, based on which the number of output events
formed may vary(each corresponding to a different
entity​ ​at​ ​the​ ​destination).
The output events are now a denormalized view of
all the tables that are required to form the
destination​ ​entities.
6.​ ​Data​ ​Transformation-​ ​Transformer
Once the denormalized event is generated by
Joiner, it is pushed into the next set of Kafka topics.
These topics are then consumed by the transformer,
another spark job,which has a sole responsibility of
data transformation. The most common operations
include:
➢ Mapping between the source and
destination​ ​fields
➢ Deriving new field values based on
business​ ​logic
➢ Validating for mandatory fields and other
business​ ​rules.
The transformation logic is handled through an
open source framework called Morphline. This
helps us define a series of commands that are
transformations which are applied sequentially on
the​ ​event​ ​being​ ​processed​ ​(​Figure​ ​4​).
Real-time​ ​Data-Pipeline​ ​from​ ​inception​ ​to​ ​production
Shreya​ ​Mukhopadhyay,​ ​Ashwini​ ​Vadivel,​ ​Basavaraj​ ​M
Figure​ ​4​ ​Morphline​ ​Illustration
The transformations are defined in an external
configuration in a format expected by Morphline
SDK as shown in the sample transformation sheet
below:
morphlines​:​ ​[
​ ​{
​ ​​ ​​ ​id:​ ​morphline
​ ​​ ​​ ​importCommands:​ ​["org.kitesdk.**"]
​ ​​ ​​ ​​commands​:​ ​[
​ ​​ ​​ ​​ ​​ ​{
​ ​​ ​​ ​​ ​​ ​​ ​​ ​​command1​​ ​{
​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​attr1​ ​:​ ​value
​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​attr2​ ​:​ ​value
​ ​​ ​​ ​​ ​​ ​​ ​​ ​}​ ​}
​ ​​ ​​ ​​ ​​ ​{
​ ​​ ​​ ​​ ​​ ​​ ​​ ​​command2​​ ​{
​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​attr1​ ​:​ ​value
​ ​​ ​​ ​​ ​​ ​​ ​}​ ​}
​ ​]​ ​}​ ​]
6.1​ ​Checkpointing
We chose to do our own checkpointing rather than
relying​ ​on​ ​Spark​ ​for​ ​2​ ​main​ ​reasons.
➢ The default spark checkpointing requires
us to clear the checkpointing directory on
HDFS whenever new code is deployed.
This is a task overhead and is prone to
errors.
➢ Saving checkpoint data in Datomic also
helps us in replaying messages on
demand.
The Joiner and Transformer both pickup the latest
offset from the metadata table at startup and
process the kafka streams from that point onwards
They save the offset in Datomic after processing a
batch along with the metadata information which
the​ ​event​ ​contains.
6.2 Features- Replay, Out of Order handling,
Schema​ ​evolution
Replaying messages from the kafka streams is
required whenever we encounter a technical or
logical issue. For the spark jobs the checkpointing
data is available in Datomic, based on the time
from which replay is required the corresponding
offset is fetched. The latest offset value in Datomic
is then set to this value and the components are
restarted. Once a message is replayed it flows
through all the downstream components and into
the​ ​sink.
Every event from the source has a fragment number
and sequence number. This combination is unique
to every event and Joiner uses it to determine out of
order events. Transformer cannot use this value as
multiple streams are merged in the Joiner output.
Instead, it uses a transaction id which is punched in
the event by the Joiner. This transaction id is
generated by Datomic for every insert/update
operation​ ​and​ ​is​ ​sequential​ ​in​ ​nature.
The schema for the pipeline is maintained only in
Schema Registry and Datomic. Both of which can
be updated at runtime without any application
level​ ​changes.
7.​ ​Data​ ​Pipeline​ ​Testing​ ​and​ ​Monitoring
The primary concern for any data processing
pipeline is the health of data flowing through the
system. The overall health of a pipeline can be
evaluated as a combination of multiple attributes
such as data loss, throughput, latency and error
rates. These metrics are helpful only when we are
able to isolate the root cause- component/
process/job/configuration. We will discuss two
major areas - how we gained confidence in the
pipeline before we went live and how we kept that
after​ ​deployment.
7.1​ ​Pre-deployment​ ​Testing
We can never undermine the importance of unit
and component integration tests, however writing
an end to end (E2E) test for a real time streaming
pipeline is a different ball game altogether. Points
of failure increased with dependencies- source,
sink, components and environments. We followed
few​ ​guiding​ ​principles​ ​for​ ​automating​ ​E2E​ ​tests:
➢ Addition​ ​of​ ​new​ ​tests​ ​should​ ​be​ ​easy
➢ support​ ​for​ ​multiple​ ​sources
➢ Independent tests, parallel runs, minimum
run​ ​time
➢ Post​ ​run,​ ​easy​ ​and​ ​faster​ ​analysis
➢ High​ ​configurability,​ ​granular​ ​control
Real-time​ ​Data-Pipeline​ ​from​ ​inception​ ​to​ ​production
Shreya​ ​Mukhopadhyay,​ ​Ashwini​ ​Vadivel,​ ​Basavaraj​ ​M
Figure​ ​5​ ​Data​ ​Pipeline​ ​Automated​ ​Testing
We used Java 1.8 with TestNG as the basic test
framework for automation and contributed to open
source- DolphinNG (​[4]​,​[5]​) for advanced reporting
and analysis. To isolate errors and enable swift
debugging, outputs after each filter had to be
verified. ​Figure 5 gives a detailed view on the
interactions between the test automation framework
and​ ​data​ ​pipeline.
Below​ ​is​ ​the​ ​anatomy​ ​of​ ​an​ ​E2E​ ​automation​ ​test:
1. Start Kafka Message Aggregator- listen to all
messages​ ​henceforth
@BeforeClass
public​ ​void​ ​​startKafkaAggregatorListening​()​ ​{
aggregator​ ​=​ ​new
KafkaMessageAggregator(configuration);
}
2. Create events- actual OR simulated to populate
raw​ ​CDC​ ​topics​ ​and​ ​collect​ ​Unique​ ​Id
@Test
public​ ​void​ ​joinerTest(String​ ​params)​ ​throws
Exceptions​ ​{
String​ ​​uniqueId​​ ​=
createEvents​(param1,param2,configuration);
}
3. Filter aggregator messages with uniqueId and
create​ ​a​ ​list
List<GenericRecord>​ ​joinerMessagesForUniqueId
=
KafkaMessageConsumer.​filterRecords​(aggregator.
getMessagesForTopic(KAFKA_JOINER_KEY)),
uniqueId);
4.​ ​Verifications
debugAtSource
checkForDuplicatesOnAllTopics
verifyValidityOfMessagesCollectedForSizeAndDat
a
verifyOutOfOrderEvents
verifyUniqueIdAtSplunk
verifyUniqueIdAtDatomic
verifyDataParity
With the volume of data flowing in our pipeline,
we were adding tests daily and the complexity kept
on increasing. Tests were periodically run and we
were getting over 100 reports per day. TestNG
reports were not efficient for analysis, we wanted
to quickly analyze and log issues for errors.
DolphinNG a testNG add on was integrated with
test automation suite to free ourselves from all
manual interventions. It clubs failures, reports root
cause​ ​and​ ​automatically​ ​creates​ ​JIRA​ ​tickets.
7.2​ ​Post-​ ​deployment​ ​Monitoring
For post deployment monitoring, it was essential to
instrument, annotate, and organize our telemetry, or
else it would become very difficult to separate
primary concerns from other infrastructure metrics
such as CPU utilization, disk space, and so forth.
Standard metrics that we wanted to capture were
latency, input output throughput, data integrity and
errors. The front runners for such dashboarding and
alerting were Splunk and Wavefront. Splunk
concentrates on application metrics, Wavefront
allows both system and application metrics. As we
wanted application metrics and solid debugging
capabilities,​ ​we​ ​went​ ​with​ ​Splunk​ ​6.2.1​ ​​[6]​.
Real-time​ ​Data-Pipeline​ ​from​ ​inception​ ​to​ ​production
Shreya​ ​Mukhopadhyay,​ ​Ashwini​ ​Vadivel,​ ​Basavaraj​ ​M
Figure​ ​6​ ​Monitoring​ ​framework
In order to isolate issues and find their root causes,
we needed to capture metrics at all stages. Each
stage of the pipeline logged an audit entry with
event_code, stage_timestamp, output_checksum,
stage_number and few other values to the Splunk.
Splunk forwarders and log4j appenders were used
in the pipeline components to log the auditing
metrics to splunk with a dedicated splunk index.
For Joiner and Transformer components, we used
appenders to avoid installation of forwarders on all
data​ ​nodes​ ​(​Figure​ ​​6​).
Details captured at Splunk also allowed us to
perform data integrity monitoring. With the help of
event codes and stage numbers data loss could be
detected even though input to output event ratio is
not 1:1 throughout the pipeline. Splunk dashboard
was created for capturing data loss at each and
every​ ​stage​ ​of​ ​the​ ​pipeline.
For latency, 95 percentile numbers were used to
derive insights at each stage. For throughput (TPS)
absolute throughput was measured and plotted in
splunk dashboards (​Figure 7​). Splunk alerts were
created on top of splunk dashboards for alerting
input TPS, data loss occurrences and latency
breaches.
Figure​ ​7​ ​Splunk​ ​Dashboards
8.​ ​Outcomes
We were able to take multiple pipelines to
production using the above framework, maintaining
the​ ​following​ ​KPI’s
➢ Bootstrap populated 10 million records to
Datomic​ ​in​ ​under​ ​15​ ​mins
➢ E2E latency remains < 60 sec with
exceptions​ ​during​ ​high​ ​volume​ ​inputs
➢ Pipeline with a setup of 3 Kafka brokers, 5
Cassandra instances and 20 input
tables(avg 25 columns) processes 100 TPS
with​ ​sub-minute​ ​latency
➢ With DolphinNG smart reporting and
Splunk alerting, there is no manual
intervention​ ​for​ ​pipeline​ ​monitoring
➢ Onboarding new table only needs config
changes
Real-time​ ​Data-Pipeline​ ​from​ ​inception​ ​to​ ​production
Shreya​ ​Mukhopadhyay,​ ​Ashwini​ ​Vadivel,​ ​Basavaraj​ ​M
9.​ ​Learnings
1. Race Condition, Data corruption​- As we had
different joiners processing events from different
tables we started running into race conditions
resulting in data loss or stale data. To fix this issue
we wrote transaction functions in Datomic that
ensured atomicity on a set of commands. This
along with handling of out-of-order events
prevented​ ​the​ ​data​ ​from​ ​being​ ​corrupted.
2. ​Data loss at bootstrap​- Retention period for
CDC’s was 24 hrs, which meant events should be
consumed within that time-frame else there will be
data loss. The first bootstrap design failed to clear
the performance markers and was redesigned to
execute​ ​in​ ​steps​ ​as​ ​explained​ ​earlier.
3. ​Zero batch processing time of Spark​- Spark
1.6.1 performance degrades with time. The size of
metadata which is passed to the executor keeps
increasing with time. As a result, batches with 0
events take 2-3s for completion. This issue is
reported​ ​to​ ​have​ ​been​ ​fixed​ ​in​ ​the​ ​latest​ ​version.
10.​ ​Conclusion
In this paper, we’ve tried to consolidate our
implementation and learnings of building a real
time ETL pipeline which allows replay, data
persistence, automated monitoring, testing and
schema evaluation. It gives a glimpse into the latest
stream processing technologies like Kafka and
Spark, distributed database like Datomic, rich
configurations using Morphline and DolphinNG, a
TestNG add on for smart reporting. For future
work, we would want to make onboarding self
serve; open source logical components, Kafka
Message aggregator; optimize KPI’s and
experiment​ ​with​ ​Spark​ ​structured​ ​streaming.
References
[1]​ ​Track​ ​Data​ ​Changes​ ​(SQL​ ​Server)
https://docs.microsoft.com/en-us/sql/relational-databases/track-c
hanges/track-data-changes-sql-server
[2]​ ​Kafka​ ​Connect​ ​JMS​ ​Sink
http://docs.datamountaineer.com/en/latest/jms.html
[3]​ ​Splunk:​ ​Distributed​ ​Deployment​ ​Manual
https://docs.splunk.com/Documentation/Splunk/7.0.1/Deploy/C
omponentsofadistributedenvironment
[4]​ ​DolphinNG​ ​​https://github.com/basavaraj1985/DolphinNG
[5]​ ​DolphinNG​ ​Sample​ ​Project
https://github.com/basavaraj1985/UseDolphinNG
[6]​ ​Splunk​ ​Logging​ ​for​ ​Java
http://dev.splunk.com/view/splunk-logging-java/SP-CAAAE2K
[7]​ ​Oracle​ ​GG​ ​for​ ​MySQL
https://docs.oracle.com/goldengate/1212/gg-winux/GIMYS/toc.
htm
[8]​ ​Pipe​ ​and​ ​filter​ ​architectures
http://community.wvu.edu/~hhammar/CU/swarch/lecture%20sli
des/slides%204%20sw%20arch%20styles/supporting%20slides/
SWArch-4-PipesandFilter.pdf
[9]​ ​Confluent​ ​2.0.1​ ​documentation
https://docs.confluent.io/2.0.1/platform.html
[10]​ ​Datomic​ ​​http://docs.datomic.com/index.html
[11]​ ​Enabling​ ​CDC​ ​on​ ​Microsoft​ ​SQL​ ​server
https://docs.microsoft.com/en-us/sql/relational-databases/track-c
hanges/enable-and-disable-change-data-capture-sql-server
[12]​ ​Avro​ ​Messages
https://avro.apache.org/docs/1.7.7/gettingstartedjava.html
Real-time​ ​Data-Pipeline​ ​from​ ​inception​ ​to​ ​production

More Related Content

What's hot

Visual Basic.Net & Ado.Net
Visual Basic.Net & Ado.NetVisual Basic.Net & Ado.Net
Visual Basic.Net & Ado.Net
FaRid Adwa
 
COMBINING EFFICIENCY, FIDELITY, AND FLEXIBILITY IN RESOURCE INFORMATION SERV...
 COMBINING EFFICIENCY, FIDELITY, AND FLEXIBILITY IN RESOURCE INFORMATION SERV... COMBINING EFFICIENCY, FIDELITY, AND FLEXIBILITY IN RESOURCE INFORMATION SERV...
COMBINING EFFICIENCY, FIDELITY, AND FLEXIBILITY IN RESOURCE INFORMATION SERV...
Nexgen Technology
 
Vb.net session 05
Vb.net session 05Vb.net session 05
Vb.net session 05
Niit Care
 
S18 das
S18 dasS18 das
Combining efficiency, fidelity, and flexibility in resource information services
Combining efficiency, fidelity, and flexibility in resource information servicesCombining efficiency, fidelity, and flexibility in resource information services
Combining efficiency, fidelity, and flexibility in resource information services
CloudTechnologies
 
Towards secure and dependable storage service in cloud
Towards secure and dependable storage service in cloudTowards secure and dependable storage service in cloud
Towards secure and dependable storage service in cloud
sibidlegend
 
Introduction to ado
Introduction to adoIntroduction to ado
Introduction to ado
Harman Bajwa
 
Rdbms Practical file diploma
Rdbms Practical file diploma Rdbms Practical file diploma
Rdbms Practical file diploma
mustkeem khan
 
Management of Bi-Temporal Properties of Sql/Nosql Based Architectures – A Re...
Management of Bi-Temporal Properties of  Sql/Nosql Based Architectures – A Re...Management of Bi-Temporal Properties of  Sql/Nosql Based Architectures – A Re...
Management of Bi-Temporal Properties of Sql/Nosql Based Architectures – A Re...
lyn kurian
 
Updating and Scheduling of Streaming Web Services in Data Warehouses
Updating and Scheduling of Streaming Web Services in Data WarehousesUpdating and Scheduling of Streaming Web Services in Data Warehouses
Updating and Scheduling of Streaming Web Services in Data Warehouses
International Journal of Science and Research (IJSR)
 
Ado.net session10
Ado.net session10Ado.net session10
Ado.net session10
Niit Care
 
Cloud Technology: Virtualization
Cloud Technology: VirtualizationCloud Technology: Virtualization
Workflow Scheduling Techniques and Algorithms in IaaS Cloud: A Survey
Workflow Scheduling Techniques and Algorithms in IaaS Cloud: A Survey Workflow Scheduling Techniques and Algorithms in IaaS Cloud: A Survey
Workflow Scheduling Techniques and Algorithms in IaaS Cloud: A Survey
IJECEIAES
 
Architecture of integration services
Architecture of integration servicesArchitecture of integration services
Architecture of integration services
Slava Kokaev
 
IEEE 2015 - 2016 | Combining Efficiency, Fidelity, and Flexibility in Resource...
IEEE 2015 - 2016 | Combining Efficiency, Fidelity, and Flexibility in Resource...IEEE 2015 - 2016 | Combining Efficiency, Fidelity, and Flexibility in Resource...
IEEE 2015 - 2016 | Combining Efficiency, Fidelity, and Flexibility in Resource...
1crore projects
 
ANG-GridWay-Poster-Final-Colorful-Bright-Final0
ANG-GridWay-Poster-Final-Colorful-Bright-Final0ANG-GridWay-Poster-Final-Colorful-Bright-Final0
ANG-GridWay-Poster-Final-Colorful-Bright-Final0
Jingjing Sun
 
Data Partitioning in Mongo DB with Cloud
Data Partitioning in Mongo DB with CloudData Partitioning in Mongo DB with Cloud
Data Partitioning in Mongo DB with Cloud
IJAAS Team
 
Graphical display of statistical data on Android
Graphical display of statistical data on AndroidGraphical display of statistical data on Android
Graphical display of statistical data on Android
Didac Montero
 
Discover Database
Discover DatabaseDiscover Database
Discover Database
Wayne Weixin
 
IRJET- Providing In-Database Analytic Functionalities to Mysql : A Proposed S...
IRJET- Providing In-Database Analytic Functionalities to Mysql : A Proposed S...IRJET- Providing In-Database Analytic Functionalities to Mysql : A Proposed S...
IRJET- Providing In-Database Analytic Functionalities to Mysql : A Proposed S...
IRJET Journal
 

What's hot (20)

Visual Basic.Net & Ado.Net
Visual Basic.Net & Ado.NetVisual Basic.Net & Ado.Net
Visual Basic.Net & Ado.Net
 
COMBINING EFFICIENCY, FIDELITY, AND FLEXIBILITY IN RESOURCE INFORMATION SERV...
 COMBINING EFFICIENCY, FIDELITY, AND FLEXIBILITY IN RESOURCE INFORMATION SERV... COMBINING EFFICIENCY, FIDELITY, AND FLEXIBILITY IN RESOURCE INFORMATION SERV...
COMBINING EFFICIENCY, FIDELITY, AND FLEXIBILITY IN RESOURCE INFORMATION SERV...
 
Vb.net session 05
Vb.net session 05Vb.net session 05
Vb.net session 05
 
S18 das
S18 dasS18 das
S18 das
 
Combining efficiency, fidelity, and flexibility in resource information services
Combining efficiency, fidelity, and flexibility in resource information servicesCombining efficiency, fidelity, and flexibility in resource information services
Combining efficiency, fidelity, and flexibility in resource information services
 
Towards secure and dependable storage service in cloud
Towards secure and dependable storage service in cloudTowards secure and dependable storage service in cloud
Towards secure and dependable storage service in cloud
 
Introduction to ado
Introduction to adoIntroduction to ado
Introduction to ado
 
Rdbms Practical file diploma
Rdbms Practical file diploma Rdbms Practical file diploma
Rdbms Practical file diploma
 
Management of Bi-Temporal Properties of Sql/Nosql Based Architectures – A Re...
Management of Bi-Temporal Properties of  Sql/Nosql Based Architectures – A Re...Management of Bi-Temporal Properties of  Sql/Nosql Based Architectures – A Re...
Management of Bi-Temporal Properties of Sql/Nosql Based Architectures – A Re...
 
Updating and Scheduling of Streaming Web Services in Data Warehouses
Updating and Scheduling of Streaming Web Services in Data WarehousesUpdating and Scheduling of Streaming Web Services in Data Warehouses
Updating and Scheduling of Streaming Web Services in Data Warehouses
 
Ado.net session10
Ado.net session10Ado.net session10
Ado.net session10
 
Cloud Technology: Virtualization
Cloud Technology: VirtualizationCloud Technology: Virtualization
Cloud Technology: Virtualization
 
Workflow Scheduling Techniques and Algorithms in IaaS Cloud: A Survey
Workflow Scheduling Techniques and Algorithms in IaaS Cloud: A Survey Workflow Scheduling Techniques and Algorithms in IaaS Cloud: A Survey
Workflow Scheduling Techniques and Algorithms in IaaS Cloud: A Survey
 
Architecture of integration services
Architecture of integration servicesArchitecture of integration services
Architecture of integration services
 
IEEE 2015 - 2016 | Combining Efficiency, Fidelity, and Flexibility in Resource...
IEEE 2015 - 2016 | Combining Efficiency, Fidelity, and Flexibility in Resource...IEEE 2015 - 2016 | Combining Efficiency, Fidelity, and Flexibility in Resource...
IEEE 2015 - 2016 | Combining Efficiency, Fidelity, and Flexibility in Resource...
 
ANG-GridWay-Poster-Final-Colorful-Bright-Final0
ANG-GridWay-Poster-Final-Colorful-Bright-Final0ANG-GridWay-Poster-Final-Colorful-Bright-Final0
ANG-GridWay-Poster-Final-Colorful-Bright-Final0
 
Data Partitioning in Mongo DB with Cloud
Data Partitioning in Mongo DB with CloudData Partitioning in Mongo DB with Cloud
Data Partitioning in Mongo DB with Cloud
 
Graphical display of statistical data on Android
Graphical display of statistical data on AndroidGraphical display of statistical data on Android
Graphical display of statistical data on Android
 
Discover Database
Discover DatabaseDiscover Database
Discover Database
 
IRJET- Providing In-Database Analytic Functionalities to Mysql : A Proposed S...
IRJET- Providing In-Database Analytic Functionalities to Mysql : A Proposed S...IRJET- Providing In-Database Analytic Functionalities to Mysql : A Proposed S...
IRJET- Providing In-Database Analytic Functionalities to Mysql : A Proposed S...
 

Similar to Real time data-pipeline from inception to production

Dataservices - Processing Big Data The Microservice Way
Dataservices - Processing Big Data The Microservice WayDataservices - Processing Big Data The Microservice Way
Dataservices - Processing Big Data The Microservice Way
Josef Adersberger
 
Predictive maintenance withsensors_in_utilities_
Predictive maintenance withsensors_in_utilities_Predictive maintenance withsensors_in_utilities_
Predictive maintenance withsensors_in_utilities_
Tina Zhang
 
Keynote 1 the rise of stream processing for data management &amp; micro serv...
Keynote 1  the rise of stream processing for data management &amp; micro serv...Keynote 1  the rise of stream processing for data management &amp; micro serv...
Keynote 1 the rise of stream processing for data management &amp; micro serv...
Sabri Skhiri
 
One Billion Black Friday Shoppers on a Distributed Data Store (Fahd Siddiqui,...
One Billion Black Friday Shoppers on a Distributed Data Store (Fahd Siddiqui,...One Billion Black Friday Shoppers on a Distributed Data Store (Fahd Siddiqui,...
One Billion Black Friday Shoppers on a Distributed Data Store (Fahd Siddiqui,...
DataStax
 
Building Continuous Application with Structured Streaming and Real-Time Data ...
Building Continuous Application with Structured Streaming and Real-Time Data ...Building Continuous Application with Structured Streaming and Real-Time Data ...
Building Continuous Application with Structured Streaming and Real-Time Data ...
Databricks
 
Event Driven Architecture with a RESTful Microservices Architecture (Kyle Ben...
Event Driven Architecture with a RESTful Microservices Architecture (Kyle Ben...Event Driven Architecture with a RESTful Microservices Architecture (Kyle Ben...
Event Driven Architecture with a RESTful Microservices Architecture (Kyle Ben...
confluent
 
Confluent & Attunity: Mainframe Data Modern Analytics
Confluent & Attunity: Mainframe Data Modern AnalyticsConfluent & Attunity: Mainframe Data Modern Analytics
Confluent & Attunity: Mainframe Data Modern Analytics
confluent
 
Engineering Wunderlist for Android - Ceasr Valiente, 6Wunderkinder
Engineering Wunderlist for Android - Ceasr Valiente, 6WunderkinderEngineering Wunderlist for Android - Ceasr Valiente, 6Wunderkinder
Engineering Wunderlist for Android - Ceasr Valiente, 6Wunderkinder
DroidConTLV
 
Integration Patterns for Big Data Applications
Integration Patterns for Big Data ApplicationsIntegration Patterns for Big Data Applications
Integration Patterns for Big Data Applications
Michael Häusler
 
BSA 385 Week 3 Individual Assignment Essay
BSA 385 Week 3 Individual Assignment EssayBSA 385 Week 3 Individual Assignment Essay
BSA 385 Week 3 Individual Assignment Essay
Tara Smith
 
Apache Beam: A unified model for batch and stream processing data
Apache Beam: A unified model for batch and stream processing dataApache Beam: A unified model for batch and stream processing data
Apache Beam: A unified model for batch and stream processing data
DataWorks Summit/Hadoop Summit
 
Enabling SQL Access to Data Lakes
Enabling SQL Access to Data LakesEnabling SQL Access to Data Lakes
Enabling SQL Access to Data Lakes
Vasu S
 
Osb Bam Integration
Osb Bam IntegrationOsb Bam Integration
Osb Bam Integration
guest6070853
 
Strata+Hadoop 2015 NYC End User Panel on Real-Time Data Analytics
Strata+Hadoop 2015 NYC End User Panel on Real-Time Data AnalyticsStrata+Hadoop 2015 NYC End User Panel on Real-Time Data Analytics
Strata+Hadoop 2015 NYC End User Panel on Real-Time Data Analytics
SingleStore
 
Tpl dataflow
Tpl dataflowTpl dataflow
Tpl dataflow
Alex Kursov
 
A cloud service architecture for analyzing big monitoring data
A cloud service architecture for analyzing big monitoring dataA cloud service architecture for analyzing big monitoring data
A cloud service architecture for analyzing big monitoring data
redpel dot com
 
Materialize: a platform for changing data
Materialize: a platform for changing dataMaterialize: a platform for changing data
Materialize: a platform for changing data
Altinity Ltd
 
Hw09 Hadoop Based Data Mining Platform For The Telecom Industry
Hw09   Hadoop Based Data Mining Platform For The Telecom IndustryHw09   Hadoop Based Data Mining Platform For The Telecom Industry
Hw09 Hadoop Based Data Mining Platform For The Telecom Industry
Cloudera, Inc.
 
Test Strategy Utilising Mc Useful Tools
Test Strategy Utilising Mc Useful ToolsTest Strategy Utilising Mc Useful Tools
Test Strategy Utilising Mc Useful Tools
mcthedog
 
Disadvantages Of Robotium
Disadvantages Of RobotiumDisadvantages Of Robotium
Disadvantages Of Robotium
Susan Tullis
 

Similar to Real time data-pipeline from inception to production (20)

Dataservices - Processing Big Data The Microservice Way
Dataservices - Processing Big Data The Microservice WayDataservices - Processing Big Data The Microservice Way
Dataservices - Processing Big Data The Microservice Way
 
Predictive maintenance withsensors_in_utilities_
Predictive maintenance withsensors_in_utilities_Predictive maintenance withsensors_in_utilities_
Predictive maintenance withsensors_in_utilities_
 
Keynote 1 the rise of stream processing for data management &amp; micro serv...
Keynote 1  the rise of stream processing for data management &amp; micro serv...Keynote 1  the rise of stream processing for data management &amp; micro serv...
Keynote 1 the rise of stream processing for data management &amp; micro serv...
 
One Billion Black Friday Shoppers on a Distributed Data Store (Fahd Siddiqui,...
One Billion Black Friday Shoppers on a Distributed Data Store (Fahd Siddiqui,...One Billion Black Friday Shoppers on a Distributed Data Store (Fahd Siddiqui,...
One Billion Black Friday Shoppers on a Distributed Data Store (Fahd Siddiqui,...
 
Building Continuous Application with Structured Streaming and Real-Time Data ...
Building Continuous Application with Structured Streaming and Real-Time Data ...Building Continuous Application with Structured Streaming and Real-Time Data ...
Building Continuous Application with Structured Streaming and Real-Time Data ...
 
Event Driven Architecture with a RESTful Microservices Architecture (Kyle Ben...
Event Driven Architecture with a RESTful Microservices Architecture (Kyle Ben...Event Driven Architecture with a RESTful Microservices Architecture (Kyle Ben...
Event Driven Architecture with a RESTful Microservices Architecture (Kyle Ben...
 
Confluent & Attunity: Mainframe Data Modern Analytics
Confluent & Attunity: Mainframe Data Modern AnalyticsConfluent & Attunity: Mainframe Data Modern Analytics
Confluent & Attunity: Mainframe Data Modern Analytics
 
Engineering Wunderlist for Android - Ceasr Valiente, 6Wunderkinder
Engineering Wunderlist for Android - Ceasr Valiente, 6WunderkinderEngineering Wunderlist for Android - Ceasr Valiente, 6Wunderkinder
Engineering Wunderlist for Android - Ceasr Valiente, 6Wunderkinder
 
Integration Patterns for Big Data Applications
Integration Patterns for Big Data ApplicationsIntegration Patterns for Big Data Applications
Integration Patterns for Big Data Applications
 
BSA 385 Week 3 Individual Assignment Essay
BSA 385 Week 3 Individual Assignment EssayBSA 385 Week 3 Individual Assignment Essay
BSA 385 Week 3 Individual Assignment Essay
 
Apache Beam: A unified model for batch and stream processing data
Apache Beam: A unified model for batch and stream processing dataApache Beam: A unified model for batch and stream processing data
Apache Beam: A unified model for batch and stream processing data
 
Enabling SQL Access to Data Lakes
Enabling SQL Access to Data LakesEnabling SQL Access to Data Lakes
Enabling SQL Access to Data Lakes
 
Osb Bam Integration
Osb Bam IntegrationOsb Bam Integration
Osb Bam Integration
 
Strata+Hadoop 2015 NYC End User Panel on Real-Time Data Analytics
Strata+Hadoop 2015 NYC End User Panel on Real-Time Data AnalyticsStrata+Hadoop 2015 NYC End User Panel on Real-Time Data Analytics
Strata+Hadoop 2015 NYC End User Panel on Real-Time Data Analytics
 
Tpl dataflow
Tpl dataflowTpl dataflow
Tpl dataflow
 
A cloud service architecture for analyzing big monitoring data
A cloud service architecture for analyzing big monitoring dataA cloud service architecture for analyzing big monitoring data
A cloud service architecture for analyzing big monitoring data
 
Materialize: a platform for changing data
Materialize: a platform for changing dataMaterialize: a platform for changing data
Materialize: a platform for changing data
 
Hw09 Hadoop Based Data Mining Platform For The Telecom Industry
Hw09   Hadoop Based Data Mining Platform For The Telecom IndustryHw09   Hadoop Based Data Mining Platform For The Telecom Industry
Hw09 Hadoop Based Data Mining Platform For The Telecom Industry
 
Test Strategy Utilising Mc Useful Tools
Test Strategy Utilising Mc Useful ToolsTest Strategy Utilising Mc Useful Tools
Test Strategy Utilising Mc Useful Tools
 
Disadvantages Of Robotium
Disadvantages Of RobotiumDisadvantages Of Robotium
Disadvantages Of Robotium
 

Recently uploaded

Recycled Concrete Aggregate in Construction Part III
Recycled Concrete Aggregate in Construction Part IIIRecycled Concrete Aggregate in Construction Part III
Recycled Concrete Aggregate in Construction Part III
Aditya Rajan Patra
 
5214-1693458878915-Unit 6 2023 to 2024 academic year assignment (AutoRecovere...
5214-1693458878915-Unit 6 2023 to 2024 academic year assignment (AutoRecovere...5214-1693458878915-Unit 6 2023 to 2024 academic year assignment (AutoRecovere...
5214-1693458878915-Unit 6 2023 to 2024 academic year assignment (AutoRecovere...
ihlasbinance2003
 
2. Operations Strategy in a Global Environment.ppt
2. Operations Strategy in a Global Environment.ppt2. Operations Strategy in a Global Environment.ppt
2. Operations Strategy in a Global Environment.ppt
PuktoonEngr
 
哪里办理(csu毕业证书)查尔斯特大学毕业证硕士学历原版一模一样
哪里办理(csu毕业证书)查尔斯特大学毕业证硕士学历原版一模一样哪里办理(csu毕业证书)查尔斯特大学毕业证硕士学历原版一模一样
哪里办理(csu毕业证书)查尔斯特大学毕业证硕士学历原版一模一样
insn4465
 
Electric vehicle and photovoltaic advanced roles in enhancing the financial p...
Electric vehicle and photovoltaic advanced roles in enhancing the financial p...Electric vehicle and photovoltaic advanced roles in enhancing the financial p...
Electric vehicle and photovoltaic advanced roles in enhancing the financial p...
IJECEIAES
 
Embedded machine learning-based road conditions and driving behavior monitoring
Embedded machine learning-based road conditions and driving behavior monitoringEmbedded machine learning-based road conditions and driving behavior monitoring
Embedded machine learning-based road conditions and driving behavior monitoring
IJECEIAES
 
Generative AI leverages algorithms to create various forms of content
Generative AI leverages algorithms to create various forms of contentGenerative AI leverages algorithms to create various forms of content
Generative AI leverages algorithms to create various forms of content
Hitesh Mohapatra
 
KuberTENes Birthday Bash Guadalajara - K8sGPT first impressions
KuberTENes Birthday Bash Guadalajara - K8sGPT first impressionsKuberTENes Birthday Bash Guadalajara - K8sGPT first impressions
KuberTENes Birthday Bash Guadalajara - K8sGPT first impressions
Victor Morales
 
Iron and Steel Technology Roadmap - Towards more sustainable steelmaking.pdf
Iron and Steel Technology Roadmap - Towards more sustainable steelmaking.pdfIron and Steel Technology Roadmap - Towards more sustainable steelmaking.pdf
Iron and Steel Technology Roadmap - Towards more sustainable steelmaking.pdf
RadiNasr
 
Heat Resistant Concrete Presentation ppt
Heat Resistant Concrete Presentation pptHeat Resistant Concrete Presentation ppt
Heat Resistant Concrete Presentation ppt
mamunhossenbd75
 
basic-wireline-operations-course-mahmoud-f-radwan.pdf
basic-wireline-operations-course-mahmoud-f-radwan.pdfbasic-wireline-operations-course-mahmoud-f-radwan.pdf
basic-wireline-operations-course-mahmoud-f-radwan.pdf
NidhalKahouli2
 
International Conference on NLP, Artificial Intelligence, Machine Learning an...
International Conference on NLP, Artificial Intelligence, Machine Learning an...International Conference on NLP, Artificial Intelligence, Machine Learning an...
International Conference on NLP, Artificial Intelligence, Machine Learning an...
gerogepatton
 
bank management system in java and mysql report1.pdf
bank management system in java and mysql report1.pdfbank management system in java and mysql report1.pdf
bank management system in java and mysql report1.pdf
Divyam548318
 
Understanding Inductive Bias in Machine Learning
Understanding Inductive Bias in Machine LearningUnderstanding Inductive Bias in Machine Learning
Understanding Inductive Bias in Machine Learning
SUTEJAS
 
Modelagem de um CSTR com reação endotermica.pdf
Modelagem de um CSTR com reação endotermica.pdfModelagem de um CSTR com reação endotermica.pdf
Modelagem de um CSTR com reação endotermica.pdf
camseq
 
ACEP Magazine edition 4th launched on 05.06.2024
ACEP Magazine edition 4th launched on 05.06.2024ACEP Magazine edition 4th launched on 05.06.2024
ACEP Magazine edition 4th launched on 05.06.2024
Rahul
 
Advanced control scheme of doubly fed induction generator for wind turbine us...
Advanced control scheme of doubly fed induction generator for wind turbine us...Advanced control scheme of doubly fed induction generator for wind turbine us...
Advanced control scheme of doubly fed induction generator for wind turbine us...
IJECEIAES
 
ML Based Model for NIDS MSc Updated Presentation.v2.pptx
ML Based Model for NIDS MSc Updated Presentation.v2.pptxML Based Model for NIDS MSc Updated Presentation.v2.pptx
ML Based Model for NIDS MSc Updated Presentation.v2.pptx
JamalHussainArman
 
Swimming pool mechanical components design.pptx
Swimming pool  mechanical components design.pptxSwimming pool  mechanical components design.pptx
Swimming pool mechanical components design.pptx
yokeleetan1
 
14 Template Contractual Notice - EOT Application
14 Template Contractual Notice - EOT Application14 Template Contractual Notice - EOT Application
14 Template Contractual Notice - EOT Application
SyedAbiiAzazi1
 

Recently uploaded (20)

Recycled Concrete Aggregate in Construction Part III
Recycled Concrete Aggregate in Construction Part IIIRecycled Concrete Aggregate in Construction Part III
Recycled Concrete Aggregate in Construction Part III
 
5214-1693458878915-Unit 6 2023 to 2024 academic year assignment (AutoRecovere...
5214-1693458878915-Unit 6 2023 to 2024 academic year assignment (AutoRecovere...5214-1693458878915-Unit 6 2023 to 2024 academic year assignment (AutoRecovere...
5214-1693458878915-Unit 6 2023 to 2024 academic year assignment (AutoRecovere...
 
2. Operations Strategy in a Global Environment.ppt
2. Operations Strategy in a Global Environment.ppt2. Operations Strategy in a Global Environment.ppt
2. Operations Strategy in a Global Environment.ppt
 
哪里办理(csu毕业证书)查尔斯特大学毕业证硕士学历原版一模一样
哪里办理(csu毕业证书)查尔斯特大学毕业证硕士学历原版一模一样哪里办理(csu毕业证书)查尔斯特大学毕业证硕士学历原版一模一样
哪里办理(csu毕业证书)查尔斯特大学毕业证硕士学历原版一模一样
 
Electric vehicle and photovoltaic advanced roles in enhancing the financial p...
Electric vehicle and photovoltaic advanced roles in enhancing the financial p...Electric vehicle and photovoltaic advanced roles in enhancing the financial p...
Electric vehicle and photovoltaic advanced roles in enhancing the financial p...
 
Embedded machine learning-based road conditions and driving behavior monitoring
Embedded machine learning-based road conditions and driving behavior monitoringEmbedded machine learning-based road conditions and driving behavior monitoring
Embedded machine learning-based road conditions and driving behavior monitoring
 
Generative AI leverages algorithms to create various forms of content
Generative AI leverages algorithms to create various forms of contentGenerative AI leverages algorithms to create various forms of content
Generative AI leverages algorithms to create various forms of content
 
KuberTENes Birthday Bash Guadalajara - K8sGPT first impressions
KuberTENes Birthday Bash Guadalajara - K8sGPT first impressionsKuberTENes Birthday Bash Guadalajara - K8sGPT first impressions
KuberTENes Birthday Bash Guadalajara - K8sGPT first impressions
 
Iron and Steel Technology Roadmap - Towards more sustainable steelmaking.pdf
Iron and Steel Technology Roadmap - Towards more sustainable steelmaking.pdfIron and Steel Technology Roadmap - Towards more sustainable steelmaking.pdf
Iron and Steel Technology Roadmap - Towards more sustainable steelmaking.pdf
 
Heat Resistant Concrete Presentation ppt
Heat Resistant Concrete Presentation pptHeat Resistant Concrete Presentation ppt
Heat Resistant Concrete Presentation ppt
 
basic-wireline-operations-course-mahmoud-f-radwan.pdf
basic-wireline-operations-course-mahmoud-f-radwan.pdfbasic-wireline-operations-course-mahmoud-f-radwan.pdf
basic-wireline-operations-course-mahmoud-f-radwan.pdf
 
International Conference on NLP, Artificial Intelligence, Machine Learning an...
International Conference on NLP, Artificial Intelligence, Machine Learning an...International Conference on NLP, Artificial Intelligence, Machine Learning an...
International Conference on NLP, Artificial Intelligence, Machine Learning an...
 
bank management system in java and mysql report1.pdf
bank management system in java and mysql report1.pdfbank management system in java and mysql report1.pdf
bank management system in java and mysql report1.pdf
 
Understanding Inductive Bias in Machine Learning
Understanding Inductive Bias in Machine LearningUnderstanding Inductive Bias in Machine Learning
Understanding Inductive Bias in Machine Learning
 
Modelagem de um CSTR com reação endotermica.pdf
Modelagem de um CSTR com reação endotermica.pdfModelagem de um CSTR com reação endotermica.pdf
Modelagem de um CSTR com reação endotermica.pdf
 
ACEP Magazine edition 4th launched on 05.06.2024
ACEP Magazine edition 4th launched on 05.06.2024ACEP Magazine edition 4th launched on 05.06.2024
ACEP Magazine edition 4th launched on 05.06.2024
 
Advanced control scheme of doubly fed induction generator for wind turbine us...
Advanced control scheme of doubly fed induction generator for wind turbine us...Advanced control scheme of doubly fed induction generator for wind turbine us...
Advanced control scheme of doubly fed induction generator for wind turbine us...
 
ML Based Model for NIDS MSc Updated Presentation.v2.pptx
ML Based Model for NIDS MSc Updated Presentation.v2.pptxML Based Model for NIDS MSc Updated Presentation.v2.pptx
ML Based Model for NIDS MSc Updated Presentation.v2.pptx
 
Swimming pool mechanical components design.pptx
Swimming pool  mechanical components design.pptxSwimming pool  mechanical components design.pptx
Swimming pool mechanical components design.pptx
 
14 Template Contractual Notice - EOT Application
14 Template Contractual Notice - EOT Application14 Template Contractual Notice - EOT Application
14 Template Contractual Notice - EOT Application
 

Real time data-pipeline from inception to production

  • 1. Real-time​ ​Data-Pipeline​ ​from​ ​inception​ ​to​ ​production Shreya​ ​Mukhopadhyay Intuit Bengaluru,​ ​India shreya_mukhopadhyay3@intuit.com Ashwini​ ​Vadivel Intuit Bengaluru,​ ​India ashwini_vadivel@intuit.com ​ ​​ ​​ ​​ ​Basavaraj​ ​M Intuit Bengaluru,​ ​India basavaraj_m@intuit.com Abstract Big Data being the buzzword of the industry, organizations want to arrive at actionable insights from their data quickly. Both historic and incoming data needs to be ingested through data pipelines onto a single data lake to help derive real-time analytics. To build real time streaming pipelines, we need to take care of data veracity, reliability of the system, out of order events, complex transformations and easier integration for future purposes. This paper will cover our experience of building such real-time pipelines for financial data, the various open source libraries we experimented with and​ ​the​ ​impacts​ ​we​ ​saw​ ​in​ ​a​ ​very​ ​brief​ ​time. 1.​ ​Introduction Intuit offers a plethora of financial products, which helps small and medium businesses in bookkeeping, financial management and tax filing. These products can have multiple data sources- customer entered data, bank feeds, payment, payroll and tax information from Federal Agencies among others. To build insights for our customers, auditors, accountants and internal customer care executives,​ ​a​ ​unified​ ​data​ ​lake​ ​is​ ​needed. This data lake needs to be fed with both real time and historical data, from internal and external sources. We wanted to build a framework for ETL(Extract, Transform, Load) data pipelines, which can be used across the organization to stream data and populate the data lake. Raw data from multiple sources had to be transformed into efficient formats before streaming and storage. The main guiding principles for such a framework were near real-time stateful transformation, data streaming with integrity, high availability, scalability​ ​and​ ​minimal​ ​latency. 2.​ ​Architecture In order to meet the above standards, the framework should be able to handle complex tasks like ingestion, persistence, processing and transformation. After considering multiple distributed application architectures, we narrowed down to Unix pipes and filter architecture. It was best suited to solve the above requirements as it is a simple yet powerful and robust system architecture. It can have any number of components (filters) to transform or filter data before passing it via connector(pipes)​ ​to​ ​other​ ​components​ ​(​Figure​ ​1​) Figure​ ​1​ ​Pipe​ ​and​ ​Filter​ ​Architecture A filter can have any number of input pipes and any number of output pipes. The pipe is the connector that passes data from one filter to the next. It is a directional stream of data, and is usually implemented by a data buffer to store all data, until the next filter has time to process it. The source and sink are the producers and consumers respectively and can be static files, any database or user​ ​input.​ ​(Refer​ ​to​ ​​[8]​). cat sample.txt | ​grep -v a | ​sort -r is a simple unix command representational of the architecture. Here sample.txt is the source and console is the sink. Commands ​cat, grep –v a and sort –r are filters and | is the pipe which passes unidirectional data between​ ​these​ ​filters. Our real-time streaming architecture was designed using the same logic of pipes. It ensured the following: - Support​ ​for​ ​multiple​ ​sources​ ​and​ ​sinks - Easier future enhancements by rearrangement​ ​of​ ​filters - Smaller processing steps ensuring easy reusability - Explicit storage of intermediate results for further​ ​processing - Scalability​ ​support 3.​ ​Pipeline​ ​Components Our first use-case had a relational Microsoft SQL database (Windows 2012 R2 Server) as source and Apache ActiveMQ as sink. The sink application contributes to the data lake, where real-time data can help generate in-product recommendations, help identify fraudsters among others through Machine Learning(ML) models. The source database had over 100+ tables in a single schema. The sink being a JMS queue accepted only text messages​ ​in​ ​certain​ ​formats.
  • 2. Shreya​ ​Mukhopadhyay,​ ​Ashwini​ ​Vadivel,​ ​Basavaraj​ ​M Figure​ ​2​ ​Pipeline​ ​Component Working with the product teams, we were able to create many-to-many input/output transformation maps. Overall the inputs came from 10 dynamic and 6 static tables which had to be transformed to 3 types​ ​of​ ​events. Figure 2 ​gives the general idea of data flow in the ETL pipeline. The choice of Kafka (Confluent 2.0.1 ​[9]​) for the pipes was a simple one as it well known for its ability to process high volume data. It can publish-subscribe messages and streams and is meant to be durable, fast, and scalable. The grey arrows before/after each component represents Kafka​ ​topics. As we delve deeper into individual components, we will elaborate the technologies and open source libraries that were used to get this pipeline to production. 4.​ ​Data​ ​Ingestion​ ​and​ ​delivery The first step in every pipeline is ingestion, wherein data can be ingested in real time or in batches. In our case, we needed real time streaming and therefore real time ingestion. The first use case of our data pipeline had Microsoft SQL(MSSQL) as data source. and it supports Change Data Capture (CDC) technology to capture the changes in the source at real time. The Oracle Golden Gate (GG) solution to capture the changes at real time had an issue. The issue was, after each database switchover GG started to read from the very beginning, thereby creating huge data loads on the pipeline. So, we chose MSSQL CDC mechanism to capture insert, update and delete events. Kafka source connectors pulled these CDC events, converted them to avro[​12​] messages and published them to Kafka topics. GG was later used for Oracle and MySQL database sources where the above issues​ ​were​ ​not​ ​seen. 4.1​ ​Connectors The source and sink connectors are the entry and exit points of our data pipeline. The source connectors are responsible for bringing all change data in, streaming to the pipeline and sink connectors are responsible to pass the output transformed​ ​data​ ​to​ ​sink/data​ ​lake. In the following sections we will discuss our first use case- source connector for MSSQL and sink connector​ ​for​ ​ActiveMQ. 4.1.1​ ​Source​ ​Connector For our use-case of MSSQL, we wanted to capture all data manipulation operations on the database tables and MSSQL server CDC provided this technology. The source of change data is the server transaction log and they can be enabled on an individual table, chosen fields or on the entire schema​ ​​[​11​]​. A new schema and captured columns get created once we enable CDC’s. 5 additional columns- __$start_lsn, __$end_lsn, __$seqval, __$operation, __$update_mask are added per table. These columns will allow us to uniquely identify a transaction​ ​within​ ​a​ ​commit​ ​and​ ​replay. Once the CDC is set up, Kafka connect is used to pull data from these tables using shared JDBC connections, massage it and then publish onto the CDC Kafka topics. Each table has a single data source i.e. the CDC table. The final output onto the topic is in the form of an avro whose schemas are stored in Confluent Schema Registry. Below is a sample​ ​avro​ ​message​ ​for​ ​INSERT { ​ ​​ ​"​header​":​ ​{ ​ ​​ ​​ ​​ ​"source":​ ​"MSSQLServer", ​ ​​ ​​ ​​ ​"seqno":​ ​"00127A53000034E00110", ​ ​​ ​​ ​​ ​"fragno":​ ​"00127A53000034C80007", ​ ​​ ​​ ​​ ​"schema":​ ​"mssql_database_name", ​ ​​ ​​ ​​ ​"table":​ ​"TABLENAME", ​ ​​ ​​ ​​ ​"timestamp":​ ​1494839698233, ​ ​​ ​​ ​​ ​"eventtype":​ ​"INSERT", ​ ​​ ​​ ​​ ​"shardid":​ ​"SHARD0", ​ ​​ ​​ ​​ ​"eventid":​ ​"00127A53000034E00110", ​ ​​ ​​ ​​ ​"primarykey":​ ​"ID", ​ ​}, Real-time​ ​Data-Pipeline​ ​from​ ​inception​ ​to​ ​production
  • 3. Shreya​ ​Mukhopadhyay,​ ​Ashwini​ ​Vadivel,​ ​Basavaraj​ ​M Figure​ ​3​ ​Real​ ​Time​ ​Data​ ​Pipeline "​payload​":​ ​{ ​ ​​ ​​ ​​ ​"​beforerecord​":​ ​null, ​ ​​ ​​ ​​ ​"​afterrecord​":​ ​{ ​ ​​ ​​ ​​ ​​ ​​ ​"afterrecord":​ ​{ ​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​"ID":​ ​{ ​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​"long":​ ​245983 ​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​} ​ ​​ ​​ ​​ ​​ ​​ ​} ​ ​​ ​​ ​​ ​} ​ ​​ ​} } Another implementation of the connector can use Oracle Golden Gate to publish MySQL events to Kafka​ ​topics​ ​​[​7​] Figure 3 gives a detailed view of the pipeline with all open source libraries used. The components- Sink connector, Joiner and Transformer will be discussed​ ​in​ ​detail​ ​in​ ​the​ ​following​ ​sections. 4.1.2​ ​Sink​ ​Connector The JMS Sink connector allows us to extract entries from a Kafka topic with the Connect Query Language(CQL) driver and pass them to a JMS topic/queue. The connectors, one for each type of event de-dupes and takes the latest event which can be identified by a combination of fragment and sequence numbers added by the source connector. These messages are then converted to text messages using JMS API and then written onto the queue. Input format in our case is an avro from kafka and the output is a text message. Details of configuration and Kafka Connect Sink JMS are well​ ​explained​ ​in​ ​​[2]​. 4.2​ ​Bootstrap Data from the source connectors is delta and to support stateful transformations, complete data is needed. We bootstrap historic data to Cassandra so that the outgoing events are complete. It also aids in replay and schema evolution. The next two components in the pipeline- Joiner and Transformer use this Cassandra data to construct a complete​ ​and​ ​stateful​ ​event. Bootstrap has 3 stages- each stage is a standalone java program which is run separately and serially in the following order before onboarding any new source​ ​to​ ​the​ ​pipeline: ➢ Populate Kafka topics- Using JDBC connection, SQL queries for historic data is run for each table and data is populated onto the raw CDC topics in the same avro format discussed in Source Connector section-​ ​all​ ​these​ ​events​ ​are​ ​insert. Real-time​ ​Data-Pipeline​ ​from​ ​inception​ ​to​ ​production
  • 4. Shreya​ ​Mukhopadhyay,​ ​Ashwini​ ​Vadivel,​ ​Basavaraj​ ​M ➢ Populate datomic tables- The bootstrap java program reads from input kafka topics and populates the corresponding datomic​ ​tables ➢ Populate datomic references- In this stage, the program populates references for various​ ​records​ ​in​ ​different​ ​datomic​ ​tables 5.​ ​Data​ ​Joiner-​ ​Joiner The next component in our pipeline does the task of joining the events from multiple kafka streams to form a de-normalised view, ready to be transformed. The Joiner are Spark jobs that process events​ ​from​ ​respective​ ​Kafka​ ​Streams. The joins are performed based on a joiner configuration that is provided to the job at startup. This config is used to create the joiner output events and to define the joins in the DB i.e, Datomic​ ​in​ ​our​ ​case. 5.1​ ​Datomic Datomic is a fully transactional, distributed database that avoids the compromises and losses of many NoSQL solutions. In addition, it offers flexibility and power over the traditional RDBMS model. ➢ Datomic stores a record of immutable facts, which are never update-in-place and all data is retained by default, giving you built-in auditing and the ability to query history. ➢ Caching is built-in and can be maintained at the client-side, which makes reads faster. ➢ Datomic provides rich schema and query capabilities on top of a storage of your choice. A storage 'service' can be anything from a SQL database, to a key/value store, to a true service like Amazon's DynamoDB. ➢ Schema evolution can be handled easily with Datomic as it follows a EAVT(Entity, Attribute, Value, Timestamp)​ ​structure. ➢ Joins are handled inherently where references to joined rows is always maintained. ➢ ACID-compliant​ ​transactions. We used Datomic 0.9.5561v (refer to ​[10]​) on top of​ ​a​ ​Cassandra​ ​cluster​ ​for​ ​storage. But before the joining can be performed we needed to ensure that the incoming events are complete rows (since we want to support both CDC’s and Golden​ ​Gate​ ​events). 5.2​ ​Reconciliation Every event processed by the Joiner is persisted at our end in a Datomic DB, using which we can construct the complete row even when partial data comes in through the CDC events. When the joiner receives an event it reads the previous state for the same from our database. It then applies the change set on it to construct the latest and complete row and​ ​pushes​ ​it​ ​back​ ​in. 5.3​ ​Joining Only master table entries can translate into output events. If the incoming event belongs to a master table, then on fetching its value from Datomic we get the complete output entity along with all the referenced child entries (thanks to Datomic!). If it is from a child table, then we fetch its corresponding master values to form the output event. A single table could be a master and/or child, based on which the number of output events formed may vary(each corresponding to a different entity​ ​at​ ​the​ ​destination). The output events are now a denormalized view of all the tables that are required to form the destination​ ​entities. 6.​ ​Data​ ​Transformation-​ ​Transformer Once the denormalized event is generated by Joiner, it is pushed into the next set of Kafka topics. These topics are then consumed by the transformer, another spark job,which has a sole responsibility of data transformation. The most common operations include: ➢ Mapping between the source and destination​ ​fields ➢ Deriving new field values based on business​ ​logic ➢ Validating for mandatory fields and other business​ ​rules. The transformation logic is handled through an open source framework called Morphline. This helps us define a series of commands that are transformations which are applied sequentially on the​ ​event​ ​being​ ​processed​ ​(​Figure​ ​4​). Real-time​ ​Data-Pipeline​ ​from​ ​inception​ ​to​ ​production
  • 5. Shreya​ ​Mukhopadhyay,​ ​Ashwini​ ​Vadivel,​ ​Basavaraj​ ​M Figure​ ​4​ ​Morphline​ ​Illustration The transformations are defined in an external configuration in a format expected by Morphline SDK as shown in the sample transformation sheet below: morphlines​:​ ​[ ​ ​{ ​ ​​ ​​ ​id:​ ​morphline ​ ​​ ​​ ​importCommands:​ ​["org.kitesdk.**"] ​ ​​ ​​ ​​commands​:​ ​[ ​ ​​ ​​ ​​ ​​ ​{ ​ ​​ ​​ ​​ ​​ ​​ ​​ ​​command1​​ ​{ ​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​attr1​ ​:​ ​value ​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​attr2​ ​:​ ​value ​ ​​ ​​ ​​ ​​ ​​ ​​ ​}​ ​} ​ ​​ ​​ ​​ ​​ ​{ ​ ​​ ​​ ​​ ​​ ​​ ​​ ​​command2​​ ​{ ​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​attr1​ ​:​ ​value ​ ​​ ​​ ​​ ​​ ​​ ​}​ ​} ​ ​]​ ​}​ ​] 6.1​ ​Checkpointing We chose to do our own checkpointing rather than relying​ ​on​ ​Spark​ ​for​ ​2​ ​main​ ​reasons. ➢ The default spark checkpointing requires us to clear the checkpointing directory on HDFS whenever new code is deployed. This is a task overhead and is prone to errors. ➢ Saving checkpoint data in Datomic also helps us in replaying messages on demand. The Joiner and Transformer both pickup the latest offset from the metadata table at startup and process the kafka streams from that point onwards They save the offset in Datomic after processing a batch along with the metadata information which the​ ​event​ ​contains. 6.2 Features- Replay, Out of Order handling, Schema​ ​evolution Replaying messages from the kafka streams is required whenever we encounter a technical or logical issue. For the spark jobs the checkpointing data is available in Datomic, based on the time from which replay is required the corresponding offset is fetched. The latest offset value in Datomic is then set to this value and the components are restarted. Once a message is replayed it flows through all the downstream components and into the​ ​sink. Every event from the source has a fragment number and sequence number. This combination is unique to every event and Joiner uses it to determine out of order events. Transformer cannot use this value as multiple streams are merged in the Joiner output. Instead, it uses a transaction id which is punched in the event by the Joiner. This transaction id is generated by Datomic for every insert/update operation​ ​and​ ​is​ ​sequential​ ​in​ ​nature. The schema for the pipeline is maintained only in Schema Registry and Datomic. Both of which can be updated at runtime without any application level​ ​changes. 7.​ ​Data​ ​Pipeline​ ​Testing​ ​and​ ​Monitoring The primary concern for any data processing pipeline is the health of data flowing through the system. The overall health of a pipeline can be evaluated as a combination of multiple attributes such as data loss, throughput, latency and error rates. These metrics are helpful only when we are able to isolate the root cause- component/ process/job/configuration. We will discuss two major areas - how we gained confidence in the pipeline before we went live and how we kept that after​ ​deployment. 7.1​ ​Pre-deployment​ ​Testing We can never undermine the importance of unit and component integration tests, however writing an end to end (E2E) test for a real time streaming pipeline is a different ball game altogether. Points of failure increased with dependencies- source, sink, components and environments. We followed few​ ​guiding​ ​principles​ ​for​ ​automating​ ​E2E​ ​tests: ➢ Addition​ ​of​ ​new​ ​tests​ ​should​ ​be​ ​easy ➢ support​ ​for​ ​multiple​ ​sources ➢ Independent tests, parallel runs, minimum run​ ​time ➢ Post​ ​run,​ ​easy​ ​and​ ​faster​ ​analysis ➢ High​ ​configurability,​ ​granular​ ​control Real-time​ ​Data-Pipeline​ ​from​ ​inception​ ​to​ ​production
  • 6. Shreya​ ​Mukhopadhyay,​ ​Ashwini​ ​Vadivel,​ ​Basavaraj​ ​M Figure​ ​5​ ​Data​ ​Pipeline​ ​Automated​ ​Testing We used Java 1.8 with TestNG as the basic test framework for automation and contributed to open source- DolphinNG (​[4]​,​[5]​) for advanced reporting and analysis. To isolate errors and enable swift debugging, outputs after each filter had to be verified. ​Figure 5 gives a detailed view on the interactions between the test automation framework and​ ​data​ ​pipeline. Below​ ​is​ ​the​ ​anatomy​ ​of​ ​an​ ​E2E​ ​automation​ ​test: 1. Start Kafka Message Aggregator- listen to all messages​ ​henceforth @BeforeClass public​ ​void​ ​​startKafkaAggregatorListening​()​ ​{ aggregator​ ​=​ ​new KafkaMessageAggregator(configuration); } 2. Create events- actual OR simulated to populate raw​ ​CDC​ ​topics​ ​and​ ​collect​ ​Unique​ ​Id @Test public​ ​void​ ​joinerTest(String​ ​params)​ ​throws Exceptions​ ​{ String​ ​​uniqueId​​ ​= createEvents​(param1,param2,configuration); } 3. Filter aggregator messages with uniqueId and create​ ​a​ ​list List<GenericRecord>​ ​joinerMessagesForUniqueId = KafkaMessageConsumer.​filterRecords​(aggregator. getMessagesForTopic(KAFKA_JOINER_KEY)), uniqueId); 4.​ ​Verifications debugAtSource checkForDuplicatesOnAllTopics verifyValidityOfMessagesCollectedForSizeAndDat a verifyOutOfOrderEvents verifyUniqueIdAtSplunk verifyUniqueIdAtDatomic verifyDataParity With the volume of data flowing in our pipeline, we were adding tests daily and the complexity kept on increasing. Tests were periodically run and we were getting over 100 reports per day. TestNG reports were not efficient for analysis, we wanted to quickly analyze and log issues for errors. DolphinNG a testNG add on was integrated with test automation suite to free ourselves from all manual interventions. It clubs failures, reports root cause​ ​and​ ​automatically​ ​creates​ ​JIRA​ ​tickets. 7.2​ ​Post-​ ​deployment​ ​Monitoring For post deployment monitoring, it was essential to instrument, annotate, and organize our telemetry, or else it would become very difficult to separate primary concerns from other infrastructure metrics such as CPU utilization, disk space, and so forth. Standard metrics that we wanted to capture were latency, input output throughput, data integrity and errors. The front runners for such dashboarding and alerting were Splunk and Wavefront. Splunk concentrates on application metrics, Wavefront allows both system and application metrics. As we wanted application metrics and solid debugging capabilities,​ ​we​ ​went​ ​with​ ​Splunk​ ​6.2.1​ ​​[6]​. Real-time​ ​Data-Pipeline​ ​from​ ​inception​ ​to​ ​production
  • 7. Shreya​ ​Mukhopadhyay,​ ​Ashwini​ ​Vadivel,​ ​Basavaraj​ ​M Figure​ ​6​ ​Monitoring​ ​framework In order to isolate issues and find their root causes, we needed to capture metrics at all stages. Each stage of the pipeline logged an audit entry with event_code, stage_timestamp, output_checksum, stage_number and few other values to the Splunk. Splunk forwarders and log4j appenders were used in the pipeline components to log the auditing metrics to splunk with a dedicated splunk index. For Joiner and Transformer components, we used appenders to avoid installation of forwarders on all data​ ​nodes​ ​(​Figure​ ​​6​). Details captured at Splunk also allowed us to perform data integrity monitoring. With the help of event codes and stage numbers data loss could be detected even though input to output event ratio is not 1:1 throughout the pipeline. Splunk dashboard was created for capturing data loss at each and every​ ​stage​ ​of​ ​the​ ​pipeline. For latency, 95 percentile numbers were used to derive insights at each stage. For throughput (TPS) absolute throughput was measured and plotted in splunk dashboards (​Figure 7​). Splunk alerts were created on top of splunk dashboards for alerting input TPS, data loss occurrences and latency breaches. Figure​ ​7​ ​Splunk​ ​Dashboards 8.​ ​Outcomes We were able to take multiple pipelines to production using the above framework, maintaining the​ ​following​ ​KPI’s ➢ Bootstrap populated 10 million records to Datomic​ ​in​ ​under​ ​15​ ​mins ➢ E2E latency remains < 60 sec with exceptions​ ​during​ ​high​ ​volume​ ​inputs ➢ Pipeline with a setup of 3 Kafka brokers, 5 Cassandra instances and 20 input tables(avg 25 columns) processes 100 TPS with​ ​sub-minute​ ​latency ➢ With DolphinNG smart reporting and Splunk alerting, there is no manual intervention​ ​for​ ​pipeline​ ​monitoring ➢ Onboarding new table only needs config changes Real-time​ ​Data-Pipeline​ ​from​ ​inception​ ​to​ ​production
  • 8. Shreya​ ​Mukhopadhyay,​ ​Ashwini​ ​Vadivel,​ ​Basavaraj​ ​M 9.​ ​Learnings 1. Race Condition, Data corruption​- As we had different joiners processing events from different tables we started running into race conditions resulting in data loss or stale data. To fix this issue we wrote transaction functions in Datomic that ensured atomicity on a set of commands. This along with handling of out-of-order events prevented​ ​the​ ​data​ ​from​ ​being​ ​corrupted. 2. ​Data loss at bootstrap​- Retention period for CDC’s was 24 hrs, which meant events should be consumed within that time-frame else there will be data loss. The first bootstrap design failed to clear the performance markers and was redesigned to execute​ ​in​ ​steps​ ​as​ ​explained​ ​earlier. 3. ​Zero batch processing time of Spark​- Spark 1.6.1 performance degrades with time. The size of metadata which is passed to the executor keeps increasing with time. As a result, batches with 0 events take 2-3s for completion. This issue is reported​ ​to​ ​have​ ​been​ ​fixed​ ​in​ ​the​ ​latest​ ​version. 10.​ ​Conclusion In this paper, we’ve tried to consolidate our implementation and learnings of building a real time ETL pipeline which allows replay, data persistence, automated monitoring, testing and schema evaluation. It gives a glimpse into the latest stream processing technologies like Kafka and Spark, distributed database like Datomic, rich configurations using Morphline and DolphinNG, a TestNG add on for smart reporting. For future work, we would want to make onboarding self serve; open source logical components, Kafka Message aggregator; optimize KPI’s and experiment​ ​with​ ​Spark​ ​structured​ ​streaming. References [1]​ ​Track​ ​Data​ ​Changes​ ​(SQL​ ​Server) https://docs.microsoft.com/en-us/sql/relational-databases/track-c hanges/track-data-changes-sql-server [2]​ ​Kafka​ ​Connect​ ​JMS​ ​Sink http://docs.datamountaineer.com/en/latest/jms.html [3]​ ​Splunk:​ ​Distributed​ ​Deployment​ ​Manual https://docs.splunk.com/Documentation/Splunk/7.0.1/Deploy/C omponentsofadistributedenvironment [4]​ ​DolphinNG​ ​​https://github.com/basavaraj1985/DolphinNG [5]​ ​DolphinNG​ ​Sample​ ​Project https://github.com/basavaraj1985/UseDolphinNG [6]​ ​Splunk​ ​Logging​ ​for​ ​Java http://dev.splunk.com/view/splunk-logging-java/SP-CAAAE2K [7]​ ​Oracle​ ​GG​ ​for​ ​MySQL https://docs.oracle.com/goldengate/1212/gg-winux/GIMYS/toc. htm [8]​ ​Pipe​ ​and​ ​filter​ ​architectures http://community.wvu.edu/~hhammar/CU/swarch/lecture%20sli des/slides%204%20sw%20arch%20styles/supporting%20slides/ SWArch-4-PipesandFilter.pdf [9]​ ​Confluent​ ​2.0.1​ ​documentation https://docs.confluent.io/2.0.1/platform.html [10]​ ​Datomic​ ​​http://docs.datomic.com/index.html [11]​ ​Enabling​ ​CDC​ ​on​ ​Microsoft​ ​SQL​ ​server https://docs.microsoft.com/en-us/sql/relational-databases/track-c hanges/enable-and-disable-change-data-capture-sql-server [12]​ ​Avro​ ​Messages https://avro.apache.org/docs/1.7.7/gettingstartedjava.html Real-time​ ​Data-Pipeline​ ​from​ ​inception​ ​to​ ​production