Real time data-pipeline from inception to production

Real-time Data-Pipeline from inception to production
Shreya Mukhopadhyay
Intuit
Bengaluru, India
shreya_mukhopadhyay3@intuit.com
Ashwini Vadivel
Intuit
Bengaluru, India
ashwini_vadivel@intuit.com
Basavaraj M
Intuit
Bengaluru, India
basavaraj_m@intuit.com
Abstract
Big Data being the buzzword of the industry,
organizations want to arrive at actionable insights
from their data quickly. Both historic and incoming
data needs to be ingested through data pipelines
onto a single data lake to help derive real-time
analytics. To build real time streaming pipelines,
we need to take care of data veracity, reliability of
the system, out of order events, complex
transformations and easier integration for future
purposes.
This paper will cover our experience of building
such real-time pipelines for financial data, the
various open source libraries we experimented with
and the impacts we saw in a very brief time.
1. Introduction
Intuit offers a plethora of financial products, which
helps small and medium businesses in
bookkeeping, financial management and tax filing.
These products can have multiple data sources-
customer entered data, bank feeds, payment,
payroll and tax information from Federal Agencies
among others. To build insights for our customers,
auditors, accountants and internal customer care
executives, a unified data lake is needed.
This data lake needs to be fed with both real time
and historical data, from internal and external
sources. We wanted to build a framework for
ETL(Extract, Transform, Load) data pipelines,
which can be used across the organization to stream
data and populate the data lake. Raw data from
multiple sources had to be transformed into
efficient formats before streaming and storage. The
main guiding principles for such a framework were
near real-time stateful transformation, data
streaming with integrity, high availability,
scalability and minimal latency.
2. Architecture
In order to meet the above standards, the
framework should be able to handle complex tasks
like ingestion, persistence, processing and
transformation. After considering multiple
distributed application architectures, we narrowed
down to Unix pipes and filter architecture. It was
best suited to solve the above requirements as it is a
simple yet powerful and robust system architecture.
It can have any number of components (filters) to
transform or filter data before passing it via
connector(pipes) to other components (Figure 1)
Figure 1 Pipe and Filter Architecture
A filter can have any number of input pipes and
any number of output pipes. The pipe is the
connector that passes data from one filter to the
next. It is a directional stream of data, and is
usually implemented by a data buffer to store all
data, until the next filter has time to process it. The
source and sink are the producers and consumers
respectively and can be static files, any database or
user input. (Refer to [8]).
cat sample.txt | grep -v a | sort -r is a simple unix
command representational of the architecture. Here
sample.txt is the source and console is the sink.
Commands cat, grep –v a and sort –r are filters and
| is the pipe which passes unidirectional data
between these filters.
Our real-time streaming architecture was designed
using the same logic of pipes. It ensured the
following:
- Support for multiple sources and sinks
- Easier future enhancements by
rearrangement of filters
- Smaller processing steps ensuring easy
reusability
- Explicit storage of intermediate results for
further processing
- Scalability support
3. Pipeline Components
Our first use-case had a relational Microsoft SQL
database (Windows 2012 R2 Server) as source and
Apache ActiveMQ as sink. The sink application
contributes to the data lake, where real-time data
can help generate in-product recommendations,
help identify fraudsters among others through
Machine Learning(ML) models. The source
database had over 100+ tables in a single schema.
The sink being a JMS queue accepted only text
messages in certain formats.

Shreya Mukhopadhyay, Ashwini Vadivel, Basavaraj M
Figure 2 Pipeline Component
Working with the product teams, we were able to
create many-to-many input/output transformation
maps. Overall the inputs came from 10 dynamic
and 6 static tables which had to be transformed to 3
types of events.
Figure 2 gives the general idea of data flow in the
ETL pipeline. The choice of Kafka (Confluent
2.0.1 [9]) for the pipes was a simple one as it well
known for its ability to process high volume data. It
can publish-subscribe messages and streams and is
meant to be durable, fast, and scalable. The grey
arrows before/after each component represents
Kafka topics.
As we delve deeper into individual components, we
will elaborate the technologies and open source
libraries that were used to get this pipeline to
production.
4. Data Ingestion and delivery
The first step in every pipeline is ingestion,
wherein data can be ingested in real time or in
batches. In our case, we needed real time streaming
and therefore real time ingestion. The first use case
of our data pipeline had Microsoft SQL(MSSQL)
as data source. and it supports Change Data
Capture (CDC) technology to capture the changes
in the source at real time. The Oracle Golden Gate
(GG) solution to capture the changes at real time
had an issue. The issue was, after each database
switchover GG started to read from the very
beginning, thereby creating huge data loads on the
pipeline. So, we chose MSSQL CDC mechanism to
capture insert, update and delete events. Kafka
source connectors pulled these CDC events,
converted them to avro[12] messages and published
them to Kafka topics. GG was later used for Oracle
and MySQL database sources where the above
issues were not seen.
4.1 Connectors
The source and sink connectors are the entry and
exit points of our data pipeline. The source
connectors are responsible for bringing all change
data in, streaming to the pipeline and sink
connectors are responsible to pass the output
transformed data to sink/data lake.
In the following sections we will discuss our first
use case- source connector for MSSQL and sink
connector for ActiveMQ.
4.1.1 Source Connector
For our use-case of MSSQL, we wanted to capture
all data manipulation operations on the database
tables and MSSQL server CDC provided this
technology. The source of change data is the server
transaction log and they can be enabled on an
individual table, chosen fields or on the entire
schema [11].
A new schema and captured columns get created
once we enable CDC’s. 5 additional columns-
__$start_lsn, __$end_lsn, __$seqval, __$operation,
__$update_mask are added per table. These
columns will allow us to uniquely identify a
transaction within a commit and replay.
Once the CDC is set up, Kafka connect is used to
pull data from these tables using shared JDBC
connections, massage it and then publish onto the
CDC Kafka topics. Each table has a single data
source i.e. the CDC table. The final output onto the
topic is in the form of an avro whose schemas are
stored in Confluent Schema Registry. Below is a
sample avro message for INSERT
{
"header": {
"source": "MSSQLServer",
"seqno": "00127A53000034E00110",
"fragno": "00127A53000034C80007",
"schema": "mssql_database_name",
"table": "TABLENAME",
"timestamp": 1494839698233,
"eventtype": "INSERT",
"shardid": "SHARD0",
"eventid": "00127A53000034E00110",
"primarykey": "ID",
},

Figure 3 Real Time Data Pipeline
"payload": {
"beforerecord": null,
"afterrecord": {
"afterrecord": {
"ID": {
"long": 245983
}
}
}
}
}
Another implementation of the connector can use
Oracle Golden Gate to publish MySQL events to
Kafka topics [7]
Figure 3 gives a detailed view of the pipeline with
all open source libraries used. The components-
Sink connector, Joiner and Transformer will be
discussed in detail in the following sections.
4.1.2 Sink Connector
The JMS Sink connector allows us to extract
entries from a Kafka topic with the Connect Query
Language(CQL) driver and pass them to a JMS
topic/queue. The connectors, one for each type of
event de-dupes and takes the latest event which can
be identified by a combination of fragment and
sequence numbers added by the source connector.
These messages are then converted to text
messages using JMS API and then written onto the
queue. Input format in our case is an avro from
kafka and the output is a text message. Details of
configuration and Kafka Connect Sink JMS are
well explained in [2].
4.2 Bootstrap
Data from the source connectors is delta and to
support stateful transformations, complete data is
needed. We bootstrap historic data to Cassandra so
that the outgoing events are complete. It also aids
in replay and schema evolution. The next two
components in the pipeline- Joiner and
Transformer use this Cassandra data to construct a
complete and stateful event.
Bootstrap has 3 stages- each stage is a standalone
java program which is run separately and serially in
the following order before onboarding any new
source to the pipeline:
➢ Populate Kafka topics- Using JDBC
connection, SQL queries for historic data
is run for each table and data is populated
onto the raw CDC topics in the same avro
format discussed in Source Connector
section- all these events are insert.

➢ Populate datomic tables- The bootstrap
java program reads from input kafka
topics and populates the corresponding
datomic tables
➢ Populate datomic references- In this stage,
the program populates references for
various records in different datomic tables
5. Data Joiner- Joiner
The next component in our pipeline does the task
of joining the events from multiple kafka streams to
form a de-normalised view, ready to be
transformed. The Joiner are Spark jobs that process
events from respective Kafka Streams.
The joins are performed based on a joiner
configuration that is provided to the job at startup.
This config is used to create the joiner output
events and to define the joins in the DB i.e,
Datomic in our case.
5.1 Datomic
Datomic is a fully transactional, distributed
database that avoids the compromises and losses of
many NoSQL solutions. In addition, it offers
flexibility and power over the traditional RDBMS
model.
➢ Datomic stores a record of immutable
facts, which are never update-in-place and
all data is retained by default, giving you
built-in auditing and the ability to query
history.
➢ Caching is built-in and can be maintained
at the client-side, which makes reads
faster.
➢ Datomic provides rich schema and query
capabilities on top of a storage of your
choice. A storage 'service' can be anything
from a SQL database, to a key/value store,
to a true service like Amazon's
DynamoDB.
➢ Schema evolution can be handled easily
with Datomic as it follows a
EAVT(Entity, Attribute, Value,
Timestamp) structure.
➢ Joins are handled inherently where
references to joined rows is always
maintained.
➢ ACID-compliant transactions.
We used Datomic 0.9.5561v (refer to [10]) on top
of a Cassandra cluster for storage.
But before the joining can be performed we needed
to ensure that the incoming events are complete
rows (since we want to support both CDC’s and
Golden Gate events).
5.2 Reconciliation
Every event processed by the Joiner is persisted at
our end in a Datomic DB, using which we can
construct the complete row even when partial data
comes in through the CDC events. When the joiner
receives an event it reads the previous state for the
same from our database. It then applies the change
set on it to construct the latest and complete row
and pushes it back in.
5.3 Joining
Only master table entries can translate into output
events. If the incoming event belongs to a master
table, then on fetching its value from Datomic we
get the complete output entity along with all the
referenced child entries (thanks to Datomic!). If it
is from a child table, then we fetch its
corresponding master values to form the output
event. A single table could be a master and/or
child, based on which the number of output events
formed may vary(each corresponding to a different
entity at the destination).
The output events are now a denormalized view of
all the tables that are required to form the
destination entities.
6. Data Transformation- Transformer
Once the denormalized event is generated by
Joiner, it is pushed into the next set of Kafka topics.
These topics are then consumed by the transformer,
another spark job,which has a sole responsibility of
data transformation. The most common operations
include:
➢ Mapping between the source and
destination fields
➢ Deriving new field values based on
business logic
➢ Validating for mandatory fields and other
business rules.
The transformation logic is handled through an
open source framework called Morphline. This
helps us define a series of commands that are
transformations which are applied sequentially on
the event being processed (Figure 4).

Figure 4 Morphline Illustration
The transformations are defined in an external
configuration in a format expected by Morphline
SDK as shown in the sample transformation sheet
below:
morphlines: [
{
id: morphline
importCommands: ["org.kitesdk.**"]
commands: [
{
command1 {
attr1 : value
attr2 : value
} }
{
command2 {
attr1 : value
} }
] } ]
6.1 Checkpointing
We chose to do our own checkpointing rather than
relying on Spark for 2 main reasons.
➢ The default spark checkpointing requires
us to clear the checkpointing directory on
HDFS whenever new code is deployed.
This is a task overhead and is prone to
errors.
➢ Saving checkpoint data in Datomic also
helps us in replaying messages on
demand.
The Joiner and Transformer both pickup the latest
offset from the metadata table at startup and
process the kafka streams from that point onwards
They save the offset in Datomic after processing a
batch along with the metadata information which
the event contains.
6.2 Features- Replay, Out of Order handling,
Schema evolution
Replaying messages from the kafka streams is
required whenever we encounter a technical or
logical issue. For the spark jobs the checkpointing
data is available in Datomic, based on the time
from which replay is required the corresponding
offset is fetched. The latest offset value in Datomic
is then set to this value and the components are
restarted. Once a message is replayed it flows
through all the downstream components and into
the sink.
Every event from the source has a fragment number
and sequence number. This combination is unique
to every event and Joiner uses it to determine out of
order events. Transformer cannot use this value as
multiple streams are merged in the Joiner output.
Instead, it uses a transaction id which is punched in
the event by the Joiner. This transaction id is
generated by Datomic for every insert/update
operation and is sequential in nature.
The schema for the pipeline is maintained only in
Schema Registry and Datomic. Both of which can
be updated at runtime without any application
level changes.
7. Data Pipeline Testing and Monitoring
The primary concern for any data processing
pipeline is the health of data flowing through the
system. The overall health of a pipeline can be
evaluated as a combination of multiple attributes
such as data loss, throughput, latency and error
rates. These metrics are helpful only when we are
able to isolate the root cause- component/
process/job/configuration. We will discuss two
major areas - how we gained confidence in the
pipeline before we went live and how we kept that
after deployment.
7.1 Pre-deployment Testing
We can never undermine the importance of unit
and component integration tests, however writing
an end to end (E2E) test for a real time streaming
pipeline is a different ball game altogether. Points
of failure increased with dependencies- source,
sink, components and environments. We followed
few guiding principles for automating E2E tests:
➢ Addition of new tests should be easy
➢ support for multiple sources
➢ Independent tests, parallel runs, minimum
run time
➢ Post run, easy and faster analysis
➢ High configurability, granular control

Figure 5 Data Pipeline Automated Testing
We used Java 1.8 with TestNG as the basic test
framework for automation and contributed to open
source- DolphinNG ([4],[5]) for advanced reporting
and analysis. To isolate errors and enable swift
debugging, outputs after each filter had to be
verified. Figure 5 gives a detailed view on the
interactions between the test automation framework
and data pipeline.
Below is the anatomy of an E2E automation test:
1. Start Kafka Message Aggregator- listen to all
messages henceforth
@BeforeClass
public void startKafkaAggregatorListening() {
aggregator = new
KafkaMessageAggregator(configuration);
}
2. Create events- actual OR simulated to populate
raw CDC topics and collect Unique Id
@Test
public void joinerTest(String params) throws
Exceptions {
String uniqueId =
createEvents(param1,param2,configuration);
}
3. Filter aggregator messages with uniqueId and
create a list
List<GenericRecord> joinerMessagesForUniqueId
=
KafkaMessageConsumer.filterRecords(aggregator.
getMessagesForTopic(KAFKA_JOINER_KEY)),
uniqueId);
4. Verifications
debugAtSource
checkForDuplicatesOnAllTopics
verifyValidityOfMessagesCollectedForSizeAndDat
a
verifyOutOfOrderEvents
verifyUniqueIdAtSplunk
verifyUniqueIdAtDatomic
verifyDataParity
With the volume of data flowing in our pipeline,
we were adding tests daily and the complexity kept
on increasing. Tests were periodically run and we
were getting over 100 reports per day. TestNG
reports were not efficient for analysis, we wanted
to quickly analyze and log issues for errors.
DolphinNG a testNG add on was integrated with
test automation suite to free ourselves from all
manual interventions. It clubs failures, reports root
cause and automatically creates JIRA tickets.
7.2 Post- deployment Monitoring
For post deployment monitoring, it was essential to
instrument, annotate, and organize our telemetry, or
else it would become very difficult to separate
primary concerns from other infrastructure metrics
such as CPU utilization, disk space, and so forth.
Standard metrics that we wanted to capture were
latency, input output throughput, data integrity and
errors. The front runners for such dashboarding and
alerting were Splunk and Wavefront. Splunk
concentrates on application metrics, Wavefront
allows both system and application metrics. As we
wanted application metrics and solid debugging
capabilities, we went with Splunk 6.2.1 [6].

Figure 6 Monitoring framework
In order to isolate issues and find their root causes,
we needed to capture metrics at all stages. Each
stage of the pipeline logged an audit entry with
event_code, stage_timestamp, output_checksum,
stage_number and few other values to the Splunk.
Splunk forwarders and log4j appenders were used
in the pipeline components to log the auditing
metrics to splunk with a dedicated splunk index.
For Joiner and Transformer components, we used
appenders to avoid installation of forwarders on all
data nodes (Figure 6).
Details captured at Splunk also allowed us to
perform data integrity monitoring. With the help of
event codes and stage numbers data loss could be
detected even though input to output event ratio is
not 1:1 throughout the pipeline. Splunk dashboard
was created for capturing data loss at each and
every stage of the pipeline.
For latency, 95 percentile numbers were used to
derive insights at each stage. For throughput (TPS)
absolute throughput was measured and plotted in
splunk dashboards (Figure 7). Splunk alerts were
created on top of splunk dashboards for alerting
input TPS, data loss occurrences and latency
breaches.
Figure 7 Splunk Dashboards
8. Outcomes
We were able to take multiple pipelines to
production using the above framework, maintaining
the following KPI’s
➢ Bootstrap populated 10 million records to
Datomic in under 15 mins
➢ E2E latency remains < 60 sec with
exceptions during high volume inputs
➢ Pipeline with a setup of 3 Kafka brokers, 5
Cassandra instances and 20 input
tables(avg 25 columns) processes 100 TPS
with sub-minute latency
➢ With DolphinNG smart reporting and
Splunk alerting, there is no manual
intervention for pipeline monitoring
➢ Onboarding new table only needs config
changes

9. Learnings
1. Race Condition, Data corruption- As we had
different joiners processing events from different
tables we started running into race conditions
resulting in data loss or stale data. To fix this issue
we wrote transaction functions in Datomic that
ensured atomicity on a set of commands. This
along with handling of out-of-order events
prevented the data from being corrupted.
2. Data loss at bootstrap- Retention period for
CDC’s was 24 hrs, which meant events should be
consumed within that time-frame else there will be
data loss. The first bootstrap design failed to clear
the performance markers and was redesigned to
execute in steps as explained earlier.
3. Zero batch processing time of Spark- Spark
1.6.1 performance degrades with time. The size of
metadata which is passed to the executor keeps
increasing with time. As a result, batches with 0
events take 2-3s for completion. This issue is
reported to have been fixed in the latest version.
10. Conclusion
In this paper, we’ve tried to consolidate our
implementation and learnings of building a real
time ETL pipeline which allows replay, data
persistence, automated monitoring, testing and
schema evaluation. It gives a glimpse into the latest
stream processing technologies like Kafka and
Spark, distributed database like Datomic, rich
configurations using Morphline and DolphinNG, a
TestNG add on for smart reporting. For future
work, we would want to make onboarding self
serve; open source logical components, Kafka
Message aggregator; optimize KPI’s and
experiment with Spark structured streaming.
References
[1] Track Data Changes (SQL Server)
https://docs.microsoft.com/en-us/sql/relational-databases/track-c
hanges/track-data-changes-sql-server
[2] Kafka Connect JMS Sink
http://docs.datamountaineer.com/en/latest/jms.html
[3] Splunk: Distributed Deployment Manual
https://docs.splunk.com/Documentation/Splunk/7.0.1/Deploy/C
omponentsofadistributedenvironment
[4] DolphinNG https://github.com/basavaraj1985/DolphinNG
[5] DolphinNG Sample Project
https://github.com/basavaraj1985/UseDolphinNG
[6] Splunk Logging for Java
http://dev.splunk.com/view/splunk-logging-java/SP-CAAAE2K
[7] Oracle GG for MySQL
https://docs.oracle.com/goldengate/1212/gg-winux/GIMYS/toc.
htm
[8] Pipe and filter architectures
http://community.wvu.edu/~hhammar/CU/swarch/lecture%20sli
des/slides%204%20sw%20arch%20styles/supporting%20slides/
SWArch-4-PipesandFilter.pdf
[9] Confluent 2.0.1 documentation
https://docs.confluent.io/2.0.1/platform.html
[10] Datomic http://docs.datomic.com/index.html
[11] Enabling CDC on Microsoft SQL server
https://docs.microsoft.com/en-us/sql/relational-databases/track-c
hanges/enable-and-disable-change-data-capture-sql-server
[12] Avro Messages
https://avro.apache.org/docs/1.7.7/gettingstartedjava.html

Real time data-pipeline from inception to production

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Real time data-pipeline from inception to production

Similar to Real time data-pipeline from inception to production (20)

Recently uploaded

Recently uploaded (20)

Real time data-pipeline from inception to production