Javantura v3 - Real-time BigData ingestion and querying of aggregated data – Davor Poldrugo

www.infobip.com
REAL-TIME BIG DATA INGESTION
AND QUERYING OF
AGGREGATED DATA
Davor Poldrugo
software engineer

Davor Poldrugo @ Infobip
Software engineer with interest in backend development,
high availability and distributed systems.
https://about.me/davor.poldrugo

●
MOBILE SERVICES: Professional SMS, number validation, voice, USSD,
mobile payments; deeply integrated into the telecoms world
●
ENTERPRISE PRODUCTS for businesses of any scale and need (mGate,
fully-featured web apps, SMS authentication solutions, reseller solutions...)
●
APP ENGAGEMENT PLATFORM based on advanced push notifications
●
APIs and protocols for EASY INTEGRATION: xml, soap/rest, smpp, http,
json
●
Full 24/7 TECHNICAL SUPPORT regardless of location
●
QUALITY guaranteed by a strict SLA
Our services

Presentation overview
●
Dictionary
●
The real-time use case and the challenges
(because there are no problems ;)
●
The platform and how we got here
●
Our path towards real-time data
●
Architecture and component overview
●
Numbers and conclusion

Dictionary
REAL-TIME noun
“the actual time during which something takes place <the computer may
partly analyze the data in real time (as it comes in) — R. H. March>
<chatted online in real time>
– real-time adjective”
http://www.merriam-webster.com/dictionary/real%20time
BIG DATA noun
“an accumulation of data that is too large and complex for processing by
traditional database management tools”
http://www.merriam-webster.com/dictionary/big%20data

Dictionary
INGEST verb
“to take (something, such as food) into your body : to swallow
(something)
— sometimes used figuratively
She ingested [=absorbed] large amounts of information very quickly.”
http://www.learnersdictionary.com/definition/ingest
I'll use this figurative meaning... in context of data ingestion.

The real-time use case and the challenges
●
Our new web requirement: provide real-time data and graphs of
traffic
●
SMS Campaigns Web application
Near real-time
● But we wanted real-time!

●
There was only one node – a monolith
●
One transactional database (OLTP)
●
Traffic increased
●
After a while the database began to be a bottleneck
●
Then we introduced multiple transaction databases
●
Then multiple monolith nodes were introduced – one per database
●
Then load balancers were needed

●
After that querying has become complex:
– when one or more databases down for maintenance - data from
that DB is missing
– queries had to span over multiple databases and then results
had to be joined
– aggregate reports become a problem (complexity, availability)
– aggregation databases introduced (ETL) that pulled from
transactional databases
●
In the meantime we decoupled our monolithic node to lots of
microservice nodes (IpCore, Billing, Contacts, Campaigns, ...)
●
As traffic increased, non-transactional (apps, reports) queries
become a problem – throughput decrease

●
Our Database Team introduced GREEN – our ODS/DWH
– named after the color of the pencil used to draw on the board ;)
– Near real-time ETL (for traffic tables with 150+ columns)
– Centralized reporting
– Decreased workload from transactional databases
– Throughput increase of our core nodes (IpCore)
– Specialized indexes
– Specialized aggregations
– But still... near real-time...
– 1 to 60 minutes out of sync – with the transactional databases
(depending on the load)

●
GREEN ODS/DWH provided an abstract solution for all our traffic
data but was not in REAL-TIME
●
GREEN consists of big hardware – scales vertically
●
This approach tries to solve a particular REAL-TIME use case – one
by one – not a silver bullet!
●
Because REAL-TIME isn't always needed
●
Resources are limited
●
The path towards horizontal scalability

1. All data entering the system is dispatched
to both the batch layer and the speed
layer for processing.
2. The batch layer has two functions: (i)
managing the master dataset (an
immutable, append-only set of raw data),
and (ii) to pre-compute the batch views.
3. The serving layer indexes the batch views
so that they can be queried in low-latency,
ad-hoc way.
4. The speed layer compensates for the high
latency of updates to the serving layer
and deals with recent data only.
5. Any incoming query can be answered by
merging results from batch views and
real-time views.
Lambda architecture ( http://lambda-architecture.net/ )

Know thyself! Adapt lambda architecture to fit your needs!
IpCore
(Core Message
Processing)
IpCore
(Core Message
Processing)Messaging Cloud Transactional
Databases
(OLTP)
App
Message
Event
App
Message
Event
App
Message
Event
GREEN DB
ODS
DWH
(newly proclaimed
BATCH/SERVING
LAYER)
REAL-TIME
LAYER
QUERY
LAYER
(queries REAL-TIME
OR
BATCH)
Ingest
point
Ingest
point
Messaging
Cloud
App
Messaging
Cloud
App
...

Messaging Cloud
App
Message
Event
App
Message
Event
App
Message
Event
REAL-TIME
LAYER
...
Data Ingestion Service
Process
Message
Process
Delta
Pairing and composing
a new message
Kafka cluster
Druid cluster
Billing
ingest
point
IpCore
ingest
point

REAL-TIME
LAYER
Kafka cluster
Druid cluster
{
"sendDateTime":"2016-02-19T12:07:47Z",
"campaignId":29680,
"currencyId":2,
"currencyHNBCode":"EUR",
"currencySymbol":"€",
"countDelta":1,
"priceDelta":0.02
}

REAL-TIME LAYER
Kafka cluster
Druid cluster
GREEN DB
ODS
DWH
BATCH LAYER
QUERY LAYER
Data
Query
Service
Messaging
Cloud
App
Messaging
Cloud
App
Messaging
Cloud
App
Messaging
Cloud
App
Messaging
Cloud
App
Is
realtime?
TRUE FALSE

REAL-TIME LAYER
Druid cluster
QUERY LAYER
Data
Query
Service
POST /druid/v2 HTTP/1.1
Host: druid-broker-node:8080
Content-Type: application/json
{
"queryType": "groupBy",
"dataSource": "campaign-totals-v2",
"granularity": "all",
"intervals": [ "2012-01-01T00:00:00.000/2100-01-01T00:00:00.000" ],
"dimensions": ["campaignId", "currencyId", "currencySymbol", "currencyHNBCode"],
"filter": { "type": "selector", "dimension": "campaignId", "value": 29680 },
"aggregations": [
{ "type": "longSum", "name": "totalCountSum", "fieldName": "totalCount" },
{ "type": "doubleSum", "name": "totalPriceSum", "fieldName": "price" }
]
}
Request to Druid

REAL-TIME LAYER
Druid cluster
QUERY LAYER
Data
Query
Service
Response from Druid
[
{
"version": "v1",
"timestamp": "2012-01-01T00:00:00.000Z",
"event": {
"totalCountSum": 1000000,
"currencyid": "2",
"totalPriceSum": 20000,
"currencysymbol": "€",
"currencyhnbcode": "EUR",
"campaignid": "29680"
}
}
]

KAFKA - https://kafka.apache.org/
●
Kafka maintains feeds of messages in categories called
topics
●
A distributed, partitioned, replicated commit log service. It
provides the functionality of a messaging system, but with
a unique design.
FEATURES
●
two messaging models incorporated in an abstraction
called consumer group (group id) – queue and publish-
subscribe
– queue - a pool of consumers may read from a server
and each message goes to one of them
– publish-subscribe - the message is broadcast to all
consumers
●
constant performance with respect to data size
●
replay – all messages are stored and can be accessd
with a sequential id number called the offset
REAL-TIME LAYER
Kafka cluster
Druid cluster
REAL-TIME LAYER
Kafka cluster
Druid cluster

DRUID - http://druid.io/
Druid is a fast column-oriented distributed data store.
Real-time Streams
Druid supports streaming data ingestion and offers insights on
events immediately after they occur. Retain events indefinitely and
unify real-time and historical views.
Sub-Second Queries
Druid supports fast aggregations and sub-second OLAP queries.
Scalable to Petabytes
Existing Druid clusters have scaled to petabytes of data and trillions
of events, ingesting millions of events every second. Druid is
extremely cost effective, even at scale.
Deploy Anywhere
Druid runs on commodity hardware. Deploy it in the cloud or on-
premise. Integrate with existing data systems such as Hadoop,
Spark, Kafka, Storm, and Samza.
REAL-TIME LAYER
Kafka cluster
Druid cluster
REAL-TIME LAYER
Kafka cluster
Druid cluster

Numbers and conlusion
Data pipeline
Max. throughput
(msg/s)
Ingest points → Data Ingestion Service 7700
Billing ingest point → Data Ingestion Service 5500
IpCore ingest point → Data Ingestion Service 2200
Data Ingestion service → Kafka 2130
Druid firehose pull and aggregate from Kafka 29000
Real-time!
<2 sec delay

Numbers and conclusion
PROBLEMS / CHALLENGES ;)
●
Added complexity to the flow
– Maintenance of “ingest point” code
– Maintenance of Data Ingestion Service
– Operational knowledge of Kafka / Druid
●
Scaling Druid – problems with “Druid realtme nodes” and Kafka
topics with multiple partitions
●
Druid - Exactly once semantics are not guaranteed with real-time
ingestion in Druid – but we didn't have problems with our
configuration - definitive solution – Druid batch ingestion using
Tranquility

www.infobip.com
Q/A
Davor Poldrugo
software engineer
davor.poldrugo@infobip.com
dpoldrugo@gmail.com
REAL-TIME BIG DATA INGESTION AND QUERYING OF AGGREGATED DATA

Javantura v3 - Real-time BigData ingestion and querying of aggregated data – Davor Poldrugo

More Related Content

What's hot

Viewers also liked

Similar to Javantura v3 - Real-time BigData ingestion and querying of aggregated data – Davor Poldrugo

More from HUJAK - Hrvatska udruga Java korisnika / Croatian Java User Association

Recently uploaded

Javantura v3 - Real-time BigData ingestion and querying of aggregated data – Davor Poldrugo

Editor's Notes