[Strata] Sparkta

SPARKTA
A real-time analytics platform
based on Apache Spark
London, May 2015

FIRST SPARK PLATFORM.
APR 2014
20+ INTERNATIONAL
PROJECTS
WITH SPARK

STRATIO
INGESTION
Customer lake
STRATIO
STREAMING
STRATIO
QUANTUM
STRATIO DEEP
STRATIO CROSSDATA
ODBC JBDC API Rest
CRM
ERP
Call
Center
BI
Internal
Data
External
Data
BI AD HOC APP
Hdfs S3 Elastic
Search
Mongo DB Cassandra Redis Oracle, DB2
Other
Databases
STRATIO DATAVIS
4

STRATIO
INGESTION
Customer lake
STRATIO
STREAMING
STRATIO
QUANTUM
STRATIO DEEP
STRATIO CROSSDATA
ODBC JBDC API Rest
CRM
ERP
Call
Center
BI
Internal
Data
External
data
BI AD HOC APP
Ingests,
transforms
Analyzes and
processes real
time streaming
A unified SQL interface
Machine Learning
and algorithms
Processes & combines with
Spark
STRATIO DATAVIS
Creates and designs
dashboards and reports
Hdfs S3 Elastic
Search
Other
Databases
5

STRATIO
INGESTION
Ingests,
transforms
STRATIO
STREAMING
STRATIO
QUANTUM
STRATIO CROSSDATA
Analyzes & processes
A unified SQL interface
Machine Learning
and algorithms
ODBC JBDC API Rest
Streaming
Apache Kite
Apache Flume
CRM
ERP
Call
Center
BI
MLlib
Internal
Data
External
Data
BI AD HOC APP
Combines with Spark data from any
source
Customer lake
STRATIO DEEP
Processes & combines with Spark
Hdfs S3 Elastic
Search
Other
Databases
STRATIO DATAVIS
Creates and designs
6

STRATIO
INGESTION
Hdfs S3 Elastic
Search
Other
Databases
Ingests,
transforms
STRATIO
STREAMING
STRATIO
QUANTUM
STRATIO CROSSDATA
Analyzes &
processes
Consult & analyze. SQL interface
Machine Learning
& algorithms
ODBC JBDC API Rest
Streaming
Apache Kite
Apache Flume
CRM
ERP
Call
Center
BI
MLib
Internal
Data
External
Data
BI AD HOC APP
Data combination through time
Customer lake
STRATIO DEEP
Processes & combines with
Spark
Real-time
Ephemer
al tables
Past
Stored
tables
Future
Quantum
tables
STRATIO DATAVIS
Creates and designs
7

STRATIO DATAVIS
STRATIO
INGESTION
Ingests,
transforms
STRATIO
STREAMING
STRATIO
QUANTUM
STRATIO CROSSDATA
Analyzes &
processes
Consulta y analiza. Interfaz SQL
Machine Learning
& algorithms
ODBC JBDC API Rest
Streaming
Apache Kite
Apache Flume
CRM
ERP
Call
Center
BI
MLlib
Internal
Data
External
Data
Creates and designs
Customer lake
STRATIO DEEP
Processes & combines with Spark
Hdfs S3 Elastic
Search
Other
Databases
INFORMATIONAL + OPERATIONAL
WITHOUT NEED TO REPLICATE DATA
Oracle, DB2
Other Databases Mongo DB TeradataOPERATIONAL
8

REAL-TIME:
Beyond cool dashboards2

The time is N W
We all know this story already
Social media and networking sites are a part of the fabric of
everyday life, changing the way the world shares and accesses
information.
The overwhelming amount of information gathered not only
from messages, updates and images but also readings from
sensors,GPS signals and many other sources was the origin of
a (big) technological revolution.
Remember? VOLUME, VARIETY & VELOCITY
CONFERENCE10

Look at these sexy infographics!
We all love data
visualization
Insights from this vast amount of data
allows us to learn from the users and
explore our own world.
We can follow in real-time the evolution
of a topic, an event or even an incident
just by exploring aggregated data.
CONFERENCE11

Delivering real-time business in the Internet
But beyond cool visualizations, there are
some core services delivered in real-time,
using aggregated data to answer common
questions in the fastest way.
These services are the heart of the
business behind their nice logos.
Site traffic, user engagement monitoring,
service health, APIs, internal monitoring
platforms, real-time dashboards…
Aggregated data feeds directly to end
users, publishers, and advertisers, among
others.
CONFERENCE12

Pushing business’ processes to perform faster
Digital companies, born to develop their services in real-time have changed
the expectations of many others businesses.
Real-time information makes it possible for a company to be much more agile
than its competitors, improving business answers, gaining insights on their
performance…
CONFERENCE13

Listen to your data…
CLIENTTPV
Accounts
Loans
and credits
Insurances
Broker
Mortgages
Cards
Deposits
ATM
Online
gateway
application logs
Social
networks
transactions
geolocation
CRM
Where as business intelligence is data gathered
for the purpose of analyzing trends over time,
operational intelligence provides a picture of
what is currently happening within a process.
And we can listen to almost everything! Orders,
transactions, clicks, calls, bookings, internal
services...
CONFERENCE14

…and start delivering real-time services
Real-time monitoring could be really nice, but your
company needs to work in the same way as digital
companies:
• Rethinking existing processes to deliver them
faster, better.
• Creating new opportunities for competitive
advantages.
CONFERENCE15

REAL-TIME
Challenges at Stratio2

Real-time fraud monitoring
DATA RECEIVER
REAL-TIME
AGGREGATION
CONSOLIDATION
Dashboardin
g
Reporting
FRAUD
DETECTION
Leveraging the power of Spark Streaming, we have developed some fraud detection
solutions, aggregating data in real-time to work better with machine learning
algorithms.
CONFERENCE17

Extract, Transform and Aggregate
By combining Apache Flume and Spark Streaming we have deployed complex
topologies to deal with data coming from heterogeneous sources.
The full solution allow us to transform and aggregate data on-the-fly
(data cleaning, normalization and enrichment)
REAL-TIME
AGGREGATION
Dashboardin
g
Reporting
CONFERENCE18

Custom data sources and storage
Each project requires
specific inputs and data
storages, dealing with
different kinds of
events.
From click stream
activity to bank
transactions...
DATA STREAM
LOADING
TRANSFORM
CUSTOM LOGS
CONFERENCE19

Towards a generic real-time aggregation platform
At Stratio, we have implemented several real-time analytic projects based
on Apache Spark, Kafka, Flume, Cassandra, or MongoDB.
These technologies were always a perfect fit, but soon we found ourselves
writing the same pieces of integration code over and over again.
This is how SPARKTA was born.
CONFERENCE20

#1 RainBird from Twitter
Some folks from twitter shared some thoughts
about their real-time needs at Strata (2011).
They worked on a “generic” platform in order to
deal with pre-calculated data from a huge number
of events.
It allows them to deal with:
• Data Structures
• Hierarchical Aggregation
• Temporal Aggregation
• Multiple Formulas
Still not open sourceCURRENT STATE
http://goo.gl/ykvQa
CONFERENCE22

#2 Countandra
Countandra is a hierarchical distributed counting
engine exploiting all the excellent write&read
performance of Cassandra.
It supports:
• Geographically distributed counting.
• Easy Http Based interface to insert counts.
• Hierarchical counting such as
com.mywebsite.music.
• Retrieves counts, sums and square in near real-
time.
• Simple Http queries provides desired output in Json
format
• Queries can be sliced by period such as lasthour
,lastyear and so on for minutely,hourly,daily,monthly
values
https://github.com/milindparikh/Countandra
Rather deprecatedCURRENT STATE
CONFERENCE23

#3 ThunderRain from Intel
ThunderRain is a Real-Time Analytical Processing
(RTAP) example using Spark and Shark, which
can be best characterized by the following four
salient properties:
• Data continuously streamed in & processed
in near real-time
• Real-time data queried and presented
in an online fashion
• Real-time and history data combined
and mined interactively
• Predominant RAM-based processing
https://github.com/thunderain-
project/thunderain
Rather deprecatedCURRENT STATE
CONFERENCE24

#4 TSAR from Twitter
TSAR (the TimeSeries AggregatoR) is a
flexible, reusable, end-to-end service
architecture on top of Summingbird.
Twitter really needs a truly robust real-
time aggregation service considering their
scaling and evolving needs.
They realized that many time-series
applications call for essentially the same
architecture, with only slight variations in
the data model.
https://blog.twitter.com/2014/tsar-a-timeseries-aggregator
Still not open sourceCURRENT STATE
CONFERENCE25

Towards a generic real-time aggregation platform
Some initiatives have tried to solve this problem, but until now most of them
were complex or obsolete while others were not open source.
For this reason, Stratio created SPARKTA: an open source and full-featured
platform for real-time analytics, based on Apache Spark.
This is why SPARKTA was conceived
CONFERENCE26

Distributed, high-volume & pluggable analytics framework
Our goals:
Since Aryabhatta invented zero, Mathematicians such as John von Neuman have
been in pursuit of efficient counting and architects have constantly built systems that
computes counts quicker. In this age of social media, where 100s of 1000s events
take place every second, we designed a aggregation engine to deliver real-time
service
• Pure Spark!
• No need of coding, only declarative aggregation
workflows
• Data continuously streamed in & processed in near real-
time
• Ready to use out of the box
• Plug & play: flexible workflows (inputs, outputs, parsers,
etc…)
• High performance
• Scalable and fault tolerant
CONFERENCE28

Sparkta: A first look
DRIVER - SUPERVISOR
AGGREGATION POLICY
QUERY
SERVICES
Aggregation policy
definition is sent to the
engine
Allows multiple application to be
defined, each of which is bound to
a context, executing the
aggregation workflow
others
AGGREGATION WORKFLOW
CONFERENCE29

Sparkta: Deploy any number of real-time aggregation policies
DRIVER - SUPERVISOR
You can start
several workflows
at any time, and
also stop or
monitor them
CONFERENCE30

Sparkta: Key Technologies
+
Apache Kite SDK
INPUTS PROCESSING
RabbitMQ
ZeroMQ
Twitter
Flume
Kafka
....
OUTPUTS
..
..
CONFERENCE31

Sparkta: Define your real-time needs
AGGREGATION POLICY
Remember: no need to code anything.
Define your workflow in a JSON document, including:
INPUT Where is the data coming from?
OUTPUT(s) Where should aggregate data be stored?
DIMENSION(s) Which fields will you need for your real-time
needs?
ROLLUP(s) How do you want to aggregate the dimensions?
TRANSFORMATION(s) Which functions should be applied before aggregation?
SAVE RAW DATA Do you want to save raw events?
CONFERENCE32

Sparkta: Key Technologies
ROLLUPS
• Pass-through
• Time-based
• Secondly, minutely, hourly, daily,
monthly, yearly...
• Hierarchycal
• GeoRange: Areas with different sizes
(rectangles)
OPERATORS
• Max, min, count, sum
• Average, median
• Stdev, variance, count distinct
• Last value
• Full-text search
KiteSDK
CONFERENCE33

Sparkta SDK
INPUT
OUTPUT(s)
DIMENSION(s)
OPERATORS
TRANSFORMATION(s)
Sparkta has been conceived as an SDK.
You can extend several points of the platform to
fulfill your needs, such as adding new inputs,
outputs, operators, dimension types.
Add new functions to Apache Kite in order to
extend the data cleaning, enrichment and
normalization capabilities.
CONFERENCE34

NEXT STEPS5
Source: mydisguises.com

Next steps in our roadmap (1)
Sparkta is a work in progress, so we still have some nice features to
develop…
QUERY
SERVICES
ALARMS
Creating a REST services layer in order to query the
aggregated data allows us to isolate the final consumer
from the specific data storage
Features
- Time ranges
- Agreggation on time ranges
- Best rollup selection
For example, I want to know if we have earned over $3000 in
London in the last hour...
Remember operational intelligence!
CONFERENCE36

Next steps in our roadmap (II)
WEB
APPLICATION
DEPLOYING &
MONITORING
How about a nice web interface to create and manage policies?
Forget the JSON file and use your mouse to define the workflow :)
We have been working with Spark jobServer & Yarn, but it will be
nice to support Mesos, for example.
Hey, did you miss something? Do you have a great idea?
Let us know!
MORE AWESOMENESS
CONFERENCE37

OPEN TO YOUR IDEAS
www.stratio.com
@StratioBD
https://github.com/stratio/sparkta
SPARKTA is fully open source
Apache 2 License.
We are open to contributors & ideas
CONFERENCE39

Do you want to try SPARKTA?
Use a full-featured sandbox to start trying SPARKTA
vagrant init “stratio/sparkta”
vagrant up
Just open a shell and type
CONFERENCE41

Do you want to try SPARKTA?
Getting some real-time stats from
#StrataHadoop
Our real-time policy defines some
rollups in order to know chatty users, hot
hashtags, and heatmaps from
StrataConf tweets.
We are using the standard Twitter input
from Spark Streaming, ElasticSearch
output & Kibana to display results
CONFERENCE42

[Strata] Sparkta

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to [Strata] Sparkta

Similar to [Strata] Sparkta (20)

More from Stratio

More from Stratio (20)

Recently uploaded

Recently uploaded (20)

[Strata] Sparkta

Editor's Notes