Real Time Analytics - Stream Processing (Colombo big data meetup 18/05/2017)

Real Time Analytics
Stream Processing and Beyond
Mahesh Madushanka
(Associate Technical Lead)
Colombo Big Data Meetup

Outline
• Real Time Analytics Overview
• Stream Processing Technologies
• Apache Storm as an ETL Tool
• Cake ETL
• Apache Storm best practices
• Limitations and challenges with stream processing
• Alternatives for stream processing
OUTLINE

Analytics
“Discovery, Interpretation, and Communication of
meaningful patterns in data”

Source
1
Data
Warehous
e
Source
2
Source
n
Data
Lake
ETL
ANALYTICS

Stream Processing
“Analyze data as it is being produced”

Tuple { 1 , ”qqq” , 23 , ”1233” }
Stream - Sequence of tuples
STREAM PROCESSING

• Apache Storm
http://storm.apache.org/
• IBM Streams
http://www-03.ibm.com/software/products/en/ibm-streams
• Tibco-streambase
http://www.tibco.com/products/tibco-streambase
• S4
http://incubator.apache.org/s4/
STREAM PROCESSING

Apache Spark , is it a stream
processing technology?
STREAM PROCESSING

Spouts
A spout is a source of streams in a topology.
{Tuple,Tuple,Tuple,....}
APACHE STORM

Bolts
{Tuple,Tuple,Tuple,...
.}
.}
{Tu,Tu,Tu,....}
{ple,ple,ple,....}
.}
All processing in topologies are done in bolts
APACHE STORM

A topology is a graph of spouts and bolts
that are connected with stream groupings
Topology
APACHE STORM

Stream Grouping
A stream grouping defines how that stream should
be partitioned among the bolt's tasks.
APACHE STORM

Stream Grouping
1. Shuffle grouping: Tuples are randomly distributed across the bolt's
tasks
2. Fields grouping: The stream is partitioned by the fields specified in
the grouping.
3. Partial Key grouping: Equivalent to Fields grouping (provides better
utilization of resources when the incoming data is skewed)
4. All grouping: The stream is replicated across all the bolt's tasks.
5. Global grouping: The entire stream goes to a single one of the bolt's
tasks. Specifically, it goes to the task with the lowest id.
6. None grouping: Equivalent to shuffle groupings.
7. Direct grouping: Producer of the tuple decides which task of the
consumer will receive this tuple.
8. Local or shuffle grouping: If the target bolt has one or more tasks in
the same worker process, tuples will be shuffled to just those in-
process tasks
APACHE STORM

Reliability - ACK
Storm guarantees that every spout tuple will be fully
processed by the topology.
Tuple Tuple Tuple Tuple
ACK ACK ACK ACK
APACHE STORM
1 2 3 4
5678

Talend ETL
STORM AS AN ETL TOOL

HDFS
Spout
Filter
Bolt
Mysql
lookup
Aggregate
Data
Mysql
Output

HDFS
Spout
Filter
Bolt
Mysql
lookup
Aggregate
Data
Mysql
Output
100
Tuples/s
10
Tuples/s
30
Tuples/s
40
Tuples/s
50
Tuples/s

HDFS
Spout
Filter
Bolt
Mysql
lookup
Aggregate
Data
Mysql
Output
100
Tuples/s
10
Tuples/s
30
Tuples/s
40
Tuples/s
50
Tuples/s
10 Tuples/s

HDFS
Spout
Mysql
lookup
Aggregate
Data
Mysql
Output
Filter
Bolt
100
Tuples/s
10
Tuples/s
30
Tuples/s
40
Tuples/s
50
Tuples/s
30 Tuples/s
10
Tuples/s
10
Tuples/s
Filter
Bolt
Filter
Bolt

● HDFS Spout
● CSV Spout
● Kafka Spout
● …..
Cake ETL - Bolt
Cake ETL -
Spouts
● Loader Bolt (Mysql,Redshift,....)
● Filter Bolt
● Splitter Bolt
● …..
CAKE ETL FRAMEWORK

HDFS Spout
Xml
Type: HDFS Spout
Parallelism : 1
File Path : ……
Columns : {}
Records per Second : 100
CAKE ETL FRAMEWORK

HDFS
Spout
Filter Bolt
Mysql
lookup
Aggregate
Data
Mysql
Output
Filter Bolt
Filter Bolt
xml
Type: HDFS
Spout
Parallelism : 1
<Logic>
xml
Type: Filter
Parallelism : 3
<Logic>
xml
Type: Mysql
lookup
Parallelism : 1
<Logic>
xml
Type:
Aggregate
Parallelism : 1
<Logic>
xml
Type: Mysql
out
Parallelism : 1
<Logic>
CAKE ETL FRAMEWORK

Server 4Server 2
Server 1 Server 3
CAKE ETL FRAMEWORK

NIMBUS
ZOOKEEPER
SUPERVISORSUPERVISORSUPERVISOR SUPERVISOR
ZOOKEEPER
WORKER
EXECUTORS (THREADS)
TASK
ZOOKEEPER
CAKE ETL FRAMEWORK
WORKERWORKERWORKERWORKERWORKER

HDFS
Spout
Filter
Bolt Mysql
lookup
Aggregate
Data
Mysql
Output
Filter
Bolt
Filter
Bolt
* Topology Length = 5
CAKE ETL FRAMEWORK

1
1
1 1
1
1
1
Amazon EC2 Instance ( 2 c3.2xlarge)
8 CPU 8M cache 16GB RAM running Centos 6.4_x64 Kernel
#Workers = #CPU
Spout
Filter
MySQL lookup
Aggregate
MySql out
Server 1 Server 2
CAKE ETL FRAMEWORK

HDFS
Spout
Mysql
lookup
Aggregate
Data
Mysql
Output
Filter
BoltTuple ->
CAKE ETL FRAMEWORK
Filter
Bolt
Filter
Bolt

1
1
1 1
1
1
1
#Workers = #CPU
Spout
Filter
MySQL lookup
Aggregate
MySql out
1
2
3
4
Server 1 Server 2
CAKE ETL FRAMEWORK

1
1
1 1
1
1
1
#Workers = #CPU
~ 160 Tuples/s
Spout
Filter
MySQL lookup
Aggregate
MySql out
1
2
3
4
Server 1 Server 2
CAKE ETL FRAMEWORK

Scenario 1
• Topology Length (#Transformations Steps)
= 30
• Total Spout and Bolt Count
= 30*
• # Executors / Task
= 30 * Parallelism = 1 for all bolts and
spouts
STORM BEST PRACTICES

Scenario 1
2 2
2 2
2 2
2 1
2 2
2 2
2 2
2 1
#Workers = #CPU
40 Tuples/s (Maximum)
Server 1 Server 2

Scenario 2
= 30
• Total Spout and Bolt Count
= 172*
• # Executor's / Task
= 172
* Parallelism =1-10
Objective = 100 Tuples/s

11 11
11 11
11 11
10 10
11 11
11 11
11 11
10 10
Scenario 2
Amazon EC2 Instance (2 c3.2xlarge)
#Workers = #CPU
Server 1 Server 2
Tuples/s

Scenario 3
<6
#Workers = #CPU
Server 1 Server 2 Server 3 Server 4
Tuples/s

Scenario 3
6<
#Workers = #CPU
1
2
3
4
5
6
7
8
Tuples/s

Scenario 4
<6
16 CPU 30GB RAM running Centos 6.4_x64 Kernel
#Workers = #CPU
Server
1
Server 2
Tuples/s

Scenario 4
<6
16 CPU 30GB RAM running Centos 6.4_x64 Kernel
#Workers = #CPU
1
2
3
4
5
Server 1 Server 2
6
7
8
Tuples/s

Scenario 1 Scenario 2
Scenario 3 Scenario 4
100 Tuples/s ?

8 cpu 8M cache 16GB RAM running Centos 6.4_x64 Kernel
8 CPU * 4 Server = 32 CPU 32
Workers
#Workers per Topology < 8 *
#Executors per Worker <10
#Task per Executor = 2
Maximum Bolt/Spout per Topology =
2*10*8=160
OUR SOLUTION

= 30
• Total Spout and Bolt Count = 172
156(<160)
• # Task =
156
• # Executors =
156 /2 = 78
• # Workers =
78/10 = 8
100 Tuples/s
OUR SOLUTION

~100 Tuples/s
ETL Topology (8 workers, 4
server)
OUR SOLUTION

92.59 Tuples/s
25 Transformations Steps per Tuple
OUR SOLUTION

ETL Topology (8 workers, 4 server)
OUR SOLUTION

• Server Cost 4 node(c3.2xlarge) ~
$ 2000.00
• Nimbus and Supervisor Failures
• ACK - Memory Utilization
• 100 % CPU/Memory Utilization
LIMITATIONS AND CHALLENGES

Alternatives for stream processing

Column-oriented DBMS
ALTERNATIVES FOR STREAM PROCESSING

ALTERNATIVES FOR STREAM PROCESSING
● Why column Store: https://mariadb.com/resources/blog/why-columnstore-important

Questions?
mahesh.madushanka@trycake.com

References
● Data Lake - https://martinfowler.com/bliki/DataLake.html
● Batch vs Real Time data processing - http://www.datasciencecentral.com/profiles/blogs/batch-vs-
real-time-data-processing
● Storm Concept : http://storm.apache.org/releases/2.0.0-SNAPSHOT/Concepts.html
● Why column Store: https://mariadb.com/resources/blog/why-columnstore-important
● Spark Streaming : http://sqlstream.com/2015/03/5-reasons-why-spark-streamings-batch-
processing-of-data-streams-is-not-stream-processing/
● http://spark.apache.org/docs/latest/streaming-programming-guide.html

Spark Streaming receives live input data streams and divides the
data into batches, which are then processed by the Spark engine to
generate the final stream of results in batches.

Real Time Analytics - Stream Processing (Colombo big data meetup 18/05/2017)

Recommended

Recommended

More Related Content

What's hot

What's hot (19)

Similar to Real Time Analytics - Stream Processing (Colombo big data meetup 18/05/2017)

Similar to Real Time Analytics - Stream Processing (Colombo big data meetup 18/05/2017) (20)

Recently uploaded

Recently uploaded (20)

Real Time Analytics - Stream Processing (Colombo big data meetup 18/05/2017)

Editor's Notes