Real Time Analytics - Stream Processing (Colombo big data meetup 18/05/2017)
1.
2. Real Time Analytics
Stream Processing and Beyond
Mahesh Madushanka
(Associate Technical Lead)
Colombo Big Data Meetup
3. Outline
• Real Time Analytics Overview
• Stream Processing Technologies
• Apache Storm as an ETL Tool
• Cake ETL
• Apache Storm best practices
• Limitations and challenges with stream processing
• Alternatives for stream processing
OUTLINE
14. A topology is a graph of spouts and bolts
that are connected with stream groupings
Topology
APACHE STORM
15. Stream Grouping
A stream grouping defines how that stream should
be partitioned among the bolt's tasks.
APACHE STORM
16. Stream Grouping
1. Shuffle grouping: Tuples are randomly distributed across the bolt's
tasks
2. Fields grouping: The stream is partitioned by the fields specified in
the grouping.
3. Partial Key grouping: Equivalent to Fields grouping (provides better
utilization of resources when the incoming data is skewed)
4. All grouping: The stream is replicated across all the bolt's tasks.
5. Global grouping: The entire stream goes to a single one of the bolt's
tasks. Specifically, it goes to the task with the lowest id.
6. None grouping: Equivalent to shuffle groupings.
7. Direct grouping: Producer of the tuple decides which task of the
consumer will receive this tuple.
8. Local or shuffle grouping: If the target bolt has one or more tasks in
the same worker process, tuples will be shuffled to just those in-
process tasks
APACHE STORM
17. Reliability - ACK
Storm guarantees that every spout tuple will be fully
processed by the topology.
Tuple Tuple Tuple Tuple
ACK ACK ACK ACK
APACHE STORM
1 2 3 4
5678
60. References
● Data Lake - https://martinfowler.com/bliki/DataLake.html
● Batch vs Real Time data processing - http://www.datasciencecentral.com/profiles/blogs/batch-vs-
real-time-data-processing
● Storm Concept : http://storm.apache.org/releases/2.0.0-SNAPSHOT/Concepts.html
● Why column Store: https://mariadb.com/resources/blog/why-columnstore-important
● Spark Streaming : http://sqlstream.com/2015/03/5-reasons-why-spark-streamings-batch-
processing-of-data-streams-is-not-stream-processing/
● http://spark.apache.org/docs/latest/streaming-programming-guide.html
61. Spark Streaming receives live input data streams and divides the
data into batches, which are then processed by the Spark engine to
generate the final stream of results in batches.
Editor's Notes
ADD IT AS A QUESTION
Generally spouts will read tuples from an external source and emit them into the topology // REMOVE ONE IMAGE
Filtering/Aggregation/Join/Transform
Filtering/Aggregation/Join/Transform
Filtering/Aggregation/Join/Transform
It does this by tracking the tree of tuples triggered by every spout tuple and determining when that tree of tuples has been successfully completed.
Filtering/Aggregation/Join/Transform
ETL Developer need to code it and implement it. Its not like dragging and dropping components
Filtering/Aggregation/Join/Transform
Filtering/Aggregation/Join/Transform
Filtering/Aggregation/Join/Transform - MAX VALUE
Filtering/Aggregation/Join/Transform
Filtering/Aggregation/Join/Transform
Filtering/Aggregation/Join/Transform
Nimbus - Cordinations
Zookeper - Distributed Cordinations
Supervicer - On each node
WORKER and Fonts
Nimbus - Cordinations
Zookeper - Distributed Cordinations
Supervicer - On each node
Spout = 1
Bolt ( 1*3 +1+1+1) = 6
Total = 7
Executors (Assume one task per executor) = 7
Worker Process = 16
Servers = 2
COLORS
STORM RANDUM DISTRIBUTION
Filtering/Aggregation/Join/Transform
Remove the zeroes
HOW CAN WE ACHIEVE THIS
Arrow symbols
MAX
Arrow Symbols
max
Same as previous
Netty - Network Communication Delay
Huge drop
Lmax - Network Communication Delay
Lmax - Network Communication Delay add Limiting factor in separate slide
Arrow Symbols
Slide snapshots – arrow symbols
If you want you can go with 2 Server (4 workers per each)