Big Data - Meetup
Big Data
Stream processing
using
Apache Storm
Athens - May 2016
Who we are?
● Adrianos Dadis (@qiozas )
● Patroclos Christou (@christoupat )
● Eleftheria Chavelia
● Sofia Nomikou
Agenda
● Apache Storm
● Apache Kafka
● Streaming application demo
Why stream processing?
● Increasement of available real-time data
● Extract actionable intelligence on real-time
● Act on real-time
Use Cases examples
● Fraud detection
● Network monitoring
● Smart order routing
● E-commerce
● Bandwith allocation optimization
● Algorithmic trading
End-to-End Deployment
Real-Time
Data Stream
Streamimg Processing
Solution
Dashboards
Data Store
Applications
Alerts
Batch
Processing
Apache Storm
● Creator: Nathan Marz (2011)
● Distributed real-time computation system for processing
large volumes of high-velocity data
● Characteristics:
– Fast
– Scalable
– Fault-tolerant
– Reliable
– Easy to operate
– Easy to develop
Storm core concepts
● Tuple : Storm uses tuples as its data model
● Stream : An unbounded sequence of tuples
● Spout : A source of streams in a topology
● Bolt : All processing in topologies is done in bolts
● Topology : DAG of Spout and Bolts
Storm topology
Storm Architecture
Nimbus
Zookeper
Supervisor
Worker
Worker
Zookeper
Zookeper
Supervisor
Worker
Worker
Supervisor
Worker
Worker
Supervisor
Worker
Worker
Master
Node
Cluster
Coordination
Node
Coordination
Processing
Worker
Nimbus
Nimbus
Storm topology parallelism
Worker Internal Messaging
Worker Receiver
Thread
Router
Inbound Queue
Disruptor
Outbound Queue
Disruptor
Task
Executor Thread
Send
Thread
List<Tuple>
Transfer Buffer
List<Tuple>
Receiver Buffer
Worker Transfer
Thread
Worker
Port Worker
Port
Stream Grouping
● Shuffle
● LocalOrShuffle
● All
● Global
● Field
● Partial Key
● Direct
Reliable Processing
{A} {B}
{D}
{F}
{C}
{E}
{H}
{X}
{G}
● Acking
● Anchoring
● Failures
ACK
FAIL
Streaming Windows
● Sliding Windows
● Tumbling Windows
{...}{...}{...}{...}{...}{...}{...}{...}{...}{...}
Time
{...}{...}{...}{...}{...}{...}{...}{...}{...}{...}
Time
Storm topology example
Storm Trident
● High level abstraction on top of Storm
● Micro Batching
● Stateful
● Built-in support:
– Functions
– Fliters
– Merges and Joins
– Aggregations
– Grouping
Trident Example
Partitioning
Trident execution analysis
Storm 1.x Features
● HA Nimbus
● Distributed Cache API
● Pacemaker - Heartbeat Server
● Automatic Backpressure
● Resource Aware Scheduler
● State Management
● Native Streaming Windows
Storm Integrations
● Kafka
● Redis
● Hive, HDFS
● HBase, Cassandra
● MongoDB
● Elasticsearch, Solr
● JDBC
● MQTT
Storm modes
● One-at-a-time processing (pure Storm)
– Very low latency
– Very simple development model
– At-Most-Once and At-Least-Once semantics
● Micro batch processing (Storm Trident)
– Increased latency on event
– Better throughput for large rates
– More complex development model
– Exactly-Once semantics
Messaging Systems
● Core needs:
– Decouple processing from data producers
– Buffer unprocessed messages
● Models:
– Queuing
– Publish-Subscribe
● Frameworks
– Kafka
– RabbitMQ
– ActiveMQ
Apache Kafka
● Distributed, partitioned, replicated commit log service
● Publish-Subscribe model
● Maintains feeds of messages in Topics
● Automatic Replication and Retention
● Brokers
Apache Kafka
● Offset uniquely identifies each message within the partition
● Consumers coordinate what to read
● Consumer & Consumer Group
Implementing Big Data Apps
● Design for scalability from day one
● Queries drive schema design
● Failure (HW or data) is a normal case
● Continuous Integration
● Metrics & Monitoring from day one
● Appropriate people
Sentiment Analysis Demo
Random
Sentence
Spout
Stemming
Bolt
Positive
Scoring
Bolt
Negative
Scoring
Bolt
Final
Scoring
Bolt
Persistence
Bolt
Kafka
Topic
Kafka
Spout
Kafka
Topic
NoSQL
src => https://github.com/qiozas/sentiment-analysis-storm
Athens Big Data - Meetup - 2016
THANK YOU :-)
[ Updates / Questions / Comments ]
@qiozas
@christoupat

4th Athens Big Data Meetup - 1st Talk - Big Data Streaming Processing Using Apache Storm‏