FOSSCOM 2016
Big Data
Streaming processing
using
Apache Storm
Agenda
● Big Data concepts
● Batch & Streaming processing
● NoSQL persistence
● Apache Storm and Apache Kafka
● Streaming application demo
● Considerations for Big Data applications
Who we are?
● Adrianos Dadis (@qiozas )
● Patroclos Christou
● Eleftheria Chavelia
● Sofia Nomikou
Big Data characteristics (3V+)
Terabytes
Records
Transactions
Files
Batch
Real-time
Streams
Near-real-time
Structured
Unstructured
Semi-structured
All the above
Volume Velocity
Variety
Use cases
● 360-degree customer view
● Internet of Things
● Data warehouse optimization
● Information security
● Sentiment analysis
● Urban Planning (MIT)
Big Data Pros
● Better user experience
● Predictive analytics
● Answer complex business questions
Big Data Cons
● Privacy
● False truths
● Strict personalized content
Databases
● Persistency & Concurrency
● Relational Databases provide strong benefits
● Relational Databases pitfalls:
– Application development productivity
– Large-scale data
● Capture more data
● Process data faster
● Costs
NoSQL Databases
● Schema agnostic
● Nonrelational
● Commodity hardware
● Highly distributable
● NOT a silver bullet
NoSQL & CAP theorem
● Consistency
– all nodes see the same data at the same time
● Availability
– a guarantee that every request receives a response
about whether it succeeded or failed
● Partition tolerance
– the system continues to operate despite arbitrary
partitioning due to network failures
● CP or AP
NoSQL Data Models
● Key Value
– Riak, Redis, Aerospike
● Document
– MongoDB, Elasticsearch
● Column Family
– Cassandra, HBase
● Graph
– Neo4J, ArangoDB, OrientDB
Big Data processing modes
● Batch
– Long running jobs
● Streaming processing
– One-at-a-time
– Micro batch
Apache Hadoop - MapReduce
● Programming/Processing model
● HDFS
●
Steps:
– Map
– Shuffle
– Reduce
●
Apache Hadoop ecosystem
– Pig
– Hive
– more
YARN – Resource Manager
Map Reduce – word count
Deer Bear River
Car Car River
Deer Car Bear
Deer Bear River
Deer Bear River
Deer Bear River
Deer, 1
Bear, 1
River, 1
Bear, 1
Bear, 1
Bear, 2
Bear, 2
Car, 3
Deer, 2
River, 2
Car, 1
Car, 1
River, 1
Car, 1
Car, 1
Car, 1
Car, 1
Car, 3
Deer, 1
Car, 1
Bear, 1
Deer, 1
Deer, 1
Deer, 2
River, 1
River, 1
River, 2
INPUT SPLIT MAP SHUFFLE REDUCE RESULT
Apache Spark (batch processing)
● Similar to MapReduce but faster
● Resilient Distributed Dataset
● Ecosystem
– Spark SQL
– MLlib
– GraphX
– Streaming
– Scala / R / Python / Java / SQL
Streaming processing
Real Time
Data Stream
Streaming Processing
Solution
Dashboards
Data Store
Applications
Alerts
Batch
Processing
Streaming processing Use Cases
● Prevent security fraud
● Optimize order routing
● Predictive maintenance
● Optimize personalized content
● Optimize prices or offers
● Prevent stock outs
● Optimize bandwith allocation
● Prevent application failures
Streaming processing frameworks
● Apache Storm
● Apache Spark Streaming
● Apache Flink
● Apache Samza
Apache Storm
● Creator: Nathan Marz (2011)
● Distributed real-time computation system for processing large
volumes of high-velocity data
● Most popular streaming engine in industry
● Characteristics:
– Fast
– Scalable
– Fault-tolerant
– Reliable
– Easy to operate
– Easy to develop
Storm modes
● One-at-a-time processing (pure Storm)
– Very low latency
– Very simple development model
– At-Most-Once and At-Least-Once semantics
● Micro batch processing (Storm Trident)
– Increased latency on event
– Better throughput for large rates (it depends)
– More complex development
– Exactly-Once semantics (per batch)
Storm Architecture
Nimbus
Zookeper
Supervisor
Worker
Worker
Zookeper
Zookeper
Supervisor
Worker
Worker
Supervisor
Worker
Worker
Supervisor
Worker
Worker
Master
Node
Cluster
Coordination
Node
Coordination
Processing
Worker
Storm concepts
● Topology : A graph of computation
● Tuple : Storm uses tuples as its data model
● Stream : An unbounded sequence of tuples
● Spout : A source of streams in a topology
● Bolt : All processing in topologies is done in bolts
● Stream Grouping : Defines how that stream should be
partitioned among the bolt's tasks.
Storm topology
Storm topology example
Storm topology parallelism
Storm Trident
Storm Trident partitioning
Storm Trident
Messaging Systems
● Decouple processing from data producers
● Buffer unprocessed messages
● Models: queuing & publish-subscribe
● Frameworks
– Kafka
– RabbitMQ
– ActiveMQ
Apache Kafka
● Distributed, partitioned, replicated commit log service
● Publish-Subscribe model
● Maintains feeds of messages in Topics
● Automatic Replication and Retention
● Brokers
Apache Kafka
● Offset uniquely identifies each message within the partition
● Consumers coordinate what to read
● Consumer & Consumer Group
Big Data Apps hints
● Design for scalability from day one
● Queries drive schema design
● Failure (HW or data) is a normal case
● Continuous Integration
● Metrics from day one
● Monitoring
● Appropriate people
Streaming Demo – word count
Car Car River
Deer Car Bear
<River,2>
<Deer,2>
<River,2>
<Deer,2>
Deer Bear River
Car Car River
Deer Car Bear
Deer Bear River
Car
Car
River
Deer
Bear
River
Deer
Car
Bear
<Bear,2>
<Car ,3>
<Bear,2>
<Car ,3>
<Deer,2>
<River,2>
<Bear,2>
<Car,3>
Kafka
Spout
Split
Bolt
Count
Bolt
Kafka
Bolt
Kafka
Topic
Kafka
Topic
FOSSCOMM 2016
THANK YOU :-)
[ Updates / Questions / Comments ]
@qiozas

Big Data Streaming processing using Apache Storm - FOSSCOMM 2016

  • 1.
    FOSSCOM 2016 Big Data Streamingprocessing using Apache Storm
  • 2.
    Agenda ● Big Dataconcepts ● Batch & Streaming processing ● NoSQL persistence ● Apache Storm and Apache Kafka ● Streaming application demo ● Considerations for Big Data applications
  • 3.
    Who we are? ●Adrianos Dadis (@qiozas ) ● Patroclos Christou ● Eleftheria Chavelia ● Sofia Nomikou
  • 4.
    Big Data characteristics(3V+) Terabytes Records Transactions Files Batch Real-time Streams Near-real-time Structured Unstructured Semi-structured All the above Volume Velocity Variety
  • 5.
    Use cases ● 360-degreecustomer view ● Internet of Things ● Data warehouse optimization ● Information security ● Sentiment analysis ● Urban Planning (MIT)
  • 6.
    Big Data Pros ●Better user experience ● Predictive analytics ● Answer complex business questions
  • 7.
    Big Data Cons ●Privacy ● False truths ● Strict personalized content
  • 8.
    Databases ● Persistency &Concurrency ● Relational Databases provide strong benefits ● Relational Databases pitfalls: – Application development productivity – Large-scale data ● Capture more data ● Process data faster ● Costs
  • 9.
    NoSQL Databases ● Schemaagnostic ● Nonrelational ● Commodity hardware ● Highly distributable ● NOT a silver bullet
  • 10.
    NoSQL & CAPtheorem ● Consistency – all nodes see the same data at the same time ● Availability – a guarantee that every request receives a response about whether it succeeded or failed ● Partition tolerance – the system continues to operate despite arbitrary partitioning due to network failures ● CP or AP
  • 11.
    NoSQL Data Models ●Key Value – Riak, Redis, Aerospike ● Document – MongoDB, Elasticsearch ● Column Family – Cassandra, HBase ● Graph – Neo4J, ArangoDB, OrientDB
  • 12.
    Big Data processingmodes ● Batch – Long running jobs ● Streaming processing – One-at-a-time – Micro batch
  • 13.
    Apache Hadoop -MapReduce ● Programming/Processing model ● HDFS ● Steps: – Map – Shuffle – Reduce ● Apache Hadoop ecosystem – Pig – Hive – more
  • 14.
  • 15.
    Map Reduce –word count Deer Bear River Car Car River Deer Car Bear Deer Bear River Deer Bear River Deer Bear River Deer, 1 Bear, 1 River, 1 Bear, 1 Bear, 1 Bear, 2 Bear, 2 Car, 3 Deer, 2 River, 2 Car, 1 Car, 1 River, 1 Car, 1 Car, 1 Car, 1 Car, 1 Car, 3 Deer, 1 Car, 1 Bear, 1 Deer, 1 Deer, 1 Deer, 2 River, 1 River, 1 River, 2 INPUT SPLIT MAP SHUFFLE REDUCE RESULT
  • 16.
    Apache Spark (batchprocessing) ● Similar to MapReduce but faster ● Resilient Distributed Dataset ● Ecosystem – Spark SQL – MLlib – GraphX – Streaming – Scala / R / Python / Java / SQL
  • 17.
    Streaming processing Real Time DataStream Streaming Processing Solution Dashboards Data Store Applications Alerts Batch Processing
  • 18.
    Streaming processing UseCases ● Prevent security fraud ● Optimize order routing ● Predictive maintenance ● Optimize personalized content ● Optimize prices or offers ● Prevent stock outs ● Optimize bandwith allocation ● Prevent application failures
  • 19.
    Streaming processing frameworks ●Apache Storm ● Apache Spark Streaming ● Apache Flink ● Apache Samza
  • 20.
    Apache Storm ● Creator:Nathan Marz (2011) ● Distributed real-time computation system for processing large volumes of high-velocity data ● Most popular streaming engine in industry ● Characteristics: – Fast – Scalable – Fault-tolerant – Reliable – Easy to operate – Easy to develop
  • 21.
    Storm modes ● One-at-a-timeprocessing (pure Storm) – Very low latency – Very simple development model – At-Most-Once and At-Least-Once semantics ● Micro batch processing (Storm Trident) – Increased latency on event – Better throughput for large rates (it depends) – More complex development – Exactly-Once semantics (per batch)
  • 22.
  • 23.
    Storm concepts ● Topology: A graph of computation ● Tuple : Storm uses tuples as its data model ● Stream : An unbounded sequence of tuples ● Spout : A source of streams in a topology ● Bolt : All processing in topologies is done in bolts ● Stream Grouping : Defines how that stream should be partitioned among the bolt's tasks.
  • 24.
  • 25.
  • 26.
  • 27.
  • 28.
  • 29.
  • 30.
    Messaging Systems ● Decoupleprocessing from data producers ● Buffer unprocessed messages ● Models: queuing & publish-subscribe ● Frameworks – Kafka – RabbitMQ – ActiveMQ
  • 31.
    Apache Kafka ● Distributed,partitioned, replicated commit log service ● Publish-Subscribe model ● Maintains feeds of messages in Topics ● Automatic Replication and Retention ● Brokers
  • 32.
    Apache Kafka ● Offsetuniquely identifies each message within the partition ● Consumers coordinate what to read ● Consumer & Consumer Group
  • 33.
    Big Data Appshints ● Design for scalability from day one ● Queries drive schema design ● Failure (HW or data) is a normal case ● Continuous Integration ● Metrics from day one ● Monitoring ● Appropriate people
  • 34.
    Streaming Demo –word count Car Car River Deer Car Bear <River,2> <Deer,2> <River,2> <Deer,2> Deer Bear River Car Car River Deer Car Bear Deer Bear River Car Car River Deer Bear River Deer Car Bear <Bear,2> <Car ,3> <Bear,2> <Car ,3> <Deer,2> <River,2> <Bear,2> <Car,3> Kafka Spout Split Bolt Count Bolt Kafka Bolt Kafka Topic Kafka Topic
  • 35.
    FOSSCOMM 2016 THANK YOU:-) [ Updates / Questions / Comments ] @qiozas