SlideShare a Scribd company logo
FOSSCOM 2016
Big Data
Streaming processing
using
Apache Storm
Agenda
● Big Data concepts
● Batch & Streaming processing
● NoSQL persistence
● Apache Storm and Apache Kafka
● Streaming application demo
● Considerations for Big Data applications
Who we are?
● Adrianos Dadis (@qiozas )
● Patroclos Christou
● Eleftheria Chavelia
● Sofia Nomikou
Big Data characteristics (3V+)
Terabytes
Records
Transactions
Files
Batch
Real-time
Streams
Near-real-time
Structured
Unstructured
Semi-structured
All the above
Volume Velocity
Variety
Use cases
● 360-degree customer view
● Internet of Things
● Data warehouse optimization
● Information security
● Sentiment analysis
● Urban Planning (MIT)
Big Data Pros
● Better user experience
● Predictive analytics
● Answer complex business questions
Big Data Cons
● Privacy
● False truths
● Strict personalized content
Databases
● Persistency & Concurrency
● Relational Databases provide strong benefits
● Relational Databases pitfalls:
– Application development productivity
– Large-scale data
● Capture more data
● Process data faster
● Costs
NoSQL Databases
● Schema agnostic
● Nonrelational
● Commodity hardware
● Highly distributable
● NOT a silver bullet
NoSQL & CAP theorem
● Consistency
– all nodes see the same data at the same time
● Availability
– a guarantee that every request receives a response
about whether it succeeded or failed
● Partition tolerance
– the system continues to operate despite arbitrary
partitioning due to network failures
● CP or AP
NoSQL Data Models
● Key Value
– Riak, Redis, Aerospike
● Document
– MongoDB, Elasticsearch
● Column Family
– Cassandra, HBase
● Graph
– Neo4J, ArangoDB, OrientDB
Big Data processing modes
● Batch
– Long running jobs
● Streaming processing
– One-at-a-time
– Micro batch
Apache Hadoop - MapReduce
● Programming/Processing model
● HDFS
●
Steps:
– Map
– Shuffle
– Reduce
●
Apache Hadoop ecosystem
– Pig
– Hive
– more
YARN – Resource Manager
Map Reduce – word count
Deer Bear River
Car Car River
Deer Car Bear
Deer Bear River
Deer Bear River
Deer Bear River
Deer, 1
Bear, 1
River, 1
Bear, 1
Bear, 1
Bear, 2
Bear, 2
Car, 3
Deer, 2
River, 2
Car, 1
Car, 1
River, 1
Car, 1
Car, 1
Car, 1
Car, 1
Car, 3
Deer, 1
Car, 1
Bear, 1
Deer, 1
Deer, 1
Deer, 2
River, 1
River, 1
River, 2
INPUT SPLIT MAP SHUFFLE REDUCE RESULT
Apache Spark (batch processing)
● Similar to MapReduce but faster
● Resilient Distributed Dataset
● Ecosystem
– Spark SQL
– MLlib
– GraphX
– Streaming
– Scala / R / Python / Java / SQL
Streaming processing
Real Time
Data Stream
Streaming Processing
Solution
Dashboards
Data Store
Applications
Alerts
Batch
Processing
Streaming processing Use Cases
● Prevent security fraud
● Optimize order routing
● Predictive maintenance
● Optimize personalized content
● Optimize prices or offers
● Prevent stock outs
● Optimize bandwith allocation
● Prevent application failures
Streaming processing frameworks
● Apache Storm
● Apache Spark Streaming
● Apache Flink
● Apache Samza
Apache Storm
● Creator: Nathan Marz (2011)
● Distributed real-time computation system for processing large
volumes of high-velocity data
● Most popular streaming engine in industry
● Characteristics:
– Fast
– Scalable
– Fault-tolerant
– Reliable
– Easy to operate
– Easy to develop
Storm modes
● One-at-a-time processing (pure Storm)
– Very low latency
– Very simple development model
– At-Most-Once and At-Least-Once semantics
● Micro batch processing (Storm Trident)
– Increased latency on event
– Better throughput for large rates (it depends)
– More complex development
– Exactly-Once semantics (per batch)
Storm Architecture
Nimbus
Zookeper
Supervisor
Worker
Worker
Zookeper
Zookeper
Supervisor
Worker
Worker
Supervisor
Worker
Worker
Supervisor
Worker
Worker
Master
Node
Cluster
Coordination
Node
Coordination
Processing
Worker
Storm concepts
● Topology : A graph of computation
● Tuple : Storm uses tuples as its data model
● Stream : An unbounded sequence of tuples
● Spout : A source of streams in a topology
● Bolt : All processing in topologies is done in bolts
● Stream Grouping : Defines how that stream should be
partitioned among the bolt's tasks.
Storm topology
Storm topology example
Storm topology parallelism
Storm Trident
Storm Trident partitioning
Storm Trident
Messaging Systems
● Decouple processing from data producers
● Buffer unprocessed messages
● Models: queuing & publish-subscribe
● Frameworks
– Kafka
– RabbitMQ
– ActiveMQ
Apache Kafka
● Distributed, partitioned, replicated commit log service
● Publish-Subscribe model
● Maintains feeds of messages in Topics
● Automatic Replication and Retention
● Brokers
Apache Kafka
● Offset uniquely identifies each message within the partition
● Consumers coordinate what to read
● Consumer & Consumer Group
Big Data Apps hints
● Design for scalability from day one
● Queries drive schema design
● Failure (HW or data) is a normal case
● Continuous Integration
● Metrics from day one
● Monitoring
● Appropriate people
Streaming Demo – word count
Car Car River
Deer Car Bear
<River,2>
<Deer,2>
<River,2>
<Deer,2>
Deer Bear River
Car Car River
Deer Car Bear
Deer Bear River
Car
Car
River
Deer
Bear
River
Deer
Car
Bear
<Bear,2>
<Car ,3>
<Bear,2>
<Car ,3>
<Deer,2>
<River,2>
<Bear,2>
<Car,3>
Kafka
Spout
Split
Bolt
Count
Bolt
Kafka
Bolt
Kafka
Topic
Kafka
Topic
FOSSCOMM 2016
THANK YOU :-)
[ Updates / Questions / Comments ]
@qiozas

More Related Content

What's hot

Real Time Data Streaming using Kafka & Storm
Real Time Data Streaming using Kafka & StormReal Time Data Streaming using Kafka & Storm
Real Time Data Streaming using Kafka & Storm
Ran Silberman
 
STORM as an ETL Engine to HADOOP
STORM as an ETL Engine to HADOOPSTORM as an ETL Engine to HADOOP
STORM as an ETL Engine to HADOOP
DataWorks Summit
 
Towards Benchmaking Modern Distruibuted Systems-(Grace Huang, Intel)
Towards Benchmaking Modern Distruibuted Systems-(Grace Huang, Intel)Towards Benchmaking Modern Distruibuted Systems-(Grace Huang, Intel)
Towards Benchmaking Modern Distruibuted Systems-(Grace Huang, Intel)
Spark Summit
 
Apache Storm and Oracle Event Processing for Real-time Analytics
Apache Storm and Oracle Event Processing for Real-time AnalyticsApache Storm and Oracle Event Processing for Real-time Analytics
Apache Storm and Oracle Event Processing for Real-time Analytics
Prabhu Thukkaram
 
Infrastructure at Scale: Apache Kafka, Twitter Storm & Elastic Search (ARC303...
Infrastructure at Scale: Apache Kafka, Twitter Storm & Elastic Search (ARC303...Infrastructure at Scale: Apache Kafka, Twitter Storm & Elastic Search (ARC303...
Infrastructure at Scale: Apache Kafka, Twitter Storm & Elastic Search (ARC303...
Amazon Web Services
 
Apache storm vs. Spark Streaming
Apache storm vs. Spark StreamingApache storm vs. Spark Streaming
Apache storm vs. Spark Streaming
P. Taylor Goetz
 
Real time analytics with Kafka and SparkStreaming
Real time analytics with Kafka and SparkStreamingReal time analytics with Kafka and SparkStreaming
Real time analytics with Kafka and SparkStreaming
Ashish Singh
 
Building large-scale analytics platform with Storm, Kafka and Cassandra - NYC...
Building large-scale analytics platform with Storm, Kafka and Cassandra - NYC...Building large-scale analytics platform with Storm, Kafka and Cassandra - NYC...
Building large-scale analytics platform with Storm, Kafka and Cassandra - NYC...
Alexey Kharlamov
 
Apache Storm vs. Spark Streaming - two stream processing platforms compared
Apache Storm vs. Spark Streaming - two stream processing platforms comparedApache Storm vs. Spark Streaming - two stream processing platforms compared
Apache Storm vs. Spark Streaming - two stream processing platforms compared
Guido Schmutz
 
Building Streaming Applications with Apache Storm 1.1
Building Streaming Applications with Apache Storm 1.1Building Streaming Applications with Apache Storm 1.1
Building Streaming Applications with Apache Storm 1.1
Hugo Louro
 
Real Time Data Processing Using Spark Streaming
Real Time Data Processing Using Spark StreamingReal Time Data Processing Using Spark Streaming
Real Time Data Processing Using Spark Streaming
Hari Shreedharan
 
Data Pipelines & Integrating Real-time Web Services w/ Storm : Improving on t...
Data Pipelines & Integrating Real-time Web Services w/ Storm : Improving on t...Data Pipelines & Integrating Real-time Web Services w/ Storm : Improving on t...
Data Pipelines & Integrating Real-time Web Services w/ Storm : Improving on t...
Brian O'Neill
 
Event Detection Pipelines with Apache Kafka
Event Detection Pipelines with Apache KafkaEvent Detection Pipelines with Apache Kafka
Event Detection Pipelines with Apache Kafka
DataWorks Summit
 
The Netflix Way to deal with Big Data Problems
The Netflix Way to deal with Big Data ProblemsThe Netflix Way to deal with Big Data Problems
The Netflix Way to deal with Big Data Problems
Monal Daxini
 
Extending Spark Streaming to Support Complex Event Processing
Extending Spark Streaming to Support Complex Event ProcessingExtending Spark Streaming to Support Complex Event Processing
Extending Spark Streaming to Support Complex Event Processing
Oh Chan Kwon
 
Streaming SQL
Streaming SQLStreaming SQL
Kafka Summit NYC 2017 Introduction to Kafka Streams with a Real-life Example
Kafka Summit NYC 2017 Introduction to Kafka Streams with a Real-life ExampleKafka Summit NYC 2017 Introduction to Kafka Streams with a Real-life Example
Kafka Summit NYC 2017 Introduction to Kafka Streams with a Real-life Example
confluent
 
Tsinghua University: Two Exemplary Applications in China
Tsinghua University: Two Exemplary Applications in ChinaTsinghua University: Two Exemplary Applications in China
Tsinghua University: Two Exemplary Applications in China
DataStax Academy
 
Realtime Detection of DDOS attacks using Apache Spark and MLLib
Realtime Detection of DDOS attacks using Apache Spark and MLLibRealtime Detection of DDOS attacks using Apache Spark and MLLib
Realtime Detection of DDOS attacks using Apache Spark and MLLib
Ryan Bosshart
 
Data Pipeline with Kafka
Data Pipeline with KafkaData Pipeline with Kafka
Data Pipeline with Kafka
Peerapat Asoktummarungsri
 

What's hot (20)

Real Time Data Streaming using Kafka & Storm
Real Time Data Streaming using Kafka & StormReal Time Data Streaming using Kafka & Storm
Real Time Data Streaming using Kafka & Storm
 
STORM as an ETL Engine to HADOOP
STORM as an ETL Engine to HADOOPSTORM as an ETL Engine to HADOOP
STORM as an ETL Engine to HADOOP
 
Towards Benchmaking Modern Distruibuted Systems-(Grace Huang, Intel)
Towards Benchmaking Modern Distruibuted Systems-(Grace Huang, Intel)Towards Benchmaking Modern Distruibuted Systems-(Grace Huang, Intel)
Towards Benchmaking Modern Distruibuted Systems-(Grace Huang, Intel)
 
Apache Storm and Oracle Event Processing for Real-time Analytics
Apache Storm and Oracle Event Processing for Real-time AnalyticsApache Storm and Oracle Event Processing for Real-time Analytics
Apache Storm and Oracle Event Processing for Real-time Analytics
 
Infrastructure at Scale: Apache Kafka, Twitter Storm & Elastic Search (ARC303...
Infrastructure at Scale: Apache Kafka, Twitter Storm & Elastic Search (ARC303...Infrastructure at Scale: Apache Kafka, Twitter Storm & Elastic Search (ARC303...
Infrastructure at Scale: Apache Kafka, Twitter Storm & Elastic Search (ARC303...
 
Apache storm vs. Spark Streaming
Apache storm vs. Spark StreamingApache storm vs. Spark Streaming
Apache storm vs. Spark Streaming
 
Real time analytics with Kafka and SparkStreaming
Real time analytics with Kafka and SparkStreamingReal time analytics with Kafka and SparkStreaming
Real time analytics with Kafka and SparkStreaming
 
Building large-scale analytics platform with Storm, Kafka and Cassandra - NYC...
Building large-scale analytics platform with Storm, Kafka and Cassandra - NYC...Building large-scale analytics platform with Storm, Kafka and Cassandra - NYC...
Building large-scale analytics platform with Storm, Kafka and Cassandra - NYC...
 
Apache Storm vs. Spark Streaming - two stream processing platforms compared
Apache Storm vs. Spark Streaming - two stream processing platforms comparedApache Storm vs. Spark Streaming - two stream processing platforms compared
Apache Storm vs. Spark Streaming - two stream processing platforms compared
 
Building Streaming Applications with Apache Storm 1.1
Building Streaming Applications with Apache Storm 1.1Building Streaming Applications with Apache Storm 1.1
Building Streaming Applications with Apache Storm 1.1
 
Real Time Data Processing Using Spark Streaming
Real Time Data Processing Using Spark StreamingReal Time Data Processing Using Spark Streaming
Real Time Data Processing Using Spark Streaming
 
Data Pipelines & Integrating Real-time Web Services w/ Storm : Improving on t...
Data Pipelines & Integrating Real-time Web Services w/ Storm : Improving on t...Data Pipelines & Integrating Real-time Web Services w/ Storm : Improving on t...
Data Pipelines & Integrating Real-time Web Services w/ Storm : Improving on t...
 
Event Detection Pipelines with Apache Kafka
Event Detection Pipelines with Apache KafkaEvent Detection Pipelines with Apache Kafka
Event Detection Pipelines with Apache Kafka
 
The Netflix Way to deal with Big Data Problems
The Netflix Way to deal with Big Data ProblemsThe Netflix Way to deal with Big Data Problems
The Netflix Way to deal with Big Data Problems
 
Extending Spark Streaming to Support Complex Event Processing
Extending Spark Streaming to Support Complex Event ProcessingExtending Spark Streaming to Support Complex Event Processing
Extending Spark Streaming to Support Complex Event Processing
 
Streaming SQL
Streaming SQLStreaming SQL
Streaming SQL
 
Kafka Summit NYC 2017 Introduction to Kafka Streams with a Real-life Example
Kafka Summit NYC 2017 Introduction to Kafka Streams with a Real-life ExampleKafka Summit NYC 2017 Introduction to Kafka Streams with a Real-life Example
Kafka Summit NYC 2017 Introduction to Kafka Streams with a Real-life Example
 
Tsinghua University: Two Exemplary Applications in China
Tsinghua University: Two Exemplary Applications in ChinaTsinghua University: Two Exemplary Applications in China
Tsinghua University: Two Exemplary Applications in China
 
Realtime Detection of DDOS attacks using Apache Spark and MLLib
Realtime Detection of DDOS attacks using Apache Spark and MLLibRealtime Detection of DDOS attacks using Apache Spark and MLLib
Realtime Detection of DDOS attacks using Apache Spark and MLLib
 
Data Pipeline with Kafka
Data Pipeline with KafkaData Pipeline with Kafka
Data Pipeline with Kafka
 

Viewers also liked

You've Made It - Pitching to TechCrunch & Using AngelList
You've Made It - Pitching to TechCrunch & Using AngelListYou've Made It - Pitching to TechCrunch & Using AngelList
You've Made It - Pitching to TechCrunch & Using AngelList
Randy Ksar
 
Youth programs provide a future for the sport
Youth programs provide a future for the sportYouth programs provide a future for the sport
Youth programs provide a future for the sport
United States Bowling Congress
 
I Barómetro de Redes Sociales y Destinos Turísticos de la Comunitat Valencian...
I Barómetro de Redes Sociales y Destinos Turísticos de la Comunitat Valencian...I Barómetro de Redes Sociales y Destinos Turísticos de la Comunitat Valencian...
I Barómetro de Redes Sociales y Destinos Turísticos de la Comunitat Valencian...
Invattur
 
工業用産業用メモリーUDINFO
工業用産業用メモリーUDINFO工業用産業用メモリーUDINFO
工業用産業用メモリーUDINFO
工業用産業用メモリーUDINFOJP
 
Real-Time Analytics with Apache Storm
Real-Time Analytics with Apache StormReal-Time Analytics with Apache Storm
Real-Time Analytics with Apache Storm
Taewoo Kim
 
StormWars - when the data stream shrinks
StormWars - when the data stream shrinksStormWars - when the data stream shrinks
StormWars - when the data stream shrinks
vishnu rao
 
Twitter Stream Processing
Twitter Stream ProcessingTwitter Stream Processing
Twitter Stream Processing
Colin Surprenant
 
Learning Stream Processing with Apache Storm
Learning Stream Processing with Apache StormLearning Stream Processing with Apache Storm
Learning Stream Processing with Apache Storm
Eugene Dvorkin
 
Lessons From Copyright in Action for Copyright Reform
Lessons From Copyright in Action for Copyright ReformLessons From Copyright in Action for Copyright Reform
Lessons From Copyright in Action for Copyright Reform
Dobusch Leonhard
 
Embedding ORCID across researcher career paths
Embedding ORCID across researcher career pathsEmbedding ORCID across researcher career paths
Embedding ORCID across researcher career paths
ORCID, Inc
 
Apache Storm
Apache StormApache Storm
Apache Storm
Edureka!
 
How The Open Data Community Died - A Warning From The Future
How The Open Data Community Died - A Warning From The FutureHow The Open Data Community Died - A Warning From The Future
How The Open Data Community Died - A Warning From The Future
Chris Taggart
 
Stateless Hypervisors at Scale
Stateless Hypervisors at ScaleStateless Hypervisors at Scale
Stateless Hypervisors at Scale
Antony Messerl
 
Zelforganisatie en high performance teams presentatie Hiswa 2015
Zelforganisatie en high performance teams presentatie Hiswa 2015Zelforganisatie en high performance teams presentatie Hiswa 2015
Zelforganisatie en high performance teams presentatie Hiswa 2015
Frank Willems
 
That Conference Date and Time
That Conference Date and TimeThat Conference Date and Time
That Conference Date and Time
Maggie Pint
 
Date and Time MomentJS Edition
Date and Time MomentJS EditionDate and Time MomentJS Edition
Date and Time MomentJS Edition
Maggie Pint
 
Real time and reliable processing with Apache Storm
Real time and reliable processing with Apache StormReal time and reliable processing with Apache Storm
Real time and reliable processing with Apache Storm
Andrea Iacono
 
Intro to Startups and VC
Intro to Startups and VCIntro to Startups and VC
Intro to Startups and VC
Guy Turner
 
ストリーミングのげんざい
ストリーミングのげんざいストリーミングのげんざい
ストリーミングのげんざい
Tetsuya Morimoto
 
DevOps with Visual studio Release Management (Pieter Gheysens)
DevOps with Visual studio Release Management (Pieter Gheysens)DevOps with Visual studio Release Management (Pieter Gheysens)
DevOps with Visual studio Release Management (Pieter Gheysens)
Visug
 

Viewers also liked (20)

You've Made It - Pitching to TechCrunch & Using AngelList
You've Made It - Pitching to TechCrunch & Using AngelListYou've Made It - Pitching to TechCrunch & Using AngelList
You've Made It - Pitching to TechCrunch & Using AngelList
 
Youth programs provide a future for the sport
Youth programs provide a future for the sportYouth programs provide a future for the sport
Youth programs provide a future for the sport
 
I Barómetro de Redes Sociales y Destinos Turísticos de la Comunitat Valencian...
I Barómetro de Redes Sociales y Destinos Turísticos de la Comunitat Valencian...I Barómetro de Redes Sociales y Destinos Turísticos de la Comunitat Valencian...
I Barómetro de Redes Sociales y Destinos Turísticos de la Comunitat Valencian...
 
工業用産業用メモリーUDINFO
工業用産業用メモリーUDINFO工業用産業用メモリーUDINFO
工業用産業用メモリーUDINFO
 
Real-Time Analytics with Apache Storm
Real-Time Analytics with Apache StormReal-Time Analytics with Apache Storm
Real-Time Analytics with Apache Storm
 
StormWars - when the data stream shrinks
StormWars - when the data stream shrinksStormWars - when the data stream shrinks
StormWars - when the data stream shrinks
 
Twitter Stream Processing
Twitter Stream ProcessingTwitter Stream Processing
Twitter Stream Processing
 
Learning Stream Processing with Apache Storm
Learning Stream Processing with Apache StormLearning Stream Processing with Apache Storm
Learning Stream Processing with Apache Storm
 
Lessons From Copyright in Action for Copyright Reform
Lessons From Copyright in Action for Copyright ReformLessons From Copyright in Action for Copyright Reform
Lessons From Copyright in Action for Copyright Reform
 
Embedding ORCID across researcher career paths
Embedding ORCID across researcher career pathsEmbedding ORCID across researcher career paths
Embedding ORCID across researcher career paths
 
Apache Storm
Apache StormApache Storm
Apache Storm
 
How The Open Data Community Died - A Warning From The Future
How The Open Data Community Died - A Warning From The FutureHow The Open Data Community Died - A Warning From The Future
How The Open Data Community Died - A Warning From The Future
 
Stateless Hypervisors at Scale
Stateless Hypervisors at ScaleStateless Hypervisors at Scale
Stateless Hypervisors at Scale
 
Zelforganisatie en high performance teams presentatie Hiswa 2015
Zelforganisatie en high performance teams presentatie Hiswa 2015Zelforganisatie en high performance teams presentatie Hiswa 2015
Zelforganisatie en high performance teams presentatie Hiswa 2015
 
That Conference Date and Time
That Conference Date and TimeThat Conference Date and Time
That Conference Date and Time
 
Date and Time MomentJS Edition
Date and Time MomentJS EditionDate and Time MomentJS Edition
Date and Time MomentJS Edition
 
Real time and reliable processing with Apache Storm
Real time and reliable processing with Apache StormReal time and reliable processing with Apache Storm
Real time and reliable processing with Apache Storm
 
Intro to Startups and VC
Intro to Startups and VCIntro to Startups and VC
Intro to Startups and VC
 
ストリーミングのげんざい
ストリーミングのげんざいストリーミングのげんざい
ストリーミングのげんざい
 
DevOps with Visual studio Release Management (Pieter Gheysens)
DevOps with Visual studio Release Management (Pieter Gheysens)DevOps with Visual studio Release Management (Pieter Gheysens)
DevOps with Visual studio Release Management (Pieter Gheysens)
 

Similar to Big Data Streaming processing using Apache Storm - FOSSCOMM 2016

4th Athens Big Data Meetup - 1st Talk - Big Data Streaming Processing Using A...
4th Athens Big Data Meetup - 1st Talk - Big Data Streaming Processing Using A...4th Athens Big Data Meetup - 1st Talk - Big Data Streaming Processing Using A...
4th Athens Big Data Meetup - 1st Talk - Big Data Streaming Processing Using A...
Athens Big Data
 
AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned
Omid Vahdaty
 
Apache Storm Concepts
Apache Storm ConceptsApache Storm Concepts
Apache Storm Concepts
André Dias
 
AWS big-data-demystified #1.1 | Big Data Architecture Lessons Learned | English
AWS big-data-demystified #1.1  | Big Data Architecture Lessons Learned | EnglishAWS big-data-demystified #1.1  | Big Data Architecture Lessons Learned | English
AWS big-data-demystified #1.1 | Big Data Architecture Lessons Learned | English
Omid Vahdaty
 
Introduction to Spark Streaming
Introduction to Spark StreamingIntroduction to Spark Streaming
Introduction to Spark Streaming
datamantra
 
Architecting Big Data Ingest & Manipulation
Architecting Big Data Ingest & ManipulationArchitecting Big Data Ingest & Manipulation
Architecting Big Data Ingest & Manipulation
George Long
 
Interactive Data Analysis in Spark Streaming
Interactive Data Analysis in Spark StreamingInteractive Data Analysis in Spark Streaming
Interactive Data Analysis in Spark Streaming
datamantra
 
Big Data in 200 km/h | AWS Big Data Demystified #1.3
Big Data in 200 km/h | AWS Big Data Demystified #1.3  Big Data in 200 km/h | AWS Big Data Demystified #1.3
Big Data in 200 km/h | AWS Big Data Demystified #1.3
Omid Vahdaty
 
SnappyData Toronto Meetup Nov 2017
SnappyData Toronto Meetup Nov 2017SnappyData Toronto Meetup Nov 2017
SnappyData Toronto Meetup Nov 2017
SnappyData
 
Impala presentation ahad rana
Impala presentation ahad ranaImpala presentation ahad rana
Impala presentation ahad rana
Data Con LA
 
Explore big data at speed of thought with Spark 2.0 and Snappydata
Explore big data at speed of thought with Spark 2.0 and SnappydataExplore big data at speed of thought with Spark 2.0 and Snappydata
Explore big data at speed of thought with Spark 2.0 and Snappydata
Data Con LA
 
Introduction to AWS Big Data
Introduction to AWS Big Data Introduction to AWS Big Data
Introduction to AWS Big Data
Omid Vahdaty
 
Data Analytics and Machine Learning: From Node to Cluster on ARM64
Data Analytics and Machine Learning: From Node to Cluster on ARM64Data Analytics and Machine Learning: From Node to Cluster on ARM64
Data Analytics and Machine Learning: From Node to Cluster on ARM64
Ganesh Raju
 
BKK16-404B Data Analytics and Machine Learning- from Node to Cluster
BKK16-404B Data Analytics and Machine Learning- from Node to ClusterBKK16-404B Data Analytics and Machine Learning- from Node to Cluster
BKK16-404B Data Analytics and Machine Learning- from Node to Cluster
Linaro
 
BKK16-408B Data Analytics and Machine Learning From Node to Cluster
BKK16-408B Data Analytics and Machine Learning From Node to ClusterBKK16-408B Data Analytics and Machine Learning From Node to Cluster
BKK16-408B Data Analytics and Machine Learning From Node to Cluster
Linaro
 
Big Data Streams Architectures. Why? What? How?
Big Data Streams Architectures. Why? What? How?Big Data Streams Architectures. Why? What? How?
Big Data Streams Architectures. Why? What? How?
Anton Nazaruk
 
Apache Spark 101 - Demi Ben-Ari
Apache Spark 101 - Demi Ben-AriApache Spark 101 - Demi Ben-Ari
Apache Spark 101 - Demi Ben-Ari
Demi Ben-Ari
 
Big data Argentina meetup 2020-09: Intro to presto on docker
Big data Argentina meetup 2020-09: Intro to presto on dockerBig data Argentina meetup 2020-09: Intro to presto on docker
Big data Argentina meetup 2020-09: Intro to presto on docker
Federico Palladoro
 
Stream, Stream, Stream: Different Streaming Methods with Spark and Kafka
Stream, Stream, Stream: Different Streaming Methods with Spark and KafkaStream, Stream, Stream: Different Streaming Methods with Spark and Kafka
Stream, Stream, Stream: Different Streaming Methods with Spark and Kafka
DataWorks Summit
 
Apache Spark Components
Apache Spark ComponentsApache Spark Components
Apache Spark Components
Girish Khanzode
 

Similar to Big Data Streaming processing using Apache Storm - FOSSCOMM 2016 (20)

4th Athens Big Data Meetup - 1st Talk - Big Data Streaming Processing Using A...
4th Athens Big Data Meetup - 1st Talk - Big Data Streaming Processing Using A...4th Athens Big Data Meetup - 1st Talk - Big Data Streaming Processing Using A...
4th Athens Big Data Meetup - 1st Talk - Big Data Streaming Processing Using A...
 
AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned
 
Apache Storm Concepts
Apache Storm ConceptsApache Storm Concepts
Apache Storm Concepts
 
AWS big-data-demystified #1.1 | Big Data Architecture Lessons Learned | English
AWS big-data-demystified #1.1  | Big Data Architecture Lessons Learned | EnglishAWS big-data-demystified #1.1  | Big Data Architecture Lessons Learned | English
AWS big-data-demystified #1.1 | Big Data Architecture Lessons Learned | English
 
Introduction to Spark Streaming
Introduction to Spark StreamingIntroduction to Spark Streaming
Introduction to Spark Streaming
 
Architecting Big Data Ingest & Manipulation
Architecting Big Data Ingest & ManipulationArchitecting Big Data Ingest & Manipulation
Architecting Big Data Ingest & Manipulation
 
Interactive Data Analysis in Spark Streaming
Interactive Data Analysis in Spark StreamingInteractive Data Analysis in Spark Streaming
Interactive Data Analysis in Spark Streaming
 
Big Data in 200 km/h | AWS Big Data Demystified #1.3
Big Data in 200 km/h | AWS Big Data Demystified #1.3  Big Data in 200 km/h | AWS Big Data Demystified #1.3
Big Data in 200 km/h | AWS Big Data Demystified #1.3
 
SnappyData Toronto Meetup Nov 2017
SnappyData Toronto Meetup Nov 2017SnappyData Toronto Meetup Nov 2017
SnappyData Toronto Meetup Nov 2017
 
Impala presentation ahad rana
Impala presentation ahad ranaImpala presentation ahad rana
Impala presentation ahad rana
 
Explore big data at speed of thought with Spark 2.0 and Snappydata
Explore big data at speed of thought with Spark 2.0 and SnappydataExplore big data at speed of thought with Spark 2.0 and Snappydata
Explore big data at speed of thought with Spark 2.0 and Snappydata
 
Introduction to AWS Big Data
Introduction to AWS Big Data Introduction to AWS Big Data
Introduction to AWS Big Data
 
Data Analytics and Machine Learning: From Node to Cluster on ARM64
Data Analytics and Machine Learning: From Node to Cluster on ARM64Data Analytics and Machine Learning: From Node to Cluster on ARM64
Data Analytics and Machine Learning: From Node to Cluster on ARM64
 
BKK16-404B Data Analytics and Machine Learning- from Node to Cluster
BKK16-404B Data Analytics and Machine Learning- from Node to ClusterBKK16-404B Data Analytics and Machine Learning- from Node to Cluster
BKK16-404B Data Analytics and Machine Learning- from Node to Cluster
 
BKK16-408B Data Analytics and Machine Learning From Node to Cluster
BKK16-408B Data Analytics and Machine Learning From Node to ClusterBKK16-408B Data Analytics and Machine Learning From Node to Cluster
BKK16-408B Data Analytics and Machine Learning From Node to Cluster
 
Big Data Streams Architectures. Why? What? How?
Big Data Streams Architectures. Why? What? How?Big Data Streams Architectures. Why? What? How?
Big Data Streams Architectures. Why? What? How?
 
Apache Spark 101 - Demi Ben-Ari
Apache Spark 101 - Demi Ben-AriApache Spark 101 - Demi Ben-Ari
Apache Spark 101 - Demi Ben-Ari
 
Big data Argentina meetup 2020-09: Intro to presto on docker
Big data Argentina meetup 2020-09: Intro to presto on dockerBig data Argentina meetup 2020-09: Intro to presto on docker
Big data Argentina meetup 2020-09: Intro to presto on docker
 
Stream, Stream, Stream: Different Streaming Methods with Spark and Kafka
Stream, Stream, Stream: Different Streaming Methods with Spark and KafkaStream, Stream, Stream: Different Streaming Methods with Spark and Kafka
Stream, Stream, Stream: Different Streaming Methods with Spark and Kafka
 
Apache Spark Components
Apache Spark ComponentsApache Spark Components
Apache Spark Components
 

Recently uploaded

SWEBOK and Education at FUSE Okinawa 2024
SWEBOK and Education at FUSE Okinawa 2024SWEBOK and Education at FUSE Okinawa 2024
SWEBOK and Education at FUSE Okinawa 2024
Hironori Washizaki
 
Measures in SQL (SIGMOD 2024, Santiago, Chile)
Measures in SQL (SIGMOD 2024, Santiago, Chile)Measures in SQL (SIGMOD 2024, Santiago, Chile)
Measures in SQL (SIGMOD 2024, Santiago, Chile)
Julian Hyde
 
What is Master Data Management by PiLog Group
What is Master Data Management by PiLog GroupWhat is Master Data Management by PiLog Group
What is Master Data Management by PiLog Group
aymanquadri279
 
8 Best Automated Android App Testing Tool and Framework in 2024.pdf
8 Best Automated Android App Testing Tool and Framework in 2024.pdf8 Best Automated Android App Testing Tool and Framework in 2024.pdf
8 Best Automated Android App Testing Tool and Framework in 2024.pdf
kalichargn70th171
 
Atelier - Innover avec l’IA Générative et les graphes de connaissances
Atelier - Innover avec l’IA Générative et les graphes de connaissancesAtelier - Innover avec l’IA Générative et les graphes de connaissances
Atelier - Innover avec l’IA Générative et les graphes de connaissances
Neo4j
 
LORRAINE ANDREI_LEQUIGAN_HOW TO USE ZOOM
LORRAINE ANDREI_LEQUIGAN_HOW TO USE ZOOMLORRAINE ANDREI_LEQUIGAN_HOW TO USE ZOOM
LORRAINE ANDREI_LEQUIGAN_HOW TO USE ZOOM
lorraineandreiamcidl
 
Hand Rolled Applicative User Validation Code Kata
Hand Rolled Applicative User ValidationCode KataHand Rolled Applicative User ValidationCode Kata
Hand Rolled Applicative User Validation Code Kata
Philip Schwarz
 
Unveiling the Advantages of Agile Software Development.pdf
Unveiling the Advantages of Agile Software Development.pdfUnveiling the Advantages of Agile Software Development.pdf
Unveiling the Advantages of Agile Software Development.pdf
brainerhub1
 
UI5con 2024 - Keynote: Latest News about UI5 and it’s Ecosystem
UI5con 2024 - Keynote: Latest News about UI5 and it’s EcosystemUI5con 2024 - Keynote: Latest News about UI5 and it’s Ecosystem
UI5con 2024 - Keynote: Latest News about UI5 and it’s Ecosystem
Peter Muessig
 
2024 eCommerceDays Toulouse - Sylius 2.0.pdf
2024 eCommerceDays Toulouse - Sylius 2.0.pdf2024 eCommerceDays Toulouse - Sylius 2.0.pdf
2024 eCommerceDays Toulouse - Sylius 2.0.pdf
Łukasz Chruściel
 
GraphSummit Paris - The art of the possible with Graph Technology
GraphSummit Paris - The art of the possible with Graph TechnologyGraphSummit Paris - The art of the possible with Graph Technology
GraphSummit Paris - The art of the possible with Graph Technology
Neo4j
 
Automated software refactoring with OpenRewrite and Generative AI.pptx.pdf
Automated software refactoring with OpenRewrite and Generative AI.pptx.pdfAutomated software refactoring with OpenRewrite and Generative AI.pptx.pdf
Automated software refactoring with OpenRewrite and Generative AI.pptx.pdf
timtebeek1
 
Neo4j - Product Vision and Knowledge Graphs - GraphSummit Paris
Neo4j - Product Vision and Knowledge Graphs - GraphSummit ParisNeo4j - Product Vision and Knowledge Graphs - GraphSummit Paris
Neo4j - Product Vision and Knowledge Graphs - GraphSummit Paris
Neo4j
 
Transform Your Communication with Cloud-Based IVR Solutions
Transform Your Communication with Cloud-Based IVR SolutionsTransform Your Communication with Cloud-Based IVR Solutions
Transform Your Communication with Cloud-Based IVR Solutions
TheSMSPoint
 
openEuler Case Study - The Journey to Supply Chain Security
openEuler Case Study - The Journey to Supply Chain SecurityopenEuler Case Study - The Journey to Supply Chain Security
openEuler Case Study - The Journey to Supply Chain Security
Shane Coughlan
 
AI Fusion Buddy Review: Brand New, Groundbreaking Gemini-Powered AI App
AI Fusion Buddy Review: Brand New, Groundbreaking Gemini-Powered AI AppAI Fusion Buddy Review: Brand New, Groundbreaking Gemini-Powered AI App
AI Fusion Buddy Review: Brand New, Groundbreaking Gemini-Powered AI App
Google
 
Energy consumption of Database Management - Florina Jonuzi
Energy consumption of Database Management - Florina JonuziEnergy consumption of Database Management - Florina Jonuzi
Energy consumption of Database Management - Florina Jonuzi
Green Software Development
 
E-commerce Development Services- Hornet Dynamics
E-commerce Development Services- Hornet DynamicsE-commerce Development Services- Hornet Dynamics
E-commerce Development Services- Hornet Dynamics
Hornet Dynamics
 
GreenCode-A-VSCode-Plugin--Dario-Jurisic
GreenCode-A-VSCode-Plugin--Dario-JurisicGreenCode-A-VSCode-Plugin--Dario-Jurisic
GreenCode-A-VSCode-Plugin--Dario-Jurisic
Green Software Development
 
Using Query Store in Azure PostgreSQL to Understand Query Performance
Using Query Store in Azure PostgreSQL to Understand Query PerformanceUsing Query Store in Azure PostgreSQL to Understand Query Performance
Using Query Store in Azure PostgreSQL to Understand Query Performance
Grant Fritchey
 

Recently uploaded (20)

SWEBOK and Education at FUSE Okinawa 2024
SWEBOK and Education at FUSE Okinawa 2024SWEBOK and Education at FUSE Okinawa 2024
SWEBOK and Education at FUSE Okinawa 2024
 
Measures in SQL (SIGMOD 2024, Santiago, Chile)
Measures in SQL (SIGMOD 2024, Santiago, Chile)Measures in SQL (SIGMOD 2024, Santiago, Chile)
Measures in SQL (SIGMOD 2024, Santiago, Chile)
 
What is Master Data Management by PiLog Group
What is Master Data Management by PiLog GroupWhat is Master Data Management by PiLog Group
What is Master Data Management by PiLog Group
 
8 Best Automated Android App Testing Tool and Framework in 2024.pdf
8 Best Automated Android App Testing Tool and Framework in 2024.pdf8 Best Automated Android App Testing Tool and Framework in 2024.pdf
8 Best Automated Android App Testing Tool and Framework in 2024.pdf
 
Atelier - Innover avec l’IA Générative et les graphes de connaissances
Atelier - Innover avec l’IA Générative et les graphes de connaissancesAtelier - Innover avec l’IA Générative et les graphes de connaissances
Atelier - Innover avec l’IA Générative et les graphes de connaissances
 
LORRAINE ANDREI_LEQUIGAN_HOW TO USE ZOOM
LORRAINE ANDREI_LEQUIGAN_HOW TO USE ZOOMLORRAINE ANDREI_LEQUIGAN_HOW TO USE ZOOM
LORRAINE ANDREI_LEQUIGAN_HOW TO USE ZOOM
 
Hand Rolled Applicative User Validation Code Kata
Hand Rolled Applicative User ValidationCode KataHand Rolled Applicative User ValidationCode Kata
Hand Rolled Applicative User Validation Code Kata
 
Unveiling the Advantages of Agile Software Development.pdf
Unveiling the Advantages of Agile Software Development.pdfUnveiling the Advantages of Agile Software Development.pdf
Unveiling the Advantages of Agile Software Development.pdf
 
UI5con 2024 - Keynote: Latest News about UI5 and it’s Ecosystem
UI5con 2024 - Keynote: Latest News about UI5 and it’s EcosystemUI5con 2024 - Keynote: Latest News about UI5 and it’s Ecosystem
UI5con 2024 - Keynote: Latest News about UI5 and it’s Ecosystem
 
2024 eCommerceDays Toulouse - Sylius 2.0.pdf
2024 eCommerceDays Toulouse - Sylius 2.0.pdf2024 eCommerceDays Toulouse - Sylius 2.0.pdf
2024 eCommerceDays Toulouse - Sylius 2.0.pdf
 
GraphSummit Paris - The art of the possible with Graph Technology
GraphSummit Paris - The art of the possible with Graph TechnologyGraphSummit Paris - The art of the possible with Graph Technology
GraphSummit Paris - The art of the possible with Graph Technology
 
Automated software refactoring with OpenRewrite and Generative AI.pptx.pdf
Automated software refactoring with OpenRewrite and Generative AI.pptx.pdfAutomated software refactoring with OpenRewrite and Generative AI.pptx.pdf
Automated software refactoring with OpenRewrite and Generative AI.pptx.pdf
 
Neo4j - Product Vision and Knowledge Graphs - GraphSummit Paris
Neo4j - Product Vision and Knowledge Graphs - GraphSummit ParisNeo4j - Product Vision and Knowledge Graphs - GraphSummit Paris
Neo4j - Product Vision and Knowledge Graphs - GraphSummit Paris
 
Transform Your Communication with Cloud-Based IVR Solutions
Transform Your Communication with Cloud-Based IVR SolutionsTransform Your Communication with Cloud-Based IVR Solutions
Transform Your Communication with Cloud-Based IVR Solutions
 
openEuler Case Study - The Journey to Supply Chain Security
openEuler Case Study - The Journey to Supply Chain SecurityopenEuler Case Study - The Journey to Supply Chain Security
openEuler Case Study - The Journey to Supply Chain Security
 
AI Fusion Buddy Review: Brand New, Groundbreaking Gemini-Powered AI App
AI Fusion Buddy Review: Brand New, Groundbreaking Gemini-Powered AI AppAI Fusion Buddy Review: Brand New, Groundbreaking Gemini-Powered AI App
AI Fusion Buddy Review: Brand New, Groundbreaking Gemini-Powered AI App
 
Energy consumption of Database Management - Florina Jonuzi
Energy consumption of Database Management - Florina JonuziEnergy consumption of Database Management - Florina Jonuzi
Energy consumption of Database Management - Florina Jonuzi
 
E-commerce Development Services- Hornet Dynamics
E-commerce Development Services- Hornet DynamicsE-commerce Development Services- Hornet Dynamics
E-commerce Development Services- Hornet Dynamics
 
GreenCode-A-VSCode-Plugin--Dario-Jurisic
GreenCode-A-VSCode-Plugin--Dario-JurisicGreenCode-A-VSCode-Plugin--Dario-Jurisic
GreenCode-A-VSCode-Plugin--Dario-Jurisic
 
Using Query Store in Azure PostgreSQL to Understand Query Performance
Using Query Store in Azure PostgreSQL to Understand Query PerformanceUsing Query Store in Azure PostgreSQL to Understand Query Performance
Using Query Store in Azure PostgreSQL to Understand Query Performance
 

Big Data Streaming processing using Apache Storm - FOSSCOMM 2016

  • 1. FOSSCOM 2016 Big Data Streaming processing using Apache Storm
  • 2. Agenda ● Big Data concepts ● Batch & Streaming processing ● NoSQL persistence ● Apache Storm and Apache Kafka ● Streaming application demo ● Considerations for Big Data applications
  • 3. Who we are? ● Adrianos Dadis (@qiozas ) ● Patroclos Christou ● Eleftheria Chavelia ● Sofia Nomikou
  • 4. Big Data characteristics (3V+) Terabytes Records Transactions Files Batch Real-time Streams Near-real-time Structured Unstructured Semi-structured All the above Volume Velocity Variety
  • 5. Use cases ● 360-degree customer view ● Internet of Things ● Data warehouse optimization ● Information security ● Sentiment analysis ● Urban Planning (MIT)
  • 6. Big Data Pros ● Better user experience ● Predictive analytics ● Answer complex business questions
  • 7. Big Data Cons ● Privacy ● False truths ● Strict personalized content
  • 8. Databases ● Persistency & Concurrency ● Relational Databases provide strong benefits ● Relational Databases pitfalls: – Application development productivity – Large-scale data ● Capture more data ● Process data faster ● Costs
  • 9. NoSQL Databases ● Schema agnostic ● Nonrelational ● Commodity hardware ● Highly distributable ● NOT a silver bullet
  • 10. NoSQL & CAP theorem ● Consistency – all nodes see the same data at the same time ● Availability – a guarantee that every request receives a response about whether it succeeded or failed ● Partition tolerance – the system continues to operate despite arbitrary partitioning due to network failures ● CP or AP
  • 11. NoSQL Data Models ● Key Value – Riak, Redis, Aerospike ● Document – MongoDB, Elasticsearch ● Column Family – Cassandra, HBase ● Graph – Neo4J, ArangoDB, OrientDB
  • 12. Big Data processing modes ● Batch – Long running jobs ● Streaming processing – One-at-a-time – Micro batch
  • 13. Apache Hadoop - MapReduce ● Programming/Processing model ● HDFS ● Steps: – Map – Shuffle – Reduce ● Apache Hadoop ecosystem – Pig – Hive – more
  • 14. YARN – Resource Manager
  • 15. Map Reduce – word count Deer Bear River Car Car River Deer Car Bear Deer Bear River Deer Bear River Deer Bear River Deer, 1 Bear, 1 River, 1 Bear, 1 Bear, 1 Bear, 2 Bear, 2 Car, 3 Deer, 2 River, 2 Car, 1 Car, 1 River, 1 Car, 1 Car, 1 Car, 1 Car, 1 Car, 3 Deer, 1 Car, 1 Bear, 1 Deer, 1 Deer, 1 Deer, 2 River, 1 River, 1 River, 2 INPUT SPLIT MAP SHUFFLE REDUCE RESULT
  • 16. Apache Spark (batch processing) ● Similar to MapReduce but faster ● Resilient Distributed Dataset ● Ecosystem – Spark SQL – MLlib – GraphX – Streaming – Scala / R / Python / Java / SQL
  • 17. Streaming processing Real Time Data Stream Streaming Processing Solution Dashboards Data Store Applications Alerts Batch Processing
  • 18. Streaming processing Use Cases ● Prevent security fraud ● Optimize order routing ● Predictive maintenance ● Optimize personalized content ● Optimize prices or offers ● Prevent stock outs ● Optimize bandwith allocation ● Prevent application failures
  • 19. Streaming processing frameworks ● Apache Storm ● Apache Spark Streaming ● Apache Flink ● Apache Samza
  • 20. Apache Storm ● Creator: Nathan Marz (2011) ● Distributed real-time computation system for processing large volumes of high-velocity data ● Most popular streaming engine in industry ● Characteristics: – Fast – Scalable – Fault-tolerant – Reliable – Easy to operate – Easy to develop
  • 21. Storm modes ● One-at-a-time processing (pure Storm) – Very low latency – Very simple development model – At-Most-Once and At-Least-Once semantics ● Micro batch processing (Storm Trident) – Increased latency on event – Better throughput for large rates (it depends) – More complex development – Exactly-Once semantics (per batch)
  • 23. Storm concepts ● Topology : A graph of computation ● Tuple : Storm uses tuples as its data model ● Stream : An unbounded sequence of tuples ● Spout : A source of streams in a topology ● Bolt : All processing in topologies is done in bolts ● Stream Grouping : Defines how that stream should be partitioned among the bolt's tasks.
  • 30. Messaging Systems ● Decouple processing from data producers ● Buffer unprocessed messages ● Models: queuing & publish-subscribe ● Frameworks – Kafka – RabbitMQ – ActiveMQ
  • 31. Apache Kafka ● Distributed, partitioned, replicated commit log service ● Publish-Subscribe model ● Maintains feeds of messages in Topics ● Automatic Replication and Retention ● Brokers
  • 32. Apache Kafka ● Offset uniquely identifies each message within the partition ● Consumers coordinate what to read ● Consumer & Consumer Group
  • 33. Big Data Apps hints ● Design for scalability from day one ● Queries drive schema design ● Failure (HW or data) is a normal case ● Continuous Integration ● Metrics from day one ● Monitoring ● Appropriate people
  • 34. Streaming Demo – word count Car Car River Deer Car Bear <River,2> <Deer,2> <River,2> <Deer,2> Deer Bear River Car Car River Deer Car Bear Deer Bear River Car Car River Deer Bear River Deer Car Bear <Bear,2> <Car ,3> <Bear,2> <Car ,3> <Deer,2> <River,2> <Bear,2> <Car,3> Kafka Spout Split Bolt Count Bolt Kafka Bolt Kafka Topic Kafka Topic
  • 35. FOSSCOMM 2016 THANK YOU :-) [ Updates / Questions / Comments ] @qiozas