SlideShare a Scribd company logo
Stream Processing
with Apache Storm
Spring 2014
Version 1.0
Kit Menke, Lead Software Engineer, EHI
Scott Shaw, Solutions Engineer, Hortonworks
© Hortonworks Inc. 2013
Stream Processing in Hadoop
Driven by new types of
data
– Sensor/Machine
– Server logs
– Clickstream
Storm with Hadoop
enables new business
opportunities
– Low-latency dashboards
– Quality, Security, Safety,
Operations Alerts
– Improved operations
– Real-time data integration
HDFS2
(redundant, reliable storage)
YARN
(cluster resource management)
MapReduce
(batch)
Apache
STORM
(streaming)
HADOOP 2.1
Tez
(interactive)
Multi Use Data Platform
Batch, Interactive, Online, Streaming, …
Stream processing has emerged as a key use case
2
© Hortonworks Inc. 2013
Typical stream processing workflow
Real-time
data feeds
Stream
processing
solution
Persist data
Relational
or non
relational
data store
Batch
processing
Batch
FeedsUpdate event
models
(Pattern templates,
KPIs & alerts)
Dashboards & Applications
3
© Hortonworks Inc. 2013
Stream processing very different from batch
Factors Real-time Batch
Data
Freshness Real-time ( usually
< 15 min)
Historical – usually
more than 15 min old
Location Primarily memory (
moved to disk after
processing)
Primarily in disk moved
to memory for
processing
Processing
Speed Sub second to few
seconds
Few seconds to hours
Frequency Always running Sporadic to periodic
Clients
Who? Automated systems
only
Human & automated
systems
Type Primarily
operational systems
Primarily analytical
applications
4
© Hortonworks Inc. 2013
Key requirements of a streaming solution
• Extremely high ingest rates – millions of
events/secondData Ingest
• Ability to easily plug different processing frameworks
• Guaranteed processing – atleast once processing
semantics
Processing
• Ability to persist data to multiple relational and non
relational data storesPersistence
• Security, HA, fault tolerance & management supportOperations
5
© Hortonworks Inc. 2013
Apache Storm Leading for Stream Processing
Open source real-time event stream processing platform that
provides fixed, continuous & low latency processing for very high
frequency streaming data
• Horizontally scalable like Hadoop
• Eg: 10 node cluster can process 1M tuples per
second per node
Highly
scalable
• Automatically reassigns tasks on failed nodes
Fault-
tolerant
• Supports at least once & exactly once processing
semantics
Guarantees
processing
• Processing logic can be defined in any language
Language
agnostic
• Brand, governance & a large active community
Apache
project
6
© Hortonworks Inc. 2013
Pattern driving MOST streaming use cases
7
Monitor real-time
data to..
Prevent Optimize
Finance
- Securities Fraud
- Compliance violations
- Order routing
- Pricing
Telco
- Security breaches
- Network Outages
- Bandwidth allocation
- Customer service
Retail
- Offers
- Pricing
Manufacturing
- Machine failures - Supply chain
Transportation
- Driver & fleet issues - Routes
- Pricing
Web
- Application failures
- Operational issues
- Site content
Sentiment Clickstream Machine/Sensor Logs Geo-location
----
© Hortonworks Inc. 2013
Storm use cases – IT operations view
• Continuously ingest high rate messages, process
them and update data stores
Continuous
processing
• Aggregate multiple data streams that emit data at
extremely high rates into one central data store
High speed
data
aggregation
• Filter out unwanted data on the fly before it is
persisted to a data storeData filtering
• Extremely resource( CPU, mem or I/O) intensive
processing that would take long time to process on a
single machine can be parallelized with Storm to
reduce response times to seconds
Distributed
RPC response
time reduction
8
© Hortonworks Inc. 2013
Key Constructs in Apache Storm
• Tuples, Streams, Sprouts, Bolts
• Topology
• Field Grouping
• Components and Topology Submission
• Parallelism
• Processing Guarantee
9
© Hortonworks Inc. 2013
Tuples and Streams
• What is a Tuple?
–Fundamental data structure in Storm. Is a named list of values
that can be of any data type.
10
• What is a Stream?
–An unbounded sequences of tuples.
–Core abstraction in Storm and are what you “process” in Storm
© Hortonworks Inc. 2013
Spouts
• What is a Spout?
–Generates or a source of Streams
– E.g.: JMS, Twitter, Log, Kafka Spout
–Can spin up multiple instances of a Spout and dynamically adjust
as needed
11
© Hortonworks Inc. 2013
Bolts
• What is a Bolt?
–Processes any number of input streams and produces output
streams
–Common processing in bolts are functions, aggregations, joins,
read/write to data stores, alerting logic
–Can spin up multiple instances of a Bolt and dynamically adjust as
needed
• Example of Bolts:
1. HBaseBolt: persist stream in Hbase
2. HDFSBolt: persist stream into HFDS as Avro Files using Flume
3. MonitoringBolt: Read from Hbase and create alerts via email
and messaging queues if given thresholds are exceeded.
12
© Hortonworks Inc. 2013
Topology
• What is a Topology?
–A network of spouts and bolts wired together into a workflow
Truck-Event-Processor Topology
Kafka Spout
HBase
Bolt
Monitoring
Bolt
HDFS
Bolt
WebSocket
Bolt
Stream Stream
Stream
Stream
13
© Hortonworks Inc. 2013
Storm Components and Topology
Submission
Submit
storm-event-processor
topology
Nimbus
(Yarn App Master Node)
Zookeeper ZookeeperZookeeper
Supervisor
(Slave Node)
Supervisor
(Slave Node)
Supervisor
(Slave Node)
Supervisor
(Slave Node)
Supervisor
(Slave Node)
Kafka
Spout
Kafka
Spout
Kafka
Spout
Kafka
Spout
Kafka
Spout
HDFS
Bolt
HDFS
Bolt
HDFS
Bolt
HBase
Bolt
HBase
Bolt
Monitor
Bolt
Monitor
Bolt
Nimbus (Management server)
• Similar to job tracker
• Distributes code around cluster
• Assigns tasks
• Handles failures
Supervisor (Worker nodes)
• Similar to task tracker
• Run bolts and spouts as ‘tasks’
Zookeeper:
• Cluster co-ordination
• Nimbus HA
• Stores cluster metrics
14
© Hortonworks Inc. 2013
Processing Guarantees in Storm
Processing
guarantee
How is it achieved? Applicable use cases
Atleast once Replay tuples on failure - Processing does not need to be
ordered
- Need extremely low latency
processing
Exactly once Transactional
topologies ( now
implemented using
Trident)
- Need ordered processing
- Global counts
- Context aware processing
- Causality based
- Latency not important
15
Implementing Storm
Kit Menke, Lead Software Engineer at Enterprise Holdings, Inc.
May 2014
Implementing Storm
Spring 2014
Version 1.0
Real World Scenarios
Overview
• Storm Terminology
• Creating a Topology
• Persisting data from Storm
• Topology Performance
• Custom Metrics
• Workers, Executors, and Tasks
• Caching within a Bolt
• Environment Setup
18
Storm Terminology
• Topologies run on your Hadoop cluster
– Uber-jar with spouts and bolts
– Runs forever
• Spouts generate streams of tuples
• Tuples are lists of values
• Bolts process tuples (and emit tuples)
Topology
Spout
Bolt A Bolt B
Bolt 1
Tuples
19
Storm Topology Example
Counting Topology
spout unreliable output
Guaranteed message processing
20
Storm Spout Example
21
Storm Bolt Example
22
Failing a Tuple
1. Spout emits tuple
2. Bolt fails tuple
3. Spout receives failed
message ID
23
Persisting Data
• Write to HDFS using storm-hdfs for
long term storage
24
Files in Hue written by storm-hdfs
25
Persisting Data
• Write to HDFS using storm-hdfs for
long term storage
• Index data in ElasticSearch or Solr
for real-time dashboards
26
ElasticSearch + Kibana
27
Persisting Data
• Write to HDFS using storm-hdfs for
long term storage
• Index data in ElasticSearch or Solr
for real-time dashboards
• Insert messages into a Database
• Message Queue
• HBase reads/writes to influence
topology in real-time
28
Topology Performance
• Storm UI shows capacity
– Break out your bolts to find bottlenecks!
29
Custom Metrics
• New in Storm 0.9.0
• Out of the box metrics, ex: CountMetric
• Custom metric by implementing IMetric
• Register the metric on spout/bolt startup
• Set topology to consume metrics stream
30
Topology Performance
• Filter bolt is our bottleneck!
31
Workers, Executors, and Tasks
• Workers
– Separate JVM
– Workers run Executors
• Executors
– Separate threads
– Executors run Tasks
• Tasks
– Your spout or bolt code
• Running more than one task per executor does not increase
the level of parallelism!!!
Workers <= Executors <= Tasks
Caching inside a Bolt
• RotatingMap with Tick Tuples
• Use fieldsGrouping to ensure cache hits
33
Environment Setup
• Storm-starter project on GitHub
• Git, Eclipse, Maven
• Unit test!
• Develop locally or on a single node
hadoop machine
• Read the source code
34
Questions?

More Related Content

What's hot

Yahoo compares Storm and Spark
Yahoo compares Storm and SparkYahoo compares Storm and Spark
Yahoo compares Storm and Spark
Chicago Hadoop Users Group
 
Real-time Big Data Processing with Storm
Real-time Big Data Processing with StormReal-time Big Data Processing with Storm
Real-time Big Data Processing with Storm
viirya
 
Streaming SQL
Streaming SQLStreaming SQL
Storm-on-YARN: Convergence of Low-Latency and Big-Data
Storm-on-YARN: Convergence of Low-Latency and Big-DataStorm-on-YARN: Convergence of Low-Latency and Big-Data
Storm-on-YARN: Convergence of Low-Latency and Big-Data
DataWorks Summit
 
Apache Storm vs. Spark Streaming – two Stream Processing Platforms compared
Apache Storm vs. Spark Streaming – two Stream Processing Platforms comparedApache Storm vs. Spark Streaming – two Stream Processing Platforms compared
Apache Storm vs. Spark Streaming – two Stream Processing Platforms compared
Guido Schmutz
 
Data Stream Algorithms in Storm and R
Data Stream Algorithms in Storm and RData Stream Algorithms in Storm and R
Data Stream Algorithms in Storm and R
Radek Maciaszek
 
Apache Storm Concepts
Apache Storm ConceptsApache Storm Concepts
Apache Storm Concepts
André Dias
 
Storm: distributed and fault-tolerant realtime computation
Storm: distributed and fault-tolerant realtime computationStorm: distributed and fault-tolerant realtime computation
Storm: distributed and fault-tolerant realtime computation
nathanmarz
 
Event Detection Pipelines with Apache Kafka
Event Detection Pipelines with Apache KafkaEvent Detection Pipelines with Apache Kafka
Event Detection Pipelines with Apache Kafka
DataWorks Summit
 
Apache storm vs. Spark Streaming
Apache storm vs. Spark StreamingApache storm vs. Spark Streaming
Apache storm vs. Spark Streaming
P. Taylor Goetz
 
Storm and Cassandra
Storm and Cassandra Storm and Cassandra
Storm and Cassandra
T Jake Luciani
 
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
DataWorks Summit/Hadoop Summit
 
Apache Eagle: eBay构建开源分布式实时预警引擎实践
Apache Eagle: eBay构建开源分布式实时预警引擎实践Apache Eagle: eBay构建开源分布式实时预警引擎实践
Apache Eagle: eBay构建开源分布式实时预警引擎实践
Hao Chen
 
Large Infrastructure Monitoring At CERN by Matthias Braeger at Big Data Spain...
Large Infrastructure Monitoring At CERN by Matthias Braeger at Big Data Spain...Large Infrastructure Monitoring At CERN by Matthias Braeger at Big Data Spain...
Large Infrastructure Monitoring At CERN by Matthias Braeger at Big Data Spain...
Big Data Spain
 
Chris Hillman – Beyond Mapreduce Scientific Data Processing in Real-time
Chris Hillman – Beyond Mapreduce Scientific Data Processing in Real-timeChris Hillman – Beyond Mapreduce Scientific Data Processing in Real-time
Chris Hillman – Beyond Mapreduce Scientific Data Processing in Real-time
Flink Forward
 
Scaling ETL with Hadoop - Avoiding Failure
Scaling ETL with Hadoop - Avoiding FailureScaling ETL with Hadoop - Avoiding Failure
Scaling ETL with Hadoop - Avoiding Failure
Gwen (Chen) Shapira
 
Re-envisioning the Lambda Architecture : Web Services & Real-time Analytics ...
Re-envisioning the Lambda Architecture : Web Services & Real-time Analytics ...Re-envisioning the Lambda Architecture : Web Services & Real-time Analytics ...
Re-envisioning the Lambda Architecture : Web Services & Real-time Analytics ...
Brian O'Neill
 
A real time architecture using Hadoop and Storm @ FOSDEM 2013
A real time architecture using Hadoop and Storm @ FOSDEM 2013A real time architecture using Hadoop and Storm @ FOSDEM 2013
A real time architecture using Hadoop and Storm @ FOSDEM 2013
Nathan Bijnens
 
MapR & Skytree:
MapR & Skytree: MapR & Skytree:
MapR & Skytree:
MapR Technologies
 
Thing you didn't know you could do in Spark
Thing you didn't know you could do in SparkThing you didn't know you could do in Spark
Thing you didn't know you could do in Spark
SnappyData
 

What's hot (20)

Yahoo compares Storm and Spark
Yahoo compares Storm and SparkYahoo compares Storm and Spark
Yahoo compares Storm and Spark
 
Real-time Big Data Processing with Storm
Real-time Big Data Processing with StormReal-time Big Data Processing with Storm
Real-time Big Data Processing with Storm
 
Streaming SQL
Streaming SQLStreaming SQL
Streaming SQL
 
Storm-on-YARN: Convergence of Low-Latency and Big-Data
Storm-on-YARN: Convergence of Low-Latency and Big-DataStorm-on-YARN: Convergence of Low-Latency and Big-Data
Storm-on-YARN: Convergence of Low-Latency and Big-Data
 
Apache Storm vs. Spark Streaming – two Stream Processing Platforms compared
Apache Storm vs. Spark Streaming – two Stream Processing Platforms comparedApache Storm vs. Spark Streaming – two Stream Processing Platforms compared
Apache Storm vs. Spark Streaming – two Stream Processing Platforms compared
 
Data Stream Algorithms in Storm and R
Data Stream Algorithms in Storm and RData Stream Algorithms in Storm and R
Data Stream Algorithms in Storm and R
 
Apache Storm Concepts
Apache Storm ConceptsApache Storm Concepts
Apache Storm Concepts
 
Storm: distributed and fault-tolerant realtime computation
Storm: distributed and fault-tolerant realtime computationStorm: distributed and fault-tolerant realtime computation
Storm: distributed and fault-tolerant realtime computation
 
Event Detection Pipelines with Apache Kafka
Event Detection Pipelines with Apache KafkaEvent Detection Pipelines with Apache Kafka
Event Detection Pipelines with Apache Kafka
 
Apache storm vs. Spark Streaming
Apache storm vs. Spark StreamingApache storm vs. Spark Streaming
Apache storm vs. Spark Streaming
 
Storm and Cassandra
Storm and Cassandra Storm and Cassandra
Storm and Cassandra
 
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
 
Apache Eagle: eBay构建开源分布式实时预警引擎实践
Apache Eagle: eBay构建开源分布式实时预警引擎实践Apache Eagle: eBay构建开源分布式实时预警引擎实践
Apache Eagle: eBay构建开源分布式实时预警引擎实践
 
Large Infrastructure Monitoring At CERN by Matthias Braeger at Big Data Spain...
Large Infrastructure Monitoring At CERN by Matthias Braeger at Big Data Spain...Large Infrastructure Monitoring At CERN by Matthias Braeger at Big Data Spain...
Large Infrastructure Monitoring At CERN by Matthias Braeger at Big Data Spain...
 
Chris Hillman – Beyond Mapreduce Scientific Data Processing in Real-time
Chris Hillman – Beyond Mapreduce Scientific Data Processing in Real-timeChris Hillman – Beyond Mapreduce Scientific Data Processing in Real-time
Chris Hillman – Beyond Mapreduce Scientific Data Processing in Real-time
 
Scaling ETL with Hadoop - Avoiding Failure
Scaling ETL with Hadoop - Avoiding FailureScaling ETL with Hadoop - Avoiding Failure
Scaling ETL with Hadoop - Avoiding Failure
 
Re-envisioning the Lambda Architecture : Web Services & Real-time Analytics ...
Re-envisioning the Lambda Architecture : Web Services & Real-time Analytics ...Re-envisioning the Lambda Architecture : Web Services & Real-time Analytics ...
Re-envisioning the Lambda Architecture : Web Services & Real-time Analytics ...
 
A real time architecture using Hadoop and Storm @ FOSDEM 2013
A real time architecture using Hadoop and Storm @ FOSDEM 2013A real time architecture using Hadoop and Storm @ FOSDEM 2013
A real time architecture using Hadoop and Storm @ FOSDEM 2013
 
MapR & Skytree:
MapR & Skytree: MapR & Skytree:
MapR & Skytree:
 
Thing you didn't know you could do in Spark
Thing you didn't know you could do in SparkThing you didn't know you could do in Spark
Thing you didn't know you could do in Spark
 

Viewers also liked

Big Data Architectures
Big Data ArchitecturesBig Data Architectures
Big Data Architectures
Guido Schmutz
 
Big Data Technology
Big Data TechnologyBig Data Technology
Big Data Technology
Juan J. Mostazo
 
Introduction to Streaming Analytics
Introduction to Streaming AnalyticsIntroduction to Streaming Analytics
Introduction to Streaming Analytics
Guido Schmutz
 
Cassandra Summit 2014: META — An Efficient Distributed Data Hub with Batch an...
Cassandra Summit 2014: META — An Efficient Distributed Data Hub with Batch an...Cassandra Summit 2014: META — An Efficient Distributed Data Hub with Batch an...
Cassandra Summit 2014: META — An Efficient Distributed Data Hub with Batch an...
DataStax Academy
 
Introduction to Streaming Analytics
Introduction to Streaming AnalyticsIntroduction to Streaming Analytics
Introduction to Streaming Analytics
Guido Schmutz
 
Importance of Big Data Analytics
Importance of Big Data AnalyticsImportance of Big Data Analytics
Importance of Big Data Analytics
Impetus Technologies
 

Viewers also liked (6)

Big Data Architectures
Big Data ArchitecturesBig Data Architectures
Big Data Architectures
 
Big Data Technology
Big Data TechnologyBig Data Technology
Big Data Technology
 
Introduction to Streaming Analytics
Introduction to Streaming AnalyticsIntroduction to Streaming Analytics
Introduction to Streaming Analytics
 
Cassandra Summit 2014: META — An Efficient Distributed Data Hub with Batch an...
Cassandra Summit 2014: META — An Efficient Distributed Data Hub with Batch an...Cassandra Summit 2014: META — An Efficient Distributed Data Hub with Batch an...
Cassandra Summit 2014: META — An Efficient Distributed Data Hub with Batch an...
 
Introduction to Streaming Analytics
Introduction to Streaming AnalyticsIntroduction to Streaming Analytics
Introduction to Streaming Analytics
 
Importance of Big Data Analytics
Importance of Big Data AnalyticsImportance of Big Data Analytics
Importance of Big Data Analytics
 

Similar to Storm – Streaming Data Analytics at Scale - StampedeCon 2014

Hortonworks Technical Workshop: Real Time Monitoring with Apache Hadoop
Hortonworks Technical Workshop: Real Time Monitoring with Apache HadoopHortonworks Technical Workshop: Real Time Monitoring with Apache Hadoop
Hortonworks Technical Workshop: Real Time Monitoring with Apache Hadoop
Hortonworks
 
Discover HDP2.1: Apache Storm for Stream Data Processing in Hadoop
Discover HDP2.1: Apache Storm for Stream Data Processing in HadoopDiscover HDP2.1: Apache Storm for Stream Data Processing in Hadoop
Discover HDP2.1: Apache Storm for Stream Data Processing in Hadoop
Hortonworks
 
Mhug apache storm
Mhug apache stormMhug apache storm
Mhug apache storm
Joseph Niemiec
 
Discover.hdp2.2.storm and kafka.final
Discover.hdp2.2.storm and kafka.finalDiscover.hdp2.2.storm and kafka.final
Discover.hdp2.2.storm and kafka.final
Hortonworks
 
Real-Time Data Flows with Apache NiFi
Real-Time Data Flows with Apache NiFiReal-Time Data Flows with Apache NiFi
Real-Time Data Flows with Apache NiFi
Manish Gupta
 
A Reference Architecture for ETL 2.0
A Reference Architecture for ETL 2.0 A Reference Architecture for ETL 2.0
A Reference Architecture for ETL 2.0
DataWorks Summit
 
Druid Scaling Realtime Analytics
Druid Scaling Realtime AnalyticsDruid Scaling Realtime Analytics
Druid Scaling Realtime Analytics
Aaron Brooks
 
Javantura v3 - Real-time BigData ingestion and querying of aggregated data – ...
Javantura v3 - Real-time BigData ingestion and querying of aggregated data – ...Javantura v3 - Real-time BigData ingestion and querying of aggregated data – ...
Javantura v3 - Real-time BigData ingestion and querying of aggregated data – ...
HUJAK - Hrvatska udruga Java korisnika / Croatian Java User Association
 
Hive acid and_2.x new_features
Hive acid and_2.x new_featuresHive acid and_2.x new_features
Hive acid and_2.x new_features
Alberto Romero
 
LLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in HiveLLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in Hive
DataWorks Summit/Hadoop Summit
 
From Device to Data Center to Insights: Architectural Considerations for the ...
From Device to Data Center to Insights: Architectural Considerations for the ...From Device to Data Center to Insights: Architectural Considerations for the ...
From Device to Data Center to Insights: Architectural Considerations for the ...
P. Taylor Goetz
 
From Device to Data Center to Insights
From Device to Data Center to InsightsFrom Device to Data Center to Insights
From Device to Data Center to Insights
DataWorks Summit/Hadoop Summit
 
Druid deep dive
Druid deep diveDruid deep dive
Druid deep dive
Kashif Khan
 
Using Spark Streaming and NiFi for the next generation of ETL in the enterprise
Using Spark Streaming and NiFi for the next generation of ETL in the enterpriseUsing Spark Streaming and NiFi for the next generation of ETL in the enterprise
Using Spark Streaming and NiFi for the next generation of ETL in the enterprise
DataWorks Summit
 
Curing the Kafka blindness—Streams Messaging Manager
Curing the Kafka blindness—Streams Messaging ManagerCuring the Kafka blindness—Streams Messaging Manager
Curing the Kafka blindness—Streams Messaging Manager
DataWorks Summit
 
Hadoop project design and a usecase
Hadoop project design and  a usecaseHadoop project design and  a usecase
Hadoop project design and a usecase
sudhakara st
 
Realtime Analytics in Hadoop
Realtime Analytics in HadoopRealtime Analytics in Hadoop
Realtime Analytics in Hadoop
Rommel Garcia
 
Realtime analytics + hadoop 2.0
Realtime analytics + hadoop 2.0Realtime analytics + hadoop 2.0
Realtime analytics + hadoop 2.0
Rommel Garcia
 
Hdfs 2016-hadoop-summit-san-jose-v4
Hdfs 2016-hadoop-summit-san-jose-v4Hdfs 2016-hadoop-summit-san-jose-v4
Hdfs 2016-hadoop-summit-san-jose-v4
Chris Nauroth
 
MiniFi and Apache NiFi : IoT in Berlin Germany 2018
MiniFi and Apache NiFi : IoT in Berlin Germany 2018MiniFi and Apache NiFi : IoT in Berlin Germany 2018
MiniFi and Apache NiFi : IoT in Berlin Germany 2018
Timothy Spann
 

Similar to Storm – Streaming Data Analytics at Scale - StampedeCon 2014 (20)

Hortonworks Technical Workshop: Real Time Monitoring with Apache Hadoop
Hortonworks Technical Workshop: Real Time Monitoring with Apache HadoopHortonworks Technical Workshop: Real Time Monitoring with Apache Hadoop
Hortonworks Technical Workshop: Real Time Monitoring with Apache Hadoop
 
Discover HDP2.1: Apache Storm for Stream Data Processing in Hadoop
Discover HDP2.1: Apache Storm for Stream Data Processing in HadoopDiscover HDP2.1: Apache Storm for Stream Data Processing in Hadoop
Discover HDP2.1: Apache Storm for Stream Data Processing in Hadoop
 
Mhug apache storm
Mhug apache stormMhug apache storm
Mhug apache storm
 
Discover.hdp2.2.storm and kafka.final
Discover.hdp2.2.storm and kafka.finalDiscover.hdp2.2.storm and kafka.final
Discover.hdp2.2.storm and kafka.final
 
Real-Time Data Flows with Apache NiFi
Real-Time Data Flows with Apache NiFiReal-Time Data Flows with Apache NiFi
Real-Time Data Flows with Apache NiFi
 
A Reference Architecture for ETL 2.0
A Reference Architecture for ETL 2.0 A Reference Architecture for ETL 2.0
A Reference Architecture for ETL 2.0
 
Druid Scaling Realtime Analytics
Druid Scaling Realtime AnalyticsDruid Scaling Realtime Analytics
Druid Scaling Realtime Analytics
 
Javantura v3 - Real-time BigData ingestion and querying of aggregated data – ...
Javantura v3 - Real-time BigData ingestion and querying of aggregated data – ...Javantura v3 - Real-time BigData ingestion and querying of aggregated data – ...
Javantura v3 - Real-time BigData ingestion and querying of aggregated data – ...
 
Hive acid and_2.x new_features
Hive acid and_2.x new_featuresHive acid and_2.x new_features
Hive acid and_2.x new_features
 
LLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in HiveLLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in Hive
 
From Device to Data Center to Insights: Architectural Considerations for the ...
From Device to Data Center to Insights: Architectural Considerations for the ...From Device to Data Center to Insights: Architectural Considerations for the ...
From Device to Data Center to Insights: Architectural Considerations for the ...
 
From Device to Data Center to Insights
From Device to Data Center to InsightsFrom Device to Data Center to Insights
From Device to Data Center to Insights
 
Druid deep dive
Druid deep diveDruid deep dive
Druid deep dive
 
Using Spark Streaming and NiFi for the next generation of ETL in the enterprise
Using Spark Streaming and NiFi for the next generation of ETL in the enterpriseUsing Spark Streaming and NiFi for the next generation of ETL in the enterprise
Using Spark Streaming and NiFi for the next generation of ETL in the enterprise
 
Curing the Kafka blindness—Streams Messaging Manager
Curing the Kafka blindness—Streams Messaging ManagerCuring the Kafka blindness—Streams Messaging Manager
Curing the Kafka blindness—Streams Messaging Manager
 
Hadoop project design and a usecase
Hadoop project design and  a usecaseHadoop project design and  a usecase
Hadoop project design and a usecase
 
Realtime Analytics in Hadoop
Realtime Analytics in HadoopRealtime Analytics in Hadoop
Realtime Analytics in Hadoop
 
Realtime analytics + hadoop 2.0
Realtime analytics + hadoop 2.0Realtime analytics + hadoop 2.0
Realtime analytics + hadoop 2.0
 
Hdfs 2016-hadoop-summit-san-jose-v4
Hdfs 2016-hadoop-summit-san-jose-v4Hdfs 2016-hadoop-summit-san-jose-v4
Hdfs 2016-hadoop-summit-san-jose-v4
 
MiniFi and Apache NiFi : IoT in Berlin Germany 2018
MiniFi and Apache NiFi : IoT in Berlin Germany 2018MiniFi and Apache NiFi : IoT in Berlin Germany 2018
MiniFi and Apache NiFi : IoT in Berlin Germany 2018
 

More from StampedeCon

Why Should We Trust You-Interpretability of Deep Neural Networks - StampedeCo...
Why Should We Trust You-Interpretability of Deep Neural Networks - StampedeCo...Why Should We Trust You-Interpretability of Deep Neural Networks - StampedeCo...
Why Should We Trust You-Interpretability of Deep Neural Networks - StampedeCo...
StampedeCon
 
The Search for a New Visual Search Beyond Language - StampedeCon AI Summit 2017
The Search for a New Visual Search Beyond Language - StampedeCon AI Summit 2017The Search for a New Visual Search Beyond Language - StampedeCon AI Summit 2017
The Search for a New Visual Search Beyond Language - StampedeCon AI Summit 2017
StampedeCon
 
Predicting Outcomes When Your Outcomes are Graphs - StampedeCon AI Summit 2017
Predicting Outcomes When Your Outcomes are Graphs - StampedeCon AI Summit 2017Predicting Outcomes When Your Outcomes are Graphs - StampedeCon AI Summit 2017
Predicting Outcomes When Your Outcomes are Graphs - StampedeCon AI Summit 2017
StampedeCon
 
Novel Semi-supervised Probabilistic ML Approach to SNP Variant Calling - Stam...
Novel Semi-supervised Probabilistic ML Approach to SNP Variant Calling - Stam...Novel Semi-supervised Probabilistic ML Approach to SNP Variant Calling - Stam...
Novel Semi-supervised Probabilistic ML Approach to SNP Variant Calling - Stam...
StampedeCon
 
How to Talk about AI to Non-analaysts - Stampedecon AI Summit 2017
How to Talk about AI to Non-analaysts - Stampedecon AI Summit 2017How to Talk about AI to Non-analaysts - Stampedecon AI Summit 2017
How to Talk about AI to Non-analaysts - Stampedecon AI Summit 2017
StampedeCon
 
Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017
Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017
Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017
StampedeCon
 
Foundations of Machine Learning - StampedeCon AI Summit 2017
Foundations of Machine Learning - StampedeCon AI Summit 2017Foundations of Machine Learning - StampedeCon AI Summit 2017
Foundations of Machine Learning - StampedeCon AI Summit 2017
StampedeCon
 
Don't Start from Scratch: Transfer Learning for Novel Computer Vision Problem...
Don't Start from Scratch: Transfer Learning for Novel Computer Vision Problem...Don't Start from Scratch: Transfer Learning for Novel Computer Vision Problem...
Don't Start from Scratch: Transfer Learning for Novel Computer Vision Problem...
StampedeCon
 
Bringing the Whole Elephant Into View Can Cognitive Systems Bring Real Soluti...
Bringing the Whole Elephant Into View Can Cognitive Systems Bring Real Soluti...Bringing the Whole Elephant Into View Can Cognitive Systems Bring Real Soluti...
Bringing the Whole Elephant Into View Can Cognitive Systems Bring Real Soluti...
StampedeCon
 
Automated AI The Next Frontier in Analytics - StampedeCon AI Summit 2017
Automated AI The Next Frontier in Analytics - StampedeCon AI Summit 2017Automated AI The Next Frontier in Analytics - StampedeCon AI Summit 2017
Automated AI The Next Frontier in Analytics - StampedeCon AI Summit 2017
StampedeCon
 
AI in the Enterprise: Past, Present & Future - StampedeCon AI Summit 2017
AI in the Enterprise: Past,  Present &  Future - StampedeCon AI Summit 2017AI in the Enterprise: Past,  Present &  Future - StampedeCon AI Summit 2017
AI in the Enterprise: Past, Present & Future - StampedeCon AI Summit 2017
StampedeCon
 
A Different Data Science Approach - StampedeCon AI Summit 2017
A Different Data Science Approach - StampedeCon AI Summit 2017A Different Data Science Approach - StampedeCon AI Summit 2017
A Different Data Science Approach - StampedeCon AI Summit 2017
StampedeCon
 
Graph in Customer 360 - StampedeCon Big Data Conference 2017
Graph in Customer 360 - StampedeCon Big Data Conference 2017Graph in Customer 360 - StampedeCon Big Data Conference 2017
Graph in Customer 360 - StampedeCon Big Data Conference 2017
StampedeCon
 
End-to-end Big Data Projects with Python - StampedeCon Big Data Conference 2017
End-to-end Big Data Projects with Python - StampedeCon Big Data Conference 2017End-to-end Big Data Projects with Python - StampedeCon Big Data Conference 2017
End-to-end Big Data Projects with Python - StampedeCon Big Data Conference 2017
StampedeCon
 
Doing Big Data Using Amazon's Analogs - StampedeCon Big Data Conference 2017
Doing Big Data Using Amazon's Analogs - StampedeCon Big Data Conference 2017Doing Big Data Using Amazon's Analogs - StampedeCon Big Data Conference 2017
Doing Big Data Using Amazon's Analogs - StampedeCon Big Data Conference 2017
StampedeCon
 
Enabling New Business Capabilities with Cloud-based Streaming Data Architectu...
Enabling New Business Capabilities with Cloud-based Streaming Data Architectu...Enabling New Business Capabilities with Cloud-based Streaming Data Architectu...
Enabling New Business Capabilities with Cloud-based Streaming Data Architectu...
StampedeCon
 
Big Data Meets IoT: Lessons From the Cloud on Polling, Collecting, and Analyz...
Big Data Meets IoT: Lessons From the Cloud on Polling, Collecting, and Analyz...Big Data Meets IoT: Lessons From the Cloud on Polling, Collecting, and Analyz...
Big Data Meets IoT: Lessons From the Cloud on Polling, Collecting, and Analyz...
StampedeCon
 
Innovation in the Data Warehouse - StampedeCon 2016
Innovation in the Data Warehouse - StampedeCon 2016Innovation in the Data Warehouse - StampedeCon 2016
Innovation in the Data Warehouse - StampedeCon 2016
StampedeCon
 
Creating a Data Driven Organization - StampedeCon 2016
Creating a Data Driven Organization - StampedeCon 2016Creating a Data Driven Organization - StampedeCon 2016
Creating a Data Driven Organization - StampedeCon 2016
StampedeCon
 
Using The Internet of Things for Population Health Management - StampedeCon 2016
Using The Internet of Things for Population Health Management - StampedeCon 2016Using The Internet of Things for Population Health Management - StampedeCon 2016
Using The Internet of Things for Population Health Management - StampedeCon 2016
StampedeCon
 

More from StampedeCon (20)

Why Should We Trust You-Interpretability of Deep Neural Networks - StampedeCo...
Why Should We Trust You-Interpretability of Deep Neural Networks - StampedeCo...Why Should We Trust You-Interpretability of Deep Neural Networks - StampedeCo...
Why Should We Trust You-Interpretability of Deep Neural Networks - StampedeCo...
 
The Search for a New Visual Search Beyond Language - StampedeCon AI Summit 2017
The Search for a New Visual Search Beyond Language - StampedeCon AI Summit 2017The Search for a New Visual Search Beyond Language - StampedeCon AI Summit 2017
The Search for a New Visual Search Beyond Language - StampedeCon AI Summit 2017
 
Predicting Outcomes When Your Outcomes are Graphs - StampedeCon AI Summit 2017
Predicting Outcomes When Your Outcomes are Graphs - StampedeCon AI Summit 2017Predicting Outcomes When Your Outcomes are Graphs - StampedeCon AI Summit 2017
Predicting Outcomes When Your Outcomes are Graphs - StampedeCon AI Summit 2017
 
Novel Semi-supervised Probabilistic ML Approach to SNP Variant Calling - Stam...
Novel Semi-supervised Probabilistic ML Approach to SNP Variant Calling - Stam...Novel Semi-supervised Probabilistic ML Approach to SNP Variant Calling - Stam...
Novel Semi-supervised Probabilistic ML Approach to SNP Variant Calling - Stam...
 
How to Talk about AI to Non-analaysts - Stampedecon AI Summit 2017
How to Talk about AI to Non-analaysts - Stampedecon AI Summit 2017How to Talk about AI to Non-analaysts - Stampedecon AI Summit 2017
How to Talk about AI to Non-analaysts - Stampedecon AI Summit 2017
 
Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017
Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017
Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017
 
Foundations of Machine Learning - StampedeCon AI Summit 2017
Foundations of Machine Learning - StampedeCon AI Summit 2017Foundations of Machine Learning - StampedeCon AI Summit 2017
Foundations of Machine Learning - StampedeCon AI Summit 2017
 
Don't Start from Scratch: Transfer Learning for Novel Computer Vision Problem...
Don't Start from Scratch: Transfer Learning for Novel Computer Vision Problem...Don't Start from Scratch: Transfer Learning for Novel Computer Vision Problem...
Don't Start from Scratch: Transfer Learning for Novel Computer Vision Problem...
 
Bringing the Whole Elephant Into View Can Cognitive Systems Bring Real Soluti...
Bringing the Whole Elephant Into View Can Cognitive Systems Bring Real Soluti...Bringing the Whole Elephant Into View Can Cognitive Systems Bring Real Soluti...
Bringing the Whole Elephant Into View Can Cognitive Systems Bring Real Soluti...
 
Automated AI The Next Frontier in Analytics - StampedeCon AI Summit 2017
Automated AI The Next Frontier in Analytics - StampedeCon AI Summit 2017Automated AI The Next Frontier in Analytics - StampedeCon AI Summit 2017
Automated AI The Next Frontier in Analytics - StampedeCon AI Summit 2017
 
AI in the Enterprise: Past, Present & Future - StampedeCon AI Summit 2017
AI in the Enterprise: Past,  Present &  Future - StampedeCon AI Summit 2017AI in the Enterprise: Past,  Present &  Future - StampedeCon AI Summit 2017
AI in the Enterprise: Past, Present & Future - StampedeCon AI Summit 2017
 
A Different Data Science Approach - StampedeCon AI Summit 2017
A Different Data Science Approach - StampedeCon AI Summit 2017A Different Data Science Approach - StampedeCon AI Summit 2017
A Different Data Science Approach - StampedeCon AI Summit 2017
 
Graph in Customer 360 - StampedeCon Big Data Conference 2017
Graph in Customer 360 - StampedeCon Big Data Conference 2017Graph in Customer 360 - StampedeCon Big Data Conference 2017
Graph in Customer 360 - StampedeCon Big Data Conference 2017
 
End-to-end Big Data Projects with Python - StampedeCon Big Data Conference 2017
End-to-end Big Data Projects with Python - StampedeCon Big Data Conference 2017End-to-end Big Data Projects with Python - StampedeCon Big Data Conference 2017
End-to-end Big Data Projects with Python - StampedeCon Big Data Conference 2017
 
Doing Big Data Using Amazon's Analogs - StampedeCon Big Data Conference 2017
Doing Big Data Using Amazon's Analogs - StampedeCon Big Data Conference 2017Doing Big Data Using Amazon's Analogs - StampedeCon Big Data Conference 2017
Doing Big Data Using Amazon's Analogs - StampedeCon Big Data Conference 2017
 
Enabling New Business Capabilities with Cloud-based Streaming Data Architectu...
Enabling New Business Capabilities with Cloud-based Streaming Data Architectu...Enabling New Business Capabilities with Cloud-based Streaming Data Architectu...
Enabling New Business Capabilities with Cloud-based Streaming Data Architectu...
 
Big Data Meets IoT: Lessons From the Cloud on Polling, Collecting, and Analyz...
Big Data Meets IoT: Lessons From the Cloud on Polling, Collecting, and Analyz...Big Data Meets IoT: Lessons From the Cloud on Polling, Collecting, and Analyz...
Big Data Meets IoT: Lessons From the Cloud on Polling, Collecting, and Analyz...
 
Innovation in the Data Warehouse - StampedeCon 2016
Innovation in the Data Warehouse - StampedeCon 2016Innovation in the Data Warehouse - StampedeCon 2016
Innovation in the Data Warehouse - StampedeCon 2016
 
Creating a Data Driven Organization - StampedeCon 2016
Creating a Data Driven Organization - StampedeCon 2016Creating a Data Driven Organization - StampedeCon 2016
Creating a Data Driven Organization - StampedeCon 2016
 
Using The Internet of Things for Population Health Management - StampedeCon 2016
Using The Internet of Things for Population Health Management - StampedeCon 2016Using The Internet of Things for Population Health Management - StampedeCon 2016
Using The Internet of Things for Population Health Management - StampedeCon 2016
 

Recently uploaded

dbms calicut university B. sc Cs 4th sem.pdf
dbms  calicut university B. sc Cs 4th sem.pdfdbms  calicut university B. sc Cs 4th sem.pdf
dbms calicut university B. sc Cs 4th sem.pdf
Shinana2
 
Skybuffer AI: Advanced Conversational and Generative AI Solution on SAP Busin...
Skybuffer AI: Advanced Conversational and Generative AI Solution on SAP Busin...Skybuffer AI: Advanced Conversational and Generative AI Solution on SAP Busin...
Skybuffer AI: Advanced Conversational and Generative AI Solution on SAP Busin...
Tatiana Kojar
 
Choosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptxChoosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptx
Brandon Minnick, MBA
 
UI5 Controls simplified - UI5con2024 presentation
UI5 Controls simplified - UI5con2024 presentationUI5 Controls simplified - UI5con2024 presentation
UI5 Controls simplified - UI5con2024 presentation
Wouter Lemaire
 
Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024
Jason Packer
 
Operating System Used by Users in day-to-day life.pptx
Operating System Used by Users in day-to-day life.pptxOperating System Used by Users in day-to-day life.pptx
Operating System Used by Users in day-to-day life.pptx
Pravash Chandra Das
 
Main news related to the CCS TSI 2023 (2023/1695)
Main news related to the CCS TSI 2023 (2023/1695)Main news related to the CCS TSI 2023 (2023/1695)
Main news related to the CCS TSI 2023 (2023/1695)
Jakub Marek
 
5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides
DanBrown980551
 
System Design Case Study: Building a Scalable E-Commerce Platform - Hiike
System Design Case Study: Building a Scalable E-Commerce Platform - HiikeSystem Design Case Study: Building a Scalable E-Commerce Platform - Hiike
System Design Case Study: Building a Scalable E-Commerce Platform - Hiike
Hiike
 
Presentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of GermanyPresentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of Germany
innovationoecd
 
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
saastr
 
Building Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and MilvusBuilding Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and Milvus
Zilliz
 
Introduction of Cybersecurity with OSS at Code Europe 2024
Introduction of Cybersecurity with OSS  at Code Europe 2024Introduction of Cybersecurity with OSS  at Code Europe 2024
Introduction of Cybersecurity with OSS at Code Europe 2024
Hiroshi SHIBATA
 
Recommendation System using RAG Architecture
Recommendation System using RAG ArchitectureRecommendation System using RAG Architecture
Recommendation System using RAG Architecture
fredae14
 
Taking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdfTaking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdf
ssuserfac0301
 
A Comprehensive Guide to DeFi Development Services in 2024
A Comprehensive Guide to DeFi Development Services in 2024A Comprehensive Guide to DeFi Development Services in 2024
A Comprehensive Guide to DeFi Development Services in 2024
Intelisync
 
Nordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptxNordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptx
MichaelKnudsen27
 
Serial Arm Control in Real Time Presentation
Serial Arm Control in Real Time PresentationSerial Arm Control in Real Time Presentation
Serial Arm Control in Real Time Presentation
tolgahangng
 
Programming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup SlidesProgramming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup Slides
Zilliz
 
Monitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdfMonitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdf
Tosin Akinosho
 

Recently uploaded (20)

dbms calicut university B. sc Cs 4th sem.pdf
dbms  calicut university B. sc Cs 4th sem.pdfdbms  calicut university B. sc Cs 4th sem.pdf
dbms calicut university B. sc Cs 4th sem.pdf
 
Skybuffer AI: Advanced Conversational and Generative AI Solution on SAP Busin...
Skybuffer AI: Advanced Conversational and Generative AI Solution on SAP Busin...Skybuffer AI: Advanced Conversational and Generative AI Solution on SAP Busin...
Skybuffer AI: Advanced Conversational and Generative AI Solution on SAP Busin...
 
Choosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptxChoosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptx
 
UI5 Controls simplified - UI5con2024 presentation
UI5 Controls simplified - UI5con2024 presentationUI5 Controls simplified - UI5con2024 presentation
UI5 Controls simplified - UI5con2024 presentation
 
Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024
 
Operating System Used by Users in day-to-day life.pptx
Operating System Used by Users in day-to-day life.pptxOperating System Used by Users in day-to-day life.pptx
Operating System Used by Users in day-to-day life.pptx
 
Main news related to the CCS TSI 2023 (2023/1695)
Main news related to the CCS TSI 2023 (2023/1695)Main news related to the CCS TSI 2023 (2023/1695)
Main news related to the CCS TSI 2023 (2023/1695)
 
5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides
 
System Design Case Study: Building a Scalable E-Commerce Platform - Hiike
System Design Case Study: Building a Scalable E-Commerce Platform - HiikeSystem Design Case Study: Building a Scalable E-Commerce Platform - Hiike
System Design Case Study: Building a Scalable E-Commerce Platform - Hiike
 
Presentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of GermanyPresentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of Germany
 
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
 
Building Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and MilvusBuilding Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and Milvus
 
Introduction of Cybersecurity with OSS at Code Europe 2024
Introduction of Cybersecurity with OSS  at Code Europe 2024Introduction of Cybersecurity with OSS  at Code Europe 2024
Introduction of Cybersecurity with OSS at Code Europe 2024
 
Recommendation System using RAG Architecture
Recommendation System using RAG ArchitectureRecommendation System using RAG Architecture
Recommendation System using RAG Architecture
 
Taking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdfTaking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdf
 
A Comprehensive Guide to DeFi Development Services in 2024
A Comprehensive Guide to DeFi Development Services in 2024A Comprehensive Guide to DeFi Development Services in 2024
A Comprehensive Guide to DeFi Development Services in 2024
 
Nordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptxNordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptx
 
Serial Arm Control in Real Time Presentation
Serial Arm Control in Real Time PresentationSerial Arm Control in Real Time Presentation
Serial Arm Control in Real Time Presentation
 
Programming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup SlidesProgramming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup Slides
 
Monitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdfMonitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdf
 

Storm – Streaming Data Analytics at Scale - StampedeCon 2014

  • 1. Stream Processing with Apache Storm Spring 2014 Version 1.0 Kit Menke, Lead Software Engineer, EHI Scott Shaw, Solutions Engineer, Hortonworks
  • 2. © Hortonworks Inc. 2013 Stream Processing in Hadoop Driven by new types of data – Sensor/Machine – Server logs – Clickstream Storm with Hadoop enables new business opportunities – Low-latency dashboards – Quality, Security, Safety, Operations Alerts – Improved operations – Real-time data integration HDFS2 (redundant, reliable storage) YARN (cluster resource management) MapReduce (batch) Apache STORM (streaming) HADOOP 2.1 Tez (interactive) Multi Use Data Platform Batch, Interactive, Online, Streaming, … Stream processing has emerged as a key use case 2
  • 3. © Hortonworks Inc. 2013 Typical stream processing workflow Real-time data feeds Stream processing solution Persist data Relational or non relational data store Batch processing Batch FeedsUpdate event models (Pattern templates, KPIs & alerts) Dashboards & Applications 3
  • 4. © Hortonworks Inc. 2013 Stream processing very different from batch Factors Real-time Batch Data Freshness Real-time ( usually < 15 min) Historical – usually more than 15 min old Location Primarily memory ( moved to disk after processing) Primarily in disk moved to memory for processing Processing Speed Sub second to few seconds Few seconds to hours Frequency Always running Sporadic to periodic Clients Who? Automated systems only Human & automated systems Type Primarily operational systems Primarily analytical applications 4
  • 5. © Hortonworks Inc. 2013 Key requirements of a streaming solution • Extremely high ingest rates – millions of events/secondData Ingest • Ability to easily plug different processing frameworks • Guaranteed processing – atleast once processing semantics Processing • Ability to persist data to multiple relational and non relational data storesPersistence • Security, HA, fault tolerance & management supportOperations 5
  • 6. © Hortonworks Inc. 2013 Apache Storm Leading for Stream Processing Open source real-time event stream processing platform that provides fixed, continuous & low latency processing for very high frequency streaming data • Horizontally scalable like Hadoop • Eg: 10 node cluster can process 1M tuples per second per node Highly scalable • Automatically reassigns tasks on failed nodes Fault- tolerant • Supports at least once & exactly once processing semantics Guarantees processing • Processing logic can be defined in any language Language agnostic • Brand, governance & a large active community Apache project 6
  • 7. © Hortonworks Inc. 2013 Pattern driving MOST streaming use cases 7 Monitor real-time data to.. Prevent Optimize Finance - Securities Fraud - Compliance violations - Order routing - Pricing Telco - Security breaches - Network Outages - Bandwidth allocation - Customer service Retail - Offers - Pricing Manufacturing - Machine failures - Supply chain Transportation - Driver & fleet issues - Routes - Pricing Web - Application failures - Operational issues - Site content Sentiment Clickstream Machine/Sensor Logs Geo-location ----
  • 8. © Hortonworks Inc. 2013 Storm use cases – IT operations view • Continuously ingest high rate messages, process them and update data stores Continuous processing • Aggregate multiple data streams that emit data at extremely high rates into one central data store High speed data aggregation • Filter out unwanted data on the fly before it is persisted to a data storeData filtering • Extremely resource( CPU, mem or I/O) intensive processing that would take long time to process on a single machine can be parallelized with Storm to reduce response times to seconds Distributed RPC response time reduction 8
  • 9. © Hortonworks Inc. 2013 Key Constructs in Apache Storm • Tuples, Streams, Sprouts, Bolts • Topology • Field Grouping • Components and Topology Submission • Parallelism • Processing Guarantee 9
  • 10. © Hortonworks Inc. 2013 Tuples and Streams • What is a Tuple? –Fundamental data structure in Storm. Is a named list of values that can be of any data type. 10 • What is a Stream? –An unbounded sequences of tuples. –Core abstraction in Storm and are what you “process” in Storm
  • 11. © Hortonworks Inc. 2013 Spouts • What is a Spout? –Generates or a source of Streams – E.g.: JMS, Twitter, Log, Kafka Spout –Can spin up multiple instances of a Spout and dynamically adjust as needed 11
  • 12. © Hortonworks Inc. 2013 Bolts • What is a Bolt? –Processes any number of input streams and produces output streams –Common processing in bolts are functions, aggregations, joins, read/write to data stores, alerting logic –Can spin up multiple instances of a Bolt and dynamically adjust as needed • Example of Bolts: 1. HBaseBolt: persist stream in Hbase 2. HDFSBolt: persist stream into HFDS as Avro Files using Flume 3. MonitoringBolt: Read from Hbase and create alerts via email and messaging queues if given thresholds are exceeded. 12
  • 13. © Hortonworks Inc. 2013 Topology • What is a Topology? –A network of spouts and bolts wired together into a workflow Truck-Event-Processor Topology Kafka Spout HBase Bolt Monitoring Bolt HDFS Bolt WebSocket Bolt Stream Stream Stream Stream 13
  • 14. © Hortonworks Inc. 2013 Storm Components and Topology Submission Submit storm-event-processor topology Nimbus (Yarn App Master Node) Zookeeper ZookeeperZookeeper Supervisor (Slave Node) Supervisor (Slave Node) Supervisor (Slave Node) Supervisor (Slave Node) Supervisor (Slave Node) Kafka Spout Kafka Spout Kafka Spout Kafka Spout Kafka Spout HDFS Bolt HDFS Bolt HDFS Bolt HBase Bolt HBase Bolt Monitor Bolt Monitor Bolt Nimbus (Management server) • Similar to job tracker • Distributes code around cluster • Assigns tasks • Handles failures Supervisor (Worker nodes) • Similar to task tracker • Run bolts and spouts as ‘tasks’ Zookeeper: • Cluster co-ordination • Nimbus HA • Stores cluster metrics 14
  • 15. © Hortonworks Inc. 2013 Processing Guarantees in Storm Processing guarantee How is it achieved? Applicable use cases Atleast once Replay tuples on failure - Processing does not need to be ordered - Need extremely low latency processing Exactly once Transactional topologies ( now implemented using Trident) - Need ordered processing - Global counts - Context aware processing - Causality based - Latency not important 15
  • 16. Implementing Storm Kit Menke, Lead Software Engineer at Enterprise Holdings, Inc. May 2014
  • 17. Implementing Storm Spring 2014 Version 1.0 Real World Scenarios
  • 18. Overview • Storm Terminology • Creating a Topology • Persisting data from Storm • Topology Performance • Custom Metrics • Workers, Executors, and Tasks • Caching within a Bolt • Environment Setup 18
  • 19. Storm Terminology • Topologies run on your Hadoop cluster – Uber-jar with spouts and bolts – Runs forever • Spouts generate streams of tuples • Tuples are lists of values • Bolts process tuples (and emit tuples) Topology Spout Bolt A Bolt B Bolt 1 Tuples 19
  • 20. Storm Topology Example Counting Topology spout unreliable output Guaranteed message processing 20
  • 23. Failing a Tuple 1. Spout emits tuple 2. Bolt fails tuple 3. Spout receives failed message ID 23
  • 24. Persisting Data • Write to HDFS using storm-hdfs for long term storage 24
  • 25. Files in Hue written by storm-hdfs 25
  • 26. Persisting Data • Write to HDFS using storm-hdfs for long term storage • Index data in ElasticSearch or Solr for real-time dashboards 26
  • 28. Persisting Data • Write to HDFS using storm-hdfs for long term storage • Index data in ElasticSearch or Solr for real-time dashboards • Insert messages into a Database • Message Queue • HBase reads/writes to influence topology in real-time 28
  • 29. Topology Performance • Storm UI shows capacity – Break out your bolts to find bottlenecks! 29
  • 30. Custom Metrics • New in Storm 0.9.0 • Out of the box metrics, ex: CountMetric • Custom metric by implementing IMetric • Register the metric on spout/bolt startup • Set topology to consume metrics stream 30
  • 31. Topology Performance • Filter bolt is our bottleneck! 31
  • 32. Workers, Executors, and Tasks • Workers – Separate JVM – Workers run Executors • Executors – Separate threads – Executors run Tasks • Tasks – Your spout or bolt code • Running more than one task per executor does not increase the level of parallelism!!! Workers <= Executors <= Tasks
  • 33. Caching inside a Bolt • RotatingMap with Tick Tuples • Use fieldsGrouping to ensure cache hits 33
  • 34. Environment Setup • Storm-starter project on GitHub • Git, Eclipse, Maven • Unit test! • Develop locally or on a single node hadoop machine • Read the source code 34

Editor's Notes

  1. Real-time data integration Analyze, clean, normalize data with low latency Low-latency dashboards Summing/aggregations for operational monitors, gauges and counters Orders, revenue, call volumes, infrastructure load Geographic location of fleets Alerts Quality: Detection of “never seen before” entities (customers, ads, etc) Security: Detection of trespass / fraud / illegal activities Safety: patient monitoring, automotive telematics Operations: Detection of system / network overload Improved operations Advertising optimization Personalization Fleet rerouting
  2. Stream processing solution needs to consume explicit or implicit event models from batch processing platform. These event models define the schemas of incoming event data, such as records of calls into the customer contact center, copies of customer order transactions or exogenous market data. Event models also specify: Relationships (such as causation) among the event types Calculations (for example, formulas to compute KPIs) Alert thresholds (for example, "if average caller wait time exceeds 45 seconds, send a yellow warning by email") Responses (for example, "trigger an exception process if the result of a customer credit check has not been received within two hours")
  3. Storm was benchmarked at processing one million 100 byte messages per second per node on hardware with the following specs: Processor: 2x Intel E5645@2.4Ghz Memory: 24 GB
  4. Add types of data and ad prevent and optimize use cases
  5. Getting started with storm Reading source code most helpful Create a simple hello world topology and run it locally
  6. Topologies are the application you will write and deploy to your cluster where it will run forever working on streams of data. Each topology contains spouts and bolts Spouts bring data into your topology by generating streams of tuples. This is an external source like a queue or something on the internet (like twitter). Tuples are lists of values (string, int, boolean, or custom objects which require serializers) Bolts process the tuples emitted by the spouts and also emit tuples themselves
  7. Creating a simple storm topology which demonstrates guaranteed message processing. Create a counting spout connected to an unreliable bolt connected to an output bolt Many different options for connecting things together: shuffle grouping means tuples are randomly distributed. Can also group by a field, broadcast tuple Demonstrate an error scenario by using an unreliable bolt
  8. Simple example of a spout which counts from 0 to 9 Open is called once for each instance of your spout. Adding numbers 0-9 to an in-memory queue Typically you will be reading from a real message queue nextTuple is called repeatedly to get each tuple. Here we are emitting one int: number The second parameter is used for reprocessing in the event of a failure declareOutputFields for specifying which fields you are emitting in nextTuple.
  9. An example implementation of an Unreliable Bolt (because it should fail 50% of the time) Bolts also have a prepare and declareOutputFields method. Execute is the main method where your processing will take place. The input tuple was generated by our spout. 50% of the time, the tuple will fail.
  10. Calling _collector.fail on a tuple will cause it to go back to the spout’s fail method. In this simple example, I made number the same value as the tuple but in reality this might be a queued message ID. We ended up not really needing tuple reprocessing but I believe storm-jms has this built in if you need it.
  11. Talked about bringing data into your topology and processing it. Most likely you will want to persist it somewhere as well for additional processing. We are using storm-hdfs to write all messages we receive straight into HDFS. Also indexing our data in ElasticSearch in order to have a real-time dashboard for executives. Influence the topology in “real-time” by reading from or writing to HBase !!!! This stuff will be SLOW compared to how fast you need to process messages in Storm. HBase read takes 20ms, that is only 50 tuples/s!!!!
  12. Using storm-hdfs to stream data to HDFS for more analytics and storage Put hive tables over top, run trends, etc.
  13. Talked about bringing data into your topology and processing it. Most likely you will want to persist it somewhere as well for additional processing. We are using storm-hdfs to write all messages we receive straight into HDFS. Also indexing our data in ElasticSearch in order to have a real-time dashboard for executives. Influence the topology in “real-time” by reading from or writing to HBase !!!! This stuff will be SLOW compared to how fast you need to process messages in Storm. HBase read takes 20ms, that is only 50 tuples/s!!!!
  14. Time based indexes (one per day) Kibana dashboard on top of elasticsearch indexes size: 14.3G (28.7G) docs: 42,051,720 (42,051,720)
  15. Talked about bringing data into your topology and processing it. Most likely you will want to persist it somewhere as well for additional processing. We are using storm-hdfs to write all messages we receive straight into HDFS. Also indexing our data in ElasticSearch in order to have a real-time dashboard for executives. Influence the topology in “real-time” by reading from or writing to HBase !!!! This stuff will be SLOW compared to how fast you need to process messages in Storm. HBase read takes 20ms, that is only 50 tuples/s!!!!
  16. It is hard to optimize! The storm UI will help you a lot with determining where the bottleneck is in your topology, but you will need to break out your bolts. Capacity = If this is around 1.0, the bolt is running as fast as it can and you probably need to increase your parallelism. Here I’ve prefixed my bolts with a number so they sort nicely in the Storm UI.
  17. Custom Metrics were added in Storm 0.9.0 and allow you to collect a lot more information than what is displayed in the Storm UI. Comes with some metrics out of the box like the CountMetric (cache hits? # of tuples processed?) Can create custom metrics by implementing the IMetric interface. Register your metric in your spout’s open method or bolt’s prepare method. When creating your topology, configure a consumer. LoggingMetricsConsumer comes out of the box and just logs to the metrics.log on one of the machines. Can create your own consumers to stream to third party monitoring apps.
  18. We’ve identified a bottleneck in our topology (filter bolt) using the Storm UI and storm’s metrics. Increasing the parallelism of the bolt might help with our throughput. If it takes twice as long as our categorize bolt, we probably need to DOUBLE the amount of Executors.
  19. Configure workers, executors, and tasks when creating the topology. Worker process… Separate JVM Runs executors One send/receive thread per worker Rule of thumb: Multiple of the number of machines in your cluster Executors Thread spawned by worker Runs tasks serially Rule of thumb: Multiple of the # of workers Task Runs your spouts and bolts Cannot change the number of tasks after topology has been started Rule of thumb: Multiple of the # of executors.. Typically just have 1 per executor unless you play on adding more nodes as the topology is running Running more than one task per executor does not increase the level of parallelism!!! Number of workers and executors can change, number of tasks cannot http://stackoverflow.com/questions/17257448/what-is-the-task-in-twitter-storm-parallelism Example: Storm running on 3 nodes. Three workers, six executors, six tasks. Workers <= Executors <= Tasks
  20. If HBase calls take 20 ms, we’re going to have a bottleneck in our topology so we need caching. fieldsGrouping + caching within bolts Group by something that will be used as the key (or part of the key) to your cache. Same Tuples will be sent to the same bolt and increase the number of cache hits. Create a RotatingMap (a LRU cache) in your bolt Configure your bolt to receive Tick Tuples Tick tuples sent to your bolt in addition to normal Tuples Check to see if the tuple you received was a tick tuple and then rotate the cache every 300s
  21. Possible to develop in multiple languages, but java makes the most sense for getting started Check out the storm-starter project on github for a great working example Use git to clone the repository, setup in your favorite IDE (Eclipse haha yea right!), and setup maven. Use maven-shade-plugin to build your uber-jar Separate projects for major functionality. Try to keep as little as possible in your storm project. Use unit testing everywhere.. It will save you time when you find bugs in the topology. You can develop locally just with Eclipse and storm. However, you will most likely also being using a lot of other Hadoop stuff (HDFS check out storm-hdfs, HBase, etc) so it might be helpful to get a single node machine with everything installed.