SlideShare a Scribd company logo
1 of 35
Stream Processing
with Apache Storm
Spring 2014
Version 1.0
Kit Menke, Lead Software Engineer, EHI
Scott Shaw, Solutions Engineer, Hortonworks
© Hortonworks Inc. 2013
Stream Processing in Hadoop
Driven by new types of
data
– Sensor/Machine
– Server logs
– Clickstream
Storm with Hadoop
enables new business
opportunities
– Low-latency dashboards
– Quality, Security, Safety,
Operations Alerts
– Improved operations
– Real-time data integration
HDFS2
(redundant, reliable storage)
YARN
(cluster resource management)
MapReduce
(batch)
Apache
STORM
(streaming)
HADOOP 2.1
Tez
(interactive)
Multi Use Data Platform
Batch, Interactive, Online, Streaming, …
Stream processing has emerged as a key use case
2
© Hortonworks Inc. 2013
Typical stream processing workflow
Real-time
data feeds
Stream
processing
solution
Persist data
Relational
or non
relational
data store
Batch
processing
Batch
FeedsUpdate event
models
(Pattern templates,
KPIs & alerts)
Dashboards & Applications
3
© Hortonworks Inc. 2013
Stream processing very different from batch
Factors Real-time Batch
Data
Freshness Real-time ( usually
< 15 min)
Historical – usually
more than 15 min old
Location Primarily memory (
moved to disk after
processing)
Primarily in disk moved
to memory for
processing
Processing
Speed Sub second to few
seconds
Few seconds to hours
Frequency Always running Sporadic to periodic
Clients
Who? Automated systems
only
Human & automated
systems
Type Primarily
operational systems
Primarily analytical
applications
4
© Hortonworks Inc. 2013
Key requirements of a streaming solution
• Extremely high ingest rates – millions of
events/secondData Ingest
• Ability to easily plug different processing frameworks
• Guaranteed processing – atleast once processing
semantics
Processing
• Ability to persist data to multiple relational and non
relational data storesPersistence
• Security, HA, fault tolerance & management supportOperations
5
© Hortonworks Inc. 2013
Apache Storm Leading for Stream Processing
Open source real-time event stream processing platform that
provides fixed, continuous & low latency processing for very high
frequency streaming data
• Horizontally scalable like Hadoop
• Eg: 10 node cluster can process 1M tuples per
second per node
Highly
scalable
• Automatically reassigns tasks on failed nodes
Fault-
tolerant
• Supports at least once & exactly once processing
semantics
Guarantees
processing
• Processing logic can be defined in any language
Language
agnostic
• Brand, governance & a large active community
Apache
project
6
© Hortonworks Inc. 2013
Pattern driving MOST streaming use cases
7
Monitor real-time
data to..
Prevent Optimize
Finance
- Securities Fraud
- Compliance violations
- Order routing
- Pricing
Telco
- Security breaches
- Network Outages
- Bandwidth allocation
- Customer service
Retail
- Offers
- Pricing
Manufacturing
- Machine failures - Supply chain
Transportation
- Driver & fleet issues - Routes
- Pricing
Web
- Application failures
- Operational issues
- Site content
Sentiment Clickstream Machine/Sensor Logs Geo-location
----
© Hortonworks Inc. 2013
Storm use cases – IT operations view
• Continuously ingest high rate messages, process
them and update data stores
Continuous
processing
• Aggregate multiple data streams that emit data at
extremely high rates into one central data store
High speed
data
aggregation
• Filter out unwanted data on the fly before it is
persisted to a data storeData filtering
• Extremely resource( CPU, mem or I/O) intensive
processing that would take long time to process on a
single machine can be parallelized with Storm to
reduce response times to seconds
Distributed
RPC response
time reduction
8
© Hortonworks Inc. 2013
Key Constructs in Apache Storm
• Tuples, Streams, Sprouts, Bolts
• Topology
• Field Grouping
• Components and Topology Submission
• Parallelism
• Processing Guarantee
9
© Hortonworks Inc. 2013
Tuples and Streams
• What is a Tuple?
–Fundamental data structure in Storm. Is a named list of values
that can be of any data type.
10
• What is a Stream?
–An unbounded sequences of tuples.
–Core abstraction in Storm and are what you “process” in Storm
© Hortonworks Inc. 2013
Spouts
• What is a Spout?
–Generates or a source of Streams
– E.g.: JMS, Twitter, Log, Kafka Spout
–Can spin up multiple instances of a Spout and dynamically adjust
as needed
11
© Hortonworks Inc. 2013
Bolts
• What is a Bolt?
–Processes any number of input streams and produces output
streams
–Common processing in bolts are functions, aggregations, joins,
read/write to data stores, alerting logic
–Can spin up multiple instances of a Bolt and dynamically adjust as
needed
• Example of Bolts:
1. HBaseBolt: persist stream in Hbase
2. HDFSBolt: persist stream into HFDS as Avro Files using Flume
3. MonitoringBolt: Read from Hbase and create alerts via email
and messaging queues if given thresholds are exceeded.
12
© Hortonworks Inc. 2013
Topology
• What is a Topology?
–A network of spouts and bolts wired together into a workflow
Truck-Event-Processor Topology
Kafka Spout
HBase
Bolt
Monitoring
Bolt
HDFS
Bolt
WebSocket
Bolt
Stream Stream
Stream
Stream
13
© Hortonworks Inc. 2013
Storm Components and Topology
Submission
Submit
storm-event-processor
topology
Nimbus
(Yarn App Master Node)
Zookeeper ZookeeperZookeeper
Supervisor
(Slave Node)
Supervisor
(Slave Node)
Supervisor
(Slave Node)
Supervisor
(Slave Node)
Supervisor
(Slave Node)
Kafka
Spout
Kafka
Spout
Kafka
Spout
Kafka
Spout
Kafka
Spout
HDFS
Bolt
HDFS
Bolt
HDFS
Bolt
HBase
Bolt
HBase
Bolt
Monitor
Bolt
Monitor
Bolt
Nimbus (Management server)
• Similar to job tracker
• Distributes code around cluster
• Assigns tasks
• Handles failures
Supervisor (Worker nodes)
• Similar to task tracker
• Run bolts and spouts as ‘tasks’
Zookeeper:
• Cluster co-ordination
• Nimbus HA
• Stores cluster metrics
14
© Hortonworks Inc. 2013
Processing Guarantees in Storm
Processing
guarantee
How is it achieved? Applicable use cases
Atleast once Replay tuples on failure - Processing does not need to be
ordered
- Need extremely low latency
processing
Exactly once Transactional
topologies ( now
implemented using
Trident)
- Need ordered processing
- Global counts
- Context aware processing
- Causality based
- Latency not important
15
Implementing Storm
Kit Menke, Lead Software Engineer at Enterprise Holdings, Inc.
May 2014
Implementing Storm
Spring 2014
Version 1.0
Real World Scenarios
Overview
• Storm Terminology
• Creating a Topology
• Persisting data from Storm
• Topology Performance
• Custom Metrics
• Workers, Executors, and Tasks
• Caching within a Bolt
• Environment Setup
18
Storm Terminology
• Topologies run on your Hadoop cluster
– Uber-jar with spouts and bolts
– Runs forever
• Spouts generate streams of tuples
• Tuples are lists of values
• Bolts process tuples (and emit tuples)
Topology
Spout
Bolt A Bolt B
Bolt 1
Tuples
19
Storm Topology Example
Counting Topology
spout unreliable output
Guaranteed message processing
20
Storm Spout Example
21
Storm Bolt Example
22
Failing a Tuple
1. Spout emits tuple
2. Bolt fails tuple
3. Spout receives failed
message ID
23
Persisting Data
• Write to HDFS using storm-hdfs for
long term storage
24
Files in Hue written by storm-hdfs
25
Persisting Data
• Write to HDFS using storm-hdfs for
long term storage
• Index data in ElasticSearch or Solr
for real-time dashboards
26
ElasticSearch + Kibana
27
Persisting Data
• Write to HDFS using storm-hdfs for
long term storage
• Index data in ElasticSearch or Solr
for real-time dashboards
• Insert messages into a Database
• Message Queue
• HBase reads/writes to influence
topology in real-time
28
Topology Performance
• Storm UI shows capacity
– Break out your bolts to find bottlenecks!
29
Custom Metrics
• New in Storm 0.9.0
• Out of the box metrics, ex: CountMetric
• Custom metric by implementing IMetric
• Register the metric on spout/bolt startup
• Set topology to consume metrics stream
30
Topology Performance
• Filter bolt is our bottleneck!
31
Workers, Executors, and Tasks
• Workers
– Separate JVM
– Workers run Executors
• Executors
– Separate threads
– Executors run Tasks
• Tasks
– Your spout or bolt code
• Running more than one task per executor does not increase
the level of parallelism!!!
Workers <= Executors <= Tasks
Caching inside a Bolt
• RotatingMap with Tick Tuples
• Use fieldsGrouping to ensure cache hits
33
Environment Setup
• Storm-starter project on GitHub
• Git, Eclipse, Maven
• Unit test!
• Develop locally or on a single node
hadoop machine
• Read the source code
34
Questions?

More Related Content

What's hot

Real-time Big Data Processing with Storm
Real-time Big Data Processing with StormReal-time Big Data Processing with Storm
Real-time Big Data Processing with Stormviirya
 
Storm-on-YARN: Convergence of Low-Latency and Big-Data
Storm-on-YARN: Convergence of Low-Latency and Big-DataStorm-on-YARN: Convergence of Low-Latency and Big-Data
Storm-on-YARN: Convergence of Low-Latency and Big-DataDataWorks Summit
 
Apache Storm vs. Spark Streaming – two Stream Processing Platforms compared
Apache Storm vs. Spark Streaming – two Stream Processing Platforms comparedApache Storm vs. Spark Streaming – two Stream Processing Platforms compared
Apache Storm vs. Spark Streaming – two Stream Processing Platforms comparedGuido Schmutz
 
Data Stream Algorithms in Storm and R
Data Stream Algorithms in Storm and RData Stream Algorithms in Storm and R
Data Stream Algorithms in Storm and RRadek Maciaszek
 
Apache Storm Concepts
Apache Storm ConceptsApache Storm Concepts
Apache Storm ConceptsAndré Dias
 
Storm: distributed and fault-tolerant realtime computation
Storm: distributed and fault-tolerant realtime computationStorm: distributed and fault-tolerant realtime computation
Storm: distributed and fault-tolerant realtime computationnathanmarz
 
Event Detection Pipelines with Apache Kafka
Event Detection Pipelines with Apache KafkaEvent Detection Pipelines with Apache Kafka
Event Detection Pipelines with Apache KafkaDataWorks Summit
 
Apache storm vs. Spark Streaming
Apache storm vs. Spark StreamingApache storm vs. Spark Streaming
Apache storm vs. Spark StreamingP. Taylor Goetz
 
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...DataWorks Summit/Hadoop Summit
 
Apache Eagle: eBay构建开源分布式实时预警引擎实践
Apache Eagle: eBay构建开源分布式实时预警引擎实践Apache Eagle: eBay构建开源分布式实时预警引擎实践
Apache Eagle: eBay构建开源分布式实时预警引擎实践Hao Chen
 
Large Infrastructure Monitoring At CERN by Matthias Braeger at Big Data Spain...
Large Infrastructure Monitoring At CERN by Matthias Braeger at Big Data Spain...Large Infrastructure Monitoring At CERN by Matthias Braeger at Big Data Spain...
Large Infrastructure Monitoring At CERN by Matthias Braeger at Big Data Spain...Big Data Spain
 
Chris Hillman – Beyond Mapreduce Scientific Data Processing in Real-time
Chris Hillman – Beyond Mapreduce Scientific Data Processing in Real-timeChris Hillman – Beyond Mapreduce Scientific Data Processing in Real-time
Chris Hillman – Beyond Mapreduce Scientific Data Processing in Real-timeFlink Forward
 
Scaling ETL with Hadoop - Avoiding Failure
Scaling ETL with Hadoop - Avoiding FailureScaling ETL with Hadoop - Avoiding Failure
Scaling ETL with Hadoop - Avoiding FailureGwen (Chen) Shapira
 
Re-envisioning the Lambda Architecture : Web Services & Real-time Analytics ...
Re-envisioning the Lambda Architecture : Web Services & Real-time Analytics ...Re-envisioning the Lambda Architecture : Web Services & Real-time Analytics ...
Re-envisioning the Lambda Architecture : Web Services & Real-time Analytics ...Brian O'Neill
 
A real time architecture using Hadoop and Storm @ FOSDEM 2013
A real time architecture using Hadoop and Storm @ FOSDEM 2013A real time architecture using Hadoop and Storm @ FOSDEM 2013
A real time architecture using Hadoop and Storm @ FOSDEM 2013Nathan Bijnens
 
Thing you didn't know you could do in Spark
Thing you didn't know you could do in SparkThing you didn't know you could do in Spark
Thing you didn't know you could do in SparkSnappyData
 

What's hot (20)

Yahoo compares Storm and Spark
Yahoo compares Storm and SparkYahoo compares Storm and Spark
Yahoo compares Storm and Spark
 
Real-time Big Data Processing with Storm
Real-time Big Data Processing with StormReal-time Big Data Processing with Storm
Real-time Big Data Processing with Storm
 
Streaming SQL
Streaming SQLStreaming SQL
Streaming SQL
 
Storm-on-YARN: Convergence of Low-Latency and Big-Data
Storm-on-YARN: Convergence of Low-Latency and Big-DataStorm-on-YARN: Convergence of Low-Latency and Big-Data
Storm-on-YARN: Convergence of Low-Latency and Big-Data
 
Apache Storm vs. Spark Streaming – two Stream Processing Platforms compared
Apache Storm vs. Spark Streaming – two Stream Processing Platforms comparedApache Storm vs. Spark Streaming – two Stream Processing Platforms compared
Apache Storm vs. Spark Streaming – two Stream Processing Platforms compared
 
Data Stream Algorithms in Storm and R
Data Stream Algorithms in Storm and RData Stream Algorithms in Storm and R
Data Stream Algorithms in Storm and R
 
Apache Storm Concepts
Apache Storm ConceptsApache Storm Concepts
Apache Storm Concepts
 
Storm: distributed and fault-tolerant realtime computation
Storm: distributed and fault-tolerant realtime computationStorm: distributed and fault-tolerant realtime computation
Storm: distributed and fault-tolerant realtime computation
 
Event Detection Pipelines with Apache Kafka
Event Detection Pipelines with Apache KafkaEvent Detection Pipelines with Apache Kafka
Event Detection Pipelines with Apache Kafka
 
Apache storm vs. Spark Streaming
Apache storm vs. Spark StreamingApache storm vs. Spark Streaming
Apache storm vs. Spark Streaming
 
Storm and Cassandra
Storm and Cassandra Storm and Cassandra
Storm and Cassandra
 
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
 
Apache Eagle: eBay构建开源分布式实时预警引擎实践
Apache Eagle: eBay构建开源分布式实时预警引擎实践Apache Eagle: eBay构建开源分布式实时预警引擎实践
Apache Eagle: eBay构建开源分布式实时预警引擎实践
 
Large Infrastructure Monitoring At CERN by Matthias Braeger at Big Data Spain...
Large Infrastructure Monitoring At CERN by Matthias Braeger at Big Data Spain...Large Infrastructure Monitoring At CERN by Matthias Braeger at Big Data Spain...
Large Infrastructure Monitoring At CERN by Matthias Braeger at Big Data Spain...
 
Chris Hillman – Beyond Mapreduce Scientific Data Processing in Real-time
Chris Hillman – Beyond Mapreduce Scientific Data Processing in Real-timeChris Hillman – Beyond Mapreduce Scientific Data Processing in Real-time
Chris Hillman – Beyond Mapreduce Scientific Data Processing in Real-time
 
Scaling ETL with Hadoop - Avoiding Failure
Scaling ETL with Hadoop - Avoiding FailureScaling ETL with Hadoop - Avoiding Failure
Scaling ETL with Hadoop - Avoiding Failure
 
Re-envisioning the Lambda Architecture : Web Services & Real-time Analytics ...
Re-envisioning the Lambda Architecture : Web Services & Real-time Analytics ...Re-envisioning the Lambda Architecture : Web Services & Real-time Analytics ...
Re-envisioning the Lambda Architecture : Web Services & Real-time Analytics ...
 
A real time architecture using Hadoop and Storm @ FOSDEM 2013
A real time architecture using Hadoop and Storm @ FOSDEM 2013A real time architecture using Hadoop and Storm @ FOSDEM 2013
A real time architecture using Hadoop and Storm @ FOSDEM 2013
 
MapR & Skytree:
MapR & Skytree: MapR & Skytree:
MapR & Skytree:
 
Thing you didn't know you could do in Spark
Thing you didn't know you could do in SparkThing you didn't know you could do in Spark
Thing you didn't know you could do in Spark
 

Viewers also liked

Big Data Architectures
Big Data ArchitecturesBig Data Architectures
Big Data ArchitecturesGuido Schmutz
 
Introduction to Streaming Analytics
Introduction to Streaming AnalyticsIntroduction to Streaming Analytics
Introduction to Streaming AnalyticsGuido Schmutz
 
Cassandra Summit 2014: META — An Efficient Distributed Data Hub with Batch an...
Cassandra Summit 2014: META — An Efficient Distributed Data Hub with Batch an...Cassandra Summit 2014: META — An Efficient Distributed Data Hub with Batch an...
Cassandra Summit 2014: META — An Efficient Distributed Data Hub with Batch an...DataStax Academy
 
Introduction to Streaming Analytics
Introduction to Streaming AnalyticsIntroduction to Streaming Analytics
Introduction to Streaming AnalyticsGuido Schmutz
 

Viewers also liked (6)

Big Data Architectures
Big Data ArchitecturesBig Data Architectures
Big Data Architectures
 
Big Data Technology
Big Data TechnologyBig Data Technology
Big Data Technology
 
Introduction to Streaming Analytics
Introduction to Streaming AnalyticsIntroduction to Streaming Analytics
Introduction to Streaming Analytics
 
Cassandra Summit 2014: META — An Efficient Distributed Data Hub with Batch an...
Cassandra Summit 2014: META — An Efficient Distributed Data Hub with Batch an...Cassandra Summit 2014: META — An Efficient Distributed Data Hub with Batch an...
Cassandra Summit 2014: META — An Efficient Distributed Data Hub with Batch an...
 
Introduction to Streaming Analytics
Introduction to Streaming AnalyticsIntroduction to Streaming Analytics
Introduction to Streaming Analytics
 
Importance of Big Data Analytics
Importance of Big Data AnalyticsImportance of Big Data Analytics
Importance of Big Data Analytics
 

Similar to Storm – Streaming Data Analytics at Scale - StampedeCon 2014

Hortonworks Technical Workshop: Real Time Monitoring with Apache Hadoop
Hortonworks Technical Workshop: Real Time Monitoring with Apache HadoopHortonworks Technical Workshop: Real Time Monitoring with Apache Hadoop
Hortonworks Technical Workshop: Real Time Monitoring with Apache HadoopHortonworks
 
Discover HDP2.1: Apache Storm for Stream Data Processing in Hadoop
Discover HDP2.1: Apache Storm for Stream Data Processing in HadoopDiscover HDP2.1: Apache Storm for Stream Data Processing in Hadoop
Discover HDP2.1: Apache Storm for Stream Data Processing in HadoopHortonworks
 
Discover.hdp2.2.storm and kafka.final
Discover.hdp2.2.storm and kafka.finalDiscover.hdp2.2.storm and kafka.final
Discover.hdp2.2.storm and kafka.finalHortonworks
 
Real-Time Data Flows with Apache NiFi
Real-Time Data Flows with Apache NiFiReal-Time Data Flows with Apache NiFi
Real-Time Data Flows with Apache NiFiManish Gupta
 
A Reference Architecture for ETL 2.0
A Reference Architecture for ETL 2.0 A Reference Architecture for ETL 2.0
A Reference Architecture for ETL 2.0 DataWorks Summit
 
Druid Scaling Realtime Analytics
Druid Scaling Realtime AnalyticsDruid Scaling Realtime Analytics
Druid Scaling Realtime AnalyticsAaron Brooks
 
Hive acid and_2.x new_features
Hive acid and_2.x new_featuresHive acid and_2.x new_features
Hive acid and_2.x new_featuresAlberto Romero
 
From Device to Data Center to Insights: Architectural Considerations for the ...
From Device to Data Center to Insights: Architectural Considerations for the ...From Device to Data Center to Insights: Architectural Considerations for the ...
From Device to Data Center to Insights: Architectural Considerations for the ...P. Taylor Goetz
 
Curing the Kafka blindness—Streams Messaging Manager
Curing the Kafka blindness—Streams Messaging ManagerCuring the Kafka blindness—Streams Messaging Manager
Curing the Kafka blindness—Streams Messaging ManagerDataWorks Summit
 
Using Spark Streaming and NiFi for the next generation of ETL in the enterprise
Using Spark Streaming and NiFi for the next generation of ETL in the enterpriseUsing Spark Streaming and NiFi for the next generation of ETL in the enterprise
Using Spark Streaming and NiFi for the next generation of ETL in the enterpriseDataWorks Summit
 
Hadoop project design and a usecase
Hadoop project design and  a usecaseHadoop project design and  a usecase
Hadoop project design and a usecasesudhakara st
 
Realtime analytics + hadoop 2.0
Realtime analytics + hadoop 2.0Realtime analytics + hadoop 2.0
Realtime analytics + hadoop 2.0Rommel Garcia
 
Realtime Analytics in Hadoop
Realtime Analytics in HadoopRealtime Analytics in Hadoop
Realtime Analytics in HadoopRommel Garcia
 
Hdfs 2016-hadoop-summit-san-jose-v4
Hdfs 2016-hadoop-summit-san-jose-v4Hdfs 2016-hadoop-summit-san-jose-v4
Hdfs 2016-hadoop-summit-san-jose-v4Chris Nauroth
 
MiniFi and Apache NiFi : IoT in Berlin Germany 2018
MiniFi and Apache NiFi : IoT in Berlin Germany 2018MiniFi and Apache NiFi : IoT in Berlin Germany 2018
MiniFi and Apache NiFi : IoT in Berlin Germany 2018Timothy Spann
 

Similar to Storm – Streaming Data Analytics at Scale - StampedeCon 2014 (20)

Hortonworks Technical Workshop: Real Time Monitoring with Apache Hadoop
Hortonworks Technical Workshop: Real Time Monitoring with Apache HadoopHortonworks Technical Workshop: Real Time Monitoring with Apache Hadoop
Hortonworks Technical Workshop: Real Time Monitoring with Apache Hadoop
 
Discover HDP2.1: Apache Storm for Stream Data Processing in Hadoop
Discover HDP2.1: Apache Storm for Stream Data Processing in HadoopDiscover HDP2.1: Apache Storm for Stream Data Processing in Hadoop
Discover HDP2.1: Apache Storm for Stream Data Processing in Hadoop
 
Mhug apache storm
Mhug apache stormMhug apache storm
Mhug apache storm
 
Discover.hdp2.2.storm and kafka.final
Discover.hdp2.2.storm and kafka.finalDiscover.hdp2.2.storm and kafka.final
Discover.hdp2.2.storm and kafka.final
 
Real-Time Data Flows with Apache NiFi
Real-Time Data Flows with Apache NiFiReal-Time Data Flows with Apache NiFi
Real-Time Data Flows with Apache NiFi
 
A Reference Architecture for ETL 2.0
A Reference Architecture for ETL 2.0 A Reference Architecture for ETL 2.0
A Reference Architecture for ETL 2.0
 
Druid Scaling Realtime Analytics
Druid Scaling Realtime AnalyticsDruid Scaling Realtime Analytics
Druid Scaling Realtime Analytics
 
Javantura v3 - Real-time BigData ingestion and querying of aggregated data – ...
Javantura v3 - Real-time BigData ingestion and querying of aggregated data – ...Javantura v3 - Real-time BigData ingestion and querying of aggregated data – ...
Javantura v3 - Real-time BigData ingestion and querying of aggregated data – ...
 
Hive acid and_2.x new_features
Hive acid and_2.x new_featuresHive acid and_2.x new_features
Hive acid and_2.x new_features
 
LLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in HiveLLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in Hive
 
From Device to Data Center to Insights: Architectural Considerations for the ...
From Device to Data Center to Insights: Architectural Considerations for the ...From Device to Data Center to Insights: Architectural Considerations for the ...
From Device to Data Center to Insights: Architectural Considerations for the ...
 
From Device to Data Center to Insights
From Device to Data Center to InsightsFrom Device to Data Center to Insights
From Device to Data Center to Insights
 
Druid deep dive
Druid deep diveDruid deep dive
Druid deep dive
 
Curing the Kafka blindness—Streams Messaging Manager
Curing the Kafka blindness—Streams Messaging ManagerCuring the Kafka blindness—Streams Messaging Manager
Curing the Kafka blindness—Streams Messaging Manager
 
Using Spark Streaming and NiFi for the next generation of ETL in the enterprise
Using Spark Streaming and NiFi for the next generation of ETL in the enterpriseUsing Spark Streaming and NiFi for the next generation of ETL in the enterprise
Using Spark Streaming and NiFi for the next generation of ETL in the enterprise
 
Hadoop project design and a usecase
Hadoop project design and  a usecaseHadoop project design and  a usecase
Hadoop project design and a usecase
 
Realtime analytics + hadoop 2.0
Realtime analytics + hadoop 2.0Realtime analytics + hadoop 2.0
Realtime analytics + hadoop 2.0
 
Realtime Analytics in Hadoop
Realtime Analytics in HadoopRealtime Analytics in Hadoop
Realtime Analytics in Hadoop
 
Hdfs 2016-hadoop-summit-san-jose-v4
Hdfs 2016-hadoop-summit-san-jose-v4Hdfs 2016-hadoop-summit-san-jose-v4
Hdfs 2016-hadoop-summit-san-jose-v4
 
MiniFi and Apache NiFi : IoT in Berlin Germany 2018
MiniFi and Apache NiFi : IoT in Berlin Germany 2018MiniFi and Apache NiFi : IoT in Berlin Germany 2018
MiniFi and Apache NiFi : IoT in Berlin Germany 2018
 

More from StampedeCon

Why Should We Trust You-Interpretability of Deep Neural Networks - StampedeCo...
Why Should We Trust You-Interpretability of Deep Neural Networks - StampedeCo...Why Should We Trust You-Interpretability of Deep Neural Networks - StampedeCo...
Why Should We Trust You-Interpretability of Deep Neural Networks - StampedeCo...StampedeCon
 
The Search for a New Visual Search Beyond Language - StampedeCon AI Summit 2017
The Search for a New Visual Search Beyond Language - StampedeCon AI Summit 2017The Search for a New Visual Search Beyond Language - StampedeCon AI Summit 2017
The Search for a New Visual Search Beyond Language - StampedeCon AI Summit 2017StampedeCon
 
Predicting Outcomes When Your Outcomes are Graphs - StampedeCon AI Summit 2017
Predicting Outcomes When Your Outcomes are Graphs - StampedeCon AI Summit 2017Predicting Outcomes When Your Outcomes are Graphs - StampedeCon AI Summit 2017
Predicting Outcomes When Your Outcomes are Graphs - StampedeCon AI Summit 2017StampedeCon
 
Novel Semi-supervised Probabilistic ML Approach to SNP Variant Calling - Stam...
Novel Semi-supervised Probabilistic ML Approach to SNP Variant Calling - Stam...Novel Semi-supervised Probabilistic ML Approach to SNP Variant Calling - Stam...
Novel Semi-supervised Probabilistic ML Approach to SNP Variant Calling - Stam...StampedeCon
 
How to Talk about AI to Non-analaysts - Stampedecon AI Summit 2017
How to Talk about AI to Non-analaysts - Stampedecon AI Summit 2017How to Talk about AI to Non-analaysts - Stampedecon AI Summit 2017
How to Talk about AI to Non-analaysts - Stampedecon AI Summit 2017StampedeCon
 
Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017
Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017
Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017StampedeCon
 
Foundations of Machine Learning - StampedeCon AI Summit 2017
Foundations of Machine Learning - StampedeCon AI Summit 2017Foundations of Machine Learning - StampedeCon AI Summit 2017
Foundations of Machine Learning - StampedeCon AI Summit 2017StampedeCon
 
Don't Start from Scratch: Transfer Learning for Novel Computer Vision Problem...
Don't Start from Scratch: Transfer Learning for Novel Computer Vision Problem...Don't Start from Scratch: Transfer Learning for Novel Computer Vision Problem...
Don't Start from Scratch: Transfer Learning for Novel Computer Vision Problem...StampedeCon
 
Bringing the Whole Elephant Into View Can Cognitive Systems Bring Real Soluti...
Bringing the Whole Elephant Into View Can Cognitive Systems Bring Real Soluti...Bringing the Whole Elephant Into View Can Cognitive Systems Bring Real Soluti...
Bringing the Whole Elephant Into View Can Cognitive Systems Bring Real Soluti...StampedeCon
 
Automated AI The Next Frontier in Analytics - StampedeCon AI Summit 2017
Automated AI The Next Frontier in Analytics - StampedeCon AI Summit 2017Automated AI The Next Frontier in Analytics - StampedeCon AI Summit 2017
Automated AI The Next Frontier in Analytics - StampedeCon AI Summit 2017StampedeCon
 
AI in the Enterprise: Past, Present & Future - StampedeCon AI Summit 2017
AI in the Enterprise: Past,  Present &  Future - StampedeCon AI Summit 2017AI in the Enterprise: Past,  Present &  Future - StampedeCon AI Summit 2017
AI in the Enterprise: Past, Present & Future - StampedeCon AI Summit 2017StampedeCon
 
A Different Data Science Approach - StampedeCon AI Summit 2017
A Different Data Science Approach - StampedeCon AI Summit 2017A Different Data Science Approach - StampedeCon AI Summit 2017
A Different Data Science Approach - StampedeCon AI Summit 2017StampedeCon
 
Graph in Customer 360 - StampedeCon Big Data Conference 2017
Graph in Customer 360 - StampedeCon Big Data Conference 2017Graph in Customer 360 - StampedeCon Big Data Conference 2017
Graph in Customer 360 - StampedeCon Big Data Conference 2017StampedeCon
 
End-to-end Big Data Projects with Python - StampedeCon Big Data Conference 2017
End-to-end Big Data Projects with Python - StampedeCon Big Data Conference 2017End-to-end Big Data Projects with Python - StampedeCon Big Data Conference 2017
End-to-end Big Data Projects with Python - StampedeCon Big Data Conference 2017StampedeCon
 
Doing Big Data Using Amazon's Analogs - StampedeCon Big Data Conference 2017
Doing Big Data Using Amazon's Analogs - StampedeCon Big Data Conference 2017Doing Big Data Using Amazon's Analogs - StampedeCon Big Data Conference 2017
Doing Big Data Using Amazon's Analogs - StampedeCon Big Data Conference 2017StampedeCon
 
Enabling New Business Capabilities with Cloud-based Streaming Data Architectu...
Enabling New Business Capabilities with Cloud-based Streaming Data Architectu...Enabling New Business Capabilities with Cloud-based Streaming Data Architectu...
Enabling New Business Capabilities with Cloud-based Streaming Data Architectu...StampedeCon
 
Big Data Meets IoT: Lessons From the Cloud on Polling, Collecting, and Analyz...
Big Data Meets IoT: Lessons From the Cloud on Polling, Collecting, and Analyz...Big Data Meets IoT: Lessons From the Cloud on Polling, Collecting, and Analyz...
Big Data Meets IoT: Lessons From the Cloud on Polling, Collecting, and Analyz...StampedeCon
 
Innovation in the Data Warehouse - StampedeCon 2016
Innovation in the Data Warehouse - StampedeCon 2016Innovation in the Data Warehouse - StampedeCon 2016
Innovation in the Data Warehouse - StampedeCon 2016StampedeCon
 
Creating a Data Driven Organization - StampedeCon 2016
Creating a Data Driven Organization - StampedeCon 2016Creating a Data Driven Organization - StampedeCon 2016
Creating a Data Driven Organization - StampedeCon 2016StampedeCon
 
Using The Internet of Things for Population Health Management - StampedeCon 2016
Using The Internet of Things for Population Health Management - StampedeCon 2016Using The Internet of Things for Population Health Management - StampedeCon 2016
Using The Internet of Things for Population Health Management - StampedeCon 2016StampedeCon
 

More from StampedeCon (20)

Why Should We Trust You-Interpretability of Deep Neural Networks - StampedeCo...
Why Should We Trust You-Interpretability of Deep Neural Networks - StampedeCo...Why Should We Trust You-Interpretability of Deep Neural Networks - StampedeCo...
Why Should We Trust You-Interpretability of Deep Neural Networks - StampedeCo...
 
The Search for a New Visual Search Beyond Language - StampedeCon AI Summit 2017
The Search for a New Visual Search Beyond Language - StampedeCon AI Summit 2017The Search for a New Visual Search Beyond Language - StampedeCon AI Summit 2017
The Search for a New Visual Search Beyond Language - StampedeCon AI Summit 2017
 
Predicting Outcomes When Your Outcomes are Graphs - StampedeCon AI Summit 2017
Predicting Outcomes When Your Outcomes are Graphs - StampedeCon AI Summit 2017Predicting Outcomes When Your Outcomes are Graphs - StampedeCon AI Summit 2017
Predicting Outcomes When Your Outcomes are Graphs - StampedeCon AI Summit 2017
 
Novel Semi-supervised Probabilistic ML Approach to SNP Variant Calling - Stam...
Novel Semi-supervised Probabilistic ML Approach to SNP Variant Calling - Stam...Novel Semi-supervised Probabilistic ML Approach to SNP Variant Calling - Stam...
Novel Semi-supervised Probabilistic ML Approach to SNP Variant Calling - Stam...
 
How to Talk about AI to Non-analaysts - Stampedecon AI Summit 2017
How to Talk about AI to Non-analaysts - Stampedecon AI Summit 2017How to Talk about AI to Non-analaysts - Stampedecon AI Summit 2017
How to Talk about AI to Non-analaysts - Stampedecon AI Summit 2017
 
Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017
Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017
Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017
 
Foundations of Machine Learning - StampedeCon AI Summit 2017
Foundations of Machine Learning - StampedeCon AI Summit 2017Foundations of Machine Learning - StampedeCon AI Summit 2017
Foundations of Machine Learning - StampedeCon AI Summit 2017
 
Don't Start from Scratch: Transfer Learning for Novel Computer Vision Problem...
Don't Start from Scratch: Transfer Learning for Novel Computer Vision Problem...Don't Start from Scratch: Transfer Learning for Novel Computer Vision Problem...
Don't Start from Scratch: Transfer Learning for Novel Computer Vision Problem...
 
Bringing the Whole Elephant Into View Can Cognitive Systems Bring Real Soluti...
Bringing the Whole Elephant Into View Can Cognitive Systems Bring Real Soluti...Bringing the Whole Elephant Into View Can Cognitive Systems Bring Real Soluti...
Bringing the Whole Elephant Into View Can Cognitive Systems Bring Real Soluti...
 
Automated AI The Next Frontier in Analytics - StampedeCon AI Summit 2017
Automated AI The Next Frontier in Analytics - StampedeCon AI Summit 2017Automated AI The Next Frontier in Analytics - StampedeCon AI Summit 2017
Automated AI The Next Frontier in Analytics - StampedeCon AI Summit 2017
 
AI in the Enterprise: Past, Present & Future - StampedeCon AI Summit 2017
AI in the Enterprise: Past,  Present &  Future - StampedeCon AI Summit 2017AI in the Enterprise: Past,  Present &  Future - StampedeCon AI Summit 2017
AI in the Enterprise: Past, Present & Future - StampedeCon AI Summit 2017
 
A Different Data Science Approach - StampedeCon AI Summit 2017
A Different Data Science Approach - StampedeCon AI Summit 2017A Different Data Science Approach - StampedeCon AI Summit 2017
A Different Data Science Approach - StampedeCon AI Summit 2017
 
Graph in Customer 360 - StampedeCon Big Data Conference 2017
Graph in Customer 360 - StampedeCon Big Data Conference 2017Graph in Customer 360 - StampedeCon Big Data Conference 2017
Graph in Customer 360 - StampedeCon Big Data Conference 2017
 
End-to-end Big Data Projects with Python - StampedeCon Big Data Conference 2017
End-to-end Big Data Projects with Python - StampedeCon Big Data Conference 2017End-to-end Big Data Projects with Python - StampedeCon Big Data Conference 2017
End-to-end Big Data Projects with Python - StampedeCon Big Data Conference 2017
 
Doing Big Data Using Amazon's Analogs - StampedeCon Big Data Conference 2017
Doing Big Data Using Amazon's Analogs - StampedeCon Big Data Conference 2017Doing Big Data Using Amazon's Analogs - StampedeCon Big Data Conference 2017
Doing Big Data Using Amazon's Analogs - StampedeCon Big Data Conference 2017
 
Enabling New Business Capabilities with Cloud-based Streaming Data Architectu...
Enabling New Business Capabilities with Cloud-based Streaming Data Architectu...Enabling New Business Capabilities with Cloud-based Streaming Data Architectu...
Enabling New Business Capabilities with Cloud-based Streaming Data Architectu...
 
Big Data Meets IoT: Lessons From the Cloud on Polling, Collecting, and Analyz...
Big Data Meets IoT: Lessons From the Cloud on Polling, Collecting, and Analyz...Big Data Meets IoT: Lessons From the Cloud on Polling, Collecting, and Analyz...
Big Data Meets IoT: Lessons From the Cloud on Polling, Collecting, and Analyz...
 
Innovation in the Data Warehouse - StampedeCon 2016
Innovation in the Data Warehouse - StampedeCon 2016Innovation in the Data Warehouse - StampedeCon 2016
Innovation in the Data Warehouse - StampedeCon 2016
 
Creating a Data Driven Organization - StampedeCon 2016
Creating a Data Driven Organization - StampedeCon 2016Creating a Data Driven Organization - StampedeCon 2016
Creating a Data Driven Organization - StampedeCon 2016
 
Using The Internet of Things for Population Health Management - StampedeCon 2016
Using The Internet of Things for Population Health Management - StampedeCon 2016Using The Internet of Things for Population Health Management - StampedeCon 2016
Using The Internet of Things for Population Health Management - StampedeCon 2016
 

Recently uploaded

H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DaySri Ambati
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Farhan Tariq
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsNathaniel Shimoni
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityIES VE
 
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...AliaaTarek5
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI AgeCprime
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality AssuranceInflectra
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Hiroshi SHIBATA
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPathCommunity
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch TuesdayIvanti
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfpanagenda
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersNicole Novielli
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 

Recently uploaded (20)

H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directions
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a reality
 
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI Age
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to Hero
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch Tuesday
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software Developers
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 

Storm – Streaming Data Analytics at Scale - StampedeCon 2014

  • 1. Stream Processing with Apache Storm Spring 2014 Version 1.0 Kit Menke, Lead Software Engineer, EHI Scott Shaw, Solutions Engineer, Hortonworks
  • 2. © Hortonworks Inc. 2013 Stream Processing in Hadoop Driven by new types of data – Sensor/Machine – Server logs – Clickstream Storm with Hadoop enables new business opportunities – Low-latency dashboards – Quality, Security, Safety, Operations Alerts – Improved operations – Real-time data integration HDFS2 (redundant, reliable storage) YARN (cluster resource management) MapReduce (batch) Apache STORM (streaming) HADOOP 2.1 Tez (interactive) Multi Use Data Platform Batch, Interactive, Online, Streaming, … Stream processing has emerged as a key use case 2
  • 3. © Hortonworks Inc. 2013 Typical stream processing workflow Real-time data feeds Stream processing solution Persist data Relational or non relational data store Batch processing Batch FeedsUpdate event models (Pattern templates, KPIs & alerts) Dashboards & Applications 3
  • 4. © Hortonworks Inc. 2013 Stream processing very different from batch Factors Real-time Batch Data Freshness Real-time ( usually < 15 min) Historical – usually more than 15 min old Location Primarily memory ( moved to disk after processing) Primarily in disk moved to memory for processing Processing Speed Sub second to few seconds Few seconds to hours Frequency Always running Sporadic to periodic Clients Who? Automated systems only Human & automated systems Type Primarily operational systems Primarily analytical applications 4
  • 5. © Hortonworks Inc. 2013 Key requirements of a streaming solution • Extremely high ingest rates – millions of events/secondData Ingest • Ability to easily plug different processing frameworks • Guaranteed processing – atleast once processing semantics Processing • Ability to persist data to multiple relational and non relational data storesPersistence • Security, HA, fault tolerance & management supportOperations 5
  • 6. © Hortonworks Inc. 2013 Apache Storm Leading for Stream Processing Open source real-time event stream processing platform that provides fixed, continuous & low latency processing for very high frequency streaming data • Horizontally scalable like Hadoop • Eg: 10 node cluster can process 1M tuples per second per node Highly scalable • Automatically reassigns tasks on failed nodes Fault- tolerant • Supports at least once & exactly once processing semantics Guarantees processing • Processing logic can be defined in any language Language agnostic • Brand, governance & a large active community Apache project 6
  • 7. © Hortonworks Inc. 2013 Pattern driving MOST streaming use cases 7 Monitor real-time data to.. Prevent Optimize Finance - Securities Fraud - Compliance violations - Order routing - Pricing Telco - Security breaches - Network Outages - Bandwidth allocation - Customer service Retail - Offers - Pricing Manufacturing - Machine failures - Supply chain Transportation - Driver & fleet issues - Routes - Pricing Web - Application failures - Operational issues - Site content Sentiment Clickstream Machine/Sensor Logs Geo-location ----
  • 8. © Hortonworks Inc. 2013 Storm use cases – IT operations view • Continuously ingest high rate messages, process them and update data stores Continuous processing • Aggregate multiple data streams that emit data at extremely high rates into one central data store High speed data aggregation • Filter out unwanted data on the fly before it is persisted to a data storeData filtering • Extremely resource( CPU, mem or I/O) intensive processing that would take long time to process on a single machine can be parallelized with Storm to reduce response times to seconds Distributed RPC response time reduction 8
  • 9. © Hortonworks Inc. 2013 Key Constructs in Apache Storm • Tuples, Streams, Sprouts, Bolts • Topology • Field Grouping • Components and Topology Submission • Parallelism • Processing Guarantee 9
  • 10. © Hortonworks Inc. 2013 Tuples and Streams • What is a Tuple? –Fundamental data structure in Storm. Is a named list of values that can be of any data type. 10 • What is a Stream? –An unbounded sequences of tuples. –Core abstraction in Storm and are what you “process” in Storm
  • 11. © Hortonworks Inc. 2013 Spouts • What is a Spout? –Generates or a source of Streams – E.g.: JMS, Twitter, Log, Kafka Spout –Can spin up multiple instances of a Spout and dynamically adjust as needed 11
  • 12. © Hortonworks Inc. 2013 Bolts • What is a Bolt? –Processes any number of input streams and produces output streams –Common processing in bolts are functions, aggregations, joins, read/write to data stores, alerting logic –Can spin up multiple instances of a Bolt and dynamically adjust as needed • Example of Bolts: 1. HBaseBolt: persist stream in Hbase 2. HDFSBolt: persist stream into HFDS as Avro Files using Flume 3. MonitoringBolt: Read from Hbase and create alerts via email and messaging queues if given thresholds are exceeded. 12
  • 13. © Hortonworks Inc. 2013 Topology • What is a Topology? –A network of spouts and bolts wired together into a workflow Truck-Event-Processor Topology Kafka Spout HBase Bolt Monitoring Bolt HDFS Bolt WebSocket Bolt Stream Stream Stream Stream 13
  • 14. © Hortonworks Inc. 2013 Storm Components and Topology Submission Submit storm-event-processor topology Nimbus (Yarn App Master Node) Zookeeper ZookeeperZookeeper Supervisor (Slave Node) Supervisor (Slave Node) Supervisor (Slave Node) Supervisor (Slave Node) Supervisor (Slave Node) Kafka Spout Kafka Spout Kafka Spout Kafka Spout Kafka Spout HDFS Bolt HDFS Bolt HDFS Bolt HBase Bolt HBase Bolt Monitor Bolt Monitor Bolt Nimbus (Management server) • Similar to job tracker • Distributes code around cluster • Assigns tasks • Handles failures Supervisor (Worker nodes) • Similar to task tracker • Run bolts and spouts as ‘tasks’ Zookeeper: • Cluster co-ordination • Nimbus HA • Stores cluster metrics 14
  • 15. © Hortonworks Inc. 2013 Processing Guarantees in Storm Processing guarantee How is it achieved? Applicable use cases Atleast once Replay tuples on failure - Processing does not need to be ordered - Need extremely low latency processing Exactly once Transactional topologies ( now implemented using Trident) - Need ordered processing - Global counts - Context aware processing - Causality based - Latency not important 15
  • 16. Implementing Storm Kit Menke, Lead Software Engineer at Enterprise Holdings, Inc. May 2014
  • 17. Implementing Storm Spring 2014 Version 1.0 Real World Scenarios
  • 18. Overview • Storm Terminology • Creating a Topology • Persisting data from Storm • Topology Performance • Custom Metrics • Workers, Executors, and Tasks • Caching within a Bolt • Environment Setup 18
  • 19. Storm Terminology • Topologies run on your Hadoop cluster – Uber-jar with spouts and bolts – Runs forever • Spouts generate streams of tuples • Tuples are lists of values • Bolts process tuples (and emit tuples) Topology Spout Bolt A Bolt B Bolt 1 Tuples 19
  • 20. Storm Topology Example Counting Topology spout unreliable output Guaranteed message processing 20
  • 23. Failing a Tuple 1. Spout emits tuple 2. Bolt fails tuple 3. Spout receives failed message ID 23
  • 24. Persisting Data • Write to HDFS using storm-hdfs for long term storage 24
  • 25. Files in Hue written by storm-hdfs 25
  • 26. Persisting Data • Write to HDFS using storm-hdfs for long term storage • Index data in ElasticSearch or Solr for real-time dashboards 26
  • 28. Persisting Data • Write to HDFS using storm-hdfs for long term storage • Index data in ElasticSearch or Solr for real-time dashboards • Insert messages into a Database • Message Queue • HBase reads/writes to influence topology in real-time 28
  • 29. Topology Performance • Storm UI shows capacity – Break out your bolts to find bottlenecks! 29
  • 30. Custom Metrics • New in Storm 0.9.0 • Out of the box metrics, ex: CountMetric • Custom metric by implementing IMetric • Register the metric on spout/bolt startup • Set topology to consume metrics stream 30
  • 31. Topology Performance • Filter bolt is our bottleneck! 31
  • 32. Workers, Executors, and Tasks • Workers – Separate JVM – Workers run Executors • Executors – Separate threads – Executors run Tasks • Tasks – Your spout or bolt code • Running more than one task per executor does not increase the level of parallelism!!! Workers <= Executors <= Tasks
  • 33. Caching inside a Bolt • RotatingMap with Tick Tuples • Use fieldsGrouping to ensure cache hits 33
  • 34. Environment Setup • Storm-starter project on GitHub • Git, Eclipse, Maven • Unit test! • Develop locally or on a single node hadoop machine • Read the source code 34

Editor's Notes

  1. Real-time data integration Analyze, clean, normalize data with low latency Low-latency dashboards Summing/aggregations for operational monitors, gauges and counters Orders, revenue, call volumes, infrastructure load Geographic location of fleets Alerts Quality: Detection of “never seen before” entities (customers, ads, etc) Security: Detection of trespass / fraud / illegal activities Safety: patient monitoring, automotive telematics Operations: Detection of system / network overload Improved operations Advertising optimization Personalization Fleet rerouting
  2. Stream processing solution needs to consume explicit or implicit event models from batch processing platform. These event models define the schemas of incoming event data, such as records of calls into the customer contact center, copies of customer order transactions or exogenous market data. Event models also specify: Relationships (such as causation) among the event types Calculations (for example, formulas to compute KPIs) Alert thresholds (for example, "if average caller wait time exceeds 45 seconds, send a yellow warning by email") Responses (for example, "trigger an exception process if the result of a customer credit check has not been received within two hours")
  3. Storm was benchmarked at processing one million 100 byte messages per second per node on hardware with the following specs: Processor: 2x Intel E5645@2.4Ghz Memory: 24 GB
  4. Add types of data and ad prevent and optimize use cases
  5. Getting started with storm Reading source code most helpful Create a simple hello world topology and run it locally
  6. Topologies are the application you will write and deploy to your cluster where it will run forever working on streams of data. Each topology contains spouts and bolts Spouts bring data into your topology by generating streams of tuples. This is an external source like a queue or something on the internet (like twitter). Tuples are lists of values (string, int, boolean, or custom objects which require serializers) Bolts process the tuples emitted by the spouts and also emit tuples themselves
  7. Creating a simple storm topology which demonstrates guaranteed message processing. Create a counting spout connected to an unreliable bolt connected to an output bolt Many different options for connecting things together: shuffle grouping means tuples are randomly distributed. Can also group by a field, broadcast tuple Demonstrate an error scenario by using an unreliable bolt
  8. Simple example of a spout which counts from 0 to 9 Open is called once for each instance of your spout. Adding numbers 0-9 to an in-memory queue Typically you will be reading from a real message queue nextTuple is called repeatedly to get each tuple. Here we are emitting one int: number The second parameter is used for reprocessing in the event of a failure declareOutputFields for specifying which fields you are emitting in nextTuple.
  9. An example implementation of an Unreliable Bolt (because it should fail 50% of the time) Bolts also have a prepare and declareOutputFields method. Execute is the main method where your processing will take place. The input tuple was generated by our spout. 50% of the time, the tuple will fail.
  10. Calling _collector.fail on a tuple will cause it to go back to the spout’s fail method. In this simple example, I made number the same value as the tuple but in reality this might be a queued message ID. We ended up not really needing tuple reprocessing but I believe storm-jms has this built in if you need it.
  11. Talked about bringing data into your topology and processing it. Most likely you will want to persist it somewhere as well for additional processing. We are using storm-hdfs to write all messages we receive straight into HDFS. Also indexing our data in ElasticSearch in order to have a real-time dashboard for executives. Influence the topology in “real-time” by reading from or writing to HBase !!!! This stuff will be SLOW compared to how fast you need to process messages in Storm. HBase read takes 20ms, that is only 50 tuples/s!!!!
  12. Using storm-hdfs to stream data to HDFS for more analytics and storage Put hive tables over top, run trends, etc.
  13. Talked about bringing data into your topology and processing it. Most likely you will want to persist it somewhere as well for additional processing. We are using storm-hdfs to write all messages we receive straight into HDFS. Also indexing our data in ElasticSearch in order to have a real-time dashboard for executives. Influence the topology in “real-time” by reading from or writing to HBase !!!! This stuff will be SLOW compared to how fast you need to process messages in Storm. HBase read takes 20ms, that is only 50 tuples/s!!!!
  14. Time based indexes (one per day) Kibana dashboard on top of elasticsearch indexes size: 14.3G (28.7G) docs: 42,051,720 (42,051,720)
  15. Talked about bringing data into your topology and processing it. Most likely you will want to persist it somewhere as well for additional processing. We are using storm-hdfs to write all messages we receive straight into HDFS. Also indexing our data in ElasticSearch in order to have a real-time dashboard for executives. Influence the topology in “real-time” by reading from or writing to HBase !!!! This stuff will be SLOW compared to how fast you need to process messages in Storm. HBase read takes 20ms, that is only 50 tuples/s!!!!
  16. It is hard to optimize! The storm UI will help you a lot with determining where the bottleneck is in your topology, but you will need to break out your bolts. Capacity = If this is around 1.0, the bolt is running as fast as it can and you probably need to increase your parallelism. Here I’ve prefixed my bolts with a number so they sort nicely in the Storm UI.
  17. Custom Metrics were added in Storm 0.9.0 and allow you to collect a lot more information than what is displayed in the Storm UI. Comes with some metrics out of the box like the CountMetric (cache hits? # of tuples processed?) Can create custom metrics by implementing the IMetric interface. Register your metric in your spout’s open method or bolt’s prepare method. When creating your topology, configure a consumer. LoggingMetricsConsumer comes out of the box and just logs to the metrics.log on one of the machines. Can create your own consumers to stream to third party monitoring apps.
  18. We’ve identified a bottleneck in our topology (filter bolt) using the Storm UI and storm’s metrics. Increasing the parallelism of the bolt might help with our throughput. If it takes twice as long as our categorize bolt, we probably need to DOUBLE the amount of Executors.
  19. Configure workers, executors, and tasks when creating the topology. Worker process… Separate JVM Runs executors One send/receive thread per worker Rule of thumb: Multiple of the number of machines in your cluster Executors Thread spawned by worker Runs tasks serially Rule of thumb: Multiple of the # of workers Task Runs your spouts and bolts Cannot change the number of tasks after topology has been started Rule of thumb: Multiple of the # of executors.. Typically just have 1 per executor unless you play on adding more nodes as the topology is running Running more than one task per executor does not increase the level of parallelism!!! Number of workers and executors can change, number of tasks cannot http://stackoverflow.com/questions/17257448/what-is-the-task-in-twitter-storm-parallelism Example: Storm running on 3 nodes. Three workers, six executors, six tasks. Workers <= Executors <= Tasks
  20. If HBase calls take 20 ms, we’re going to have a bottleneck in our topology so we need caching. fieldsGrouping + caching within bolts Group by something that will be used as the key (or part of the key) to your cache. Same Tuples will be sent to the same bolt and increase the number of cache hits. Create a RotatingMap (a LRU cache) in your bolt Configure your bolt to receive Tick Tuples Tick tuples sent to your bolt in addition to normal Tuples Check to see if the tuple you received was a tick tuple and then rotate the cache every 300s
  21. Possible to develop in multiple languages, but java makes the most sense for getting started Check out the storm-starter project on github for a great working example Use git to clone the repository, setup in your favorite IDE (Eclipse haha yea right!), and setup maven. Use maven-shade-plugin to build your uber-jar Separate projects for major functionality. Try to keep as little as possible in your storm project. Use unit testing everywhere.. It will save you time when you find bugs in the topology. You can develop locally just with Eclipse and storm. However, you will most likely also being using a lot of other Hadoop stuff (HDFS check out storm-hdfs, HBase, etc) so it might be helpful to get a single node machine with everything installed.