SlideShare a Scribd company logo
1 of 38
STORM as an ETL Engine to
HADOOP
Apr 15, 2015
Yash Ranadive
Lookout Mobile Security
@yashranadive
etl.svbtle.com
Friday, April 24, 15
ABOUT
• Data Engineer at Lookout, San Francisco
• Work on
• Analytics Infrastructure (Internal)
• Data Ingestion in Hadoop
• Blog all things ETL
• etl.svbtle.com
Friday, April 24, 15
AGENDA
• When to use Storm?
• Architecture Alternatives
• Monitoring
• Questions
Friday, April 24, 15
Over 60 million registered users
Friday, April 24, 15
DEFINITION OF ETL
Moving and Transforming data so it can
be stored and analyzed
Friday, April 24, 15
DEFINITION OF ETL
Moving and Transforming data so it can
be stored and analyzed
Friday, April 24, 15
DEFINITION OF ETL
Moving and Transforming data so it can
be stored and analyzed
Friday, April 24, 15
General Framework for Event Processing
Pipelines
Need
Friday, April 24, 15
General Framework for Event Processing
Pipelines
Need
that can make processed data available for
analytics as fast as possible
Friday, April 24, 15
THE PROBLEM
Message Bus
ETL Logic
HADOOPWhat to use?
Friday, April 24, 15
Depends on Latency Requirements
What to use?
Friday, April 24, 15
ANALYTICS LATENCY
Latency
Batch Hourly/Daily
Frequent Batch 10-15 mins
Near Real-time <1 minute
Sub Second < 1s
Flow interrupt
Friday, April 24, 15
Depends on Complexity of Reports
What to use?
Friday, April 24, 15
THE PROBLEM
Kafka
ETL Logic
HADOOPWhat to use?
Friday, April 24, 15
OFFLOADING AND PROCESSING
Batch
Offload
Batch
Process
Real-time
Offload
Real-time
Process
Camus X
Storm X X
Scalding X
Spark X X X X
...
Friday, April 24, 15
For Real-time Analytics
We use Storm
Friday, April 24, 15
HOW WE SOLVED 2 PROBLEMS
1. User Gratifications
2. Device Connections
Friday, April 24, 15
an event that adds value to the user
“Gratification” is
USER GRATIFICATIONS
Friday, April 24, 15
1. USER GRATIFICATIONS
• Need Analytics on performance of “Scream”, “Lock”,
“Locate”
• Events in Protobuf format
Kafka
“Scream”, “Lock”, “Locate”
protobuf events
Monitor
Throughput
Join
Cohorts Table
Complex Reports
Friday, April 24, 15
PIPELINE - LANDING DATA DIRECTLY
Kafka Storm HDFS
Kafka
Spout
Deserializ
e Protobuf
storm-
hdfs bolt
Landing
Directory
Hive
Directory
Bolt deserializes protobuf to a TSV Data lands on hdfs
Files rotated to
HIVE external
table folder
Friday, April 24, 15
TUNING OPTIONS
• Change storm-hdfs hsync count based policy
• Change Parallelism of storm-hdfs bolt
• Possibly Change storm-hdfs hsync time based
policy
Friday, April 24, 15
THE GOOD
• Plain Text of Protobufs by tailing landing file
• Real-time view of throughput via. StatsD
• Data available in HIVE for downstream
analysis
#####Insert diag here
Friday, April 24, 15
CHALLENGES
• Possible duplicates if not “exactly-once”
• storm-hdfs bolt has limitations
• can’t rotate when topology
shutdown
• parameter tweaking depending
throughput
Friday, April 24, 15
BURSTY TRAFFIC
dd
Bursty Traffic can cause frequent hsync (hadoop file system sync)
and slow down throughput
Friday, April 24, 15
DEVICE CONNECTIONS
Friday, April 24, 15
2. DEVICE CONNECTIONS
• Report on counts of devices connecting
• JSON format
• Analyze all connecting devices to backend
servers to measure engagement after new
product feature rollouts
Device Connection JSON
events
Join
Cohorts Table
Complex Reports
Friday, April 24, 15
LANDING DATA ON HBASE
Storm HBase
HIVE
Bolt writes to HBase
Daily job copies data
from
HBase to Hive table
Hive table backed by HBase
TTL => 3 days
Hive table backed by HBase - last 3
days of data
Friday, April 24, 15
THE GOOD
• Can query in real-time HBase or Hive
• Better Stability than writing directly to HDFS
Friday, April 24, 15
ANALYTICS
Kafka Storm
StatsD
HIVE Tableau
AdHoc
Friday, April 24, 15
OPERATIONAL
STUFF
Friday, April 24, 15
TOPOLOGY DEPLOYMENT
• Manually push Storm Jars
• After Code Review
• JAR uploaded to Artifactory w/ version
• JAR deployed to Storm Box
• To start topology
• Kill previous
• Start new
Friday, April 24, 15
CONFIGURATION MANAGEMENT
$> cat run_topology.sh
storm jar data-storm-0.0.6.jar com.lookout.data.topology.MyTopoClass 
-topologymaxtaskparallelism 8 
-D hdfs.sync.tuple.count=3000 
...
-D statsd.host=statsd.flexd-sf0.local
• Simple
• Config parameters in shell scripts
Friday, April 24, 15
TRACKING METRICS
• Use StatsD and Graphite
• Storm Consumer Offsets in DataDog
Friday, April 24, 15
OPERATIONAL MONITORING &
ALERTING
• Ruby script hits Storm’s thrift API
• Alert if topology is inactive
• No monitoring on bolt-level failures
• Alert on high-level metrics to
prevent alert fatigue
Friday, April 24, 15
ENVIRONMENT
• Independent Storm Cluster for Data Warehouse
Tasks
• 2 worker nodes
• 24 Cores
• 48GB Memory
Friday, April 24, 15
LESSONS LEARNED
• Use Storm only for real-time metrics
• Streaming data directly to HDFS has its challenges
• Better stability with ingesting first in HBase
Friday, April 24, 15
Questions
Friday, April 24, 15
OFFICIAL DESCRIPTION
Lookout`s data team ingests several terabytes of data from various sources every day
using many techniques such as binlog parsing, ruby daemons, and storm topologies.
With an increasing use of distributed messaging like Kafka from upstream services,
ingestion needs to happen on a distributed ETL infrastructure that can horizontally
scale with the data.
This talk will be on storm topology pipelines for data ingestion, transformation,
processing and ultimately consumption for interactive queries.
In addition, the talk will focus on
1. storm topology deployment,
2. configuration management,
3. metric monitoring,
4. and finally storage on Hadoop.
Keep in Mind: 1. Planning – Structure your presentation, define what the most
important messages are and clearly make your point 2. Plan on approximately 30
minutes of presentation and 10 minutes of Q&A 3. Use standard fonts no smaller
than 24 pts.
Friday, April 24, 15

More Related Content

What's hot

Real-Time Big Data at In-Memory Speed, Using Storm
Real-Time Big Data at In-Memory Speed, Using StormReal-Time Big Data at In-Memory Speed, Using Storm
Real-Time Big Data at In-Memory Speed, Using StormNati Shalom
 
Storm: distributed and fault-tolerant realtime computation
Storm: distributed and fault-tolerant realtime computationStorm: distributed and fault-tolerant realtime computation
Storm: distributed and fault-tolerant realtime computationnathanmarz
 
Albert Bifet – Apache Samoa: Mining Big Data Streams with Apache Flink
Albert Bifet – Apache Samoa: Mining Big Data Streams with Apache FlinkAlbert Bifet – Apache Samoa: Mining Big Data Streams with Apache Flink
Albert Bifet – Apache Samoa: Mining Big Data Streams with Apache FlinkFlink Forward
 
Developing Java Streaming Applications with Apache Storm
Developing Java Streaming Applications with Apache StormDeveloping Java Streaming Applications with Apache Storm
Developing Java Streaming Applications with Apache StormLester Martin
 
Data Pipelines & Integrating Real-time Web Services w/ Storm : Improving on t...
Data Pipelines & Integrating Real-time Web Services w/ Storm : Improving on t...Data Pipelines & Integrating Real-time Web Services w/ Storm : Improving on t...
Data Pipelines & Integrating Real-time Web Services w/ Storm : Improving on t...Brian O'Neill
 
Apache Storm Concepts
Apache Storm ConceptsApache Storm Concepts
Apache Storm ConceptsAndré Dias
 
Storm – Streaming Data Analytics at Scale - StampedeCon 2014
Storm – Streaming Data Analytics at Scale - StampedeCon 2014Storm – Streaming Data Analytics at Scale - StampedeCon 2014
Storm – Streaming Data Analytics at Scale - StampedeCon 2014StampedeCon
 
Apache Storm 0.9 basic training - Verisign
Apache Storm 0.9 basic training - VerisignApache Storm 0.9 basic training - Verisign
Apache Storm 0.9 basic training - VerisignMichael Noll
 
Multi-Tenant Storm Service on Hadoop Grid
Multi-Tenant Storm Service on Hadoop GridMulti-Tenant Storm Service on Hadoop Grid
Multi-Tenant Storm Service on Hadoop GridDataWorks Summit
 
Suneel Marthi - Deep Learning with Apache Flink and DL4J
Suneel Marthi - Deep Learning with Apache Flink and DL4JSuneel Marthi - Deep Learning with Apache Flink and DL4J
Suneel Marthi - Deep Learning with Apache Flink and DL4JFlink Forward
 
Learning Stream Processing with Apache Storm
Learning Stream Processing with Apache StormLearning Stream Processing with Apache Storm
Learning Stream Processing with Apache StormEugene Dvorkin
 
Data Stream Algorithms in Storm and R
Data Stream Algorithms in Storm and RData Stream Algorithms in Storm and R
Data Stream Algorithms in Storm and RRadek Maciaszek
 
Storm presentation
Storm presentationStorm presentation
Storm presentationShyam Raj
 
Apache Storm vs. Spark Streaming – two Stream Processing Platforms compared
Apache Storm vs. Spark Streaming – two Stream Processing Platforms comparedApache Storm vs. Spark Streaming – two Stream Processing Platforms compared
Apache Storm vs. Spark Streaming – two Stream Processing Platforms comparedGuido Schmutz
 
Apache storm vs. Spark Streaming
Apache storm vs. Spark StreamingApache storm vs. Spark Streaming
Apache storm vs. Spark StreamingP. Taylor Goetz
 
Towards Benchmaking Modern Distruibuted Systems-(Grace Huang, Intel)
Towards Benchmaking Modern Distruibuted Systems-(Grace Huang, Intel)Towards Benchmaking Modern Distruibuted Systems-(Grace Huang, Intel)
Towards Benchmaking Modern Distruibuted Systems-(Grace Huang, Intel)Spark Summit
 
Real-Time Analytics with Kafka, Cassandra and Storm
Real-Time Analytics with Kafka, Cassandra and StormReal-Time Analytics with Kafka, Cassandra and Storm
Real-Time Analytics with Kafka, Cassandra and StormJohn Georgiadis
 

What's hot (20)

Real-Time Big Data at In-Memory Speed, Using Storm
Real-Time Big Data at In-Memory Speed, Using StormReal-Time Big Data at In-Memory Speed, Using Storm
Real-Time Big Data at In-Memory Speed, Using Storm
 
Hadoop at Lookout
Hadoop at LookoutHadoop at Lookout
Hadoop at Lookout
 
Storm: distributed and fault-tolerant realtime computation
Storm: distributed and fault-tolerant realtime computationStorm: distributed and fault-tolerant realtime computation
Storm: distributed and fault-tolerant realtime computation
 
Yahoo compares Storm and Spark
Yahoo compares Storm and SparkYahoo compares Storm and Spark
Yahoo compares Storm and Spark
 
Albert Bifet – Apache Samoa: Mining Big Data Streams with Apache Flink
Albert Bifet – Apache Samoa: Mining Big Data Streams with Apache FlinkAlbert Bifet – Apache Samoa: Mining Big Data Streams with Apache Flink
Albert Bifet – Apache Samoa: Mining Big Data Streams with Apache Flink
 
Developing Java Streaming Applications with Apache Storm
Developing Java Streaming Applications with Apache StormDeveloping Java Streaming Applications with Apache Storm
Developing Java Streaming Applications with Apache Storm
 
Data Pipelines & Integrating Real-time Web Services w/ Storm : Improving on t...
Data Pipelines & Integrating Real-time Web Services w/ Storm : Improving on t...Data Pipelines & Integrating Real-time Web Services w/ Storm : Improving on t...
Data Pipelines & Integrating Real-time Web Services w/ Storm : Improving on t...
 
Apache Storm Concepts
Apache Storm ConceptsApache Storm Concepts
Apache Storm Concepts
 
Storm – Streaming Data Analytics at Scale - StampedeCon 2014
Storm – Streaming Data Analytics at Scale - StampedeCon 2014Storm – Streaming Data Analytics at Scale - StampedeCon 2014
Storm – Streaming Data Analytics at Scale - StampedeCon 2014
 
Apache Storm 0.9 basic training - Verisign
Apache Storm 0.9 basic training - VerisignApache Storm 0.9 basic training - Verisign
Apache Storm 0.9 basic training - Verisign
 
Multi-Tenant Storm Service on Hadoop Grid
Multi-Tenant Storm Service on Hadoop GridMulti-Tenant Storm Service on Hadoop Grid
Multi-Tenant Storm Service on Hadoop Grid
 
Suneel Marthi - Deep Learning with Apache Flink and DL4J
Suneel Marthi - Deep Learning with Apache Flink and DL4JSuneel Marthi - Deep Learning with Apache Flink and DL4J
Suneel Marthi - Deep Learning with Apache Flink and DL4J
 
Learning Stream Processing with Apache Storm
Learning Stream Processing with Apache StormLearning Stream Processing with Apache Storm
Learning Stream Processing with Apache Storm
 
Data Stream Algorithms in Storm and R
Data Stream Algorithms in Storm and RData Stream Algorithms in Storm and R
Data Stream Algorithms in Storm and R
 
Storm presentation
Storm presentationStorm presentation
Storm presentation
 
Apache Storm vs. Spark Streaming – two Stream Processing Platforms compared
Apache Storm vs. Spark Streaming – two Stream Processing Platforms comparedApache Storm vs. Spark Streaming – two Stream Processing Platforms compared
Apache Storm vs. Spark Streaming – two Stream Processing Platforms compared
 
Introduction to Storm
Introduction to StormIntroduction to Storm
Introduction to Storm
 
Apache storm vs. Spark Streaming
Apache storm vs. Spark StreamingApache storm vs. Spark Streaming
Apache storm vs. Spark Streaming
 
Towards Benchmaking Modern Distruibuted Systems-(Grace Huang, Intel)
Towards Benchmaking Modern Distruibuted Systems-(Grace Huang, Intel)Towards Benchmaking Modern Distruibuted Systems-(Grace Huang, Intel)
Towards Benchmaking Modern Distruibuted Systems-(Grace Huang, Intel)
 
Real-Time Analytics with Kafka, Cassandra and Storm
Real-Time Analytics with Kafka, Cassandra and StormReal-Time Analytics with Kafka, Cassandra and Storm
Real-Time Analytics with Kafka, Cassandra and Storm
 

Viewers also liked

Oracle GoldenGate and Apache Kafka A Deep Dive Into Real-Time Data Streaming
Oracle GoldenGate and Apache Kafka A Deep Dive Into Real-Time Data StreamingOracle GoldenGate and Apache Kafka A Deep Dive Into Real-Time Data Streaming
Oracle GoldenGate and Apache Kafka A Deep Dive Into Real-Time Data StreamingMichael Rainey
 
Hadoop Summit Europe 2014: Apache Storm Architecture
Hadoop Summit Europe 2014: Apache Storm ArchitectureHadoop Summit Europe 2014: Apache Storm Architecture
Hadoop Summit Europe 2014: Apache Storm ArchitectureP. Taylor Goetz
 
美团技术沙龙03 - 实时数据仓库解决方案
美团技术沙龙03 - 实时数据仓库解决方案美团技术沙龙03 - 实时数据仓库解决方案
美团技术沙龙03 - 实时数据仓库解决方案美团点评技术团队
 
美团点评技术沙龙09 - 一个用户行为分析产品的设计与实现
美团点评技术沙龙09 - 一个用户行为分析产品的设计与实现美团点评技术沙龙09 - 一个用户行为分析产品的设计与实现
美团点评技术沙龙09 - 一个用户行为分析产品的设计与实现美团点评技术团队
 
中等创业公司后端技术选型
中等创业公司后端技术选型中等创业公司后端技术选型
中等创业公司后端技术选型wang hongjiang
 
Big Data & the importance of Data Science
Big Data & the importance of Data ScienceBig Data & the importance of Data Science
Big Data & the importance of Data ScienceWim Van Leuven
 
Moving and Transforming Data with Pentaho Data Integration 5.0 CE (aka Kettle)
Moving and Transforming Data with Pentaho Data Integration 5.0 CE (aka Kettle)Moving and Transforming Data with Pentaho Data Integration 5.0 CE (aka Kettle)
Moving and Transforming Data with Pentaho Data Integration 5.0 CE (aka Kettle)Roland Bouman
 
Real Time Graph Computations in Storm, Neo4J, Python - PyCon India 2013
Real Time Graph Computations in Storm, Neo4J, Python - PyCon India 2013Real Time Graph Computations in Storm, Neo4J, Python - PyCon India 2013
Real Time Graph Computations in Storm, Neo4J, Python - PyCon India 2013Sonal Raj
 
Kettle: Pentaho Data Integration tool
Kettle: Pentaho Data Integration toolKettle: Pentaho Data Integration tool
Kettle: Pentaho Data Integration toolAlex Rayón Jerez
 
Data integration with Apache Kafka
Data integration with Apache KafkaData integration with Apache Kafka
Data integration with Apache Kafkaconfluent
 
Apache Storm vs. Spark Streaming - two stream processing platforms compared
Apache Storm vs. Spark Streaming - two stream processing platforms comparedApache Storm vs. Spark Streaming - two stream processing platforms compared
Apache Storm vs. Spark Streaming - two stream processing platforms comparedGuido Schmutz
 
Hortonworks Data in Motion Webinar Series Part 7 Apache Kafka Nifi Better Tog...
Hortonworks Data in Motion Webinar Series Part 7 Apache Kafka Nifi Better Tog...Hortonworks Data in Motion Webinar Series Part 7 Apache Kafka Nifi Better Tog...
Hortonworks Data in Motion Webinar Series Part 7 Apache Kafka Nifi Better Tog...Hortonworks
 
Real Time Data Streaming using Kafka & Storm
Real Time Data Streaming using Kafka & StormReal Time Data Streaming using Kafka & Storm
Real Time Data Streaming using Kafka & StormRan Silberman
 
Kafka and Storm - event processing in realtime
Kafka and Storm - event processing in realtimeKafka and Storm - event processing in realtime
Kafka and Storm - event processing in realtimeGuido Schmutz
 

Viewers also liked (16)

Oracle GoldenGate and Apache Kafka A Deep Dive Into Real-Time Data Streaming
Oracle GoldenGate and Apache Kafka A Deep Dive Into Real-Time Data StreamingOracle GoldenGate and Apache Kafka A Deep Dive Into Real-Time Data Streaming
Oracle GoldenGate and Apache Kafka A Deep Dive Into Real-Time Data Streaming
 
Hadoop Summit Europe 2014: Apache Storm Architecture
Hadoop Summit Europe 2014: Apache Storm ArchitectureHadoop Summit Europe 2014: Apache Storm Architecture
Hadoop Summit Europe 2014: Apache Storm Architecture
 
美团技术沙龙03 - 实时数据仓库解决方案
美团技术沙龙03 - 实时数据仓库解决方案美团技术沙龙03 - 实时数据仓库解决方案
美团技术沙龙03 - 实时数据仓库解决方案
 
美团点评技术沙龙09 - 一个用户行为分析产品的设计与实现
美团点评技术沙龙09 - 一个用户行为分析产品的设计与实现美团点评技术沙龙09 - 一个用户行为分析产品的设计与实现
美团点评技术沙龙09 - 一个用户行为分析产品的设计与实现
 
中等创业公司后端技术选型
中等创业公司后端技术选型中等创业公司后端技术选型
中等创业公司后端技术选型
 
Big Data & the importance of Data Science
Big Data & the importance of Data ScienceBig Data & the importance of Data Science
Big Data & the importance of Data Science
 
Moving and Transforming Data with Pentaho Data Integration 5.0 CE (aka Kettle)
Moving and Transforming Data with Pentaho Data Integration 5.0 CE (aka Kettle)Moving and Transforming Data with Pentaho Data Integration 5.0 CE (aka Kettle)
Moving and Transforming Data with Pentaho Data Integration 5.0 CE (aka Kettle)
 
Real Time Graph Computations in Storm, Neo4J, Python - PyCon India 2013
Real Time Graph Computations in Storm, Neo4J, Python - PyCon India 2013Real Time Graph Computations in Storm, Neo4J, Python - PyCon India 2013
Real Time Graph Computations in Storm, Neo4J, Python - PyCon India 2013
 
Kettle: Pentaho Data Integration tool
Kettle: Pentaho Data Integration toolKettle: Pentaho Data Integration tool
Kettle: Pentaho Data Integration tool
 
Streaming SQL
Streaming SQLStreaming SQL
Streaming SQL
 
Data integration with Apache Kafka
Data integration with Apache KafkaData integration with Apache Kafka
Data integration with Apache Kafka
 
Apache Storm vs. Spark Streaming - two stream processing platforms compared
Apache Storm vs. Spark Streaming - two stream processing platforms comparedApache Storm vs. Spark Streaming - two stream processing platforms compared
Apache Storm vs. Spark Streaming - two stream processing platforms compared
 
Hortonworks Data in Motion Webinar Series Part 7 Apache Kafka Nifi Better Tog...
Hortonworks Data in Motion Webinar Series Part 7 Apache Kafka Nifi Better Tog...Hortonworks Data in Motion Webinar Series Part 7 Apache Kafka Nifi Better Tog...
Hortonworks Data in Motion Webinar Series Part 7 Apache Kafka Nifi Better Tog...
 
Real Time Data Streaming using Kafka & Storm
Real Time Data Streaming using Kafka & StormReal Time Data Streaming using Kafka & Storm
Real Time Data Streaming using Kafka & Storm
 
Resource Aware Scheduling in Apache Storm
Resource Aware Scheduling in Apache StormResource Aware Scheduling in Apache Storm
Resource Aware Scheduling in Apache Storm
 
Kafka and Storm - event processing in realtime
Kafka and Storm - event processing in realtimeKafka and Storm - event processing in realtime
Kafka and Storm - event processing in realtime
 

Similar to STORM as an ETL Engine to HADOOP

Self-Service BI for big data applications using Apache Drill (Big Data Amster...
Self-Service BI for big data applications using Apache Drill (Big Data Amster...Self-Service BI for big data applications using Apache Drill (Big Data Amster...
Self-Service BI for big data applications using Apache Drill (Big Data Amster...Dataconomy Media
 
Self-Service BI for big data applications using Apache Drill (Big Data Amster...
Self-Service BI for big data applications using Apache Drill (Big Data Amster...Self-Service BI for big data applications using Apache Drill (Big Data Amster...
Self-Service BI for big data applications using Apache Drill (Big Data Amster...Mats Uddenfeldt
 
Not Your Father’s Data Warehouse: Breaking Tradition with Innovation
Not Your Father’s Data Warehouse: Breaking Tradition with InnovationNot Your Father’s Data Warehouse: Breaking Tradition with Innovation
Not Your Father’s Data Warehouse: Breaking Tradition with InnovationInside Analysis
 
Hdf5 current future
Hdf5 current futureHdf5 current future
Hdf5 current futuremfolk
 
Hadoop Summit - Sanoma self service on hadoop
Hadoop Summit - Sanoma self service on hadoopHadoop Summit - Sanoma self service on hadoop
Hadoop Summit - Sanoma self service on hadoopSander Kieft
 
Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in HadoopBackup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadooplarsgeorge
 
A New “Sparkitecture” for Modernizing your Data Warehouse: Spark Summit East ...
A New “Sparkitecture” for Modernizing your Data Warehouse: Spark Summit East ...A New “Sparkitecture” for Modernizing your Data Warehouse: Spark Summit East ...
A New “Sparkitecture” for Modernizing your Data Warehouse: Spark Summit East ...Spark Summit
 
Scaling self service on Hadoop
Scaling self service on HadoopScaling self service on Hadoop
Scaling self service on HadoopDataWorks Summit
 
Hdf5 parallel
Hdf5 parallelHdf5 parallel
Hdf5 parallelmfolk
 
Cloudera Federal Forum 2014: The Evolution of Machine Learning from Science t...
Cloudera Federal Forum 2014: The Evolution of Machine Learning from Science t...Cloudera Federal Forum 2014: The Evolution of Machine Learning from Science t...
Cloudera Federal Forum 2014: The Evolution of Machine Learning from Science t...Cloudera, Inc.
 
SQL-on-Hadoop for Analytics + BI: What Are My Options, What's the Future?
SQL-on-Hadoop for Analytics + BI: What Are My Options, What's the Future?SQL-on-Hadoop for Analytics + BI: What Are My Options, What's the Future?
SQL-on-Hadoop for Analytics + BI: What Are My Options, What's the Future?Mark Rittman
 
Building data pipelines with kite
Building data pipelines with kiteBuilding data pipelines with kite
Building data pipelines with kiteJoey Echeverria
 
SQL and Machine Learning on Hadoop
SQL and Machine Learning on HadoopSQL and Machine Learning on Hadoop
SQL and Machine Learning on HadoopMukund Babbar
 
Testistanbul 2016 - Keynote: "Performance Testing of Big Data" by Roland Leusden
Testistanbul 2016 - Keynote: "Performance Testing of Big Data" by Roland LeusdenTestistanbul 2016 - Keynote: "Performance Testing of Big Data" by Roland Leusden
Testistanbul 2016 - Keynote: "Performance Testing of Big Data" by Roland LeusdenTurkish Testing Board
 
Apache Samoa: Mining Big Data Streams with Apache Flink
Apache Samoa: Mining Big Data Streams with Apache FlinkApache Samoa: Mining Big Data Streams with Apache Flink
Apache Samoa: Mining Big Data Streams with Apache FlinkAlbert Bifet
 
Large-Scale Stream Processing in the Hadoop Ecosystem - Hadoop Summit 2016
Large-Scale Stream Processing in the Hadoop Ecosystem - Hadoop Summit 2016Large-Scale Stream Processing in the Hadoop Ecosystem - Hadoop Summit 2016
Large-Scale Stream Processing in the Hadoop Ecosystem - Hadoop Summit 2016Gyula Fóra
 
Large-Scale Stream Processing in the Hadoop Ecosystem
Large-Scale Stream Processing in the Hadoop Ecosystem Large-Scale Stream Processing in the Hadoop Ecosystem
Large-Scale Stream Processing in the Hadoop Ecosystem DataWorks Summit/Hadoop Summit
 
Emerging technologies /frameworks in Big Data
Emerging technologies /frameworks in Big DataEmerging technologies /frameworks in Big Data
Emerging technologies /frameworks in Big DataRahul Jain
 

Similar to STORM as an ETL Engine to HADOOP (20)

Realtime analytics with_hadoop
Realtime analytics with_hadoopRealtime analytics with_hadoop
Realtime analytics with_hadoop
 
Self-Service BI for big data applications using Apache Drill (Big Data Amster...
Self-Service BI for big data applications using Apache Drill (Big Data Amster...Self-Service BI for big data applications using Apache Drill (Big Data Amster...
Self-Service BI for big data applications using Apache Drill (Big Data Amster...
 
Self-Service BI for big data applications using Apache Drill (Big Data Amster...
Self-Service BI for big data applications using Apache Drill (Big Data Amster...Self-Service BI for big data applications using Apache Drill (Big Data Amster...
Self-Service BI for big data applications using Apache Drill (Big Data Amster...
 
Not Your Father’s Data Warehouse: Breaking Tradition with Innovation
Not Your Father’s Data Warehouse: Breaking Tradition with InnovationNot Your Father’s Data Warehouse: Breaking Tradition with Innovation
Not Your Father’s Data Warehouse: Breaking Tradition with Innovation
 
Hdf5 current future
Hdf5 current futureHdf5 current future
Hdf5 current future
 
Hadoop Summit - Sanoma self service on hadoop
Hadoop Summit - Sanoma self service on hadoopHadoop Summit - Sanoma self service on hadoop
Hadoop Summit - Sanoma self service on hadoop
 
Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in HadoopBackup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadoop
 
A New “Sparkitecture” for Modernizing your Data Warehouse: Spark Summit East ...
A New “Sparkitecture” for Modernizing your Data Warehouse: Spark Summit East ...A New “Sparkitecture” for Modernizing your Data Warehouse: Spark Summit East ...
A New “Sparkitecture” for Modernizing your Data Warehouse: Spark Summit East ...
 
Scaling self service on Hadoop
Scaling self service on HadoopScaling self service on Hadoop
Scaling self service on Hadoop
 
Hdf5 parallel
Hdf5 parallelHdf5 parallel
Hdf5 parallel
 
Cloudera Federal Forum 2014: The Evolution of Machine Learning from Science t...
Cloudera Federal Forum 2014: The Evolution of Machine Learning from Science t...Cloudera Federal Forum 2014: The Evolution of Machine Learning from Science t...
Cloudera Federal Forum 2014: The Evolution of Machine Learning from Science t...
 
SQL-on-Hadoop for Analytics + BI: What Are My Options, What's the Future?
SQL-on-Hadoop for Analytics + BI: What Are My Options, What's the Future?SQL-on-Hadoop for Analytics + BI: What Are My Options, What's the Future?
SQL-on-Hadoop for Analytics + BI: What Are My Options, What's the Future?
 
Building data pipelines with kite
Building data pipelines with kiteBuilding data pipelines with kite
Building data pipelines with kite
 
Presto
PrestoPresto
Presto
 
SQL and Machine Learning on Hadoop
SQL and Machine Learning on HadoopSQL and Machine Learning on Hadoop
SQL and Machine Learning on Hadoop
 
Testistanbul 2016 - Keynote: "Performance Testing of Big Data" by Roland Leusden
Testistanbul 2016 - Keynote: "Performance Testing of Big Data" by Roland LeusdenTestistanbul 2016 - Keynote: "Performance Testing of Big Data" by Roland Leusden
Testistanbul 2016 - Keynote: "Performance Testing of Big Data" by Roland Leusden
 
Apache Samoa: Mining Big Data Streams with Apache Flink
Apache Samoa: Mining Big Data Streams with Apache FlinkApache Samoa: Mining Big Data Streams with Apache Flink
Apache Samoa: Mining Big Data Streams with Apache Flink
 
Large-Scale Stream Processing in the Hadoop Ecosystem - Hadoop Summit 2016
Large-Scale Stream Processing in the Hadoop Ecosystem - Hadoop Summit 2016Large-Scale Stream Processing in the Hadoop Ecosystem - Hadoop Summit 2016
Large-Scale Stream Processing in the Hadoop Ecosystem - Hadoop Summit 2016
 
Large-Scale Stream Processing in the Hadoop Ecosystem
Large-Scale Stream Processing in the Hadoop Ecosystem Large-Scale Stream Processing in the Hadoop Ecosystem
Large-Scale Stream Processing in the Hadoop Ecosystem
 
Emerging technologies /frameworks in Big Data
Emerging technologies /frameworks in Big DataEmerging technologies /frameworks in Big Data
Emerging technologies /frameworks in Big Data
 

More from DataWorks Summit

Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisDataWorks Summit
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiDataWorks Summit
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...DataWorks Summit
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...DataWorks Summit
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal SystemDataWorks Summit
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExampleDataWorks Summit
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberDataWorks Summit
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixDataWorks Summit
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiDataWorks Summit
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsDataWorks Summit
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureDataWorks Summit
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EngineDataWorks Summit
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...DataWorks Summit
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudDataWorks Summit
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiDataWorks Summit
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerDataWorks Summit
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...DataWorks Summit
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouDataWorks Summit
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkDataWorks Summit
 

More from DataWorks Summit (20)

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
 

Recently uploaded

Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embeddingZilliz
 
What is Artificial Intelligence?????????
What is Artificial Intelligence?????????What is Artificial Intelligence?????????
What is Artificial Intelligence?????????blackmambaettijean
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rick Flair
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESmohitsingh558521
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxBkGupta21
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 

Recently uploaded (20)

Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embedding
 
What is Artificial Intelligence?????????
What is Artificial Intelligence?????????What is Artificial Intelligence?????????
What is Artificial Intelligence?????????
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptx
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 

STORM as an ETL Engine to HADOOP

  • 1. STORM as an ETL Engine to HADOOP Apr 15, 2015 Yash Ranadive Lookout Mobile Security @yashranadive etl.svbtle.com Friday, April 24, 15
  • 2. ABOUT • Data Engineer at Lookout, San Francisco • Work on • Analytics Infrastructure (Internal) • Data Ingestion in Hadoop • Blog all things ETL • etl.svbtle.com Friday, April 24, 15
  • 3. AGENDA • When to use Storm? • Architecture Alternatives • Monitoring • Questions Friday, April 24, 15
  • 4. Over 60 million registered users Friday, April 24, 15
  • 5. DEFINITION OF ETL Moving and Transforming data so it can be stored and analyzed Friday, April 24, 15
  • 6. DEFINITION OF ETL Moving and Transforming data so it can be stored and analyzed Friday, April 24, 15
  • 7. DEFINITION OF ETL Moving and Transforming data so it can be stored and analyzed Friday, April 24, 15
  • 8. General Framework for Event Processing Pipelines Need Friday, April 24, 15
  • 9. General Framework for Event Processing Pipelines Need that can make processed data available for analytics as fast as possible Friday, April 24, 15
  • 10. THE PROBLEM Message Bus ETL Logic HADOOPWhat to use? Friday, April 24, 15
  • 11. Depends on Latency Requirements What to use? Friday, April 24, 15
  • 12. ANALYTICS LATENCY Latency Batch Hourly/Daily Frequent Batch 10-15 mins Near Real-time <1 minute Sub Second < 1s Flow interrupt Friday, April 24, 15
  • 13. Depends on Complexity of Reports What to use? Friday, April 24, 15
  • 14. THE PROBLEM Kafka ETL Logic HADOOPWhat to use? Friday, April 24, 15
  • 15. OFFLOADING AND PROCESSING Batch Offload Batch Process Real-time Offload Real-time Process Camus X Storm X X Scalding X Spark X X X X ... Friday, April 24, 15
  • 16. For Real-time Analytics We use Storm Friday, April 24, 15
  • 17. HOW WE SOLVED 2 PROBLEMS 1. User Gratifications 2. Device Connections Friday, April 24, 15
  • 18. an event that adds value to the user “Gratification” is USER GRATIFICATIONS Friday, April 24, 15
  • 19. 1. USER GRATIFICATIONS • Need Analytics on performance of “Scream”, “Lock”, “Locate” • Events in Protobuf format Kafka “Scream”, “Lock”, “Locate” protobuf events Monitor Throughput Join Cohorts Table Complex Reports Friday, April 24, 15
  • 20. PIPELINE - LANDING DATA DIRECTLY Kafka Storm HDFS Kafka Spout Deserializ e Protobuf storm- hdfs bolt Landing Directory Hive Directory Bolt deserializes protobuf to a TSV Data lands on hdfs Files rotated to HIVE external table folder Friday, April 24, 15
  • 21. TUNING OPTIONS • Change storm-hdfs hsync count based policy • Change Parallelism of storm-hdfs bolt • Possibly Change storm-hdfs hsync time based policy Friday, April 24, 15
  • 22. THE GOOD • Plain Text of Protobufs by tailing landing file • Real-time view of throughput via. StatsD • Data available in HIVE for downstream analysis #####Insert diag here Friday, April 24, 15
  • 23. CHALLENGES • Possible duplicates if not “exactly-once” • storm-hdfs bolt has limitations • can’t rotate when topology shutdown • parameter tweaking depending throughput Friday, April 24, 15
  • 24. BURSTY TRAFFIC dd Bursty Traffic can cause frequent hsync (hadoop file system sync) and slow down throughput Friday, April 24, 15
  • 26. 2. DEVICE CONNECTIONS • Report on counts of devices connecting • JSON format • Analyze all connecting devices to backend servers to measure engagement after new product feature rollouts Device Connection JSON events Join Cohorts Table Complex Reports Friday, April 24, 15
  • 27. LANDING DATA ON HBASE Storm HBase HIVE Bolt writes to HBase Daily job copies data from HBase to Hive table Hive table backed by HBase TTL => 3 days Hive table backed by HBase - last 3 days of data Friday, April 24, 15
  • 28. THE GOOD • Can query in real-time HBase or Hive • Better Stability than writing directly to HDFS Friday, April 24, 15
  • 31. TOPOLOGY DEPLOYMENT • Manually push Storm Jars • After Code Review • JAR uploaded to Artifactory w/ version • JAR deployed to Storm Box • To start topology • Kill previous • Start new Friday, April 24, 15
  • 32. CONFIGURATION MANAGEMENT $> cat run_topology.sh storm jar data-storm-0.0.6.jar com.lookout.data.topology.MyTopoClass -topologymaxtaskparallelism 8 -D hdfs.sync.tuple.count=3000 ... -D statsd.host=statsd.flexd-sf0.local • Simple • Config parameters in shell scripts Friday, April 24, 15
  • 33. TRACKING METRICS • Use StatsD and Graphite • Storm Consumer Offsets in DataDog Friday, April 24, 15
  • 34. OPERATIONAL MONITORING & ALERTING • Ruby script hits Storm’s thrift API • Alert if topology is inactive • No monitoring on bolt-level failures • Alert on high-level metrics to prevent alert fatigue Friday, April 24, 15
  • 35. ENVIRONMENT • Independent Storm Cluster for Data Warehouse Tasks • 2 worker nodes • 24 Cores • 48GB Memory Friday, April 24, 15
  • 36. LESSONS LEARNED • Use Storm only for real-time metrics • Streaming data directly to HDFS has its challenges • Better stability with ingesting first in HBase Friday, April 24, 15
  • 38. OFFICIAL DESCRIPTION Lookout`s data team ingests several terabytes of data from various sources every day using many techniques such as binlog parsing, ruby daemons, and storm topologies. With an increasing use of distributed messaging like Kafka from upstream services, ingestion needs to happen on a distributed ETL infrastructure that can horizontally scale with the data. This talk will be on storm topology pipelines for data ingestion, transformation, processing and ultimately consumption for interactive queries. In addition, the talk will focus on 1. storm topology deployment, 2. configuration management, 3. metric monitoring, 4. and finally storage on Hadoop. Keep in Mind: 1. Planning – Structure your presentation, define what the most important messages are and clearly make your point 2. Plan on approximately 30 minutes of presentation and 10 minutes of Q&A 3. Use standard fonts no smaller than 24 pts. Friday, April 24, 15