Ensuring Technical Readiness For Copilot in Microsoft 365
STORM as an ETL Engine to HADOOP
1. STORM as an ETL Engine to
HADOOP
Apr 15, 2015
Yash Ranadive
Lookout Mobile Security
@yashranadive
etl.svbtle.com
Friday, April 24, 15
2. ABOUT
• Data Engineer at Lookout, San Francisco
• Work on
• Analytics Infrastructure (Internal)
• Data Ingestion in Hadoop
• Blog all things ETL
• etl.svbtle.com
Friday, April 24, 15
3. AGENDA
• When to use Storm?
• Architecture Alternatives
• Monitoring
• Questions
Friday, April 24, 15
17. HOW WE SOLVED 2 PROBLEMS
1. User Gratifications
2. Device Connections
Friday, April 24, 15
18. an event that adds value to the user
“Gratification” is
USER GRATIFICATIONS
Friday, April 24, 15
19. 1. USER GRATIFICATIONS
• Need Analytics on performance of “Scream”, “Lock”,
“Locate”
• Events in Protobuf format
Kafka
“Scream”, “Lock”, “Locate”
protobuf events
Monitor
Throughput
Join
Cohorts Table
Complex Reports
Friday, April 24, 15
20. PIPELINE - LANDING DATA DIRECTLY
Kafka Storm HDFS
Kafka
Spout
Deserializ
e Protobuf
storm-
hdfs bolt
Landing
Directory
Hive
Directory
Bolt deserializes protobuf to a TSV Data lands on hdfs
Files rotated to
HIVE external
table folder
Friday, April 24, 15
21. TUNING OPTIONS
• Change storm-hdfs hsync count based policy
• Change Parallelism of storm-hdfs bolt
• Possibly Change storm-hdfs hsync time based
policy
Friday, April 24, 15
22. THE GOOD
• Plain Text of Protobufs by tailing landing file
• Real-time view of throughput via. StatsD
• Data available in HIVE for downstream
analysis
#####Insert diag here
Friday, April 24, 15
23. CHALLENGES
• Possible duplicates if not “exactly-once”
• storm-hdfs bolt has limitations
• can’t rotate when topology
shutdown
• parameter tweaking depending
throughput
Friday, April 24, 15
24. BURSTY TRAFFIC
dd
Bursty Traffic can cause frequent hsync (hadoop file system sync)
and slow down throughput
Friday, April 24, 15
26. 2. DEVICE CONNECTIONS
• Report on counts of devices connecting
• JSON format
• Analyze all connecting devices to backend
servers to measure engagement after new
product feature rollouts
Device Connection JSON
events
Join
Cohorts Table
Complex Reports
Friday, April 24, 15
27. LANDING DATA ON HBASE
Storm HBase
HIVE
Bolt writes to HBase
Daily job copies data
from
HBase to Hive table
Hive table backed by HBase
TTL => 3 days
Hive table backed by HBase - last 3
days of data
Friday, April 24, 15
28. THE GOOD
• Can query in real-time HBase or Hive
• Better Stability than writing directly to HDFS
Friday, April 24, 15
31. TOPOLOGY DEPLOYMENT
• Manually push Storm Jars
• After Code Review
• JAR uploaded to Artifactory w/ version
• JAR deployed to Storm Box
• To start topology
• Kill previous
• Start new
Friday, April 24, 15
32. CONFIGURATION MANAGEMENT
$> cat run_topology.sh
storm jar data-storm-0.0.6.jar com.lookout.data.topology.MyTopoClass
-topologymaxtaskparallelism 8
-D hdfs.sync.tuple.count=3000
...
-D statsd.host=statsd.flexd-sf0.local
• Simple
• Config parameters in shell scripts
Friday, April 24, 15
33. TRACKING METRICS
• Use StatsD and Graphite
• Storm Consumer Offsets in DataDog
Friday, April 24, 15
34. OPERATIONAL MONITORING &
ALERTING
• Ruby script hits Storm’s thrift API
• Alert if topology is inactive
• No monitoring on bolt-level failures
• Alert on high-level metrics to
prevent alert fatigue
Friday, April 24, 15
35. ENVIRONMENT
• Independent Storm Cluster for Data Warehouse
Tasks
• 2 worker nodes
• 24 Cores
• 48GB Memory
Friday, April 24, 15
36. LESSONS LEARNED
• Use Storm only for real-time metrics
• Streaming data directly to HDFS has its challenges
• Better stability with ingesting first in HBase
Friday, April 24, 15
38. OFFICIAL DESCRIPTION
Lookout`s data team ingests several terabytes of data from various sources every day
using many techniques such as binlog parsing, ruby daemons, and storm topologies.
With an increasing use of distributed messaging like Kafka from upstream services,
ingestion needs to happen on a distributed ETL infrastructure that can horizontally
scale with the data.
This talk will be on storm topology pipelines for data ingestion, transformation,
processing and ultimately consumption for interactive queries.
In addition, the talk will focus on
1. storm topology deployment,
2. configuration management,
3. metric monitoring,
4. and finally storage on Hadoop.
Keep in Mind: 1. Planning – Structure your presentation, define what the most
important messages are and clearly make your point 2. Plan on approximately 30
minutes of presentation and 10 minutes of Q&A 3. Use standard fonts no smaller
than 24 pts.
Friday, April 24, 15