Real-Time Processing in Hadoop for IoT Use Cases - Phoenix HUG

© Hortonworks Inc. 2011 – 2014. All Rights Reserved
Real-Time Processing in Hadoop
Phoenix Hadoop User Group
Shane Kumpf & Mac Moore
Solutions Engineers, Hortonworks
July 2015

© Hortonworks Inc. 2012
Professional Services
Agenda
 Introduction & about Hortonworks HDP
 Overview of logistics industry scenario
 Overview of streaming architecture on HDP
 Streaming Demo #1
 Integrating Predictive Analytics in streaming scenarios
 Streaming Demo with Predictive additions
 Q & A
Page 2

Preface: Enabling Technologies
Page 3

Page 4
Enablers: Key technologies from mass consumer-scale deployments.

Page 5
• Problems solved at scale, via fundamentally new approaches…
• Make it possible, even simple, to produce new products/applications that would
have been too cost prohibitive – or simply impossible - beforehand.
• Where foundation tech like Li-Ion batteries, retina displays, GPS & tiny HD cameras
(from smartphones) have enabled Electric cars, quad-copters, VR displays, & more…
• Hadoop has similarly led to breakthroughs in big data scale & capability, and enables
new real-time advanced analytic applications.

Why did Hadoop emerge?
April 2015

Traditional systems under pressure
Challenges
• Constrains data to app
• Can’t manage new data
• Costly to Scale
Business Value
Clickstream
Geolocation
Web Data
Internet of Things
Docs, emails
Server logs
2012
2.8 Zettabytes
2020
40 Zettabytes
LAGGARDS
INDUSTRY
LEADERS
1
2 New Data
ERP CRM SCM
New
Traditional

Hadoop for the Enterprise: Implement a
Modern Data Architecture with HDP
Spring 2015
Hortonworks. We do Hadoop.

Hadoop for the Enterprise:
Implement a Modern Data Architecture with HDP
Customer Momentum
• 430+ customers (Q1 2015)
Hortonworks Data Platform
• Completely open multi-tenant platform for any app & any data.
• A centralized architecture of consistent enterprise services for
resource management, security, operations, and governance.
Partner for Customer Success
• Open source community leadership focus on enterprise needs
• Unrivaled world class support
• Founded in 2011
• Original 24 architects, developers,
operators of Hadoop from Yahoo!
• 600+ Employees
• 1000+ Ecosystem Partners

Customer Partnerships matter
Driving our innovation through
Apache Software Foundation Projects
Apache Project Committers
PMC
Members
Hadoop 27 21
Pig 5 5
Hive 18 6
Tez 16 15
HBase 6 4
Phoenix 4 4
Accumulo 2 2
Storm 3 2
Slider 11 11
Falcon 5 3
Flume 1 1
Sqoop 1 1
Ambari 34 27
Oozie 3 2
Zookeeper 2 1
Knox 13 3
Ranger 10 n/a
TOTAL 161 108
Source: Apache Software Foundation. As of 11/7/2014.
Hortonworkers are the architects and
engineers that lead development of open
source Apache Hadoop at the ASF
• Expertise
Uniquely capable to solve the most complex issues &
ensure success with latest features
• Connection
Provide customers & partners direct input into
the community roadmap
• Partnership
We partner with customers with subscription offering.
Our success is predicated on yours.
27
Cloudera: 11
Facebook: 5
LinkedIn: 2
IBM: 2
Others: 23
Yahoo
10

Technology Partnerships matter
Apache Project Hortonworks
Relationship
Named
Partner
Certified
Solution
Resells
Joint
Engr
Microsoft    
HP    
SAS   
SAP    
IBM   
Pivotal   
Redhat   
Teradata    
Informatica   
Oracle  
It is not just about
packaging and certifying
software…
Our joint engineering
with our partners drives
open source standards
for Apache Hadoop
HDP is
Apache Hadoop

HDP delivers a Centralized Architecture
Modern Data Architecture
• Unifies data and processing.
• Enables applications to have access to
all your enterprise data through an
efficient centralized platform
• Supported with a centralized approach
governance, security and operations
• Versatile to handle any applications
and datasets no matter the size or type
Clickstream Web
& Social
Geolocation Sensor
& Machine
Server
Logs
Unstructured
SOURCES
Existing Systems
ERP CRM SCM
ANALYTICS
Data
Marts
Business
Analytics
Visualization
& Dashboards
ANALYTICS
Applications
Business
Analytics
Visualization
& Dashboards
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
HDFS
(Hadoop Distributed File System)
YARN: Data Operating System
Interactive Real-TimeBatch Partner ISVBatch BatchMP
P
EDW

HDP delivers a completely open data platform
Hortonworks Data Platform 2.2
Hortonworks Data Platform provides Hadoop for the Enterprise: a centralized architecture of core
enterprise services, for any application and any data.
Completely Open
• HDP incorporates every element
required of an enterprise data
platform: data storage, data access,
governance, security, operations
• All components are developed in
open source and then rigorously
tested, certified, and delivered as an
integrated open source platform that’s
easy to consume and use by the
enterprise and ecosystem.
(Cluster Resource Management)
1 ° ° ° ° ° ° °
° ° ° ° ° ° ° °
ApachePig
° °
° °
° ° °
° ° °
HDFS
GOVERNANCE BATCH, INTERACTIVE & REAL-TIME DATA ACCESS
Apache Falcon
ApacheHive
Cascading
ApacheHBase
ApacheAccumulo
ApacheSolr
ApacheSpark
ApacheStorm
Apache Sqoop
Apache Flume
Apache Kafka
SECURITY
Apache Ranger
Apache Knox
Apache Falcon
OPERATIONS
Apache Ambari
Apache
Zookeeper
Apache Oozie

Real World Use Case:
Trucking Company
Spring 2015
Hortonworks. We do Hadoop.

Scenario Overview
.

Trucking company w/ large fleet of trucks in Midwest
A truck generates millions of events for a
given route; an event could be:
 'Normal' events: starting / stopping of the
vehicle
 ‘Violation’ events: speeding, excessive
acceleration and breaking, unsafe tail distance
Company uses an application that monitors
truck locations and violations from the
truck/driver in real-time
Route?
Truck?
Driver?
Analysts query a broad
history to understand if
today’s violations are
part of a larger problem
with specific routes,
trucks, or drivers

Distributed Storage: HDFS
Many Workloads: YARN
Trucking Company’s YARN-enabled Architecture
Stream Processing
(Storm)
Inbound Messaging
(Kafka)
Real-time Serving
(HBase)
Alerts & Events
(ActiveMQ)
Real-Time
User Interface
One cluster with consistent
security, governance &
operations
SQL
Interactive Query
(Hive on Tez)
Truck Sensors

What is Kafka? APACHE KAFKA
 High throughput distributed messaging
system
 Publish-Subscribe semantics but re-
imagined at the implementation level to
operate at speed with big data volumes
 Kafka @LinkedIn:
 800 billion messages per day
 175 terabytes of data written per day
 650 terabytes of data read per day
 Over 13 million messages/2.75GB of data
per second
Kafka
Cluster
producer
producer
producer
consumer
consumer
consumer

Kafka: Anatomy of a Topic
Partition 0 Partition 1 Partition 2
0 0 0
1 1 1
2 2 2
3 3 3
4 4 4
5 5 5
6 6 6
7 7 7
8 8 8
9 9 9
10 10
11 11
12
Writes
Old
New
APACHE KAFKA
 Partitioning allows topics to
scale beyond a single
machine/node
 Topics can also be replicated,
for high availability.

Apache Storm
• Distributed, real time, fault tolerant Stream Processing platform.
• Provides processing guarantees.
• Key concepts include:
•Tuples
•Streams
•Spouts
•Bolts
•Topology
Page 22

Tuples and Streams
• What is a Tuple?
–Fundamental data structure in Storm. Is a named list of values that can be of any data type.
Page 23
• What is a Stream?
–An unbounded sequences of tuples.
–Core abstraction in Storm and are what you “process” in Storm

Spouts
• What is a Spout?
–Generates or a source of Streams
–E.g.: JMS, Twitter, Log, Kafka Spout
–Can spin up multiple instances of a Spout and dynamically adjust as needed
Page 24

Bolts
• What is a Bolt?
–Processes any number of input streams and produces output streams
–Common processing in bolts are functions, aggregations, joins, read/write to data stores, alerting
logic
–Can spin up multiple instances of a Bolt and dynamically adjust as needed
• Bolts used in the Use Case:
1. HBaseBolt: persisting and counting in Hbase
2. HDFSBolt: persisting into HFDS as Avro Files using Flume
3. MonitoringBolt: Read from Hbase and create alerts via email and a message to ActiveMQ if the
number of illegal driver incidents exceed a given threshhold.
Page 25

Topology
• What is a Topology?
–A network of spouts and bolts wired together into a workflow
Page 26
Truck-Event-Processor Topology
Kafka Spout
HBase
Bolt
Monitoring
Bolt
HDFS
Bolt
WebSocket
Bolt
Stream Stream
Stream
Stream

Key Constructs in Apache HBase
• HBase = Key / Value store
• Designed for petabyte scale
• Supports low latency reads, writes and updates
• Key features
– Updateable records
– Versioned Records
– Distributed across a cluster of machines
– Low Latency
– Caching
• Popular use cases:
– User profiles and session state
– Object store
– Sensor apps
Page 28

Data Assignment
Page 29
HBase Table
Keys within HBase
Divided among
different RegionServers

Data Access
• Get
–Retrieves a single cell, all cells with a matching rowkey, or all cells in a column family with
a matching rowkey
• Put
–Inserts a new version of a cell.
• Scan
–The whole table, row by row, or a section of that table starting at a particular start key and
ending at a particular end key
• Delete
–It is actually a version of put(Add a new version with put with a deletion marker)
• SQL via Apache Phoenix
–Unique capability in the NoSQL market
Page 30

20092006
1 ° ° ° ° °
° ° ° ° ° N
HDFS
MapReduce
Largely Batch Processing
Hadoop w/
MapReduce
1 ° ° ° ° ° ° ° ° °
° ° ° ° ° ° ° ° °
°
°N
HDFS
Hadoop2 & YARN based Architecture
Silo’d clusters
Largely batch system
Difficult to integrate
MR-279: YARN
Hadoop 2 & YARN
Interactive Real-TimeBatch
Architected &
led development
of YARN to enable
the Modern Data
Architecture
October 23, 2013

Benefits of YARN as the Data Operating System
• The container based model allows for running nearly any workload.
–Enables the centralized architecture.
–No longer is MapReduce the only data processing engine.
–Docker containers managed by YARN. Yes Please!
• Decouples resource scheduling from application lifecycle.
–Improved scalability and fault tolerence
• Dynamically allocated resources, resulting in HUGE utilization gains
–Versus static allocation of “slots” in Hadoop 1.0
Page 33
Yahoo has over 30000 nodes running YARN across over 365PB of data.
They calculate running about 400,000 jobs per day for about 10 million hours of compute time.
They also have estimated a 60% – 150% improvement on node usage per day since moving to YARN.

Apache HDFS – Hadoop Distributed File
System
• Very large scale distributed file system
• 10K nodes, tens of millions files and PBs of data
• Supports large files
• Designed to run on commodity hardware, assumes hardware failures
• Files are replicated to handle hardware failure
• Detect failures and recovers from them automatically
• Optimized for Large Scale Processing
• Data locations are exposed so that the computations can move to where data resides
• Data Coherency
• Write once and read many times access pattern
• Files are broken up in chunks called ‘blocks’
• Blocks are distributed over nodes
Page 35

Streaming Demo - High Level Architecture
YARN
Storm Stream Processing
Kakfa Spout
HBase
Dangerous
Events Table
Hbase
Bolt
HDFS
Bolt
Truck Events
Active
MQ
Monitoring
Bolt
Web App
Truck Streaming Data
T(1) T(2) T(N)
Inbound Messaging
(Kafka)
Truck Events Topic

Demo – Streaming Dashboard
.

A New Challenge
.

CDO’s vision: Build a Predictive Business, not a Reactive one
CDO’s Requirements
 Offline predictions
 Identify investments that will increase
safety and reduce company’s liabilities
 Real-time predictions
 Anticipate driver violations before they
happen and take precautionary actions
Data Scientist’s Response
 Need to explore data & form a hypothesis
 Verify trends against TBs of events data via
machine learning
 Generate predictive models with Spark
MLlib on HDP
 Plug models into the Storm topology to predict
driver violations in real-time
♬ I’ve been waiting for
this moment all my life ♬

Demo – Analyzing Events with Tableau
.

Analyzing Raw Events – dangerous drivers
Page 41

Analyzing Raw Events – dangerous routes
Page 42

Analyzing Raw Events – violations by location
Page 43

Enriching truck events for analysis with Pig
HDFS Raw Truck EventsWeather Data Sets
Raw Weather Data
HCatalog (Metadata)
Payroll Data
HR & Payroll DBs
Load Raw Truck
Events
Clean &
Filter
Cleaned
Events
Transformed
Events
Transform
Join with
HR & weather data
Enriched
Events
Enriched Events
Store
Tableau

Analyzing Enriched Events – noncertified and fatigued
drivers more dangerous
Page 45

Analyzing Enriched Events – top 3 dangerous routes seem
to be driven by fatigued drivers
Page 46

Analyzing Enriched Events – foggy weather leads to
violations
Page 47

Analyzing Enriched Events – but top 3 safest routes are
also foggy
Page 48

Integrating Predictive Analytics

Building the Predictive Model on HDP
Tableau
Explore small subset of events to identify predictive
features and make a hypothesis. E.g. hypothesis: “foggy
weather causes driver violations”
1
Identify suitable ML algorithms to train a model – we will
use classification algorithms as we have labeled events
data
2
Transform enriched events data to a format that is
friendly to Spark MLlib – many ML libs expect
training data in a certain format
3
Train a logistic regression model in Spark on YARN, with
above events as training input, and iterate to fine tune
the generated model
4
Integrate Spark MLlib model in a Storm bolt to predict
violations in real time
5

Truck Sensors
HDFS
YARN
Integrate Predictive Analytics in Stream Processing
Stream Processing
(Storm)
Inbound Messaging
(Kafka)
Interactive Query
(Hive on Tez)
Real-time Serving
(HBase)
Millions of Enriched Truck Events
Prediction Bolt
Plug Spark model
into Storm bolt
Machine Learning
(Spark)
Train Spark ML model with
millions of truck events

Streaming Demo - Updated Architecture
YARN
Storm Stream Processing
Kakfa Spout
HBase
PayRoll
TableHBase
Bolt
HDFS
Bolt
Truck Events
Active
MQ
Monitoring
Bolt
Web App
Truck Streaming Data
T(1) T(2) T(N)
Inbound Messaging
(Kafka)
Truck Events Topic
Prediction
Bolt
Enrich
Event
Predict
violation in
real time &
alert via MQ
Render Real time
predictions on UI

Transforming training data for Spark MLlib
Enriched Events Data
Event Type Is Driver
Certified?
Wage
Plan
Hours
Driven
Miles
Driven
Longitude Latitude Weather
Foggy
Weather
Rainy
Weather
Windy
Normal Yes Hourly 45 2721 -91.3 38.14 No No No
Overspeed No Miles 72 4152 -94.23 37.09 Yes Yes No
… … … … … … … … … …
Spark MLlib Training Data
Label Is Driver
Certified?
Wage
Plan
Hours
Driven
Miles
Driven
Weather
Foggy
Weather
Rainy
Weather
Windy
0 1 1 0.45 0.2721 0 0 0
1 0 0 0.72 0.4152 1 1 0
… … … … … … … …
Normal events
labeled as 0 and
violation events as 1
Feature scaling applied to
hours and miles to improve
algorithm performance
Features with binary values
denoted as 0 and 1

Running Spark ML on YARN
1
spark-submit --class org.apache.spark.examples.mllib.BinaryClassification --master yarn-cluster --
num-executors 3 --driver-memory 512m --executor-memory 512m
--executor-cores 1 truckml.jar --algorithm LR --regType L2 --regParam 1.0 /user/root/truck_training
--numIterations 100
Run spark-submit script to launch a Spark job on YARN.
Training data
location on HDFS
2 Monitor progress of Spark job in YARN Resource Mgr UI

Interpreting Spark Logistic Regression Results
Precision: 87.5% Recall: 88%
Top three predictors of violations
1. Foggy Weather 2. Rainy Weather 3. Driver Certification

Integrating Spark model in Storm
Kafka Spout
Storm Prediction Bolt
 Initialize Spark model
 Parse truck event
 Enrich event with HBase data
 Predict violation with model
 Send Alert if violation predicted
Real-time Serving
(HBase)
Active MQ
Ops Center LOB Dashboards

Summary: Solution Value
.

Value of large scale ML on HDP
 Accelerate time to market/value
 Test out multiple ML algorithms against TBs of training data in reasonable
time frames
 Confirm hypothesis against TBs of training data with confidence
 We confirmed that fog does impact safety and wage plans do not,
whereas BI tools indicated otherwise
 Easily integrate predictive models in data driven apps
 Run predictive models in Storm or any other app in your enterprise
 Run all of the above in a multi-tenant YARN cluster
 Large scale ML on YARN respects other tenants in an HDP cluster

Recommendations to CDO
 Investment recommendations, in order of priority
1. Invest in visibility sensors and auto braking systems to deal with foggy conditions
2. Invest in slip resistant tires to fight rainy conditions
3. Invest in certifying drivers to reduce violation probability
 Power of real time predictions
 40% reduction in violation rates by predicting high risk situations in real-time and
sending immediate alerts to drivers

Predictive Demo
.

Q & A

Real-Time Processing in Hadoop for IoT Use Cases - Phoenix HUG

More Related Content

What's hot

Viewers also liked

Similar to Real-Time Processing in Hadoop for IoT Use Cases - Phoenix HUG

Real-Time Processing in Hadoop for IoT Use Cases - Phoenix HUG

Editor's Notes