Internet of things Crash Course Workshop

© Hortonworks Inc. 2011 – 2014. All Rights Reserved
Real-Time Processing in Hadoop
Hadoop Summit 2015
Ali Bajwa
Partner Solutions Engineer
June 2015

© Hortonworks Inc. 2012
Professional Services
Agenda
 Introduction & about Hortonworks HDP
 Overview of logistics industry scenario
 Overview of streaming architecture on HDP
 Streaming Demo #1
 Integrating Predictive Analytics in streaming scenarios
 Streaming Demo with Predictive additions
 Q & A
Page 2

Preface: Enabling Technologies
Page 5
• Problems solved at scale, via fundamentally new approaches…
• Make it possible, even simple, to produce new products/applications that would
have been too cost prohibitive – or simply impossible - beforehand.
• Where foundation tech like Li-Ion batteries, retina displays, GPS & tiny HD cameras
(from smartphones) have enabled Electric cars, quad-copters, VR displays, & more…
• Hadoop has similarly led to breakthroughs in big data scale & capability, and enables
new real-time advanced analytic applications.

Why did Hadoop emerge?
April 2015

Traditional systems under pressure
Challenges
• Constrains data to app
• Can’t manage new data
• Costly to Scale
Business Value
Clickstream
Geolocation
Web Data
Internet of Things
Docs, emails
Server logs
2012
2.8 Zettabytes
2020
40 Zettabytes
LAGGARDS
INDUSTRY
LEADERS
1
2 New Data
ERP CRM SCM
New
Traditional

Hadoop for the Enterprise: Implement a
Modern Data Architecture with HDP
Spring 2015
Hortonworks. We do Hadoop.

Hadoop for the Enterprise:
Implement a Modern Data Architecture with HDP
Customer Momentum
• 330+ customers (as of year-end 2014)
Hortonworks Data Platform
• Completely open multi-tenant platform for any app & any data.
• A centralized architecture of consistent enterprise services for
resource management, security, operations, and governance.
Partner for Customer Success
• Open source community leadership focus on enterprise needs
• Unrivaled world class support
• Founded in 2011
• Original 24 architects, developers,
operators of Hadoop from Yahoo!
• 600+ Employees
• 1000+ Ecosystem Partners

Customer Partnerships matter
Driving our innovation through
Apache Software Foundation Projects
Apache Project Committers
PMC
Members
Hadoop 27 21
Pig 5 5
Hive 18 6
Tez 16 15
HBase 6 4
Phoenix 4 4
Accumulo 2 2
Storm 3 2
Slider 11 11
Falcon 5 3
Flume 1 1
Sqoop 1 1
Ambari 34 27
Oozie 3 2
Zookeeper 2 1
Knox 13 3
Ranger 10 n/a
TOTAL 161 108
Source: Apache Software Foundation. As of 11/7/2014.
Hortonworkers are the architects and
engineers that lead development of open
source Apache Hadoop at the ASF
• Expertise
Uniquely capable to solve the most complex issues &
ensure success with latest features
• Connection
Provide customers & partners direct input into
the community roadmap
• Partnership
We partner with customers with subscription offering.
Our success is predicated on yours.
27
Cloudera: 11
Facebook: 5
LinkedIn: 2
IBM: 2
Others: 23
Yahoo
10

Technology Partnerships matter
Apache Project Hortonworks
Relationship
Named
Partner
Certified
Solution
Resells
Joint
Engr
Microsoft    
HP    
SAS   
SAP    
IBM   
Pivotal   
Redhat   
Teradata    
Informatica   
Oracle  
It is not just about
packaging and certifying
software…
Our joint engineering
with our partners drives
open source standards
for Apache Hadoop
HDP is
Apache Hadoop

HDP delivers a Centralized Architecture
Modern Data Architecture
• Unifies data and processing.
• Enables applications to have access to
all your enterprise data through an
efficient centralized platform
• Supported with a centralized approach
governance, security and operations
• Versatile to handle any applications
and datasets no matter the size or type
Clickstream Web
& Social
Geolocation Sensor
& Machine
Server
Logs
Unstructured
SOURCES
Existing Systems
ERP CRM SCM
ANALYTICS
Data
Marts
Business
Analytics
Visualization
& Dashboards
ANALYTICS
Applications
Business
Analytics
Visualization
& Dashboards
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
HDFS
(Hadoop Distributed File System)
YARN: Data Operating System
Interactive Real-TimeBatch Partner ISVBatch BatchMP
P
EDW

HDP delivers a completely open data platform
Hortonworks Data Platform 2.2
Hortonworks Data Platform provides Hadoop for the Enterprise: a centralized architecture of core
enterprise services, for any application and any data.
Completely Open
• HDP incorporates every element
required of an enterprise data
platform: data storage, data access,
governance, security, operations
• All components are developed in
open source and then rigorously
tested, certified, and delivered as an
integrated open source platform that’s
easy to consume and use by the
enterprise and ecosystem.
(Cluster Resource Management)
1 ° ° ° ° ° ° °
° ° ° ° ° ° ° °
ApachePig
° °
° °
° ° °
° ° °
HDFS
GOVERNANCE BATCH, INTERACTIVE & REAL-TIME DATA ACCESS
Apache Falcon
ApacheHive
Cascading
ApacheHBase
ApacheAccumulo
ApacheSolr
ApacheSpark
ApacheStorm
Apache Sqoop
Apache Flume
Apache Kafka
SECURITY
Apache Ranger
Apache Knox
Apache Falcon
OPERATIONS
Apache Ambari
Apache
Zookeeper
Apache Oozie

Real World Use Case:
Trucking Company
Spring 2015
Hortonworks. We do Hadoop.

Scenario Overview
.

Trucking company w/ large fleet of trucks in Midwest
A truck generates millions of events for a
given route; an event could be:
 'Normal' events: starting / stopping of the
vehicle
 ‘Violation’ events: speeding, excessive
acceleration and breaking, unsafe tail distance
Company uses an application that monitors
truck locations and violations from the
truck/driver in real-time
Route?
Truck?
Driver?
Analysts query a broad
history to understand if
today’s violations are
part of a larger problem
with specific routes,
trucks, or drivers

Distributed Storage: HDFS
Many Workloads: YARN
Trucking Company’s YARN-enabled Architecture
Stream Processing
(Storm)
Inbound Messaging
(Kafka)
Real-time Serving
(HBase)
Alerts & Events
(ActiveMQ)
Real-Time
User Interface
One cluster with consistent
security, governance &
operations
SQL
Interactive Query
(Hive on Tez)
Truck Sensors

What is Kafka? APACHE KAFKA
 High throughput distributed messaging
system
 Publish-Subscribe semantics but re-
imagined at the implementation level to
operate at speed with big data volumes
 Kafka @LinkedIn:
 800 billion messages per day
 175 terabytes of data written per day
 650 terabytes of data read per day
 Over 13 million messages/2.75GB of data
per second
Kafka
Cluster
producer
producer
producer
consumer
consumer
consumer

Kafka: Anatomy of a Topic
Partition 0 Partition 1 Partition 2
0 0 0
1 1 1
2 2 2
3 3 3
4 4 4
5 5 5
6 6 6
7 7 7
8 8 8
9 9 9
10 10
11 11
12
Writes
Old
New
APACHE KAFKA
 Partitioning allows topics to
scale beyond a single
machine/node
 Topics can also be replicated,
for high availability.

Apache Storm
• Distributed, real time, fault tolerant Stream Processing platform.
• Provides processing guarantees.
• Key concepts include:
•Tuples
•Streams
•Spouts
•Bolts
•Topology
Page 22

Tuples and Streams
• What is a Tuple?
–Fundamental data structure in Storm. Is a named list of values that can be of any data type.
Page 23
• What is a Stream?
–An unbounded sequences of tuples.
–Core abstraction in Storm and are what you “process” in Storm

Spouts
• What is a Spout?
–Generates or a source of Streams
–E.g.: JMS, Twitter, Log, Kafka Spout
–Can spin up multiple instances of a Spout and dynamically adjust as needed
Page 24

Bolts
• What is a Bolt?
–Processes any number of input streams and produces output streams
–Common processing in bolts are functions, aggregations, joins, read/write to data stores, alerting
logic
–Can spin up multiple instances of a Bolt and dynamically adjust as needed
• Bolts used in the Use Case:
1. HBaseBolt: persisting and counting in Hbase
2. HDFSBolt: persisting into HFDS as Avro Files using Flume
3. MonitoringBolt: Read from Hbase and create alerts via email and a message to ActiveMQ if the
number of illegal driver incidents exceed a given threshhold.
Page 25

Topology
• What is a Topology?
–A network of spouts and bolts wired together into a workflow
Page 26
Truck-Event-Processor Topology
Kafka Spout
HBase
Bolt
Monitoring
Bolt
HDFS
Bolt
WebSocket
Bolt
Stream Stream
Stream
Stream

Key Constructs in Apache HBase
• HBase = Key / Value store
• Designed for petabyte scale
• Supports low latency reads, writes and updates
• Key features
– Updateable records
– Versioned Records
– Distributed across a cluster of machines
– Low Latency
– Caching
• Popular use cases:
– User profiles and session state
– Object store
– Sensor apps
Page 28

Data Assignment
Page 29
HBase Table
Keys within HBase
Divided among
different RegionServers

Data Access
• Get
–Retrieves a single cell, all cells with a matching rowkey, or all cells in a column family with
a matching rowkey
• Put
–Inserts a new version of a cell.
• Scan
–The whole table, row by row, or a section of that table starting at a particular start key and
ending at a particular end key
• Delete
–It is actually a version of put(Add a new version with put with a deletion marker)
• SQL via Apache Phoenix
–Unique capability in the NoSQL market
Page 30

20092006
1 ° ° ° ° °
° ° ° ° ° N
HDFS
MapReduce
Largely Batch Processing
Hadoop w/
MapReduce
1 ° ° ° ° ° ° ° ° °
° ° ° ° ° ° ° ° °
°
°N
HDFS
Hadoop2 & YARN based Architecture
Silo’d clusters
Largely batch system
Difficult to integrate
MR-279: YARN
Hadoop 2 & YARN
Interactive Real-TimeBatch
Architected &
led development
of YARN to enable
the Modern Data
Architecture
October 23, 2013

Benefits of YARN as the Data Operating System
• The container based model allows for running nearly any workload.
–Enables the centralized architecture.
–No longer is MapReduce the only data processing engine.
–Docker containers managed by YARN. Yes Please!
• Decouples resource scheduling from application lifecycle.
–Improved scalability and fault tolerence
• Dynamically allocated resources, resulting in HUGE utilization gains
–Versus static allocation of “slots” in Hadoop 1.0
Page 33
Yahoo has over 30000 nodes running YARN across over 365PB of data.
They calculate running about 400,000 jobs per day for about 10 million hours of compute time.
They also have estimated a 60% – 150% improvement on node usage per day since moving to YARN.

Apache HDFS – Hadoop Distributed File
System
• Very large scale distributed file system
• 10K nodes, tens of millions files and PBs of data
• Supports large files
• Designed to run on commodity hardware, assumes hardware failures
• Files are replicated to handle hardware failure
• Detect failures and recovers from them automatically
• Optimized for Large Scale Processing
• Data locations are exposed so that the computations can move to where data resides
• Data Coherency
• Write once and read many times access pattern
• Files are broken up in chunks called ‘blocks’
• Blocks are distributed over nodes
Page 35

Streaming Demo - High Level Architecture
Distributed Storage: HDFS
YARN
Storm Stream Processing
Kakfa Spout
HBase
Dangerous
Events Table
Hbase
Bolt
HDFS
Bolt
Truck Events
Active
MQ
Monitoring
Bolt
Web App
Truck Streaming Data
T(1) T(2) T(N)
Inbound Messaging
(Kafka)
Truck Events Topic

Demo – Streaming Dashboard
.

Lab #1: bit.ly/1L3RLMo
Lab #2: bit.ly/1FW7ENl (<-lower case L)
Lab #3: bit.ly/1L3S0ah
Shell cheatsheet: bit.ly/1JN8EsO
Slides: bit.ly/1MtVoIL (<-capital I)
Twitter demo: github.com/abajwa-hw/hdp22-twitter-demo
Custom services: github.com/hortonworks-gallery
webinars: hortonworks.com/partners/learn email: abajwa@
IoT demo: youtube.com/watch?v=FHMMcMYhmNI

Internet of things Crash Course Workshop

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Internet of things Crash Course Workshop

Similar to Internet of things Crash Course Workshop (20)

More from DataWorks Summit

More from DataWorks Summit (20)

Recently uploaded

Recently uploaded (20)

Internet of things Crash Course Workshop

Editor's Notes