Simplifying Real-Time Architectures for IoT with Apache Kudu

1© Cloudera, Inc. All rights reserved.
Simplifying Real-Time
Architectures for IoT using
Apache Kudu
Vijay Raja| Solutions Marketing Lead, IoT
Ryan Lippert | Product Marketing, Operational DB

IoT – Key Drivers & Objectives
Drive Internal
Efficiencies
Improve Product
& Customer Exp.
New Services &
Business Models
• Predictive Maintenance
• Real-time monitoring
• Ops optimization
• Reduced equipment
down-times
• Product Usage Analytics
• Personalized products &
offerings
• Improved Product
Development
• New usage based
business models
• New service offerings
• E.g. On Command Connect
• Remote Monitoring
Who are my customers?
How are they using my products?
How can I lower downtime?
How can I drive efficiencies?
How do we implement a usage-based
model?
How can I launch new revenue streams?

2 PB of data/car/ year 1 – 2 TB of data / day 1 – 5 TB of data / day

IoT Data Characteristics
- The Foundation of Hadoop’s Potential
IoT data comes from a variety of different sources
• Massive volumes of intermittent data streams
• Generated from a variety of data sources
• Predominantly time-series
• Can come in streams (real-time) or batches
• Diverse data structures and schemas
• Some of it may be perishable
Combining sensor data with contextual data is the key to
value creation from IoT

Polling Question - 1
Where is your organization in your IoT journey?
A. Not sure where to start
B. Currently exploring use cases
C. Implementing our first IoT use case
D. Already deployed first IoT use case
E. Multiple IoT use cases in production
(Single Choice)

The IoT Ecosystem & Architecture
IoT Gateway
Data Center
Gateway
• Data Routing
• Edge-Processing
• Edge-Storage
IoT Data Storage, Processing & Analytics
Centralized IoT Data Analytics
• Time Series Data, Trends
• Machine Learning
• Context Enrichment
• Deeper business insights
Distributed Data
Processing & Analytics
• Cloud & On-Premise
Cloud
Sensors/ Things
• Analytics at the edge
• For Immediate
response
IoT Analytics
Enterprise Data Sources

What Happens at the Edge & What happens in the Cloud?
• Analytics that needs to be acted upon
immediately
• Low latency req. - Hazard detection,
collision avoidance etc.
• Human response times
• Context Enrichment
• Time series Analysis
• Comparative / Trend analysis
Cloud
Analytics
Edge
Analytics
Cloud
Analytics

Cloudera Enterprise – Hadoop as a Data Platform for IoT
Sensors/ IoT
Data Sources
Internal Systems External Sources
BI Solutions Real-Time AppsSearch Data Science
Workbench
SQL
Machine
Learning
Data Center
Cloud
Sensor/ IoT Data
IoT Gateway
• Data Storage
• Data Processing
• Real-time Analytics
OPERATIONS
Cloudera Manager
Cloudera Director
DATA
MANAGEMENT
Cloudera Navigator
Encrypt and KeyTrustee
Optimizer
BATCH
Sqoop
REAL-TIME
Kafka, Flume
PROCESS, ANALYZE, SERVE
UNIFIED SERVICES
RESOURCE MANAGEMENT
YARN
SECURITY
Sentry, RecordService
FILESYSTEM
HDFS
RELATIONAL
Kudu
NoSQL
HBase
STORE
INTEGRATE
BATCH
Spark, Hive, Pig
MapReduce
STREAM
Spark
SQL
Impala
SEARCH
Solr
SDK
Partners

IoT: Lots of Buzz, but what is the core concept?
And critically, what do we need from our infrastructure?
IoT promises prediction
and optimization, but
often delivers
monitoring.
The right solution allows you to
analyze data and serve
information in time to change
business outcomes.
That means the right solution is
built on real-time analytics.

IoT: Driven by Data

Polling Question - 2
What area of the real-time data chain does your organization need the
most help with?
A. Data ingest
B. Data processing
C. Data serving
D. All of the above
(Single Choice)

HDFS
Fast Scans, Analytics
and Processing of
Stored Data
Fast On-Line
Updates &
Data Serving
Arbitrary Storage
(Active Archive)
Fast Analytics
(on fast-changing or
frequently-updated data)
Traditional Hadoop Databases Leave a Gap
Use cases that fall between HDFS and HBase were difficult to manage
Unchanging
Fast Changing
Frequent Updates
HBase
Append-Only
Real-Time
Complex Hybrid
Architectures
Analytic
Gap
Pace of Analysis
PaceofData

The Trouble with Lambda
Batch Layer
Serving Layer
Speed Layer
New Data
Data Lake
(HDFS)
Precompute
Views
Stream or
Micro Batch
Increment
Views
Data
Application
“Real-time” Increment
Batch Recompute
Merge
Hadoop
Storm/Spark
HBase
Impala
Code must be kept in sync
Restatement is difficult

Updateable Analytic Storage
Simple real-time analytics and updates with Apache Kudu
Kudu: Storage for fast analytics on fast data
• Simplified architecture for building real-time analytic
applications
• Designed for next-generation hardware for faster analytic
performance across frameworks
• Native Hadoop storage engine
Flexibility for the right tools for the right use
case in one platform
• Only analytic database for Hadoop with Kudu + Impala
• Simple real-time applications with Kudu + Spark
Use cases
• Time series data
• Machine data analytics
• Online reporting
STRUCTURED
Sqoop
UNSTRUCTURED
Kafka, Flume
UNIFIED SERVICES
RESOURCE MANAGEMENT
YARN
SECURITY
STORE
INTEGRATE
BATCH
Spark, Hive, Pig
MapReduce
STREAM
Spark
SQL
Impala
SEARCH
Solr
OTHER
Kite
NoSQL
HBase
FILESYSTEM
HDFS
RELATIONAL
Kudu
OBJECT
Cloud

HDFS
Fast Scans, Analytics
and Processing of
Stored Data
Fast On-Line
Updates &
Data Serving
Arbitrary Storage
(Active Archive)
Fast Analytics
(on fast-changing or
frequently-updated data)
Kudu: Fast Analytics on Fast-Changing Data
New storage engine enables new Hadoop use cases
Unchanging
Fast Changing
Frequent Updates
HBase
Append-Only
Real-Time
Kudu Kudu fills the Gap
Modern analytic
applications often
require complex data
flow & difficult
integration work to
move data between
HBase & HDFS
Analytic
Gap
Pace of Analysis
PaceofData

Better Together
Kudu Benefits from Integration with the Apache Ecosystem
Spark – Stream Processing for Kudu
• Open standard for real-time stream processing
• Effective for automating decision processes and machine
learning
• Use Cases include: Time Series Data & Machine Data
Analytics
Impala – High-Performance BI & SQL for Kudu
• Open standard for interactive SQL queries
• Powers analytic database workloads with flexibility, scale, and
open architecture
• Use Cases include: Online Reporting

Why Kudu, Why Cloudera?
A simultaneous combination of sequential and random reads and writes
Can you insert time series data
in real time? How long does it
take to prepare it for analysis?
Can you get results and act fast
enough to change outcomes?
Can you handle large volumes
of machine-generated data? Do
you have the tools to identify
problems or threats? Can your
system do machine learning?
Time Series Data Machine Data Analytics

Kudu Increases the Value of Time Series Data
Time Series
Inserts, updates, scans, lookups
Workload
Examples
Stream market data; IoT; fraud detection &
prevention; risk monitoring; connected cars;
Time series data is most valuable if you can
analyze it to change outcomes in real time.
Kudu simulateneously enables:
• Time series data inserted/updated as it arrives
• Analytic scans to find trends on fresh time series data
• Lookups to quickly visit the point in time where an
event occured

Kudu Keeps Your Business Operational
Machine Data
Analytics
Inserts, scans, lookups
Workload
Examples
Network threat detection; network health
monitoring; application performance
monitoring
Kudu can help spot problems before they
happen. Real-time data inserts with the ability to
analyze trends identifies potential problems.
Kudu identifies trouble through:
• Unlimited storage, yielding better historic trend analysis
• Fast inserts to enable an up-to-date network view
• Fast scans identify/flag undesired states for remedy

Operational DB: Real-Time Architecture
Driving the Model Through Machine Learning
Kafka
Spark
Streaming
Spark MLlib
IoT Analytics
Individual Session
Full Model/Learning
Genesis
Spark
1 Event
Occurs
2
Messaging
3
Stream
Processing 4
Land in
Relational
Store
5
Apply ML
Libraries
IoT Data
Sources
Other Data Sources

MLlib & K-Means: Defining Microsegments via Machine Learning
Height
Weight
Height
Weight
1 2
Height
Weight
3
Height
Weight
4
L
M
S
XL
L
M
S
XS
Near
Custom
?

Driving Prediction and Optimization
Kafka
Spark
Streaming
Spark MLlib
IoT Analytics
Individual Session
1
Data
Processed
Genesis
Spark
2
Request Processed/
Kudu Queried
3
4
Results
Returned
Results
Processed
5
Processed
Data
Returned
Full Model/Learning
IoT Data
Sources
Other Data Sources

Driving Prediction and Optimization
Step 1: Data Processed
Apache Spark processes the data from the event (car sensors, manufacturing,
wearables, etc), which potentially involves keeping a running list of the last X
number of events
Step 2: Request Processed/Kudu Queried
A Spark application uses the data gathered in step one to query Kudu’s database
in a predefined manner to look for similar patterns defined via machine learning
Step 3: Kudu Results Returned
Kudu returns the results from the query in step 2 back to Spark to determine what
needs to be returned to the application
Step 4: Results Processed
Spark associates the results from Kudu with the information stored from the
current event to determine the next step to feed back to the application
Step 5: Processed Data Returned
The machine-generated, best possible outcome is prescribed and served to the
application

Operational DB: IoT Use Case
Prediction and Optimization
Kafka
Spark
Streaming
Spark MLlib
Application
Individual Session
Sensor Data
Spark
Full Model/Learning
Data Request Sent For Stream Processing
Data Cleaned/Ordered/Processed, Then
Delivered to Kudu for Modelling
Automated processes based on machine
learning enable prediction and
optimization at a new level.
Illustrative,
models will likely
have >2
dimensions
IoT Data
Sources
Kudu
Other Data Sources

Key IoT Use Cases

Using Predictive Maintenance to Improve
Performance and Reduce Fleet Downtime
• Real-time visibility of 300,000+ trucks in
order to improve uptime and vehicle
performance
• OnCommand Connection is collecting
telematics and geolocation data across
the fleet
• Reduced maintenance costs to $.03 per
mile from $.12-$.15 per mile
• Centralizing data from 13 systems with
varying frequency and semantic
definitions
TRANSPORTATION
» PREDICTIVE MAINTENANCE
» IMPROVED SERVICE
» DATA DRIVEN PRODUCTS
DATA-DRIVEN
PRODUCTS
CASE STUDY

Predictive Maintenance on industrial-
grade turbines for hydro power stations
Challenge:
• Gather, store and analyze noise levels
from turbines for anomaly detection
Solution:
• Cloudera platform used to gather and
analyze acoustic data/audio files coming
from the turbines in real-time
• Using diagnostic solution to monitor the
health of turbines and predict failures
in advance
PREDICTIVE MAINTENANCE
» INDUSTRIAL IoT
» LOWERED DOWNTIME
» LOWERED COSTS
Predictive Maintenance - Turbines
DATA-DRIVEN
PROCESS
CASE STUDY
DATA-DRIVEN
PRODUCTS

#1 Telematics provider with 130 billion
miles of driving data collected from black
boxes in connected cars
Challenge:
• Drive analytics on 12 million miles of
driving data collected every hour
Solution:
• Telematics solution based on Cloudera
to process data from black boxes
• Analytics around driving behavior, risks,
location, braking patterns, contextual
elements and crash information
TELEMATICS
» CONNECTED VEHICLES
» INSURANCE TELEMATICS
» PREDICTIVE ANALYTICS
Connected Car Telematics for Insurance
CASE STUDY
DATA-DRIVEN
PROCESS
DATA-DRIVEN
PRODUCTS

Powering a Variety of IoT Use Cases…
Connected Vehicles
Usage Based Insurance
Industrial IoT
Predictive Maintenance
Smart Cities/ Ports Oil & Gas
Aerospace & Aviation Smart Healthcare

Connected Car Demo

Connected Car – Demo Architecture
OPERATIONS
Cloudera Manager
Cloudera Director
DATA
MANAGEMENT
Cloudera Navigator
Encrypt and KeyTrustee
Optimizer
BATCH
Sqoop
REAL-TIME
Kafka, Flume
UNIFIED SERVICES
RESOURCE MANAGEMENT
YARN
SECURITY
FILESYSTEM
HDFS
RELATIONAL
Kudu
NoSQL
HBase
STORE
INTEGRATE
BATCH
Spark, Hive, Pig
MapReduce
STREAM
Spark
SQL
Impala
SEARCH
Solr
SDK
Partners
Cloudera Enterprise Data Hub
MQTT -
Kafka
Bridge
Connected Car
Simulator
Data Ingest &
Pipeline
Enterprise Data Hub BI & Visualization
Streaming Data:
• Time
• VIN
• Location
• Mileage
• Speed
• Acceleration
• Brakes applied?
• Turn signal on?
• Lane departed?
• Collision
detected?
• Hazard detected?
StreamSets Data
Collector

Connected Car – Demo Architecture
Cloudera Enterprise Data Hub
MQTT -
Kafka
Bridge
Connected Car
Simulator
Data Ingest &
Pipeline
Enterprise Data Hub BI & Visualization
Streaming Data:
• Time
• VIN
• Location
• Mileage
• Acceleration
• Speed
• Brakes applied?
• Turn signal on?
• Lane departed?
• Collision
detected?
• Hazard detected?
Data Storage Layer
Search
#2
#1
Pub-Sub Messaging
System
Real-Time
Processing Engine
StreamSets Data
Collector
Interactive SQL Engine

Thank You

Simplifying Real-Time Architectures for IoT with Apache Kudu

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Simplifying Real-Time Architectures for IoT with Apache Kudu

Similar to Simplifying Real-Time Architectures for IoT with Apache Kudu (20)

More from Cloudera, Inc.

More from Cloudera, Inc. (20)

Recently uploaded

Recently uploaded (20)

Simplifying Real-Time Architectures for IoT with Apache Kudu