HBaseCon 2015: Industrial Internet Case Study using HBase and TSDB

HBaseCon
Imagination at work
Industrial Internet Case Study
using HBase and TSDB
Shyam Varan Nath
Arnab Guin
May, 2015
Agenda
• Introduction to IoT and Industrial Internet
• Industrial Use Case
• Technology Details
• Wrap up
About Shyam
• Architect – Industrial Analytics @GE
• Worked in IBM, Deloitte, Oracle and Halliburton,
prior to GE
• Regular speaker in technology conferences such
as Oracle Openworld, IoT Summit, Collaborate,
BIWA Summit on IoT, Big Data and Analytics
related topics
• Education: IIT Kanpur (B Tech EE), MBA and MS
Computer Science from Florida Atlantic University
About Arnab
• Staff Software Engineer, Big Data – Predix
Platform @GE
• Working on big data, analytics (Past HBaseCon
attendee)
• Past work in several domains – audience
research, genomics, compilers, EE
• Education: BS Computer Science, Masters
(Quantitative Finance), CS/EE graduate studies
@Stanford
GE – Guiding Principle
Industrial Internet Makes Machines Better!
Overview – Industrial Internet as a Service
Industrial Data Lake
Industrial Internet Application
Industrial
Machines
https://www.gesoftware.com/news-events/featured-stories/controls%E2%80%94-brilliant-edge-ge%E2%80%99s-industrial-internet
Connectivity
Analytics
Value to Business User
Aviation Use Case - Jet Engines
Speed Sensor
Exhaust Gas Temperature
(EGT) sensors
Temperature Sensors
Temperature Sensors
Pressure sensors
* Simplified view of some of the sensors
Making “Sense” of the “Sensors”
EGT = Exhaust Gas Temperature
The temperature of the exhaust gases as they enter the tail
pipe, after passing through the turbine
A good indicator of the health of engine (just like human body
temperature)
Recording & interpreting the EGT can help to detect several
jet engine
problems.
http://www.geaviation.com/press/cf34/cf34_20140513b.html
Business Problem
• The aircraft collects data about it’s operations including the jet
engine using Quick Access Recorder (QAR)
• Engine analytics applied to such data in near real-time can be
used to proactively diagnose problems and reduce unplanned
downtime
• Continuous Engine Operations Data (CEOD) can be up to 500GB
per flight. With 300K flights per day, it soon becomes a Big Data
problem
• Data Lake is a fertile ground for Big Data Analytics to understand
jet engine behavior and problems over its age of ~30 yrs
• Analytics developed with full data, can be deployed to summary
information, near real-time
http://en.wikipedia.org/wiki/Quick_access_recorder
All data
Access to real-time data
and historical data and not
limited to snapshot of data
Any data
Handling of all data types
including documents, images
machine data, sensor data
One place
Access to all data in one
place to quickly respond to
the speed of business change
1
2
3
Rapid access to all data for analytics
How long will
it last without
failures or
maintenance?
Is my asset
ready when
there is market
opportunity?
Is my asset
performing
optimally?
How to
configure
for best
operational
results?
FLEXIBLE DATA MODELS
New approach – Industrial Data Lake
architecture
INDUSTRIAL DATA LAKE
Data scientist Field operations Business analyst
Sensor
data
Content
(images, videos,
manuals, etc.)
Machine
data
Historian
data
CRM,
ERP,
etc.
Logs,
click
streams
Geo-
location
data
Social
network
data
Industrial Data Lake
50BMachines will be
connected on the
internet by 2020
2XIndustrial data
growth within
next 10 years
*Source: IDC
CRM, ERP,
etc. Logs
Social
network
data
Geo-location
data
In practice only
3%of potentially useful
data is tagged
and even less
is analyzed*
9M
Data points
per hour for each
locomotive
500GB
Data per blade
by gas
turbines
Sensor
data
Content
(images, videos,
manuals, etc.)
Historian
data
Machine
data
35GB
Data per day
from each
Smart Meter
50X
Data growth
in healthcare
(2012 – 2020)
1TB
Data per
flight
© General Electric Company, 2014. All Rights Reserved.
Machine
data
Data Flow - Ingestion, Storage, Analytics
Sensor
data
System Components
Ingestion
- High speed real-time data input in the time domain
(streams)
- Batch processing (files)
Transport Layer
- RabbitMQ
Security
Fault tolerance
- Multiple containers (ingestion)
- Highly available queues (transport)
- Multiple masters, replication (storage)
- Multi-node zookeeper quorum (coordination)
Ingestion
Storage
Transport
Security
Zookeeper
Read-Write Tracks
W
R
I
T
E
H
B
A
S
E
T
S
D
B
TimeRange
Tag/Metric
Tag Value
Attributes
R
E
A
D
H
B
A
S
E
T
S
D
B
Timestamp
Tag/Metric
Tag Value
Attributes
TimeMessag
e
Block
Write
Atomic
Write
Async
Parallel
Async
ParallelTimeQuer
y
Block
Read
Atomic
Read
s
e0
now
time
Block reads
can be
syncronized
(futures)
● Abstractions over TSDB for simplification
● Higher throughput reads
o Multi-threading
o Multi-processing (coprocessors) - wip
o Divide and conquer
o On-demand loading (yield to wield)
Read/Writ
e APIs KairosDB
OpenTSDB
HBase
Metric TimeStamp
……..
Salt
.
.
Region
Servers
APIs (Read + Write)
Hbase
DataStore
Plugin
Performance
Region
Servers
Higher throughput reads
● Multi-threading (client side)
● Multi-processing (coprocessors)
● Divide and conquer
● On-demand loading (read ahead and iterate)
Aggregation
Use cases
● Exploratory analysis (random/stratified sampling)
● Graphing/plotting data
● Trend analysis (regression)
dbObject.readAggregate ( dbQuery , new MeanAggregator (2000,new TimeUnits().asMilliseconds())
new TimeQuery(0L,1239867L,”cabin-temperature”)
Interval duration
Unit type
new Database(<data table>,<uid table>, zkQuorum, zkBasePath)
IAggregator Interface
SQL based access
SQL Pivotal
HAWQ
Hbase
UDF [
PL-Java ]
Java Read
APIs
RStudio
● Dual Advantages – MPP scaling + underlying
columnar storage
● Security (Kerberos) JVM specific
● PostGreSql package in R (access by data
scientists)
● Scaled out in terms of compute and storage
SQL access for RDBMS based flows
CRAN Project RPostgreSQL: R interface to the PostgreSQL database system
Database interface and PostgreSQL driver for R This package provides a Database Interface
(DBI) compliant driver for R to access PostgreSQL database systems …….
SQL based access – User Scenarios
SELECT * FROM getTags('tsdb','tsdb-uid') WHERE gettags LIKE 'CabAttribute%’
SELECT * FROM getTimeSeriesData( 'tsdb','tsdb-uid',CAST('2012-10-01 01:00:00' AS TIMESTAMP ),'temperature5') limit 10
SELECT * FROM getTimeSeriesData( 'tsdb','tsdb-uid',CAST('2012-09-20 01:00:00' AS TIMESTAMP ),CAST('2012-09-21
23:59:59' AS TIMESTAMP ),'temperature5')
SELECT * FROM getTimeSeriesData( 'tsdb','tsdb-uid',CAST('2012-09-20 01:00:00' AS TIMESTAMP ),CAST('2012-09-21
23:59:59' AS TIMESTAMP ),'temperature5','attrib1=*,attrib2=W*")
Title or Job Number | XX Month 201X 20
Time
Range
Tag Input Attributes Return Attributes
Start
Time,
[End
Time]
Tag Only None None
Start
Time,
[End
Time]
Tag +
Attributes
All(*), key=value, key=<regex> All
key=value,
[p=q,x=y,…]
key=value1,key=valu
e2, [p=q,x=y, …]
p,q,x,y,… other
attributes in time
series data point
TimeStamp Tag Attribute
Name
Attribute
Value
2012-10-01
1:00:00
Tempe
rature
5
Attrib1 v1
2012-10-01
1:00:00
Tempe
rature
5
Attrib1 v2
2012-10-01
1:00:00
Tempe
rature
5
Attrib2 W1
2012-10-01
1:00:00
Tempe
arture
5
Attrib2 W2
2012-11-01 Tempe
rature
5
… …
Data Models – HBase PoC
• Nature of sensor data from engines - sparse
• Horizontal v/s vertical data model
• Vertical – 1 flight parameter / column (1-2K values)
• Horizontal – Parameters converted to rows, needs transposition
during the ingestion
• Performance on Hbase
• Horizontal model did much better for retrieval
• HAWQ & HAWQ as external table over Hbase was
slower
Recap / Summary
• Industrial Data – nature of the business problem
• Industrial Data Lake
• Technical Solution
• Wrap up
HBaseCon 2015: Industrial Internet Case Study using HBase and TSDB
1 of 23

More Related Content

What's hot(20)

What's new in SQL on Hadoop and BeyondWhat's new in SQL on Hadoop and Beyond
What's new in SQL on Hadoop and Beyond
DataWorks Summit/Hadoop Summit708 views
High-Scale Entity Resolution in HadoopHigh-Scale Entity Resolution in Hadoop
High-Scale Entity Resolution in Hadoop
DataWorks Summit/Hadoop Summit975 views
What's new in apache hive What's new in apache hive
What's new in apache hive
DataWorks Summit3.4K views
Securing Data in Hadoop at UberSecuring Data in Hadoop at Uber
Securing Data in Hadoop at Uber
DataWorks Summit992 views
Real time fraud detection at 1+M scale on hadoop stackReal time fraud detection at 1+M scale on hadoop stack
Real time fraud detection at 1+M scale on hadoop stack
DataWorks Summit/Hadoop Summit1.3K views
Qubole - Big data in cloudQubole - Big data in cloud
Qubole - Big data in cloud
Dmitry Tolpeko1.6K views
Hadoop and HBase @eBayHadoop and HBase @eBay
Hadoop and HBase @eBay
DataWorks Summit7.7K views

Viewers also liked(20)

Similar to HBaseCon 2015: Industrial Internet Case Study using HBase and TSDB(20)

Big Data Techcon 2014Big Data Techcon 2014
Big Data Techcon 2014
Samir Lad2.1K views
Analytics&IoTAnalytics&IoT
Analytics&IoT
Selvaraj Kesavan145 views
Microsoft Azure Big Data AnalyticsMicrosoft Azure Big Data Analytics
Microsoft Azure Big Data Analytics
Mark Kromer4.2K views
Talend introduction v1Talend introduction v1
Talend introduction v1
Softnix Technology803 views
Big Data SE vs. SE for Big DataBig Data SE vs. SE for Big Data
Big Data SE vs. SE for Big Data
Institute of Contemporary Sciences401 views

Recently uploaded(20)

El Arte de lo PossibleEl Arte de lo Possible
El Arte de lo Possible
Neo4j28 views
Create Roku ChannelsCreate Roku Channels
Create Roku Channels
Roshan Dwivedi5 views
ict act 1.pptxict act 1.pptx
ict act 1.pptx
sanjaniarun0812 views
[PHPCon 2023] Blaski i ciebie BDD[PHPCon 2023] Blaski i ciebie BDD
[PHPCon 2023] Blaski i ciebie BDD
Mateusz Zalewski47 views
Unleash The MonkeysUnleash The Monkeys
Unleash The Monkeys
Jacob Duijzer7 views
Neo4j y GenAI Neo4j y GenAI
Neo4j y GenAI
Neo4j27 views
Best Mics For Your Live StreamingBest Mics For Your Live Streaming
Best Mics For Your Live Streaming
ontheflystream6 views

HBaseCon 2015: Industrial Internet Case Study using HBase and TSDB

  • 1. Imagination at work Industrial Internet Case Study using HBase and TSDB Shyam Varan Nath Arnab Guin May, 2015
  • 2. Agenda • Introduction to IoT and Industrial Internet • Industrial Use Case • Technology Details • Wrap up
  • 3. About Shyam • Architect – Industrial Analytics @GE • Worked in IBM, Deloitte, Oracle and Halliburton, prior to GE • Regular speaker in technology conferences such as Oracle Openworld, IoT Summit, Collaborate, BIWA Summit on IoT, Big Data and Analytics related topics • Education: IIT Kanpur (B Tech EE), MBA and MS Computer Science from Florida Atlantic University
  • 4. About Arnab • Staff Software Engineer, Big Data – Predix Platform @GE • Working on big data, analytics (Past HBaseCon attendee) • Past work in several domains – audience research, genomics, compilers, EE • Education: BS Computer Science, Masters (Quantitative Finance), CS/EE graduate studies @Stanford
  • 5. GE – Guiding Principle Industrial Internet Makes Machines Better!
  • 6. Overview – Industrial Internet as a Service Industrial Data Lake Industrial Internet Application Industrial Machines https://www.gesoftware.com/news-events/featured-stories/controls%E2%80%94-brilliant-edge-ge%E2%80%99s-industrial-internet Connectivity Analytics
  • 8. Aviation Use Case - Jet Engines Speed Sensor Exhaust Gas Temperature (EGT) sensors Temperature Sensors Temperature Sensors Pressure sensors * Simplified view of some of the sensors
  • 9. Making “Sense” of the “Sensors” EGT = Exhaust Gas Temperature The temperature of the exhaust gases as they enter the tail pipe, after passing through the turbine A good indicator of the health of engine (just like human body temperature) Recording & interpreting the EGT can help to detect several jet engine problems. http://www.geaviation.com/press/cf34/cf34_20140513b.html
  • 10. Business Problem • The aircraft collects data about it’s operations including the jet engine using Quick Access Recorder (QAR) • Engine analytics applied to such data in near real-time can be used to proactively diagnose problems and reduce unplanned downtime • Continuous Engine Operations Data (CEOD) can be up to 500GB per flight. With 300K flights per day, it soon becomes a Big Data problem • Data Lake is a fertile ground for Big Data Analytics to understand jet engine behavior and problems over its age of ~30 yrs • Analytics developed with full data, can be deployed to summary information, near real-time http://en.wikipedia.org/wiki/Quick_access_recorder
  • 11. All data Access to real-time data and historical data and not limited to snapshot of data Any data Handling of all data types including documents, images machine data, sensor data One place Access to all data in one place to quickly respond to the speed of business change 1 2 3 Rapid access to all data for analytics How long will it last without failures or maintenance? Is my asset ready when there is market opportunity? Is my asset performing optimally? How to configure for best operational results? FLEXIBLE DATA MODELS New approach – Industrial Data Lake architecture INDUSTRIAL DATA LAKE Data scientist Field operations Business analyst Sensor data Content (images, videos, manuals, etc.) Machine data Historian data CRM, ERP, etc. Logs, click streams Geo- location data Social network data
  • 12. Industrial Data Lake 50BMachines will be connected on the internet by 2020 2XIndustrial data growth within next 10 years *Source: IDC CRM, ERP, etc. Logs Social network data Geo-location data In practice only 3%of potentially useful data is tagged and even less is analyzed* 9M Data points per hour for each locomotive 500GB Data per blade by gas turbines Sensor data Content (images, videos, manuals, etc.) Historian data Machine data 35GB Data per day from each Smart Meter 50X Data growth in healthcare (2012 – 2020) 1TB Data per flight © General Electric Company, 2014. All Rights Reserved.
  • 13. Machine data Data Flow - Ingestion, Storage, Analytics Sensor data
  • 14. System Components Ingestion - High speed real-time data input in the time domain (streams) - Batch processing (files) Transport Layer - RabbitMQ Security Fault tolerance - Multiple containers (ingestion) - Highly available queues (transport) - Multiple masters, replication (storage) - Multi-node zookeeper quorum (coordination) Ingestion Storage Transport Security Zookeeper
  • 15. Read-Write Tracks W R I T E H B A S E T S D B TimeRange Tag/Metric Tag Value Attributes R E A D H B A S E T S D B Timestamp Tag/Metric Tag Value Attributes TimeMessag e Block Write Atomic Write Async Parallel Async ParallelTimeQuer y Block Read Atomic Read s e0 now time Block reads can be syncronized (futures)
  • 16. ● Abstractions over TSDB for simplification ● Higher throughput reads o Multi-threading o Multi-processing (coprocessors) - wip o Divide and conquer o On-demand loading (yield to wield) Read/Writ e APIs KairosDB OpenTSDB HBase Metric TimeStamp …….. Salt . . Region Servers APIs (Read + Write) Hbase DataStore Plugin
  • 17. Performance Region Servers Higher throughput reads ● Multi-threading (client side) ● Multi-processing (coprocessors) ● Divide and conquer ● On-demand loading (read ahead and iterate)
  • 18. Aggregation Use cases ● Exploratory analysis (random/stratified sampling) ● Graphing/plotting data ● Trend analysis (regression) dbObject.readAggregate ( dbQuery , new MeanAggregator (2000,new TimeUnits().asMilliseconds()) new TimeQuery(0L,1239867L,”cabin-temperature”) Interval duration Unit type new Database(<data table>,<uid table>, zkQuorum, zkBasePath) IAggregator Interface
  • 19. SQL based access SQL Pivotal HAWQ Hbase UDF [ PL-Java ] Java Read APIs RStudio ● Dual Advantages – MPP scaling + underlying columnar storage ● Security (Kerberos) JVM specific ● PostGreSql package in R (access by data scientists) ● Scaled out in terms of compute and storage SQL access for RDBMS based flows CRAN Project RPostgreSQL: R interface to the PostgreSQL database system Database interface and PostgreSQL driver for R This package provides a Database Interface (DBI) compliant driver for R to access PostgreSQL database systems …….
  • 20. SQL based access – User Scenarios SELECT * FROM getTags('tsdb','tsdb-uid') WHERE gettags LIKE 'CabAttribute%’ SELECT * FROM getTimeSeriesData( 'tsdb','tsdb-uid',CAST('2012-10-01 01:00:00' AS TIMESTAMP ),'temperature5') limit 10 SELECT * FROM getTimeSeriesData( 'tsdb','tsdb-uid',CAST('2012-09-20 01:00:00' AS TIMESTAMP ),CAST('2012-09-21 23:59:59' AS TIMESTAMP ),'temperature5') SELECT * FROM getTimeSeriesData( 'tsdb','tsdb-uid',CAST('2012-09-20 01:00:00' AS TIMESTAMP ),CAST('2012-09-21 23:59:59' AS TIMESTAMP ),'temperature5','attrib1=*,attrib2=W*") Title or Job Number | XX Month 201X 20 Time Range Tag Input Attributes Return Attributes Start Time, [End Time] Tag Only None None Start Time, [End Time] Tag + Attributes All(*), key=value, key=<regex> All key=value, [p=q,x=y,…] key=value1,key=valu e2, [p=q,x=y, …] p,q,x,y,… other attributes in time series data point TimeStamp Tag Attribute Name Attribute Value 2012-10-01 1:00:00 Tempe rature 5 Attrib1 v1 2012-10-01 1:00:00 Tempe rature 5 Attrib1 v2 2012-10-01 1:00:00 Tempe rature 5 Attrib2 W1 2012-10-01 1:00:00 Tempe arture 5 Attrib2 W2 2012-11-01 Tempe rature 5 … …
  • 21. Data Models – HBase PoC • Nature of sensor data from engines - sparse • Horizontal v/s vertical data model • Vertical – 1 flight parameter / column (1-2K values) • Horizontal – Parameters converted to rows, needs transposition during the ingestion • Performance on Hbase • Horizontal model did much better for retrieval • HAWQ & HAWQ as external table over Hbase was slower
  • 22. Recap / Summary • Industrial Data – nature of the business problem • Industrial Data Lake • Technical Solution • Wrap up