Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

HBaseCon 2015: Industrial Internet Case Study using HBase and TSDB


Published on

This case study involves analysis of high-volume, continuous time-series aviation data from jet engines that consist of temperature, pressure, vibration and related parameters from the on-board sensors, joined with well-characterized slowly changing engine asset configuration data and other enterprise data for continuous engine diagnostics and analytics. This data is ingested via distributed fabric comprising transient containers, message queues and a columnar, compressed storage leveraging OpenTSDB over Apache HBase.

Published in: Software

HBaseCon 2015: Industrial Internet Case Study using HBase and TSDB

  1. 1. Imagination at work Industrial Internet Case Study using HBase and TSDB Shyam Varan Nath Arnab Guin May, 2015
  2. 2. Agenda • Introduction to IoT and Industrial Internet • Industrial Use Case • Technology Details • Wrap up
  3. 3. About Shyam • Architect – Industrial Analytics @GE • Worked in IBM, Deloitte, Oracle and Halliburton, prior to GE • Regular speaker in technology conferences such as Oracle Openworld, IoT Summit, Collaborate, BIWA Summit on IoT, Big Data and Analytics related topics • Education: IIT Kanpur (B Tech EE), MBA and MS Computer Science from Florida Atlantic University
  4. 4. About Arnab • Staff Software Engineer, Big Data – Predix Platform @GE • Working on big data, analytics (Past HBaseCon attendee) • Past work in several domains – audience research, genomics, compilers, EE • Education: BS Computer Science, Masters (Quantitative Finance), CS/EE graduate studies @Stanford
  5. 5. GE – Guiding Principle Industrial Internet Makes Machines Better!
  6. 6. Overview – Industrial Internet as a Service Industrial Data Lake Industrial Internet Application Industrial Machines Connectivity Analytics
  7. 7. Value to Business User
  8. 8. Aviation Use Case - Jet Engines Speed Sensor Exhaust Gas Temperature (EGT) sensors Temperature Sensors Temperature Sensors Pressure sensors * Simplified view of some of the sensors
  9. 9. Making “Sense” of the “Sensors” EGT = Exhaust Gas Temperature The temperature of the exhaust gases as they enter the tail pipe, after passing through the turbine A good indicator of the health of engine (just like human body temperature) Recording & interpreting the EGT can help to detect several jet engine problems.
  10. 10. Business Problem • The aircraft collects data about it’s operations including the jet engine using Quick Access Recorder (QAR) • Engine analytics applied to such data in near real-time can be used to proactively diagnose problems and reduce unplanned downtime • Continuous Engine Operations Data (CEOD) can be up to 500GB per flight. With 300K flights per day, it soon becomes a Big Data problem • Data Lake is a fertile ground for Big Data Analytics to understand jet engine behavior and problems over its age of ~30 yrs • Analytics developed with full data, can be deployed to summary information, near real-time
  11. 11. All data Access to real-time data and historical data and not limited to snapshot of data Any data Handling of all data types including documents, images machine data, sensor data One place Access to all data in one place to quickly respond to the speed of business change 1 2 3 Rapid access to all data for analytics How long will it last without failures or maintenance? Is my asset ready when there is market opportunity? Is my asset performing optimally? How to configure for best operational results? FLEXIBLE DATA MODELS New approach – Industrial Data Lake architecture INDUSTRIAL DATA LAKE Data scientist Field operations Business analyst Sensor data Content (images, videos, manuals, etc.) Machine data Historian data CRM, ERP, etc. Logs, click streams Geo- location data Social network data
  12. 12. Industrial Data Lake 50BMachines will be connected on the internet by 2020 2XIndustrial data growth within next 10 years *Source: IDC CRM, ERP, etc. Logs Social network data Geo-location data In practice only 3%of potentially useful data is tagged and even less is analyzed* 9M Data points per hour for each locomotive 500GB Data per blade by gas turbines Sensor data Content (images, videos, manuals, etc.) Historian data Machine data 35GB Data per day from each Smart Meter 50X Data growth in healthcare (2012 – 2020) 1TB Data per flight © General Electric Company, 2014. All Rights Reserved.
  13. 13. Machine data Data Flow - Ingestion, Storage, Analytics Sensor data
  14. 14. System Components Ingestion - High speed real-time data input in the time domain (streams) - Batch processing (files) Transport Layer - RabbitMQ Security Fault tolerance - Multiple containers (ingestion) - Highly available queues (transport) - Multiple masters, replication (storage) - Multi-node zookeeper quorum (coordination) Ingestion Storage Transport Security Zookeeper
  15. 15. Read-Write Tracks W R I T E H B A S E T S D B TimeRange Tag/Metric Tag Value Attributes R E A D H B A S E T S D B Timestamp Tag/Metric Tag Value Attributes TimeMessag e Block Write Atomic Write Async Parallel Async ParallelTimeQuer y Block Read Atomic Read s e0 now time Block reads can be syncronized (futures)
  16. 16. ● Abstractions over TSDB for simplification ● Higher throughput reads o Multi-threading o Multi-processing (coprocessors) - wip o Divide and conquer o On-demand loading (yield to wield) Read/Writ e APIs KairosDB OpenTSDB HBase Metric TimeStamp …….. Salt . . Region Servers APIs (Read + Write) Hbase DataStore Plugin
  17. 17. Performance Region Servers Higher throughput reads ● Multi-threading (client side) ● Multi-processing (coprocessors) ● Divide and conquer ● On-demand loading (read ahead and iterate)
  18. 18. Aggregation Use cases ● Exploratory analysis (random/stratified sampling) ● Graphing/plotting data ● Trend analysis (regression) dbObject.readAggregate ( dbQuery , new MeanAggregator (2000,new TimeUnits().asMilliseconds()) new TimeQuery(0L,1239867L,”cabin-temperature”) Interval duration Unit type new Database(<data table>,<uid table>, zkQuorum, zkBasePath) IAggregator Interface
  19. 19. SQL based access SQL Pivotal HAWQ Hbase UDF [ PL-Java ] Java Read APIs RStudio ● Dual Advantages – MPP scaling + underlying columnar storage ● Security (Kerberos) JVM specific ● PostGreSql package in R (access by data scientists) ● Scaled out in terms of compute and storage SQL access for RDBMS based flows CRAN Project RPostgreSQL: R interface to the PostgreSQL database system Database interface and PostgreSQL driver for R This package provides a Database Interface (DBI) compliant driver for R to access PostgreSQL database systems …….
  20. 20. SQL based access – User Scenarios SELECT * FROM getTags('tsdb','tsdb-uid') WHERE gettags LIKE 'CabAttribute%’ SELECT * FROM getTimeSeriesData( 'tsdb','tsdb-uid',CAST('2012-10-01 01:00:00' AS TIMESTAMP ),'temperature5') limit 10 SELECT * FROM getTimeSeriesData( 'tsdb','tsdb-uid',CAST('2012-09-20 01:00:00' AS TIMESTAMP ),CAST('2012-09-21 23:59:59' AS TIMESTAMP ),'temperature5') SELECT * FROM getTimeSeriesData( 'tsdb','tsdb-uid',CAST('2012-09-20 01:00:00' AS TIMESTAMP ),CAST('2012-09-21 23:59:59' AS TIMESTAMP ),'temperature5','attrib1=*,attrib2=W*") Title or Job Number | XX Month 201X 20 Time Range Tag Input Attributes Return Attributes Start Time, [End Time] Tag Only None None Start Time, [End Time] Tag + Attributes All(*), key=value, key=<regex> All key=value, [p=q,x=y,…] key=value1,key=valu e2, [p=q,x=y, …] p,q,x,y,… other attributes in time series data point TimeStamp Tag Attribute Name Attribute Value 2012-10-01 1:00:00 Tempe rature 5 Attrib1 v1 2012-10-01 1:00:00 Tempe rature 5 Attrib1 v2 2012-10-01 1:00:00 Tempe rature 5 Attrib2 W1 2012-10-01 1:00:00 Tempe arture 5 Attrib2 W2 2012-11-01 Tempe rature 5 … …
  21. 21. Data Models – HBase PoC • Nature of sensor data from engines - sparse • Horizontal v/s vertical data model • Vertical – 1 flight parameter / column (1-2K values) • Horizontal – Parameters converted to rows, needs transposition during the ingestion • Performance on Hbase • Horizontal model did much better for retrieval • HAWQ & HAWQ as external table over Hbase was slower
  22. 22. Recap / Summary • Industrial Data – nature of the business problem • Industrial Data Lake • Technical Solution • Wrap up