Optimizing Industrial Operations
in Real time
using the Bigdata Ecosystem
Kishore Reddipalli
Director - Software Engineering
GE Digital
Agenda
• Usecase
• Spark as Analytic Runtime
• Optimization Framework
• Streaming and Batch Analysis
• Challenges
• QA
GE Mission
• Improve Asset Reliability and Availability
• Monitor Mission Critical Events
• Optimize the Manufacturing process
• Optimize Fleet Operations
• Reduce Unplanned Downtime
Usecase
Power Plant Efficiency:
• Heat rate in the context of power plants can be thought of as
the input needed to produce one unit of output. It generally
indicates the amount of fuel required to generate one unit of
electricity.
• Performance parameters tracked for any thermal power
plant like efficiency, fuel costs, plant load factor, emissions
level, etc. are a function of the station heat rate and can
be linked directly
Source : https://en.wikipedia.org/wiki/Heat_rate_(efficiency)
Data Volume
• In aviation a GE jet engine produces 5000 data points that
can analyzed per second to optimize flight times
• In Power there are 500000 data points need to analyzed for
generating the outcomes. The data points are being
generated from ~1000 sensors
• Data being generated from thousands of GE equipments at a
high volume and rate need to be stored, analyzed at a peta
byte scale.
Predix – Industrial Internet platform that can be
leveraged to build industrial applications
www.predix.io
Architecture
Spark as a Analytic Runtime
• Rest API (Spark Job Server)
• Security
• Multi-tenancy
• Optimization Framework
• Spark SQL
• Spark Streaming
Optimization Framework
Need for framework – To simplify and bring consistency in
the development of analytics and abstract the complexity of
data connectivity and processing of large volumes of data
• API
• Schema
• Data Providers (Input / Output)
• Data Frames (Variety of Data – Timeseries, Asset,
Configuration)
• Parallelism (Partitioning of data for processing)
• Multi-Mode (Stream vs Batch)
• Multi-Stream Source
• UDF (Aggregation, Interpolation, Unit of Measure)
Optimization Framework -
Architecture
Data Providers
The data connectors to fetch the data from
variety of data sources.
Example:
1. File– (HDFS)
2. HTTP – Restful Services (Asset, Timeseries,
any business services)
3. Database (Cassandra, Postgres)
4. Messaging (Kafka, Kinesis, EventHub)
Timeseries – Dataframe Schema
{
"tags": [
{
"tagId": ”temperature",
"data": [
{
"q": "3",
"ts": "2015-07-
23T12:25:00.000-0000",
"v": "425.07935"
Timeseries DataFrame
Asset Dataframe - Schema
"tagClassifications": [
{
"id": "OO-
BL000472_Tag_Temperature_Cl
assification_ID",
"name": "OO-
BL000472_Tag_Temperature_Cl
assification_name",
"description": "This is tag
Temperature Classification
description",
"unitGroup": "temperature",
"properties": [
{
"id": "low",
"value": [
80
],
"type": "double"
},
{
"id": "high",
"value": [
120
],
"type": "double"
},
{
"id": "threshold",
"value": [
100
Asset Dataframe
Sample Analytic
Stream Processing – Data Flow
Stream Processing
• Micro Batch Interval
• Continuous Application
• Multi Stream Sources
• Tenant Aware data Pipeline
• Context based data pipeline
• Window based Slicing– Moving Average
Stream Processing - Pointers
• Micro Batch Interval - “Depends on
Usecase”
• Data Congestion – Instream vs Processing
• Delayed Data – Quality In absence of data
Batch Processing
Batch Processing
• Time range of data
• Aggregations
• Parallel Collections
• Partitioning of Data
Challenges
Stream Processing:
- Data Arrival – Delays (Spark 2.x)
- State Persistence (Spark 2.x)
DataProviders:
-GRPC Connector (Shading)
Performance Tuning:
-Parallel Collections of Data (Read/Write)
Yarn-Client Mode Limitations: (Cluster Mode)
-Latency (Distribution of Jars)
-Loading from HDFS
Shading
Performance Metrics (Batch)
Performance Metrics (Stream)
Monitoring (Grafana)
Future Next Steps
• Spark 2.x – Structured Streaming
• Machine Learning Pipelines
• Zeppelin as Service – Interactive
Analysis
• Data Providers – Registration as a
Service
QA

Optimizing industrial operations using the big data ecosystem

  • 1.
    Optimizing Industrial Operations inReal time using the Bigdata Ecosystem Kishore Reddipalli Director - Software Engineering GE Digital
  • 2.
    Agenda • Usecase • Sparkas Analytic Runtime • Optimization Framework • Streaming and Batch Analysis • Challenges • QA
  • 3.
    GE Mission • ImproveAsset Reliability and Availability • Monitor Mission Critical Events • Optimize the Manufacturing process • Optimize Fleet Operations • Reduce Unplanned Downtime
  • 4.
    Usecase Power Plant Efficiency: •Heat rate in the context of power plants can be thought of as the input needed to produce one unit of output. It generally indicates the amount of fuel required to generate one unit of electricity. • Performance parameters tracked for any thermal power plant like efficiency, fuel costs, plant load factor, emissions level, etc. are a function of the station heat rate and can be linked directly Source : https://en.wikipedia.org/wiki/Heat_rate_(efficiency)
  • 5.
    Data Volume • Inaviation a GE jet engine produces 5000 data points that can analyzed per second to optimize flight times • In Power there are 500000 data points need to analyzed for generating the outcomes. The data points are being generated from ~1000 sensors • Data being generated from thousands of GE equipments at a high volume and rate need to be stored, analyzed at a peta byte scale.
  • 6.
    Predix – IndustrialInternet platform that can be leveraged to build industrial applications www.predix.io
  • 7.
  • 8.
    Spark as aAnalytic Runtime • Rest API (Spark Job Server) • Security • Multi-tenancy • Optimization Framework • Spark SQL • Spark Streaming
  • 9.
    Optimization Framework Need forframework – To simplify and bring consistency in the development of analytics and abstract the complexity of data connectivity and processing of large volumes of data • API • Schema • Data Providers (Input / Output) • Data Frames (Variety of Data – Timeseries, Asset, Configuration) • Parallelism (Partitioning of data for processing) • Multi-Mode (Stream vs Batch) • Multi-Stream Source • UDF (Aggregation, Interpolation, Unit of Measure)
  • 10.
  • 11.
    Data Providers The dataconnectors to fetch the data from variety of data sources. Example: 1. File– (HDFS) 2. HTTP – Restful Services (Asset, Timeseries, any business services) 3. Database (Cassandra, Postgres) 4. Messaging (Kafka, Kinesis, EventHub)
  • 12.
    Timeseries – DataframeSchema { "tags": [ { "tagId": ”temperature", "data": [ { "q": "3", "ts": "2015-07- 23T12:25:00.000-0000", "v": "425.07935"
  • 13.
  • 14.
    Asset Dataframe -Schema "tagClassifications": [ { "id": "OO- BL000472_Tag_Temperature_Cl assification_ID", "name": "OO- BL000472_Tag_Temperature_Cl assification_name", "description": "This is tag Temperature Classification description", "unitGroup": "temperature", "properties": [ { "id": "low", "value": [ 80 ], "type": "double" }, { "id": "high", "value": [ 120 ], "type": "double" }, { "id": "threshold", "value": [ 100
  • 15.
  • 16.
  • 17.
  • 18.
    Stream Processing • MicroBatch Interval • Continuous Application • Multi Stream Sources • Tenant Aware data Pipeline • Context based data pipeline • Window based Slicing– Moving Average
  • 19.
    Stream Processing -Pointers • Micro Batch Interval - “Depends on Usecase” • Data Congestion – Instream vs Processing • Delayed Data – Quality In absence of data
  • 20.
  • 21.
    Batch Processing • Timerange of data • Aggregations • Parallel Collections • Partitioning of Data
  • 22.
    Challenges Stream Processing: - DataArrival – Delays (Spark 2.x) - State Persistence (Spark 2.x) DataProviders: -GRPC Connector (Shading) Performance Tuning: -Parallel Collections of Data (Read/Write) Yarn-Client Mode Limitations: (Cluster Mode) -Latency (Distribution of Jars) -Loading from HDFS
  • 23.
  • 24.
  • 25.
  • 26.
  • 27.
    Future Next Steps •Spark 2.x – Structured Streaming • Machine Learning Pipelines • Zeppelin as Service – Interactive Analysis • Data Providers – Registration as a Service
  • 28.

Editor's Notes

  • #14 Timeseries Dataframe Configuration Dataframe
  • #16 Timeseries Dataframe Configuration Dataframe
  • #17 Timeseries Dataframe Configuration Dataframe
  • #23  Some of the challenges in industrial usecases late arrival of data – We need to make sure the batch interval to be tuned for the usecase needs to ignore the late data We also have usecases to persist the state intermediate We developed a spark custom receiver to stream the data from an in-house messaging layer – eventhub (grpc) . Some of the challenges while building are the class conflict issues the typical java class loading issues which different version of third part libraries. For the reason we used the approach of shading which enabled.
  • #27 Graphite and Grafana – Ability to monitor and visualize the Spark Performance and provide the ability to create dashboards – consolidated UI
  • #28 Ability to author, test and productionize the analytics Support for machine learning pipelines Support for registering custom data providers Unit of Measure conversions Spark 2.0 Adoption – Structured Streaming