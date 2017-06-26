Optimizing Industrial Operations in Real time using the Bigdata Ecosystem Kishore Reddipalli Director - Software Engineeri...
Agenda • Usecase • Spark as Analytic Runtime • Optimization Framework • Streaming and Batch Analysis • Challenges • QA
GE Mission • Improve Asset Reliability and Availability • Monitor Mission Critical Events • Optimize the Manufacturing pro...
Usecase Power Plant Efficiency: • Heat rate in the context of power plants can be thought of as the input needed to produc...
Data Volume • In aviation a GE jet engine produces 5000 data points that can analyzed per second to optimize flight times ...
Predix – Industrial Internet platform that can be leveraged to build industrial applications www.predix.io
Architecture
Spark as a Analytic Runtime • Rest API (Spark Job Server) • Security • Multi-tenancy • Optimization Framework • Spark SQL ...
Optimization Framework Need for framework – To simplify and bring consistency in the development of analytics and abstract...
Optimization Framework - Architecture
Data Providers The data connectors to fetch the data from variety of data sources. Example: 1. File– (HDFS) 2. HTTP – Rest...
Timeseries – Dataframe Schema { "tags": [ { "tagId": ”temperature", "data": [ { "q": "3", "ts": "2015-07- 23T12:25:00.000-...
Timeseries DataFrame
Asset Dataframe - Schema "tagClassifications": [ { "id": "OO- BL000472_Tag_Temperature_Cl assification_ID", "name": "OO- B...
Asset Dataframe
Sample Analytic
Stream Processing – Data Flow
Stream Processing • Micro Batch Interval • Continuous Application • Multi Stream Sources • Tenant Aware data Pipeline • Co...
Stream Processing - Pointers • Micro Batch Interval - “Depends on Usecase” • Data Congestion – Instream vs Processing • De...
Batch Processing
Batch Processing • Time range of data • Aggregations • Parallel Collections • Partitioning of Data
Challenges Stream Processing: - Data Arrival – Delays (Spark 2.x) - State Persistence (Spark 2.x) DataProviders: -GRPC Con...
Shading
Performance Metrics (Batch)
Performance Metrics (Stream)
Monitoring (Grafana)
Future Next Steps • Spark 2.x – Structured Streaming • Machine Learning Pipelines • Zeppelin as Service – Interactive Anal...
QA
Upcoming SlideShare
Loading in …5
×

Optimizing industrial operations using the big data ecosystem

72 views

Published on

GE Digital is undertaking a journey to optimize the reliability, availability, and efficiency of assets in the industrial sector and converge IT and OT. To do so, GE Digital is building cloud-based products that enable customers to analyze the asset data, detect anomalies, and provide recommendations for operating plants efficiently while increasing productivity. In a energy sector such as oil and gas, power, or renewables, a single plant comprises multiple complex assets, such as steam turbines, gas turbines, and compressors, to generate power. Each system contains various sensors to detect the operating conditions of the assets, generating large volumes of variety of data. A highly scalable distributed environment is required to analyze such a large volume of data and provide operating insights in near real time.
In this session I will share the challenges encountered when analyzing the large volumes of data, in-stream data analysis and how we standardized the industrial data based on data frames, and performance tuning.

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
no profile picture user

  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
72
On SlideShare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
0
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide
  • Timeseries Dataframe
    Configuration Dataframe
  • Timeseries Dataframe
    Configuration Dataframe
  • Timeseries Dataframe
    Configuration Dataframe
  • Some of the challenges in industrial usecases late arrival of data –
    We need to make sure the batch interval to be tuned for the usecase needs to ignore the late data
    We also have usecases to persist the state intermediate

    We developed a spark custom receiver to stream the data from an in-house messaging layer – eventhub (grpc) . Some of the challenges while building are the class conflict issues the typical java class loading issues which different version of third part libraries. For the reason we used the approach of shading which enabled.

  • Graphite and Grafana – Ability to monitor and visualize the Spark Performance and provide the ability to create dashboards – consolidated UI
  • Ability to author, test and productionize the analytics
    Support for machine learning pipelines
    Support for registering custom data providers
    Unit of Measure conversions
    Spark 2.0 Adoption – Structured Streaming

    • Optimizing industrial operations using the big data ecosystem

    1. 1. Optimizing Industrial Operations in Real time using the Bigdata Ecosystem Kishore Reddipalli Director - Software Engineering GE Digital
    2. 2. Agenda • Usecase • Spark as Analytic Runtime • Optimization Framework • Streaming and Batch Analysis • Challenges • QA
    3. 3. GE Mission • Improve Asset Reliability and Availability • Monitor Mission Critical Events • Optimize the Manufacturing process • Optimize Fleet Operations • Reduce Unplanned Downtime
    4. 4. Usecase Power Plant Efficiency: • Heat rate in the context of power plants can be thought of as the input needed to produce one unit of output. It generally indicates the amount of fuel required to generate one unit of electricity. • Performance parameters tracked for any thermal power plant like efficiency, fuel costs, plant load factor, emissions level, etc. are a function of the station heat rate and can be linked directly Source : https://en.wikipedia.org/wiki/Heat_rate_(efficiency)
    5. 5. Data Volume • In aviation a GE jet engine produces 5000 data points that can analyzed per second to optimize flight times • In Power there are 500000 data points need to analyzed for generating the outcomes. The data points are being generated from ~1000 sensors • Data being generated from thousands of GE equipments at a high volume and rate need to be stored, analyzed at a peta byte scale.
    6. 6. Predix – Industrial Internet platform that can be leveraged to build industrial applications www.predix.io
    7. 7. Architecture
    8. 8. Spark as a Analytic Runtime • Rest API (Spark Job Server) • Security • Multi-tenancy • Optimization Framework • Spark SQL • Spark Streaming
    9. 9. Optimization Framework Need for framework – To simplify and bring consistency in the development of analytics and abstract the complexity of data connectivity and processing of large volumes of data • API • Schema • Data Providers (Input / Output) • Data Frames (Variety of Data – Timeseries, Asset, Configuration) • Parallelism (Partitioning of data for processing) • Multi-Mode (Stream vs Batch) • Multi-Stream Source • UDF (Aggregation, Interpolation, Unit of Measure)
    10. 10. Optimization Framework - Architecture
    11. 11. Data Providers The data connectors to fetch the data from variety of data sources. Example: 1. File– (HDFS) 2. HTTP – Restful Services (Asset, Timeseries, any business services) 3. Database (Cassandra, Postgres) 4. Messaging (Kafka, Kinesis, EventHub)
    12. 12. Timeseries – Dataframe Schema { "tags": [ { "tagId": ”temperature", "data": [ { "q": "3", "ts": "2015-07- 23T12:25:00.000-0000", "v": "425.07935"
    13. 13. Timeseries DataFrame
    14. 14. Asset Dataframe - Schema "tagClassifications": [ { "id": "OO- BL000472_Tag_Temperature_Cl assification_ID", "name": "OO- BL000472_Tag_Temperature_Cl assification_name", "description": "This is tag Temperature Classification description", "unitGroup": "temperature", "properties": [ { "id": "low", "value": [ 80 ], "type": "double" }, { "id": "high", "value": [ 120 ], "type": "double" }, { "id": "threshold", "value": [ 100
    15. 15. Asset Dataframe
    16. 16. Sample Analytic
    17. 17. Stream Processing – Data Flow
    18. 18. Stream Processing • Micro Batch Interval • Continuous Application • Multi Stream Sources • Tenant Aware data Pipeline • Context based data pipeline • Window based Slicing– Moving Average
    19. 19. Stream Processing - Pointers • Micro Batch Interval - “Depends on Usecase” • Data Congestion – Instream vs Processing • Delayed Data – Quality In absence of data
    20. 20. Batch Processing
    21. 21. Batch Processing • Time range of data • Aggregations • Parallel Collections • Partitioning of Data
    22. 22. Challenges Stream Processing: - Data Arrival – Delays (Spark 2.x) - State Persistence (Spark 2.x) DataProviders: -GRPC Connector (Shading) Performance Tuning: -Parallel Collections of Data (Read/Write) Yarn-Client Mode Limitations: (Cluster Mode) -Latency (Distribution of Jars) -Loading from HDFS
    23. 23. Shading
    24. 24. Performance Metrics (Batch)
    25. 25. Performance Metrics (Stream)
    26. 26. Monitoring (Grafana)
    27. 27. Future Next Steps • Spark 2.x – Structured Streaming • Machine Learning Pipelines • Zeppelin as Service – Interactive Analysis • Data Providers – Registration as a Service
    28. 28. QA

    ×