Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Ge aviation spark application experience porting analytics into py spark ml pipelines

284 views

Published on

GE is a world leader in the manufacture of commercial jet engines, offering products for many of the best-selling commercial airframes. With more than 33,000 engines in service, GE Aviation has a history of developing analytics for monitoring its commercial engines fleets. In recent years, GE Aviation Digital has developed advanced analytic solutions for engine monitoring, with the target of improving detection and reducing false alerts, when compared to conventional analytic approaches. The advanced analytics are implemented in a real-time monitoring system which notifies GE’s Fleet Support team on a per flight basis. These analytics are developed and validated using large, historical datasets.

Analytic tools such as SQL Server and MATLAB were used until recently, when GE’s data was moved to an Apache Spark environment. Consequently, our advanced analytics are now being migrated to Spark, where there should also be performance gains with bigger data sets. In this talk we will share experiences of converting our advanced algorithms to custom Spark ML pipelines, as well as outlining various case studies.

With Honor Powrie and Peter Knight

Published in: Data & Analytics
  • Be the first to comment

  • Be the first to like this

Ge aviation spark application experience porting analytics into py spark ml pipelines

  1. 1. GEAviation Digital Experience Porting Analytics into PySpark ML Pipelines Session hashtag: #SAISExp12 4 Oct 2018 Prof Honor Powrie Dr Peter Knight
  2. 2. Outline • GE Aviation - commercial engines, data and analytics overview • Historic analytic development process • Converting analytics to PySpark ML Pipelines • Plotting ML Pipelines • The full analytic lifecycle in Spark • Conclusions 24 Oct 2018GE Aviation - porting analytics to PySpark ML Pipelines |
  3. 3. 34 Oct 2018GE Aviation - porting analytics to PySpark ML Pipelines | General Electric - Aviation • 40k employees • $27.4B revenue - 2017 • >33k commercial engines “Every two seconds, an aircraft powered by GE technology takes off somewhere in the world”
  4. 4. 100K flights per day 10 TB per year 1 KB / FLIGHT 30PARAMETERS 3SNAPSHOTS / FLIGHT 1 GB / FLIGHT 1000PARAMETERS @ 10 Hz 3.5 HR / FLIGHT HISTORICAL TOMORROWTODAY 200 KB / FLIGHT 1000PARAMETERS 20 SNAPSHOTS / FLIGHT < 50 GB per year …100K flights per day …….100 TB per day …50 PB per year
  5. 5. ML and Analytic Applications - GE Commercial Engines Operational Lifecycle 54 Oct 2018GE Aviation - porting analytics to PySpark ML Pipelines | Borescope Imaging 3 2 1 Fleet ManagementFleet Monitoring Time on Wing Enterprise Workscope Material Prediction Supply Chain ManagementShop Visit Forecast Digital Twin Models Optimisation
  6. 6. 6 • GE Aviation’s custom ML library, based on Probabilistic Graphical Models • Developed to tackle some of the key challenges of real world data • Used extensively in ML applications for GE commercial engines • Being recoded in C++ and Python and integrated with Spark • Fleet Segmentation • Multivariate Models and Anomaly Detection • Diagnostic Reasoning About Functional Elements Examples 4 Oct 2018GE Aviation - porting analytics to PySpark ML Pipelines |
  7. 7. Historic Analytic Development and Deployment Process 7 Develop (Data Scientists) Deploy (Software Engineers) 4 Oct 2018GE Aviation - porting analytics to PySpark ML Pipelines | Hand off package Configuration file Test cases Model files Functional spec MATLAB data exploration Data: Greenplum to SQL server MATLAB analytics Toll Gate Reviews Windows Server environment MATLAB generate metrics Analytic recoded in Java XML configuration Test in QA environment Deploy to pre-production Deploy to production Oracle Database Predix run-time Monitor in production (Spotfire) Aim: convert entire model building & deployment workflow to ML Pipelines
  8. 8. Why we like Spark ML Pipelines • Easy to string together pre-processing, ML and alerting stages • Same pipeline – and code - for analytic development (model building), evaluation and deployment (run time) • Extensive library of existing analytics and data manipulation tools • Extensible for our own analytics using Python – in a standard framework • Self describing – explainParams()shows you how to use it • Scales to big data 84 Oct 2018GE Aviation - porting analytics to PySpark ML Pipelines |
  9. 9. Converting our Workflow to Spark ML Pipelines • The pipeline includes various custom modules before & after ML, e.g. normalisation & alerting. • Some analytics are more complex than just processing on a row by row basis • e.g. Median Trend: group the data by engine (and configurable other conditions) and have a sliding window (by date or records) that is sorted by date. • For our ML algorithms we have overcome a number of challenges… 94 Oct 2018GE Aviation - porting analytics to PySpark ML Pipelines |
  10. 10. Converting ProDAPS to Spark ML Our ProDAPS ML analytic seems more complex than many already in Spark. For example: • Once the model is built, many different types of inference can be configured • Most existing ML analytics use Dense Vectors, but these have limitations e.g. they can’t handle null records • We want to be able to filter the data between pipeline stages • We are still on Spark 2.2, so had to write our own method for saving and loading custom stages • No export/import commands for porting pipelines between Spark clusters • Tedious to add params – see next slide • Our full wish-list is on JIRA at: SPARK-19498 104 Oct 2018GE Aviation - porting analytics to PySpark ML Pipelines |
  11. 11. Example Custom Transformer Showing Verbose Param Code Straight line: y = m x + c 114 Oct 2018GE Aviation - porting analytics to PySpark ML Pipelines | from pyspark import keyword_only from pyspark.ml.param.shared import Param, Params, TypeConverters from pyspark.ml import Transformer class StraightLine(Transformer): @keyword_only def __init__(self, inputCol=None, outputCol=None, m=1.0, c=0.0): super(StraightLine, self).__init__() self._setDefault(inputCol=None, outputCol=None, m=1.0, c=0.0) kwargs = self._input_kwargs self.setParams(**kwargs) @keyword_only def setParams(self, inputCol=None, outputCol=None, m=1.0, c=0.0): kwargs = self._input_kwargs return self._set(**kwargs) inputCol = Param(Params._dummy(), "inputCol", "the input column name (your X). (string)", typeConverter=TypeConverters.toString) def setInputCol(self, value): return self._set(inputCol=value) def getInputCol(self): return self.getOrDefault(self.inputCol) outputCol = Param(Params._dummy(), "outputCol", "the output column name (your Y). (string)", typeConverter=TypeConverters.toString) def setOutputCol(self, value): return self._set(outputCol=value) def getOutputCol(self): return self.getOrDefault(self.outputCol) m = Param(Params._dummy(), "m", "the slope of the line. (float)", typeConverter=TypeConverters.toFloat) def setM(self, value): return self._set(m=value) def getM(self): return self.getOrDefault(self.m) c = Param(Params._dummy(), "c", "the y offset when x = 0. (float)", typeConverter=TypeConverters.toFloat) def setC(self, value): return self._set(c=value) def getC(self): return self.getOrDefault(self.c) def _transform(self, dataset): input_col = self.getInputCol() if not input_col: raise Exception("inputCol not supplied") output_col = self.getOutputCol() if not output_col: raise Exception("outputCol not supplied") return dataset.selectExpr("*", str(self.getM()) + " * " + input_col + " + " + str(self.getC()) + " AS " + output_col) Current code required (~50 lines) • Currently for each parameter that the default values are being set 3 times and the parameter names are being entered 9 times! from pyspark import keyword_only from pyspark.ml.param.shared import Param, Params, TypeConverters from pyspark.ml import Transformer class StraightLine(Transformer): @keyword_only def __init__(self, inputCol=None, outputCol=None, m=1.0, c=0.0): super(StraightLine, self).__init__() self._setDefault(inputCol=None, outputCol=None, m=1.0, c=0.0) kwargs = self._input_kwargs self.setParams(**kwargs) @keyword_only def setParams(self, inputCol=None, outputCol=None, m=1.0, c=0.0): kwargs = self._input_kwargs return self._set(**kwargs) inputCol = Param(Params._dummy(), "inputCol", "the input column name (your X). (string)" typeConverter=TypeConverters.toString) def setInputCol(self, value): return self._set(inputCol=value) def getInputCol(self): return self.getOrDefault(self.inputCol) outputCol = Param(Params._dummy(), "outputCol", "the output column name (your Y). (strin typeConverter=TypeConverters.toString) def setOutputCol(self, value): return self._set(outputCol=value) def getOutputCol(self): return self.getOrDefault(self.outputCol) m = Param(Params._dummy(), "m", "the slope of the line. (float)", from pyspark import keyword_only from pyspark.ml.param.shared import Param, Params, TypeConverters from pyspark.ml import Transformer class StraightLine(Transformer): @keyword_only def __init__(self, inputCol=None, outputCol=None, m=1.0, c=0.0): super(StraightLine, self).__init__() self._setDefault(inputCol=None, outputCol=None, m=1.0, c=0.0) kwargs = self._input_kwargs self.setParams(**kwargs) @keyword_only def setParams(self, inputCol=None, outputCol=None, m=1.0, c=0.0): kwargs = self._input_kwargs return self._set(**kwargs) inputCol = Param(Params._dummy(), "inputCol", "the input column name (your X). (string)", typeConverter=TypeConverters.toString) def setInputCol(self, value): return self._set(inputCol=value) def getInputCol(self): return self.getOrDefault(self.inputCol) outputCol = Param(Params._dummy(), "outputCol", "the output column name (your Y). (string)", typeConverter=TypeConverters.toString) def setOutputCol(self, value): return self._set(outputCol=value) def getOutputCol(self): return self.getOrDefault(self.outputCol)
  12. 12. Example Custom Transformer Showing Verbose Param Code Straight line: y = m x + c 124 Oct 2018GE Aviation - porting analytics to PySpark ML Pipelines | from pyspark import keyword_only from pyspark.ml.param.shared import Param, Params, TypeConverters from pyspark.ml import Transformer class StraightLine(Transformer): @keyword_only def __init__(self, inputCol=None, outputCol=None, m=1.0, c=0.0): super(StraightLine, self).__init__() self._setDefault(inputCol=None, outputCol=None, m=1.0, c=0.0) kwargs = self._input_kwargs self.setParams(**kwargs) @keyword_only def setParams(self, inputCol=None, outputCol=None, m=1.0, c=0.0): kwargs = self._input_kwargs return self._set(**kwargs) inputCol = Param(Params._dummy(), "inputCol", "the input column name (your X). (string)", typeConverter=TypeConverters.toString) def setInputCol(self, value): return self._set(inputCol=value) def getInputCol(self): return self.getOrDefault(self.inputCol) outputCol = Param(Params._dummy(), "outputCol", "the output column name (your Y). (string)", typeConverter=TypeConverters.toString) def setOutputCol(self, value): return self._set(outputCol=value) def getOutputCol(self): return self.getOrDefault(self.outputCol) m = Param(Params._dummy(), "m", "the slope of the line. (float)", typeConverter=TypeConverters.toFloat) def setM(self, value): return self._set(m=value) def getM(self): return self.getOrDefault(self.m) c = Param(Params._dummy(), "c", "the y offset when x = 0. (float)", typeConverter=TypeConverters.toFloat) def setC(self, value): return self._set(c=value) def getC(self): return self.getOrDefault(self.c) def _transform(self, dataset): input_col = self.getInputCol() if not input_col: raise Exception("inputCol not supplied") output_col = self.getOutputCol() if not output_col: raise Exception("outputCol not supplied") return dataset.selectExpr("*", str(self.getM()) + " * " + input_col + " + " + str(self.getC()) + " AS " + output_col) Current code required (~50 lines) • Currently for each parameter that the default values are being set 3 times and the parameter names are being entered 9 times! Proposed code (~10 lines, clearer & easier to maintain) from pyspark import keyword_only from pyspark.ml.param.shared import Param, Params, TypeConverters, addParam from pyspark.ml import Transformer class StraightLine(Transformer): addParam("inputCol", "specify the input column name (your X).", String, None) addParam("outputCol", "specify the output column name (your Y).", String, None) addParam("m", "specify m - the slope of the line.", Float, 1.0) addParam("c", "specify c - the y offset when x = 0.", Float, 0.0) def _transform(self, dataset): return dataset.selectExpr("*", str(self.getM()) + " * " + self.getInputCol() + " + " + str(self.getC()) + " AS " + self.getOutputCol()) • We propose adding a method to add all this boiler plate code in one function – specify param name, description, datatype, default value and required flag • Ideally explainParams() should also show the data types
  13. 13. Display Pipeline 13 Before Training After Training 4 Oct 2018GE Aviation - porting analytics to PySpark ML Pipelines | • We created code to plot any Spark ML pipeline using bokeh, showing Params on hover • Available on GitHub at: https://github.com/GeneralElectric/SparkMLPipelineDisplay • The following example is based on the example pipeline in the spark documentation: https://spark.apache.org/docs/2.2.0/ml-pipeline.html
  14. 14. We can now do the whole data science workflow in Spark in a notebook environment 14GE Aviation - porting analytics to PySpark ML Pipelines | 3. Generate the ML Pipeline from config & display it 1. Start from existing configuration file 2. Explore the data - e.g. interactive bokeh plot 4. Build the model(s) and display 5. Calculate metrics – e.g. ROC curve 4 Oct 2018
  15. 15. Conclusions • GE Aviation uses analytics across the entire commercial engine operational lifecycle • Over the past few decades, we have seen an explosion in data, the trend is set to continue and today we have discussed solutions to manage this • Overcome real-life challenges of implementing python custom ML Pipeline analytics • Provided feedback on ways to make adding custom python ML libraries easier • Developed ML Pipeline display utility shared with the community • Completed entire analytic development and deployment lifecycle in Spark • Still working to port all analytics to Spark • Production deployment environment not yet in Spark 154 Oct 2018GE Aviation - porting analytics to PySpark ML Pipelines |
  16. 16. Questions?

×