Successfully reported this slideshow.
Your SlideShare is downloading. ×

Operationalize Apache Spark Analytics

Ad

Operationalize Apache Spark
Analytics
Ivan Nardini
Sr. Associate Customer Advisor, SAS Institute | CI & Analytics | ModelO...

Ad

Operationalize Apache
Spark Analytics
Ivan Nardini
SAS Governance options with Apache® Spark
Analytics
▪ Govern Spark Mode...

Ad

Model Ops Challenges

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Check these out next

1 of 23 Ad
1 of 23 Ad

Operationalize Apache Spark Analytics

Download to read offline

Apache Spark is a unified analytics engine for large-scale, distributed data processing. And Spark MLlib (Machine Learning library) is a scalable Spark implementation of some common machine learning (ML) functionality, as well associated tests and data generators.

Apache Spark is a unified analytics engine for large-scale, distributed data processing. And Spark MLlib (Machine Learning library) is a scalable Spark implementation of some common machine learning (ML) functionality, as well associated tests and data generators.

Advertisement
Advertisement

More Related Content

Slideshows for you (19)

Similar to Operationalize Apache Spark Analytics (17)

Advertisement

More from Databricks (20)

Advertisement

Operationalize Apache Spark Analytics

  1. 1. Operationalize Apache Spark Analytics Ivan Nardini Sr. Associate Customer Advisor, SAS Institute | CI & Analytics | ModelOps | Decisioning Artem Glazkov Sr. Consultant, SAS Instintute | Decisioning | ModelOps | Customer Advisory
  2. 2. Operationalize Apache Spark Analytics Ivan Nardini SAS Governance options with Apache® Spark Analytics ▪ Govern Spark Models – PMML ▪ Orchestrate Spark Models - Livy Artem Glazkov Managing Spark ML model lifecycle demo scenario: ▪ Code-agnostics model repository ▪ BPM tool for model governance ▪ Capturing model performance over time
  3. 3. Model Ops Challenges
  4. 4. Model Ops Challenges ▪ Change in customer behavior ▪ Internal and external environment changes ▪ Track performance for models with long and short target actualization ▪ Role-based approach ▪ Elaborate clear action plan for the model ▪ Combine business rules, scripts, and user expertise in governance process DecisioningModel Performance decay Retrain automation ▪ Orchestrate repetitive procedures ▪ Reduce time gap between model development and deployment stages ▪ Figure out right model in the right moment for retrain
  5. 5. How we meet ModelOps challenges using SAS Model Manager and SAS Workflow Manager Including two build-in scoring engines (CAS and MAS) and external engines Integration with engines Orchestration GUI + code govern SAS and Open Source models Openness One place to store all models Repository Built-in and customized model quality assessment Reporting Automate all repetitive model management tasks
  6. 6. Why we should track model performance decay Predictivepowerofthemodel time t1 t2 t3 t4 Deployed model Alerting trigger Additional value Retrained and redeployed model
  7. 7. should track model performance decay How do you operationalize Spark Models?
  8. 8. SAS Governance options with Apache Spark Analytics
  9. 9. Govern Spark Models using SAS - PMML
  10. 10. PMML is one of the leading standard for statistical and data mining models. PMML enables model development on one system using one application and deploy the model on another system using another application, simply by transmitting an XML configuration file. Govern Spark models – Spark PMML
  11. 11. Govern Spark models – Spark PMML The JPMML-SparkML library converts Apache Spark ML pipelines to PMML data format. It is written in Java. But the JPMML family includes Python (and R) wrapper libraries for the JPMML-SparkML library. For Python, we have the pyspark2pmml package that works with the official PySpark interface: • The pyspark2pmml.PMMLBuilder Python class is an API clone of the org.jpmml.sparkml.PMMLBuilder Java class. • The Apache Spark connection is typically available in PySpark session as the sc variable. The SparkContext class has an _jvm attribute, which gives Python users direct access to JPMML-SparkML functionality via the Py4J gateway. Then in your Spark session, you fit your pipeline and then use PMMLBuilder to create its PMML file.
  12. 12. Govern Spark models: SAS Model Manager and PMML SAS Model Manager GUI/ REST API PySpark Mlib Register into Spark Development Environment SAS Workflow Manager SAS Data Connector Spark Production Environment SAS Viya Governance Environment Score new data In-DB Process for Spark by SAS REST API
  13. 13. In this scenario we are translate OS model score code to SAS and utilize Embeded Process for Hadoop. We use build-in SAS Viya capabilities for creating SAS Model Manager reports, based on the scored data provided by running of the Embedded process. Govern Spark models: The «PMML» workflow
  14. 14. PMML approach Pro and Cons PROs: • SAS In-database technology (Accelator Scoring) CONs: • Technology Bottlenecks (PMML supports a limited set of algorithms) Govern Spark models (PMML)
  15. 15. Orchestrate Spark Models – Apache Livy
  16. 16. Orchestrate Spark models – What’s Apache Livy? Apache Livy is a service enables easy submission of Spark jobs or snippets of Spark code, synchronous or asynchronous result retrieval, as well as Spark Context management, all via a simple REST interface or an RPC client library. SAS Viya client
  17. 17. Govern Spark models – Apache Livy Like Python Sklearn models, we register the parquet version of Spark Mlib model and (optionally) the scoring code: • Parquet model contains the model metainfo to score new data in the Hadoop/Spark ecosystems. • Scoring code is a REST API recipe will submit from Livy Server to Spark cluster for loading the model and get score back Then we use SAS Workflow Manager capabilities (Job execution and REST API service) to: 1. Submit Scoring REST API call 2. Get back the scoring data 3. Generate Performance monitoring
  18. 18. SAS Model Manager GUI/ REST API PySpark Mlib Register into Spark Development Environment SAS Workflow Manager Spark Production Environment SAS Viya Governance Environment REST API Apache Livy Score new data Govern Spark models: SAS Model Manager and Apache Livy
  19. 19. In this scenario SAS Model Manager and SAS Workflow Manager acting more like orchestrator of service task and user reviews. We utilize build-in SAS Viya capabilities for creating Model Manager reports, based on the scored data provided by native spark. Govern Spark models: «Apache Livy» workflow
  20. 20. PMML and Livy approaches Pro and Cons PROs: • SAS In-database technology (Accelator Scoring) CONs: • Technology Bottlenecks (PMML supports a limited set of algorithms) Govern Spark models (PMML) Orchestrate Spark Models (Livy) PROs: • Native integrations (no score code manipulation or conversion) CONs: • Configuration needed (Livy server)
  21. 21. Demo
  22. 22. Feedback Your feedback is important to us. Don’t forget to rate and review the sessions.

×