Apache Spark is a unified analytics engine for large-scale, distributed data processing. And Spark MLlib (Machine Learning library) is a scalable Spark implementation of some common machine learning (ML) functionality, as well associated tests and data generators.
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Operationalize Apache Spark Analytics
1.
2. Operationalize Apache Spark
Analytics
Ivan Nardini
Sr. Associate Customer Advisor, SAS Institute | CI & Analytics | ModelOps | Decisioning
Artem Glazkov
Sr. Consultant, SAS Instintute | Decisioning | ModelOps | Customer Advisory
3. Operationalize Apache
Spark Analytics
Ivan Nardini
SAS Governance options with Apache® Spark
Analytics
▪ Govern Spark Models – PMML
▪ Orchestrate Spark Models - Livy
Artem Glazkov
Managing Spark ML model lifecycle demo
scenario:
▪ Code-agnostics model repository
▪ BPM tool for model governance
▪ Capturing model performance over time
5. Model Ops Challenges
▪ Change in customer
behavior
▪ Internal and external
environment changes
▪ Track performance for
models with long and short
target actualization
▪ Role-based approach
▪ Elaborate clear action plan
for the model
▪ Combine business rules,
scripts, and user expertise
in governance process
DecisioningModel Performance decay Retrain automation
▪ Orchestrate repetitive
procedures
▪ Reduce time gap between
model development and
deployment stages
▪ Figure out right model in the
right moment for retrain
6. How we meet ModelOps challenges
using SAS Model Manager and SAS Workflow Manager
Including two build-in
scoring engines
(CAS and MAS) and
external engines
Integration with engines Orchestration
GUI + code
govern SAS and Open
Source models
Openness
One place to
store all
models
Repository
Built-in and
customized model
quality assessment
Reporting
Automate all
repetitive model
management
tasks
7. Why we should track model performance decay
Predictivepowerofthemodel
time
t1 t2 t3 t4
Deployed
model
Alerting trigger
Additional value
Retrained and
redeployed model
8. should track model performance
decay
How do you operationalize
Spark Models?
11. PMML is one of the leading standard for
statistical and data mining models.
PMML enables model development on one
system using one application and deploy the
model on another system using another
application, simply by transmitting an XML
configuration file.
Govern Spark models – Spark PMML
12. Govern Spark models – Spark PMML
The JPMML-SparkML library converts Apache Spark ML pipelines to PMML data
format. It is written in Java. But the JPMML family includes Python (and R) wrapper
libraries for the JPMML-SparkML library.
For Python, we have the pyspark2pmml package that works with the official PySpark
interface:
• The pyspark2pmml.PMMLBuilder Python class is an API clone of the org.jpmml.sparkml.PMMLBuilder Java class.
• The Apache Spark connection is typically available in PySpark session as the sc variable. The SparkContext class
has an _jvm attribute, which gives Python users direct access to JPMML-SparkML functionality via the Py4J
gateway.
Then in your Spark session, you fit your pipeline and then use PMMLBuilder to create
its PMML file.
13. Govern Spark models: SAS Model Manager and PMML
SAS Model Manager
GUI/
REST API
PySpark Mlib
Register into
Spark Development
Environment
SAS Workflow Manager
SAS Data
Connector
Spark Production
Environment
SAS Viya
Governance Environment
Score new data
In-DB Process for
Spark by SAS
REST API
14. In this scenario we are translate OS
model score code to SAS and utilize
Embeded Process for Hadoop.
We use build-in SAS Viya capabilities
for creating SAS Model Manager
reports, based on the scored data
provided by running of the Embedded
process.
Govern Spark models: The «PMML» workflow
15. PMML approach
Pro and Cons
PROs:
• SAS In-database technology
(Accelator Scoring)
CONs:
• Technology Bottlenecks
(PMML supports a limited set of
algorithms)
Govern Spark models
(PMML)
17. Orchestrate Spark models – What’s Apache Livy?
Apache Livy is a service enables easy submission of Spark jobs or snippets of
Spark code, synchronous or asynchronous result retrieval, as well as Spark
Context management, all via a simple REST interface or an RPC client library.
SAS Viya
client
18. Govern Spark models – Apache Livy
Like Python Sklearn models, we register the parquet version of Spark Mlib model and (optionally) the
scoring code:
• Parquet model contains the model metainfo to score new data in the Hadoop/Spark ecosystems.
• Scoring code is a REST API recipe will submit from Livy Server to Spark cluster for loading the model
and get score back
Then we use SAS Workflow Manager capabilities (Job execution and REST API service) to:
1. Submit Scoring REST API call
2. Get back the scoring data
3. Generate Performance monitoring
19. SAS Model Manager
GUI/
REST API
PySpark Mlib
Register into
Spark Development
Environment
SAS Workflow Manager
Spark Production
Environment
SAS Viya
Governance Environment
REST API
Apache Livy
Score new data
Govern Spark models: SAS Model Manager and Apache Livy
20. In this scenario SAS Model Manager and
SAS Workflow Manager acting more like
orchestrator of service task and user
reviews.
We utilize build-in SAS Viya capabilities
for creating Model Manager reports,
based on the scored data provided by
native spark.
Govern Spark models: «Apache Livy» workflow
21. PMML and Livy approaches
Pro and Cons
PROs:
• SAS In-database technology
(Accelator Scoring)
CONs:
• Technology Bottlenecks
(PMML supports a
limited set of algorithms)
Govern Spark models
(PMML)
Orchestrate Spark Models
(Livy)
PROs:
• Native integrations (no score code
manipulation or conversion)
CONs:
• Configuration needed (Livy server)