Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Apache Spark MLlib 2.0 Preview: Data Science and Production


Published on

This talk highlights major improvements in Machine Learning (ML) targeted for Apache Spark 2.0. The MLlib 2.0 release focuses on ease of use for data science—both for casual and power users. We will discuss 3 key improvements: persisting models for production, customizing Pipelines, and improvements to models and APIs critical to data science.

(1) MLlib simplifies moving ML models to production by adding full support for model and Pipeline persistence. Individual models—and entire Pipelines including feature transformations—can be built on one Spark deployment, saved, and loaded onto other Spark deployments for production and serving.

(2) Users will find it much easier to implement custom feature transformers and models. Abstractions automatically handle input schema validation, as well as persistence for saving and loading models.

(3) For statisticians and data scientists, MLlib has doubled down on Generalized Linear Models (GLMs), which are key algorithms for many use cases. MLlib now supports more GLM families and link functions, handles corner cases more gracefully, and provides more model statistics. Also, expanded language APIs allow data scientists using Python and R to call many more algorithms.

Finally, we will demonstrate these improvements live and show how they facilitate getting started with ML on Spark, customizing implementations, and moving to production.

Published in: Software

Apache Spark MLlib 2.0 Preview: Data Science and Production

  1. 1. Apache Spark MLlib 2.0 Preview: Data Science and Production Joseph K. Bradley June 8, 2016 ® ™
  2. 2. Who am I? • ApacheSpark committer & PMC member • Software Engineer@ Databricks • Ph.D. in Machine Learning from CarnegieMellonU. 2
  3. 3. MLlib trajectory 3 0 500 1000 v0.8 v0.9 v1.0 v1.1 v1.2 v1.3 v1.4 v1.5 v1.6 v2.0 commits/release Scala/Jav a API Primary API for MLlib Python API R API DataFrame-based API forMLlib
  4. 4. DataFrame-based API for MLlib a.k.a. “Pipelines” API, with utilities for constructingMLPipelines In 2.0, the DataFrame-based API will become the primary API for MLlib. • Voted by community •, The RDD-based API will entermaintenance mode. • Still maintained with bug fixes, but no new features •org.apache.spark.mllib, pyspark.mllib 4
  5. 5. Goals for MLlib in 2.0 Major initiatives • Generalized LinearModels • Python & R API expansion • ML persistence:saving & loading models& Pipelines Also in 2.0: • Sketching algorithms: • For details, see SPARK-12626 roadmap JIRA + mailing list discussions. 5 Exploratory analysis Production
  6. 6. MLlib for exploratory analysis Generalized Linear Models (GLMs) Python & R APIs 6
  7. 7. Generalized Linear Models (GLMs) Arguablythe most important class of models for ML • Logistic regression • Linear regression 7 In Spark 1.6 & earlier • Many othertypes of models • Model summary statistics +1 +1 +1 +1 -1-1 -1-1 +1 -1 +1 +1 Spam Ham
  8. 8. GLMs in 2.0 8 GeneralizedLinearRegression • Max 4096 features • Solved using Iteratively Reweighted Least Squares (IRLS) LinearRegression & LogisticRegression • Millionsoffeatures • Solved using L-BFGS / OWL-QNFixes for cornercases • E.g., handle invalid labelsgracefully Model family Supported linkfunctions Gaussian Identity, Log, Inverse Binomial Logit, Probit, CLogLog Poisson Log, Identity, Sqrt Gamma Inverse, Identity, Log
  9. 9. Python & R APIs for MLlib Python • Clustering algorithms: Bisecting K-Means, Gaussian Mixtures, LDA • Meta-algorithms: OneVsRest, TrainValidationSplit • GeneralizedLinearRegression • Feature transformers: ChiSqSelector, MaxAbsScaler, QuantileDiscretizer • Model inspection: summaries for Logistic Regression, Linear Regression, GLMs R • Regression & classification: Generalized Linear Regression, AFT survival regression • Clustering: K-Means 9 Goal: Expand ML APIs forcritical languagesfor data science
  10. 10. MLlib in production ML Persistence Customizing Pipelines 10
  11. 11. Why ML persistence? 11 Data Science Software Engineering Prototype (Python/R) Create model Re-implementmodel for production (Java) Deploy model
  12. 12. Why ML persistence? 12 Data Science Software Engineering Prototype (Python/R) Create Pipeline • Extract raw features • Transform features • Select key features • Fit multiple models • Combine results to make prediction • Extra implementation work • Different code paths • Synchronization overhead Re-implementPipeline for production (Java) Deploy Pipeline
  13. 13. With ML persistence... 13 Data Science Software Engineering Prototype (Python/R) Create Pipeline Persist model or Pipeline:“s3n://...”) Load Pipeline (Scala/Java) Model.load(“s3n://…”) Deploy in production
  14. 14. Model tuning ML persistence status 14 Text preprocessing Feature generation Generalized Linear Regression Unfitted Fitted Model Pipeline Supported in MLlib’s RDD-based API “recipe” “result”
  15. 15. ML persistence status Near-complete coveragein all Spark languageAPIs • Scala & Java: complete • Python:complete exceptfor 2 algorithms • R: complete for existing APIs Single underlyingimplementation of models Exchangeabledata format • JSON for metadata • Parquet for model data (coefficients,etc.) 15
  16. 16. Customizing ML Pipelines MLlib 2.0 will include: • 29 feature transformers (Tokenizer,Word2Vec, ...) • 21 models (for classification, regression,clustering, ...) • Model tuning & evaluation But some applications requirecustomized Transformers& Models. 16
  17. 17. Options for customization Extend abstractions • Transformer • Estimator & Model • Evaluator 17 UnaryTransformer: input à output E.g.: • Tokenizer: text à sequence of words • Normalizer: vector à normalized vector Simple API which provides: • DataFrame-based API • Built-in parametergetters, setters • Distributed Row-wise transformation Existing use cases: • Natural Language Processing (Snowball, Stanford NLP) • Featurization libraries • Many others!
  18. 18. Persistence for customized algorithms 2 traits provide persistence for simple Transformers: •DefaultParamsWritable •DefaultParamsReadable Simply mix in these traits to provide persistence in Scala/Java. à Saves & loadsalgorithmparameters val myParam: Param[Double] à Does notsave member data val myCoefficients: Vector For an example, see UnaryTransformerExample.scala in spark/examples/ 18
  19. 19. Recap Exploratory analysis • Generalized LinearRegression • Python & R APIs Production • ML persistence • Customizing Pipelines 19 Many thanksto the community for contributions& support!
  20. 20. What’s next? Prioritized items on the 2.1 roadmap JIRA (SPARK-15581): • Critical feature completenessfor the DataFrame-based API – Multiclass logistic regression – Statistics • Python API parity & R API expansion • Scaling & speed tuning for key algorithms: trees & ensembles GraphFrames • Release for Spark 2.0 • Speed improvements(join elimination,connected components) 20
  21. 21. Get started • Getinvolved via roadmap JIRA (SPARK- 15581) + mailing lists • Download notebook for this talk • ML persistenceblog post 21 Try outthe ApacheSpark 2.0 preview release:
  22. 22. Thank you! Officehour@2:45pmtoday(ExpoHall) Twitter:@jkbatcmu