This presentation focuses on Apache Spark’s MLlib library for distributed ML, focusing on how we simplified elements of production-grade ML by building MLlib on top of Spark’s distributed DataFrame API.
2. Who am I?
ApacheSpark committer & PMC member
Software Engineer@ Databricks
Ph.D. in Machine Learning from CarnegieMellon
2
3. • General engine for big data computing
• Fast
• Easy to use
• APIs in Python,Scala, Java & R
3
Apache Spark
Spark SQL Streaming MLlib GraphX
Largest cluster:
8000 Nodes(Tencent)
Open source
• Apache Software Foundation
• 1000+ contributors
• 200+ companies& universities
4. NOTABL E USERS THAT PRESENTED AT SPARK SUMMIT
2 0 1 5 SAN F RANCISCO
Source: Slide5ofSparkCommunityUpdate
5. Databricks
Founded by the creators of Apache Spark
Offers hosted service
• Spark on EC2
• Notebooks
• Visualizations
• Cluster management
• Scheduled jobs
5
We’re hiring!
6. This talk: DataFrames in MLlib
6
Common issues within Big ML projects
• Custom, strict data format
• Library encouragesdeveloping via scripts
• Lots of work on low-level optimizations
• Hard to bridge R&D – Production gap
• Single-language APIs
7. MLlib: Spark’s ML library
7
0
500
1000
v0.8 v0.9 v1.0 v1.1 v1.2 v1.3 v1.4 v1.5 v1.6 v2.0
commits/release
Learning tasks
Classification
Regression
Recommendation
Clustering
Frequent itemsets
Data utilities
Featurization
Statistics
Linear algebra
Workflow utilities
Model import/export
Pipelines
DataFrames
Cross validation
Goals
Scale-out ML
Standard library
Extensible API
8. Spark DataFrames & Datasets
8
dept age name
Bio 48 H Smith
CS 34 A Turing
Bio 43 B Jones
Chem 61 M Kennedy
Data groupedinto
named columns
DSL for common tasks
• Project, filter, aggregate, join,…
• 100+ functionsavailable
• User-Defined Functions(UDFs)
data.groupBy(“dept”).avg(“age”)
Datasets: Strongly typedDataFrames
9. This talk: DataFrames in MLlib
Data sources& ETL
ML Pipelines
Under the hood: optimizations
Model persistence
Multiple languagesupport
9
10. Data sources & ETL
DataFrames support easy
manipulationof big data
• Standard DataFrame/SQL ops
• Methodsfor null/NaN vals
• Statistical methods
• Conversions:R data.frame,
Python Pandas
10
built-in external
{ JSON }
JDBC
and more …
Many data sources
Data scientistsspend50-80%oftheirtime on data munging.*
* Lohraug. “For Big-Data Scientists, ‘Janitor Work’
Is KeyHurdle to Insights.” NYTimes, 8/18/2014.
14. Fit a model
Feature
extraction
Original
dataset
14
Predictive
model
Evaluation
Text Label Words Features Prediction Probability
I bought the game... 4 “i",
“bought”,...
[1, 0, 3, 9, ...] 4 0.8
Do NOT bother try... 1 “do”, “not”,... [0, 0, 11, 0, ...] 2 0.6
this shirt is aweso... 5 “this”, “shirt” [0, 2, 3, 1, ...] 5 0.9
nevergot it.
Seller...
1 “never”, “got” [1, 2, 0, 0, ...] 1 0.7
15. Evaluate
Feature
extraction
Original
dataset
15
Predictive
model
Evaluation
Text Label Words Features Prediction Probability
I bought the game... 4 “i",
“bought”,...
[1, 0, 3, 9, ...] 4 0.8
Do NOT bother try... 1 “do”, “not”,... [0, 0, 11, 0, ...] 2 0.6
this shirt is aweso... 5 “this”, “shirt” [0, 2, 3, 1, ...] 5 0.9
nevergot it.
Seller...
1 “never”, “got” [1, 2, 0, 0, ...] 1 0.7
16. ML Pipelines
DataFrames: unified ML dataset API
• Flexible types
• Add & remove columns during
Pipeline execution
• Materialize columns lazily
• Inspectintermediate results
16
18. Under the hood: optimizations
Currentuse of DataFrames
• API
• Transformations & predictions
18
Feature transformation &model
prediction are phrased as User-
Defined Functions (UDFs)
à Catalyst query optimizer
à Tungsten memory management
+ code generation
19. Implementations on DataFrames
Prototypes
• Beliefpropagation
• Connectedcomponents
Current challenge: DataFrame query plans
do not have iterationas a top-levelconcept
Eventualgoal: Port all ML algorithmsto run
on top of DataFrames à speed & scalability
19
20. ML persistence
20
Data Science Software Engineering
Prototype (Python/R)
Create model
Re-implementmodel for
production (Java)
Deploy model
21. ML persistence
21
Data Science Software Engineering
Prototype (Python/R)
Create Pipeline
• Extract raw features
• Transform features
• Select key features
• Fit multiple models
• Combine results to
make prediction
• Extra implementation work
• Different code paths
• Synchronization overhead
Re-implementPipeline for
production (Java)
Deploy Pipeline
22. With ML persistence...
22
Data Science Software Engineering
Prototype (Python/R)
Create Pipeline
Persist model or Pipeline:
model.save(“s3n://...”)
Load Pipeline (Scala/Java)
Model.load(“s3n://…”)
Deploy in production
23. Model tuning
ML persistence status
23
Data
preprocessing
Feature
generation
Predictive
modeling
Unfitted Fitted
Model
Pipeline
Supported in MLlib’s
RDD-based API
“recipe” “result”
Single implementation for
all Spark language APIs:
Scala, Java, Python, R
24. ML persistence status
Near-complete coveragein all Spark languageAPIs
• Scala & Java: complete
• Python:complete exceptfor 2 algorithms
• R: complete for existing APIs
Single underlyingimplementation of models
Exchangeabledata format
• JSON for metadata
• Parquet for model data (coefficients,etc.)
24
25. Multiple language support
APIs in Scala, Java, Python,R
• Scala (& Java): implementation
• Python& R: wrappers for Scala
DataFrames provide:
• Uniform API across languages
• Data serialization
• Store data off-heap, accessible from JVM
• Transfer to & from Python & R handled by DataFrames, not MLlib
25
26. Summary: DataFrames in MLlib
Data sources& ETL
ML Pipelines
Under the hood: optimizations
Model persistence
Multiple languagesupport
26
27. Research & development topics
• Queryoptimization for ML/Graph algorithms
• Caching,communication,serialization,compression
• Iteration as a first-class conceptin DataFrames
• Optimized model tuning
• Spark + GPUs
• Asynchronouscommunication within Spark
27
28. What’s next?
Prioritized items on the 2.1 roadmap JIRA (SPARK-15581):
• Critical feature completenessfor the DataFrame-based API
– Multiclass logistic regression
– Frequent pattern mining
• Python API parity & R API expansion
• Scaling & speed for key algorithms: trees, forests, and boosting
GraphFrames
• Release for Spark 2.0
• Speed improvements(join elimination,connected components)
28
29. Get started
Get involved
• JIRA http://issues.apache.org
• mailing lists http://spark.apache.org
• Github http://github.com/apache/spark
• Spark Packages http://spark-packages.org
Learn more
• What’s coming in Apache Spark 2.0
http://databricks.com/blog/2016/06/01
• MOOCs on EdX http://databricks.com/spark/training
29
Try outApache Spark 2.0 preview
in Databricks Community Edition
http://databricks.com/ce
Many thanksto the community
for contributions& support!