Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Miruna Oprescu (moprescu@microsoft.com)
Microsoft
MMLSPARK:
Lessons From Building A
SparkML-Compatible Machine
Learning Li...
MMLSpark
Microsoft Machine Learning Library for Apache Spark
GitHub: https://github.com/Azure/mmlspark
Spark Package:
pysp...
Why MMLSpark?
• The (typical) data science workflow with Spark:
3#EUai7
ML algorithm
Data transforms
Data
Model
Why MMLSpark?
• Doesn’t look familiar?…maybe this does?
4#EUai7
ML algorithms
Data
transforms
Data
Pipeline
Python/R
UDFsD...
Why MMLSpark?
• Workflow can be:
– Slow & time-consuming
– Intractable
– Difficult to debug, reproduce
– Hard to put in pr...
MMLSpark Goals
q Stay in the Spark ecosystem as much as possible by
integrating domain-specific libraries (vision, text
an...
MMLSpark
Lesson #1: Follow the SparkML Pipeline model for
composability.
• MMLSpark consists of Transforms, Estimators
and...
MMLSpark: Before and After
8#EUai7
Example: Book Reviews
MMLSpark: Before
9#EUai7
MMLSpark: After
10#EUai7
MMLSpark
11#EUai7
Lesson #2: Leverage SparkML abstractions to
auto-generate Python and R interfaces.
• Decreases developme...
MMLSpark Architecture
12#EUai7
Pre-
trained
DNN
models
OpenCV Java
Bindings
Spark coreCNTK Java
Bindings
Scala API
PySpark...
Python Wrappers: Example
13#EUai7
Scala source
Python wrapper
Generator
MMLSpark
Lesson #3: Turn parallelizable algorithms from
external libraries (e.g. OpenCV and
CNTK) into Scala Pipeline Stag...
MMLSpark Architecture
15
Pre-
trained
DNN
models
OpenCV Java
Bindings
Spark coreCNTK Java
Bindings
Scala API
PySpark/R wra...
MMLSpark: Example
• DNN Featurization with OpenCV and CNTK
16#EUai7
MMLSpark
Lesson #4: Run on every platform supported by
Spark. Test at the highest possible
level (Jupyter Notebooks).
• Pu...
MMLSpark: Bonus
• Image Schema unification
https://github.com/apache/spark/pull/19439
• Uses DataFrames as common format f...
Use case: Snow Leopards
19#EUai7
Use case: Snow Leopards
20#EUai7
• 3,900-6,500 individuals left
in the wild
• Little known about their
behavior, movement
...
Use case: Snow Leopards
21#EUai7
• Automatic Image
Classification with
MMLSpark:
− Thousands of hours of
researcher and vo...
Thank You!
• Star our repo:
https://github.com/Azure/mmlspark
• Contact us:
Myself: Miruna Oprescu (moprescu@microsoft.com...
Upcoming SlideShare
Loading in …5
×

MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library for Apache Spark with Miruna Oprescu

1,596 views

Published on

With the rapid growth of available datasets, it is imperative to have good tools for extracting insight from big data. The Spark ML library has excellent support for performing at-scale data processing and machine learning experiments, but more often than not, Data Scientists find themselves struggling with issues such as: low level data manipulation, lack of support for image processing, text analytics and deep learning, as well as the inability to use Spark alongside other popular machine learning libraries. To address these pain points, Microsoft recently released The Microsoft Machine Learning Library for Apache Spark (MMLSpark), an open-source machine learning library built on top of SparkML that seeks to simplify the data science process and integrate SparkML Pipelines with deep learning and computer vision libraries such as the Microsoft Cognitive Toolkit (CNTK) and OpenCV. With MMLSpark, Data Scientists can build models with 1/10th of the code through Pipeline objects that compose seamlessly with other parts of the SparkML ecosystem. In this session, we explore some of the main lessons learned from building MMLSpark. Join us if you would like to know how to extend Pipelines to ensure seamless integration with SparkML, how to auto-generate Python and R wrappers from Scala Transformers and Estimators, how to integrate and use previously non-distributed libraries in a distributed manner and how to efficiently deploy a Spark library across multiple platforms.

Published in: Data & Analytics

MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library for Apache Spark with Miruna Oprescu

  1. 1. Miruna Oprescu (moprescu@microsoft.com) Microsoft MMLSPARK: Lessons From Building A SparkML-Compatible Machine Learning Library For Apache Spark #EUai7
  2. 2. MMLSpark Microsoft Machine Learning Library for Apache Spark GitHub: https://github.com/Azure/mmlspark Spark Package: pyspark/spark-shell/spark-submit --packages Azure:mmlspark:0.9 Docker: docker run -it -p 8888:8888 -e ACCEPT_EULA=yes microsoft/mmlspark Navigate to http://localhost:8888 to view example Jupyter notebooks 2#EUai7
  3. 3. Why MMLSpark? • The (typical) data science workflow with Spark: 3#EUai7 ML algorithm Data transforms Data Model
  4. 4. Why MMLSpark? • Doesn’t look familiar?…maybe this does? 4#EUai7 ML algorithms Data transforms Data Pipeline Python/R UDFsData massaging External libraries (CNTK, OpenCV)
  5. 5. Why MMLSpark? • Workflow can be: – Slow & time-consuming – Intractable – Difficult to debug, reproduce – Hard to put in production 5#EUai7
  6. 6. MMLSpark Goals q Stay in the Spark ecosystem as much as possible by integrating domain-specific libraries (vision, text analytics, etc.) q Have better model management support q Bring cutting edge ML algorithms to Spark q Reduce the overhead from UDFs and other custom functions q Run on every platform & language supported by Spark 6#EUai7
  7. 7. MMLSpark Lesson #1: Follow the SparkML Pipeline model for composability. • MMLSpark consists of Transforms, Estimators and Models that can be combined with existing SparkML components into pipelines. • These abstractions ensure composability, reusability via serialization, logging, ease of use across languages. 7#EUai7
  8. 8. MMLSpark: Before and After 8#EUai7 Example: Book Reviews
  9. 9. MMLSpark: Before 9#EUai7
  10. 10. MMLSpark: After 10#EUai7
  11. 11. MMLSpark 11#EUai7 Lesson #2: Leverage SparkML abstractions to auto-generate Python and R interfaces. • Decreases development time, ensures feature parity, reduces errors and improves testing.
  12. 12. MMLSpark Architecture 12#EUai7 Pre- trained DNN models OpenCV Java Bindings Spark coreCNTK Java Bindings Scala API PySpark/R wrappers Wrapper generation
  13. 13. Python Wrappers: Example 13#EUai7 Scala source Python wrapper Generator
  14. 14. MMLSpark Lesson #3: Turn parallelizable algorithms from external libraries (e.g. OpenCV and CNTK) into Scala Pipeline Stages. • No data transfer overhead, all operations happen at the JVM level. • Enables cutting edge techniques such as Transfer Learning. 14#EUai7
  15. 15. MMLSpark Architecture 15 Pre- trained DNN models OpenCV Java Bindings Spark coreCNTK Java Bindings Scala API PySpark/R wrappers Wrapper generation #EUai7
  16. 16. MMLSpark: Example • DNN Featurization with OpenCV and CNTK 16#EUai7
  17. 17. MMLSpark Lesson #4: Run on every platform supported by Spark. Test at the highest possible level (Jupyter Notebooks). • Publish self-contained packages that can be used from a variety of targets. • Avoid common integration issues by testing directly with Jupyter Notebooks. 17#EUai7
  18. 18. MMLSpark: Bonus • Image Schema unification https://github.com/apache/spark/pull/19439 • Uses DataFrames as common format for reading images. • Standardizes handling of images as a datatype used by different algorithms. 18#EUai7
  19. 19. Use case: Snow Leopards 19#EUai7
  20. 20. Use case: Snow Leopards 20#EUai7 • 3,900-6,500 individuals left in the wild • Little known about their behavior, movement patterns, survival rates • Camera trapping since 2009 (~1.3 mil images)
  21. 21. Use case: Snow Leopards 21#EUai7 • Automatic Image Classification with MMLSpark: − Thousands of hours of researcher and volunteer time saved − Resources redeployed to science and conservation vs image sorting − Much more accurate data on range and population
  22. 22. Thank You! • Star our repo: https://github.com/Azure/mmlspark • Contact us: Myself: Miruna Oprescu (moprescu@microsoft.com) Dev Lead: Sudarshan Raghunathan (susudars@microsoft.com) PM: Roope Astala (roastala@microsoft.com) 22#EUai7

×