With the rapid growth of available datasets, it is imperative to have good tools for extracting insight from big data. The Spark ML library has excellent support for performing at-scale data processing and machine learning experiments, but more often than not, Data Scientists find themselves struggling with issues such as: low level data manipulation, lack of support for image processing, text analytics and deep learning, as well as the inability to use Spark alongside other popular machine learning libraries. To address these pain points, Microsoft recently released The Microsoft Machine Learning Library for Apache Spark (MMLSpark), an open-source machine learning library built on top of SparkML that seeks to simplify the data science process and integrate SparkML Pipelines with deep learning and computer vision libraries such as the Microsoft Cognitive Toolkit (CNTK) and OpenCV. With MMLSpark, Data Scientists can build models with 1/10th of the code through Pipeline objects that compose seamlessly with other parts of the SparkML ecosystem. In this session, we explore some of the main lessons learned from building MMLSpark. Join us if you would like to know how to extend Pipelines to ensure seamless integration with SparkML, how to auto-generate Python and R wrappers from Scala Transformers and Estimators, how to integrate and use previously non-distributed libraries in a distributed manner and how to efficiently deploy a Spark library across multiple platforms.
2. MMLSpark
Microsoft Machine Learning Library for Apache Spark
GitHub: https://github.com/Azure/mmlspark
Spark Package:
pyspark/spark-shell/spark-submit --packages Azure:mmlspark:0.9
Docker:
docker run -it -p 8888:8888 -e ACCEPT_EULA=yes microsoft/mmlspark
Navigate to http://localhost:8888 to view example Jupyter notebooks
2#EUai7
3. Why MMLSpark?
• The (typical) data science workflow with Spark:
3#EUai7
ML algorithm
Data transforms
Data
Model
4. Why MMLSpark?
• Doesn’t look familiar?…maybe this does?
4#EUai7
ML algorithms
Data
transforms
Data
Pipeline
Python/R
UDFsData
massaging
External libraries
(CNTK, OpenCV)
5. Why MMLSpark?
• Workflow can be:
– Slow & time-consuming
– Intractable
– Difficult to debug, reproduce
– Hard to put in production
5#EUai7
6. MMLSpark Goals
q Stay in the Spark ecosystem as much as possible by
integrating domain-specific libraries (vision, text
analytics, etc.)
q Have better model management support
q Bring cutting edge ML algorithms to Spark
q Reduce the overhead from UDFs and other custom
functions
q Run on every platform & language supported by Spark
6#EUai7
7. MMLSpark
Lesson #1: Follow the SparkML Pipeline model for
composability.
• MMLSpark consists of Transforms, Estimators
and Models that can be combined with existing
SparkML components into pipelines.
• These abstractions ensure composability,
reusability via serialization, logging, ease of use
across languages.
7#EUai7
11. MMLSpark
11#EUai7
Lesson #2: Leverage SparkML abstractions to
auto-generate Python and R interfaces.
• Decreases development time, ensures feature
parity, reduces errors and improves testing.
14. MMLSpark
Lesson #3: Turn parallelizable algorithms from
external libraries (e.g. OpenCV and
CNTK) into Scala Pipeline Stages.
• No data transfer overhead, all operations
happen at the JVM level.
• Enables cutting edge techniques such as
Transfer Learning.
14#EUai7
17. MMLSpark
Lesson #4: Run on every platform supported by
Spark. Test at the highest possible
level (Jupyter Notebooks).
• Publish self-contained packages that can be
used from a variety of targets.
• Avoid common integration issues by testing
directly with Jupyter Notebooks.
17#EUai7
18. MMLSpark: Bonus
• Image Schema unification
https://github.com/apache/spark/pull/19439
• Uses DataFrames as common format for
reading images.
• Standardizes handling of images as a datatype
used by different algorithms.
18#EUai7
20. Use case: Snow Leopards
20#EUai7
• 3,900-6,500 individuals left
in the wild
• Little known about their
behavior, movement
patterns, survival rates
• Camera trapping since
2009 (~1.3 mil images)
21. Use case: Snow Leopards
21#EUai7
• Automatic Image
Classification with
MMLSpark:
− Thousands of hours of
researcher and volunteer time
saved
− Resources redeployed to
science and conservation vs
image sorting
− Much more accurate data on
range and population
22. Thank You!
• Star our repo:
https://github.com/Azure/mmlspark
• Contact us:
Myself: Miruna Oprescu (moprescu@microsoft.com)
Dev Lead: Sudarshan Raghunathan (susudars@microsoft.com)
PM: Roope Astala (roastala@microsoft.com)
22#EUai7