2. Introduction
Studied some Math
and Economics
- Predictive modeling
of at risk patients @
UHG
- R/SAS batch
pipelines
-- Joins TrueCar to do
new car price
modeling
- SQL-based batch
pipelins + python for
Linear Regressions in
API layer
- Continues to train
models in SAS/R
- SQL-based batch
pipelins + python for
Linear Regressions in
API layer
- Massive pre-
computation in
Hadoop/ElasticSearch
- Portions of models still
translated, now to Java
- Spark enables data
processing and training
of models in same
environment
MLeap
Web Dev @ Cornell
Studied some
General Biology
Rails Consulting for
TrueCar and other
companies
Implement ML
model for
ClearBook in Ruby,
Python and Java
Erlang for highly-
concurrent game
servers
C/C# Game Engines
Join TrueCar for
Rails/mobile
development
Begin work on
machine learning
platform based on
Spark/Hadoop
How do we build
real-time APIs
based on Spark
Models?
SparkContext
prohibitive
Try several ideas,
fail, learn a ton,
begin collaboration
with Hortonworks
3. Everyone wants to do better! The winning technology will be the one that enables Engineers and Data
Scientists to collaborate and work across a single platform.
Action Reaction
Problem Statement: Deploying machine learning algorithms to a production environment is
a lot more difficult than it has to be and is a common problem for data-driven organizations.
- Data scientists write data pipelines to
construct research datasets
- Engineers re-write the data pipelines for a
production-ready system
- Engineers write scalable libraries for
computing features and algorithms
- Data scientists largely don’t use those
libraries and maintain/re-write their own
copy of the code
- Data scientists largely focus on
linear/logistic regressions due to
engineering constraints
- Talented engineers get largely tired of
coding up linear regressions and updating
coefficients
4. HDFS: Shared Data Spark: Shared ML APIs ?: Shared API Environment
Existing Solutions: You won’t believe how many companies are still deploying algorithms in
a SQL environment! And these are billion dollar operations.
Hard-Coded Models
(SQL, Java, Ruby)
PMML Emerging Solutions
(yHat, DataRobot)
Enterprise Solutions
(Microsoft, IBM, SAS)
MLeap
Quick to
Implement
Open Sourced
Committed to
Spark/Hadoop
API Server
5. MLeap Solution
• Born out of need to deploy models to a real time API
server at TrueCar
• Leverage Hadoop/Spark ecosystem for training, get rid
of Spark dependency for execution
• Easily reuse models with serialization and executing
without Spark
6. MLeap Components
• core - provides linear algebra system,
regression models, and feature builders
• runtime - provides DataFrame-like
“LeapFrame” and transformers for it
• spark - provides easy conversion from Spark
transformers to MLeap transformers
mleap-spark
mleap-runtime
mleap-core
8. MLeap Runtime
• Provides LeapFrame, which stores data for transformations by MLeap
transformers
• MLeap transformers user mleap-core building blocks to transform
LeapFrame
• MLeap transformers correspond one-to-one with Spark transformers
• No dependencies on Spark
9. Feature Pipeline Diagram
VectorAssembler
Continuous
Feature Vector
StandardScaler
StringIndexer
StringIndexer
StringIndexer
OneHotEncoder
OneHotEncoder
VectorAssembler
LinearRegression
Categorical
Feature
Categorical
Feature
Index
Categorical
Feature
One Hot Vector
Categorical
Feature Vector
VectorAssembler
Scaled Continuous
Feature Vector
Final Feature Vector
Continuous
Feature
Legend
Final Feature Vector Prediction
Regression Pipeline
OneHotEncoder
10. Categorical Pipeline
LeapFrame LeapFrame LeapFrame
Categorical
Feature
StringIndexer OneHotEncoder
Categorical
Feature Index
Categorical
Feature One
Hot Vector
StringIndexer OneHotEncoder
12. MLeap Spark
• Train an ML pipeline with Spark then export it to MLeap
• Execute an MLeap pipeline against a Spark DataFrame
Spark Estimator Spark Model MLeap Model
MLeap
Spark
Spark DataFrame Spark LeapFrame Spark LeapFrame
MLeap
Spark
Spark DataFrame
MLeap
Transformer
MLeap
Spark
13. Typical MLeap Workflow
Data
Frames
HDFS:
Raw Data
Spark/MapReduce/Pig/Hive
Combine + Enrich
Transactional Data
Compute the feature
vector
Train, Test, Repeat
(MLeap-Spark)
Export models to …?
Spark MLeap Runtime
Serialize to HDFS
Serialize to ElasticSearch
Model 1 Model 2
Deploy to API Servers
Scala/Java/Python/Zeppelin/Jupyter/DataBricks
14. Demo Usage of MLeap
• Train a sample listing price model using linear
regression and random forest
• Deploy both models to an API server
• Get real-time results
• IN UNDER 5 MINUTES!
15. Future of MLeap
• Have MLeap Core be a dependency of Spark to ensure
parity between transformers
• Expand supported transformers
• Python/R interface