Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Matthew Tovbin
Principal Engineer, Salesforce Einstein
mtovbin@salesforce.com
@tovbinm
Fantastic ML apps and how to
build ...
“This lonely scene – the galaxies like
dust, is what most of space looks like.
This emptiness is normal. The richness
of o...
Powers of Ten (1977)
A travel between a
quark and the
observable universe
[10-17, 1024]
Powers of Ten for Machine Learning
•  Data collection
•  Data preparation
•  Feature engineering
•  Feature selection
•  S...
a)  Hours
b)  Days
c)  Weeks
d)  Months
e)  More
How long does it take to build a
machine learning application?
How to cope with this complexity?
E = mc2


Free[F[_], A]

M[A]

Functor[F[_]]

Cofree[S[_], A]

Months -> Hours
“The task of the software development
team is to engineer the illusion of
simplicity.”
– Grady Booch
Complexity vs. Abstraction
Appropriate Level of Abstraction
Language
Syntax &
Semantics
Degrees of
Freedom
Lower
Abstraction
Higher
Abstraction
define...
“FP removes one important dimension
of complexity:
To understand a program part (a
function) you need no longer account
fo...
Functional Approach
•  Type-safe
•  No side effects
•  Composability
•  Concise
•  Fine-grained control
// Extracting URL f...
Object-oriented Approach
•  Modularity
•  Code reuse
•  Polymorphism
// Extracting text features!
val txt = Seq(!
Url("htt...
Why Scala?
•  Combines FP & OOP
•  Strongly-typed
•  Expressive
•  Concise
•  Fun (mostly)
•  Default for Spark
Optimus Prime
An AutoML library for building modular,
reusable, strongly typed ML workflows on Spark
•  Declarative & intui...
FeatureType
OPNumeric OPCollection
OPSetOPList
NonNullable
TextEmail
Base64
Phone
ID
URL
ComboBox
PickList
TextArea
OPVect...
Type Safety Everywhere
•  Value Operations
•  Feature Operations
•  Transformation Pipelines (aka Workflow)
// Typed value ...
Book Price Prediction
// Raw feature definitions!
val authr = FeatureBuilder.PickList[Book].extract(_.author).asPredictor!...
Magic Behind “vectorize()”
// Raw feature definitions!
val authr = FeatureBuilder.PickList[Book].extract(_.author).asPredi...
Automatic Feature Engineering
ZipcodeSubjectPhoneEmail Age
Age
[0-15]
Age
[15-35]
Age
[>35]
Email Is
Spammy
Top Email
Doma...
Automatic Feature Engineering
Imputation
Track null value
Log transformation
for large range
Scaling - zNormalize
Smart Bi...
Automatic Feature Selection
•  Analyze features & calculate statistics
•  Ensure features have acceptable ranges
•  Is thi...
Automatic Feature Selection
// Sanity check your features against the label!
val checked = price.check(!
featureVector = f...
Automatic Model Selection
•  Multiple algorithms to pick from
•  Many hyperparameters for each algorithm
•  Automated hype...
Automatic Model Selection
// Model selection and hyperparameter tuning!
val preds =!
RegressionModelSelector!
.withCrossVa...
Automatic Model Selection
Demo
How well does it work?
•  Most of our models deployed in production
are completely hands free
•  We serve 475,000,000+ pre...
Fantastic ML apps HOWTO
•  Define appropriate level of abstraction
•  Use types to express it
•  Automate everything:
–  fe...
Further exploration
Talks @ Scale By The Bay 2017:
•  “Real Time ML Pipelines in Multi-Tenant Environments” by
Karl Skucha...
We are hiring!
einstein-recruiting@salesforce.com
Thank You
Upcoming SlideShare
Loading in …5
×

Fantastic ML apps and how to build them

5,430 views

Published on

Building efficient machine learning applications is not a simple task. The typical engineering process is an iteration of data wrangling, feature generation, model selection, hyperparameter tuning and evaluation. The amount of possible variations of input features, algorithms and parameters makes it too complex to perform efficiently even by experts. Automating this process is especially important when building machine learning applications for thousands of customers. In this talk I demonstrate how we build effective ML models using AutoML capabilities we develop at Salesforce. Our AutoML capabilities include techniques for automatic data processing, feature generation, model selection, hyperparameter tuning and evaluation. I present several of the implemented solutions with Scala and Spark.

Published in: Software
  • Be the first to comment

Fantastic ML apps and how to build them

  1. 1. Matthew Tovbin Principal Engineer, Salesforce Einstein mtovbin@salesforce.com @tovbinm Fantastic ML apps and how to build them
  2. 2. “This lonely scene – the galaxies like dust, is what most of space looks like. This emptiness is normal. The richness of our own neighborhood is the exception.” – Powers of Ten (1977), by Charles and Ray Eames
  3. 3. Powers of Ten (1977) A travel between a quark and the observable universe [10-17, 1024]
  4. 4. Powers of Ten for Machine Learning •  Data collection •  Data preparation •  Feature engineering •  Feature selection •  Sampling •  Algorithm implementation •  Hyperparameter tuning •  Model selection •  Model serving (scoring) •  Prediction insights •  Metrics
  5. 5. a)  Hours b)  Days c)  Weeks d)  Months e)  More How long does it take to build a machine learning application?
  6. 6. How to cope with this complexity? E = mc2 Free[F[_], A] M[A] Functor[F[_]] Cofree[S[_], A] Months -> Hours
  7. 7. “The task of the software development team is to engineer the illusion of simplicity.” – Grady Booch
  8. 8. Complexity vs. Abstraction
  9. 9. Appropriate Level of Abstraction Language Syntax & Semantics Degrees of Freedom Lower Abstraction Higher Abstraction define •  Less flexible •  Simpler syntax •  Reuse •  Suitable for complex problems •  Difficult to use •  More complex •  Error prone ???
  10. 10. “FP removes one important dimension of complexity: To understand a program part (a function) you need no longer account for the possible histories of executions that can lead to that program part.” – Martin Odersky
  11. 11. Functional Approach •  Type-safe •  No side effects •  Composability •  Concise •  Fine-grained control // Extracting URL features! def urlFeatures(s: String): (Text, Text) = { ! val url = Url(s)! url.protocol -> url.domain! }! Seq("http://einstein.com", “”).map(urlFeatures)! ! > Seq((Text(“http”), Text(“einstein.com”),! (Text(), Text()))!
  12. 12. Object-oriented Approach •  Modularity •  Code reuse •  Polymorphism // Extracting text features! val txt = Seq(! Url("http://einstein.com"),! Base64("b25lIHR3byB0aHJlZQ==”),! Text(”Hello world!”),! Phone(”650-123-4567”)! Text.empty ! )! txt.map(_.tokenize)! ! Seq(! TextList(“http”, “einstein.com”),! TextList(“one”, “two”, “three”),! TextList(“Hello”, “world”),! TextList(“+1”, “650”, “1234567”),! TextList()! )!
  13. 13. Why Scala? •  Combines FP & OOP •  Strongly-typed •  Expressive •  Concise •  Fun (mostly) •  Default for Spark
  14. 14. Optimus Prime An AutoML library for building modular, reusable, strongly typed ML workflows on Spark •  Declarative & intuitive syntax •  Proper level of abstraction •  Aimed for simplicity & reuse •  >90% accuracy with 100X reduction in time
  15. 15. FeatureType OPNumeric OPCollection OPSetOPList NonNullable TextEmail Base64 Phone ID URL ComboBox PickList TextArea OPVector OPMap BinaryMap IntegralMap RealMap DateList DateTimeList Integral Real Binary Percent Currency Date DateTime MultiPickList TextMap … TextList City Street Country PostalCode Location State Geolocation StateMap SingleResponse RealNN Categorical MultiResponse Legend: bold - abstract type, normal - concrete type, italic - trait, solid line - inheritance, dashed line - trait mixin Types Hide the Complexity
  16. 16. Type Safety Everywhere •  Value Operations •  Feature Operations •  Transformation Pipelines (aka Workflow) // Typed value operations! def tokenize(t: Text): TextList = t.map(_.split(“ “)).toTextList! ! // Typed feature operations! val title: Feature[Text] = FeatureBuilder.Text[Book].extract(_.title).asPredictor! val tokens: Feature[TextList] = title.map(tokenize)! ! // Transformation pipelines! new OpWorkflow().setInput(books).setResultFeatures(tokens.vectorize())!
  17. 17. Book Price Prediction // Raw feature definitions! val authr = FeatureBuilder.PickList[Book].extract(_.author).asPredictor! val title = FeatureBuilder.Text[Book].extract(_.title).asPredictor! val descr = FeatureBuilder.Text[Book].extract(_.description).asPredictor! val price = FeatureBuilder.RealNN[Book].extract(_.price).asResponse! ! // Feature engineering: tokenize, tfidf etc.! val tokns = (title + description).tokenize(removePunctuation = true)! val tfidf = tokns.tf(numTerms = 1024).idf(minFreq = 0)! val feats = Seq(tfidf, author).vectorize()! ! // Model training! implicit val spark = SparkSession.builder.config(new SparkConf).getOrCreate! val books = spark.read.csv(“books.csv”).as[Book]! val preds = RegressionModelSelector().setInput(price, feats).getOutput! new OpWorkflow().setInput(books).setResultFeatures(feats, preds).train()!
  18. 18. Magic Behind “vectorize()” // Raw feature definitions! val authr = FeatureBuilder.PickList[Book].extract(_.author).asPredictor! val title = FeatureBuilder.Text[Book].extract(_.title).asPredictor! val descr = FeatureBuilder.Text[Book].extract(_.description).asPredictor! val price = FeatureBuilder.RealNN[Book].extract(_.price).asResponse! ! // Feature engineering: tokenize, tfidf etc.! val tokns = (title + description).tokenize(removePunctuation = true)! val tfidf = tokns.tf(numTerms = 1024).idf(minFreq = 0)! val feats = Seq(tfidf, author).vectorize() // <- magic here! ! // Model training! implicit val spark = SparkSession.builder.config(new SparkConf).getOrCreate! val books = spark.read.csv(“books.csv”).as[Book]! val preds = RegressionModelSelector().setInput(price, feats).getOutput! new OpWorkflow().setInput(books).setResultFeatures(feats, preds).train()!
  19. 19. Automatic Feature Engineering ZipcodeSubjectPhoneEmail Age Age [0-15] Age [15-35] Age [>35] Email Is Spammy Top Email Domains Country Code Phone Is Valid Top TF- IDF Terms Average Income Vector
  20. 20. Automatic Feature Engineering Imputation Track null value Log transformation for large range Scaling - zNormalize Smart Binning Numeric Categorical SpatialTemporal Tokenization Hash Encoding TF-IDF Word2Vec Sentiment Analysis Language Detection Time difference Time Binning Time extraction (day, week, month, year) Closeness to major events Augment with external data e.g avg income Spatial fraudulent behavior e.g: impossible travel speed Geo-encoding Text Imputation Track null value One Hot Encoding Dynamic Top K pivot Smart Binning LabelCount Encoding Category Embedding More…
  21. 21. Automatic Feature Selection •  Analyze features & calculate statistics •  Ensure features have acceptable ranges •  Is this feature a leaker? •  Does this feature help our model? Is it predictive?
  22. 22. Automatic Feature Selection // Sanity check your features against the label! val checked = price.check(! featureVector = feats,! checkSample = 0.3,! sampleSeed = 1L,! sampleLimit = 100000L,! maxCorrelation = 0.95,! minCorrelation = 0.0,! correlationType = Pearson,! minVariance = 0.00001,! removeBadFeatures = true! )! ! new OpWorkflow().setInput(books).setResultFeatures(checked, preds).train()!
  23. 23. Automatic Model Selection •  Multiple algorithms to pick from •  Many hyperparameters for each algorithm •  Automated hyperparameter tuning –  Faster model creation with improved metrics –  Search algorithms to find the optimal hyperparameters. e.g. grid search, random search, bandit methods
  24. 24. Automatic Model Selection // Model selection and hyperparameter tuning! val preds =! RegressionModelSelector! .withCrossValidation(! dataSplitter = DataSplitter(reserveTestFraction = 0.1),! numFolds = 3,! validationMetric = Evaluators.Regression.rmse(),! trainTestEvaluators = Seq.empty,! seed = 1L)! .setModelsToTry(LinearRegression, RandomForestRegression)! .setLinearRegressionElasticNetParam(0, 0.5, 1)! .setLinearRegressionMaxIter(10, 100)! .setLinearRegressionSolver(Solver.LBFGS)! .setRandomForestMaxDepth(2, 10)! .setRandomForestNumTrees(10)! .setInput(price, checked).getOutput! ! new OpWorkflow().setInput(books).setResultFeatures(checked, preds).train()!
  25. 25. Automatic Model Selection
  26. 26. Demo
  27. 27. How well does it work? •  Most of our models deployed in production are completely hands free •  We serve 475,000,000+ predictions per day
  28. 28. Fantastic ML apps HOWTO •  Define appropriate level of abstraction •  Use types to express it •  Automate everything: –  feature engineering & selection –  model selection –  hyperparameter tuning –  Etc. Months -> Hours
  29. 29. Further exploration Talks @ Scale By The Bay 2017: •  “Real Time ML Pipelines in Multi-Tenant Environments” by Karl Skucha and Yan Yang •  “Fireworks - lighting up the sky with millions of Sparks“ by Thomas Gerber •  “Functional Linear Algebra in Scala” by Vlad Patryshev •  “Complex Machine Learning Pipelines Made Easy” by Chris Rupley and Till Bergmann •  “Just enough DevOps for data scientists” by Anya Bida
  30. 30. We are hiring! einstein-recruiting@salesforce.com
  31. 31. Thank You

×