SlideShare a Scribd company logo
1 of 31
Download to read offline
Matthew Tovbin
Principal Engineer, Salesforce Einstein
mtovbin@salesforce.com
@tovbinm
Fantastic ML apps and how to
build them
“This lonely scene – the galaxies like
dust, is what most of space looks like.
This emptiness is normal. The richness
of our own neighborhood is the
exception.”
– Powers of Ten (1977), by Charles and Ray Eames
Powers of Ten (1977)
A travel between a
quark and the
observable universe
[10-17, 1024]
Powers of Ten for Machine Learning
•  Data collection
•  Data preparation
•  Feature engineering
•  Feature selection
•  Sampling
•  Algorithm implementation
•  Hyperparameter tuning
•  Model selection
•  Model serving (scoring)
•  Prediction insights
•  Metrics
a)  Hours
b)  Days
c)  Weeks
d)  Months
e)  More
How long does it take to build a
machine learning application?
How to cope with this complexity?
E = mc2


Free[F[_], A]

M[A]

Functor[F[_]]

Cofree[S[_], A]

Months -> Hours
“The task of the software development
team is to engineer the illusion of
simplicity.”
– Grady Booch
Complexity vs. Abstraction
Appropriate Level of Abstraction
Language
Syntax &
Semantics
Degrees of
Freedom
Lower
Abstraction
Higher
Abstraction
define
•  Less flexible
•  Simpler syntax
•  Reuse
•  Suitable for
complex problems
•  Difficult to use
•  More complex
•  Error prone
???
“FP removes one important dimension
of complexity:
To understand a program part (a
function) you need no longer account
for the possible histories of executions
that can lead to that program part.”
– Martin Odersky
Functional Approach
•  Type-safe
•  No side effects
•  Composability
•  Concise
•  Fine-grained control
// Extracting URL features!
def urlFeatures(s: String): (Text, Text) = { !
val url = Url(s)!
url.protocol -> url.domain!
}!
Seq("http://einstein.com", “”).map(urlFeatures)!
!
> Seq((Text(“http”), Text(“einstein.com”),!
(Text(), Text()))!
Object-oriented Approach
•  Modularity
•  Code reuse
•  Polymorphism
// Extracting text features!
val txt = Seq(!
Url("http://einstein.com"),!
Base64("b25lIHR3byB0aHJlZQ==”),!
Text(”Hello world!”),!
Phone(”650-123-4567”)!
Text.empty !
)!
txt.map(_.tokenize)!
!
Seq(!
TextList(“http”, “einstein.com”),!
TextList(“one”, “two”, “three”),!
TextList(“Hello”, “world”),!
TextList(“+1”, “650”, “1234567”),!
TextList()!
)!
Why Scala?
•  Combines FP & OOP
•  Strongly-typed
•  Expressive
•  Concise
•  Fun (mostly)
•  Default for Spark
Optimus Prime
An AutoML library for building modular,
reusable, strongly typed ML workflows on Spark
•  Declarative & intuitive syntax
•  Proper level of abstraction
•  Aimed for simplicity & reuse
•  >90% accuracy with 100X reduction in time
FeatureType
OPNumeric OPCollection
OPSetOPList
NonNullable
TextEmail
Base64
Phone
ID
URL
ComboBox
PickList
TextArea
OPVector OPMap
BinaryMap
IntegralMap
RealMap
DateList
DateTimeList
Integral
Real
Binary
Percent
Currency
Date
DateTime
MultiPickList TextMap
…
TextList
City
Street
Country
PostalCode
Location
State
Geolocation
StateMap
SingleResponse
RealNN
Categorical
MultiResponse
Legend: bold - abstract type, normal - concrete type, italic - trait, solid line - inheritance, dashed line - trait mixin
Types Hide the Complexity
Type Safety Everywhere
•  Value Operations
•  Feature Operations
•  Transformation Pipelines (aka Workflow)
// Typed value operations!
def tokenize(t: Text): TextList = t.map(_.split(“ “)).toTextList!
!
// Typed feature operations!
val title: Feature[Text] =
FeatureBuilder.Text[Book].extract(_.title).asPredictor!
val tokens: Feature[TextList] = title.map(tokenize)!
!
// Transformation pipelines!
new OpWorkflow().setInput(books).setResultFeatures(tokens.vectorize())!
Book Price Prediction
// Raw feature definitions!
val authr = FeatureBuilder.PickList[Book].extract(_.author).asPredictor!
val title = FeatureBuilder.Text[Book].extract(_.title).asPredictor!
val descr = FeatureBuilder.Text[Book].extract(_.description).asPredictor!
val price = FeatureBuilder.RealNN[Book].extract(_.price).asResponse!
!
// Feature engineering: tokenize, tfidf etc.!
val tokns = (title + description).tokenize(removePunctuation = true)!
val tfidf = tokns.tf(numTerms = 1024).idf(minFreq = 0)!
val feats = Seq(tfidf, author).vectorize()!
!
// Model training!
implicit val spark = SparkSession.builder.config(new SparkConf).getOrCreate!
val books = spark.read.csv(“books.csv”).as[Book]!
val preds = RegressionModelSelector().setInput(price, feats).getOutput!
new OpWorkflow().setInput(books).setResultFeatures(feats, preds).train()!
Magic Behind “vectorize()”
// Raw feature definitions!
val authr = FeatureBuilder.PickList[Book].extract(_.author).asPredictor!
val title = FeatureBuilder.Text[Book].extract(_.title).asPredictor!
val descr = FeatureBuilder.Text[Book].extract(_.description).asPredictor!
val price = FeatureBuilder.RealNN[Book].extract(_.price).asResponse!
!
// Feature engineering: tokenize, tfidf etc.!
val tokns = (title + description).tokenize(removePunctuation = true)!
val tfidf = tokns.tf(numTerms = 1024).idf(minFreq = 0)!
val feats = Seq(tfidf, author).vectorize() // <- magic here!
!
// Model training!
implicit val spark = SparkSession.builder.config(new SparkConf).getOrCreate!
val books = spark.read.csv(“books.csv”).as[Book]!
val preds = RegressionModelSelector().setInput(price, feats).getOutput!
new OpWorkflow().setInput(books).setResultFeatures(feats, preds).train()!
Automatic Feature Engineering
ZipcodeSubjectPhoneEmail Age
Age
[0-15]
Age
[15-35]
Age
[>35]
Email Is
Spammy
Top Email
Domains
Country
Code
Phone
Is Valid
Top TF-
IDF
Terms
Average
Income
Vector
Automatic Feature Engineering
Imputation
Track null value
Log transformation
for large range
Scaling - zNormalize
Smart Binning
Numeric Categorical SpatialTemporal
Tokenization
Hash Encoding
TF-IDF
Word2Vec
Sentiment Analysis
Language Detection
Time difference
Time Binning
Time extraction
(day, week, month,
year)
Closeness to major
events
Augment with external
data e.g avg income
Spatial fraudulent
behavior e.g:
impossible travel speed
Geo-encoding
Text
Imputation
Track null value
One Hot Encoding
Dynamic Top K pivot
Smart Binning
LabelCount Encoding
Category Embedding
More…
Automatic Feature Selection
•  Analyze features & calculate statistics
•  Ensure features have acceptable ranges
•  Is this feature a leaker?
•  Does this feature help our model? Is it
predictive?
Automatic Feature Selection
// Sanity check your features against the label!
val checked = price.check(!
featureVector = feats,!
checkSample = 0.3,!
sampleSeed = 1L,!
sampleLimit = 100000L,!
maxCorrelation = 0.95,!
minCorrelation = 0.0,!
correlationType = Pearson,!
minVariance = 0.00001,!
removeBadFeatures = true!
)!
!
new OpWorkflow().setInput(books).setResultFeatures(checked, preds).train()!
Automatic Model Selection
•  Multiple algorithms to pick from
•  Many hyperparameters for each algorithm
•  Automated hyperparameter tuning
–  Faster model creation with improved metrics
–  Search algorithms to find the optimal
hyperparameters. e.g. grid search, random
search, bandit methods
Automatic Model Selection
// Model selection and hyperparameter tuning!
val preds =!
RegressionModelSelector!
.withCrossValidation(!
dataSplitter = DataSplitter(reserveTestFraction = 0.1),!
numFolds = 3,!
validationMetric = Evaluators.Regression.rmse(),!
trainTestEvaluators = Seq.empty,!
seed = 1L)!
.setModelsToTry(LinearRegression, RandomForestRegression)!
.setLinearRegressionElasticNetParam(0, 0.5, 1)!
.setLinearRegressionMaxIter(10, 100)!
.setLinearRegressionSolver(Solver.LBFGS)!
.setRandomForestMaxDepth(2, 10)!
.setRandomForestNumTrees(10)!
.setInput(price, checked).getOutput!
!
new OpWorkflow().setInput(books).setResultFeatures(checked, preds).train()!
Automatic Model Selection
Demo
How well does it work?
•  Most of our models deployed in production
are completely hands free
•  We serve 475,000,000+ predictions per day
Fantastic ML apps HOWTO
•  Define appropriate level of abstraction
•  Use types to express it
•  Automate everything:
–  feature engineering & selection
–  model selection
–  hyperparameter tuning
–  Etc.
Months -> Hours
Further exploration
Talks @ Scale By The Bay 2017:
•  “Real Time ML Pipelines in Multi-Tenant Environments” by
Karl Skucha and Yan Yang
•  “Fireworks - lighting up the sky with millions of Sparks“ by
Thomas Gerber
•  “Functional Linear Algebra in Scala” by Vlad Patryshev
•  “Complex Machine Learning Pipelines Made Easy” by Chris
Rupley and Till Bergmann
•  “Just enough DevOps for data scientists” by Anya Bida
We are hiring!
einstein-recruiting@salesforce.com
Thank You

More Related Content

Similar to Fantastic ML apps and how to build them

The Machine Learning Workflow with Azure
The Machine Learning Workflow with AzureThe Machine Learning Workflow with Azure
The Machine Learning Workflow with AzureIvo Andreev
 
Predicting Optimal Parallelism for Data Analytics
Predicting Optimal Parallelism for Data AnalyticsPredicting Optimal Parallelism for Data Analytics
Predicting Optimal Parallelism for Data AnalyticsDatabricks
 
Feature Hashing for Scalable Machine Learning: Spark Summit East talk by Nick...
Feature Hashing for Scalable Machine Learning: Spark Summit East talk by Nick...Feature Hashing for Scalable Machine Learning: Spark Summit East talk by Nick...
Feature Hashing for Scalable Machine Learning: Spark Summit East talk by Nick...Spark Summit
 
Prepare your data for machine learning
Prepare your data for machine learningPrepare your data for machine learning
Prepare your data for machine learningIvo Andreev
 
Taxonomy of Scala
Taxonomy of ScalaTaxonomy of Scala
Taxonomy of Scalashinolajla
 
Automate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scala
Automate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scalaAutomate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scala
Automate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scalaChetan Khatri
 
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...Jose Quesada (hiring)
 
The Data Science Process - Do we need it and how to apply?
The Data Science Process - Do we need it and how to apply?The Data Science Process - Do we need it and how to apply?
The Data Science Process - Do we need it and how to apply?Ivo Andreev
 
ML and Data Science at Uber - GITPro talk 2017
ML and Data Science at Uber - GITPro talk 2017ML and Data Science at Uber - GITPro talk 2017
ML and Data Science at Uber - GITPro talk 2017Sudhir Tonse
 
Apache Spark-Based Stratification Library for Machine Learning Use Cases at N...
Apache Spark-Based Stratification Library for Machine Learning Use Cases at N...Apache Spark-Based Stratification Library for Machine Learning Use Cases at N...
Apache Spark-Based Stratification Library for Machine Learning Use Cases at N...Databricks
 
Apache Spark-Based Stratification Library for Machine Learning Use Cases at N...
Apache Spark-Based Stratification Library for Machine Learning Use Cases at N...Apache Spark-Based Stratification Library for Machine Learning Use Cases at N...
Apache Spark-Based Stratification Library for Machine Learning Use Cases at N...Karthik Murugesan
 
Ayudando a los Viajeros usando 500 millones de Reseñas Hoteleras al Mes
Ayudando a los Viajeros usando 500 millones de Reseñas Hoteleras al MesAyudando a los Viajeros usando 500 millones de Reseñas Hoteleras al Mes
Ayudando a los Viajeros usando 500 millones de Reseñas Hoteleras al MesBig Data Colombia
 
Machine Learning Pipelines
Machine Learning PipelinesMachine Learning Pipelines
Machine Learning Pipelinesjeykottalam
 
Mining big data streams with APACHE SAMOA by Albert Bifet
Mining big data streams with APACHE SAMOA by Albert BifetMining big data streams with APACHE SAMOA by Albert Bifet
Mining big data streams with APACHE SAMOA by Albert BifetJ On The Beach
 
Mining Big Data Streams with APACHE SAMOA
Mining Big Data Streams with APACHE SAMOAMining Big Data Streams with APACHE SAMOA
Mining Big Data Streams with APACHE SAMOAAlbert Bifet
 
TransmogrifAI - Automate Machine Learning Workflow with the power of Scala an...
TransmogrifAI - Automate Machine Learning Workflow with the power of Scala an...TransmogrifAI - Automate Machine Learning Workflow with the power of Scala an...
TransmogrifAI - Automate Machine Learning Workflow with the power of Scala an...Chetan Khatri
 
DDD, CQRS and testing with ASP.Net MVC
DDD, CQRS and testing with ASP.Net MVCDDD, CQRS and testing with ASP.Net MVC
DDD, CQRS and testing with ASP.Net MVCAndy Butland
 
ML MODULE 1_slideshare.pdf
ML MODULE 1_slideshare.pdfML MODULE 1_slideshare.pdf
ML MODULE 1_slideshare.pdfShiwani Gupta
 
Optimizing Terascale Machine Learning Pipelines with Keystone ML
Optimizing Terascale Machine Learning Pipelines with Keystone MLOptimizing Terascale Machine Learning Pipelines with Keystone ML
Optimizing Terascale Machine Learning Pipelines with Keystone MLSpark Summit
 

Similar to Fantastic ML apps and how to build them (20)

The Machine Learning Workflow with Azure
The Machine Learning Workflow with AzureThe Machine Learning Workflow with Azure
The Machine Learning Workflow with Azure
 
Predicting Optimal Parallelism for Data Analytics
Predicting Optimal Parallelism for Data AnalyticsPredicting Optimal Parallelism for Data Analytics
Predicting Optimal Parallelism for Data Analytics
 
Feature Hashing for Scalable Machine Learning: Spark Summit East talk by Nick...
Feature Hashing for Scalable Machine Learning: Spark Summit East talk by Nick...Feature Hashing for Scalable Machine Learning: Spark Summit East talk by Nick...
Feature Hashing for Scalable Machine Learning: Spark Summit East talk by Nick...
 
Prepare your data for machine learning
Prepare your data for machine learningPrepare your data for machine learning
Prepare your data for machine learning
 
Taxonomy of Scala
Taxonomy of ScalaTaxonomy of Scala
Taxonomy of Scala
 
Automate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scala
Automate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scalaAutomate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scala
Automate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scala
 
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
 
The Data Science Process - Do we need it and how to apply?
The Data Science Process - Do we need it and how to apply?The Data Science Process - Do we need it and how to apply?
The Data Science Process - Do we need it and how to apply?
 
ML and Data Science at Uber - GITPro talk 2017
ML and Data Science at Uber - GITPro talk 2017ML and Data Science at Uber - GITPro talk 2017
ML and Data Science at Uber - GITPro talk 2017
 
Apache Spark-Based Stratification Library for Machine Learning Use Cases at N...
Apache Spark-Based Stratification Library for Machine Learning Use Cases at N...Apache Spark-Based Stratification Library for Machine Learning Use Cases at N...
Apache Spark-Based Stratification Library for Machine Learning Use Cases at N...
 
Apache Spark-Based Stratification Library for Machine Learning Use Cases at N...
Apache Spark-Based Stratification Library for Machine Learning Use Cases at N...Apache Spark-Based Stratification Library for Machine Learning Use Cases at N...
Apache Spark-Based Stratification Library for Machine Learning Use Cases at N...
 
Ayudando a los Viajeros usando 500 millones de Reseñas Hoteleras al Mes
Ayudando a los Viajeros usando 500 millones de Reseñas Hoteleras al MesAyudando a los Viajeros usando 500 millones de Reseñas Hoteleras al Mes
Ayudando a los Viajeros usando 500 millones de Reseñas Hoteleras al Mes
 
Machine Learning Pipelines
Machine Learning PipelinesMachine Learning Pipelines
Machine Learning Pipelines
 
Mining big data streams with APACHE SAMOA by Albert Bifet
Mining big data streams with APACHE SAMOA by Albert BifetMining big data streams with APACHE SAMOA by Albert Bifet
Mining big data streams with APACHE SAMOA by Albert Bifet
 
Mining Big Data Streams with APACHE SAMOA
Mining Big Data Streams with APACHE SAMOAMining Big Data Streams with APACHE SAMOA
Mining Big Data Streams with APACHE SAMOA
 
TransmogrifAI - Automate Machine Learning Workflow with the power of Scala an...
TransmogrifAI - Automate Machine Learning Workflow with the power of Scala an...TransmogrifAI - Automate Machine Learning Workflow with the power of Scala an...
TransmogrifAI - Automate Machine Learning Workflow with the power of Scala an...
 
DDD, CQRS and testing with ASP.Net MVC
DDD, CQRS and testing with ASP.Net MVCDDD, CQRS and testing with ASP.Net MVC
DDD, CQRS and testing with ASP.Net MVC
 
AutoML lectures (ACDL 2019)
AutoML lectures (ACDL 2019)AutoML lectures (ACDL 2019)
AutoML lectures (ACDL 2019)
 
ML MODULE 1_slideshare.pdf
ML MODULE 1_slideshare.pdfML MODULE 1_slideshare.pdf
ML MODULE 1_slideshare.pdf
 
Optimizing Terascale Machine Learning Pipelines with Keystone ML
Optimizing Terascale Machine Learning Pipelines with Keystone MLOptimizing Terascale Machine Learning Pipelines with Keystone ML
Optimizing Terascale Machine Learning Pipelines with Keystone ML
 

Recently uploaded

Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideBuilding Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideChristina Lin
 
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASEBATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASEOrtus Solutions, Corp
 
Intelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalmIntelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalmSujith Sukumaran
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...MyIntelliSource, Inc.
 
chapter--4-software-project-planning.ppt
chapter--4-software-project-planning.pptchapter--4-software-project-planning.ppt
chapter--4-software-project-planning.pptkotipi9215
 
Folding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesFolding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesPhilip Schwarz
 
MYjobs Presentation Django-based project
MYjobs Presentation Django-based projectMYjobs Presentation Django-based project
MYjobs Presentation Django-based projectAnoyGreter
 
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...soniya singh
 
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfGOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfAlina Yurenko
 
EY_Graph Database Powered Sustainability
EY_Graph Database Powered SustainabilityEY_Graph Database Powered Sustainability
EY_Graph Database Powered SustainabilityNeo4j
 
Cloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEECloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEEVICTOR MAESTRE RAMIREZ
 
Unveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML DiagramsUnveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML DiagramsAhmed Mohamed
 
React Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief UtamaReact Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief UtamaHanief Utama
 
Professional Resume Template for Software Developers
Professional Resume Template for Software DevelopersProfessional Resume Template for Software Developers
Professional Resume Template for Software DevelopersVinodh Ram
 
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio, Inc.
 
Cloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackCloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackVICTOR MAESTRE RAMIREZ
 
Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024Andreas Granig
 
Salesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantSalesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantAxelRicardoTrocheRiq
 

Recently uploaded (20)

Call Girls In Mukherjee Nagar 📱 9999965857 🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
Call Girls In Mukherjee Nagar 📱  9999965857  🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...Call Girls In Mukherjee Nagar 📱  9999965857  🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
Call Girls In Mukherjee Nagar 📱 9999965857 🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
 
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideBuilding Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
 
Hot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort Service
Hot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort ServiceHot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort Service
Hot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort Service
 
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASEBATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
 
Intelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalmIntelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalm
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
 
chapter--4-software-project-planning.ppt
chapter--4-software-project-planning.pptchapter--4-software-project-planning.ppt
chapter--4-software-project-planning.ppt
 
Folding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesFolding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a series
 
MYjobs Presentation Django-based project
MYjobs Presentation Django-based projectMYjobs Presentation Django-based project
MYjobs Presentation Django-based project
 
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
 
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfGOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
 
EY_Graph Database Powered Sustainability
EY_Graph Database Powered SustainabilityEY_Graph Database Powered Sustainability
EY_Graph Database Powered Sustainability
 
Cloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEECloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEE
 
Unveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML DiagramsUnveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML Diagrams
 
React Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief UtamaReact Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief Utama
 
Professional Resume Template for Software Developers
Professional Resume Template for Software DevelopersProfessional Resume Template for Software Developers
Professional Resume Template for Software Developers
 
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
 
Cloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackCloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStack
 
Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024
 
Salesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantSalesforce Certified Field Service Consultant
Salesforce Certified Field Service Consultant
 

Fantastic ML apps and how to build them

  • 1. Matthew Tovbin Principal Engineer, Salesforce Einstein mtovbin@salesforce.com @tovbinm Fantastic ML apps and how to build them
  • 2. “This lonely scene – the galaxies like dust, is what most of space looks like. This emptiness is normal. The richness of our own neighborhood is the exception.” – Powers of Ten (1977), by Charles and Ray Eames
  • 3. Powers of Ten (1977) A travel between a quark and the observable universe [10-17, 1024]
  • 4. Powers of Ten for Machine Learning •  Data collection •  Data preparation •  Feature engineering •  Feature selection •  Sampling •  Algorithm implementation •  Hyperparameter tuning •  Model selection •  Model serving (scoring) •  Prediction insights •  Metrics
  • 5. a)  Hours b)  Days c)  Weeks d)  Months e)  More How long does it take to build a machine learning application?
  • 6. How to cope with this complexity? E = mc2 Free[F[_], A] M[A] Functor[F[_]] Cofree[S[_], A] Months -> Hours
  • 7. “The task of the software development team is to engineer the illusion of simplicity.” – Grady Booch
  • 9. Appropriate Level of Abstraction Language Syntax & Semantics Degrees of Freedom Lower Abstraction Higher Abstraction define •  Less flexible •  Simpler syntax •  Reuse •  Suitable for complex problems •  Difficult to use •  More complex •  Error prone ???
  • 10. “FP removes one important dimension of complexity: To understand a program part (a function) you need no longer account for the possible histories of executions that can lead to that program part.” – Martin Odersky
  • 11. Functional Approach •  Type-safe •  No side effects •  Composability •  Concise •  Fine-grained control // Extracting URL features! def urlFeatures(s: String): (Text, Text) = { ! val url = Url(s)! url.protocol -> url.domain! }! Seq("http://einstein.com", “”).map(urlFeatures)! ! > Seq((Text(“http”), Text(“einstein.com”),! (Text(), Text()))!
  • 12. Object-oriented Approach •  Modularity •  Code reuse •  Polymorphism // Extracting text features! val txt = Seq(! Url("http://einstein.com"),! Base64("b25lIHR3byB0aHJlZQ==”),! Text(”Hello world!”),! Phone(”650-123-4567”)! Text.empty ! )! txt.map(_.tokenize)! ! Seq(! TextList(“http”, “einstein.com”),! TextList(“one”, “two”, “three”),! TextList(“Hello”, “world”),! TextList(“+1”, “650”, “1234567”),! TextList()! )!
  • 13. Why Scala? •  Combines FP & OOP •  Strongly-typed •  Expressive •  Concise •  Fun (mostly) •  Default for Spark
  • 14. Optimus Prime An AutoML library for building modular, reusable, strongly typed ML workflows on Spark •  Declarative & intuitive syntax •  Proper level of abstraction •  Aimed for simplicity & reuse •  >90% accuracy with 100X reduction in time
  • 15. FeatureType OPNumeric OPCollection OPSetOPList NonNullable TextEmail Base64 Phone ID URL ComboBox PickList TextArea OPVector OPMap BinaryMap IntegralMap RealMap DateList DateTimeList Integral Real Binary Percent Currency Date DateTime MultiPickList TextMap … TextList City Street Country PostalCode Location State Geolocation StateMap SingleResponse RealNN Categorical MultiResponse Legend: bold - abstract type, normal - concrete type, italic - trait, solid line - inheritance, dashed line - trait mixin Types Hide the Complexity
  • 16. Type Safety Everywhere •  Value Operations •  Feature Operations •  Transformation Pipelines (aka Workflow) // Typed value operations! def tokenize(t: Text): TextList = t.map(_.split(“ “)).toTextList! ! // Typed feature operations! val title: Feature[Text] = FeatureBuilder.Text[Book].extract(_.title).asPredictor! val tokens: Feature[TextList] = title.map(tokenize)! ! // Transformation pipelines! new OpWorkflow().setInput(books).setResultFeatures(tokens.vectorize())!
  • 17. Book Price Prediction // Raw feature definitions! val authr = FeatureBuilder.PickList[Book].extract(_.author).asPredictor! val title = FeatureBuilder.Text[Book].extract(_.title).asPredictor! val descr = FeatureBuilder.Text[Book].extract(_.description).asPredictor! val price = FeatureBuilder.RealNN[Book].extract(_.price).asResponse! ! // Feature engineering: tokenize, tfidf etc.! val tokns = (title + description).tokenize(removePunctuation = true)! val tfidf = tokns.tf(numTerms = 1024).idf(minFreq = 0)! val feats = Seq(tfidf, author).vectorize()! ! // Model training! implicit val spark = SparkSession.builder.config(new SparkConf).getOrCreate! val books = spark.read.csv(“books.csv”).as[Book]! val preds = RegressionModelSelector().setInput(price, feats).getOutput! new OpWorkflow().setInput(books).setResultFeatures(feats, preds).train()!
  • 18. Magic Behind “vectorize()” // Raw feature definitions! val authr = FeatureBuilder.PickList[Book].extract(_.author).asPredictor! val title = FeatureBuilder.Text[Book].extract(_.title).asPredictor! val descr = FeatureBuilder.Text[Book].extract(_.description).asPredictor! val price = FeatureBuilder.RealNN[Book].extract(_.price).asResponse! ! // Feature engineering: tokenize, tfidf etc.! val tokns = (title + description).tokenize(removePunctuation = true)! val tfidf = tokns.tf(numTerms = 1024).idf(minFreq = 0)! val feats = Seq(tfidf, author).vectorize() // <- magic here! ! // Model training! implicit val spark = SparkSession.builder.config(new SparkConf).getOrCreate! val books = spark.read.csv(“books.csv”).as[Book]! val preds = RegressionModelSelector().setInput(price, feats).getOutput! new OpWorkflow().setInput(books).setResultFeatures(feats, preds).train()!
  • 19. Automatic Feature Engineering ZipcodeSubjectPhoneEmail Age Age [0-15] Age [15-35] Age [>35] Email Is Spammy Top Email Domains Country Code Phone Is Valid Top TF- IDF Terms Average Income Vector
  • 20. Automatic Feature Engineering Imputation Track null value Log transformation for large range Scaling - zNormalize Smart Binning Numeric Categorical SpatialTemporal Tokenization Hash Encoding TF-IDF Word2Vec Sentiment Analysis Language Detection Time difference Time Binning Time extraction (day, week, month, year) Closeness to major events Augment with external data e.g avg income Spatial fraudulent behavior e.g: impossible travel speed Geo-encoding Text Imputation Track null value One Hot Encoding Dynamic Top K pivot Smart Binning LabelCount Encoding Category Embedding More…
  • 21. Automatic Feature Selection •  Analyze features & calculate statistics •  Ensure features have acceptable ranges •  Is this feature a leaker? •  Does this feature help our model? Is it predictive?
  • 22. Automatic Feature Selection // Sanity check your features against the label! val checked = price.check(! featureVector = feats,! checkSample = 0.3,! sampleSeed = 1L,! sampleLimit = 100000L,! maxCorrelation = 0.95,! minCorrelation = 0.0,! correlationType = Pearson,! minVariance = 0.00001,! removeBadFeatures = true! )! ! new OpWorkflow().setInput(books).setResultFeatures(checked, preds).train()!
  • 23. Automatic Model Selection •  Multiple algorithms to pick from •  Many hyperparameters for each algorithm •  Automated hyperparameter tuning –  Faster model creation with improved metrics –  Search algorithms to find the optimal hyperparameters. e.g. grid search, random search, bandit methods
  • 24. Automatic Model Selection // Model selection and hyperparameter tuning! val preds =! RegressionModelSelector! .withCrossValidation(! dataSplitter = DataSplitter(reserveTestFraction = 0.1),! numFolds = 3,! validationMetric = Evaluators.Regression.rmse(),! trainTestEvaluators = Seq.empty,! seed = 1L)! .setModelsToTry(LinearRegression, RandomForestRegression)! .setLinearRegressionElasticNetParam(0, 0.5, 1)! .setLinearRegressionMaxIter(10, 100)! .setLinearRegressionSolver(Solver.LBFGS)! .setRandomForestMaxDepth(2, 10)! .setRandomForestNumTrees(10)! .setInput(price, checked).getOutput! ! new OpWorkflow().setInput(books).setResultFeatures(checked, preds).train()!
  • 26. Demo
  • 27. How well does it work? •  Most of our models deployed in production are completely hands free •  We serve 475,000,000+ predictions per day
  • 28. Fantastic ML apps HOWTO •  Define appropriate level of abstraction •  Use types to express it •  Automate everything: –  feature engineering & selection –  model selection –  hyperparameter tuning –  Etc. Months -> Hours
  • 29. Further exploration Talks @ Scale By The Bay 2017: •  “Real Time ML Pipelines in Multi-Tenant Environments” by Karl Skucha and Yan Yang •  “Fireworks - lighting up the sky with millions of Sparks“ by Thomas Gerber •  “Functional Linear Algebra in Scala” by Vlad Patryshev •  “Complex Machine Learning Pipelines Made Easy” by Chris Rupley and Till Bergmann •  “Just enough DevOps for data scientists” by Anya Bida