Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Embracing a Taxonomy of Types to Simplify Machine Learning with Leah McGuire

549 views

Published on

Salesforce has created a machine learning framework on top of Spark ML that builds personalized models for businesses across a range of applications. Hear how expanding type information about features has allowed them to deal with custom datasets with good results.

By building a platform that automatically does feature engineering on rich types (e.g. Currency and Percentages rather than Doubles; Phone Numbers and Email Addresses rather than Strings), they have automated much of the work that consumes most data scientists’ time. Learn how you can do the same by building a single model outline based on the application, and then having the framework customize it for each customer.

Published in: Data & Analytics
  • Be the first to comment

Embracing a Taxonomy of Types to Simplify Machine Learning with Leah McGuire

  1. 1. Leah McGuire Principal Member of Technical Staff, Salesforce Einstein lmcguire@salesforce.com @leahmcguire Types, Types, Types! Embracing a hierarchy of types to simplify machine learning
  2. 2. Let’s make sure you are in the right talk ​ What I am going to talk about: • What does machine learning mean at Salesforce • Problems in machine learning for business to business (B2B) companies • Automating machine learning and how our AutoML library (Optimus Prime) works • The utility of having strongly typed features in AutoML • What we have learned and what we are planning
  3. 3. Salesforce and Machine Learning
  4. 4. Salesforce
  5. 5. The Problem ​ For the majority of businesses, data science is out of reach
  6. 6. Sales Cloud Einstein Predictive Lead Scoring Opportunity Insights Automated Activity Capture Building World’s Smartest CRM Commerce Cloud Einstein Product Recommendations Predictive Sort Commerce Insights App Cloud Einstein Heroku + PredictionIO Predictive Vision Services Predictive Sentiment Services Predictive Modeling Services Service Cloud Einstein Recommended Case Classification Recommended Responses Predictive Close Time Marketing Cloud Einstein Predictive Scoring Predictive Audiences Automated Send-time Optimization Community Cloud Einstein Recommended Experts, Articles & Topics Automated Service Escalation Newsfeed Insights Analytics Cloud Einstein Predictive Wave Apps Smart Data Discovery Automated Analytics & Storytelling IoT Cloud Einstein Predictive Device Scoring Recommend Best Next Action Automated IoT Rules Optimization
  7. 7. Machine learning workflows And how much more complicated they get for B2B
  8. 8. Feature Engineering Model Training Model A Model B Model C Model Evaluation ​ What Kaggle would lead us to believe Building a machine learning model
  9. 9. Real-life ML ​ Building a ML model workflow ETL Model Evaluation Feature Engineering Scoring Model Training Model A Model B Model C Deployment
  10. 10. ​ Over and over again Building a machine learning model ETL Model Evaluation Feature Engineering Scoring Model Trainin g Model A Model B Model C Deployment ETL Model Evaluation Feature Engineering Scoring Model Trainin g Model A Model B Model C Deployment ETL Model Evaluation Feature Engineering Scoring Model Trainin g Model A Model B Model C Deployment
  11. 11. We can’t build one global model • Privacy concerns •  Customers don’t want data cross- pollinated • Business Use Cases •  Industries are very different •  Processes are different • Platform customization •  Ability to create custom fields and objects • Scale, Automation, •  Ability to create
  12. 12. ​ Over and over again Building a machine learning model M o d e l E v a l u a t i o n F e a t u r e E n g i n e e r i n g S c o r i n g D e p l o y m e n t M o d e l E v a l u a t i o n F e a t u r e E n g i n e e r i n g S c o r i n g D e p l o y m e n t M o d e l E v a l u a t i o n F e a t u r e E n g i n e e r i n g S c o r i n g D e p l o y m e n t M o d e l E v a l u a t i o n F e a t u r e E n g i n e e r i n g S c o r i n g D e p l o y m e n t ET L Model Evaluati on Feature Enginee ring Scoring Mod el Trai ning Mod el A Mod el B Mod el C Deploy ment M o d el E v al u at io n F e at ur e E n gi n e er in g S c o ri n g M o d e l T r a i n i n g M o d e l A M o d e l B M o d e l C D e pl o y m e nt M o d e l E v a l u a t i o n F e a t u r e E n g i n e e r i n g S c o r i n g D e p l o y m e n t M o d e l E v a l F e a t u r e E n g i n S c o r i n g D e p l o y m e n M o d e l E v a l u a t i o n F e a t u r e E n g i n e e r i n g S c o r i n g D e p l o y m e n t M o d e l E v F e a t u r e E n S c o r i n g D e p l o y m M o d e l E v a l u F e a t u r e E n g i n S c o r i n g D e p l o y m e n t M o a F e a t u r S c o D e p Mo del Ev alu ati on Fe atu re En gin eer ing Sc ori ng M o d e l T r a i n i n g M o d e l A M o d e l B M o d e l C De plo ym ent M o d e l E v a l u a t i o F e a t u r e E n g i n e e r i n S c o r i n g D e p l o y m e n t M o d e l E v a l u a t i o n F e a t u r e E n g i n e e r i n g S c o r i n g D e p l o y m e n t M o d e l E v F e a t u r e E n g S c o r i n g D e p l o y m
  13. 13. Automating machine learning Enter Einstein (and Optimus Prime)
  14. 14. •  ML is not magic, just statistics – generalizing examples •  But there is a ‘black art’ to producing good models •  Input data needs to be combined, filtered, cleaned etc. •  Producing the best features for your model takes time •  You can’t just throw a ml algorithm at your raw data and expect good results Turning a black art into a paint by number kit.
  15. 15. Keep it DRY (don’t repeat yourself) and DRO (don’t repeat others) • The Spark ML pipeline (estimator, transformer) model is nice • The lack of types in Spark is not • Want to use more than Spark ML • Declarative and intuitive syntax – for both workflow generation and developers • Typed reusable operations • Multitenant application support • All built in scala Optimus Prime - A library to develop reusable, modular and typed ML workflows
  16. 16. Simple interchangeable parts ​ val featureVector = Seq(pClass, name, gender, age, sibSp, parch, ticket, cabin, embarked).vectorize() ​ val (pred, raw, prob) = featureVector.check(survived).classify(survived) ​ val workflow = new OpWorkflow().setResultFeatures(pred).setDataReader(titanicReader) ​ In a declarative type safe syntax Stages Features transformed with produce Transformers Estimators fitted into ReadersWorkflows input data materialized by
  17. 17. Automating typed feature engineering and modeling (with Optimus Prime)
  18. 18. Features are given a type on creation ​ val gender = FeatureBuilder.Categorical[Titanic] .extract(d => Option(d.getGender).toSet[String]).asPredictor • Features are strongly typed • Each stage takes specific input type(s) and returns a specific output type(s) ​ Death to runtime errors! StagesFeatures produce
  19. 19. Creating a workflow DAG with features • Features point to a column of data • The type of the feature determines which stages can act on it pred prob rawPred survived featureVector gender age name genderPivot title
  20. 20. Creating a workflow DAG with features • When a stage acts on a feature it produces a new feature (or features) • Keep on manipulating features until you get your goal pred prob rawPred survived featureVector gender age name genderPivot title Pivot Title Regex Combine Model
  21. 21. Done manipulating your features? Make them. Stages Features transformed with produce Transformers Estimators fitted into ReadersWorkflows input data materialized by • Once you make your final feature you have the full DAG • Features are materialized by the workflow • Initial data into the workflow provided by the reader
  22. 22. The power of types!
  23. 23. Using types to automate feature engineering ​ val featureVector = Seq(pClass, name, gender, age, sibSp, parch, ticket, cabin, embarked).vectorize() • Each feature is mapped to an appropriate .vectorize() stage based on its type •  gender (a Categorical) and age (a Real) are automatically assigned to different stages • You also have an option to do the exact type safe manipulations you want •  age can undergo special transformations if desired •  val ageBuckets =age.bucketize(buckets(0, 10, 20, 40, 100)) •  val featureVector = Seq(pClass, name, gender, ageBuckets, sibSp, parch, ticket, cabin, embarked).vectorize()
  24. 24. Show me the types! FeatureType OPNumeric OPCollection OPSet OPSortedSet OPList NonNullable Text Email Base64 Phone ID URL ComboBox PickList TextArea OPVector OPMap BinaryMap IntegralMap RealMap CategoricalMap OrdinalMap DateList DateTimeList Integral Real Binary Percent Currency Date DateTime Categorical MultiPickList Note: all the types are assumed to be nullable, unless NonNullable trait is mixed - https://developer.salesforce.com/docs/atlas.en-us.api.meta/api/field_types.htm Ordinal TextMap Legend: - inheritance, bold - abstract class, italic - trait, normal - concrete class ... Optimus Prime Type Hierarchy TextList City Street Country Postal Code Location State Geolocation StateMap
  25. 25. Take the types away!! • Sometimes a type is all you have • Hierarchy allows both very specific and very general stages • Type safety for production saves a lot of headaches ​ Why would we make this monstrosity?? FeatureType OPNumeric OPCollection OPSet OPSortedSet OPList NonNullable Text Email Base64 Phone ID URL ComboBox PickList TextArea OPVector OPMap BinaryMap IntegralMap RealMap CategoricalMap OrdinalMap DateList DateTimeList Integral Real Binary Percent Currency Date DateTime Categorical MultiPickList Note: all the types are assumed to be nullable, unless NonNullable trait is mixed - https://developer.salesforce.com/docs/atlas.en-us.api.meta/api/field_types.htm Ordinal TextMap Legend: - inheritance, bold - abstract class, italic - trait, normal - concrete class ... Optimus Prime Type Hierarchy TextList City Street Country Postal Code Location State Geolocation StateMap
  26. 26. Sanity Checking – the stage that checks your features • Check data quality before modeling • Label leakage • Features have acceptable ranges • The feature types allow much better checks val checkedVector = featureVector.check(survived)
  27. 27. Model Selection Stage - Resampling, Hyper-parameter Tuning, Comparing Models • Many possible models for each class of problem • Many hyper parameters for each type of model • Finding the right model for THIS dataset makes a huge difference val (pred, raw, prob) = checkedFeatureVector.classify(survived)
  28. 28. Types can save us ​ And if you don’t believe me take a look at the code val featureVector = Seq(pClass, name, gender, age, sibSp, parch, ticket, cabin, embarked).vectorize() val (pred, raw, prob) = featureVector.check(survived).classify(survived) val workflow = new OpWorkflow().setResultFeatures(pred).setDataReader(titanicReader)
  29. 29. Types can save us ​ And if you don’t believe me take a look at the code val featureVector = Seq(pClass, name, gender, age, sibSp, parch, ticket, cabin, embarked).vectorize() val (pred, raw, prob) = featureVector.check(survived).classify(survived) val workflow = new OpWorkflow().setResultFeatures(pred).setDataReader(titanicReader) def addFeatures(df: DataFrame): DataFrame = {! // Create a new family size field := siblings + spouses + parents + children + self! val familySizeUDF = udf { (sibsp: Double, parch: Double) => sibsp + parch + 1 }! ! df.withColumn("fsize", familySizeUDF(col("sibsp"), col("parch"))) // <-- full freedom to overwrite ! }! ! def fillMissing(df: DataFrame): DataFrame = {! // Fill missing age values with average age! val avgAge = df.select("age").agg(avg("age")).collect.first()! ! // Fill missing embarked values with default "S" (i.e Southampton)! val embarkedUDF = udf{(e: String)=> e match { case x if x == null || x.isEmpty => "S"; case x => x}}! ! df.na.fill(Map("age" -> avgAge)).withColumn("embarked", embarkedUDF(col("embarked")))! }!
  30. 30. Types can save us ​ And if you don’t believe me take a look at the code // Modify the dataframe! val allData = fillMissing(addFeatures(rawData)).cache() // <-- need to remember about caching! // Split the data and cache it! val Array(trainSet, testSet) = allData.randomSplit(Array(0.75, 0.25)).map(_.cache())! ! // Prepare categorical columns! val categoricalFeatures = Array("pclass", "sex", "embarked")! val stringIndexers = categoricalFeatures.map(colName =>! new StringIndexer().setInputCol(colName).setOutputCol(colName + "_index").fit(allData)! )! ! // Concat all the feature into a numeric feature vector! val allFeatures = Array("age", "sibsp", "parch", "fsize") ++ stringIndexers.map(_.getOutputCol)! ! val vectorAssembler = new VectorAssembler().setInputCols(allFeatures).setOutputCol("feature_vector”)! ! // Prepare Logistic Regression estimator! val logReg = new LogisticRegression().setFeaturesCol("feature_vector").setLabelCol("survived”)! ! // Finally build the pipeline with the stages above! val pipeline = new Pipeline().setStages(stringIndexers ++ Array(vectorAssembler, logReg))!
  31. 31. Types can save us ​ And if you don’t believe me take a look at the code // Cross validate our pipeline with various parameters! val paramGrid =! new ParamGridBuilder()! .addGrid(logReg.regParam, Array(1, 0.1, 0.01))! .addGrid(logReg.maxIter, Array(10, 50, 100))! .build()! ! val crossValidator =! new CrossValidator()! .setEstimator(pipeline) // <-- set our pipeline here! .setEstimatorParamMaps(paramGrid)! .setEvaluator(new BinaryClassificationEvaluator().setLabelCol("survived"))! .setNumFolds(3)! ! // Train the model & compute scores ! val model: CrossValidationModel = crossValidator.fit(trainSet)! val scores: DataFrame = model.transform(testSet)! ! // Save the model for later use! model.save("/models/titanic-model.ml")! !
  32. 32. Where are we going and what have we learned
  33. 33. Key takeaways • ML for B2B is a whole other beast • Spark ML is great, but it needs type safety • Simple and intuitive syntax saves you trouble down the road • Types in ML are incredibly useful • Scala has all the relevant facilities to provide the above • Modularity and reusability is the key
  34. 34. Going forward with Optimus Prime • Going beyond Spark ML for algorithms and small scale • Making everything smarter (feature eng, sanity checking, model selection) • Template generation • Improvements to developer interface
  35. 35. If You’re Curious … einstein-recruiting@salesforce.com
  36. 36. Thank Y u

×