Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Matthew Tovbin
Principal Engineer, Salesforce Einstein
mtovbin@salesforce.com
@tovbinm
Doubt Truth to be a Liar
Non Trivia...
“Doubt thou the stars are fire,
Doubt that the sun doth move,
Doubt truth to be a liar,
But never doubt I love.”
- William ...
A glimpse into the future
​ What I am going to talk about:
• Machine Learning (ML) 101
• Real-life ML
• Building ML applic...
Machine Learning 101
​ What is Machine Learning?
• “The capacity of a computer to learn from experience, i.e. to modify it...
Machine Learning 101
​ Feature
• An individual measurable property of a phenomenon being observed.
• Choosing informative,...
Machine Learning 101
​ Training
• The process of training an ML model involves providing an ML algorithm
with training dat...
Machine Learning 101
​ Building a ML model
Feature
Engineering
Model
Training
Model A
Model B
Model C
Model
Evaluation
Real-life ML
​ Building a ML model pipeline
ETL
Model
Evaluation
Feature
Engineering
Scoring
Model
Training
Model A
Model ...
Real-life ML
​ Just a few problems to mention
• ETL is tough
• Feature Engineering is even tougher
• Our prototype in R/Py...
Real-life ML
​ Specifically for Salesforce
Multi-tenancy
• Multiple customers: Square, Fanatics, etc.
• Multiple customer e...
Salesforce Einstein
​ AI for everyone
• ML Platform for customers, engineers, data scientists
• No need for ETL or PhD
​ P...
Building ML application
with Spark ML
Predict survival on the Titanic
Sources: https://www.kaggle.com/c/titanic/data, https://github.com/BenFradet/spark-kaggle
...
Building ML application with Spark ML
​ [1/5] - Spinning up Spark
import org.apache.spark._!
import org.apache.spark.sql._...
Building ML application with Spark ML
​ [2/5] - Reading the data
def readData(file: String): DataFrame = {!
val schema = S...
Building ML application with Spark ML
​ [3/5] - Feature Engineering
!
def addFeatures(df: DataFrame): DataFrame = {!
// Cr...
Building ML application with Spark ML
​ [4/5] - Building the pipeline
// Prepare categorical columns!
val categoricalFeatu...
Building ML application with Spark ML
​ [5/5] - Model training
// Cross validate our pipeline with various parameters!
val...
Building ML application with Spark ML
​ The good parts
• Simple abstraction: Transformers (.map), Estimators (.reduce) and...
Building ML application with Spark ML
​ The not so good parts
• No type checking (especially painful for Transformers, Est...
Typed Feature Engineering
with Optimus Prime
Optimus Prime
What is Optimus Prime?
• A transformation framework to develop reusable, modular and typed ML
pipelines
Why ...
Optimus Prime
Building ML application with Optimus Prime
​ [1/3] – Defining a reader & raw features
import com.salesforce.op._!
import co...
Building ML application with Optimus Prime
​ [2/3] – Feature engineering
// Create a new family size feature – no annoying...
Building ML application with Optimus Prime
​ [3/3] – Model training
!
val modelSelector = { // Create a model selector wit...
Building ML application with Optimus Prime
​ Summary
• Everything is typed: readers, features, pipeline, model
• Declarati...
Behind the scenes of
Optimus Prime
Types and interactions
Numeric
Text
…
Feature[T <: Feature Type]
transformed with
Categorical
Unary
produce
Transformers (...
Feature Types V1
case object FeatureTypes {!
type Numeric = Double!
type NullableNumeric = Option[Double]!
type Categorica...
FeatureType
OPNumeric OPCollection
OPSet
OPSortedSet
OPList
NonNullable
Text
Email
Base64
Phone
ID
URL
ComboBox
PickList
T...
Typed Features
// Typed value container!
trait FeatureType extends Serializable {!
!
type Value // feature value type!
!
d...
Typed Features
// Represents a single feature (dimension)!
trait FeatureLike[O <: FeatureType] extends Serializable { !
!
...
Feature names from vals using Macros magic
val sibsp = FeatureBuilder.Numeric[Passenger](name = "sibsp") !
!
val sibsp = F...
Feature transformations with Implicit Classes
// Create a new family size feature := siblings + spouses + parents + childr...
Features and Transformers
sibsp: Feature[Numeric]
survived	pclass	sex	 age	 sibsp	 parch	embarked	 fsize	
0	 3	 male	 22	 ...
Typed pipeline stages
import org.apache.spark.ml.util.MLWritable !
import org.apache.spark.ml.PipelineStage!
!
// OP pipel...
Typed pipeline stages
!
// Stage providing a single feature[O]!
trait OpPipelineStage[O <: FeatureType] extends OpPipeline...
1-to-1 Transformer example
!
import org.apache.spark.ml.Transformer!
!
// A simple 1 to 1 transformer !
trait OpTransforme...
1-to-1 Estimator example
​ Last code slide. Really.
import org.apache.spark.ml.{Estimator, Model}!
!
// A simple 1 to 1 es...
Model training (.reduce)
Data
ReaderRDD[Passenger]
Feature
Extraction
DataFrame (age, sibsp,
parch, embarked, …)
Spark Pip...
Going forward with Optimus Prime
​ What is still missing?
• Wrapping existing Spark ML transformers/estimators
• Codegen f...
Key takeaways
• Real-life Machine Learning is hard
• Spark ML is great, but it needs type safety
• Simple and intuitive sy...
Further exploration
• Salesforce Einstein – http://einstein.com
•  “Democratizing AI to solve human bottlenecks” by Sarah ...
If You’re Curious …
einstein-recruiting@salesforce.com
Thank Y u
Upcoming SlideShare
Loading in …5
×

Doubt Truth to be a Liar: Non Triviality of Type Safety for Machine Learning (at Salesforce)

1,988 views

Published on

Feature vectors – sequences of heterogenous types – are the basic unit of any machine learning algorithm. Further, feature engineering involves manipulations of these feature vectors and is a fundamental step in optimizing the accuracy of machine learning models. These manipulations may take the form of regular Scala sequence operations that can also be distributed using frameworks such as Spark or Flink. When building a general purpose machine learning framework, the types of engineered features is not known in advance, which is a problem for statically typed languages. In this talk, I will walk through possible solutions for designing type-safe feature vectors in Scala that provide compile-time type safety for feature engineering and other machine learning use cases. The solutions will demonstrate applications of Shapeless, Scala Macros, and Quasiquotes.

Presentation Recording - SBTB2016 https://www.youtube.com/watch?v=FfpSyXTx0uo

Published in: Software
  • Be the first to comment

Doubt Truth to be a Liar: Non Triviality of Type Safety for Machine Learning (at Salesforce)

  1. 1. Matthew Tovbin Principal Engineer, Salesforce Einstein mtovbin@salesforce.com @tovbinm Doubt Truth to be a Liar Non Triviality of Type Safety for Machine Learning
  2. 2. “Doubt thou the stars are fire, Doubt that the sun doth move, Doubt truth to be a liar, But never doubt I love.” - William Shakespeare, Hamlet
  3. 3. A glimpse into the future ​ What I am going to talk about: • Machine Learning (ML) 101 • Real-life ML • Building ML application with Spark ML • Typed Feature Engineering with Optimus Prime • Behind the scenes • Going forward
  4. 4. Machine Learning 101 ​ What is Machine Learning? • “The capacity of a computer to learn from experience, i.e. to modify it’s processing on the basis of newly acquired information” – 1950s, IBM Journal What is “experience” in the computer terms? • It’s just data. What are the tasks Machine Learning solves? • Recognition, diagnosis, prediction, forecasting, planning, data mining, etc.
  5. 5. Machine Learning 101 ​ Feature • An individual measurable property of a phenomenon being observed. • Choosing informative, discriminating and independent features is a crucial step for building effective ML algorithms. Feature Vector • An n-dimensional vector that represents a set of features corresponding to a single observation, i.e. email open/click, product purchase etc. Model • A structure and corresponding interpretation that summarizes or partially summarizes a set of data, for description or prediction. ​ [1/2] - Terms
  6. 6. Machine Learning 101 ​ Training • The process of training an ML model involves providing an ML algorithm with training data to learn from. • During training, data is evaluated by the ML algorithm, which analyzes the distribution and type of the data, looking for rules and patterns that used later prediction. ​ Scoring • The process of applying a trained model to new data to generate predictions and other values. • Examples: a list of recommended items, forecast for time series models, estimates of projected demand/volume/etc., probability scores. ​ [2/2] - Terms
  7. 7. Machine Learning 101 ​ Building a ML model Feature Engineering Model Training Model A Model B Model C Model Evaluation
  8. 8. Real-life ML ​ Building a ML model pipeline ETL Model Evaluation Feature Engineering Scoring Model Training Model A Model B Model C Deployment
  9. 9. Real-life ML ​ Just a few problems to mention • ETL is tough • Feature Engineering is even tougher • Our prototype in R/Python/Octave works great, but … • Copy/pasting code across projects doesn’t scale • Model Training fails exactly two hours after you go to sleep •  Data is not there •  Data is there, but in a wrong format •  Data is there, but is insufficient •  OOM, Insufficient Space, Serialization… • So we have the models/scores, but can we trust them?!
  10. 10. Real-life ML ​ Specifically for Salesforce Multi-tenancy • Multiple customers: Square, Fanatics, etc. • Multiple customer environments: Live, Staging, QA, Dev. • Multiple data sources: SFDC, Marketing Cloud, Service Cloud, IoT Cloud, etc. • Multiple data entities: Leads, Opportunities, Email Campaigns, etc. • Multiple applications: Lead Scoring, Predictive Journeys, or custom. Security, Scale, Automation, Transparency, Cost Efficiency and so on.
  11. 11. Salesforce Einstein ​ AI for everyone • ML Platform for customers, engineers, data scientists • No need for ETL or PhD ​ PredictionIO • Most starred Scala project on GitHub • Now part of Apache’s incubator program Optimus Prime • In-house transformation framework • Declarative, collaborative, reusable, typed Services, Microservices, Nanoservices…
  12. 12. Building ML application with Spark ML
  13. 13. Predict survival on the Titanic Sources: https://www.kaggle.com/c/titanic/data, https://github.com/BenFradet/spark-kaggle PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked 1 0 3 Braund, Mr. Owen Harris male 22 1 0 A/5 21171 7.25 S 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Thayer) female 38 1 0 PC 17599 71.283 3 C85 C 3 1 3 Heikkinen, Miss. Laina female 26 0 0 STON/ O2. 3101282 7.925 S 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35 1 0 113803 53.1 C123 S 5 0 3 Allen, Mr. William Henry male 35 0 0 373450 8.05 S 6 0 3 Moran, Mr. James male 0 0 330877 8.4583 Q 7 0 1 McCarthy, Mr. Timothy J male 54 0 0 17463 51.862 5 E46 S 8 0 3 Palsson, Master. Gosta Leonard male 2 3 1 349909 21.075 S ... 890 1 1 Behr, Mr. Karl Howell male 26 0 0 111369 30 C148 C 891 0 3 Dooley, Mr. Patrick male 32 0 0 370376 7.75 Q
  14. 14. Building ML application with Spark ML ​ [1/5] - Spinning up Spark import org.apache.spark._! import org.apache.spark.sql._! import org.apache.spark.sql.types._! import org.apache.spark.sql.functions._! import org.apache.spark.ml._! import org.apache.spark.ml.classification._! import org.apache.spark.ml.evaluation._! import org.apache.spark.ml.feature._! import org.apache.spark.ml.tuning._! ! // Spinning up Spark! val conf = new SparkConf().setMaster("local[2]")! val session = SparkSession.builder.config(conf).getOrCreate! val (sc, sqlc) = (session.sparkContext, session.sqlContext)! ! import sqlc.implicits._!
  15. 15. Building ML application with Spark ML ​ [2/5] - Reading the data def readData(file: String): DataFrame = {! val schema = StructType(Array(! StructField("PassengerId", IntegerType, nullable = true), // <-- field names are easy to misspell! StructField("Survived", DoubleType, nullable = true), // <-- we can set any type here ! StructField("Pclass", DoubleType, nullable = true),! StructField("Name", StringType, nullable = true),! StructField("Sex", StringType, nullable = true),! StructField("Age", DoubleType, nullable = true),! StructField("SibSp", DoubleType, nullable = true),! StructField("Parch", DoubleType, nullable = true),! StructField("Ticket", StringType, nullable = true),! StructField("Fare", DoubleType, nullable = true),! StructField("Cabin", StringType, nullable = true),! StructField("Embarked", StringType, nullable = true)! ))! val df: DataFrame = sqlc.read.format("csv").option("header", "true").schema(schema).load(file)! // Select and rename necessary fields! val select = Array($"Survived".as("survived"), $"Sex".as("sex"), $"Age".as("age”),! $"Pclass".as("pclass"),$"SibSp".as("sibsp"), $"Parch".as("parch"), $"Embarked".as("embarked"))! df.select(select: _*)! }! val rawData: DataFrame = readData(file = "titanic.csv") // <-- runtime exceptions...!
  16. 16. Building ML application with Spark ML ​ [3/5] - Feature Engineering ! def addFeatures(df: DataFrame): DataFrame = {! // Create a new family size field := siblings + spouses + parents + children + self! val familySizeUDF = udf { (sibsp: Double, parch: Double) => sibsp + parch + 1 }! ! df.withColumn("fsize", familySizeUDF(col("sibsp"), col("parch"))) // <-- full freedom to overwrite ! }! ! def fillMissing(df: DataFrame): DataFrame = {! // Fill missing age values with average age! val avgAge = df.select("age").agg(avg("age")).collect.first()! ! // Fill missing embarked values with default "S" (i.e Southampton)! val embarkedUDF = udf{(e: String)=> e match { case x if x == null || x.isEmpty => "S"; case x => x}}! ! df.na.fill(Map("age" -> avgAge)).withColumn("embarked", embarkedUDF(col("embarked")))! }! // Modify the dataframe! val allData = fillMissing(addFeatures(rawData)).cache() // <-- need to remember about caching! // Split the data and cache it! val Array(trainSet, testSet) = allData.randomSplit(Array(0.75, 0.25)).map(_.cache())!
  17. 17. Building ML application with Spark ML ​ [4/5] - Building the pipeline // Prepare categorical columns! val categoricalFeatures = Array("pclass", "sex", "embarked")! val stringIndexers = categoricalFeatures.map(colName =>! new StringIndexer().setInputCol(colName).setOutputCol(colName + "_index").fit(allData)! )! // Concat all the feature into a numeric feature vector! val allFeatures = Array("age", "sibsp", "parch", "fsize") ++ stringIndexers.map(_.getOutputCol)! ! val vectorAssembler = new VectorAssembler().setInputCols(allFeatures).setOutputCol("feature_vector”)! // Prepare Logistic Regression estimator! val logReg = new LogisticRegression().setFeaturesCol("feature_vector").setLabelCol("survived”)! // Finally build the pipeline with the stages above! val pipeline = new Pipeline().setStages(stringIndexers ++ Array(vectorAssembler, logReg))!
  18. 18. Building ML application with Spark ML ​ [5/5] - Model training // Cross validate our pipeline with various parameters! val paramGrid =! new ParamGridBuilder()! .addGrid(logReg.regParam, Array(1, 0.1, 0.01))! .addGrid(logReg.maxIter, Array(10, 50, 100))! .build()! ! val crossValidator =! new CrossValidator()! .setEstimator(pipeline) // <-- set our pipeline here! .setEstimatorParamMaps(paramGrid)! .setEvaluator(new BinaryClassificationEvaluator().setLabelCol("survived"))! .setNumFolds(3)! ! // Train the model & compute scores ! val model: CrossValidationModel = crossValidator.fit(trainSet)! val scores: DataFrame = model.transform(testSet)! ! // Save the model for later use! model.save("/models/titanic-model.ml")!
  19. 19. Building ML application with Spark ML ​ The good parts • Simple abstraction: Transformers (.map), Estimators (.reduce) and Pipelines • Serialization allows reusability of models • Good implementations for various estimators: Word2Vec, LogReg, etc. • All-In-One: data exploration, prototyping, productionization • Multi language support: Java/Scala/Python • Healthy ecosystem
  20. 20. Building ML application with Spark ML ​ The not so good parts • No type checking (especially painful for Transformers, Estimators and Pipelines) • Transformer and Estimator interfaces are too open: Dataset => DataFrame • DataFrames are everywhere •  No type checking •  Easy to misspell column names •  No integration with ML Vector •  Missing a lot of RDDs functionality • Lack of support for common data I/O operations • Schema and algorithms definitions are interleaved with data manipulations Can we do better?!
  21. 21. Typed Feature Engineering with Optimus Prime
  22. 22. Optimus Prime What is Optimus Prime? • A transformation framework to develop reusable, modular and typed ML pipelines Why are we building it? • Declarative and intuitive syntax • Typed operations with Spark ML • Reusability of I/O operations, features, transformations, pipelines • Separation of features and transformations from data operations • Multitenant applications
  23. 23. Optimus Prime
  24. 24. Building ML application with Optimus Prime ​ [1/3] – Defining a reader & raw features import com.salesforce.op._! import com.salesforce.op.test.avro.Passenger! ! // Define the reader from CSV to Avro! val trainReader: DataReader[Passenger] = DataReaders.Simple.csv[Passenger](! path = Some("titanic.csv"),! schema = Passenger.getClassSchema.toString! )! // Define the response feature (feature names are inferred from val names)! val survived = FeatureBuilder.Binary[Passenger].extract(Option(_.getSurvived).map(_ != 0)).asResponse! // Define the predictor features! val age = FeatureBuilder.NullableNumeric[Passenger].extract(Option(_.getAge)).asPredictor! val sex = FeatureBuilder.Categorical[Passenger].extract(Set(_.getSex)).asPredictor! val pclass = FeatureBuilder.Numeric[Passenger].extract(_.getPClass).asPredictor! val sibsp = FeatureBuilder.Numeric[Passenger].extract(_.getSibSp).asPredictor! val parch = FeatureBuilder.Numeric[Passenger].extract(_.getParCh).asPredictor! val embarked = FeatureBuilder.Text[Passenger].extract(Option(_.getEmbarked)).asPredictor!
  25. 25. Building ML application with Optimus Prime ​ [2/3] – Feature engineering // Create a new family size feature – no annoying UDFs here!! val fsize: FeatureLike[Numeric] = sibsp + parch + 1! // Fill missing age values with average age (i.e. from NullableNumeric we get Numeric)! val ageFilled: FeatureLike[Numeric] = age.fillMissingWithMean! // Fill missing embarked values with default "S" (i.e Southampton)! val embarkedFilled: FeatureLike[Text] = embarked.fillMissingWith("S")! // Create a feature vector field using default vectorizers! val featureVector: FeatureLike[Vector] =! Seq(sex, ageFilled, pclass, sibsp, parch, fsize, embarkedFilled).vectorize()! !
  26. 26. Building ML application with Optimus Prime ​ [3/3] – Model training ! val modelSelector = { // Create a model selector with two algorithms! new ModelSelector[Passenger]().setInput(survived, featureVector)! .setParams(! LogisticRegression.RegParam -> Array(1, 0.1, 0.01),! LogisticRegression.MaxIter -> Array(10, 50, 100),! RandomForest.NumTrees -> Array(3, 5, 10)! ).setModels(Algs.LogisticRegression, Algs.RandomForest) // <-- multiple algorithms here! .setEvaluator(Evals.BinaryClassification)! }! // Build the pipeline with the model selector ! val pipeline = new OpPipeline[Passenger]().setInput(modelSelector)! ! // And only now we are spinning up Spark! val conf = new SparkConf().setMaster("local[2]")! implicit val session = SparkSession.builder.config(conf).getOrCreate! ! // Train the model & compute scores ! val model: OpPipelineModel[Passenger] = pipeline.setReader(trainReader).train()! val scores: DataFrame = model.setReader(testReader).score()! ! model.save("/models/titanic-model.op") // Save the model for later use!
  27. 27. Building ML application with Optimus Prime ​ Summary • Everything is typed: readers, features, pipeline, model • Declarative and intuitive syntax with code completion • Feature names are inferred from val names –> no misspelled names • Features are always unique in a code block, compilation error otherwise • Common data I/O operations provided with DataReaders (joins, aggrs etc.) • DataFrames are abstracted away – no direct interaction • Features and transformations are separate from data operations
  28. 28. Behind the scenes of Optimus Prime
  29. 29. Types and interactions Numeric Text … Feature[T <: Feature Type] transformed with Categorical Unary produce Transformers (.map) Binary … Estimators (.reduce) Average Word2Vec … fitted into My Model Data Readers CSV Avro … Pipelines Titanic Lead Scoring … readmaterialized by trained joined / aggr
  30. 30. Feature Types V1 case object FeatureTypes {! type Numeric = Double! type NullableNumeric = Option[Double]! type Categorical = scala.collection.mutable.WrappedArray[String]! type Text = Option[String]! type Binary = Option[Boolean]! type DateList = scala.collection.mutable.WrappedArray[Long]! type KeyString = scala.collection.Map[String, String]! type KeyNumeric = scala.collection.Map[String, Double]! type KeyBinary = scala.collection.Map[String, Boolean]! type Vector = org.apache.spark.ml.linalg.Vector! ! // TBD: Specific/rich types: Email, Phone, URL, etc.! }!
  31. 31. FeatureType OPNumeric OPCollection OPSet OPSortedSet OPList NonNullable Text Email Base64 Phone ID URL ComboBox PickList TextArea OPVector OPMap BinaryMap IntegralMap RealMap CategoricalMap OrdinalMap DateList DateTimeList Integral Real Binary Percent Currency Date DateTime Categorical MultiPickList Ordinal TextMap Legend: - inheritance, bold - abstract class, italic - trait, normal - concrete class ... TextList Feature Types V2 ...
  32. 32. Typed Features // Typed value container! trait FeatureType extends Serializable {! ! type Value // feature value type! ! def value: Value // actual value! ! def isEmpty: Boolean // true if value is empty ! ! def isNullable: Boolean // true if value is nullable! ! // ...! }! // For example, a text feature value type! class Text(val value: Option[String]) extends FeatureType {! type Value = Option[String]! def this(value: String) = this(Option(value))! final def isEmpty: Boolean = value.isEmpty! }!
  33. 33. Typed Features // Represents a single feature (dimension)! trait FeatureLike[O <: FeatureType] extends Serializable { ! ! implicit def wtt: WeakTypeTag[O] // Overcoming type erasure! ! def name: String // name of the feature! ! def defaultValue: O // feature default value! ! def originStage: OpPipelineStage[O] // the stage which generated this feature! ! def parents: Seq[FeatureLike[_ <: FeatureType]] // the input features for the origin stage! ! // feature transformation function (i.e map). We have more like this...! final def transformWith[U <: FeatureType](stage: OpPipelineStage1[O, U]): FeatureLike[U]! ! // ...! }!
  34. 34. Feature names from vals using Macros magic val sibsp = FeatureBuilder.Numeric[Passenger](name = "sibsp") ! ! val sibsp = FeatureBuilder.Numeric[Passenger] // <-- can we just do this?!! ! ! ! object FeatureBuilder {! def Numeric[I]: FeatureBuilder[I, Numeric] = macro FeatureBuilderMacros.apply[I, Numeric]! }! ! // HAHA! So we meet again! ! private[op] object FeatureBuilderMacros { ! def apply[I: c.WeakTypeTag, O: c.WeakTypeTag](c: Context): c.Expr[FeatureBuilder[I, O]] = {! import c.universe._! val enclosingValName = MacrosHelper.definingValName(c)! val featureName = c.Expr[String](Literal(Constant(enclosingValName)))! val fbApply = Select(reify(FeatureBuilder).tree, TermName("apply"))! val fbExpr = c.Expr[FeatureBuilder[I, O]](Apply(fbApply, featureName.tree :: Nil))! ! reify(fbExpr.splice)! }! }!! // Read more in sbt codebase - https://goo.gl/OdPvry! !
  35. 35. Feature transformations with Implicit Classes // Create a new family size feature := siblings + spouses + parents + children + self! val fsize = sibsp.transformWith(new BinaryNullableAndNumeric(_ + _), parch)! .transformWith(new BinaryNumeric (_ + _), 1)! ! val fsize = sibsp + parch + 1 // <-- can we just do this?! // Sure thing!! implicit class RichNullableNumericFeature[I <: NullableNumeric : TypeTag](val f: FeatureLike[I]) {! ! def +[I2 <: Numeric: TypeTag](that: FeatureLike[I2]): FeatureLike[NullableNumeric] = {! val plus = (a: Numeric, b: Numeric) => a + b! val stage = new BinaryNullableAndNumeric(plus)! ! f.transformWith[I2, NullableNumeric](f = that, stage)! }! ! }! ! // Note: we use another Macro here as well to infer the name of the feature to “fsize”!
  36. 36. Features and Transformers sibsp: Feature[Numeric] survived pclass sex age sibsp parch embarked fsize 0 3 male 22 1 0 S 2 1 1 female 38 1 0 C 2 1 3 female 26 0 0 S 1 Binary Transformer ( _ + _ ) parch: Feature[Numeric] val fsize: FeatureLike[Numeric] = sibsp + parch + 1! _: Feature[Numeric] fsize: Feature[Numeric]Binary Transformer ( _ + 1 ) Transformation DAG sibsp parch _ 1 fsize 2 Pipeline Stages
  37. 37. Typed pipeline stages import org.apache.spark.ml.util.MLWritable ! import org.apache.spark.ml.PipelineStage! ! // OP pipeline stages represent a feature transformation,! // and also carry around the input and output features! trait OpPipelineStageBase extends OpPipelineStageParams with MLWritable { self: PipelineStage =>! type InputFeatures! type OutputFeatures! ! def setInput(features: InputFeatures): this.type! def getOutput(): OutputFeatures! ! // This method allows us to modify the DataFrame schema accordingly! final override def transformSchema(schema: StructType): StructType = { ... }! }! ! Binary Transformer ( _ + _ ) Feature[Numeric] Feature[Numeric] -> Feature[Numeric] inputs outputextends OpPipelineStage2 sibsp + parch!
  38. 38. Typed pipeline stages ! // Stage providing a single feature[O]! trait OpPipelineStage[O <: FeatureType] extends OpPipelineStageBase {! type InputFeatures! final override type OutputFeatures = FeatureLike[O]! }! // Stage from feature[I] to a feature[O]! trait OpPipelineStage1[I <: FeatureType, O <: FeatureType] extends OpPipelineStage[O] {! final override type InputFeatures = FeatureLike[I]! }! // Stage from a tuple of features to a feature[O]! trait OpPipelineStage2[I1 <: FeatureType, I2 <: FeatureType, O <: FeatureType]! extends OpPipelineStage[O] {! final override type InputFeatures = (FeatureLike[I1], FeatureLike[I2])! }! // ...! // And so on for various combinations: 1to2, ..., 1toN, 2to2, ..., Nto1, Nto2, ...! // See Scala Product types - https://goo.gl/J3V5DP!
  39. 39. 1-to-1 Transformer example ! import org.apache.spark.ml.Transformer! ! // A simple 1 to 1 transformer ! trait OpTransformer1[I <: FeatureType, O <: FeatureType]! extends Transformer with OpPipelineStage1[I, O] {! implicit def tti: TypeTag[I]! implicit def tto: TypeTag[O]! ! // User provided transform function that operates on input feature value I and produces O! def transformFn: I => O! ! // We wrap the transform function above into a UDF! final override def transform(dataset: Dataset[_]): DataFrame = {! val functionUDF = udf { (in: Any) => transformFn(FeatureTypes.as[I](in)).value }! ! // Return a dataset with a new column! dataset.withColumn(outputName, functionUDF(col(in1.name)))! }! }! // ...! // And so on for various combinations: 1to2, 1to3, ..., 1toN, ..., Nto1, Nto2, ...!
  40. 40. 1-to-1 Estimator example ​ Last code slide. Really. import org.apache.spark.ml.{Estimator, Model}! ! // A simple 1 to 1 estimator which is trained into a model (transformer)! class UnaryEstimator[I <: FeatureType : TypeTag, O <: FeatureType: TypeTag](! val fitFn: Dataset[I#Value] => I => O! ) extends Estimator[UnaryModel[I, O]] with OpPipelineStage1[I, O] {! implicit val iEncoder: Encoder[I#Value] = ExpressionEncoder()! ! final override def fit(dataset: Dataset[_]): UnaryModel[I, O] = {! val df: DataFrame = dataset.select(in1.name) ! val ds: Dataset[I#Value] = df.map(r => FeatureTypes.as[I](r.get(0)).value) // needs encoder ! ! val transformFn: I => O = fitFn(ds) // fit function returns a transform function! ! new UnaryModel[I, O](transformFn).setParent(this).setInput(in1)! }! }!! // Represents a trained model (transformer) from feature[I] to feature[O]! class UnaryModel[I <: FeatureType, O <: FeatureType](val transformFn: I => O)! (implicit val tti: TypeTag[I], val tto: TypeTag[O])! extends Model[UnaryModel[I, O]] with OpTransformer1[I, O]!
  41. 41. Model training (.reduce) Data ReaderRDD[Passenger] Feature Extraction DataFrame (age, sibsp, parch, embarked, …) Spark Pipeline .setStages(stages).fit(trainData) Titanic Model .tranform(…) Topological sort Transformation DAG sibsp parch _ 1 fsize val model = new OpPipeline[Passenger]().setInput(modelSelector).setReader(trainReader).train()!
  42. 42. Going forward with Optimus Prime ​ What is still missing? • Wrapping existing Spark ML transformers/estimators • Codegen for stage in-out combinations (1to1, …, 1toN, ... Nto1, Nto2, ...) • Codegen for Macros (scary!) • Automatic feature engineering • Abstract away from Spark with Apache Beam • More…
  43. 43. Key takeaways • Real-life Machine Learning is hard • Spark ML is great, but it needs type safety • Simple and intuitive syntax saves you trouble down the road • Scala has all the relevant facilities to provide the above – know to use it • Modularity and reusability is the key
  44. 44. Further exploration • Salesforce Einstein – http://einstein.com •  “Democratizing AI to solve human bottlenecks” by Sarah Aerni, PhD – Scala Days, CPH 2017 • PredictionIO – http://predictionio.incubator.apache.org • Optimus Prime – to be open-sourced (no ETA yet) •  “Optimus Prime: declarative, collaborative, type-safe machine learning” by Shubha Nabar, PhD https://goo.gl/hgWxJb •  “The Lego Model for Machine Learning” by Leah McGuire, PhD – https://goo.gl/hmct4R
  45. 45. If You’re Curious … einstein-recruiting@salesforce.com
  46. 46. Thank Y u

×