11. Label
Generation
Fact Store
Training
Data
Preparation
Training
Feature
Engineering
Model
Quality
Intent To Treat
(Serving)
Treatment &
Action
Hyperparameter
Optimization
N
O
T
E
B
O
O
K
S
Caching
Dynamic param
management
Inference &
Logging
A/B Testing
Platform Online &
Precompute
Framework
Personalization
Aggregation
Fact
Logging
Device
Logging
Online
Services
API
The
Personalization
Rainbow
Control Plane
Online
Device
Function
Offline
Personalization systems & infrastructure
Boson
Algo Commons
O
R
C
H
E
S
T
R
A
T
I
O
N
13. The Context for AlgoCommons & Boson
● Machine Learning via ‘loosely-coupled, highly-aligned’ Scala/Java Libraries
● Historical context
○ Siloed machine learning infrastructure
○ Few opportunities for sharing
■ Incompatibility
■ Dependency concerns
■ Improvements in one pipeline not shared across others
14. Design Principles
● Composability
○ Ability to put pieces together in novel ways
○ Enable construction of generic tools
● Portability
○ Easily share code online/offline and between applications
○ Models, Feature encoders, Common data manipulation
● Avoiding Training-Serving Skew
○ Serving/Online systems are Java based, drives choice of offline software
○ Share code & data between offline/online worlds
15. Training
Feature
Engineering
Model
Quality
Inference &
Logging
Delorean
Time Travel
Feature Generation
Feature Transformers
Label Joins
Feature Schema
Stratification & Sampling
Data Fetchers & utilities
Training
API
Model
Tuning
Boson
AlgoCommons
Spot Checks (human-in-the-loop)
Visualization
Feature Importance
Validation Runs
Training Metrics
Abstractions
Feature Sharing
Component Sets
Data Maps
Feature Encoders
Specification
Common Model Format
(JSON)
Metrics Framework
Predictions
Inferencing Metrics
Scoring
Model Loading
InferencingAlgoCommons
& Boson
Batch Training over Distributed
Spark or Dockerized Containers
17. ● Common abstractions and building blocks for ML
● Integrated in Java microservices
for online or pre-computed Inferencing
● Library > framework (user-focus)
● Program to interfaces (composability)
● Aggressive modularization to avoid Jar Hell (portability)
● Data Access Abstraction (portability, testability)
Overview AlgoCommons
18. Common abstractions and Building Blocks
● Data
○ Data Keys
○ Data Maps
● Modeling
○ Component Sets
○ Feature Encoders, Predictor, Scorer
○ Model Format
● Metrics
AlgoCommons
19. DataKey<T>
○ Identifies a data value by name/type e.g “ViewingHistory”
Data Value
○ Preferably immutable data structure
DataMap
○ Map from DataKey<T> to T, plus metadata
Data Access - Abstractions AlgoCommons
20. Data Access - Lifecycle
Application Component
Factory
Component
What DataKeys do
you need?
I need X, Y, and Z
f.create(dataMap)
new Component(X, Y, Z)
Return comp
comp.do(someInput)
Make
DataMap w/
X, Y, and Z
Data Retrieval
Component Instantiation /
Data Prep
Component Application
(repeat as needed)
AlgoCommons
21. DataTransform
● DataMap => K/V
● Given zero or more key/values, produce a new key/value
● Consumable by other data transforms, feature encoders, and components
AlgoCommons
22. Feature Encoder
● DataMap ⇒ (T ⇒ FeatureSet)
● FeatureEncoder<T> create(DataMap)
○ Given a DataMap, initialize a new encoder doing any required data prep
● void encode(T, FeatureSet)
○ Given an item (say, a Video), encode features for it into the feature set
AlgoCommons
23. Feature Transform
● Expression “language” for transforming features to produce new features
○ aka Feature Interactions
● Many operators available
○ log, outer/inner product, arithmetic, logic
● Expressions can be arbitrarily “stacked”
● Expressions are automatically DeDuped
AlgoCommons
24. Predictor
● Compute a score for a feature vector
● DataMap ⇒ (Vector ⇒ Double)
○ Predictor create(DataMap)
■ Given a data map, construct a new predictor
○ double predict(Vector)
■ Given a feature vector, compute a prediction/score
● Supports many Predictors:
○ LR, RegressionTree, TensorFlow, XGBoost,
WeightedAdditiveEnsemble, FeatureWeighted, MultivariatePredictors,
BanditPredictor, Sequence-to-sequence,...
AlgoCommons
25. Scorer
● Compute a score for business objects
● DataMap ⇒ (T ⇒ Double)
● Scorer<T> create(DataMap)
○ Given a data map, construct a new Scorer<T>.
● double score(T)
○ Given an item, compute a score
AlgoCommons
26. Extensible Model Definition
● Component abstraction
● JSON model serialization
● Various “views” of the Model
○ Feature gen
○ Prediction
○ Scoring
{
"@id" : "my-model",
"@schema" : "SimpleFeatureScoringModel",
"dataTransforms" : [ ... data transforms ...],
"featureEncoders" : [ ... feature defs ...],
"featureTransform" : { ... feature interactions ... },
"predictor" : { ... ML model (weights, etc.) ... }
}
AlgoCommons
31. Overview
● A high level Scala API for ML exploration
● Focuses on Offline Training for both
○ Ad-hoc exploration
○ Production Training
● Think “Subset of SKLearn” for Scala/JVM ecosystem
● Spark’s dataframe a core data abstraction
32. Data Utilities
● Utilities for data transfer between heterogeneous systems
● Leverage Spark for data munging, but need bridge to Docker Trainers
○ Use standalone s3 downloader and parquet reader
○ S3 + s3fs-fuse
○ HDFS + hdfs-fuse
● On the wire format
○ Parquet
○ Protobuf
33. Feature Schema
● Context
The setting for evaluating a set of items (member profiles, country, etc.)
● Items
The elements to be trained on and scored (videos, rows, etc.)
34. Stratification
dataframe.stratify (samplingRules =
$(“column_foo”) == ‘US’ maxPercent 8.0,
$(“column_bar”) > 10 && $(“column_qux”) > 1 minPercent 0.5,
…
)
A generalized API on Spark Dataframes
Native SparkSQL expressions
Emphasis on type-safety
Many stratification attributes: Country, Devices, Searches,...
35. Feature Transformers
The feature generation pipeline is a sequence of Transformers
A Transformer takes a dataframe, and based on contexts performs computations
on and returns a new data frame.
Dataset Type Tagger
→ Country Tenure Stratified Sampler
→ Negative Generator
→ ….
36. Feature Generation - Putting it together
Model Training
Structured Labeled Features
Feature Model
Structured Data in
DataFrame
Feature Encoders
Required
Feature
Maps of Data
POJO
Features
Required Data
Label Data
Catalyst
Expressions
AlgoCommons
Fact Store
Structured Labeled
Features
Required
Feature
DataMaps
Features
Required
Data
1
2
24
5
6
7
37. Training
● Need flexibility and access to trainers in all languages/environments
● A simple unified Training API for
○ Synchronous & Asynchronous
○ Single Docker or Distributed (Spark)
● Inputs: Trainingset as a Spark Dataset, model params
● Returns: a Model abstraction wrapper of AlgoCommons PredictorConfig
● Can support many popular Trainers:
Learning Tools
38. Metrics
● Leverages AlgoCommons Metrics framework
● Context Level Metrics
○ Supports ranking metrics: nMRR, Recall, nDCG, etc.
○ Supports algo-commons models or custom scoring functions
○ Users can slice and dice the metrics
○ Users can aggregate them using SQL
■ Performant implementation using Spark SQL catalyst expressions
● Item Level Metrics
○ E.g. row popularity
40. Lessons learnt
● Machine learning is an iterative and data sensitive process
○ Make exploration easy, and productionizing robust
○ Make it easy to go switch between the two
● Design components with a general flexible interface
○ Specialize interfaces when you need to
● Testing can be hard, but worthwhile
○ Unit, Integration, Data Checks, Continuous Integration, @ScaleTesting
○ Metric driven system validations
41. Label
Generation
Fact Store
Training
Data
Preparation
Training
Feature
Engineering
Model
Quality
Intent To Treat
(Serving)
Treatment &
Action
Hyperparameter
Optimization
N
O
T
E
B
O
O
K
S
Caching
Dynamic param
management
Inference &
Logging
A/B Testing
Platform Online &
Precompute
Framework
Personalization
Aggregation
Fact
Logging
Device
Logging
Online
Services
API
The
Personalization
Rainbow
Control Plane
Online
Device
Function
Offline
Personalization systems & infrastructure
Boson
Algo Commons
O
R
C
H
E
S
T
R
A
T
I
O
N