TransmogrifAI - Automate Machine Learning Workflow with the power of Scala and Spark at massive scale.

TransmogrifAI
Automate Machine Learning Workflow with the power of Scala and
Spark at massive scale.
@khatri_chetanBy: Chetan Khatri
Scala.IO Conference, École Supérieure de Chimie Physique Électronique de Lyon, France

About me
Lead - Data Science @ Accion labs India Pvt. Ltd.
Open Source Contributor @ Apache Spark, Apache HBase, Elixir Lang, Spark
HBase Connectors.
Co-Authored University Curriculum @ University of Kachchh, India.
Data Engineering @: Nazara Games, Eccella Corporation.
Advisor - Data Science Lab, University of Kachchh, India.
M.Sc. - Computer Science from University of Kachchh, India.

Agenda
● What is TransmogrifAI ?
● Why you need TransmogrifAI ?
● Automation of Machine learning life Cycle - from development to deployment.
○ Feature Inference
○ Transformation
○ Automated Feature validation
○ Automated Model Selection
○ Hyperparameter Optimization
● Type Safety in Spark, TransmogrifAI.
● Example: Code - Titanic kaggle problem.

What is TransmogrifAI ?
● TransmogrifAI is open sourced by Salesforce.com, Inc. in June, 2018
● An end to end automated machine learning workflow library for structured
data build on top of Scala and SparkML.
Build with

What is TransmogrifAI ?
● TransmogrifAI helps extensively to automate Machine learning model life
cycle such as Feature Selection, Transformation, Automated Feature
validation, Automated Model Selection, Hyperparameter Optimization.
● It enforces compile-time type-safety, modularity, and reuse.
● Through automation, It achieves accuracies close to hand-tuned models with
almost 100x reduction in time.

Why you need TransmogrifAI ?
AUTOMATION
Numerous Transformers
and Estimators.
MODULARITY AND
REUSE
Enforces a strict separation
between ML workflow
definitions and data
manipulation.
COMPILE TIME TYPE
SAFETY
Workflow built are Strongly
typed, code completion
during development and
fewer runtime errors.
TRANSPARENCY
Model insights leverage
stored feature metadata
and lineage to help debug
models.
Features

Why you need TransmogrifAI ?
Use TransmogrifAI if you need a machine learning library to:
● Build production ready machine learning applications in hours, not months
● Build machine learning models without getting a Ph.D. in machine learning
● Build modular, reusable, strongly typed machine learning workflows
More read documentation: https://transmogrif.ai/

Why Machine Learning is hard ?! Really! ...
For example, this may be using a linear
classifier when your true decision
boundaries are non-linear.
Ref. http://ai.stanford.edu/~zayd/why-is-machine-learning-hard.html

Why Machine Learning is hard ?! Really! ...
fast and effective debugging is the skill that is most required for
implementing modern day machine learning pipelines.
Ref. http://ai.stanford.edu/~zayd/why-is-machine-learning-hard.html

Real time Machine Learning takes time to Productionize
TransmogrifAI Automates entire ML
Life Cycle to accelerate developer’s
productivity.

Under the Hood
Automated Feature Engineering
Automated Feature Selection
Automated Model Selection

Automatic Derivation of new features based on existing features.
Email Phone Age Subject Zip Code DOB Gender
Email is
Spam
Country
Code [0-20]
[21-30]
[ > 30]
Stop
words
Top
terms
(TF-IDF)
Detect
Language
Average Income
House Price
School Quality
Shopping
Transportation
To
Binary
Age
Day of Week
Week of Year
Quarter
Month
Year
Hour
Feature Vector

● Analyze every feature columns and compute descriptive statistics.
○ Number of Nulls, Number of empty string, Mean, Max, Min, Standard deviation.
● Handle Missing values / Noisy values.
○ Ex. fillna by Mean / Avg / near by values.
patient_details = patient_details.fillna(-1)
data['City_Type'] = data['City_Type'].fillna('Z')
imp = Imputer(missing_values='NaN', strategy='median', axis=0, copy = False)
data_total_imputed = imp.fit_transform(data_total)
# mark zero values as missing or NaN
dataset[[1,2,3,4,5]] = dataset[[1,2,3,4,5]].replace(0, numpy.NaN)
# fill missing values with mean column values
dataset.fillna(dataset.mean(), inplace=True)

● Does features have acceptable ranges / Does it contain valid values ?
● Does that feature could be leaker ?
○ Is it usually filled out after predicted field is ?
○ Is it highly correlated with the predicted field ?
● Does that feature is Outlier ?

Automated Feature Selection / Data Pre-processing
● Data Type of Features, Automatic Data Pre-processing.
○ MinMaxScaler
○ Normalizer
○ Binarizer
○ Label Encoding
○ One Hot Encoding
● Auto Data Pre-Processing based on chosen ML Model.
● Algorithm like XGBoost, specifically requires dummy encoded data while
algorithm like decision tree doesn’t seem to care at all (sometimes)!

Auto Data Pre-processing
● Numeric - Imputation, Track Null Value, Log Transformation for large range,
Scaling, Smart Binning.
● Categorical - Imputation, Track Null Value, One Hot Encoding / Dummy
Encoding, Dynamic Top K Pivot, Smart Binning, Label Encoding, Category
Embedding.
● Text - Tokenization, Hash Encoding, TF-IDF, Word2Vec, Sentiment Analysis,
Language Detection.
● Temporal - Time Difference, Circular Statistics, Time Extraction(Day, Week,
Month, Year).

Auto Selection of Best Model with Hyper Parameter
Tuning
● Machine Learning Model
○ Learning Rate
○ Epoc
○ Batch Size
○ Optimizer
○ Activation Function
○ Loss Function
● Search Algorithms to find best model and optimal hyper parameters.
○ Ex. Grid Search, Random Search, Bandit Methods

Examples - Hyper parameter tuning
XGBoost:
params = {'booster':'gbtree', 'objective':'binary:logistic', 'max_depth':9, 'eval_metric':'logloss',
'eta':0.02, 'silent':1, 'nthread':4, 'subsample': 0.9, 'colsample_bytree':0.9,
'scale_pos_weight':1, 'min_child_weight':3, 'max_delta_step':3}
num_rounds = 400
params['seed'] = 523264626346 # 0.85533
dtrain = xgb.DMatrix(train, labels, missing=np.nan)
clf = xgb.train(params, dtrain, num_rounds)
dtest = xgb.DMatrix(test, missing = np.nan)
test_preds = clf.predict(dtest)

Examples - Hyper parameter tuning
rf = RandomForestClassifier(n_estimators= int(n_tree), max_features= int(mtry),
criterion = criterion, max_depth = max_depth, n_jobs = -1, oob_score = True)
rf.fit(X_train, y_train)
gbm = GradientBoostingClassifier(n_estimators = n_tree, max_features = max_features,
subsample = subsample, n_jobs = -1)
cv = StratifiedKFold(y_train, 10)
scores = cross_val_score(rf, X_train, y_train, cv = cv, n_jobs = 4, scoring= 'f1')
ext = ExtraTreesClassifier(n_estimators = n_tree, criterion = criterion,
max_features = max_features, n_jobs = -1)
cv = StratifiedKFold(y_train, 10)
scores = cross_val_score(rf, X_train, y_train, cv = cv, n_jobs = 4, scoring= 'f1')
param_grid = { 'n_estimators' : [500,1000,1500,2000,2500,3000],
'criterion' : ['gini', 'entropy'],
'max_features' : [15,20,25,30],
'max_depth' : [4,5,6]
}
gs_cv = GridSearchCV(rf, param_grid, scoring = 'f1', n_jobs = -1, verbose = 2).fit(X_train[subspace], y_train)
gs_cv.best_params_

Ensemble Modeling
ens['XGB2'] = xgb2_pred['Disbursed']
ens['RF'] = rf_pred['Disbursed']
ens['FTRL'] = ftrl_pred['Disbursed']
ens['XGB1_Rank'] = rankdata(ens['XGB1'], method='min')
ens['XGB2_Rank'] = rankdata(ens['XGB2'], method='min')
ens['XGB_Rank'] = 0.5 * ens['XGB1_Rank'] + 0.5 * ens['XGB2_Rank']
ens['RF_Rank'] = rankdata(ens['RF'], method='min')
ens['FTRL_Rank'] = rankdata(ens['FTRL'], method='min')
ens['Final'] = (0.75*ens['XGB_Rank'] + 0.25*ens['RF_Rank']) * 0.75 + 0.25 * ens['FTRL']

Type Safety: Integration with Apache Spark and Scala
● Modular, Reusable, Strongly typed Machine learning workflow on top of
Apache Spark.
● Type Safety in Apache Spark with DataSet API.

Structured Data in Apache Spark
Structured in Spark
DataFrames
Datasets

Unification of APIs in Apache Spark 2.0
DataFrame
Dataset
Untyped API
Typed API
Dataset
(2016)
DataFrame = Dataset [Row]
Alias
DataSet [T]

Why Dataset ?
● Strongly Typing.
● Ability to use powerful lambda functions.
● Spark SQL’s optimized execution engine (catalyst, tungsten).
● Can be constructed from JVM objects & manipulated using Functional.
● transformations (map, filter, flatMap etc).
● A DataFrame is a Dataset organized into named columns.
● DataFrame is simply a type alias of Dataset[Row].

DataFrame API Code
// convert RDD -> DF with column names
val parsedDF = parsedRDD.toDF("project", "sprint", "numStories")
//filter, groupBy, sum, and then agg()
parsedDF.filter($"project" === "finance").
groupBy($"sprint").
agg(sum($"numStories").as("count")).
limit(100).
show(100)
project sprint numStories
finance 3 20
finance 4 22

DataFrame -> SQL View -> SQL Query
parsedDF.createOrReplaceTempView("audits")
val results = spark.sql(
"""SELECT sprint, sum(numStories)
AS count FROM audits WHERE project = 'finance' GROUP BY sprint
LIMIT 100""")
results.show(100)
project sprint numStories
finance 3 20
finance 4 22

Why Structure APIs ?
// DataFrame
data.groupBy("dept").avg("age")
// SQL
select dept, avg(age) from data group by 1
// RDD
data.map { case (dept, age) => dept -> (age, 1) }
.reduceByKey { case ((a1, c1), (a2, c2)) => (a1 + a2, c1 + c2) }
.map { case (dept, (age, c)) => dept -> age / c }

Catalyst in Spark
SQL AST
DataFrame
Datasets
Unresolved
Logical Plan
Logical Plan
Optimized
Logical Plan
Physical
Plans
CostModel
Selected
Physical
Plan
RDD

Dataset API in Spark 2.x
val employeesDF = spark.read.json("employees.json")
// Convert data to domain objects.
case class Employee(name: String, age: Int)
val employeesDS: Dataset[Employee] = employeesDF.as[Employee]
val filterDS = employeesDS.filter(p => p.age > 3)
Type-safe: operate on domain
objects with compiled lambda
functions.

Structured APIs in Apache Spark
SQL DataFrames Datasets
Syntax Errors Runtime Compile Time Compile Time
Analysis Errors Runtime Runtime Compile Time
Analysis errors are caught before a job runs on cluster

Spark SQL API - Analysis Error example.

TransmogrifAI - Type Safety is Everywhere!
● Value operations
● Feature operations
● Transformation Pipelines (aka Workflows)
// Typed value operations
val tokenize(t: Text): TextList = t.map(_.split("")).toTextList
// Types feature operations
val title = FeatureBuilder.Text[Book].extract(_.title).asPredictor
val tokens: Feature[TextList] = title.map(tokenize)
// Transformation pipelines
new OpWorkflow().setInput(books).setResultFeatures(token.vectorize())

Example Code
Ref. https://github.com/fosscoder/transmogrifai-demo

A Case Story - Functional Flow - Spark as a SaaS
User
Interface
Build workflow
- Source
- Target
- Transformations
- filter
- aggregation
- Joins
- Expressions
- Machine Learning
Algorithms
Store Metadata
of workflow in
Document based
NoSQL
Ex. MongoDB
ReactiveMongo
Scala / Spark
Job Reads
Metadata from
NoSQL ex.
MongoDB
Run on the
Cluster
Schedule Using
Airflow
SparkSubmit
Operator

A Case Story - High Level Technical Architecture - Spark as a SaaS
User
Interface
Middleware
Akka HTTP
Web
Service’s

Questions ?
Thank you!
Big Thanks to Scala.IO Organizers and Scala France Community!
@khatri_chetan
chetan.khatri@live.com

References
[1] TransmogrifAI - AutoML library for building modular, reusable, strongly typed machine learning workflows
on Spark from Salesforce Engineering
[online] https://transmogrif.ai
[2] A plugin to Apache Airflow to allow you to run Spark Submit Commands as an Operator
[online] https://github.com/rssanders3/airflow-spark-operator-plugin
[3] Apache Spark - Unified Analytics Engine for Big Data
[online] https://spark.apache.org/
[4] Apache Livy
[online] https://livy.incubator.apache.org/
[5] Zayd's Blog - Why is machine learning 'hard'?
[online] http://ai.stanford.edu/~zayd/why-is-machine-learning-hard.html
[6] Meet TransmogrifAI, Open Source AutoML That Powers Einstein Predictions
[online] https://www.youtube.com/watch?v=uMapcWtzwyA&t=106s
[7] Auto-Machine Learning: The Magic Behind Einstein
[online] https://www.youtube.com/watch?v=YDw1GieW4cw&t=564s

TransmogrifAI - Automate Machine Learning Workflow with the power of Scala and Spark at massive scale.

Recommended

Recommended

More Related Content

Similar to TransmogrifAI - Automate Machine Learning Workflow with the power of Scala and Spark at massive scale.

Similar to TransmogrifAI - Automate Machine Learning Workflow with the power of Scala and Spark at massive scale. (20)

More from Chetan Khatri

More from Chetan Khatri (20)

Recently uploaded

Recently uploaded (20)

TransmogrifAI - Automate Machine Learning Workflow with the power of Scala and Spark at massive scale.