SlideShare a Scribd company logo
TransmogrifAI
Automate Machine Learning Workflow with the power of Scala and
Spark at massive scale.
@khatri_chetanBy: Chetan Khatri
Scala.IO Conference, École Supérieure de Chimie Physique Électronique de Lyon, France
About me
Lead - Data Science @ Accion labs India Pvt. Ltd.
Open Source Contributor @ Apache Spark, Apache HBase, Elixir Lang, Spark
HBase Connectors.
Co-Authored University Curriculum @ University of Kachchh, India.
Data Engineering @: Nazara Games, Eccella Corporation.
Advisor - Data Science Lab, University of Kachchh, India.
M.Sc. - Computer Science from University of Kachchh, India.
Agenda
● What is TransmogrifAI ?
● Why you need TransmogrifAI ?
● Automation of Machine learning life Cycle - from development to deployment.
○ Feature Inference
○ Transformation
○ Automated Feature validation
○ Automated Model Selection
○ Hyperparameter Optimization
● Type Safety in Spark, TransmogrifAI.
● Example: Code - Titanic kaggle problem.
What is TransmogrifAI ?
● TransmogrifAI is open sourced by Salesforce.com, Inc. in June, 2018
● An end to end automated machine learning workflow library for structured
data build on top of Scala and SparkML.
Build with
What is TransmogrifAI ?
● TransmogrifAI helps extensively to automate Machine learning model life
cycle such as Feature Selection, Transformation, Automated Feature
validation, Automated Model Selection, Hyperparameter Optimization.
● It enforces compile-time type-safety, modularity, and reuse.
● Through automation, It achieves accuracies close to hand-tuned models with
almost 100x reduction in time.
Why you need TransmogrifAI ?
AUTOMATION
Numerous Transformers
and Estimators.
MODULARITY AND
REUSE
Enforces a strict separation
between ML workflow
definitions and data
manipulation.
COMPILE TIME TYPE
SAFETY
Workflow built are Strongly
typed, code completion
during development and
fewer runtime errors.
TRANSPARENCY
Model insights leverage
stored feature metadata
and lineage to help debug
models.
Features
Why you need TransmogrifAI ?
Use TransmogrifAI if you need a machine learning library to:
● Build production ready machine learning applications in hours, not months
● Build machine learning models without getting a Ph.D. in machine learning
● Build modular, reusable, strongly typed machine learning workflows
More read documentation: https://transmogrif.ai/
Why Machine Learning is hard ?! Really! ...
For example, this may be using a linear
classifier when your true decision
boundaries are non-linear.
Ref. http://ai.stanford.edu/~zayd/why-is-machine-learning-hard.html
Why Machine Learning is hard ?! Really! ...
fast and effective debugging is the skill that is most required for
implementing modern day machine learning pipelines.
Ref. http://ai.stanford.edu/~zayd/why-is-machine-learning-hard.html
Real time Machine Learning takes time to Productionize
TransmogrifAI Automates entire ML
Life Cycle to accelerate developer’s
productivity.
Under the Hood
Automated Feature Engineering
Automated Feature Selection
Automated Model Selection
Automated Feature Engineering
Automatic Derivation of new features based on existing features.
Email Phone Age Subject Zip Code DOB Gender
Email is
Spam
Country
Code [0-20]
[21-30]
[ > 30]
Stop
words
Top
terms
(TF-IDF)
Detect
Language
Average Income
House Price
School Quality
Shopping
Transportation
To
Binary
Age
Day of Week
Week of Year
Quarter
Month
Year
Hour
Feature Vector
Automated Feature Engineering
● Analyze every feature columns and compute descriptive statistics.
○ Number of Nulls, Number of empty string, Mean, Max, Min, Standard deviation.
● Handle Missing values / Noisy values.
○ Ex. fillna by Mean / Avg / near by values.
patient_details = patient_details.fillna(-1)
data['City_Type'] = data['City_Type'].fillna('Z')
imp = Imputer(missing_values='NaN', strategy='median', axis=0, copy = False)
data_total_imputed = imp.fit_transform(data_total)
# mark zero values as missing or NaN
dataset[[1,2,3,4,5]] = dataset[[1,2,3,4,5]].replace(0, numpy.NaN)
# fill missing values with mean column values
dataset.fillna(dataset.mean(), inplace=True)
Automated Feature Engineering
● Does features have acceptable ranges / Does it contain valid values ?
● Does that feature could be leaker ?
○ Is it usually filled out after predicted field is ?
○ Is it highly correlated with the predicted field ?
● Does that feature is Outlier ?
Automated Feature Selection / Data Pre-processing
● Data Type of Features, Automatic Data Pre-processing.
○ MinMaxScaler
○ Normalizer
○ Binarizer
○ Label Encoding
○ One Hot Encoding
● Auto Data Pre-Processing based on chosen ML Model.
● Algorithm like XGBoost, specifically requires dummy encoded data while
algorithm like decision tree doesn’t seem to care at all (sometimes)!
Auto Data Pre-processing
● Numeric - Imputation, Track Null Value, Log Transformation for large range,
Scaling, Smart Binning.
● Categorical - Imputation, Track Null Value, One Hot Encoding / Dummy
Encoding, Dynamic Top K Pivot, Smart Binning, Label Encoding, Category
Embedding.
● Text - Tokenization, Hash Encoding, TF-IDF, Word2Vec, Sentiment Analysis,
Language Detection.
● Temporal - Time Difference, Circular Statistics, Time Extraction(Day, Week,
Month, Year).
Auto Selection of Best Model with Hyper Parameter
Tuning
● Machine Learning Model
○ Learning Rate
○ Epoc
○ Batch Size
○ Optimizer
○ Activation Function
○ Loss Function
● Search Algorithms to find best model and optimal hyper parameters.
○ Ex. Grid Search, Random Search, Bandit Methods
Examples - Hyper parameter tuning
XGBoost:
params = {'booster':'gbtree', 'objective':'binary:logistic', 'max_depth':9, 'eval_metric':'logloss',
'eta':0.02, 'silent':1, 'nthread':4, 'subsample': 0.9, 'colsample_bytree':0.9,
'scale_pos_weight':1, 'min_child_weight':3, 'max_delta_step':3}
num_rounds = 400
params['seed'] = 523264626346 # 0.85533
dtrain = xgb.DMatrix(train, labels, missing=np.nan)
clf = xgb.train(params, dtrain, num_rounds)
dtest = xgb.DMatrix(test, missing = np.nan)
test_preds = clf.predict(dtest)
Examples - Hyper parameter tuning
rf = RandomForestClassifier(n_estimators= int(n_tree), max_features= int(mtry),
criterion = criterion, max_depth = max_depth, n_jobs = -1, oob_score = True)
rf.fit(X_train, y_train)
gbm = GradientBoostingClassifier(n_estimators = n_tree, max_features = max_features,
subsample = subsample, n_jobs = -1)
cv = StratifiedKFold(y_train, 10)
scores = cross_val_score(rf, X_train, y_train, cv = cv, n_jobs = 4, scoring= 'f1')
ext = ExtraTreesClassifier(n_estimators = n_tree, criterion = criterion,
max_features = max_features, n_jobs = -1)
cv = StratifiedKFold(y_train, 10)
scores = cross_val_score(rf, X_train, y_train, cv = cv, n_jobs = 4, scoring= 'f1')
param_grid = { 'n_estimators' : [500,1000,1500,2000,2500,3000],
'criterion' : ['gini', 'entropy'],
'max_features' : [15,20,25,30],
'max_depth' : [4,5,6]
}
gs_cv = GridSearchCV(rf, param_grid, scoring = 'f1', n_jobs = -1, verbose = 2).fit(X_train[subspace], y_train)
gs_cv.best_params_
Ensemble Modeling
ens['XGB2'] = xgb2_pred['Disbursed']
ens['RF'] = rf_pred['Disbursed']
ens['FTRL'] = ftrl_pred['Disbursed']
ens['XGB1_Rank'] = rankdata(ens['XGB1'], method='min')
ens['XGB2_Rank'] = rankdata(ens['XGB2'], method='min')
ens['XGB_Rank'] = 0.5 * ens['XGB1_Rank'] + 0.5 * ens['XGB2_Rank']
ens['RF_Rank'] = rankdata(ens['RF'], method='min')
ens['FTRL_Rank'] = rankdata(ens['FTRL'], method='min')
ens['Final'] = (0.75*ens['XGB_Rank'] + 0.25*ens['RF_Rank']) * 0.75 + 0.25 * ens['FTRL']
Type Safety: Integration with Apache Spark and Scala
● Modular, Reusable, Strongly typed Machine learning workflow on top of
Apache Spark.
● Type Safety in Apache Spark with DataSet API.
Structured Data in Apache Spark
Structured in Spark
DataFrames
Datasets
Unification of APIs in Apache Spark 2.0
DataFrame
Dataset
Untyped API
Typed API
Dataset
(2016)
DataFrame = Dataset [Row]
Alias
DataSet [T]
Why Dataset ?
● Strongly Typing.
● Ability to use powerful lambda functions.
● Spark SQL’s optimized execution engine (catalyst, tungsten).
● Can be constructed from JVM objects & manipulated using Functional.
● transformations (map, filter, flatMap etc).
● A DataFrame is a Dataset organized into named columns.
● DataFrame is simply a type alias of Dataset[Row].
DataFrame API Code
// convert RDD -> DF with column names
val parsedDF = parsedRDD.toDF("project", "sprint", "numStories")
//filter, groupBy, sum, and then agg()
parsedDF.filter($"project" === "finance").
groupBy($"sprint").
agg(sum($"numStories").as("count")).
limit(100).
show(100)
project sprint numStories
finance 3 20
finance 4 22
DataFrame -> SQL View -> SQL Query
parsedDF.createOrReplaceTempView("audits")
val results = spark.sql(
"""SELECT sprint, sum(numStories)
AS count FROM audits WHERE project = 'finance' GROUP BY sprint
LIMIT 100""")
results.show(100)
project sprint numStories
finance 3 20
finance 4 22
Why Structure APIs ?
// DataFrame
data.groupBy("dept").avg("age")
// SQL
select dept, avg(age) from data group by 1
// RDD
data.map { case (dept, age) => dept -> (age, 1) }
.reduceByKey { case ((a1, c1), (a2, c2)) => (a1 + a2, c1 + c2) }
.map { case (dept, (age, c)) => dept -> age / c }
Catalyst in Spark
SQL AST
DataFrame
Datasets
Unresolved
Logical Plan
Logical Plan
Optimized
Logical Plan
Physical
Plans
CostModel
Selected
Physical
Plan
RDD
Dataset API in Spark 2.x
val employeesDF = spark.read.json("employees.json")
// Convert data to domain objects.
case class Employee(name: String, age: Int)
val employeesDS: Dataset[Employee] = employeesDF.as[Employee]
val filterDS = employeesDS.filter(p => p.age > 3)
Type-safe: operate on domain
objects with compiled lambda
functions.
Structured APIs in Apache Spark
SQL DataFrames Datasets
Syntax Errors Runtime Compile Time Compile Time
Analysis Errors Runtime Runtime Compile Time
Analysis errors are caught before a job runs on cluster
Spark SQL API - Analysis Error example.
Spark SQL API - Analysis Error example.
TransmogrifAI - Type Safety is Everywhere!
● Value operations
● Feature operations
● Transformation Pipelines (aka Workflows)
// Typed value operations
val tokenize(t: Text): TextList = t.map(_.split("")).toTextList
// Types feature operations
val title = FeatureBuilder.Text[Book].extract(_.title).asPredictor
val tokens: Feature[TextList] = title.map(tokenize)
// Transformation pipelines
new OpWorkflow().setInput(books).setResultFeatures(token.vectorize())
Example Code
Ref. https://github.com/fosscoder/transmogrifai-demo
A Case Story - Functional Flow - Spark as a SaaS
User
Interface
Build workflow
- Source
- Target
- Transformations
- filter
- aggregation
- Joins
- Expressions
- Machine Learning
Algorithms
Store Metadata
of workflow in
Document based
NoSQL
Ex. MongoDB
ReactiveMongo
Scala / Spark
Job Reads
Metadata from
NoSQL ex.
MongoDB
Run on the
Cluster
Schedule Using
Airflow
SparkSubmit
Operator
A Case Story - High Level Technical Architecture - Spark as a SaaS
User
Interface
Middleware
Akka HTTP
Web
Service’s
Apache Livy Configuration
Apache Livy Configuration
Apache Livy Configuration ...
Apache Livy Integration
Apache Livy Integration ...
Apache Livy Integration ...
Questions ?
Thank you!
Big Thanks to Scala.IO Organizers and Scala France Community!
@khatri_chetan
chetan.khatri@live.com
References
[1] TransmogrifAI - AutoML library for building modular, reusable, strongly typed machine learning workflows
on Spark from Salesforce Engineering
[online] https://transmogrif.ai
[2] A plugin to Apache Airflow to allow you to run Spark Submit Commands as an Operator
[online] https://github.com/rssanders3/airflow-spark-operator-plugin
[3] Apache Spark - Unified Analytics Engine for Big Data
[online] https://spark.apache.org/
[4] Apache Livy
[online] https://livy.incubator.apache.org/
[5] Zayd's Blog - Why is machine learning 'hard'?
[online] http://ai.stanford.edu/~zayd/why-is-machine-learning-hard.html
[6] Meet TransmogrifAI, Open Source AutoML That Powers Einstein Predictions
[online] https://www.youtube.com/watch?v=uMapcWtzwyA&t=106s
[7] Auto-Machine Learning: The Magic Behind Einstein
[online] https://www.youtube.com/watch?v=YDw1GieW4cw&t=564s

More Related Content

Similar to TransmogrifAI - Automate Machine Learning Workflow with the power of Scala and Spark at massive scale.

Netflix Machine Learning Infra for Recommendations - 2018
Netflix Machine Learning Infra for Recommendations - 2018Netflix Machine Learning Infra for Recommendations - 2018
Netflix Machine Learning Infra for Recommendations - 2018
Karthik Murugesan
 
ML Infra for Netflix Recommendations - AI NEXTCon talk
ML Infra for Netflix Recommendations - AI NEXTCon talkML Infra for Netflix Recommendations - AI NEXTCon talk
ML Infra for Netflix Recommendations - AI NEXTCon talk
Faisal Siddiqi
 
Reproducible AI using MLflow and PyTorch
Reproducible AI using MLflow and PyTorchReproducible AI using MLflow and PyTorch
Reproducible AI using MLflow and PyTorch
Databricks
 
A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, & PyTorch with B...
A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, & PyTorch with B...A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, & PyTorch with B...
A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, & PyTorch with B...
Databricks
 
Hadoop France meetup Feb2016 : recommendations with spark
Hadoop France meetup  Feb2016 : recommendations with sparkHadoop France meetup  Feb2016 : recommendations with spark
Hadoop France meetup Feb2016 : recommendations with spark
Modern Data Stack France
 
No more struggles with Apache Spark workloads in production
No more struggles with Apache Spark workloads in productionNo more struggles with Apache Spark workloads in production
No more struggles with Apache Spark workloads in production
Chetan Khatri
 
ProgrammingPrimerAndOOPS
ProgrammingPrimerAndOOPSProgrammingPrimerAndOOPS
ProgrammingPrimerAndOOPS
sunmitraeducation
 
Toronto meetup 20190917
Toronto meetup 20190917Toronto meetup 20190917
Toronto meetup 20190917
Bill Liu
 
Dynamic Tracing of your AMP web site
Dynamic Tracing of your AMP web siteDynamic Tracing of your AMP web site
Dynamic Tracing of your AMP web site
Sriram Natarajan
 
Streaming Inference with Apache Beam and TFX
Streaming Inference with Apache Beam and TFXStreaming Inference with Apache Beam and TFX
Streaming Inference with Apache Beam and TFX
Databricks
 
Ml ops and the feature store with hopsworks, DC Data Science Meetup
Ml ops and the feature store with hopsworks, DC Data Science MeetupMl ops and the feature store with hopsworks, DC Data Science Meetup
Ml ops and the feature store with hopsworks, DC Data Science Meetup
Jim Dowling
 
Ml pipelines with Apache spark and Apache beam - Ottawa Reactive meetup Augus...
Ml pipelines with Apache spark and Apache beam - Ottawa Reactive meetup Augus...Ml pipelines with Apache spark and Apache beam - Ottawa Reactive meetup Augus...
Ml pipelines with Apache spark and Apache beam - Ottawa Reactive meetup Augus...
Holden Karau
 
AI&BigData Lab 2016. Руденко Петр: Особенности обучения, настройки и использо...
AI&BigData Lab 2016. Руденко Петр: Особенности обучения, настройки и использо...AI&BigData Lab 2016. Руденко Петр: Особенности обучения, настройки и использо...
AI&BigData Lab 2016. Руденко Петр: Особенности обучения, настройки и использо...
GeeksLab Odessa
 
When We Spark and When We Don’t: Developing Data and ML Pipelines
When We Spark and When We Don’t: Developing Data and ML PipelinesWhen We Spark and When We Don’t: Developing Data and ML Pipelines
When We Spark and When We Don’t: Developing Data and ML Pipelines
Stitch Fix Algorithms
 
Assessing Graph Solutions for Apache Spark
Assessing Graph Solutions for Apache SparkAssessing Graph Solutions for Apache Spark
Assessing Graph Solutions for Apache Spark
Databricks
 
Learning to rank search results
Learning to rank search resultsLearning to rank search results
Learning to rank search results
Jettro Coenradie
 
Build Deep Learning Applications Using Apache MXNet, Featuring Workday (AIM40...
Build Deep Learning Applications Using Apache MXNet, Featuring Workday (AIM40...Build Deep Learning Applications Using Apache MXNet, Featuring Workday (AIM40...
Build Deep Learning Applications Using Apache MXNet, Featuring Workday (AIM40...
Amazon Web Services
 
Spark Summit EU talk by Francois Garillot and Mohamed Kafsi
Spark Summit EU talk by Francois Garillot and Mohamed KafsiSpark Summit EU talk by Francois Garillot and Mohamed Kafsi
Spark Summit EU talk by Francois Garillot and Mohamed Kafsi
Spark Summit
 
Mobility insights at Swisscom - Understanding collective mobility in Switzerland
Mobility insights at Swisscom - Understanding collective mobility in SwitzerlandMobility insights at Swisscom - Understanding collective mobility in Switzerland
Mobility insights at Swisscom - Understanding collective mobility in Switzerland
François Garillot
 
Build Large-Scale Data Analytics and AI Pipeline Using RayDP
Build Large-Scale Data Analytics and AI Pipeline Using RayDPBuild Large-Scale Data Analytics and AI Pipeline Using RayDP
Build Large-Scale Data Analytics and AI Pipeline Using RayDP
Databricks
 

Similar to TransmogrifAI - Automate Machine Learning Workflow with the power of Scala and Spark at massive scale. (20)

Netflix Machine Learning Infra for Recommendations - 2018
Netflix Machine Learning Infra for Recommendations - 2018Netflix Machine Learning Infra for Recommendations - 2018
Netflix Machine Learning Infra for Recommendations - 2018
 
ML Infra for Netflix Recommendations - AI NEXTCon talk
ML Infra for Netflix Recommendations - AI NEXTCon talkML Infra for Netflix Recommendations - AI NEXTCon talk
ML Infra for Netflix Recommendations - AI NEXTCon talk
 
Reproducible AI using MLflow and PyTorch
Reproducible AI using MLflow and PyTorchReproducible AI using MLflow and PyTorch
Reproducible AI using MLflow and PyTorch
 
A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, & PyTorch with B...
A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, & PyTorch with B...A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, & PyTorch with B...
A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, & PyTorch with B...
 
Hadoop France meetup Feb2016 : recommendations with spark
Hadoop France meetup  Feb2016 : recommendations with sparkHadoop France meetup  Feb2016 : recommendations with spark
Hadoop France meetup Feb2016 : recommendations with spark
 
No more struggles with Apache Spark workloads in production
No more struggles with Apache Spark workloads in productionNo more struggles with Apache Spark workloads in production
No more struggles with Apache Spark workloads in production
 
ProgrammingPrimerAndOOPS
ProgrammingPrimerAndOOPSProgrammingPrimerAndOOPS
ProgrammingPrimerAndOOPS
 
Toronto meetup 20190917
Toronto meetup 20190917Toronto meetup 20190917
Toronto meetup 20190917
 
Dynamic Tracing of your AMP web site
Dynamic Tracing of your AMP web siteDynamic Tracing of your AMP web site
Dynamic Tracing of your AMP web site
 
Streaming Inference with Apache Beam and TFX
Streaming Inference with Apache Beam and TFXStreaming Inference with Apache Beam and TFX
Streaming Inference with Apache Beam and TFX
 
Ml ops and the feature store with hopsworks, DC Data Science Meetup
Ml ops and the feature store with hopsworks, DC Data Science MeetupMl ops and the feature store with hopsworks, DC Data Science Meetup
Ml ops and the feature store with hopsworks, DC Data Science Meetup
 
Ml pipelines with Apache spark and Apache beam - Ottawa Reactive meetup Augus...
Ml pipelines with Apache spark and Apache beam - Ottawa Reactive meetup Augus...Ml pipelines with Apache spark and Apache beam - Ottawa Reactive meetup Augus...
Ml pipelines with Apache spark and Apache beam - Ottawa Reactive meetup Augus...
 
AI&BigData Lab 2016. Руденко Петр: Особенности обучения, настройки и использо...
AI&BigData Lab 2016. Руденко Петр: Особенности обучения, настройки и использо...AI&BigData Lab 2016. Руденко Петр: Особенности обучения, настройки и использо...
AI&BigData Lab 2016. Руденко Петр: Особенности обучения, настройки и использо...
 
When We Spark and When We Don’t: Developing Data and ML Pipelines
When We Spark and When We Don’t: Developing Data and ML PipelinesWhen We Spark and When We Don’t: Developing Data and ML Pipelines
When We Spark and When We Don’t: Developing Data and ML Pipelines
 
Assessing Graph Solutions for Apache Spark
Assessing Graph Solutions for Apache SparkAssessing Graph Solutions for Apache Spark
Assessing Graph Solutions for Apache Spark
 
Learning to rank search results
Learning to rank search resultsLearning to rank search results
Learning to rank search results
 
Build Deep Learning Applications Using Apache MXNet, Featuring Workday (AIM40...
Build Deep Learning Applications Using Apache MXNet, Featuring Workday (AIM40...Build Deep Learning Applications Using Apache MXNet, Featuring Workday (AIM40...
Build Deep Learning Applications Using Apache MXNet, Featuring Workday (AIM40...
 
Spark Summit EU talk by Francois Garillot and Mohamed Kafsi
Spark Summit EU talk by Francois Garillot and Mohamed KafsiSpark Summit EU talk by Francois Garillot and Mohamed Kafsi
Spark Summit EU talk by Francois Garillot and Mohamed Kafsi
 
Mobility insights at Swisscom - Understanding collective mobility in Switzerland
Mobility insights at Swisscom - Understanding collective mobility in SwitzerlandMobility insights at Swisscom - Understanding collective mobility in Switzerland
Mobility insights at Swisscom - Understanding collective mobility in Switzerland
 
Build Large-Scale Data Analytics and AI Pipeline Using RayDP
Build Large-Scale Data Analytics and AI Pipeline Using RayDPBuild Large-Scale Data Analytics and AI Pipeline Using RayDP
Build Large-Scale Data Analytics and AI Pipeline Using RayDP
 

More from Chetan Khatri

Data Science for Beginner by Chetan Khatri and Deptt. of Computer Science, Ka...
Data Science for Beginner by Chetan Khatri and Deptt. of Computer Science, Ka...Data Science for Beginner by Chetan Khatri and Deptt. of Computer Science, Ka...
Data Science for Beginner by Chetan Khatri and Deptt. of Computer Science, Ka...
Chetan Khatri
 
Demystify Information Security & Threats for Data-Driven Platforms With Cheta...
Demystify Information Security & Threats for Data-Driven Platforms With Cheta...Demystify Information Security & Threats for Data-Driven Platforms With Cheta...
Demystify Information Security & Threats for Data-Driven Platforms With Cheta...
Chetan Khatri
 
PyconZA19-Distributed-workloads-challenges-with-PySpark-and-Airflow
PyconZA19-Distributed-workloads-challenges-with-PySpark-and-AirflowPyconZA19-Distributed-workloads-challenges-with-PySpark-and-Airflow
PyconZA19-Distributed-workloads-challenges-with-PySpark-and-Airflow
Chetan Khatri
 
ScalaTo July 2019 - No more struggles with Apache Spark workloads in production
ScalaTo July 2019 - No more struggles with Apache Spark workloads in productionScalaTo July 2019 - No more struggles with Apache Spark workloads in production
ScalaTo July 2019 - No more struggles with Apache Spark workloads in production
Chetan Khatri
 
PyConLT19-No_more_struggles_with_Apache_Spark_(PySpark)_workloads_in_production
PyConLT19-No_more_struggles_with_Apache_Spark_(PySpark)_workloads_in_productionPyConLT19-No_more_struggles_with_Apache_Spark_(PySpark)_workloads_in_production
PyConLT19-No_more_struggles_with_Apache_Spark_(PySpark)_workloads_in_production
Chetan Khatri
 
HBaseConAsia 2018 - Scaling 30 TB's of Data lake with Apache HBase and Scala ...
HBaseConAsia 2018 - Scaling 30 TB's of Data lake with Apache HBase and Scala ...HBaseConAsia 2018 - Scaling 30 TB's of Data lake with Apache HBase and Scala ...
HBaseConAsia 2018 - Scaling 30 TB's of Data lake with Apache HBase and Scala ...
Chetan Khatri
 
An Introduction to Spark with Scala
An Introduction to Spark with ScalaAn Introduction to Spark with Scala
An Introduction to Spark with Scala
Chetan Khatri
 
HBase with Apache Spark POC Demo
HBase with Apache Spark POC DemoHBase with Apache Spark POC Demo
HBase with Apache Spark POC Demo
Chetan Khatri
 
HKOSCon18 - Chetan Khatri - Open Source AI / ML Technologies and Application ...
HKOSCon18 - Chetan Khatri - Open Source AI / ML Technologies and Application ...HKOSCon18 - Chetan Khatri - Open Source AI / ML Technologies and Application ...
HKOSCon18 - Chetan Khatri - Open Source AI / ML Technologies and Application ...
Chetan Khatri
 
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
Chetan Khatri
 
Fossasia 2018-chetan-khatri
Fossasia 2018-chetan-khatriFossasia 2018-chetan-khatri
Fossasia 2018-chetan-khatri
Chetan Khatri
 
Fossasia ai-ml technologies and application for product development-chetan kh...
Fossasia ai-ml technologies and application for product development-chetan kh...Fossasia ai-ml technologies and application for product development-chetan kh...
Fossasia ai-ml technologies and application for product development-chetan kh...
Chetan Khatri
 
An Introduction Linear Algebra for Neural Networks and Deep learning
An Introduction Linear Algebra for Neural Networks and Deep learningAn Introduction Linear Algebra for Neural Networks and Deep learning
An Introduction Linear Algebra for Neural Networks and Deep learning
Chetan Khatri
 
Introduction to Computer Science
Introduction to Computer ScienceIntroduction to Computer Science
Introduction to Computer Science
Chetan Khatri
 
An introduction to Git with Atlassian Suite
An introduction to Git with Atlassian SuiteAn introduction to Git with Atlassian Suite
An introduction to Git with Atlassian Suite
Chetan Khatri
 
Think machine-learning-with-scikit-learn-chetan
Think machine-learning-with-scikit-learn-chetanThink machine-learning-with-scikit-learn-chetan
Think machine-learning-with-scikit-learn-chetan
Chetan Khatri
 
A step towards machine learning at accionlabs
A step towards machine learning at accionlabsA step towards machine learning at accionlabs
A step towards machine learning at accionlabs
Chetan Khatri
 
Voltage measurement using arduino
Voltage measurement using arduinoVoltage measurement using arduino
Voltage measurement using arduino
Chetan Khatri
 
Design & Building Smart Energy Meter
Design & Building Smart Energy MeterDesign & Building Smart Energy Meter
Design & Building Smart Energy Meter
Chetan Khatri
 
Data Analytics with Pandas and Numpy - Python
Data Analytics with Pandas and Numpy - PythonData Analytics with Pandas and Numpy - Python
Data Analytics with Pandas and Numpy - Python
Chetan Khatri
 

More from Chetan Khatri (20)

Data Science for Beginner by Chetan Khatri and Deptt. of Computer Science, Ka...
Data Science for Beginner by Chetan Khatri and Deptt. of Computer Science, Ka...Data Science for Beginner by Chetan Khatri and Deptt. of Computer Science, Ka...
Data Science for Beginner by Chetan Khatri and Deptt. of Computer Science, Ka...
 
Demystify Information Security & Threats for Data-Driven Platforms With Cheta...
Demystify Information Security & Threats for Data-Driven Platforms With Cheta...Demystify Information Security & Threats for Data-Driven Platforms With Cheta...
Demystify Information Security & Threats for Data-Driven Platforms With Cheta...
 
PyconZA19-Distributed-workloads-challenges-with-PySpark-and-Airflow
PyconZA19-Distributed-workloads-challenges-with-PySpark-and-AirflowPyconZA19-Distributed-workloads-challenges-with-PySpark-and-Airflow
PyconZA19-Distributed-workloads-challenges-with-PySpark-and-Airflow
 
ScalaTo July 2019 - No more struggles with Apache Spark workloads in production
ScalaTo July 2019 - No more struggles with Apache Spark workloads in productionScalaTo July 2019 - No more struggles with Apache Spark workloads in production
ScalaTo July 2019 - No more struggles with Apache Spark workloads in production
 
PyConLT19-No_more_struggles_with_Apache_Spark_(PySpark)_workloads_in_production
PyConLT19-No_more_struggles_with_Apache_Spark_(PySpark)_workloads_in_productionPyConLT19-No_more_struggles_with_Apache_Spark_(PySpark)_workloads_in_production
PyConLT19-No_more_struggles_with_Apache_Spark_(PySpark)_workloads_in_production
 
HBaseConAsia 2018 - Scaling 30 TB's of Data lake with Apache HBase and Scala ...
HBaseConAsia 2018 - Scaling 30 TB's of Data lake with Apache HBase and Scala ...HBaseConAsia 2018 - Scaling 30 TB's of Data lake with Apache HBase and Scala ...
HBaseConAsia 2018 - Scaling 30 TB's of Data lake with Apache HBase and Scala ...
 
An Introduction to Spark with Scala
An Introduction to Spark with ScalaAn Introduction to Spark with Scala
An Introduction to Spark with Scala
 
HBase with Apache Spark POC Demo
HBase with Apache Spark POC DemoHBase with Apache Spark POC Demo
HBase with Apache Spark POC Demo
 
HKOSCon18 - Chetan Khatri - Open Source AI / ML Technologies and Application ...
HKOSCon18 - Chetan Khatri - Open Source AI / ML Technologies and Application ...HKOSCon18 - Chetan Khatri - Open Source AI / ML Technologies and Application ...
HKOSCon18 - Chetan Khatri - Open Source AI / ML Technologies and Application ...
 
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
 
Fossasia 2018-chetan-khatri
Fossasia 2018-chetan-khatriFossasia 2018-chetan-khatri
Fossasia 2018-chetan-khatri
 
Fossasia ai-ml technologies and application for product development-chetan kh...
Fossasia ai-ml technologies and application for product development-chetan kh...Fossasia ai-ml technologies and application for product development-chetan kh...
Fossasia ai-ml technologies and application for product development-chetan kh...
 
An Introduction Linear Algebra for Neural Networks and Deep learning
An Introduction Linear Algebra for Neural Networks and Deep learningAn Introduction Linear Algebra for Neural Networks and Deep learning
An Introduction Linear Algebra for Neural Networks and Deep learning
 
Introduction to Computer Science
Introduction to Computer ScienceIntroduction to Computer Science
Introduction to Computer Science
 
An introduction to Git with Atlassian Suite
An introduction to Git with Atlassian SuiteAn introduction to Git with Atlassian Suite
An introduction to Git with Atlassian Suite
 
Think machine-learning-with-scikit-learn-chetan
Think machine-learning-with-scikit-learn-chetanThink machine-learning-with-scikit-learn-chetan
Think machine-learning-with-scikit-learn-chetan
 
A step towards machine learning at accionlabs
A step towards machine learning at accionlabsA step towards machine learning at accionlabs
A step towards machine learning at accionlabs
 
Voltage measurement using arduino
Voltage measurement using arduinoVoltage measurement using arduino
Voltage measurement using arduino
 
Design & Building Smart Energy Meter
Design & Building Smart Energy MeterDesign & Building Smart Energy Meter
Design & Building Smart Energy Meter
 
Data Analytics with Pandas and Numpy - Python
Data Analytics with Pandas and Numpy - PythonData Analytics with Pandas and Numpy - Python
Data Analytics with Pandas and Numpy - Python
 

Recently uploaded

一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单
enxupq
 
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
ewymefz
 
Opendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptxOpendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptx
Opendatabay
 
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
ewymefz
 
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Subhajit Sahu
 
Empowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptxEmpowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptx
benishzehra469
 
一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单
enxupq
 
The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...
jerlynmaetalle
 
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project PresentationPredicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Boston Institute of Analytics
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP
 
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdfSample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Linda486226
 
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
AbhimanyuSinha9
 
一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单
ewymefz
 
SOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape ReportSOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape Report
SOCRadar
 
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
ewymefz
 
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
Tiktokethiodaily
 
Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)
TravisMalana
 
Business update Q1 2024 Lar España Real Estate SOCIMI
Business update Q1 2024 Lar España Real Estate SOCIMIBusiness update Q1 2024 Lar España Real Estate SOCIMI
Business update Q1 2024 Lar España Real Estate SOCIMI
AlejandraGmez176757
 
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
nscud
 
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdfCh03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
haila53
 

Recently uploaded (20)

一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单
 
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
 
Opendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptxOpendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptx
 
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
 
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
 
Empowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptxEmpowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptx
 
一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单
 
The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...
 
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project PresentationPredicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
 
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdfSample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
 
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
 
一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单
 
SOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape ReportSOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape Report
 
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
 
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
 
Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)
 
Business update Q1 2024 Lar España Real Estate SOCIMI
Business update Q1 2024 Lar España Real Estate SOCIMIBusiness update Q1 2024 Lar España Real Estate SOCIMI
Business update Q1 2024 Lar España Real Estate SOCIMI
 
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
 
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdfCh03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
 

TransmogrifAI - Automate Machine Learning Workflow with the power of Scala and Spark at massive scale.

  • 1. TransmogrifAI Automate Machine Learning Workflow with the power of Scala and Spark at massive scale. @khatri_chetanBy: Chetan Khatri Scala.IO Conference, École Supérieure de Chimie Physique Électronique de Lyon, France
  • 2. About me Lead - Data Science @ Accion labs India Pvt. Ltd. Open Source Contributor @ Apache Spark, Apache HBase, Elixir Lang, Spark HBase Connectors. Co-Authored University Curriculum @ University of Kachchh, India. Data Engineering @: Nazara Games, Eccella Corporation. Advisor - Data Science Lab, University of Kachchh, India. M.Sc. - Computer Science from University of Kachchh, India.
  • 3. Agenda ● What is TransmogrifAI ? ● Why you need TransmogrifAI ? ● Automation of Machine learning life Cycle - from development to deployment. ○ Feature Inference ○ Transformation ○ Automated Feature validation ○ Automated Model Selection ○ Hyperparameter Optimization ● Type Safety in Spark, TransmogrifAI. ● Example: Code - Titanic kaggle problem.
  • 4. What is TransmogrifAI ? ● TransmogrifAI is open sourced by Salesforce.com, Inc. in June, 2018 ● An end to end automated machine learning workflow library for structured data build on top of Scala and SparkML. Build with
  • 5. What is TransmogrifAI ? ● TransmogrifAI helps extensively to automate Machine learning model life cycle such as Feature Selection, Transformation, Automated Feature validation, Automated Model Selection, Hyperparameter Optimization. ● It enforces compile-time type-safety, modularity, and reuse. ● Through automation, It achieves accuracies close to hand-tuned models with almost 100x reduction in time.
  • 6. Why you need TransmogrifAI ? AUTOMATION Numerous Transformers and Estimators. MODULARITY AND REUSE Enforces a strict separation between ML workflow definitions and data manipulation. COMPILE TIME TYPE SAFETY Workflow built are Strongly typed, code completion during development and fewer runtime errors. TRANSPARENCY Model insights leverage stored feature metadata and lineage to help debug models. Features
  • 7. Why you need TransmogrifAI ? Use TransmogrifAI if you need a machine learning library to: ● Build production ready machine learning applications in hours, not months ● Build machine learning models without getting a Ph.D. in machine learning ● Build modular, reusable, strongly typed machine learning workflows More read documentation: https://transmogrif.ai/
  • 8. Why Machine Learning is hard ?! Really! ... For example, this may be using a linear classifier when your true decision boundaries are non-linear. Ref. http://ai.stanford.edu/~zayd/why-is-machine-learning-hard.html
  • 9. Why Machine Learning is hard ?! Really! ... fast and effective debugging is the skill that is most required for implementing modern day machine learning pipelines. Ref. http://ai.stanford.edu/~zayd/why-is-machine-learning-hard.html
  • 10. Real time Machine Learning takes time to Productionize TransmogrifAI Automates entire ML Life Cycle to accelerate developer’s productivity.
  • 11. Under the Hood Automated Feature Engineering Automated Feature Selection Automated Model Selection
  • 12. Automated Feature Engineering Automatic Derivation of new features based on existing features. Email Phone Age Subject Zip Code DOB Gender Email is Spam Country Code [0-20] [21-30] [ > 30] Stop words Top terms (TF-IDF) Detect Language Average Income House Price School Quality Shopping Transportation To Binary Age Day of Week Week of Year Quarter Month Year Hour Feature Vector
  • 13. Automated Feature Engineering ● Analyze every feature columns and compute descriptive statistics. ○ Number of Nulls, Number of empty string, Mean, Max, Min, Standard deviation. ● Handle Missing values / Noisy values. ○ Ex. fillna by Mean / Avg / near by values. patient_details = patient_details.fillna(-1) data['City_Type'] = data['City_Type'].fillna('Z') imp = Imputer(missing_values='NaN', strategy='median', axis=0, copy = False) data_total_imputed = imp.fit_transform(data_total) # mark zero values as missing or NaN dataset[[1,2,3,4,5]] = dataset[[1,2,3,4,5]].replace(0, numpy.NaN) # fill missing values with mean column values dataset.fillna(dataset.mean(), inplace=True)
  • 14. Automated Feature Engineering ● Does features have acceptable ranges / Does it contain valid values ? ● Does that feature could be leaker ? ○ Is it usually filled out after predicted field is ? ○ Is it highly correlated with the predicted field ? ● Does that feature is Outlier ?
  • 15. Automated Feature Selection / Data Pre-processing ● Data Type of Features, Automatic Data Pre-processing. ○ MinMaxScaler ○ Normalizer ○ Binarizer ○ Label Encoding ○ One Hot Encoding ● Auto Data Pre-Processing based on chosen ML Model. ● Algorithm like XGBoost, specifically requires dummy encoded data while algorithm like decision tree doesn’t seem to care at all (sometimes)!
  • 16. Auto Data Pre-processing ● Numeric - Imputation, Track Null Value, Log Transformation for large range, Scaling, Smart Binning. ● Categorical - Imputation, Track Null Value, One Hot Encoding / Dummy Encoding, Dynamic Top K Pivot, Smart Binning, Label Encoding, Category Embedding. ● Text - Tokenization, Hash Encoding, TF-IDF, Word2Vec, Sentiment Analysis, Language Detection. ● Temporal - Time Difference, Circular Statistics, Time Extraction(Day, Week, Month, Year).
  • 17. Auto Selection of Best Model with Hyper Parameter Tuning ● Machine Learning Model ○ Learning Rate ○ Epoc ○ Batch Size ○ Optimizer ○ Activation Function ○ Loss Function ● Search Algorithms to find best model and optimal hyper parameters. ○ Ex. Grid Search, Random Search, Bandit Methods
  • 18. Examples - Hyper parameter tuning XGBoost: params = {'booster':'gbtree', 'objective':'binary:logistic', 'max_depth':9, 'eval_metric':'logloss', 'eta':0.02, 'silent':1, 'nthread':4, 'subsample': 0.9, 'colsample_bytree':0.9, 'scale_pos_weight':1, 'min_child_weight':3, 'max_delta_step':3} num_rounds = 400 params['seed'] = 523264626346 # 0.85533 dtrain = xgb.DMatrix(train, labels, missing=np.nan) clf = xgb.train(params, dtrain, num_rounds) dtest = xgb.DMatrix(test, missing = np.nan) test_preds = clf.predict(dtest)
  • 19. Examples - Hyper parameter tuning rf = RandomForestClassifier(n_estimators= int(n_tree), max_features= int(mtry), criterion = criterion, max_depth = max_depth, n_jobs = -1, oob_score = True) rf.fit(X_train, y_train) gbm = GradientBoostingClassifier(n_estimators = n_tree, max_features = max_features, subsample = subsample, n_jobs = -1) cv = StratifiedKFold(y_train, 10) scores = cross_val_score(rf, X_train, y_train, cv = cv, n_jobs = 4, scoring= 'f1') ext = ExtraTreesClassifier(n_estimators = n_tree, criterion = criterion, max_features = max_features, n_jobs = -1) cv = StratifiedKFold(y_train, 10) scores = cross_val_score(rf, X_train, y_train, cv = cv, n_jobs = 4, scoring= 'f1') param_grid = { 'n_estimators' : [500,1000,1500,2000,2500,3000], 'criterion' : ['gini', 'entropy'], 'max_features' : [15,20,25,30], 'max_depth' : [4,5,6] } gs_cv = GridSearchCV(rf, param_grid, scoring = 'f1', n_jobs = -1, verbose = 2).fit(X_train[subspace], y_train) gs_cv.best_params_
  • 20. Ensemble Modeling ens['XGB2'] = xgb2_pred['Disbursed'] ens['RF'] = rf_pred['Disbursed'] ens['FTRL'] = ftrl_pred['Disbursed'] ens['XGB1_Rank'] = rankdata(ens['XGB1'], method='min') ens['XGB2_Rank'] = rankdata(ens['XGB2'], method='min') ens['XGB_Rank'] = 0.5 * ens['XGB1_Rank'] + 0.5 * ens['XGB2_Rank'] ens['RF_Rank'] = rankdata(ens['RF'], method='min') ens['FTRL_Rank'] = rankdata(ens['FTRL'], method='min') ens['Final'] = (0.75*ens['XGB_Rank'] + 0.25*ens['RF_Rank']) * 0.75 + 0.25 * ens['FTRL']
  • 21. Type Safety: Integration with Apache Spark and Scala ● Modular, Reusable, Strongly typed Machine learning workflow on top of Apache Spark. ● Type Safety in Apache Spark with DataSet API.
  • 22. Structured Data in Apache Spark Structured in Spark DataFrames Datasets
  • 23. Unification of APIs in Apache Spark 2.0 DataFrame Dataset Untyped API Typed API Dataset (2016) DataFrame = Dataset [Row] Alias DataSet [T]
  • 24. Why Dataset ? ● Strongly Typing. ● Ability to use powerful lambda functions. ● Spark SQL’s optimized execution engine (catalyst, tungsten). ● Can be constructed from JVM objects & manipulated using Functional. ● transformations (map, filter, flatMap etc). ● A DataFrame is a Dataset organized into named columns. ● DataFrame is simply a type alias of Dataset[Row].
  • 25. DataFrame API Code // convert RDD -> DF with column names val parsedDF = parsedRDD.toDF("project", "sprint", "numStories") //filter, groupBy, sum, and then agg() parsedDF.filter($"project" === "finance"). groupBy($"sprint"). agg(sum($"numStories").as("count")). limit(100). show(100) project sprint numStories finance 3 20 finance 4 22
  • 26. DataFrame -> SQL View -> SQL Query parsedDF.createOrReplaceTempView("audits") val results = spark.sql( """SELECT sprint, sum(numStories) AS count FROM audits WHERE project = 'finance' GROUP BY sprint LIMIT 100""") results.show(100) project sprint numStories finance 3 20 finance 4 22
  • 27. Why Structure APIs ? // DataFrame data.groupBy("dept").avg("age") // SQL select dept, avg(age) from data group by 1 // RDD data.map { case (dept, age) => dept -> (age, 1) } .reduceByKey { case ((a1, c1), (a2, c2)) => (a1 + a2, c1 + c2) } .map { case (dept, (age, c)) => dept -> age / c }
  • 28. Catalyst in Spark SQL AST DataFrame Datasets Unresolved Logical Plan Logical Plan Optimized Logical Plan Physical Plans CostModel Selected Physical Plan RDD
  • 29. Dataset API in Spark 2.x val employeesDF = spark.read.json("employees.json") // Convert data to domain objects. case class Employee(name: String, age: Int) val employeesDS: Dataset[Employee] = employeesDF.as[Employee] val filterDS = employeesDS.filter(p => p.age > 3) Type-safe: operate on domain objects with compiled lambda functions.
  • 30. Structured APIs in Apache Spark SQL DataFrames Datasets Syntax Errors Runtime Compile Time Compile Time Analysis Errors Runtime Runtime Compile Time Analysis errors are caught before a job runs on cluster
  • 31. Spark SQL API - Analysis Error example.
  • 32. Spark SQL API - Analysis Error example.
  • 33. TransmogrifAI - Type Safety is Everywhere! ● Value operations ● Feature operations ● Transformation Pipelines (aka Workflows) // Typed value operations val tokenize(t: Text): TextList = t.map(_.split("")).toTextList // Types feature operations val title = FeatureBuilder.Text[Book].extract(_.title).asPredictor val tokens: Feature[TextList] = title.map(tokenize) // Transformation pipelines new OpWorkflow().setInput(books).setResultFeatures(token.vectorize())
  • 35. A Case Story - Functional Flow - Spark as a SaaS User Interface Build workflow - Source - Target - Transformations - filter - aggregation - Joins - Expressions - Machine Learning Algorithms Store Metadata of workflow in Document based NoSQL Ex. MongoDB ReactiveMongo Scala / Spark Job Reads Metadata from NoSQL ex. MongoDB Run on the Cluster Schedule Using Airflow SparkSubmit Operator
  • 36. A Case Story - High Level Technical Architecture - Spark as a SaaS User Interface Middleware Akka HTTP Web Service’s
  • 43. Questions ? Thank you! Big Thanks to Scala.IO Organizers and Scala France Community! @khatri_chetan chetan.khatri@live.com
  • 44. References [1] TransmogrifAI - AutoML library for building modular, reusable, strongly typed machine learning workflows on Spark from Salesforce Engineering [online] https://transmogrif.ai [2] A plugin to Apache Airflow to allow you to run Spark Submit Commands as an Operator [online] https://github.com/rssanders3/airflow-spark-operator-plugin [3] Apache Spark - Unified Analytics Engine for Big Data [online] https://spark.apache.org/ [4] Apache Livy [online] https://livy.incubator.apache.org/ [5] Zayd's Blog - Why is machine learning 'hard'? [online] http://ai.stanford.edu/~zayd/why-is-machine-learning-hard.html [6] Meet TransmogrifAI, Open Source AutoML That Powers Einstein Predictions [online] https://www.youtube.com/watch?v=uMapcWtzwyA&t=106s [7] Auto-Machine Learning: The Magic Behind Einstein [online] https://www.youtube.com/watch?v=YDw1GieW4cw&t=564s