Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

TransmogrifAI - Automate Machine Learning Workflow with the power of Scala and Spark at massive scale.

330 views

Published on

TransmogrifAI - Automate Machine Learning Workflow with the power of Scala and Spark at massive scale.
Scala.IO, CPE - France

Published in: Data & Analytics
  • Be the first to comment

TransmogrifAI - Automate Machine Learning Workflow with the power of Scala and Spark at massive scale.

  1. 1. TransmogrifAI Automate Machine Learning Workflow with the power of Scala and Spark at massive scale. @khatri_chetanBy: Chetan Khatri Scala.IO Conference, École Supérieure de Chimie Physique Électronique de Lyon, France
  2. 2. About me Lead - Data Science @ Accion labs India Pvt. Ltd. Open Source Contributor @ Apache Spark, Apache HBase, Elixir Lang, Spark HBase Connectors. Co-Authored University Curriculum @ University of Kachchh, India. Data Engineering @: Nazara Games, Eccella Corporation. Advisor - Data Science Lab, University of Kachchh, India. M.Sc. - Computer Science from University of Kachchh, India.
  3. 3. Agenda ● What is TransmogrifAI ? ● Why you need TransmogrifAI ? ● Automation of Machine learning life Cycle - from development to deployment. ○ Feature Inference ○ Transformation ○ Automated Feature validation ○ Automated Model Selection ○ Hyperparameter Optimization ● Type Safety in Spark, TransmogrifAI. ● Example: Code - Titanic kaggle problem.
  4. 4. What is TransmogrifAI ? ● TransmogrifAI is open sourced by Salesforce.com, Inc. in June, 2018 ● An end to end automated machine learning workflow library for structured data build on top of Scala and SparkML. Build with
  5. 5. What is TransmogrifAI ? ● TransmogrifAI helps extensively to automate Machine learning model life cycle such as Feature Selection, Transformation, Automated Feature validation, Automated Model Selection, Hyperparameter Optimization. ● It enforces compile-time type-safety, modularity, and reuse. ● Through automation, It achieves accuracies close to hand-tuned models with almost 100x reduction in time.
  6. 6. Why you need TransmogrifAI ? AUTOMATION Numerous Transformers and Estimators. MODULARITY AND REUSE Enforces a strict separation between ML workflow definitions and data manipulation. COMPILE TIME TYPE SAFETY Workflow built are Strongly typed, code completion during development and fewer runtime errors. TRANSPARENCY Model insights leverage stored feature metadata and lineage to help debug models. Features
  7. 7. Why you need TransmogrifAI ? Use TransmogrifAI if you need a machine learning library to: ● Build production ready machine learning applications in hours, not months ● Build machine learning models without getting a Ph.D. in machine learning ● Build modular, reusable, strongly typed machine learning workflows More read documentation: https://transmogrif.ai/
  8. 8. Why Machine Learning is hard ?! Really! ... For example, this may be using a linear classifier when your true decision boundaries are non-linear. Ref. http://ai.stanford.edu/~zayd/why-is-machine-learning-hard.html
  9. 9. Why Machine Learning is hard ?! Really! ... fast and effective debugging is the skill that is most required for implementing modern day machine learning pipelines. Ref. http://ai.stanford.edu/~zayd/why-is-machine-learning-hard.html
  10. 10. Real time Machine Learning takes time to Productionize TransmogrifAI Automates entire ML Life Cycle to accelerate developer’s productivity.
  11. 11. Under the Hood Automated Feature Engineering Automated Feature Selection Automated Model Selection
  12. 12. Automated Feature Engineering Automatic Derivation of new features based on existing features. Email Phone Age Subject Zip Code DOB Gender Email is Spam Country Code [0-20] [21-30] [ > 30] Stop words Top terms (TF-IDF) Detect Language Average Income House Price School Quality Shopping Transportation To Binary Age Day of Week Week of Year Quarter Month Year Hour Feature Vector
  13. 13. Automated Feature Engineering ● Analyze every feature columns and compute descriptive statistics. ○ Number of Nulls, Number of empty string, Mean, Max, Min, Standard deviation. ● Handle Missing values / Noisy values. ○ Ex. fillna by Mean / Avg / near by values. patient_details = patient_details.fillna(-1) data['City_Type'] = data['City_Type'].fillna('Z') imp = Imputer(missing_values='NaN', strategy='median', axis=0, copy = False) data_total_imputed = imp.fit_transform(data_total) # mark zero values as missing or NaN dataset[[1,2,3,4,5]] = dataset[[1,2,3,4,5]].replace(0, numpy.NaN) # fill missing values with mean column values dataset.fillna(dataset.mean(), inplace=True)
  14. 14. Automated Feature Engineering ● Does features have acceptable ranges / Does it contain valid values ? ● Does that feature could be leaker ? ○ Is it usually filled out after predicted field is ? ○ Is it highly correlated with the predicted field ? ● Does that feature is Outlier ?
  15. 15. Automated Feature Selection / Data Pre-processing ● Data Type of Features, Automatic Data Pre-processing. ○ MinMaxScaler ○ Normalizer ○ Binarizer ○ Label Encoding ○ One Hot Encoding ● Auto Data Pre-Processing based on chosen ML Model. ● Algorithm like XGBoost, specifically requires dummy encoded data while algorithm like decision tree doesn’t seem to care at all (sometimes)!
  16. 16. Auto Data Pre-processing ● Numeric - Imputation, Track Null Value, Log Transformation for large range, Scaling, Smart Binning. ● Categorical - Imputation, Track Null Value, One Hot Encoding / Dummy Encoding, Dynamic Top K Pivot, Smart Binning, Label Encoding, Category Embedding. ● Text - Tokenization, Hash Encoding, TF-IDF, Word2Vec, Sentiment Analysis, Language Detection. ● Temporal - Time Difference, Circular Statistics, Time Extraction(Day, Week, Month, Year).
  17. 17. Auto Selection of Best Model with Hyper Parameter Tuning ● Machine Learning Model ○ Learning Rate ○ Epoc ○ Batch Size ○ Optimizer ○ Activation Function ○ Loss Function ● Search Algorithms to find best model and optimal hyper parameters. ○ Ex. Grid Search, Random Search, Bandit Methods
  18. 18. Examples - Hyper parameter tuning XGBoost: params = {'booster':'gbtree', 'objective':'binary:logistic', 'max_depth':9, 'eval_metric':'logloss', 'eta':0.02, 'silent':1, 'nthread':4, 'subsample': 0.9, 'colsample_bytree':0.9, 'scale_pos_weight':1, 'min_child_weight':3, 'max_delta_step':3} num_rounds = 400 params['seed'] = 523264626346 # 0.85533 dtrain = xgb.DMatrix(train, labels, missing=np.nan) clf = xgb.train(params, dtrain, num_rounds) dtest = xgb.DMatrix(test, missing = np.nan) test_preds = clf.predict(dtest)
  19. 19. Examples - Hyper parameter tuning rf = RandomForestClassifier(n_estimators= int(n_tree), max_features= int(mtry), criterion = criterion, max_depth = max_depth, n_jobs = -1, oob_score = True) rf.fit(X_train, y_train) gbm = GradientBoostingClassifier(n_estimators = n_tree, max_features = max_features, subsample = subsample, n_jobs = -1) cv = StratifiedKFold(y_train, 10) scores = cross_val_score(rf, X_train, y_train, cv = cv, n_jobs = 4, scoring= 'f1') ext = ExtraTreesClassifier(n_estimators = n_tree, criterion = criterion, max_features = max_features, n_jobs = -1) cv = StratifiedKFold(y_train, 10) scores = cross_val_score(rf, X_train, y_train, cv = cv, n_jobs = 4, scoring= 'f1') param_grid = { 'n_estimators' : [500,1000,1500,2000,2500,3000], 'criterion' : ['gini', 'entropy'], 'max_features' : [15,20,25,30], 'max_depth' : [4,5,6] } gs_cv = GridSearchCV(rf, param_grid, scoring = 'f1', n_jobs = -1, verbose = 2).fit(X_train[subspace], y_train) gs_cv.best_params_
  20. 20. Ensemble Modeling ens['XGB2'] = xgb2_pred['Disbursed'] ens['RF'] = rf_pred['Disbursed'] ens['FTRL'] = ftrl_pred['Disbursed'] ens['XGB1_Rank'] = rankdata(ens['XGB1'], method='min') ens['XGB2_Rank'] = rankdata(ens['XGB2'], method='min') ens['XGB_Rank'] = 0.5 * ens['XGB1_Rank'] + 0.5 * ens['XGB2_Rank'] ens['RF_Rank'] = rankdata(ens['RF'], method='min') ens['FTRL_Rank'] = rankdata(ens['FTRL'], method='min') ens['Final'] = (0.75*ens['XGB_Rank'] + 0.25*ens['RF_Rank']) * 0.75 + 0.25 * ens['FTRL']
  21. 21. Type Safety: Integration with Apache Spark and Scala ● Modular, Reusable, Strongly typed Machine learning workflow on top of Apache Spark. ● Type Safety in Apache Spark with DataSet API.
  22. 22. Structured Data in Apache Spark Structured in Spark DataFrames Datasets
  23. 23. Unification of APIs in Apache Spark 2.0 DataFrame Dataset Untyped API Typed API Dataset (2016) DataFrame = Dataset [Row] Alias DataSet [T]
  24. 24. Why Dataset ? ● Strongly Typing. ● Ability to use powerful lambda functions. ● Spark SQL’s optimized execution engine (catalyst, tungsten). ● Can be constructed from JVM objects & manipulated using Functional. ● transformations (map, filter, flatMap etc). ● A DataFrame is a Dataset organized into named columns. ● DataFrame is simply a type alias of Dataset[Row].
  25. 25. DataFrame API Code // convert RDD -> DF with column names val parsedDF = parsedRDD.toDF("project", "sprint", "numStories") //filter, groupBy, sum, and then agg() parsedDF.filter($"project" === "finance"). groupBy($"sprint"). agg(sum($"numStories").as("count")). limit(100). show(100) project sprint numStories finance 3 20 finance 4 22
  26. 26. DataFrame -> SQL View -> SQL Query parsedDF.createOrReplaceTempView("audits") val results = spark.sql( """SELECT sprint, sum(numStories) AS count FROM audits WHERE project = 'finance' GROUP BY sprint LIMIT 100""") results.show(100) project sprint numStories finance 3 20 finance 4 22
  27. 27. Why Structure APIs ? // DataFrame data.groupBy("dept").avg("age") // SQL select dept, avg(age) from data group by 1 // RDD data.map { case (dept, age) => dept -> (age, 1) } .reduceByKey { case ((a1, c1), (a2, c2)) => (a1 + a2, c1 + c2) } .map { case (dept, (age, c)) => dept -> age / c }
  28. 28. Catalyst in Spark SQL AST DataFrame Datasets Unresolved Logical Plan Logical Plan Optimized Logical Plan Physical Plans CostModel Selected Physical Plan RDD
  29. 29. Dataset API in Spark 2.x val employeesDF = spark.read.json("employees.json") // Convert data to domain objects. case class Employee(name: String, age: Int) val employeesDS: Dataset[Employee] = employeesDF.as[Employee] val filterDS = employeesDS.filter(p => p.age > 3) Type-safe: operate on domain objects with compiled lambda functions.
  30. 30. Structured APIs in Apache Spark SQL DataFrames Datasets Syntax Errors Runtime Compile Time Compile Time Analysis Errors Runtime Runtime Compile Time Analysis errors are caught before a job runs on cluster
  31. 31. Spark SQL API - Analysis Error example.
  32. 32. Spark SQL API - Analysis Error example.
  33. 33. TransmogrifAI - Type Safety is Everywhere! ● Value operations ● Feature operations ● Transformation Pipelines (aka Workflows) // Typed value operations val tokenize(t: Text): TextList = t.map(_.split("")).toTextList // Types feature operations val title = FeatureBuilder.Text[Book].extract(_.title).asPredictor val tokens: Feature[TextList] = title.map(tokenize) // Transformation pipelines new OpWorkflow().setInput(books).setResultFeatures(token.vectorize())
  34. 34. Example Code Ref. https://github.com/fosscoder/transmogrifai-demo
  35. 35. A Case Story - Functional Flow - Spark as a SaaS User Interface Build workflow - Source - Target - Transformations - filter - aggregation - Joins - Expressions - Machine Learning Algorithms Store Metadata of workflow in Document based NoSQL Ex. MongoDB ReactiveMongo Scala / Spark Job Reads Metadata from NoSQL ex. MongoDB Run on the Cluster Schedule Using Airflow SparkSubmit Operator
  36. 36. A Case Story - High Level Technical Architecture - Spark as a SaaS User Interface Middleware Akka HTTP Web Service’s
  37. 37. Apache Livy Configuration
  38. 38. Apache Livy Configuration
  39. 39. Apache Livy Configuration ...
  40. 40. Apache Livy Integration
  41. 41. Apache Livy Integration ...
  42. 42. Apache Livy Integration ...
  43. 43. Questions ? Thank you! Big Thanks to Scala.IO Organizers and Scala France Community! @khatri_chetan chetan.khatri@live.com
  44. 44. References [1] TransmogrifAI - AutoML library for building modular, reusable, strongly typed machine learning workflows on Spark from Salesforce Engineering [online] https://transmogrif.ai [2] A plugin to Apache Airflow to allow you to run Spark Submit Commands as an Operator [online] https://github.com/rssanders3/airflow-spark-operator-plugin [3] Apache Spark - Unified Analytics Engine for Big Data [online] https://spark.apache.org/ [4] Apache Livy [online] https://livy.incubator.apache.org/ [5] Zayd's Blog - Why is machine learning 'hard'? [online] http://ai.stanford.edu/~zayd/why-is-machine-learning-hard.html [6] Meet TransmogrifAI, Open Source AutoML That Powers Einstein Predictions [online] https://www.youtube.com/watch?v=uMapcWtzwyA&t=106s [7] Auto-Machine Learning: The Magic Behind Einstein [online] https://www.youtube.com/watch?v=YDw1GieW4cw&t=564s

×