Successfully reported this slideshow.
Your SlideShare is downloading. ×

Improving the Life of Data Scientists: Automating ML Lifecycle through MLflow

Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad

Check these out next

1 of 23 Ad

Improving the Life of Data Scientists: Automating ML Lifecycle through MLflow

Download to read offline

The data science lifecycle consists of multiple iterative steps: data collection, data cleaning/exploration, feature engineering, model training, model deployment and scoring among others. The process is often tedious and error-prone and requires considerable human effort. Apart from these challenges, when it comes to leveraging ML in enterprise applications, especially in regulated environments, the level of scrutiny for data handling, model fairness, user privacy, and debuggability is very high. In this talk, we present the basic features of Flock, an end-to-end platform that facilitates adoption of ML in enterprise applications. We refer to this new class of applications as Enterprise Grade Machine Learning (EGML). Flock leverages MLflow to simplify and automate some of the steps involved in supporting EGML applications, allowing data scientists to spend most of their time on improving their ML models. Flock makes use of MLflow for model and experiment tracking but extends and complements it by providing automatic logging, deeper integration with relational databases that often store confidential data, model optimizations and support for the ONNX model format and the ONNX Runtime for inference. We will also present our ongoing work on automatically tracking lineage between data and ML models which is crucial in regulated environments. We will showcase Flock’s features through a demo using Microsoft’s Azure Data Studio and MLflow.

The data science lifecycle consists of multiple iterative steps: data collection, data cleaning/exploration, feature engineering, model training, model deployment and scoring among others. The process is often tedious and error-prone and requires considerable human effort. Apart from these challenges, when it comes to leveraging ML in enterprise applications, especially in regulated environments, the level of scrutiny for data handling, model fairness, user privacy, and debuggability is very high. In this talk, we present the basic features of Flock, an end-to-end platform that facilitates adoption of ML in enterprise applications. We refer to this new class of applications as Enterprise Grade Machine Learning (EGML). Flock leverages MLflow to simplify and automate some of the steps involved in supporting EGML applications, allowing data scientists to spend most of their time on improving their ML models. Flock makes use of MLflow for model and experiment tracking but extends and complements it by providing automatic logging, deeper integration with relational databases that often store confidential data, model optimizations and support for the ONNX model format and the ONNX Runtime for inference. We will also present our ongoing work on automatically tracking lineage between data and ML models which is crucial in regulated environments. We will showcase Flock’s features through a demo using Microsoft’s Azure Data Studio and MLflow.

Advertisement
Advertisement

More Related Content

Slideshows for you (20)

Similar to Improving the Life of Data Scientists: Automating ML Lifecycle through MLflow (20)

Advertisement

More from Databricks (20)

Advertisement

Improving the Life of Data Scientists: Automating ML Lifecycle through MLflow

  1. 1. Flock: E2E platform to democratize Data Science
  2. 2. Agenda Chapter 1: • GSL’s vantage point • Why are we building (yet) another Data Science platform? • Flock platform • A technology showcase demo Chapter 2: • EGML applications in Microsoft: MLflow + ONNX + SQL Server • Capturing provenance • Future Work
  3. 3. GSL’s vantage point 640Patents GAed or Public Preview features just this year LoC in OSS 0.5MLoC in OSS 130+Publications in top tier conferences/journals 1.1MLoC in products 600kServers running our code in Azure/Hydra Applied research lab part of Office of the CTO, Azure Data
  4. 4. Systems considered thus far Cloud Providers Private Services OSS
  5. 5. Systems comparison Training Experiment Tracking Managed Notebooks Pipelines / Projects Multi-Framework Proprietary Algos Distributed Training Auto ML Serving Batch prediction On-prem deployment Model Monitoring Model Validation Data Management Data Provenance Data testing Feature Store Featurization DSL Labelling Good Support OK Support No Support Unknown
  6. 6. Insights – Data Science is all about data J – There’s an emerging class of applications: • Enterprise Grade Machine Learning (EGML – CIDR’20) – Dichotomy of “smarts” with rudimentary process » Challenge on account of dual nature of models – software & data – Couple of key pillars to enable EGML are: • Tools for automating DS lifecycle – Only O(100) ipython notebooks in GitHub import mlflow over 1M+ analyzed – O(1000) for sklearn pipelines • Data Governance • (Unified data access)
  7. 7. Flock: Data-driven development offline online Data-driven development Solution Deployment NN Model transform ONNX ONNX’ Optimization Close/Update Incidents Job-id Job telemetry telemetry application tracking model training LightGBM policies deployment ONNX’ policies Dhalion
  8. 8. DEMO Python code import pandas as pd import lightgbm as lgb from sklearn import metrics data_train = pd.read_csv("global_train_x_label_with_mapping.csv") data_test = pd.read_csv("global_test_x_label_with_mapping.csv") train_x = data_train.iloc[:,:-1].values train_y = data_train.iloc[:,-1].values test_x = data_test.iloc[:,:-1].values test_y = data_test.iloc[:,-1].values n_leaves = 8 n_trees = 100 clf = lgb.LGBMClassifier(num_leaves=n_leaves, n_estimators=n_trees) clf.fit(train_x,train_y) score = metrics.precision_score(test_y, clf.predict(test_x), average='macro’) print("Precision Score on Test Data: " + str(score)) import mlflow import mlflow.onnx import multiprocessing import torch import onnx from onnx import optimizer from functools import partial from flock import get_tree_parameters, LightGBMBinaryClassifier_Batched import mlflow.sklearn import mlflow import pandas as pd import lightgbm as lgb from sklearn import metrics data_train = pd.read_csv('global_train_x_label_with_mapping.csv') data_test = pd.read_csv('global_test_x_label_with_mapping.csv') train_x = data_train.iloc[:, :-1].values train_y = data_train.iloc[:, (-1)].values test_x = data_test.iloc[:, :-1].values test_y = data_test.iloc[:, (-1)].values n_leaves = 8 n_trees = 100 clf = lgb.LGBMClassifier(num_leaves=n_leaves, n_estimators=n_trees) mlflow.log_param('clf_init_n_estimators', n_trees) mlflow.log_param('clf_init_num_leaves', n_leaves) clf.fit(train_x, train_y) mlflow.sklearn.log_model(clf, 'clf_model') score = metrics.precision_score(test_y, clf.predict(test_x), average='macro') mlflow.log_param('precision_score_average', ' macro') mlflow.log_param('score', score) print('Precision Score on Test Data: ' + str(score)) n_features = 100 activation = 'sigmoid' torch.set_num_threads(1) device = torch.device('cpu') model_name = 'griffon' model = clf.booster_.dump_model() n_features = clf.n_features_ tree_infos = model['tree_info'] pool = multiprocessing.Pool(8) parameters = pool.map(partial(get_tree_parameters, n_features=n_features), tree_infos) lgb_nn = LightGBMBinaryClassifier_Batched(parameters, n_features, activation ).to(device) torch.onnx.export(lgb_nn, torch.randn(1, n_features).to(device), model_name + '_nn.onnx', export_params=True, operator_export_type=torch.onnx. OperatorExportTypes.ONNX_ATEN_FALLBACK) passes = ['eliminate_deadend', 'eliminate_identity', 'eliminate_nop_monotone_argmax', 'eliminate_nop_transpose', 'eliminate_unused_initializer', 'extract_constant_to_initializer', 'fuse_consecutive_concats', 'fuse_consecutive_reduce_unsqueeze', 'fuse_consecutive_squeezes', 'fuse_consecutive_transposes', 'fuse_matmul_add_bias_into_gemm', 'fuse_transpose_into_gemm', 'lift_lexical_references'] model = onnx.load(model_name + '_nn.onnx') opt_model = optimizer.optimize(model, passes) mlflow.onnx.log_model(opt_model, 'opt_model') pyfunc_loaded = mlflow.pyfunc.load_pyfunc('opt_model', run_id=mlflow. active_run().info.run_uuid) scoring = pyfunc_loaded.predict(pd.DataFrame(test_x[:1].astype('float32')) ).values print('Scoring through mlflow pyfunc: ', scoring) mlflow.log_param('pyfunc_scoring', scoring[0][0]) User code Instrumented code Flock
  9. 9. Griffon: why is my job slow today? Current OnCall Workflow Revised OnCall Workflow with Griffon A support engineer (SE) spends hours of manual labor looking through hundreds of metrics After 5-6 hours of investigation, the reason for job slow down is found. A job goes out of SLA and Support is alerted A job goes out of SLA and the SE is alerted The Job ID is fed through Griffon and the top reasons for job slowdown are generated automatically The reason is found in the top five generated by Griffon. All the metrics Griffon has looked at can be ruled out and the SE can direct their efforts to a smaller set of metrics. ACM SoCC 2019
  10. 10. EGML applications in Microsoft Model Training TensorFlow Spark PyTorch … 1 2 Model Generation Conversion to ONNX Mlflow Model.v1 3 Serving SQL Server Model.vn H2O Keras … Scikit-learn Run --Tracks the runs (parameters, code versions, metrics, output files) -- Visualizes the output 4 ML Flow Model (ONNX flavor) SQL Server as artifact/backend store
  11. 11. ONNX: Interoperability across ML frameworks Open format to represent ML models Backed by Microsoft, Amazon, Facebook, and several hardware vendors
  12. 12. ONNX exchange format • Open format • Enables interoperability across frameworks • Many supported frameworks to import/export – Caffe2, PyTorch, CNTK, MXNet, TensorFlow, CoreML
  13. 13. ONNX Runtime • Cross-platform, high-performance scoring engine for ONNX models • Open-sourced at the end of 2018 • ONNX Runtime is used in millions of Windows devices and powers core models across Office, Bing, and Azure Train a model using a popular framework such as TensorFlow Convert the model to ONNX format Perform inference efficiently across multiple platforms and hardware using ONNX runtime
  14. 14. ONNX Runtime and optimizations Key design points: Graph IR Support for multiple backends (e.g., CPU, GPU, FPGA) Graph optimizations Rule-based optimizer inspired by DB optimizers Improved inference time and memory consumption Examples: 117msec à 34msec; 250MB à 200MB
  15. 15. ~40 ONNX models in production >10 orgs are migrating models to ONNX Runtime Average Speedup 2.7x ONNX Runtime in production
  16. 16. ONNX Runtime in production Office – Grammar Checking Model 14.6x reduction in latency
  17. 17. MLflow + ONNX • MLflow (1.0.0) has now built-in support for ONNX models • ONNX model flavor for saving, loading and evaluating ONNX models Train a sklearn model
  18. 18. Serving the ONNX model mlflow models serve -m /artifacts/model -p 1234 [6.379428821398614] Deploy the server Perform Inference ONNX Runtime is automatically invoked curl -XPOST-H"Content-Type:application/json; format=pandas-split"--data'{"columns":["alcohol", "chlorides", "citricacid", "density", "fixedacidity", "free sulfurdioxide","pH","residualsugar","sulphates","totalsulfurdioxide","volatileacidity"],"data":[[12.8,0.029,0.48, 0.98, 6.2, 29, 3.33, 1.2, 0.39, 75, 0.66]]}' http://127.0.0.1:1234/invocations
  19. 19. MLflow + SQL Server • MLflow can use SQL Server as an artifact store (and other RDBMSs as well) (PR) • The models are stored in binary format in the database along with other metadata such as model name, size, run_id, etc. client = MlflowClient() exp_name = “test" client.create_experiment(exp_name, artifact_location="mssql+pyodbc://sa:password@ipAddress:port/dbName?driver=ODBC+Driver+17+for+SQL+Server") mlflow.set_experiment(exp_name) mlflow.onnx.log_model(onnx, “model")
  20. 20. Provenance in EGML applications • Need for end-to-end provenance tracking • Multiple systems involved in each pipeline SQL Data pre-processing Python Script Model Training • Compliance • Keeping ML models up-to-date
  21. 21. Tracking provenance in Python scripts Python Script Python AST generation Dependencies between variables and functions Semantic annotation through a knowledge base of common ML libraries • Automatically identify models, metrics, hyperparameters in python scripts • Answer questions such as: “Which columns in a dataset were used for model training?” Dataset #Scripts %ML models covered %Training Datasets Covered Kaggle 49 95% 61% Microsoft 37 100% 100%
  22. 22. Future work • MLflow: – Integration with metadata management systems such as Apache Atlas • Flock: – Data Governance – Generalize and extend coverage of auto-tracking and ML à NN conversion. – Provenance of end-to-end pipelines • Combine with other systems (e.g., SQL, Spark)

×