Managing and Versioning Machine Learning Models in PythonSimon Frid
Practical machine learning is becoming messy, and while there are lots of algorithms, there is still a lot of infrastructure needed to manage and organize the models and datasets. Estimators and Django-Estimators are two python packages that can help version data sets and models, for deployment and effective workflow.
Version Control in Machine Learning + AI (Stanford)Anand Sampat
Starting with outlining the history of conventional version control before diving into explaining QoDs (Quantitative Oriented Developers) and the unique problems their ML systems pose from an operations perspective (MLOps). With the only status quo solutions being proprietary in-house pipelines (exclusive to Uber, Google, Facebook) and manual tracking/fragile "glue" code for everyone else.
Datmo works to solve this issue by empowering QoDs in two ways: making MLOps manageable and simple (rather than completely abstracted away) as well as reducing the amount of glue code so to ensure more robust end-to-end pipelines.
This goes through a simple example of using Datmo with an Iris classification dataset. Later workshops will expand to show how Datmo can work with other data pipelining tools.
Provenance in Production-Grade Machine LearningAnand Sampat
Over the next few years, every company must develop a strategy to leverage artificial intelligence and machine learning to stay relevant and beat out competitors. This requires hiring talented data scientists as well as DevOps and data engineers who can put these into production. Today, finding that perfect combination of talent can be difficult, but a focus on retraining and productivity tools can increase a small team’s impact on business ROI by over 10x. In this technical talk, we discuss how enterprises can better prepare their employees to deploy artificial intelligence and machine learning into production by using the same techniques used in software to add provenance, reliability, and efficiency to these processes. Specifically, we describe the benefits of adding provenance including reliable deployments and builds, A/B testing, continuous deployment, and automation and show how they can decrease the time to business ROI by over 10x.
Building A Production-Level Machine Learning PipelineRobert Dempsey
With so many options to choose from how do you select the right technologies to use for your machine learning pipeline? Do you purchase bare metal and hire a devops team, install Spark on EC2 instances, use EMR and other AWS services, combine Spark and Elasticsearch?! View this talk to get a first-hand experience of building ML pipelines: what options were looked at, how the final solution was selected, the tradeoffs made and the final results.
Monitoring AI applications with AI
The best performing offline algorithm can lose in production. The most accurate model does not always improve business metrics. Environment misconfiguration or upstream data pipeline inconsistency can silently kill the model performance. Neither prodops, data science or engineering teams are skilled to detect, monitor and debug such types of incidents.
Was it possible for Microsoft to test Tay chatbot in advance and then monitor and adjust it continuously in production to prevent its unexpected behaviour? Real mission critical AI systems require advanced monitoring and testing ecosystem which enables continuous and reliable delivery of machine learning models and data pipelines into production. Common production incidents include:
Data drifts, new data, wrong features
Vulnerability issues, malicious users
Concept drifts
Model Degradation
Biased Training set / training issue
Performance issue
In this demo based talk we discuss a solution, tooling and architecture that allows machine learning engineer to be involved in delivery phase and take ownership over deployment and monitoring of machine learning pipelines.
It allows data scientists to safely deploy early results as end-to-end AI applications in a self serve mode without assistance from engineering and operations teams. It shifts experimentation and even training phases from offline datasets to live production and closes a feedback loop between research and production.
Technical part of the talk will cover the following topics:
Automatic Data Profiling
Anomaly Detection
Clustering of inputs and outputs of the model
A/B Testing
Service Mesh, Envoy Proxy, trafic shadowing
Stateless and stateful models
Monitoring of regression, classification and prediction models
AISF19 - Building Scalable, Kubernetes-Native ML/AI Pipelines with TFX, KubeF...Bill Liu
This document discusses modern machine learning pipelines and popular open source tools to build them. It defines key characteristics of ML pipelines like experiment tracking, hyperparameter optimization, distributed execution, and metadata/data versioning. Popular tools covered are KubeFlow for Kubernetes+TensorFlow, Airflow for data and feature engineering, MLflow for experiment tracking, and TensorFlow Extended (TFX) libraries. The document demonstrates these tools and argues that while the field is emerging, simplicity is important and one should only use necessary components of different tools.
DevOps and Machine Learning (Geekwire Cloud Tech Summit)Jasjeet Thind
DevOps and Machine Learning: How do you test and deploy real-time machine learning services given the challenge that machine learning algorithms produce nondeterministic behaviors even for the same input.
This document discusses challenges in running machine learning applications in production environments. It notes that while Kaggle competitions focus on accuracy, real-world applications require balancing accuracy with interpretability, speed and infrastructure constraints. It also emphasizes that machine learning in production is as much a software and systems problem as a modeling problem. Key aspects that are discussed include flexible and scalable deployment architectures, model versioning, packaging and serving, online evaluation and experiments, and ensuring reproducibility of results.
Managing and Versioning Machine Learning Models in PythonSimon Frid
Practical machine learning is becoming messy, and while there are lots of algorithms, there is still a lot of infrastructure needed to manage and organize the models and datasets. Estimators and Django-Estimators are two python packages that can help version data sets and models, for deployment and effective workflow.
Version Control in Machine Learning + AI (Stanford)Anand Sampat
Starting with outlining the history of conventional version control before diving into explaining QoDs (Quantitative Oriented Developers) and the unique problems their ML systems pose from an operations perspective (MLOps). With the only status quo solutions being proprietary in-house pipelines (exclusive to Uber, Google, Facebook) and manual tracking/fragile "glue" code for everyone else.
Datmo works to solve this issue by empowering QoDs in two ways: making MLOps manageable and simple (rather than completely abstracted away) as well as reducing the amount of glue code so to ensure more robust end-to-end pipelines.
This goes through a simple example of using Datmo with an Iris classification dataset. Later workshops will expand to show how Datmo can work with other data pipelining tools.
Provenance in Production-Grade Machine LearningAnand Sampat
Over the next few years, every company must develop a strategy to leverage artificial intelligence and machine learning to stay relevant and beat out competitors. This requires hiring talented data scientists as well as DevOps and data engineers who can put these into production. Today, finding that perfect combination of talent can be difficult, but a focus on retraining and productivity tools can increase a small team’s impact on business ROI by over 10x. In this technical talk, we discuss how enterprises can better prepare their employees to deploy artificial intelligence and machine learning into production by using the same techniques used in software to add provenance, reliability, and efficiency to these processes. Specifically, we describe the benefits of adding provenance including reliable deployments and builds, A/B testing, continuous deployment, and automation and show how they can decrease the time to business ROI by over 10x.
Building A Production-Level Machine Learning PipelineRobert Dempsey
With so many options to choose from how do you select the right technologies to use for your machine learning pipeline? Do you purchase bare metal and hire a devops team, install Spark on EC2 instances, use EMR and other AWS services, combine Spark and Elasticsearch?! View this talk to get a first-hand experience of building ML pipelines: what options were looked at, how the final solution was selected, the tradeoffs made and the final results.
Monitoring AI applications with AI
The best performing offline algorithm can lose in production. The most accurate model does not always improve business metrics. Environment misconfiguration or upstream data pipeline inconsistency can silently kill the model performance. Neither prodops, data science or engineering teams are skilled to detect, monitor and debug such types of incidents.
Was it possible for Microsoft to test Tay chatbot in advance and then monitor and adjust it continuously in production to prevent its unexpected behaviour? Real mission critical AI systems require advanced monitoring and testing ecosystem which enables continuous and reliable delivery of machine learning models and data pipelines into production. Common production incidents include:
Data drifts, new data, wrong features
Vulnerability issues, malicious users
Concept drifts
Model Degradation
Biased Training set / training issue
Performance issue
In this demo based talk we discuss a solution, tooling and architecture that allows machine learning engineer to be involved in delivery phase and take ownership over deployment and monitoring of machine learning pipelines.
It allows data scientists to safely deploy early results as end-to-end AI applications in a self serve mode without assistance from engineering and operations teams. It shifts experimentation and even training phases from offline datasets to live production and closes a feedback loop between research and production.
Technical part of the talk will cover the following topics:
Automatic Data Profiling
Anomaly Detection
Clustering of inputs and outputs of the model
A/B Testing
Service Mesh, Envoy Proxy, trafic shadowing
Stateless and stateful models
Monitoring of regression, classification and prediction models
AISF19 - Building Scalable, Kubernetes-Native ML/AI Pipelines with TFX, KubeF...Bill Liu
This document discusses modern machine learning pipelines and popular open source tools to build them. It defines key characteristics of ML pipelines like experiment tracking, hyperparameter optimization, distributed execution, and metadata/data versioning. Popular tools covered are KubeFlow for Kubernetes+TensorFlow, Airflow for data and feature engineering, MLflow for experiment tracking, and TensorFlow Extended (TFX) libraries. The document demonstrates these tools and argues that while the field is emerging, simplicity is important and one should only use necessary components of different tools.
DevOps and Machine Learning (Geekwire Cloud Tech Summit)Jasjeet Thind
DevOps and Machine Learning: How do you test and deploy real-time machine learning services given the challenge that machine learning algorithms produce nondeterministic behaviors even for the same input.
This document discusses challenges in running machine learning applications in production environments. It notes that while Kaggle competitions focus on accuracy, real-world applications require balancing accuracy with interpretability, speed and infrastructure constraints. It also emphasizes that machine learning in production is as much a software and systems problem as a modeling problem. Key aspects that are discussed include flexible and scalable deployment architectures, model versioning, packaging and serving, online evaluation and experiments, and ensuring reproducibility of results.
End-to-End Machine learning pipelines for Python driven organizations - Nick ...PyData
The recent advances in machine learning and artificial intelligence are amazing! Yet, in order to have real value within a company, data scientists must be able to get their models off of their laptops and deployed within a company’s data pipelines and infrastructure. In this session, I'll demonstrate how one-off experiments can be transformed into scalable ML pipelines with minimal effort.
The document discusses challenges that arise when trying to scale analytics teams by building business intelligence (BI) tools, and proposes an alternative approach of hiring "Analysis Developers" to help analysts scale their work using R. Some key points made include:
- Building BI tools often leads to dysfunction as product and engineering teams compete to build the simplest tools.
- Analysis Developers would develop reusable R packages and help all analysts work more efficiently through skills training.
- This avoids issues like static tools becoming unstable as requirements change, and allows for flexible, reproducible analyses.
- Promoting skills acquisition rather than deliverables helps analysts progress in their careers.
The document provides tips for building maintainable and scalable projects. It discusses the importance of following best practices like writing tests, using version control, and avoiding premature optimization. It also warns against technical debt and recommends focusing on simplicity over complexity when starting a new project.
The document provides an overview of seamless MLOps using Seldon and MLflow. It discusses how MLOps is challenging due to the wide range of requirements across the ML lifecycle. MLflow helps with training by allowing experiment tracking and model versioning. Seldon Core helps with deployment by providing servers to containerize models and infrastructure for monitoring, A/B testing, and feedback. The demo shows training models with MLflow, deploying them to Seldon for A/B testing, and collecting feedback to optimize models.
Deploying ML models to production (frequently and safely) - PYCON 2018David Tan
1. The document discusses principles and practices for reliably and repeatedly deploying machine learning models from development to production.
2. It recommends adopting continuous delivery practices like automating environment setup, implementing a testing pyramid, and setting up continuous integration and delivery pipelines to enable frequent, safe model iterations.
3. The talk provides demonstrations of these techniques and emphasizes the importance of cross-functional teams, starting simply, and continuously improving data and processes.
SplunkLive! Seattle - Splunk for DevelopersGrigori Melnik
This document discusses Splunk's developer platform and resources for application development. It provides an overview of empowering developers to gain application intelligence, build Splunk apps, and integrate and extend Splunk. The document discusses building Splunk apps and provides resources for developers including tutorials, code samples, downloads, developer guidance, Splunk Base, GitHub, Twitter, and blogs. It also promotes Splunk's developer license and platform approach with search, analytics, and an open ecosystem to build solutions.
AnalyticOps: Lessons Learned Moving Machine-Learning Algorithms to Production...Robert Grossman
The document discusses lessons learned from moving machine learning algorithms to production environments, referred to as "AnalyticOps". It introduces AnalyticOps as establishing an environment where building, validating, deploying, and running analytic models happens rapidly, frequently, and reliably. A key challenge is deploying analytic models into operations, products, and services. The document discusses strategies for deploying models, including scoring engines that integrate analytic models into operational workflows using a model interchange format. It provides two case studies as examples.
My talk at Data Science Labs conference in Odessa.
Training a model in Apache Spark while having it automatically available for real-time serving is an essential feature for end-to-end solutions.
There is an option to export the model into PMML and then import it into a separated scoring engine. The idea of interoperability is great but it has multiple challenges, such as code duplication, limited extensibility, inconsistency, extra moving parts. In this talk we discussed an alternative solution that does not introduce custom model formats and new standards, not based on export/import workflow and shares Apache Spark API.
ML at the Edge: Building Your Production Pipeline with Apache Spark and Tens...Databricks
The explosion of data volume in the years to come challenge the idea of a centralized cloud infrastructure which handles all business needs. Edge computing comes to rescue by pushing the needs of computation and data analysis at the edge of the network, thus avoiding data exchange when makes sense. One of the areas where data exchange could impose a big overhead is scoring ML models especially where data to score are files like images eg. in a computer vision application.
Another concern in some applications, is that of keeping data as private as possible and this is where keeping things local makes sense. In this talk we will discuss current needs and recent advances in model serving, like newly introduced formats for pushing models at the edge nodes eg. mobile phones and how a unified model serving architecture could cover current and future needs for both data scientists and data engineers. This architecture is based among others, on training models in a distributed fashion with TensorFlow and leveraging Spark for cleaning data before training (eg. using TensorFlow connector).
Finally we will describe a microservice based approach for scoring models back at the cloud infrastructure side (where bandwidth can be high) eg. using TensorFlow serving and updating models remotely with a pull model approach for edge devices. We will talk also about implementing the proposed architecture and how that might look on a modern deployment environment eg. Kubernetes.
Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...PyData
To productionize data science work (and have it taken seriously by software engineers, CTOs, clients, or the open source community), you need to write tests! Except… how can you test code that performs nondeterministic tasks like natural language parsing and modeling? This talk presents an approach to testing probabilistic functions in code, illustrated with concrete examples written for Pytest.
mlflow: Accelerating the End-to-End ML lifecycleDatabricks
Building and deploying a machine learning model can be difficult to do once. Enabling other data scientists (or yourself, one month later) to reproduce your pipeline, to compare the results of different versions, to track what’s running where, and to redeploy and rollback updated models is much harder.
In this talk, I’ll introduce MLflow, a new open source project from Databricks that simplifies the machine learning lifecycle. MLflow provides APIs for tracking experiment runs between multiple users within a reproducible environment, and for managing the deployment of models to production. MLflow is designed to be an open, modular platform, in the sense that you can use it with any existing ML library and development process. MLflow was launched in June 2018 and has already seen significant community contributions, with over 50 contributors and new features including language APIs, integrations with popular ML libraries, and storage backends. I’ll show how MLflow works and explain how to get started with MLflow.
The Quest for an Open Source Data Science PlatformQAware GmbH
Cloud Native Night July 2019, Munich: Talk by Jörg Schad (@joerg_schad, Head of Engineering & ML at ArangoDB)
=== Please download slides if blurred! ===
Abstract: With the rapid and recent rise of data science, the Machine Learning Platforms being built are becoming more complex. For example, consider the various Kubeflow components: Distributed Training, Jupyter Notebooks, CI/CD, Hyperparameter Optimization, Feature store, and more. Each of these components is producing metadata: Different (versions) Datasets, different versions a of a jupyter notebooks, different training parameters, test/training accuracy, different features, model serving statistics, and many more.
For production use it is critical to have a common view across all these metadata as we have to ask questions such as: Which jupyter notebook has been used to build Model xyz currently running in production? If there is new data for a given dataset, which models (currently serving in production) have to be updated?
In this talk, we look at existing implementations, in particular MLMD as part of the TensorFlow ecosystem. Further, propose a first draft of a (MLMD compatible) universal Metadata API. We demo the first implementation of this API using ArangoDB.
Reproducible AI using MLflow and PyTorchDatabricks
Model reproducibility is becoming the next frontier for successful AI models building and deployments for both Research and Production scenarios. In this talk, we will show you how to build reproducible AI models and workflows using PyTorch and MLflow that can be shared across your teams, with traceability and speed up collaboration for AI projects.
Automated Hyperparameter Tuning, Scaling and TrackingDatabricks
Automated Machine Learning (AutoML) has received significant interest recently. We believe that the right automation would bring significant value and dramatically shorten time-to-value for data science teams. Databricks is automating the Data Science and Machine Learning process through a combination of product offerings, partnerships, and custom solutions. This talk will focus on how Databricks can help automate hyperparameter tuning.
For both traditional Machine Learning and modern Deep Learning, tuning hyperparameters can dramatically increase model performance and improve training times. However, tuning can be a complex and expensive process. In this talk, we'll start with a brief survey of the most popular techniques for hyperparameter tuning (e.g., grid search, random search, and Bayesian optimization). We will then discuss open source tools that implement each of these techniques, helping to automate the search over hyperparameters.
Finally, we will discuss and demo improvements we built for these tools in Databricks, including integration with MLflow:
Apache PySpark MLlib integration with MLflow for automatically tracking tuning
Hyperopt integration with Apache Spark to distribute tuning and with MLflow for automatic tracking
Recording and notebooks will be provided after the webinar so that you can practice at your own pace.
Presenters
Joseph Bradley, Software Engineer, Databricks
Joseph Bradley is a Software Engineer and Apache Spark PMC member working on Machine Learning at Databricks. Previously, he was a postdoc at UC Berkeley after receiving his Ph.D. in Machine Learning from Carnegie Mellon in 2013.
Yifan Cao, Senior Product Manager, Databricks
Yifan Cao is a Senior Product Manager at Databricks. His product area spans ML/DL algorithms and Databricks Runtime for Machine Learning. Prior to Databricks, Yifan worked on two Machine Learning products, applying NLP to find metadata and applying machine learning to predict equipment failures. He helped build the products from ground up to multi-million dollars in ARR. Yifan started his career as a researcher in quantum computing. Yifan received his B.S in UC Berkeley and Master from MIT.
MATS stack (MLFlow, Airflow, Tensorflow, Spark) for Cross-system Orchestratio...Databricks
At Avast we complete over 17 million phishing detections a day, providing crucial online protection for this type of attacks.
In this talk Joao Da Silva and Yury Kasimov will present the MATS stack for productionisation of Machine Learning and their journey into integrating model tracking, storage, cross-system orchestration and model deployments for a complete and modern machine learning pipeline.
Importance of ML Reproducibility & Applications with MLfLowDatabricks
With data as a valuable currency and the architecture of reliable, scalable Data Lakes and Lakehouses continuing to mature, it is crucial that machine learning training and deployment techniques keep up to realize value. Reproducibility, efficiency, and governance in training and production environments rest on the shoulders of both point in time snapshots of the data and a governing mechanism to regulate, track, and make best use of associated metadata.
This talk will outline the challenges and importance of building and maintaining reproducible, efficient, and governed machine learning solutions as well as posing solutions built on open source technologies – namely Delta Lake for data versioning and MLflow for efficiency and governance.
This document discusses MLOps at OLX, including:
- The main areas of data science work at OLX like search, recommendations, fraud detection, and content moderation.
- How OLX uses teams structured by both feature areas and roles to collaborate on projects.
- A maturity model for MLOps with levels from no MLOps to fully automated processes.
- How OLX has improved from siloed work to cross-functional teams and adding more automation to model creation, release, and application integration over time.
Scaling Ride-Hailing with Machine Learning on MLflowDatabricks
"GOJEK, the Southeast Asian super-app, has seen an explosive growth in both users and data over the past three years. Today the technology startup uses big data powered machine learning to inform decision-making in its ride-hailing, lifestyle, logistics, food delivery, and payment products. From selecting the right driver to dispatch, to dynamically setting prices, to serving food recommendations, to forecasting real-world events. Hundreds of millions of orders per month, across 18 products, are all driven by machine learning.
Building production grade machine learning systems at GOJEK wasn't always easy. Data processing and machine learning pipelines were brittle, long running, and had low reproducibility. Models and experiments were difficult to track, which led to downstream problems in production during serving and model evaluation. In this talk we will cover these and other challenges that we faced while trying to scale end-to-end machine learning systems at GOJEK. We will then introduce MLflow and explore the key features that make it useful as part of an ML platform. Finally, we will show how introducing MLflow into the ML life cycle has helped to solve many of the problems we faced while scaling machine learning at GOJEK.
"
Python is generally slower than C but offers great performance through libraries and C interop. To optimize: profile code to find bottlenecks, simplify algorithms, vectorize with NumPy, use map/filter/reduce instead of loops, and move array allocation outside loops. For critical sections, use Cython or C extensions, but avoid premature optimization and have objective benchmarks.
Scalable Automatic Machine Learning with H2OSri Ambati
In this presentation, Parul Pandey, will provide a history and overview of the field of “Automatic Machine Learning” (AutoML), followed by a detailed look inside H2O’s open source AutoML algorithm. H2O AutoML provides an easy-to-use interface which automates data pre-processing, training and tuning a large selection of candidate models (including multiple stacked ensemble models for superior model performance). The result of the AutoML run is a “leaderboard” of H2O models which can be easily exported for use in production. AutoML is available in all H2O interfaces (R, Python, Scala, web GUI) and due to the distributed nature of the H2O platform, can scale to very large datasets. The presentation will end with a demo of H2O AutoML in R and Python, including a handful of code examples to get you started using automatic machine learning on your own projects.
Parul's Bio:
Parul is a Data Science Evangelist here at H2O.ai. She combines Data Science, evangelism and community in her work. Her emphasis is to spread the information about H2O and Driverless AI to as many people as possible, She is also an active writer and has contributed towards various national and international publications.
While the adoption of machine learning and deep learning techniques continue to grow, many organizations find it difficult to actually deploy these sophisticated models into production. It is common to see data scientists build powerful models, yet these models are not deployed because of the complexity of the technology used or lack of understanding related to the process of pushing these models into production.
As part of this talk, I will review several deployment design patterns for both real-time and batch use cases. I’ll show how these models can be deployed as scalable, distributed deployments within the cloud, scaled across hadoop clusters, as APIs, and deployed within streaming analytics pipelines. I will also touch on topics related to security, end-to-end governance, pitfalls, challenges, and useful tools across a variety of platforms. This presentation will involve demos and sample code for the the deployment design patterns.
End-to-End Machine learning pipelines for Python driven organizations - Nick ...PyData
The recent advances in machine learning and artificial intelligence are amazing! Yet, in order to have real value within a company, data scientists must be able to get their models off of their laptops and deployed within a company’s data pipelines and infrastructure. In this session, I'll demonstrate how one-off experiments can be transformed into scalable ML pipelines with minimal effort.
The document discusses challenges that arise when trying to scale analytics teams by building business intelligence (BI) tools, and proposes an alternative approach of hiring "Analysis Developers" to help analysts scale their work using R. Some key points made include:
- Building BI tools often leads to dysfunction as product and engineering teams compete to build the simplest tools.
- Analysis Developers would develop reusable R packages and help all analysts work more efficiently through skills training.
- This avoids issues like static tools becoming unstable as requirements change, and allows for flexible, reproducible analyses.
- Promoting skills acquisition rather than deliverables helps analysts progress in their careers.
The document provides tips for building maintainable and scalable projects. It discusses the importance of following best practices like writing tests, using version control, and avoiding premature optimization. It also warns against technical debt and recommends focusing on simplicity over complexity when starting a new project.
The document provides an overview of seamless MLOps using Seldon and MLflow. It discusses how MLOps is challenging due to the wide range of requirements across the ML lifecycle. MLflow helps with training by allowing experiment tracking and model versioning. Seldon Core helps with deployment by providing servers to containerize models and infrastructure for monitoring, A/B testing, and feedback. The demo shows training models with MLflow, deploying them to Seldon for A/B testing, and collecting feedback to optimize models.
Deploying ML models to production (frequently and safely) - PYCON 2018David Tan
1. The document discusses principles and practices for reliably and repeatedly deploying machine learning models from development to production.
2. It recommends adopting continuous delivery practices like automating environment setup, implementing a testing pyramid, and setting up continuous integration and delivery pipelines to enable frequent, safe model iterations.
3. The talk provides demonstrations of these techniques and emphasizes the importance of cross-functional teams, starting simply, and continuously improving data and processes.
SplunkLive! Seattle - Splunk for DevelopersGrigori Melnik
This document discusses Splunk's developer platform and resources for application development. It provides an overview of empowering developers to gain application intelligence, build Splunk apps, and integrate and extend Splunk. The document discusses building Splunk apps and provides resources for developers including tutorials, code samples, downloads, developer guidance, Splunk Base, GitHub, Twitter, and blogs. It also promotes Splunk's developer license and platform approach with search, analytics, and an open ecosystem to build solutions.
AnalyticOps: Lessons Learned Moving Machine-Learning Algorithms to Production...Robert Grossman
The document discusses lessons learned from moving machine learning algorithms to production environments, referred to as "AnalyticOps". It introduces AnalyticOps as establishing an environment where building, validating, deploying, and running analytic models happens rapidly, frequently, and reliably. A key challenge is deploying analytic models into operations, products, and services. The document discusses strategies for deploying models, including scoring engines that integrate analytic models into operational workflows using a model interchange format. It provides two case studies as examples.
My talk at Data Science Labs conference in Odessa.
Training a model in Apache Spark while having it automatically available for real-time serving is an essential feature for end-to-end solutions.
There is an option to export the model into PMML and then import it into a separated scoring engine. The idea of interoperability is great but it has multiple challenges, such as code duplication, limited extensibility, inconsistency, extra moving parts. In this talk we discussed an alternative solution that does not introduce custom model formats and new standards, not based on export/import workflow and shares Apache Spark API.
ML at the Edge: Building Your Production Pipeline with Apache Spark and Tens...Databricks
The explosion of data volume in the years to come challenge the idea of a centralized cloud infrastructure which handles all business needs. Edge computing comes to rescue by pushing the needs of computation and data analysis at the edge of the network, thus avoiding data exchange when makes sense. One of the areas where data exchange could impose a big overhead is scoring ML models especially where data to score are files like images eg. in a computer vision application.
Another concern in some applications, is that of keeping data as private as possible and this is where keeping things local makes sense. In this talk we will discuss current needs and recent advances in model serving, like newly introduced formats for pushing models at the edge nodes eg. mobile phones and how a unified model serving architecture could cover current and future needs for both data scientists and data engineers. This architecture is based among others, on training models in a distributed fashion with TensorFlow and leveraging Spark for cleaning data before training (eg. using TensorFlow connector).
Finally we will describe a microservice based approach for scoring models back at the cloud infrastructure side (where bandwidth can be high) eg. using TensorFlow serving and updating models remotely with a pull model approach for edge devices. We will talk also about implementing the proposed architecture and how that might look on a modern deployment environment eg. Kubernetes.
Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...PyData
To productionize data science work (and have it taken seriously by software engineers, CTOs, clients, or the open source community), you need to write tests! Except… how can you test code that performs nondeterministic tasks like natural language parsing and modeling? This talk presents an approach to testing probabilistic functions in code, illustrated with concrete examples written for Pytest.
mlflow: Accelerating the End-to-End ML lifecycleDatabricks
Building and deploying a machine learning model can be difficult to do once. Enabling other data scientists (or yourself, one month later) to reproduce your pipeline, to compare the results of different versions, to track what’s running where, and to redeploy and rollback updated models is much harder.
In this talk, I’ll introduce MLflow, a new open source project from Databricks that simplifies the machine learning lifecycle. MLflow provides APIs for tracking experiment runs between multiple users within a reproducible environment, and for managing the deployment of models to production. MLflow is designed to be an open, modular platform, in the sense that you can use it with any existing ML library and development process. MLflow was launched in June 2018 and has already seen significant community contributions, with over 50 contributors and new features including language APIs, integrations with popular ML libraries, and storage backends. I’ll show how MLflow works and explain how to get started with MLflow.
The Quest for an Open Source Data Science PlatformQAware GmbH
Cloud Native Night July 2019, Munich: Talk by Jörg Schad (@joerg_schad, Head of Engineering & ML at ArangoDB)
=== Please download slides if blurred! ===
Abstract: With the rapid and recent rise of data science, the Machine Learning Platforms being built are becoming more complex. For example, consider the various Kubeflow components: Distributed Training, Jupyter Notebooks, CI/CD, Hyperparameter Optimization, Feature store, and more. Each of these components is producing metadata: Different (versions) Datasets, different versions a of a jupyter notebooks, different training parameters, test/training accuracy, different features, model serving statistics, and many more.
For production use it is critical to have a common view across all these metadata as we have to ask questions such as: Which jupyter notebook has been used to build Model xyz currently running in production? If there is new data for a given dataset, which models (currently serving in production) have to be updated?
In this talk, we look at existing implementations, in particular MLMD as part of the TensorFlow ecosystem. Further, propose a first draft of a (MLMD compatible) universal Metadata API. We demo the first implementation of this API using ArangoDB.
Reproducible AI using MLflow and PyTorchDatabricks
Model reproducibility is becoming the next frontier for successful AI models building and deployments for both Research and Production scenarios. In this talk, we will show you how to build reproducible AI models and workflows using PyTorch and MLflow that can be shared across your teams, with traceability and speed up collaboration for AI projects.
Automated Hyperparameter Tuning, Scaling and TrackingDatabricks
Automated Machine Learning (AutoML) has received significant interest recently. We believe that the right automation would bring significant value and dramatically shorten time-to-value for data science teams. Databricks is automating the Data Science and Machine Learning process through a combination of product offerings, partnerships, and custom solutions. This talk will focus on how Databricks can help automate hyperparameter tuning.
For both traditional Machine Learning and modern Deep Learning, tuning hyperparameters can dramatically increase model performance and improve training times. However, tuning can be a complex and expensive process. In this talk, we'll start with a brief survey of the most popular techniques for hyperparameter tuning (e.g., grid search, random search, and Bayesian optimization). We will then discuss open source tools that implement each of these techniques, helping to automate the search over hyperparameters.
Finally, we will discuss and demo improvements we built for these tools in Databricks, including integration with MLflow:
Apache PySpark MLlib integration with MLflow for automatically tracking tuning
Hyperopt integration with Apache Spark to distribute tuning and with MLflow for automatic tracking
Recording and notebooks will be provided after the webinar so that you can practice at your own pace.
Presenters
Joseph Bradley, Software Engineer, Databricks
Joseph Bradley is a Software Engineer and Apache Spark PMC member working on Machine Learning at Databricks. Previously, he was a postdoc at UC Berkeley after receiving his Ph.D. in Machine Learning from Carnegie Mellon in 2013.
Yifan Cao, Senior Product Manager, Databricks
Yifan Cao is a Senior Product Manager at Databricks. His product area spans ML/DL algorithms and Databricks Runtime for Machine Learning. Prior to Databricks, Yifan worked on two Machine Learning products, applying NLP to find metadata and applying machine learning to predict equipment failures. He helped build the products from ground up to multi-million dollars in ARR. Yifan started his career as a researcher in quantum computing. Yifan received his B.S in UC Berkeley and Master from MIT.
MATS stack (MLFlow, Airflow, Tensorflow, Spark) for Cross-system Orchestratio...Databricks
At Avast we complete over 17 million phishing detections a day, providing crucial online protection for this type of attacks.
In this talk Joao Da Silva and Yury Kasimov will present the MATS stack for productionisation of Machine Learning and their journey into integrating model tracking, storage, cross-system orchestration and model deployments for a complete and modern machine learning pipeline.
Importance of ML Reproducibility & Applications with MLfLowDatabricks
With data as a valuable currency and the architecture of reliable, scalable Data Lakes and Lakehouses continuing to mature, it is crucial that machine learning training and deployment techniques keep up to realize value. Reproducibility, efficiency, and governance in training and production environments rest on the shoulders of both point in time snapshots of the data and a governing mechanism to regulate, track, and make best use of associated metadata.
This talk will outline the challenges and importance of building and maintaining reproducible, efficient, and governed machine learning solutions as well as posing solutions built on open source technologies – namely Delta Lake for data versioning and MLflow for efficiency and governance.
This document discusses MLOps at OLX, including:
- The main areas of data science work at OLX like search, recommendations, fraud detection, and content moderation.
- How OLX uses teams structured by both feature areas and roles to collaborate on projects.
- A maturity model for MLOps with levels from no MLOps to fully automated processes.
- How OLX has improved from siloed work to cross-functional teams and adding more automation to model creation, release, and application integration over time.
Scaling Ride-Hailing with Machine Learning on MLflowDatabricks
"GOJEK, the Southeast Asian super-app, has seen an explosive growth in both users and data over the past three years. Today the technology startup uses big data powered machine learning to inform decision-making in its ride-hailing, lifestyle, logistics, food delivery, and payment products. From selecting the right driver to dispatch, to dynamically setting prices, to serving food recommendations, to forecasting real-world events. Hundreds of millions of orders per month, across 18 products, are all driven by machine learning.
Building production grade machine learning systems at GOJEK wasn't always easy. Data processing and machine learning pipelines were brittle, long running, and had low reproducibility. Models and experiments were difficult to track, which led to downstream problems in production during serving and model evaluation. In this talk we will cover these and other challenges that we faced while trying to scale end-to-end machine learning systems at GOJEK. We will then introduce MLflow and explore the key features that make it useful as part of an ML platform. Finally, we will show how introducing MLflow into the ML life cycle has helped to solve many of the problems we faced while scaling machine learning at GOJEK.
"
Python is generally slower than C but offers great performance through libraries and C interop. To optimize: profile code to find bottlenecks, simplify algorithms, vectorize with NumPy, use map/filter/reduce instead of loops, and move array allocation outside loops. For critical sections, use Cython or C extensions, but avoid premature optimization and have objective benchmarks.
Scalable Automatic Machine Learning with H2OSri Ambati
In this presentation, Parul Pandey, will provide a history and overview of the field of “Automatic Machine Learning” (AutoML), followed by a detailed look inside H2O’s open source AutoML algorithm. H2O AutoML provides an easy-to-use interface which automates data pre-processing, training and tuning a large selection of candidate models (including multiple stacked ensemble models for superior model performance). The result of the AutoML run is a “leaderboard” of H2O models which can be easily exported for use in production. AutoML is available in all H2O interfaces (R, Python, Scala, web GUI) and due to the distributed nature of the H2O platform, can scale to very large datasets. The presentation will end with a demo of H2O AutoML in R and Python, including a handful of code examples to get you started using automatic machine learning on your own projects.
Parul's Bio:
Parul is a Data Science Evangelist here at H2O.ai. She combines Data Science, evangelism and community in her work. Her emphasis is to spread the information about H2O and Driverless AI to as many people as possible, She is also an active writer and has contributed towards various national and international publications.
While the adoption of machine learning and deep learning techniques continue to grow, many organizations find it difficult to actually deploy these sophisticated models into production. It is common to see data scientists build powerful models, yet these models are not deployed because of the complexity of the technology used or lack of understanding related to the process of pushing these models into production.
As part of this talk, I will review several deployment design patterns for both real-time and batch use cases. I’ll show how these models can be deployed as scalable, distributed deployments within the cloud, scaled across hadoop clusters, as APIs, and deployed within streaming analytics pipelines. I will also touch on topics related to security, end-to-end governance, pitfalls, challenges, and useful tools across a variety of platforms. This presentation will involve demos and sample code for the the deployment design patterns.
Reproducibility and experiments management in Machine Learning Mikhail Rozhkov
Machine Learning becomes more and more common practice in many companies. ML teams size is growing and collaboration goes out of office and personal laptops. The complexity of ML projects leads to adopting distributed team collaboration, cloud based infrastructure and distributed machine learning. Well defined and manageable process for ML experiments becomes a central issue. Practices to apply automated pipelines, models and data set versioning helps to establish a good manageable process in project and provide reproducible results.
This speech helps to start with handling models and datasets versioning using open source tools: DVC, mlflow, Luigi, etc.
Abhishank Gaba has a BASc in Mechatronics Engineering from the University of Waterloo with a GPA of 4.0. He has experience leading projects involving machine learning and computer vision to detect critical points in pipes and identify tissue patterns. His relevant work experience includes product management and software development roles at startups focused on ignition interlock devices and smart underwear. He also has experience in quality assurance and software development.
Agile Data Science on Greenplum Using Airflow - Greenplum Summit 2019VMware Tanzu
This document discusses using Apache Airflow to build agile data science pipelines on Greenplum Database. It outlines the typical data science phases of discovery and operationalization. In the discovery phase, rapid iteration and experimentation is used. The operationalization phase focuses on building automated, testable pipelines for data preparation, model training/scoring, monitoring, and APIs. It provides examples of directed acyclic graphs (DAGs) for end-to-end data processing, model training, model scoring, and model re-training. It emphasizes the importance of testing, monitoring failures, and fixing errors for responsive pipelines. Greenplum and Jupyter notebooks enable agile data science in discovery, while Greenplum, Airflow,
Data kitchen 7 agile steps - big data fest 9-18-2015DataKitchen
This document discusses applying agile principles and practices to data and analytics teams to address the complexity they face. It outlines seven steps to doing agile data work: 1) adding tests, 2) modularizing and containerizing work, 3) using branching and merging, 4) employing multiple environments, 5) giving analysts tools to experiment, 6) using simple storage, and 7) supporting small team, feature branch, and data governance workflows. The goal is to enable rapid experimentation and integration of new data sources through these agile practices adapted for analytics teams and their unique needs.
Data Science as a Service: Intersection of Cloud Computing and Data SciencePouria Amirian
Dr. Pouria Amirian explains data science, steps in a data science workflow and show some experiments in AzureML. He also mentions about big data issues in a data science project and solutions to them.
Data Science as a Service: Intersection of Cloud Computing and Data SciencePouria Amirian
Dr. Pouria Amirian from the University of Oxford explains Data Science and its relationship with Big Data and Cloud Computing. Then he illustrates using AzureML to perform a simple data science analytics.
Data Scientists and Machine Learning practitioners, nowadays, seem to be churning out models by the dozen and they continuously experiment to find ways to improve their accuracies. They also use a variety of ML and DL frameworks & languages , and a typical organization may find that this results in a heterogenous, complicated bunch of assets that require different types of runtimes, resources and sometimes even specialized compute to operate efficiently.
But what does it mean for an enterprise to actually take these models to "production" ? How does an organization scale inference engines out & make them available for real-time applications without significant latencies ? There needs to be different techniques for batch (offline) inferences and instant, online scoring. Data needs to be accessed from various sources and cleansing, transformations of data needs to be enabled prior to any predictions. In many cases, there maybe no substitute for customized data handling with scripting either.
Enterprises also require additional auditing and authorizations built in, approval processes and still support a "continuous delivery" paradigm whereby a data scientist can enable insights faster. Not all models are created equal, nor are consumers of a model - so enterprises require both metering and allocation of compute resources for SLAs.
In this session, we will take a look at how machine learning is operationalized in IBM Data Science Experience (DSX), a Kubernetes based offering for the Private Cloud and optimized for the HortonWorks Hadoop Data Platform. DSX essentially brings in typical software engineering development practices to Data Science, organizing the dev->test->production for machine learning assets in much the same way as typical software deployments. We will also see what it means to deploy, monitor accuracies and even rollback models & custom scorers as well as how API based techniques enable consuming business processes and applications to remain relatively stable amidst all the chaos.
Speaker
Piotr Mierzejewski, Program Director Development IBM DSX Local, IBM
Do compilers look anything like a data pipeline? How do you do data testing to ensure end to end provenance and enforce engineering guarantees for your data products? What babysteps should you consider when assembling your team?
Anaël Beaugnon did a presentation called "From Machine Learning Scientist to Full Stack Data Scientist: Lessons learned for ML in production" at a joint meetup between WiMLDS Paris and MLOps Paris on june 2024.
Data Science for Dummies - Data Engineering with Titanic dataset + Databricks...Rodney Joyce
Number 2 in the Data Science for Dummies series - We'll predict Titanic survival with Databricks, python and MLSpark.
These are the slides only (excuse the Powerpoint animation issues) - check out the actual tech talk on YouTube: https://rodneyjoyce.home.blog/2019/05/03/data-science-for-dummies-machine-learning-with-databricks-python-sparkml-tech-talk-1-of-7/)
If you have not used Databricks before check out the first talk - Databricks for Dummies.
Here's the rest of the series: https://rodneyjoyce.home.blog/tag/data-science-for-dummies/
1) Data Science overview with Databricks
2) Titanic survival prediction with Azure Machine Learning Studio + Kaggle
3) Data Engineering with Titanic dataset + Databricks + Python
4) Titanic with Databricks + Spark ML
5) Titanic with Databricks + Azure Machine Learning Service
6) Titanic with Databricks + MLS + AutoML
7) Titanic with Databricks + MLFlow
8) Titanic with .NET Core + ML.NET
9) Deployment, DevOps/MLOps and Productionisation
Ac fr ogdgcmxqfucumvb3rtaloaj_brftdqxmm9hvb6ttcdlh-kap3doq8rsu8vhkdcpgfpozovb...RaunakMalkani3
The document discusses the Rational Unified Process (RUP) and Rapid Application Development (RAD) methodologies. RUP follows five phases - Inception, Elaboration, Construction, Transition, and Production. It aims to reduce unexpected costs and prevent waste. RAD is used for urgent projects and emphasizes prototyping. It develops components in parallel like mini-projects then assembles them. Business modeling, data modeling, process modeling, application generation and testing are key activities in RAD.
District Data Labs Workshop
Current Workshop: August 23, 2014
Previous Workshops:
- April 5, 2014
Data products are usually software applications that derive their value from data by leveraging the data science pipeline and generate data through their operation. They aren’t apps with data, nor are they one time analyses that produce insights - they are operational and interactive. The rise of these types of applications has directly contributed to the rise of the data scientist and the idea that data scientists are professionals “who are better at statistics than any software engineer and better at software engineering than any statistician.”
These applications have been largely built with Python. Python is flexible enough to develop extremely quickly on many different types of servers and has a rich tradition in web applications. Python contributes to every stage of the data science pipeline including real time ingestion and the production of APIs, and it is powerful enough to perform machine learning computations. In this class we’ll produce a data product with Python, leveraging every stage of the data science pipeline to produce a book recommender.
Agile development of data science projects | Part 1 Anubhav Dhiman
This document discusses agile development of data science projects. It begins by defining data science as focusing on predicting, prescribing, or explaining something, distinct from business intelligence which focuses on reporting past events. It notes data science encompasses quantitative research, advanced analytics, predictive modeling, and machine learning. It then discusses how reliably data science teams can deliver value, showing a data science readiness level chart ranging from algorithm design to proven systems. The rest of the document discusses collaborating across teams and organizations to move from initial concepts to specific, integrated predictive systems.
IRJET- Deep Learning Model to Predict Hardware PerformanceIRJET Journal
This document discusses using deep learning models to predict hardware performance. Specifically, it aims to predict benchmark scores from hardware configurations, or predict configurations from scores. It explores various machine learning algorithms like linear regression, logistic regression, and multi-linear regression on hardware performance data. The best results were from backward elimination and linear regression, achieving over 80% accuracy. Data preprocessing like encoding was important. The model can help analyze hardware performance more quickly than manual methods.
IRJET- Analysis of PV Fed Vector Controlled Induction Motor DriveIRJET Journal
The document describes a project to develop a deep learning model to predict hardware performance. The model takes hardware configuration parameters like CPU, memory, etc. as input and predicts benchmark scores. The authors preprocessed data, tested various regression models like linear regression and lasso regression, and techniques like backward elimination and cross-validation. Their best model used backward elimination and linear regression, achieving 80.82% accuracy. The project aims to automate hardware performance analysis and prediction to save time compared to manual methods.
A00-440: Useful Questions for SAS ModelOps Specialist Certification SuccessPalakMazumdar1
Click Here---> https://bit.ly/3oX5ZLF <---Get complete detail on A00-440 study guide to crack SAS Certified ModelOps Specialist. You can collect all information on A00-440 tutorial, practice test, books, study material, exam questions, and syllabus. Enhance your knowledge on SAS ModelOps Specialist and get ready to crack A00-440 certification in no time. Explore all information on A00-440 exam with number of questions, passing percentage and time duration to complete test.
Challenges of Operationalising Data Science in Productioniguazio
The presentation topic for this meet-up was covered in two sections without any breaks in-between
Section 1: Business Aspects (20 mins)
Speaker: Rasmi Mohapatra, Product Owner, Experian
https://www.linkedin.com/in/rasmi-m-428b3a46/
Once your data science application is in the production, there are many typical data science operational challenges experienced today - across business domains - we will cover a few challenges with example scenarios
Section 2: Tech Aspects (40 mins, slides & demo, Q&A )
Speaker: Santanu Dey, Solution Architect, Iguazio
https://www.linkedin.com/in/santanu/
In this part of the talk, we will cover how these operational challenges can be overcome e.g. automating data collection & preparation, making ML models portable & deploying in production, monitoring and scaling, etc.
with relevant demos.
Similar to Using dataset versioning in data science (20)
Discover the benefits of outsourcing SEO to Indiadavidjhones387
"Discover the benefits of outsourcing SEO to India! From cost-effective services and expert professionals to round-the-clock work advantages, learn how your business can achieve digital success with Indian SEO solutions.
Securing BGP: Operational Strategies and Best Practices for Network Defenders...APNIC
Md. Zobair Khan,
Network Analyst and Technical Trainer at APNIC, presented 'Securing BGP: Operational Strategies and Best Practices for Network Defenders' at the Phoenix Summit held in Dhaka, Bangladesh from 23 to 24 May 2024.
HijackLoader Evolution: Interactive Process HollowingDonato Onofri
CrowdStrike researchers have identified a HijackLoader (aka IDAT Loader) sample that employs sophisticated evasion techniques to enhance the complexity of the threat. HijackLoader, an increasingly popular tool among adversaries for deploying additional payloads and tooling, continues to evolve as its developers experiment and enhance its capabilities.
In their analysis of a recent HijackLoader sample, CrowdStrike researchers discovered new techniques designed to increase the defense evasion capabilities of the loader. The malware developer used a standard process hollowing technique coupled with an additional trigger that was activated by the parent process writing to a pipe. This new approach, called "Interactive Process Hollowing", has the potential to make defense evasion stealthier.
Honeypots Unveiled: Proactive Defense Tactics for Cyber Security, Phoenix Sum...APNIC
Adli Wahid, Senior Internet Security Specialist at APNIC, delivered a presentation titled 'Honeypots Unveiled: Proactive Defense Tactics for Cyber Security' at the Phoenix Summit held in Dhaka, Bangladesh from 23 to 24 May 2024.
Honeypots Unveiled: Proactive Defense Tactics for Cyber Security, Phoenix Sum...
Using dataset versioning in data science
1. Using Dataset
Versioning in Data
Science
Dr. Venkata Pingali
Founder, Scribble Data
pingali@scribbledata.io
https://github.com/pingali
2. Agenda
1. Why dataset versioning
2. Revised process using data versioning
3. Tool summary and demo
4. Roadmap
5. Feedback
a. Overall direction
b. dgit features
c. Suggestions
d. Actionables/next steps if any
3. About Me
Dr. Venkata Pingali
Founder, Scribble Data
Former-VP Analytics, FourthLion
Founder, eLuminos Energy Analytics
IIT(B) PhD (USC)
http://linkedin.com/in/pingali
6. Only the Beginning
To Manager:
Ready to process CC
Marriott's numbers on
scanned Invoices!
(or some high risk activity
based on this)
7. Then some questions
1. Where did the numbers come from? (Correctness, Lineage)
a. Assumption, models, datasets
2. Is this an accident? Does it hold now? (Reproducibility, Impact assessment)
a. Model, dataset, and question revisions
b. Performance in deployment
3. Can you get the results faster? (Efficiency)
a. Time, effort, cost
4. Can you also analyze X? (Extensibility)
a. Different dataset, question
5. Could we try X? (DoE, Synthetic data)
a. What if scenarios, field experiments
9. Business Complexity is Discovered Over
Time
Incomplete context (history, semantics)
Qtns not thought through
Continuous revisions
Biz
Analytics
Team
Data
Engg
Qtns, Context
Data Req
Datasets
Model Results
Story Telling
10. Imperfect Data Queries due to Limited
Understanding
Dependencies not specified
Wrong filters
Known outliers
Narrow specification (cubes)
Biz
Analytics
Team
Data
Engg
Qtns, Context
Data Req
Datasets
Model Results
Story Telling
11. Weak process
Lack of protocol (email/files)
Missing validation checks
No lineage
No revisions
Biz
Analytics
Team
Data
Engg
Qtns, Context
Data Req
Datasets
Model Results
Story Telling
12. Eagerness to Present Great Narratives
Wrong input dataset
Mistakes in pipeline
Excel/adhoc transformations
Model evolution
Continuous revision of narratives
Missing interpretation integrity
checks (e.g. other time windows)
Better methodology
Biz
Analytics
Team
Data
Engg
Qtns, Context
Data Req
Datasets
Model Results
Story Telling
13. Underlying Issue: Messy Analytics Process
Biz
Analytics
Team
Data
Engg
Qtns, Context
Data Req
Datasets
Modeling
Floating data
Adhoc
Iterative
Laborious
Fast paced
Story telling
14. Desired State
1. Trusted
a. Every model should be auditable to the last record and step
b. Every model should be reproducible with zero human intervention
c. All models should be evaluated independently for quality
d. No data should change without leaving audit trail
e. All applications (presentation, configuration etc) should be hyperlinked
2. Scalable
a. All models should be searchable and usable easily
b. All data and model components should be reusable
c. Process should enable observation of data science process
3. Robust
a. Process should cope with younger inexperienced staff
b. Churn in the staff
Similar to https://medium.com/airbnb-engineering/scaling-knowledge-at-airbnb-875d73eff091
15. Core Process with Dataset Versioning
Biz
Analytics
Team
Data
Engg
Server Side CI
Dataset Rules
Evaluation Rules
Dependencies
Materialized dataset
v1
v2
v3Materialize
Model Pipeline
Pipeline Execution
v4
Slide Content
URN
Context,
Questions
v5Quality Check
Interpretation
v6
Dataset as mutable object
with memory
No emails/google docs
Continuous validation by
thirdparty (server)
Separate model
development and
evaluation
17. dgit - git wrapper for datasets
1. Python package, MIT license
2. Application of git
3. Beyond git - “Understands” data
a. Metadata generation and management
b. Automatic scanning of working directory for changes
c. Automatic validation and materialization
d. Dependency tracking across repos
e. Automatic audit trails with execution
f. Pipeline support
19. Roadmap to Reduce Cost and Complexity
● Standardize processes around versioned data
○ April 2016 - git for data (opensource)
● Simplify data access
○ May 2016 - EasyQuery (SAAS product)
● Increase security of data science services
○ July 2016 - Ethereum integration (SAAS product)
20. Upvote if you like this talk….
https://fifthelephant.talkfunnel.com/2016