This document discusses best practices for deploying analytic models from development environments into operational systems. It describes how modeling environments often use different languages than deployment environments, requiring significant effort to move models. The document outlines the life cycle of analytic models, from exploratory data analysis to model deployment and monitoring. It also discusses standards like PMML and PFA that can be used to export models between different applications and analytic engines that integrate models into operational workflows.
How to design your ML application to be production ready from the day one
How to switch from notebooks to deployable and maintainable software
How to deploy, serve and monitor prediction pipelines
How to re-train models in production
How to shift machine learning experimentation phase to production
TensorFlow Extension (TFX) and Apache Beammarkgrover
Talk on TFX and Beam by Robert Crowe, developer advocate at Google, focussed on TensorFlow.
Learn how the TensorFlow Extended (TFX) project is utilizing Apache Beam to simplify pre- and post-processing for ML pipelines. TFX provides a framework for managing all of necessary pieces of a real-world machine learning project beyond simply training and utilizing models. Robert will provide an overview of TFX, and talk in a little more detail about the pieces of the framework (tf.Transform and tf.ModelAnalysis) which are powered by Apache Beam.
TPCx-HS is the first vendor-neutral benchmark focused on big data systems – which have become a critical part of the enterprise IT ecosystem.
Watch the video presentation: http://wp.me/p3RLHQ-cLY
Learn more: http://www.tpc.org/tpcx-hs
Integration Patterns for Big Data ApplicationsMichael Häusler
Big Data technologies like distributed databases, queues, batch processors, and stream processors are fun and exciting to play with. Making them play nicely together can be challenging. Keeping it fun for engineers to continuously improve and operate them is hard. At ResearchGate, we run thousands of YARN applications every day to gain insights and to power user facing features. Of course, there are numerous integration challenges on the way:
* integrating batch and stream processors with operational systems
* ingesting data and playing back results while controlling performance crosstalk
* rolling out new versions of synchronous, stream, and batch applications and their respective data schemas
* controlling the amount of glue and adapter code between different technologies
* modeling cross-flow dependencies while handling failures gracefully and limiting their repercussions
We describe our ongoing journey in identifying patterns and principles to make our big data stack integrate well. Technologies to be covered will include MongoDB, Kafka, Hadoop (YARN), Hive (TEZ), Flink Batch, and Flink Streaming.
Analytics Zoo: Building Analytics and AI Pipeline for Apache Spark and BigDL ...Databricks
A long time ago, there was Caffe and Theano, then came Torch and CNTK and Tensorflow, Keras and MXNet and Pytorch and Caffe2….a sea of Deep learning tools but none for Spark developers to dip into. Finally, there was BigDL, a deep learning library for Apache Spark. While BigDL is integrated into Spark and extends its capabilities to address the challenges of Big Data developers, will a library alone be enough to simplify and accelerate the deployment of ML/DL workloads on production clusters? From high level pipeline API support to feature transformers to pre-defined models and reference use cases, a rich repository of easy to use tools are now available with the ‘Analytics Zoo’. We’ll unpack the production challenges and opportunities with ML/DL on Spark and what the Zoo can do
How to design your ML application to be production ready from the day one
How to switch from notebooks to deployable and maintainable software
How to deploy, serve and monitor prediction pipelines
How to re-train models in production
How to shift machine learning experimentation phase to production
TensorFlow Extension (TFX) and Apache Beammarkgrover
Talk on TFX and Beam by Robert Crowe, developer advocate at Google, focussed on TensorFlow.
Learn how the TensorFlow Extended (TFX) project is utilizing Apache Beam to simplify pre- and post-processing for ML pipelines. TFX provides a framework for managing all of necessary pieces of a real-world machine learning project beyond simply training and utilizing models. Robert will provide an overview of TFX, and talk in a little more detail about the pieces of the framework (tf.Transform and tf.ModelAnalysis) which are powered by Apache Beam.
TPCx-HS is the first vendor-neutral benchmark focused on big data systems – which have become a critical part of the enterprise IT ecosystem.
Watch the video presentation: http://wp.me/p3RLHQ-cLY
Learn more: http://www.tpc.org/tpcx-hs
Integration Patterns for Big Data ApplicationsMichael Häusler
Big Data technologies like distributed databases, queues, batch processors, and stream processors are fun and exciting to play with. Making them play nicely together can be challenging. Keeping it fun for engineers to continuously improve and operate them is hard. At ResearchGate, we run thousands of YARN applications every day to gain insights and to power user facing features. Of course, there are numerous integration challenges on the way:
* integrating batch and stream processors with operational systems
* ingesting data and playing back results while controlling performance crosstalk
* rolling out new versions of synchronous, stream, and batch applications and their respective data schemas
* controlling the amount of glue and adapter code between different technologies
* modeling cross-flow dependencies while handling failures gracefully and limiting their repercussions
We describe our ongoing journey in identifying patterns and principles to make our big data stack integrate well. Technologies to be covered will include MongoDB, Kafka, Hadoop (YARN), Hive (TEZ), Flink Batch, and Flink Streaming.
Analytics Zoo: Building Analytics and AI Pipeline for Apache Spark and BigDL ...Databricks
A long time ago, there was Caffe and Theano, then came Torch and CNTK and Tensorflow, Keras and MXNet and Pytorch and Caffe2….a sea of Deep learning tools but none for Spark developers to dip into. Finally, there was BigDL, a deep learning library for Apache Spark. While BigDL is integrated into Spark and extends its capabilities to address the challenges of Big Data developers, will a library alone be enough to simplify and accelerate the deployment of ML/DL workloads on production clusters? From high level pipeline API support to feature transformers to pre-defined models and reference use cases, a rich repository of easy to use tools are now available with the ‘Analytics Zoo’. We’ll unpack the production challenges and opportunities with ML/DL on Spark and what the Zoo can do
Model Building with RevoScaleR: Using R and Hadoop for Statistical ComputationRevolution Analytics
Slides from Joseph Rickert's presentation at Strata NYC 2013
"Using R and Hadoop for Statistical Computation at Scale"
http://strataconf.com/stratany2013/public/schedule/detail/30632
Processing Terabyte-Scale Genomics Datasets with ADAM: Spark Summit East talk...Spark Summit
The detection and analysis of rare genomic events requires integrative analysis across large cohorts with terabytes to petabytes of genomic data. Contemporary genomic analysis tools have not been designed for this scale of data-intensive computing. This talk presents ADAM, an Apache 2 licensed library built on top of the popular Apache Spark distributed computing framework. ADAM is designed to allow genomic analyses to be seamlessly distributed across large clusters, and presents a clean API for writing parallel genomic analysis algorithms. In this talk, we’ll look at how we’ve used ADAM to achieve a 3.5× improvement in end-to-end variant calling latency and a 66% cost improvement over current toolkits, without sacrificing accuracy. We will talk about a recent recompute effort where we have used ADAM to recall the Simons Genome Diversity Dataset against GRCh38. We will also talk about using ADAM alongside Apache Hbase to interactively explore large variant datasets.
This deck was presented at the Spark meetup at Bangalore. The key idea behind the presentation was to focus on limitations of Hadoop MapReduce and introduce both Hadoop YARN and Spark in this context. An overview of the other aspects of the Berkeley Data Analytics Stack was also provided.
Tensors Are All You Need: Faster Inference with HummingbirdDatabricks
The ever-increasing interest around deep learning and neural networks has led to a vast increase in processing frameworks like TensorFlow and PyTorch. These libraries are built around the idea of a computational graph that models the dataflow of individual units. Because tensors are their basic computational unit, these frameworks can run efficiently on hardware accelerators (e.g. GPUs).Traditional machine learning (ML) such as linear regressions and decision trees in scikit-learn cannot currently be run on GPUs, missing out on the potential accelerations that deep learning and neural networks enjoy.
In this talk, we’ll show how you can use Hummingbird to achieve 1000x speedup in inferencing on GPUs by converting your traditional ML models to tensor-based models (PyTorch andTVM). https://github.com/microsoft/hummingbird
This talk is for intermediate audiences that use traditional machine learning and want to speedup the time it takes to perform inference with these models. After watching the talk, the audience should be able to use ~5 lines of code to convert their traditional models to tensor-based models to be able to try them out on GPUs.
Outline:
Introduction of what ML inference is (and why it’s different than training)
Motivation: Tensor-based DNN frameworks allow inference on GPU, but “traditional” ML frameworks do not
Why “traditional” ML methods are important
Introduction of what Hummingbirddoes and main benefits
Deep dive on how traditional ML models are built
Brief intro onhow Hummingbird converter works
Example of how Hummingbird can convert a tree model into a tensor-based model
Other models
Demo
Status
Q&A
Presented by: Joseph Rickert, Data Scientist Community Manager, Revolution Analytics, Sep 25 2014.
Whenever data scientists are asked about what software they use R always comes up at the top of the list. In one recent survey, only SQL was rated higher than R. In this webinar we will explore what makes R so popular and useful. Starting with the big picture, we describe how R is organized and how to find your way around the R world. Then we will work through some examples highlighting features of R that make it attractive for data science work including:
Acquiring data
Data manipulation
Exploratory data analysis
Model building
Machine learning
High Performance Predictive Analytics in R and HadoopDataWorks Summit
Hadoop is rapidly being adopted as a major platform for storing and managing massive amounts of data, and for computing descriptive and query types of analytics on that data. However, it has a reputation for not being a suitable environment for high performance complex iterative algorithms such as logistic regression, generalized linear models, and decision trees. At Revolution Analytics we think that reputation is unjustified, and in this talk I discuss the approach we have taken to porting our suite of High Performance Analytics algorithms to run natively and efficiently in Hadoop. Our algorithms are written in C++ and R, and are based on a platform that automatically and efficiently parallelizes a broad class of algorithms called Parallel External Memory Algorithms (PEMA’s). This platform abstracts both the inter-process communication layer and the data source layer, so that the algorithms can work in almost any environment in which messages can be passed among processes and with almost any data source. MPI and RPC are two traditional ways to send messages, but messages can also be passed using files, as in Hadoop. I describe how we use the file-based communication choreographed by MapReduce and how we efficiently access data stored in HDFS.
In this paper we introduce the MADlib project, including the background that led to its beginnings, and the motivation for its open source nature. We provide an overview of the library’s architecture and design patterns, and provide a description of various statistical methods in that context.
27 Aug 2013 Webinar High Performance Predictive Analytics in Hadoop and R presented by Mario E. Inchiosa, PhD., US Data Scientist and Kathleen Rohrecker, Director of Product Marketing
Quick and Dirty: Scaling Out Predictive Models Using Revolution Analytics on ...Revolution Analytics
[Presentation by Skylar Lyon at DataWeek 2014, September 17 2014.]
I recently faced the task of how to scale out an existing analytics process. The schedule was compressed - it always is in my world. The data was big - 400+ million rows waiting in database. What did I do? I offered my favorite type of solution - quick and dirty.
At the outset, I wasn't sure how easy it would be. Nor was I certain of realized performance gains. But the concept seemed sound and the exercise fun. Let's move the compute to the data via Revolution R Enterprise for Teradata.
This presentation outlines my approach in leveraging a colleague's R models as I experimented with running R in-database. Would my path lead to significant improvement? Could it be used to productionalize the workflow?
Building A Hybrid Warehouse: Efficient Joins between Data Stored in HDFS and ...Yuanyuan Tian
With the advent of big data, the enterprise analytics landscape has dramatically changed. The HDFS has become an important data repository for all business analytics. Enterprises are using various big data technologies to process data and drive actionable insights. HDFS serves as the storage where other distributed processing frameworks, such as Hadoop and Spark, access and operate on large volumes of data. At the same time, enterprise data warehouses (EDWs) continue to support critical business analytics. EDWs are usually shared-nothing parallel databases that support complex SQL processing, updates, and transactions. As a result, they manage up-to-date data and support various business analytics tools, such as reporting and dashboards. A new generation of applications have emerged, requiring access and correlation of data stored in HDFS and EDWs. This has created the need for a new generation of a special federation between Hadoop-like big data platforms and EDWs, which we call the hybrid warehouse. In this talk, we identify the best hybrid warehouse architecture by studying various algorithms to join database and HDFS tables.
Using Deep Learning on Apache Spark to Diagnose Thoracic Pathology from Chest...Databricks
Overview and extended description: AI is expected to be the engine of technological advancements in the healthcare industry, especially in the areas of radiology and image processing. The purpose of this session is to demonstrate how we can build a AI-based Radiologist system using Apache Spark and Analytics Zoo to detect pneumonia and other diseases from chest x-ray images. The dataset, released by the NIH, contains around 110,00 X-ray images of around 30,000 unique patients, annotated with up to 14 different thoracic pathology labels. Stanford University developed a state-of-the-art model using CNN and exceeds average radiologist performance on the F1 metric. This talk focuses on how we can build a multi-label image classification model in a distributed Apache Spark infrastructure, and demonstrate how to build complex image transformations and deep learning pipelines using BigDL and Analytics Zoo with scalability and ease of use. Some practical image pre-processing procedures and evaluation metrics are introduced. We will also discuss runtime configuration, near-linear scalability for training and model serving, and other general performance topics.
Model Building with RevoScaleR: Using R and Hadoop for Statistical ComputationRevolution Analytics
Slides from Joseph Rickert's presentation at Strata NYC 2013
"Using R and Hadoop for Statistical Computation at Scale"
http://strataconf.com/stratany2013/public/schedule/detail/30632
Processing Terabyte-Scale Genomics Datasets with ADAM: Spark Summit East talk...Spark Summit
The detection and analysis of rare genomic events requires integrative analysis across large cohorts with terabytes to petabytes of genomic data. Contemporary genomic analysis tools have not been designed for this scale of data-intensive computing. This talk presents ADAM, an Apache 2 licensed library built on top of the popular Apache Spark distributed computing framework. ADAM is designed to allow genomic analyses to be seamlessly distributed across large clusters, and presents a clean API for writing parallel genomic analysis algorithms. In this talk, we’ll look at how we’ve used ADAM to achieve a 3.5× improvement in end-to-end variant calling latency and a 66% cost improvement over current toolkits, without sacrificing accuracy. We will talk about a recent recompute effort where we have used ADAM to recall the Simons Genome Diversity Dataset against GRCh38. We will also talk about using ADAM alongside Apache Hbase to interactively explore large variant datasets.
This deck was presented at the Spark meetup at Bangalore. The key idea behind the presentation was to focus on limitations of Hadoop MapReduce and introduce both Hadoop YARN and Spark in this context. An overview of the other aspects of the Berkeley Data Analytics Stack was also provided.
Tensors Are All You Need: Faster Inference with HummingbirdDatabricks
The ever-increasing interest around deep learning and neural networks has led to a vast increase in processing frameworks like TensorFlow and PyTorch. These libraries are built around the idea of a computational graph that models the dataflow of individual units. Because tensors are their basic computational unit, these frameworks can run efficiently on hardware accelerators (e.g. GPUs).Traditional machine learning (ML) such as linear regressions and decision trees in scikit-learn cannot currently be run on GPUs, missing out on the potential accelerations that deep learning and neural networks enjoy.
In this talk, we’ll show how you can use Hummingbird to achieve 1000x speedup in inferencing on GPUs by converting your traditional ML models to tensor-based models (PyTorch andTVM). https://github.com/microsoft/hummingbird
This talk is for intermediate audiences that use traditional machine learning and want to speedup the time it takes to perform inference with these models. After watching the talk, the audience should be able to use ~5 lines of code to convert their traditional models to tensor-based models to be able to try them out on GPUs.
Outline:
Introduction of what ML inference is (and why it’s different than training)
Motivation: Tensor-based DNN frameworks allow inference on GPU, but “traditional” ML frameworks do not
Why “traditional” ML methods are important
Introduction of what Hummingbirddoes and main benefits
Deep dive on how traditional ML models are built
Brief intro onhow Hummingbird converter works
Example of how Hummingbird can convert a tree model into a tensor-based model
Other models
Demo
Status
Q&A
Presented by: Joseph Rickert, Data Scientist Community Manager, Revolution Analytics, Sep 25 2014.
Whenever data scientists are asked about what software they use R always comes up at the top of the list. In one recent survey, only SQL was rated higher than R. In this webinar we will explore what makes R so popular and useful. Starting with the big picture, we describe how R is organized and how to find your way around the R world. Then we will work through some examples highlighting features of R that make it attractive for data science work including:
Acquiring data
Data manipulation
Exploratory data analysis
Model building
Machine learning
High Performance Predictive Analytics in R and HadoopDataWorks Summit
Hadoop is rapidly being adopted as a major platform for storing and managing massive amounts of data, and for computing descriptive and query types of analytics on that data. However, it has a reputation for not being a suitable environment for high performance complex iterative algorithms such as logistic regression, generalized linear models, and decision trees. At Revolution Analytics we think that reputation is unjustified, and in this talk I discuss the approach we have taken to porting our suite of High Performance Analytics algorithms to run natively and efficiently in Hadoop. Our algorithms are written in C++ and R, and are based on a platform that automatically and efficiently parallelizes a broad class of algorithms called Parallel External Memory Algorithms (PEMA’s). This platform abstracts both the inter-process communication layer and the data source layer, so that the algorithms can work in almost any environment in which messages can be passed among processes and with almost any data source. MPI and RPC are two traditional ways to send messages, but messages can also be passed using files, as in Hadoop. I describe how we use the file-based communication choreographed by MapReduce and how we efficiently access data stored in HDFS.
In this paper we introduce the MADlib project, including the background that led to its beginnings, and the motivation for its open source nature. We provide an overview of the library’s architecture and design patterns, and provide a description of various statistical methods in that context.
27 Aug 2013 Webinar High Performance Predictive Analytics in Hadoop and R presented by Mario E. Inchiosa, PhD., US Data Scientist and Kathleen Rohrecker, Director of Product Marketing
Quick and Dirty: Scaling Out Predictive Models Using Revolution Analytics on ...Revolution Analytics
[Presentation by Skylar Lyon at DataWeek 2014, September 17 2014.]
I recently faced the task of how to scale out an existing analytics process. The schedule was compressed - it always is in my world. The data was big - 400+ million rows waiting in database. What did I do? I offered my favorite type of solution - quick and dirty.
At the outset, I wasn't sure how easy it would be. Nor was I certain of realized performance gains. But the concept seemed sound and the exercise fun. Let's move the compute to the data via Revolution R Enterprise for Teradata.
This presentation outlines my approach in leveraging a colleague's R models as I experimented with running R in-database. Would my path lead to significant improvement? Could it be used to productionalize the workflow?
Building A Hybrid Warehouse: Efficient Joins between Data Stored in HDFS and ...Yuanyuan Tian
With the advent of big data, the enterprise analytics landscape has dramatically changed. The HDFS has become an important data repository for all business analytics. Enterprises are using various big data technologies to process data and drive actionable insights. HDFS serves as the storage where other distributed processing frameworks, such as Hadoop and Spark, access and operate on large volumes of data. At the same time, enterprise data warehouses (EDWs) continue to support critical business analytics. EDWs are usually shared-nothing parallel databases that support complex SQL processing, updates, and transactions. As a result, they manage up-to-date data and support various business analytics tools, such as reporting and dashboards. A new generation of applications have emerged, requiring access and correlation of data stored in HDFS and EDWs. This has created the need for a new generation of a special federation between Hadoop-like big data platforms and EDWs, which we call the hybrid warehouse. In this talk, we identify the best hybrid warehouse architecture by studying various algorithms to join database and HDFS tables.
Using Deep Learning on Apache Spark to Diagnose Thoracic Pathology from Chest...Databricks
Overview and extended description: AI is expected to be the engine of technological advancements in the healthcare industry, especially in the areas of radiology and image processing. The purpose of this session is to demonstrate how we can build a AI-based Radiologist system using Apache Spark and Analytics Zoo to detect pneumonia and other diseases from chest x-ray images. The dataset, released by the NIH, contains around 110,00 X-ray images of around 30,000 unique patients, annotated with up to 14 different thoracic pathology labels. Stanford University developed a state-of-the-art model using CNN and exceeds average radiologist performance on the F1 metric. This talk focuses on how we can build a multi-label image classification model in a distributed Apache Spark infrastructure, and demonstrate how to build complex image transformations and deep learning pipelines using BigDL and Analytics Zoo with scalability and ease of use. Some practical image pre-processing procedures and evaluation metrics are introduced. We will also discuss runtime configuration, near-linear scalability for training and model serving, and other general performance topics.
Architectures for Data Commons (XLDB 15 Lightning Talk)Robert Grossman
These are the slides from a 5 minute Lightning Talk that I gave at XLDB 2015 on May 19, 2015 at Stanford. It is based in part on our experiences developing the NCI Genomic Data Commons (GDC).
Practical Methods for Identifying Anomalies That Matter in Large DatasetsRobert Grossman
Robert L. Grossman, Practical Methods for Identifying Anomalies That Matter in Large Datasets, O’Reilly, Strata + Hadoop World, San Jose, California, February 20, 2015.
Adversarial Analytics - 2013 Strata & Hadoop World TalkRobert Grossman
This is a talk I gave at the Strata Conference and Hadoop World in New York City on October 28, 2013. It describes predictive modeling in the context of modeling an adversary's behavior.
ACM Bay Area Data Mining Workshop: Pattern, PMML, HadoopPaco Nathan
ACM: Hands-On Workshop for Predictive Modeling and Enterprise Data Workflows with PMML and Cascading
2013-10-12
http://www.sfbayacm.org/event/hands-workshop-predictive-modeling-and-enterprise-data-workflows-pmml-and-cascading
PMML (Predictive Model Markup Language) provides a standard way to represent data mining models so that these can be shared between different statistical applications. Not only PMML can represent a wide range of statistical techniques, but it can also be used to represent the data transformations necessary to transform raw data into meaningful feature detectors. In this way, PMML offers a standard to represent data manipulation and modeling in a single and concise way.
The Matsu Project - Open Source Software for Processing Satellite Imagery DataRobert Grossman
The Matsu Project is an Open Cloud Consortium project that is developing open source software for processing satellite imagery data using Hadoop, OpenStack and R.
Using the Open Science Data Cloud for Data Science ResearchRobert Grossman
The Open Science Data Cloud is a petabyte scale science cloud for managing, analyzing, and sharing large datasets. We give an overview of the Open Science Data Cloud and how it can be used for data science research.
Use of standards and related issues in predictive analyticsPaco Nathan
My presentation at KDD 2016 in SF, in the "Special Session on Standards in Predictive Analytics In the Era of Big and Fast Data" morning track about PMML and PFA http://dmg.org/kdd2016.html
The quality of data is very important for various downstream analyses, such as sequence assembly, single nucleotide polymorphisms identification this ppt show parameters for
NGS Data quality check and Dataformat of top sequencing machine
How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2....Databricks
Apache Spark has rapidly become a key tool for data scientists to explore, understand and transform massive datasets and to build and train advanced machine learning models. The question then becomes, how do you deploy these ML model to a production environment? How do you embed what you’ve learned into customer facing data applications?
In this talk I will discuss best practices on how data scientists productionize machine learning models, do a deep dive with actual case studies, and show live tutorials of a few example architectures and code in Python, Scala, Java and SQL.
Crossing the Analytics Chasm and Getting the Models You Developed DeployedRobert Grossman
There are two cultures in data science and analytics - those that develop analytic models and those that deploy analytic models into operational systems. In this talk, we review the life cycle of analytic models and provide an overview of some of the approaches that have been developed for managing analytic models and workflows and for deploying them, including using analytic engines and analytic containers . We give a quick overview of languages for analytic models (PMML) and analytic workflows (PFA). We also describe the emerging discipline of AnalyticOps that has borrowed some of the techniques of DevOps.
Building machine learning service in your business — Eric Chen (Uber) @PAPIs ...PAPIs.io
When making machine learning applications in Uber, we identified a sequence of common practices and painful procedures, and thus built a machine learning platform as a service. We here present the key components to build such a scalable and reliable machine learning service which serves both our online and offline data processing needs.
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...DataWorks Summit
Specialized tools for machine learning development and model governance are becoming essential. MlFlow is an open source platform for managing the machine learning lifecycle. Just by adding a few lines of code in the function or script that trains their model, data scientists can log parameters, metrics, artifacts (plots, miscellaneous files, etc.) and a deployable packaging of the ML model. Every time that function or script is run, the results will be logged automatically as a byproduct of those lines of code being added, even if the party doing the training run makes no special effort to record the results. MLflow application programming interfaces (APIs) are available for the Python, R and Java programming languages, and MLflow sports a language-agnostic REST API as well. Over a relatively short time period, MLflow has garnered more than 3,300 stars on GitHub , almost 500,000 monthly downloads and 80 contributors from more than 40 companies. Most significantly, more than 200 companies are now using MLflow. We will demo MlFlow Tracking , Project and Model components with Azure Machine Learning (AML) Services and show you how easy it is to get started with MlFlow on-prem or in the cloud.
Reproducible AI using MLflow and PyTorchDatabricks
Model reproducibility is becoming the next frontier for successful AI models building and deployments for both Research and Production scenarios. In this talk, we will show you how to build reproducible AI models and workflows using PyTorch and MLflow that can be shared across your teams, with traceability and speed up collaboration for AI projects.
Apache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning ModelsAnyscale
Apache Spark has rapidly become a key tool for data scientists to explore, understand and transform massive datasets and to build and train advanced machine learning models. The question then becomes, how do I deploy these model to a production environment? How do I embed what I have learned into customer facing data applications?
In this webinar, we will discuss best practices from Databricks on
how our customers productionize machine learning models
do a deep dive with actual customer case studies,
show live tutorials of a few example architectures and code in Python, Scala, Java and SQL.
Lessons learnt and system built while solving the last mile problem in machine learning - taking models to production. Used for the talk at - http://sched.co/BLvf
Utilisation de MLflow pour le cycle de vie des projet Machine learningParis Data Engineers !
Mlflow est un projet opensource pour administrer le cycle de vie des projets machine learning (de l’expérimentation jusqu’au déploiement) afin de mieux les intégrer dans l’écosystème qui les entoure.
Durant cette présentation nous montrerons les différentes composantes de MLflow et ferons une démonstration de son utilisation à la fois dans le contexte d’une plateforme Databricks et d’un IDE local.
Presented at IDEAS SoCal on Oct 20, 2018. I discuss main approaches of deploying data science engines to production and provide sample code for the comprehensive approach of real time scoring with MLeap and Spark ML.
Hydrosphere.io Platform for AI/ML Operations AutomationRustem Zakiev
Simple and robust ML models deployment
Automated versioning
Easy models and versions management
Score the model from your app or microservice via REST, gRPC or Kafka stream API.
A/B and Canary testing on production traffic.
Hot-wing bumpless model replacement in production pipeline
Oleksii Moskalenko "Continuous Delivery of ML Pipelines to Production"Fwdays
Here in DS team in WIX we want to help to create stunning sites by applying recent achievement of AI research to production. Since Data Science engineering practices are still not fully shaped we found out that it is crucial to bring the best practices from software engineering - give Data Scientist ability to deliver models fast without loss in quality and computation efficiency to stay competitive in this overhyped market. To achieve this we are developing our own infrastructure for creating pipelines and deploying them to production with minimum (to none) engineer involvement.
This talk will cover initial motivation, solved technical issues and lessons learned while building such ML delivery system.
Website: https://fwdays.com/en/event/data-science-fwdays-2019/review/continuous-delivery-of-ml-pipelines-to-production
Operationalizing Machine Learning: Serving ML ModelsLightbend
Join O’Reilly author and Lightbend Principal Architect, Boris Lublinsky, as he discusses one of the hottest topics in software engineering today: serving machine learning models.
Typically with machine learning, different groups are responsible for model training and model serving. Data scientists often introduce their own machine-learning tools, causing software engineers to create complementary model-serving frameworks to keep pace. It’s not a very efficient system. In this webinar, Boris demonstrates a more standardized approach to model serving and model scoring:
* How to develop an architecture for serving models in real time as part of input stream processing
* How this approach enables data science teams to update models without restarting existing applications
* Different ways to build this model-scoring solution, using several popular stream processing engines and frameworks
Monitoring AI applications with AI
The best performing offline algorithm can lose in production. The most accurate model does not always improve business metrics. Environment misconfiguration or upstream data pipeline inconsistency can silently kill the model performance. Neither prodops, data science or engineering teams are skilled to detect, monitor and debug such types of incidents.
Was it possible for Microsoft to test Tay chatbot in advance and then monitor and adjust it continuously in production to prevent its unexpected behaviour? Real mission critical AI systems require advanced monitoring and testing ecosystem which enables continuous and reliable delivery of machine learning models and data pipelines into production. Common production incidents include:
Data drifts, new data, wrong features
Vulnerability issues, malicious users
Concept drifts
Model Degradation
Biased Training set / training issue
Performance issue
In this demo based talk we discuss a solution, tooling and architecture that allows machine learning engineer to be involved in delivery phase and take ownership over deployment and monitoring of machine learning pipelines.
It allows data scientists to safely deploy early results as end-to-end AI applications in a self serve mode without assistance from engineering and operations teams. It shifts experimentation and even training phases from offline datasets to live production and closes a feedback loop between research and production.
Technical part of the talk will cover the following topics:
Automatic Data Profiling
Anomaly Detection
Clustering of inputs and outputs of the model
A/B Testing
Service Mesh, Envoy Proxy, trafic shadowing
Stateless and stateful models
Monitoring of regression, classification and prediction models
Data Summer Conf 2018, “Monitoring AI with AI (RUS)” — Stepan Pushkarev, CTO ...Provectus
In this demo based talk we discuss a solution, tooling and architecture that allows machine learning engineer to be involved in delivery phase and take ownership over deployment and monitoring of machine learning pipelines. It allows data scientists to safely deploy early results as end-to-end AI applications in a self serve mode without assistance from engineering and operations teams. It shifts experimentation and even training phases from offline datasets to live production and closes a feedback loop between research and production.
Scaling AI in production using PyTorchgeetachauhan
Slides from my talk at MLOps World' 21
Deploying AI models in production and scaling the ML services is still a big challenge. In this talk we will cover details of how to deploy your AI models, best practices for the deployment scenarios, and techniques for performance optimization and scaling the ML services. Come join us to learn how you can jumpstart the journey of taking your PyTorch models from Research to production.
A seminar in advanced Software Engineering concerning using models to guide the development process, and QVT to transfer a model into another model automatically
Data Scientists and Machine Learning practitioners, nowadays, seem to be churning out models by the dozen and they continuously experiment to find ways to improve their accuracies. They also use a variety of ML and DL frameworks & languages , and a typical organization may find that this results in a heterogenous, complicated bunch of assets that require different types of runtimes, resources and sometimes even specialized compute to operate efficiently.
But what does it mean for an enterprise to actually take these models to "production" ? How does an organization scale inference engines out & make them available for real-time applications without significant latencies ? There needs to be different techniques for batch (offline) inferences and instant, online scoring. Data needs to be accessed from various sources and cleansing, transformations of data needs to be enabled prior to any predictions. In many cases, there maybe no substitute for customized data handling with scripting either.
Enterprises also require additional auditing and authorizations built in, approval processes and still support a "continuous delivery" paradigm whereby a data scientist can enable insights faster. Not all models are created equal, nor are consumers of a model - so enterprises require both metering and allocation of compute resources for SLAs.
In this session, we will take a look at how machine learning is operationalized in IBM Data Science Experience (DSX), a Kubernetes based offering for the Private Cloud and optimized for the HortonWorks Hadoop Data Platform. DSX essentially brings in typical software engineering development practices to Data Science, organizing the dev->test->production for machine learning assets in much the same way as typical software deployments. We will also see what it means to deploy, monitor accuracies and even rollback models & custom scorers as well as how API based techniques enable consuming business processes and applications to remain relatively stable amidst all the chaos.
Speaker
Piotr Mierzejewski, Program Director Development IBM DSX Local, IBM
Some Frameworks for Improving Analytic Operations at Your CompanyRobert Grossman
I review three frameworks for analytic operations that are designed to improve the value obtained when deploying analytic models into products, services and internal operations.
This a talk that I gave at BioIT World West on March 12, 2019. The talk was called: A Gen3 Perspective of Disparate Data:From Pipelines in Data Commons to AI in Data Ecosystems.
This is an overview of the Data Biosphere Project, its goals, its architecture, and the three core projects that form its foundation. We also discuss data commons.
What is Data Commons and How Can Your Organization Build One?Robert Grossman
This is a talk that I gave at the Molecular Medicine Tri Conference on data commons and data sharing to accelerate research discoveries and improve patient outcomes. It also covers how your organization can build a data commons using the Open Commons Consortium's Data Commons Framework and the University of Chicago's Gen3 data commons platform.
These are the slides from a plenary panel that I participated in at IEEE Cloud 2011 on July 5, 2011 in Washington, D.C. I discussed the Open Science Data Cloud and concluded the talk by three research questions
This is a talk I gave at a Northwestern University - Complete Genomics Workshop on April 21, 2011 about using clouds to support research in genomics and related areas.
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...SOFTTECHHUB
The choice of an operating system plays a pivotal role in shaping our computing experience. For decades, Microsoft's Windows has dominated the market, offering a familiar and widely adopted platform for personal and professional use. However, as technological advancements continue to push the boundaries of innovation, alternative operating systems have emerged, challenging the status quo and offering users a fresh perspective on computing.
One such alternative that has garnered significant attention and acclaim is Nitrux Linux 3.5.0, a sleek, powerful, and user-friendly Linux distribution that promises to redefine the way we interact with our devices. With its focus on performance, security, and customization, Nitrux Linux presents a compelling case for those seeking to break free from the constraints of proprietary software and embrace the freedom and flexibility of open-source computing.
In his public lecture, Christian Timmerer provides insights into the fascinating history of video streaming, starting from its humble beginnings before YouTube to the groundbreaking technologies that now dominate platforms like Netflix and ORF ON. Timmerer also presents provocative contributions of his own that have significantly influenced the industry. He concludes by looking at future challenges and invites the audience to join in a discussion.
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Albert Hoitingh
In this session I delve into the encryption technology used in Microsoft 365 and Microsoft Purview. Including the concepts of Customer Key and Double Key Encryption.
GraphRAG is All You need? LLM & Knowledge GraphGuy Korland
Guy Korland, CEO and Co-founder of FalkorDB, will review two articles on the integration of language models with knowledge graphs.
1. Unifying Large Language Models and Knowledge Graphs: A Roadmap.
https://arxiv.org/abs/2306.08302
2. Microsoft Research's GraphRAG paper and a review paper on various uses of knowledge graphs:
https://www.microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/
Essentials of Automations: The Art of Triggers and Actions in FMESafe Software
In this second installment of our Essentials of Automations webinar series, we’ll explore the landscape of triggers and actions, guiding you through the nuances of authoring and adapting workspaces for seamless automations. Gain an understanding of the full spectrum of triggers and actions available in FME, empowering you to enhance your workspaces for efficient automation.
We’ll kick things off by showcasing the most commonly used event-based triggers, introducing you to various automation workflows like manual triggers, schedules, directory watchers, and more. Plus, see how these elements play out in real scenarios.
Whether you’re tweaking your current setup or building from the ground up, this session will arm you with the tools and insights needed to transform your FME usage into a powerhouse of productivity. Join us to discover effective strategies that simplify complex processes, enhancing your productivity and transforming your data management practices with FME. Let’s turn complexity into clarity and make your workspaces work wonders!
Pushing the limits of ePRTC: 100ns holdover for 100 daysAdtran
At WSTS 2024, Alon Stern explored the topic of parametric holdover and explained how recent research findings can be implemented in real-world PNT networks to achieve 100 nanoseconds of accuracy for up to 100 days.
PHP Frameworks: I want to break free (IPC Berlin 2024)Ralf Eggert
In this presentation, we examine the challenges and limitations of relying too heavily on PHP frameworks in web development. We discuss the history of PHP and its frameworks to understand how this dependence has evolved. The focus will be on providing concrete tips and strategies to reduce reliance on these frameworks, based on real-world examples and practical considerations. The goal is to equip developers with the skills and knowledge to create more flexible and future-proof web applications. We'll explore the importance of maintaining autonomy in a rapidly changing tech landscape and how to make informed decisions in PHP development.
This talk is aimed at encouraging a more independent approach to using PHP frameworks, moving towards a more flexible and future-proof approach to PHP development.
Removing Uninteresting Bytes in Software FuzzingAftab Hussain
Imagine a world where software fuzzing, the process of mutating bytes in test seeds to uncover hidden and erroneous program behaviors, becomes faster and more effective. A lot depends on the initial seeds, which can significantly dictate the trajectory of a fuzzing campaign, particularly in terms of how long it takes to uncover interesting behaviour in your code. We introduce DIAR, a technique designed to speedup fuzzing campaigns by pinpointing and eliminating those uninteresting bytes in the seeds. Picture this: instead of wasting valuable resources on meaningless mutations in large, bloated seeds, DIAR removes the unnecessary bytes, streamlining the entire process.
In this work, we equipped AFL, a popular fuzzer, with DIAR and examined two critical Linux libraries -- Libxml's xmllint, a tool for parsing xml documents, and Binutil's readelf, an essential debugging and security analysis command-line tool used to display detailed information about ELF (Executable and Linkable Format). Our preliminary results show that AFL+DIAR does not only discover new paths more quickly but also achieves higher coverage overall. This work thus showcases how starting with lean and optimized seeds can lead to faster, more comprehensive fuzzing campaigns -- and DIAR helps you find such seeds.
- These are slides of the talk given at IEEE International Conference on Software Testing Verification and Validation Workshop, ICSTW 2022.
The Art of the Pitch: WordPress Relationships and SalesLaura Byrne
Clients don’t know what they don’t know. What web solutions are right for them? How does WordPress come into the picture? How do you make sure you understand scope and timeline? What do you do if sometime changes?
All these questions and more will be explored as we talk about matching clients’ needs with what your agency offers without pulling teeth or pulling your hair out. Practical tips, and strategies for successful relationship building that leads to closing the deal.
Epistemic Interaction - tuning interfaces to provide information for AI supportAlan Dix
Paper presented at SYNERGY workshop at AVI 2024, Genoa, Italy. 3rd June 2024
https://alandix.com/academic/papers/synergy2024-epistemic/
As machine learning integrates deeper into human-computer interactions, the concept of epistemic interaction emerges, aiming to refine these interactions to enhance system adaptability. This approach encourages minor, intentional adjustments in user behaviour to enrich the data available for system learning. This paper introduces epistemic interaction within the context of human-system communication, illustrating how deliberate interaction design can improve system understanding and adaptation. Through concrete examples, we demonstrate the potential of epistemic interaction to significantly advance human-computer interaction by leveraging intuitive human communication strategies to inform system design and functionality, offering a novel pathway for enriching user-system engagements.
Transcript: Selling digital books in 2024: Insights from industry leaders - T...BookNet Canada
The publishing industry has been selling digital audiobooks and ebooks for over a decade and has found its groove. What’s changed? What has stayed the same? Where do we go from here? Join a group of leading sales peers from across the industry for a conversation about the lessons learned since the popularization of digital books, best practices, digital book supply chain management, and more.
Link to video recording: https://bnctechforum.ca/sessions/selling-digital-books-in-2024-insights-from-industry-leaders/
Presented by BookNet Canada on May 28, 2024, with support from the Department of Canadian Heritage.
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdfPeter Spielvogel
Building better applications for business users with SAP Fiori.
• What is SAP Fiori and why it matters to you
• How a better user experience drives measurable business benefits
• How to get started with SAP Fiori today
• How SAP Fiori elements accelerates application development
• How SAP Build Code includes SAP Fiori tools and other generative artificial intelligence capabilities
• How SAP Fiori paves the way for using AI in SAP apps
Dr. Sean Tan, Head of Data Science, Changi Airport Group
Discover how Changi Airport Group (CAG) leverages graph technologies and generative AI to revolutionize their search capabilities. This session delves into the unique search needs of CAG’s diverse passengers and customers, showcasing how graph data structures enhance the accuracy and relevance of AI-generated search results, mitigating the risk of “hallucinations” and improving the overall customer journey.
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf91mobiles
91mobiles recently conducted a Smart TV Buyer Insights Survey in which we asked over 3,000 respondents about the TV they own, aspects they look at on a new TV, and their TV buying preferences.
1. Best
Prac*ces
for
Deploying
Analy*c
Models
into
Opera*ons
Robert
L.
Grossman
Open
Data
Group
and
University
of
Chicago
Predic*ve
Analy*c
World
Chicago
June
21,
2016
rgrossman.com
@bobgrossman
2. Exploratory
Data
Analysis
Get
and
clean
the
data
Build
model
in
dev/
modeling
environment
Deploy
model
in
opera*onal
systems
with
scoring
applica*on
Monitor
performance
and
employ
champion-‐challenger
methodology
Analy*c
modeling
Analy*c
opera*ons
Deploy
model
Re*re
model
and
deploy
improved
model
Select
analy*c
problem
&
approach
Scale
up
deployment
Model Env
Deployment Env
Perf.
data
Life
Cycle
of
an
Analy*c
Model
3. Differences
Between
the
Modeling
and
Deployment
Environments
• Typically
modelers
use
specialized
languages
such
as
SAS,
SPSS
or
R.
• Usually,
developers
responsible
for
products
and
services
use
languages
such
as
Java,
JavaScript,
Python,
C++,
etc.
• This
can
result
in
significant
effort
and
significant
delays
moving
the
model
from
the
modeling
environment
to
the
deployment
environment.
4. Would you minding writing
all your models in Java?
Alice,
Data
Scien*st
Bob,
Data
Scien*st
Joe,
IT
I write all my models
in R, why don’t you
do the same?
I write all my
models in scikit-
learn, why don’t you
do the same?
5. Ways
to
Deploy
Models
into
Products/Services/Opera*ons
• Push
code.
• Embed
a
model
into
a
product
or
service.
• Export
and
import
tables
of
scores
• Export
and
import
tables
of
parameters
• Have
the
product/service
interact
with
the
model
as
a
web
or
message
service.
• Import
the
models
into
a
database
How
quickly
can
the
model
be
updated?
• Model
parameters?
• New
features?
• New
pre-‐
&
post-‐
processing?
7. What
is
an
Analy*c
Engine?
• An
analy*c
engine
is
a
component
that
is
integrated
into
products
or
enterprise
IT
that
deploys
analy*c
models
in
opera*onal
workflows
for
products
and
services.
• A
Model
Interchange
Format
is
a
format
that
supports
the
expor*ng
of
a
model
by
one
applica*on
and
the
impor*ng
of
a
model
by
another
applica*on.
• Model
Interchange
Formats
include
the
Predic*ve
Model
Markup
Language
(PMML),
the
Portable
Format
for
Analy*cs
(PFA),
and
various
in-‐house
or
custom
formats.
• Analy*c
engines
are
integrated
once,
but
allow
applica*ons
to
update
models
as
quickly
as
reading
a
a
model
interchange
format
file.
7
9. PMML
Philosophy
• PMML
is
a
XML
specifica/on
of
a
model,
not
an
implementa/on
of
a
model
• PMML
provides
a
simple
means
of
binding
parameters
to
values
for
an
agreed
upon
set
of
data
mining
models
&
transforma*ons
in
a
safe
way.
9
11. PFA
Philosophy
• Define
primi*ves
for
data
transforma*ons,
data
aggrega*ons,
and
sta*s*cal
and
analy*c
models.
• Support
composi*on
of
data
mining
primi*ves
(which
makes
it
easy
to
specify
machine
learning
algorithms
and
pre-‐/post-‐
processing
of
data).
• Be
extensible.
• Designed
to
be
“safe”
to
deploy
in
enterprise
IT
opera*onal
environments.
• This
is
a
philosophy
that
is
different
and
complementary
to
Predic*ve
Model
Markup
Language
(PMML).
11
12. PFA
Case
Study
1
• 20+
person
data
science
group
developing
models
in
R,
Python,
Scikit-‐learn,
MATLAB,
...
• All
the
data
scien*sts
export
their
model
in
PFA.
• The
company’s
product
imports
models
in
PFA
and
runs
on
their
customers
data
as
required.
Export
PFA
Import
PFA
Widget
records
Widget
scores
13. PFA
Func*onality
• PFA
codes
arbitrary
mathema*cal
algorithms
in
a
*ghtly
controlled
environment.
• PFA
has
all
the
standard
flow
control
of
a
programming
language:
if/then/else
&
for/while
loops.
• PFA
has
func*on
calls
and
func*on
call
backs
• PFA
has
algebraic
data
types.
• PFA
is
encoded
as
func*on
calls
in
JSON
{func*on:
[arg
1,
arg
2,
…,
arg
n]
}
13
14. Benefits
of
PFA
• PFA
is
based
upon
JSON
and
Avro
and
integrates
easily
into
modern
big
data
environments.
• PFA
allows
models
to
be
easily
chained
and
composed.
• PFA
allows
developers
and
users
users
of
analy*c
systems
to
pre-‐process
inputs
and
to
post-‐process
outputs
to
models.
• PFA
is
easily
integrated
with
Hadoop,
Spark,
etc.
• PFA
is
easily
integrated
with
Kaoa,
Storm,
Akka
and
other
streaming
environments.
• PFA
can
be
used
to
integrate
mul*ple
tools
applica*ons
within
an
analy*c
ecosystem.
17. PFA
Case
Study
2
• Two
teams
of
data
scien*sts
develop
analy*c
models
for
an
adversarial
analy*cs
project.
• Models
developed
in
Hadoop
and
exported
in
PFA
every
4
weeks.
• Models
updated
in
client
systems
every
2
weeks.
• It’s
not
quite
this
simple,
but
that’s
the
general
idea.
Export
PFA
Import
PFA
Event
records
Event
scores
Weeks
1-‐4,
5-‐8,
…
Weeks
3-‐6,
7-‐10,
…
Weeks
4,
6,
8,
10,
…
20. Gaussian
Process
Model
(3
of
5)
input: {type: array, items: double}
output: {type: array, items: double}
cells:
table:
type:
{type: array, items: {type: record, name: GP, fields: [
- {name: x, type: {type: array, items: double}}
- {name: to, type: {type: array, items: double}}
- {name: sigma, type: {type: array, items: double}}]}}
init:
- {x: [ 0, 0], to: [0.01870587, 0.96812508], sigma: [0.2, 0.2]}
- {x: [ 0, 36], to: [0.00242101, 0.95369720], sigma: [0.2, 0.2]}
- {x: [ 0, 72], to: [0.13131668, 0.53822666], sigma: [0.2, 0.2]}
...
- {x: [324, 324], to: [-0.6815587, 0.82271760], sigma: [0.2, 0.2]}
action:
model.reg.gaussianProcess:
- input
- {cell: table}
- null
- {fcn: m.kernel.rbf, fill: {gamma: 2.0}}
type
(also
Avro)
and
value
(as
JSON,
truncated)
Gaussian
Process
model
parameters
Source:
dmg.org/pfa
21. Gaussian
Process
Model
(4
of
5)
input: {type: array, items: double}
output: {type: array, items: double}
cells:
table:
type:
{type: array, items: {type: record, name: GP, fields: [
- {name: x, type: {type: array, items: double}}
- {name: to, type: {type: array, items: double}}
- {name: sigma, type: {type: array, items: double}}]}}
init:
- {x: [ 0, 0], to: [0.01870587, 0.96812508], sigma: [0.2, 0.2]}
- {x: [ 0, 36], to: [0.00242101, 0.95369720], sigma: [0.2, 0.2]}
- {x: [ 0, 72], to: [0.13131668, 0.53822666], sigma: [0.2, 0.2]}
...
- {x: [324, 324], to: [-0.6815587, 0.82271760], sigma: [0.2, 0.2]}
action:
model.reg.gaussianProcess:
- input
- {cell: table}
- null
- {fcn: m.kernel.rbf, fill: {gamma: 2.0}}
calling
method:
parameters
expressed
as
JSON
input:
get
interpola*on
point
from
input
{cell:
table}:
get
parameters
from
table
null:
no
explicit
Kriging
weight
(universal)
{fcn:
…}:
kernel
func*on
Source:
dmg.org/pfa
22. Gaussian
Process
Model
(5
of
5)
• Appears
declara*ve,
but
this
is
a
func*on
call.
– Fourth
parameter
is
another
func*on:
m.kernel.rbf
(radial
basis
kernel,
a.k.a.
squared
exponen*al).
–
m.kernel.rbf
was
intended
for
SVM,
but
is
reusable
anywhere.
– One
argument
(gamma)
preapplied
so
that
it
fits
the
signature
for
model.reg.gaussianProcess.
• Any
kernel
func*on
could
be
used,
including
user-‐defined
func*ons
wriuen
with
PFA
“code.”
• The
Gaussian
Process
could
be
used
anywhere,
even
as
a
pre-‐
processing
or
post-‐processing
step.
model.reg.gaussianProcess:
- input
- {cell: table}
- null
- {fcn: m.kernel.rbf, fill: {gamma: 2.0}}
Source:
dmg.org/pfa
23. Genomics
dataset:
2.5+
PB
consis*ng
of
577,878
files
about
14,052
cases
(pa*ents),
in
42
cancer
types,
across
29
primary
sites.
2.5+
PB
of
cancer
genomics
data
+
Bionimbus
data
commons
technology
running
mul*ple
community
developed
variant
calling
pipelines.
Over
12,000
cores
and
10
PB
of
raw
storage
in
18+
racks
running
for
months.
Case
Study
4:
Analy*cOps
for
the
Genomic
Data
Commons*
Source:
Based
in
part
on
“The
Genomic
Data
Commons”,
the
GDC
team,
ms
in
prepara*on.
24. Dev Ops
• Virtualiza*on
and
the
requirement
for
massive
scale
out
spawned
infrastructure
automa*on
(“infrastructure
as
code”).
• Requirement
for
reducing
the
*me
to
deploying
code
created
tools
for
con*nuous
integra*on
and
tes*ng.
25. ModelDev AnalyticOps
• Use
virtualiza*on
/
containers,
infrastructure
automa*on
and
scale
out
to
support
large
scale
analy*cs.
• Requirement:
reduce
the
*me
and
cost
to
do
high
quality
analy*cs
over
large
amounts
of
data.
26. Sowware
Development
Quality
Assurance
Opera*ons
DevOps
The
goal
of
DevOps
is
to
establish
a
culture
and
an
environment
where
building,
tes*ng,
releasing,
and
opera*ng
sowware
can
happen
rapidly,
frequently,
and
more
reliably.*
*Adapted
from
Wikipedia,
en.wikipedia.org/wiki/DevOps.
27. Analy*c
Workflows
Quality
Assurance
Analy*c
Opera*ons
Analy*cOps
The
goal
of
Analy*cOps
is
to
establish
a
culture
and
an
environment
where
building,
valida*ng,
deploying,
and
running
analy*c
models
happen
rapidly,
frequently,
and
reliably.
• Sowware
• Model
• Data
Building
the
right
analy*c
model.
Is
the
analy*c
model
running
right?
Source:
Robert
L.
Grossman,
A
Quick
Introduc*on
to
Analy*cOps.
28. • Model
quality
(confusion
matrix)
• Data
quality
(six
dimensions)
• Lack
of
ground
truth
• Sowware
errors
• Monitoring
system
• Quality
of
workflow
and
scheduling
• Boulenecks,
stragglers,
hot
spots,
etc.
• Analy*c
configura*ons
problems
• System
failures
• Human
errors
Ten
Factors
Effec*ng
Analy*cOps
Source:
Based
in
part
on
“The
Genomic
Data
Commons”,
the
GDC
team,
ms
in
prepara*on.
29. Summary
• Deploying
analy*c
models
is
core
technical
competency.
• The
Portable
Format
for
Analy*cs
(PFA)
is
a
model
interchange
format
for
building
analy*c
models
in
one
environment
and
deploying
them
in
another
one.
• PFA
is
based
upon
data
mining
primi*ves
&
supports
pre-‐processing,
common
analy*c
models,
post-‐
processing,
&
composi*on
of
primi*ves
and
models.
• It
is
easy
to
add
your
own
PFA
func*ons
and
models.
• There
is
reference
implementa*on
&
compliance
tests.
• PFA
is
being
developed
by
the
not-‐for-‐profit
DMG.
• A
discipline
of
analy*cOps
is
emerging
that
supports
more
complex
analy*c
flows
at
greater
scale.
30. Ques*ons?
30
For
more
informa*on
about
PFA,
see:
dmg.org/pfa
rgrossman.com
@BobGrossman