The document provides an overview of end-to-end AI workflows using Skymind. It includes an agenda for a workshop covering topics like workflow scoping, data collection/preprocessing, model building, deployment considerations, and monitoring models in production. Challenges of applying machine learning in enterprises are discussed, such as different tool preferences between teams. The document also outlines model deployment scenarios including single node, multi-node clusters, hybrid/multi-cloud, and edge deployments.
Production-Ready BIG ML Workflows - from zero to heroDaniel Marcous
Data science isn't an easy task to pull of.
You start with exploring data and experimenting with models.
Finally, you find some amazing insight!
What now?
How do you transform a little experiment to a production ready workflow? Better yet, how do you scale it from a small sample in R/Python to TBs of production data?
Building a BIG ML Workflow - from zero to hero, is about the work process you need to take in order to have a production ready workflow up and running.
Covering :
* Small - Medium experimentation (R)
* Big data implementation (Spark Mllib /+ pipeline)
* Setting Metrics and checks in place
* Ad hoc querying and exploring your results (Zeppelin)
* Pain points & Lessons learned the hard way (is there any other way?)
Efficiently Building Machine Learning Models for Predictive Maintenance in th...Databricks
For each drilling site, there are thousands of different equipment operating simultaneously 24/7. For the oil & gas industry, the downtime can cost millions of dollars daily. As current standard practice, the majority of the equipment are on scheduled maintenance with standby units to reduce the downtime.
[DSC Europe 22] Engineers guide for shepherding models in to production - Mar...DataScienceConferenc1
Sometimes just creating a good model is not enough we need to enable people to use it and that often means making it a part of a bigger system or somehow deploying it. This will be from an engineering point of view of how we work with a data scientist or a team of them to make sure the model is production ready. Here is a short check list of things we would do for each model: 1. understand what the model is trying to do/predict 2. define all of the model inputs and outputs 3. define point (as a point in time and integration point) in the wider system when the model is called 4. define how we want to host the model. We from engineering team usually help to make sure we can gather all of the model inputs and process all of the model outputs, also we make sure models are fast and reliable to call in a production environment and we help optimize them for that we also help enforce good engineering practices that rub off on DS people and make them more efficient. And in this talk we will see a few examples of how we do things and what things to look for.
Production-Ready BIG ML Workflows - from zero to heroDaniel Marcous
Data science isn't an easy task to pull of.
You start with exploring data and experimenting with models.
Finally, you find some amazing insight!
What now?
How do you transform a little experiment to a production ready workflow? Better yet, how do you scale it from a small sample in R/Python to TBs of production data?
Building a BIG ML Workflow - from zero to hero, is about the work process you need to take in order to have a production ready workflow up and running.
Covering :
* Small - Medium experimentation (R)
* Big data implementation (Spark Mllib /+ pipeline)
* Setting Metrics and checks in place
* Ad hoc querying and exploring your results (Zeppelin)
* Pain points & Lessons learned the hard way (is there any other way?)
Efficiently Building Machine Learning Models for Predictive Maintenance in th...Databricks
For each drilling site, there are thousands of different equipment operating simultaneously 24/7. For the oil & gas industry, the downtime can cost millions of dollars daily. As current standard practice, the majority of the equipment are on scheduled maintenance with standby units to reduce the downtime.
[DSC Europe 22] Engineers guide for shepherding models in to production - Mar...DataScienceConferenc1
Sometimes just creating a good model is not enough we need to enable people to use it and that often means making it a part of a bigger system or somehow deploying it. This will be from an engineering point of view of how we work with a data scientist or a team of them to make sure the model is production ready. Here is a short check list of things we would do for each model: 1. understand what the model is trying to do/predict 2. define all of the model inputs and outputs 3. define point (as a point in time and integration point) in the wider system when the model is called 4. define how we want to host the model. We from engineering team usually help to make sure we can gather all of the model inputs and process all of the model outputs, also we make sure models are fast and reliable to call in a production environment and we help optimize them for that we also help enforce good engineering practices that rub off on DS people and make them more efficient. And in this talk we will see a few examples of how we do things and what things to look for.
Trenowanie i wdrażanie modeli uczenia maszynowego z wykorzystaniem Google Clo...Sotrender
Okej, mam już mój świetny model w Notebooku, co dalej? Większość kursów i źródeł dotyczących uczenia maszynowego dobrze przygotowuje nas do implementacji algorytmów uczenia maszynowego i budowy mniej lub bardziej skomplikowanych modeli. Jednak w większości przypadków model jest jedynie małym fragmentem większego systemu, a jego wdrożenie i utrzymywanie okazuje się w praktyce procesem czasochłonnym i generującym rozmaite błędy. Problem potęguje się kiedy mamy do sproduktyzowania nie jeden, a więcej modeli. Choć z roku na rok powstaje coraz więcej narzędzi i platform do usprawnienia tego procesu, jest to zagadnienie któremu wciąż poświęca się stosunkowo mało uwagi.
W mojej prezentacji przedstawię jakich podejść, dobrych praktyk oraz narzędzi i usług Google Cloud Platform używamy w Sotrender do efektywnego trenowania i produktyzacji naszych modeli ML, służących do analizy danych z mediów społecznościowych. Omówię na które aspekty DevOps zwracamy uwagę w kontekście wytwarzania produktów opartych o modele ML (MLOps) i jak z wykorzystaniem Google Cloud Platform można je w łatwy sposób wdrożyć w swoim startupie lub firmie.
Prezentacja Macieja Pieńkosza z Sotrendera poczas Data Science Summit 2020
In Data Engineer’s Lunch #89: Machine Learning Orchestration with Airflow, we discussed using Apache Airflow to manage and schedule machine learning tasks. By following the best practices of ML Ops, teams can streamline their ML workflows and build scalable, efficient, and accurate models that deliver real-world business value. Properly implemented ML Ops can help organizations stay ahead of the curve and achieve their goals in the fast-paced world of machine learning. Apache Airflow is an open-source tool for scheduling and automating workflows. Airflow allows you to define workflows in Python, with tasks defined as Python functions that can include Operators for all sorts of external tools. This makes it easy to automate repeated processes and define dependencies between tasks, creating directed-acyclic-graphs of tasks that can be scheduled using cron syntax or frequency tasks. Airflow also features a user-friendly UI for monitoring task progress and viewing logs, giving you greater control over your data pipeline.
ML Platform Q1 Meetup: Airbnb's End-to-End Machine Learning InfrastructureFei Chen
ML platform meetups are quarterly meetups, where we discuss and share advanced technology on machine learning infrastructure. Companies involved include Airbnb, Databricks, Facebook, Google, LinkedIn, Netflix, Pinterest, Twitter, and Uber.
Data Engineer's Lunch 89: Machine Learning Orchestration with AirflowMachine ...Anant Corporation
In Data Engineer's Lunch 89, Obioma Anomnachi will discuss how to manage and schedule Machine Learning operations via Airflow. Learn how you can write complete end-to-end pipelines starting with retrieving raw data to serving ML predictions to end-users, entirely in Airflow.
A short summary describing the major guiding principles of each of the five pillars and key actions that can be taken based on the key points mentioned
Building machine learning muscle in your team & transitioning to make them do machine learning at scale. We also discuss about Spark & other relevant technologies.
In this talk we'll look at simple building-block techniques for predicting metrics over time based on past data, taking into account trend, seasonality and noise, using Python with Tensorflow.
Lessons learnt and system built while solving the last mile problem in machine learning - taking models to production. Used for the talk at - http://sched.co/BLvf
As cloud adoption has grown more rapidly in the last decade , how DBA's a can add more value to system and bring in more scalability to the DB server. This talk was presented at Open Source India 2018 conference by Kabilesh and Manosh of Mydbops. They share a few experience and value addition made to customers during their consulting process.
NLP Text Recommendation System Journey to Automated TrainingDatabricks
This talk will cover how we built and productionized automated machine learning pipelines at Salesforce. Starting with heuristics to automated retraining using technologies including but not limited to Scala, Python, Apache Spark, Docker, Sagemaker for training, and serving. We will walk through the generally applicable data prep, feature engineering, training, evaluation/comparisons, and continuous model training including data feedback loops in containerized environments with Sagemaker. We will talk about our deployment and validation approach. Finally, we’ll draw lessons from iteratively building an enterprise ML product. Attendees will learn about the mental models for building end to end prod ML pipelines and GA ready products.
Using Machine Learning & Artificial Intelligence to Create Impactful Customer...Costanoa Ventures
Jeremy Hermann discusses Uber’s ML-as-a-service platform (Michelangelo) and how they designed it to cover the end-to-end ML workflow: manage data, train, evaluate, and deploy models, make predictions, monitor predictions and support traditional ML models, time series forecasting, and deep learning.
Deploying signature verification with deep learningAdam Gibson
Presentation covered building a signature verification system and deploying it to production. This includes resources usage as well as how the model was picked.
Meetup held in Tokyo with Deep learning Otemachi.
Self driving computers active learning workflows with human interpretable ve...Adam Gibson
Human in the loop learning workflows leveraging deep learning to group and cluster data. Also, techniques for accounting for machine learning failures.
Trenowanie i wdrażanie modeli uczenia maszynowego z wykorzystaniem Google Clo...Sotrender
Okej, mam już mój świetny model w Notebooku, co dalej? Większość kursów i źródeł dotyczących uczenia maszynowego dobrze przygotowuje nas do implementacji algorytmów uczenia maszynowego i budowy mniej lub bardziej skomplikowanych modeli. Jednak w większości przypadków model jest jedynie małym fragmentem większego systemu, a jego wdrożenie i utrzymywanie okazuje się w praktyce procesem czasochłonnym i generującym rozmaite błędy. Problem potęguje się kiedy mamy do sproduktyzowania nie jeden, a więcej modeli. Choć z roku na rok powstaje coraz więcej narzędzi i platform do usprawnienia tego procesu, jest to zagadnienie któremu wciąż poświęca się stosunkowo mało uwagi.
W mojej prezentacji przedstawię jakich podejść, dobrych praktyk oraz narzędzi i usług Google Cloud Platform używamy w Sotrender do efektywnego trenowania i produktyzacji naszych modeli ML, służących do analizy danych z mediów społecznościowych. Omówię na które aspekty DevOps zwracamy uwagę w kontekście wytwarzania produktów opartych o modele ML (MLOps) i jak z wykorzystaniem Google Cloud Platform można je w łatwy sposób wdrożyć w swoim startupie lub firmie.
Prezentacja Macieja Pieńkosza z Sotrendera poczas Data Science Summit 2020
In Data Engineer’s Lunch #89: Machine Learning Orchestration with Airflow, we discussed using Apache Airflow to manage and schedule machine learning tasks. By following the best practices of ML Ops, teams can streamline their ML workflows and build scalable, efficient, and accurate models that deliver real-world business value. Properly implemented ML Ops can help organizations stay ahead of the curve and achieve their goals in the fast-paced world of machine learning. Apache Airflow is an open-source tool for scheduling and automating workflows. Airflow allows you to define workflows in Python, with tasks defined as Python functions that can include Operators for all sorts of external tools. This makes it easy to automate repeated processes and define dependencies between tasks, creating directed-acyclic-graphs of tasks that can be scheduled using cron syntax or frequency tasks. Airflow also features a user-friendly UI for monitoring task progress and viewing logs, giving you greater control over your data pipeline.
ML Platform Q1 Meetup: Airbnb's End-to-End Machine Learning InfrastructureFei Chen
ML platform meetups are quarterly meetups, where we discuss and share advanced technology on machine learning infrastructure. Companies involved include Airbnb, Databricks, Facebook, Google, LinkedIn, Netflix, Pinterest, Twitter, and Uber.
Data Engineer's Lunch 89: Machine Learning Orchestration with AirflowMachine ...Anant Corporation
In Data Engineer's Lunch 89, Obioma Anomnachi will discuss how to manage and schedule Machine Learning operations via Airflow. Learn how you can write complete end-to-end pipelines starting with retrieving raw data to serving ML predictions to end-users, entirely in Airflow.
A short summary describing the major guiding principles of each of the five pillars and key actions that can be taken based on the key points mentioned
Building machine learning muscle in your team & transitioning to make them do machine learning at scale. We also discuss about Spark & other relevant technologies.
In this talk we'll look at simple building-block techniques for predicting metrics over time based on past data, taking into account trend, seasonality and noise, using Python with Tensorflow.
Lessons learnt and system built while solving the last mile problem in machine learning - taking models to production. Used for the talk at - http://sched.co/BLvf
As cloud adoption has grown more rapidly in the last decade , how DBA's a can add more value to system and bring in more scalability to the DB server. This talk was presented at Open Source India 2018 conference by Kabilesh and Manosh of Mydbops. They share a few experience and value addition made to customers during their consulting process.
NLP Text Recommendation System Journey to Automated TrainingDatabricks
This talk will cover how we built and productionized automated machine learning pipelines at Salesforce. Starting with heuristics to automated retraining using technologies including but not limited to Scala, Python, Apache Spark, Docker, Sagemaker for training, and serving. We will walk through the generally applicable data prep, feature engineering, training, evaluation/comparisons, and continuous model training including data feedback loops in containerized environments with Sagemaker. We will talk about our deployment and validation approach. Finally, we’ll draw lessons from iteratively building an enterprise ML product. Attendees will learn about the mental models for building end to end prod ML pipelines and GA ready products.
Using Machine Learning & Artificial Intelligence to Create Impactful Customer...Costanoa Ventures
Jeremy Hermann discusses Uber’s ML-as-a-service platform (Michelangelo) and how they designed it to cover the end-to-end ML workflow: manage data, train, evaluate, and deploy models, make predictions, monitor predictions and support traditional ML models, time series forecasting, and deep learning.
Deploying signature verification with deep learningAdam Gibson
Presentation covered building a signature verification system and deploying it to production. This includes resources usage as well as how the model was picked.
Meetup held in Tokyo with Deep learning Otemachi.
Self driving computers active learning workflows with human interpretable ve...Adam Gibson
Human in the loop learning workflows leveraging deep learning to group and cluster data. Also, techniques for accounting for machine learning failures.
Anomaly Detection and Automatic Labeling with Deep LearningAdam Gibson
Adam Gibson demonstrates how to use variational autoencoders to automatically label time series location data. You'll explore the challenge of imbalanced classes and anomaly detection, learn how to leverage deep learning for automatically labeling (and the pitfalls of this), and discover how you can deploy these techniques in your organization.
Recent presentation on deeplearning4j's new features as well as some underused features of the AI framework like arbiter,datavec's transform process and libnd4j.
This talk was on deep learning use cases outside of computer vision. It also covered larger scale patterns of what good deep learning use cases typically look like. We end up on an explanation of anomaly detection and various kinds of anomaly use cases.
Distributed deep rl on spark strata singaporeAdam Gibson
This talk briefly covers deep reinforcemeent learning on spark and the benefits of using large scale commodity compute with gpus for ease of running simulations as well as distributed training for use cases that aren't games such as network intrusion and risk. This talk also briefly mentions rl4j and our work with openai gym.
Deep learning in production with the bestAdam Gibson
Getting deep learning adopted at your company. The current landscape of academia vs industry. Presentation at AI with the best (online conference):
http://ai.withthebest.com/
Strata Beijing - Deep Learning in Production on SparkAdam Gibson
Recent talk at strata beijing - half english half chinese covering use cases of deep learning, deep learning in production and the different components of deeplearning4j.
Opendatabay - Open Data Marketplace.pptxOpendatabay
Opendatabay.com unlocks the power of data for everyone. Open Data Marketplace fosters a collaborative hub for data enthusiasts to explore, share, and contribute to a vast collection of datasets.
First ever open hub for data enthusiasts to collaborate and innovate. A platform to explore, share, and contribute to a vast collection of datasets. Through robust quality control and innovative technologies like blockchain verification, opendatabay ensures the authenticity and reliability of datasets, empowering users to make data-driven decisions with confidence. Leverage cutting-edge AI technologies to enhance the data exploration, analysis, and discovery experience.
From intelligent search and recommendations to automated data productisation and quotation, Opendatabay AI-driven features streamline the data workflow. Finding the data you need shouldn't be a complex. Opendatabay simplifies the data acquisition process with an intuitive interface and robust search tools. Effortlessly explore, discover, and access the data you need, allowing you to focus on extracting valuable insights. Opendatabay breaks new ground with a dedicated, AI-generated, synthetic datasets.
Leverage these privacy-preserving datasets for training and testing AI models without compromising sensitive information. Opendatabay prioritizes transparency by providing detailed metadata, provenance information, and usage guidelines for each dataset, ensuring users have a comprehensive understanding of the data they're working with. By leveraging a powerful combination of distributed ledger technology and rigorous third-party audits Opendatabay ensures the authenticity and reliability of every dataset. Security is at the core of Opendatabay. Marketplace implements stringent security measures, including encryption, access controls, and regular vulnerability assessments, to safeguard your data and protect your privacy.
Techniques to optimize the pagerank algorithm usually fall in two categories. One is to try reducing the work per iteration, and the other is to try reducing the number of iterations. These goals are often at odds with one another. Skipping computation on vertices which have already converged has the potential to save iteration time. Skipping in-identical vertices, with the same in-links, helps reduce duplicate computations and thus could help reduce iteration time. Road networks often have chains which can be short-circuited before pagerank computation to improve performance. Final ranks of chain nodes can be easily calculated. This could reduce both the iteration time, and the number of iterations. If a graph has no dangling nodes, pagerank of each strongly connected component can be computed in topological order. This could help reduce the iteration time, no. of iterations, and also enable multi-iteration concurrency in pagerank computation. The combination of all of the above methods is the STICD algorithm. [sticd] For dynamic graphs, unchanged components whose ranks are unaffected can be skipped altogether.
As Europe's leading economic powerhouse and the fourth-largest hashtag#economy globally, Germany stands at the forefront of innovation and industrial might. Renowned for its precision engineering and high-tech sectors, Germany's economic structure is heavily supported by a robust service industry, accounting for approximately 68% of its GDP. This economic clout and strategic geopolitical stance position Germany as a focal point in the global cyber threat landscape.
In the face of escalating global tensions, particularly those emanating from geopolitical disputes with nations like hashtag#Russia and hashtag#China, hashtag#Germany has witnessed a significant uptick in targeted cyber operations. Our analysis indicates a marked increase in hashtag#cyberattack sophistication aimed at critical infrastructure and key industrial sectors. These attacks range from ransomware campaigns to hashtag#AdvancedPersistentThreats (hashtag#APTs), threatening national security and business integrity.
🔑 Key findings include:
🔍 Increased frequency and complexity of cyber threats.
🔍 Escalation of state-sponsored and criminally motivated cyber operations.
🔍 Active dark web exchanges of malicious tools and tactics.
Our comprehensive report delves into these challenges, using a blend of open-source and proprietary data collection techniques. By monitoring activity on critical networks and analyzing attack patterns, our team provides a detailed overview of the threats facing German entities.
This report aims to equip stakeholders across public and private sectors with the knowledge to enhance their defensive strategies, reduce exposure to cyber risks, and reinforce Germany's resilience against cyber threats.
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...John Andrews
SlideShare Description for "Chatty Kathy - UNC Bootcamp Final Project Presentation"
Title: Chatty Kathy: Enhancing Physical Activity Among Older Adults
Description:
Discover how Chatty Kathy, an innovative project developed at the UNC Bootcamp, aims to tackle the challenge of low physical activity among older adults. Our AI-driven solution uses peer interaction to boost and sustain exercise levels, significantly improving health outcomes. This presentation covers our problem statement, the rationale behind Chatty Kathy, synthetic data and persona creation, model performance metrics, a visual demonstration of the project, and potential future developments. Join us for an insightful Q&A session to explore the potential of this groundbreaking project.
Project Team: Jay Requarth, Jana Avery, John Andrews, Dr. Dick Davis II, Nee Buntoum, Nam Yeongjin & Mat Nicholas
3. Schedule
●9:30-10:00: Doors Open
●10:00-11:20: Workflow Overview
● 11:20-11:30: Break
●11:30-12:00: Hands on walkthrough of training models
●12:00-13:00: Lunch
●13:00-13:50: Deployment Scenarios/Considerations
● 13:50-14:00: Break
●14:00-14:50: Serving models Hands on
●15:00-15:50: DIfferent ways of Serving models overview
●16:00-16:15: Wrap
4. Workflow Overview● Organize and define issues and problems
● Organize effects
● Feasibility Study
● Experimental design
● Data collection
● Data organization / analysis / pre-processing
● Build a baseline model
● Redo preprocessing again, improve the model
● Tuning
● Evaluation
● Environment construction & deployment
● Post-deployment monitoring and model relearning
※ We will provide training that covers all of these in
detail.
5. | Challenges of ML in Enterprise
● Different teams have different technology preferences
● Experimentation deep learning framework not necessarily makes good production
frameworks
● Data scientists are great at experimentation, not so much in implementing production-ready
models and devops that focus
● Engineering/DevOps team rewrites models for production environment
Engineering,
Devops Team
Data Science Team
● Write deep learning
workflows and models in
notebooks or other
development IDE
● Refactor or rewrite models
for production environments
● Automate training and
optimization jobs
● Deploy models
6. Tools Overview
| Ecosystem
ML, DL Frameworks
Model Development Workflows
Runtime
Keras Tensorflow Deeplearning4J Scikit-Learn ...
Others
Feature Extraction Model Import Model Training
Hyperparameter
Optimization
...
Retraining
Model Performance
Monitoring
Model Versioning Job Scheduling ...
7. ● More than a REST API
● Model calibration and input outlier
detection
● Monitor inputs and adapt to changes
in evolving data
Managing AI models over time
Human-In-The-LoopA/B Testing
Performance Monitoring
LEARNING LOOP
IN PRODUCTION
AI Model Decisions Retraining
Hot Swap
8. Scoping a Project
● Organize and define issues and problems
● Organize effects
● Feasibility Study
● Experimental design
9. ● It usually takes just a few weeks, and it's quick.
● Isn't someone doing similar tasks somewhere?
● How much is it possible?
● What data do you prepare, what techniques do you use,
and how do you evaluate it?
● What kind of team configuration do you implement? What
skills do you need?
● What kind of infrastructure do you need?
| Feasibility study, experimental design
10. | Data
● Data Collection
● Data organization / analysis / pre-processing
● EDA
● Data Quality Assessment
11. | ETL/Data Collection
●Understand your Data Sources
●Do you have enough of the right kind of data?
●Can you access the data? (regulations)
●What is your vectorization pipeline?
●What are the expected data volumes we accumulate over
time? Per Day? Per month?
12. | Modeling
● Build Baseline Model
● Refine and Tune
● Try Different Architectures after baseline
● Rinse and Repeat as necessary evaluating your model
based on at minimum a train/test/validation split
13. | About model evaluation
● Is it okay to always measure with the same accuracy?
○ Diagnostic model in medical field vs automated driving vs customer
demand forecast ... etc
● It is necessary to change the evaluation method according to the problem
to be solved
● It is necessary to change the evaluation method according to the problem
to be solved
○ About Data Split
○ About the distribution of data
14. | Difference between learning
| and production use (Inference)
● Basically in a production environment, ML models do not update
● Models updated during training
● In a production environment, AI models freeze
○ Load a learned model from disk in advance and expand it in memory so
that it can respond to requests at any time
15. | Evaluation Data
● Always understand your training data
● What kind of evaluation data do you need?
● Data actually used by the company (production data)
● How much do you need?
● The more the better
● Pay particular attention to seasonality when collecting evaluation data
16. | Preparation of Evaluation Data
● What is the generalization performance?
● Why is generalization performance important?
● How do you measure it?
Trai
n
Dev Tes
t
17. | Deployment
● Environment Construction and Setup
● Intended usage scenarios (Batch, Real time,..)
● Resource usage understanding/provisioning
● Post-deployment monitoring and model relearning
19. Deploy ML
Model learning
(training)
AI server
Model
Config
disk
定義ファイル
重みファイル
Model
learning
(training)
Definition
weight
※ The weight file is small
Dozens of megabytes, large ones are close to
GB.
※ In advance to perform real-time processing
Deploy in memory and process request
I need to wait.
Model
learning
(training)
Cloud API too
In principle it is the same!
20. | About AI execution environment
● Local machine
● Embedded (raspberry pi, phone, ..)
● Python script
● On prem server as part of application
● AI Workflow Platform(SKIL, Michelangelo, FB Learner, Sagemaker,..)
21. Model Deployment &
maintenance
What does it mean to deploy a learned model?
Where is the need for model maintenance?
About the last step left after getting a good model
I will introduce it.
22. | Monitoring After Deployment
● Why do we need monitoring?
● Concept Drift
● Example:
○ Marketing from the customer's online shopping behavior
○ Fraud detection
○ Data with Seasonality
※ As a premise, understand that data is ever changing
23. Human in the loop
Thinking
AI model assumes that it will never be 100%
Work on problem solving.
Even if the accuracy is not 100%, the effect can be achieved.
One such method is the idea of Human in the loop.
24. | Human-in-the-loop concept
● Instead of aiming for 100% accuracy (generally not possible)
● Have a recovery plan for when models fail. Prefer feedback.
input AI model Decision
Right?
Good.
Wrong?
Feedback.
25. | Calibrate your Models
● Requirements to realize Human in the loop:
● Many recent Deep Learning models are not correctly calibrated.
● Proper Confidence for Model Predictions
● Mandatory to determine whether to intervene in workflow
27. | A bit more on Concept Drift
A phenomenon in which the characteristics of data continue to change with
time, and the characteristics of the data used for learning and the latest data
differ.
Example of data with Seasonality
● Customer Movement in Online Shop
○ Affects online advertising, promotion decisions.
● Product design document data
○ The format of the design document, the contents of the description, etc.
keep changing, which affects the reading accuracy.
28. | Countermeasures for Concept Drift
● In the day-to-day operation, store feedback from the operators.
● Perform continuous / periodic model performance checks and relearning
with newly stored data.
● Continuously monitor a key metric for frequency of feedback
● Sometimes batch retraining or tuning the model might be better
30. Virtual
Machine
(VM)
Bare Metal
Container
Host
Primitives
Configuration Manager
Hypervisor
(Example: Xen, KVM)
Automation Engine
(Example: Ansible)
Orchestrator
(Example: Kubernetes,
Marathon)
Automation platform perform configuration management,
software provisioning and application deployment.
Virtualization infrastructure create, manage and run one or
more virtual machines on a host machine
Orchestration platform manage, provision and scale
containerized applications
| Different Types of Hosts
31. ● A server comprised of a set of hosts.
● Each server can be a combination of bare metal servers, virtual machine and
containers.
● Each host has a cluster membership indicating which cluster does it belongs
to.
| Server Virtualization
32. Single Server Multiple Servers Hybrid Cloud
| Configuration
● High Availability Using ZooKeeper
● Distributed Training and Inference on Multi-GPUs via Spark
● Integration with JVM ecosystem (Hadoop)
● Many SKIL server components are also embeddable in Java applications
33. Architecture
● Simple architecture for getting
started
● Cost effective solution for customers
with less than 100GB of Data
● Perfect for a DGX-1 Type system
CPU
OS
GPU
ZooKeeper
SKIL Training WorkspacesSKIL Deployments
Data Exploration / Training
SKIL Data Connectors
| Single Node
34. Scaled Out
Training Cluster
Architecture
● Any midrange VM or dedicated machine for
Zookeeper
● 1 or more Multi-GPU systems (DGX class or
similar) for SKIL
● Gluster/HDFS provides global file system for data
| Multi-Node Training Cluster
35. Hybrid Cloud Cluster
(Cloud Data Storage)
Architecture
● DGX-1 Servers for SKIL with 8 P100/V100 GPUs
● Existing Hadoop cluster is used by SKIL for
○ ETL (Preparing data for training on GPU) or
○ Batch Inference for distributed scoring
with trained models.
| HYBRID CLOUD
Amazon s3
Azure
Blob
36. Architecture
● DGX-1 Servers for SKIL with 8 P100/V100 GPUs
● Private / public clouds such as
AWS EC2, Azure VM to serve models
| HYBRID CLOUD
Hybrid Cloud Cluster
(Cloud Server)
AWS EC2 GCP
37. GPU Training Cluster Architecture
● Powerful GPU Servers or Spark Cluster for training
models
● Separate (multiple) deployments-only clusters for
production deployments of ML models as REST APIs
CPU Inference Cluster
| MULTI-CLUSTER
39. Edge Deployment Cluster
(Training, Inference)
| Edge Deployment
Embedded
System
Architecture
● SKIL deployed on edge devices for
○ training of lightweight models
○ retraining of model / transfer learning
● Alternatively, models can be trained on central server.
Agents on end devices swap models in and out of the
devices
● Central SKIL server to navigate
and coordinate between the edge devices
SKIL
SKIL
SKIL
Central SKIL server
Robot
SKIL
Tablet
SKIL
Car
SKIL
40. Various deployment methods
Local machine
(Execution on RPA robot execution
machine)
On prem server Cloud API
・ Low introduction cost
・ The period until
introduction is short
・ No need for model
preparation
・ Environment construction is
easy
・ Resolvable issues are
limited
・ Customization is difficult
·Pay-per-use
・ Low introduction cost
・ No need for additional
infrastructure investment
・ The period until introduction is
short
・ Flexible response to individual
cases
・ Customizable flexibly
・ It is troublesome for individual
environment construction
・ Scalability is low
・ High maintenance cost
・ Machine specification is low
・ Flexible response to individual
cases
・ Customizable flexibly
・ Scalability is high
・ Machine spec high
・ There is infrastructure
investment
・ Individual environment
construction is relatively easy
・ High maintenance cost
41. | Other considerations
● Precision/Memory Trade off for Models
● Model Compression for large models/constrained environments
● Hardware support (TPU does not support all ops, certain things only run on
ARM/Intel,..)
● Model Quantization is useful. Optimization becoming more common