SlideShare a Scribd company logo
1 of 25
AutoDC + AutoML
= Your AI Dev Superpower
Zac Yung-Chun Liu
+
Andromeda 360 AI
zac@a360.ai
Scott Tarlow
Hypergiant
scott.tarlow@hypergiant.com
2022 Data Con LA- AI / ML / Data Science Track
Talk outline
● Introduction: model-centric vs data-centric
● AutoDC (Automated data-centric processing)
● AutoDC + AutoML
● Data-centric + AutoML
● Discussions
Introduction
Model-centric approach
AI = Code + Data
Data-centric approach
AI = Code + Data
Systematically engineering the
data used to build an AI system
Model generation
Hyperparameter tuning
Optuna, Hyperopt,
Bayesian Optimization etc.
Array, Featuretools, SMOTE,
D3M etc.
Introduction
Model-centric approach
AI = Code + Data
Data-centric approach
AI = Code + Data
Systematically engineering the
data used to build an AI system
Model generation
Hyperparameter tuning
Improvement (accuracy)
85% → 87%
Improvement (accuracy)
85% → 95%
Introduction
Model-centric approach
AI = Code + Data
Data-centric approach
AI = Code + Data
Systematically engineering the
data used to build an AI system
Model generation
Hyperparameter tuning
AutoML AutoDC
Ideation of AutoDC framework
1 2
INPUT DATA AUTOML:
1. Data preprocessing
2. Feature engineering
3. Model generation
4. Hyperparameter tuning
3
OUTPUT PREDICTION
MODEL-
CENTRIC
AI
1 2
LABELED DATASET AUTODC:
1. Label correction
2. Edge case selection
3. Data augmentation
3
IMPROVED DATASET
DATA-
CENTRIC
AI
Presented in NeurIPS 2021
AutoDC workflow
1 2
LABELED DATASET AUTODC:
3
IMPROVED DATASET
4
ML MODEL
OR AUTOML
EMBEDDING CREATION
RESNET 50 T-SNE
OUTLIER CREATION
ISOLATION FOREST
LABEL CORRECTION
HUMAN IN THE LOOP
EDGE CASE SELECTION
OPTIMIZED RATIO
DATA AUGMENTATION
KERAS DATA GENERATOR
Currently only support
computer vision
AutoDC example
2.0
1.5
1.0
0.5
0.0
-0.5
-1.0
-1.5
-1.5 -1.0 -0.5 0.0 0.5 1.0 1.5 2.0
2.0
1.5
1.0
0.5
0.0
-0.5
-1.0
-1.5
-1.5 -1.0 -0.5 0.0 0.5 1.0 1.5 2.0
2.0
1.5
1.0
0.5
0.0
-0.5
-1.0
-1.5
-1.5 -1.0 -0.5 0.0 0.5 1.0 1.5 2.0
A B C
Embeddings → outlier detection → identify incorrect labels, edge cases Data: roman numerals
github.com/gohypergiant/AutoDC
AutoDC on 3 example datasets
ROMAN NUMERICAL
3,000 IMAGES / 10 classes
ASIRRA DOG VS CAT
25,000 IMAGES / 2 classes
STANFORD PARASITIC SNAIL
5,000 IMAGES / 4 classes
AutoDC improvement- Image Classification
ROMAN NUMERICAL
3,000 IMAGES / 10 classes
STANFORD PARASITIC SNAIL
5,000 IMAGES / 4 classes
ASIRRA DOG VS CAT
25,000 IMAGES / 2 classes
65%
80%
72%
82%
81%
95%
UNMODIFIED DATA IMPROVED DATA
Fixed model- ResNet50
→ 10-15% improvement
AutoDC: 80% time saved
ROMAN NUMERICAL
3,000 IMAGES / 10 classes
STANFORD PARASITIC SNAIL
5,000 IMAGES / 4 classes
ASIRRA DOG VS CAT
25,000 IMAGES / 2 classes
1h
4h
2h
9h
1h
6h
AutoDC Manual process
AutoDC limitations
● The parameters in AutoDC still require fine-tuning
● Users need to identify a training model in advance
+ AutoML → fill the gap
(1) AutoDC + AutoML
AutoDC AutoML
Input Data Improved Data
● Use AutoDC and AutoML as separated components
(2) AutoDC + AutoML
AutoDC Hyperparameter
tuning
Input Data Improved Data ML model
AutoML
● Include AutoDC as one of the search components in AutoML
Fine-tune
AutoDC + AutoML (1) - separated component
ROMAN NUMERICAL
3,000 IMAGES
STANFORD PARASITIC SNAIL
5,000 IMAGES
ASIRRA DOG VS CAT
25,000 IMAGES
65%
80%
72%
82%
81%
95%
UNMODIFIED DATA IMPROVED DATA + Google AutoML
85%
85%
97%
15-20% improvement
(Additional 2-5% ↑)
AutoML run: 8-20 hours
AutoDC + AutoML (2) - fine-tune AutoDC in AutoML
XX% improvement ?
More time saved
● Still in development (future works)
● Expect to be more time efficient
Data-centric approach + AutoML
Data-centric approach + AutoML : Preventive Maintenance
on Aircraft Engines - Imbalanced Classification
Data-centric approach + AutoML :
Preventive Maintenance on Aircraft Engines
We treat encoding the variables as a
parameter, similar to drop out rate (a
neural network regularization tool).
This plot shows that reducing drop
out (lowering bias) works better than
encoding categorical variables
(increasing variance). This means that
the categorical variables may not
contribute much to a good model
Data-centric approach + AutoML :
Preventive Maintenance on Aircraft Engines
More evidence towards this is the
random_state parameter is more
important than encoding, showing
that the encoded variables did not
have a meaningful contribution to a
strong model.
Using this allows us to create less
complex models which are less
susceptible to drift in production.
Discussions
● AutoDC framework is modular and flexible, can be updated with newly developed
ML techniques
● (1) AutoDC + AutoML, (2) data-centric approach + AutoML
→ automate most of the manual processes in ML development
● Low-code/ no-code ML solutions for domain experts
● Only hard requirement: compute resources
Call for open source contributions
github.com/gohypergiant/AutoDC
Project sponsors
ML/ DS service
Focus on space, defense,
and critical infrastructure
hypergiant.com
Open and modular ML platform (A360)
Focus on single touch ML deployment (Starpack)
a360.ai

More Related Content

Similar to Data Con LA 2022 - AutoDC + AutoML = your AI development superpower

2020 09-16-ai-engineering challanges
2020 09-16-ai-engineering challanges2020 09-16-ai-engineering challanges
2020 09-16-ai-engineering challangesIvica Crnkovic
 
Machine learning on streams of data
Machine learning on streams of dataMachine learning on streams of data
Machine learning on streams of dataTomasz Sosiński
 
AutoML - The Future of AI
AutoML - The Future of AIAutoML - The Future of AI
AutoML - The Future of AINing Jiang
 
Accelerate Machine Learning Workloads using Amazon EC2 P3 Instances - SRV201 ...
Accelerate Machine Learning Workloads using Amazon EC2 P3 Instances - SRV201 ...Accelerate Machine Learning Workloads using Amazon EC2 P3 Instances - SRV201 ...
Accelerate Machine Learning Workloads using Amazon EC2 P3 Instances - SRV201 ...Amazon Web Services
 
When We Spark and When We Don’t: Developing Data and ML Pipelines
When We Spark and When We Don’t: Developing Data and ML PipelinesWhen We Spark and When We Don’t: Developing Data and ML Pipelines
When We Spark and When We Don’t: Developing Data and ML PipelinesStitch Fix Algorithms
 
LLMOps for Your Data: Best Practices to Ensure Safety, Quality, and Cost
LLMOps for Your Data: Best Practices to Ensure Safety, Quality, and CostLLMOps for Your Data: Best Practices to Ensure Safety, Quality, and Cost
LLMOps for Your Data: Best Practices to Ensure Safety, Quality, and CostAggregage
 
AutoML for user segmentation: how to match millions of users with hundreds of...
AutoML for user segmentation: how to match millions of users with hundreds of...AutoML for user segmentation: how to match millions of users with hundreds of...
AutoML for user segmentation: how to match millions of users with hundreds of...Institute of Contemporary Sciences
 
Deep AutoViML For Tensorflow Models and MLOps Workflows
Deep AutoViML For Tensorflow Models and MLOps WorkflowsDeep AutoViML For Tensorflow Models and MLOps Workflows
Deep AutoViML For Tensorflow Models and MLOps WorkflowsBill Liu
 
Scalable, Fast Analytics with Graph - Why and How
Scalable, Fast Analytics with Graph - Why and HowScalable, Fast Analytics with Graph - Why and How
Scalable, Fast Analytics with Graph - Why and HowCambridge Semantics
 
2021 06 19 ms student ambassadors nigeria ml net 01 slide-share
2021 06 19 ms student ambassadors nigeria ml net 01   slide-share2021 06 19 ms student ambassadors nigeria ml net 01   slide-share
2021 06 19 ms student ambassadors nigeria ml net 01 slide-shareBruno Capuano
 
2021 02 23 MVP Fusion Getting Started with Machine Learning.Net and AutoML
2021 02 23 MVP Fusion Getting Started with Machine Learning.Net and AutoML2021 02 23 MVP Fusion Getting Started with Machine Learning.Net and AutoML
2021 02 23 MVP Fusion Getting Started with Machine Learning.Net and AutoMLBruno Capuano
 
Serverless machine learning architectures at Helixa
Serverless machine learning architectures at HelixaServerless machine learning architectures at Helixa
Serverless machine learning architectures at HelixaData Science Milan
 
201909 Automated ML for Developers
201909 Automated ML for Developers201909 Automated ML for Developers
201909 Automated ML for DevelopersMark Tabladillo
 
Deep Learning for Autonomous Driving
Deep Learning for Autonomous DrivingDeep Learning for Autonomous Driving
Deep Learning for Autonomous DrivingJan Wiegelmann
 
Mtc strategy-briefing-houston-pd m-05212018-3
Mtc strategy-briefing-houston-pd m-05212018-3Mtc strategy-briefing-houston-pd m-05212018-3
Mtc strategy-briefing-houston-pd m-05212018-3Dania Kodeih
 
Bring Your Own Recipes Hands-On Session
Bring Your Own Recipes Hands-On Session Bring Your Own Recipes Hands-On Session
Bring Your Own Recipes Hands-On Session Sri Ambati
 
Leverage the power of machine learning on windows
Leverage the power of machine learning on windowsLeverage the power of machine learning on windows
Leverage the power of machine learning on windowsJosé António Silva
 
Deployment Design Patterns - Deploying Machine Learning and Deep Learning Mod...
Deployment Design Patterns - Deploying Machine Learning and Deep Learning Mod...Deployment Design Patterns - Deploying Machine Learning and Deep Learning Mod...
Deployment Design Patterns - Deploying Machine Learning and Deep Learning Mod...All Things Open
 
AllThingsOpen 2018 - Deployment Design Patterns (Dan Zaratsian)
AllThingsOpen 2018 - Deployment Design Patterns (Dan Zaratsian)AllThingsOpen 2018 - Deployment Design Patterns (Dan Zaratsian)
AllThingsOpen 2018 - Deployment Design Patterns (Dan Zaratsian)dtz001
 

Similar to Data Con LA 2022 - AutoDC + AutoML = your AI development superpower (20)

2020 09-16-ai-engineering challanges
2020 09-16-ai-engineering challanges2020 09-16-ai-engineering challanges
2020 09-16-ai-engineering challanges
 
Machine learning on streams of data
Machine learning on streams of dataMachine learning on streams of data
Machine learning on streams of data
 
AutoML - The Future of AI
AutoML - The Future of AIAutoML - The Future of AI
AutoML - The Future of AI
 
Accelerate Machine Learning Workloads using Amazon EC2 P3 Instances - SRV201 ...
Accelerate Machine Learning Workloads using Amazon EC2 P3 Instances - SRV201 ...Accelerate Machine Learning Workloads using Amazon EC2 P3 Instances - SRV201 ...
Accelerate Machine Learning Workloads using Amazon EC2 P3 Instances - SRV201 ...
 
When We Spark and When We Don’t: Developing Data and ML Pipelines
When We Spark and When We Don’t: Developing Data and ML PipelinesWhen We Spark and When We Don’t: Developing Data and ML Pipelines
When We Spark and When We Don’t: Developing Data and ML Pipelines
 
LLMOps for Your Data: Best Practices to Ensure Safety, Quality, and Cost
LLMOps for Your Data: Best Practices to Ensure Safety, Quality, and CostLLMOps for Your Data: Best Practices to Ensure Safety, Quality, and Cost
LLMOps for Your Data: Best Practices to Ensure Safety, Quality, and Cost
 
AutoML for user segmentation: how to match millions of users with hundreds of...
AutoML for user segmentation: how to match millions of users with hundreds of...AutoML for user segmentation: how to match millions of users with hundreds of...
AutoML for user segmentation: how to match millions of users with hundreds of...
 
Deep AutoViML For Tensorflow Models and MLOps Workflows
Deep AutoViML For Tensorflow Models and MLOps WorkflowsDeep AutoViML For Tensorflow Models and MLOps Workflows
Deep AutoViML For Tensorflow Models and MLOps Workflows
 
Scalable, Fast Analytics with Graph - Why and How
Scalable, Fast Analytics with Graph - Why and HowScalable, Fast Analytics with Graph - Why and How
Scalable, Fast Analytics with Graph - Why and How
 
2021 06 19 ms student ambassadors nigeria ml net 01 slide-share
2021 06 19 ms student ambassadors nigeria ml net 01   slide-share2021 06 19 ms student ambassadors nigeria ml net 01   slide-share
2021 06 19 ms student ambassadors nigeria ml net 01 slide-share
 
2021 02 23 MVP Fusion Getting Started with Machine Learning.Net and AutoML
2021 02 23 MVP Fusion Getting Started with Machine Learning.Net and AutoML2021 02 23 MVP Fusion Getting Started with Machine Learning.Net and AutoML
2021 02 23 MVP Fusion Getting Started with Machine Learning.Net and AutoML
 
Serverless machine learning architectures at Helixa
Serverless machine learning architectures at HelixaServerless machine learning architectures at Helixa
Serverless machine learning architectures at Helixa
 
201909 Automated ML for Developers
201909 Automated ML for Developers201909 Automated ML for Developers
201909 Automated ML for Developers
 
NoSQL
NoSQLNoSQL
NoSQL
 
Deep Learning for Autonomous Driving
Deep Learning for Autonomous DrivingDeep Learning for Autonomous Driving
Deep Learning for Autonomous Driving
 
Mtc strategy-briefing-houston-pd m-05212018-3
Mtc strategy-briefing-houston-pd m-05212018-3Mtc strategy-briefing-houston-pd m-05212018-3
Mtc strategy-briefing-houston-pd m-05212018-3
 
Bring Your Own Recipes Hands-On Session
Bring Your Own Recipes Hands-On Session Bring Your Own Recipes Hands-On Session
Bring Your Own Recipes Hands-On Session
 
Leverage the power of machine learning on windows
Leverage the power of machine learning on windowsLeverage the power of machine learning on windows
Leverage the power of machine learning on windows
 
Deployment Design Patterns - Deploying Machine Learning and Deep Learning Mod...
Deployment Design Patterns - Deploying Machine Learning and Deep Learning Mod...Deployment Design Patterns - Deploying Machine Learning and Deep Learning Mod...
Deployment Design Patterns - Deploying Machine Learning and Deep Learning Mod...
 
AllThingsOpen 2018 - Deployment Design Patterns (Dan Zaratsian)
AllThingsOpen 2018 - Deployment Design Patterns (Dan Zaratsian)AllThingsOpen 2018 - Deployment Design Patterns (Dan Zaratsian)
AllThingsOpen 2018 - Deployment Design Patterns (Dan Zaratsian)
 

More from Data Con LA

Data Con LA 2022 Keynotes
Data Con LA 2022 KeynotesData Con LA 2022 Keynotes
Data Con LA 2022 KeynotesData Con LA
 
Data Con LA 2022 Keynotes
Data Con LA 2022 KeynotesData Con LA 2022 Keynotes
Data Con LA 2022 KeynotesData Con LA
 
Data Con LA 2022 Keynote
Data Con LA 2022 KeynoteData Con LA 2022 Keynote
Data Con LA 2022 KeynoteData Con LA
 
Data Con LA 2022 - Startup Showcase
Data Con LA 2022 - Startup ShowcaseData Con LA 2022 - Startup Showcase
Data Con LA 2022 - Startup ShowcaseData Con LA
 
Data Con LA 2022 Keynote
Data Con LA 2022 KeynoteData Con LA 2022 Keynote
Data Con LA 2022 KeynoteData Con LA
 
Data Con LA 2022 - Using Google trends data to build product recommendations
Data Con LA 2022 - Using Google trends data to build product recommendationsData Con LA 2022 - Using Google trends data to build product recommendations
Data Con LA 2022 - Using Google trends data to build product recommendationsData Con LA
 
Data Con LA 2022 - AI Ethics
Data Con LA 2022 - AI EthicsData Con LA 2022 - AI Ethics
Data Con LA 2022 - AI EthicsData Con LA
 
Data Con LA 2022 - Improving disaster response with machine learning
Data Con LA 2022 - Improving disaster response with machine learningData Con LA 2022 - Improving disaster response with machine learning
Data Con LA 2022 - Improving disaster response with machine learningData Con LA
 
Data Con LA 2022 - What's new with MongoDB 6.0 and Atlas
Data Con LA 2022 - What's new with MongoDB 6.0 and AtlasData Con LA 2022 - What's new with MongoDB 6.0 and Atlas
Data Con LA 2022 - What's new with MongoDB 6.0 and AtlasData Con LA
 
Data Con LA 2022 - Real world consumer segmentation
Data Con LA 2022 - Real world consumer segmentationData Con LA 2022 - Real world consumer segmentation
Data Con LA 2022 - Real world consumer segmentationData Con LA
 
Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...
Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...
Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...Data Con LA
 
Data Con LA 2022 - Moving Data at Scale to AWS
Data Con LA 2022 - Moving Data at Scale to AWSData Con LA 2022 - Moving Data at Scale to AWS
Data Con LA 2022 - Moving Data at Scale to AWSData Con LA
 
Data Con LA 2022 - Collaborative Data Exploration using Conversational AI
Data Con LA 2022 - Collaborative Data Exploration using Conversational AIData Con LA 2022 - Collaborative Data Exploration using Conversational AI
Data Con LA 2022 - Collaborative Data Exploration using Conversational AIData Con LA
 
Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...
Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...
Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...Data Con LA
 
Data Con LA 2022 - Intro to Data Science
Data Con LA 2022 - Intro to Data ScienceData Con LA 2022 - Intro to Data Science
Data Con LA 2022 - Intro to Data ScienceData Con LA
 
Data Con LA 2022 - How are NFTs and DeFi Changing Entertainment
Data Con LA 2022 - How are NFTs and DeFi Changing EntertainmentData Con LA 2022 - How are NFTs and DeFi Changing Entertainment
Data Con LA 2022 - How are NFTs and DeFi Changing EntertainmentData Con LA
 
Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...
Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...
Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...Data Con LA
 
Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...
Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...
Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...Data Con LA
 
Data Con LA 2022- Embedding medical journeys with machine learning to improve...
Data Con LA 2022- Embedding medical journeys with machine learning to improve...Data Con LA 2022- Embedding medical journeys with machine learning to improve...
Data Con LA 2022- Embedding medical journeys with machine learning to improve...Data Con LA
 
Data Con LA 2022 - Data Streaming with Kafka
Data Con LA 2022 - Data Streaming with KafkaData Con LA 2022 - Data Streaming with Kafka
Data Con LA 2022 - Data Streaming with KafkaData Con LA
 

More from Data Con LA (20)

Data Con LA 2022 Keynotes
Data Con LA 2022 KeynotesData Con LA 2022 Keynotes
Data Con LA 2022 Keynotes
 
Data Con LA 2022 Keynotes
Data Con LA 2022 KeynotesData Con LA 2022 Keynotes
Data Con LA 2022 Keynotes
 
Data Con LA 2022 Keynote
Data Con LA 2022 KeynoteData Con LA 2022 Keynote
Data Con LA 2022 Keynote
 
Data Con LA 2022 - Startup Showcase
Data Con LA 2022 - Startup ShowcaseData Con LA 2022 - Startup Showcase
Data Con LA 2022 - Startup Showcase
 
Data Con LA 2022 Keynote
Data Con LA 2022 KeynoteData Con LA 2022 Keynote
Data Con LA 2022 Keynote
 
Data Con LA 2022 - Using Google trends data to build product recommendations
Data Con LA 2022 - Using Google trends data to build product recommendationsData Con LA 2022 - Using Google trends data to build product recommendations
Data Con LA 2022 - Using Google trends data to build product recommendations
 
Data Con LA 2022 - AI Ethics
Data Con LA 2022 - AI EthicsData Con LA 2022 - AI Ethics
Data Con LA 2022 - AI Ethics
 
Data Con LA 2022 - Improving disaster response with machine learning
Data Con LA 2022 - Improving disaster response with machine learningData Con LA 2022 - Improving disaster response with machine learning
Data Con LA 2022 - Improving disaster response with machine learning
 
Data Con LA 2022 - What's new with MongoDB 6.0 and Atlas
Data Con LA 2022 - What's new with MongoDB 6.0 and AtlasData Con LA 2022 - What's new with MongoDB 6.0 and Atlas
Data Con LA 2022 - What's new with MongoDB 6.0 and Atlas
 
Data Con LA 2022 - Real world consumer segmentation
Data Con LA 2022 - Real world consumer segmentationData Con LA 2022 - Real world consumer segmentation
Data Con LA 2022 - Real world consumer segmentation
 
Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...
Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...
Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...
 
Data Con LA 2022 - Moving Data at Scale to AWS
Data Con LA 2022 - Moving Data at Scale to AWSData Con LA 2022 - Moving Data at Scale to AWS
Data Con LA 2022 - Moving Data at Scale to AWS
 
Data Con LA 2022 - Collaborative Data Exploration using Conversational AI
Data Con LA 2022 - Collaborative Data Exploration using Conversational AIData Con LA 2022 - Collaborative Data Exploration using Conversational AI
Data Con LA 2022 - Collaborative Data Exploration using Conversational AI
 
Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...
Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...
Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...
 
Data Con LA 2022 - Intro to Data Science
Data Con LA 2022 - Intro to Data ScienceData Con LA 2022 - Intro to Data Science
Data Con LA 2022 - Intro to Data Science
 
Data Con LA 2022 - How are NFTs and DeFi Changing Entertainment
Data Con LA 2022 - How are NFTs and DeFi Changing EntertainmentData Con LA 2022 - How are NFTs and DeFi Changing Entertainment
Data Con LA 2022 - How are NFTs and DeFi Changing Entertainment
 
Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...
Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...
Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...
 
Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...
Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...
Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...
 
Data Con LA 2022- Embedding medical journeys with machine learning to improve...
Data Con LA 2022- Embedding medical journeys with machine learning to improve...Data Con LA 2022- Embedding medical journeys with machine learning to improve...
Data Con LA 2022- Embedding medical journeys with machine learning to improve...
 
Data Con LA 2022 - Data Streaming with Kafka
Data Con LA 2022 - Data Streaming with KafkaData Con LA 2022 - Data Streaming with Kafka
Data Con LA 2022 - Data Streaming with Kafka
 

Recently uploaded

CALL ON ➥8923113531 🔝Call Girls Chinhat Lucknow best sexual service Online
CALL ON ➥8923113531 🔝Call Girls Chinhat Lucknow best sexual service OnlineCALL ON ➥8923113531 🔝Call Girls Chinhat Lucknow best sexual service Online
CALL ON ➥8923113531 🔝Call Girls Chinhat Lucknow best sexual service Onlineanilsa9823
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFxolyaivanovalion
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxolyaivanovalion
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxolyaivanovalion
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxBPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxMohammedJunaid861692
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightDelhi Call girls
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfRachmat Ramadhan H
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Serviceranjana rawat
 
Data-Analysis for Chicago Crime Data 2023
Data-Analysis for Chicago Crime Data  2023Data-Analysis for Chicago Crime Data  2023
Data-Analysis for Chicago Crime Data 2023ymrp368
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% SecurePooja Nehwal
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxolyaivanovalion
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionfulawalesam
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxolyaivanovalion
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz1
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusTimothy Spann
 
Capstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramCapstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramMoniSankarHazra
 

Recently uploaded (20)

CALL ON ➥8923113531 🔝Call Girls Chinhat Lucknow best sexual service Online
CALL ON ➥8923113531 🔝Call Girls Chinhat Lucknow best sexual service OnlineCALL ON ➥8923113531 🔝Call Girls Chinhat Lucknow best sexual service Online
CALL ON ➥8923113531 🔝Call Girls Chinhat Lucknow best sexual service Online
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFx
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptx
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptx
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxBPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
 
Data-Analysis for Chicago Crime Data 2023
Data-Analysis for Chicago Crime Data  2023Data-Analysis for Chicago Crime Data  2023
Data-Analysis for Chicago Crime Data 2023
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptx
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interaction
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptx
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 
Sampling (random) method and Non random.ppt
Sampling (random) method and Non random.pptSampling (random) method and Non random.ppt
Sampling (random) method and Non random.ppt
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and Milvus
 
Capstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramCapstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics Program
 

Data Con LA 2022 - AutoDC + AutoML = your AI development superpower

  • 1. AutoDC + AutoML = Your AI Dev Superpower Zac Yung-Chun Liu + Andromeda 360 AI zac@a360.ai Scott Tarlow Hypergiant scott.tarlow@hypergiant.com 2022 Data Con LA- AI / ML / Data Science Track
  • 2.
  • 3. Talk outline ● Introduction: model-centric vs data-centric ● AutoDC (Automated data-centric processing) ● AutoDC + AutoML ● Data-centric + AutoML ● Discussions
  • 4. Introduction Model-centric approach AI = Code + Data Data-centric approach AI = Code + Data Systematically engineering the data used to build an AI system Model generation Hyperparameter tuning Optuna, Hyperopt, Bayesian Optimization etc. Array, Featuretools, SMOTE, D3M etc.
  • 5. Introduction Model-centric approach AI = Code + Data Data-centric approach AI = Code + Data Systematically engineering the data used to build an AI system Model generation Hyperparameter tuning Improvement (accuracy) 85% → 87% Improvement (accuracy) 85% → 95%
  • 6. Introduction Model-centric approach AI = Code + Data Data-centric approach AI = Code + Data Systematically engineering the data used to build an AI system Model generation Hyperparameter tuning AutoML AutoDC
  • 7. Ideation of AutoDC framework 1 2 INPUT DATA AUTOML: 1. Data preprocessing 2. Feature engineering 3. Model generation 4. Hyperparameter tuning 3 OUTPUT PREDICTION MODEL- CENTRIC AI 1 2 LABELED DATASET AUTODC: 1. Label correction 2. Edge case selection 3. Data augmentation 3 IMPROVED DATASET DATA- CENTRIC AI Presented in NeurIPS 2021
  • 8. AutoDC workflow 1 2 LABELED DATASET AUTODC: 3 IMPROVED DATASET 4 ML MODEL OR AUTOML EMBEDDING CREATION RESNET 50 T-SNE OUTLIER CREATION ISOLATION FOREST LABEL CORRECTION HUMAN IN THE LOOP EDGE CASE SELECTION OPTIMIZED RATIO DATA AUGMENTATION KERAS DATA GENERATOR Currently only support computer vision
  • 9. AutoDC example 2.0 1.5 1.0 0.5 0.0 -0.5 -1.0 -1.5 -1.5 -1.0 -0.5 0.0 0.5 1.0 1.5 2.0 2.0 1.5 1.0 0.5 0.0 -0.5 -1.0 -1.5 -1.5 -1.0 -0.5 0.0 0.5 1.0 1.5 2.0 2.0 1.5 1.0 0.5 0.0 -0.5 -1.0 -1.5 -1.5 -1.0 -0.5 0.0 0.5 1.0 1.5 2.0 A B C Embeddings → outlier detection → identify incorrect labels, edge cases Data: roman numerals
  • 11. AutoDC on 3 example datasets ROMAN NUMERICAL 3,000 IMAGES / 10 classes ASIRRA DOG VS CAT 25,000 IMAGES / 2 classes STANFORD PARASITIC SNAIL 5,000 IMAGES / 4 classes
  • 12. AutoDC improvement- Image Classification ROMAN NUMERICAL 3,000 IMAGES / 10 classes STANFORD PARASITIC SNAIL 5,000 IMAGES / 4 classes ASIRRA DOG VS CAT 25,000 IMAGES / 2 classes 65% 80% 72% 82% 81% 95% UNMODIFIED DATA IMPROVED DATA Fixed model- ResNet50 → 10-15% improvement
  • 13. AutoDC: 80% time saved ROMAN NUMERICAL 3,000 IMAGES / 10 classes STANFORD PARASITIC SNAIL 5,000 IMAGES / 4 classes ASIRRA DOG VS CAT 25,000 IMAGES / 2 classes 1h 4h 2h 9h 1h 6h AutoDC Manual process
  • 14. AutoDC limitations ● The parameters in AutoDC still require fine-tuning ● Users need to identify a training model in advance + AutoML → fill the gap
  • 15. (1) AutoDC + AutoML AutoDC AutoML Input Data Improved Data ● Use AutoDC and AutoML as separated components
  • 16. (2) AutoDC + AutoML AutoDC Hyperparameter tuning Input Data Improved Data ML model AutoML ● Include AutoDC as one of the search components in AutoML Fine-tune
  • 17. AutoDC + AutoML (1) - separated component ROMAN NUMERICAL 3,000 IMAGES STANFORD PARASITIC SNAIL 5,000 IMAGES ASIRRA DOG VS CAT 25,000 IMAGES 65% 80% 72% 82% 81% 95% UNMODIFIED DATA IMPROVED DATA + Google AutoML 85% 85% 97% 15-20% improvement (Additional 2-5% ↑) AutoML run: 8-20 hours
  • 18. AutoDC + AutoML (2) - fine-tune AutoDC in AutoML XX% improvement ? More time saved ● Still in development (future works) ● Expect to be more time efficient
  • 20. Data-centric approach + AutoML : Preventive Maintenance on Aircraft Engines - Imbalanced Classification
  • 21. Data-centric approach + AutoML : Preventive Maintenance on Aircraft Engines We treat encoding the variables as a parameter, similar to drop out rate (a neural network regularization tool). This plot shows that reducing drop out (lowering bias) works better than encoding categorical variables (increasing variance). This means that the categorical variables may not contribute much to a good model
  • 22. Data-centric approach + AutoML : Preventive Maintenance on Aircraft Engines More evidence towards this is the random_state parameter is more important than encoding, showing that the encoded variables did not have a meaningful contribution to a strong model. Using this allows us to create less complex models which are less susceptible to drift in production.
  • 23. Discussions ● AutoDC framework is modular and flexible, can be updated with newly developed ML techniques ● (1) AutoDC + AutoML, (2) data-centric approach + AutoML → automate most of the manual processes in ML development ● Low-code/ no-code ML solutions for domain experts ● Only hard requirement: compute resources
  • 24. Call for open source contributions github.com/gohypergiant/AutoDC
  • 25. Project sponsors ML/ DS service Focus on space, defense, and critical infrastructure hypergiant.com Open and modular ML platform (A360) Focus on single touch ML deployment (Starpack) a360.ai

Editor's Notes

  1. [3 mins] Model-centric vs data-centric: fixed data, improve model vs fixed model, improve data (Andrew Ng’s flagship talk in 2021) Kaggle competition → model centric More data-centric competition, first one in 2021 initiated by Andrew Ng Model-centric → incremental improvement Data-centric → better approach, build better model Commonly used packages and techniques
  2. [1 min] Model-centric: Incremental improvement (1-2%) Data-centric: significant improvement (> 10%)
  3. [2 mins] The availability of AutoML (automated machine learning) with publicly accessible pre-trained models enable domain experts to automatically build high-quality custom ML applications without much requirement for ML model construction knowledge, which greatly speeds up the ML model development. AutoML has been an essential piece in the model-centric approach in the data science community. Similar with AutoML, we’ve created AutoDC as open source tooling.
  4. [3 mins] AutoDC (automated data-centric processing), similar to the purpose of AutoML, is a newly developed open source tool that enables domain experts to automatically and systematically improve datasets by fixing incorrect labels, adding examples that represent edge cases, and applying data augmentation, without much coding requirement and manual process. Note: there are some overlaps, for example, data preprocessing and feature engineering could also be in AutoDC, no strict boundaries
  5. [2 mins] AutoDC workflow Input: labeled dataset Output: improved dataset Still requires a ML model to know the improvement
  6. [2 mins] AutoDC example Roman numerals data Embeddings → help identify incorrect labels and edge cases
  7. [2 mins] Quick walkthrough Github repo
  8. [2 mins] AutoDC example Roman numerals data Embeddings → help identify incorrect labels and edge cases
  9. [2 mins] AutoDC improves ML model (fixed ResNet50 model) Tested 3 image data 10-15% improvement Prove AutoDC can be a powerful tool to improve data and label quality What about we combine AutoML with it?
  10. [2 mins] AutoDC improves ML model (fixed ResNet50 model) Tested 3 image data 10-15% improvement Prove AutoDC can be a powerful tool to improve data and label quality What about we combine AutoML with it?
  11. [2 mins] AutoDC example Roman numerals data Embeddings → help identify incorrect labels and edge cases
  12. [2 mins] The final numbers need to be updated
  13. [2 mins]
  14. Traditional AutoML only presents the best methods - but one with data centric approaches 1) include preprocessing in the hyperparameter search, 2) present the other options to see how significant each parameter is.
  15. Mix of continuous variables (sensors) and categorical variables with missing values are used to predict if an aircraft engine should be replaced before its next flight or not. We use our AutoML library to build a strong model, impute the data, and build new features.