Zac Yung-Chun Liu, Head of AI Research, Andromeda 360 AI
Scott Tarlow, Principal Applied Scientist, Hypergiant
The availability of AutoML (automated machine learning) with publicly accessible pre-trained models enable domain experts to automatically build high-quality custom ML applications without much requirement for ML model construction knowledge, which greatly speeds up the ML model development. AutoML has been an essential piece in the model-centric approach in the data science community. ' AutoDC (automated data-centric processing), similar to the purpose of AutoML, is a newly developed open source tool that enables domain experts to automatically and systematically improve datasets by fixing incorrect labels, adding examples that represent edge cases, and applying data augmentation, without much coding requirement and manual process. ' Coming these two frameworks enable the domain experts to improve both dataset and model concurrently and iteratively. ' In this talk, we will showcase 3 data science use cases and examples, which demonstrates the effectiveness of these two frameworks combined and how it empowers domain experts who don't know ML coding to do AI development.
Data Con LA 2022 - AutoDC + AutoML = your AI development superpower
1. AutoDC + AutoML
= Your AI Dev Superpower
Zac Yung-Chun Liu
+
Andromeda 360 AI
zac@a360.ai
Scott Tarlow
Hypergiant
scott.tarlow@hypergiant.com
2022 Data Con LA- AI / ML / Data Science Track
4. Introduction
Model-centric approach
AI = Code + Data
Data-centric approach
AI = Code + Data
Systematically engineering the
data used to build an AI system
Model generation
Hyperparameter tuning
Optuna, Hyperopt,
Bayesian Optimization etc.
Array, Featuretools, SMOTE,
D3M etc.
5. Introduction
Model-centric approach
AI = Code + Data
Data-centric approach
AI = Code + Data
Systematically engineering the
data used to build an AI system
Model generation
Hyperparameter tuning
Improvement (accuracy)
85% → 87%
Improvement (accuracy)
85% → 95%
6. Introduction
Model-centric approach
AI = Code + Data
Data-centric approach
AI = Code + Data
Systematically engineering the
data used to build an AI system
Model generation
Hyperparameter tuning
AutoML AutoDC
7. Ideation of AutoDC framework
1 2
INPUT DATA AUTOML:
1. Data preprocessing
2. Feature engineering
3. Model generation
4. Hyperparameter tuning
3
OUTPUT PREDICTION
MODEL-
CENTRIC
AI
1 2
LABELED DATASET AUTODC:
1. Label correction
2. Edge case selection
3. Data augmentation
3
IMPROVED DATASET
DATA-
CENTRIC
AI
Presented in NeurIPS 2021
8. AutoDC workflow
1 2
LABELED DATASET AUTODC:
3
IMPROVED DATASET
4
ML MODEL
OR AUTOML
EMBEDDING CREATION
RESNET 50 T-SNE
OUTLIER CREATION
ISOLATION FOREST
LABEL CORRECTION
HUMAN IN THE LOOP
EDGE CASE SELECTION
OPTIMIZED RATIO
DATA AUGMENTATION
KERAS DATA GENERATOR
Currently only support
computer vision
11. AutoDC on 3 example datasets
ROMAN NUMERICAL
3,000 IMAGES / 10 classes
ASIRRA DOG VS CAT
25,000 IMAGES / 2 classes
STANFORD PARASITIC SNAIL
5,000 IMAGES / 4 classes
12. AutoDC improvement- Image Classification
ROMAN NUMERICAL
3,000 IMAGES / 10 classes
STANFORD PARASITIC SNAIL
5,000 IMAGES / 4 classes
ASIRRA DOG VS CAT
25,000 IMAGES / 2 classes
65%
80%
72%
82%
81%
95%
UNMODIFIED DATA IMPROVED DATA
Fixed model- ResNet50
→ 10-15% improvement
13. AutoDC: 80% time saved
ROMAN NUMERICAL
3,000 IMAGES / 10 classes
STANFORD PARASITIC SNAIL
5,000 IMAGES / 4 classes
ASIRRA DOG VS CAT
25,000 IMAGES / 2 classes
1h
4h
2h
9h
1h
6h
AutoDC Manual process
14. AutoDC limitations
● The parameters in AutoDC still require fine-tuning
● Users need to identify a training model in advance
+ AutoML → fill the gap
15. (1) AutoDC + AutoML
AutoDC AutoML
Input Data Improved Data
● Use AutoDC and AutoML as separated components
16. (2) AutoDC + AutoML
AutoDC Hyperparameter
tuning
Input Data Improved Data ML model
AutoML
● Include AutoDC as one of the search components in AutoML
Fine-tune
17. AutoDC + AutoML (1) - separated component
ROMAN NUMERICAL
3,000 IMAGES
STANFORD PARASITIC SNAIL
5,000 IMAGES
ASIRRA DOG VS CAT
25,000 IMAGES
65%
80%
72%
82%
81%
95%
UNMODIFIED DATA IMPROVED DATA + Google AutoML
85%
85%
97%
15-20% improvement
(Additional 2-5% ↑)
AutoML run: 8-20 hours
18. AutoDC + AutoML (2) - fine-tune AutoDC in AutoML
XX% improvement ?
More time saved
● Still in development (future works)
● Expect to be more time efficient
21. Data-centric approach + AutoML :
Preventive Maintenance on Aircraft Engines
We treat encoding the variables as a
parameter, similar to drop out rate (a
neural network regularization tool).
This plot shows that reducing drop
out (lowering bias) works better than
encoding categorical variables
(increasing variance). This means that
the categorical variables may not
contribute much to a good model
22. Data-centric approach + AutoML :
Preventive Maintenance on Aircraft Engines
More evidence towards this is the
random_state parameter is more
important than encoding, showing
that the encoded variables did not
have a meaningful contribution to a
strong model.
Using this allows us to create less
complex models which are less
susceptible to drift in production.
23. Discussions
● AutoDC framework is modular and flexible, can be updated with newly developed
ML techniques
● (1) AutoDC + AutoML, (2) data-centric approach + AutoML
→ automate most of the manual processes in ML development
● Low-code/ no-code ML solutions for domain experts
● Only hard requirement: compute resources
24. Call for open source contributions
github.com/gohypergiant/AutoDC
25. Project sponsors
ML/ DS service
Focus on space, defense,
and critical infrastructure
hypergiant.com
Open and modular ML platform (A360)
Focus on single touch ML deployment (Starpack)
a360.ai
Editor's Notes
[3 mins]
Model-centric vs data-centric: fixed data, improve model vs fixed model, improve data (Andrew Ng’s flagship talk in 2021)
Kaggle competition → model centric
More data-centric competition, first one in 2021 initiated by Andrew Ng
Model-centric → incremental improvement
Data-centric → better approach, build better model
Commonly used packages and techniques
[2 mins]
The availability of AutoML (automated machine learning) with publicly accessible pre-trained models enable domain experts to automatically build high-quality custom ML applications without much requirement for ML model construction knowledge, which greatly speeds up the ML model development.
AutoML has been an essential piece in the model-centric approach in the data science community.
Similar with AutoML, we’ve created AutoDC as open source tooling.
[3 mins]
AutoDC (automated data-centric processing), similar to the purpose of AutoML, is a newly developed open source tool that enables domain experts to automatically and systematically improve datasets by fixing incorrect labels, adding examples that represent edge cases, and applying data augmentation, without much coding requirement and manual process.
Note: there are some overlaps, for example, data preprocessing and feature engineering could also be in AutoDC, no strict boundaries
[2 mins]
AutoDC workflow
Input: labeled dataset
Output: improved dataset
Still requires a ML model to know the improvement
[2 mins]
AutoDC example
Roman numerals data
Embeddings → help identify incorrect labels and edge cases
[2 mins]
Quick walkthrough Github repo
[2 mins]
AutoDC example
Roman numerals data
Embeddings → help identify incorrect labels and edge cases
[2 mins]
AutoDC improves ML model (fixed ResNet50 model)
Tested 3 image data
10-15% improvement
Prove AutoDC can be a powerful tool to improve data and label quality
What about we combine AutoML with it?
[2 mins]
AutoDC improves ML model (fixed ResNet50 model)
Tested 3 image data
10-15% improvement
Prove AutoDC can be a powerful tool to improve data and label quality
What about we combine AutoML with it?
[2 mins]
AutoDC example
Roman numerals data
Embeddings → help identify incorrect labels and edge cases
[2 mins]
The final numbers need to be updated
[2 mins]
Traditional AutoML only presents the best methods - but one with data centric approaches 1) include preprocessing in the hyperparameter search, 2) present the other options to see how significant each parameter is.
Mix of continuous variables (sensors) and categorical variables with missing values are used to predict if an aircraft engine should be replaced before its next flight or not. We use our AutoML library to build a strong model, impute the data, and build new features.