Data Con LA 2022 - AutoDC + AutoML = your AI development superpower

AutoDC + AutoML
= Your AI Dev Superpower
Zac Yung-Chun Liu
+
Andromeda 360 AI
zac@a360.ai
Scott Tarlow
Hypergiant
scott.tarlow@hypergiant.com
2022 Data Con LA- AI / ML / Data Science Track

Talk outline
● Introduction: model-centric vs data-centric
● AutoDC (Automated data-centric processing)
● AutoDC + AutoML
● Data-centric + AutoML
● Discussions

Introduction
Model-centric approach
AI = Code + Data
Data-centric approach
AI = Code + Data
Systematically engineering the
data used to build an AI system
Model generation
Hyperparameter tuning
Optuna, Hyperopt,
Bayesian Optimization etc.
Array, Featuretools, SMOTE,
D3M etc.

Introduction
AI = Code + Data
AI = Code + Data
Model generation
Improvement (accuracy)
85% → 87%
Improvement (accuracy)
85% → 95%

Introduction
AI = Code + Data
AI = Code + Data
Model generation
AutoML AutoDC

Ideation of AutoDC framework
1 2
INPUT DATA AUTOML:
1. Data preprocessing
2. Feature engineering
3. Model generation
4. Hyperparameter tuning
3
OUTPUT PREDICTION
MODEL-
CENTRIC
AI
1 2
LABELED DATASET AUTODC:
1. Label correction
2. Edge case selection
3. Data augmentation
3
IMPROVED DATASET
DATA-
CENTRIC
AI
Presented in NeurIPS 2021

AutoDC workflow
1 2
LABELED DATASET AUTODC:
3
IMPROVED DATASET
4
ML MODEL
OR AUTOML
EMBEDDING CREATION
RESNET 50 T-SNE
OUTLIER CREATION
ISOLATION FOREST
LABEL CORRECTION
HUMAN IN THE LOOP
EDGE CASE SELECTION
OPTIMIZED RATIO
DATA AUGMENTATION
KERAS DATA GENERATOR
Currently only support
computer vision

AutoDC example
2.0
1.5
1.0
0.5
0.0
-0.5
-1.0
-1.5
-1.5 -1.0 -0.5 0.0 0.5 1.0 1.5 2.0
2.0
1.5
1.0
0.5
0.0
-0.5
-1.0
-1.5
-1.5 -1.0 -0.5 0.0 0.5 1.0 1.5 2.0
2.0
1.5
1.0
0.5
0.0
-0.5
-1.0
-1.5
-1.5 -1.0 -0.5 0.0 0.5 1.0 1.5 2.0
A B C
Embeddings → outlier detection → identify incorrect labels, edge cases Data: roman numerals

github.com/gohypergiant/AutoDC

AutoDC on 3 example datasets
ROMAN NUMERICAL
3,000 IMAGES / 10 classes
ASIRRA DOG VS CAT
STANFORD PARASITIC SNAIL

AutoDC improvement- Image Classification
ROMAN NUMERICAL
ASIRRA DOG VS CAT
65%
80%
72%
82%
81%
95%
UNMODIFIED DATA IMPROVED DATA
Fixed model- ResNet50
→ 10-15% improvement

AutoDC: 80% time saved
ROMAN NUMERICAL
ASIRRA DOG VS CAT
1h
4h
2h
9h
1h
6h
AutoDC Manual process

AutoDC limitations
● The parameters in AutoDC still require fine-tuning
● Users need to identify a training model in advance
+ AutoML → fill the gap

(1) AutoDC + AutoML
AutoDC AutoML
Input Data Improved Data
● Use AutoDC and AutoML as separated components

(2) AutoDC + AutoML
AutoDC Hyperparameter
tuning
Input Data Improved Data ML model
AutoML
● Include AutoDC as one of the search components in AutoML
Fine-tune

AutoDC + AutoML (1) - separated component
ROMAN NUMERICAL
3,000 IMAGES
5,000 IMAGES
ASIRRA DOG VS CAT
25,000 IMAGES
65%
80%
72%
82%
81%
95%
UNMODIFIED DATA IMPROVED DATA + Google AutoML
85%
85%
97%
15-20% improvement
(Additional 2-5% ↑)
AutoML run: 8-20 hours

AutoDC + AutoML (2) - fine-tune AutoDC in AutoML
XX% improvement ?
More time saved
● Still in development (future works)
● Expect to be more time efficient

Data-centric approach + AutoML

Data-centric approach + AutoML : Preventive Maintenance
on Aircraft Engines - Imbalanced Classification

Data-centric approach + AutoML :
Preventive Maintenance on Aircraft Engines
We treat encoding the variables as a
parameter, similar to drop out rate (a
neural network regularization tool).
This plot shows that reducing drop
out (lowering bias) works better than
encoding categorical variables
(increasing variance). This means that
the categorical variables may not
contribute much to a good model

Data-centric approach + AutoML :
Preventive Maintenance on Aircraft Engines
More evidence towards this is the
random_state parameter is more
important than encoding, showing
that the encoded variables did not
have a meaningful contribution to a
strong model.
Using this allows us to create less
complex models which are less
susceptible to drift in production.

Discussions
● AutoDC framework is modular and flexible, can be updated with newly developed
ML techniques
● (1) AutoDC + AutoML, (2) data-centric approach + AutoML
→ automate most of the manual processes in ML development
● Low-code/ no-code ML solutions for domain experts
● Only hard requirement: compute resources

Call for open source contributions
github.com/gohypergiant/AutoDC

Project sponsors
ML/ DS service
Focus on space, defense,
and critical infrastructure
hypergiant.com
Open and modular ML platform (A360)
Focus on single touch ML deployment (Starpack)
a360.ai

Data Con LA 2022 - AutoDC + AutoML = your AI development superpower

Recommended

Recommended

More Related Content

Similar to Data Con LA 2022 - AutoDC + AutoML = your AI development superpower

Similar to Data Con LA 2022 - AutoDC + AutoML = your AI development superpower (20)

More from Data Con LA

More from Data Con LA (20)

Recently uploaded

Recently uploaded (20)

Data Con LA 2022 - AutoDC + AutoML = your AI development superpower

Editor's Notes