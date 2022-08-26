Successfully reported this slideshow.
Your SlideShare is downloading. ×

Data Con LA 2022 - AutoDC + AutoML = your AI development superpower

Aug. 26, 2022
0 likes 0 views
Upcoming SlideShare
Data Con LA 2022-Open Source or Open Core in Your Data Layer? What Needs to B...
Data Con LA 2022-Open Source or Open Core in Your Data Layer? What Needs to B...
Loading in …3
×

Check these out next

Data Con LA 2022 - Transformers for NLP
Data Con LA
Data Con LA 2022 - Key Open Source Databases Strategies That Share Business i...
Data Con LA
Data Con LA 2022 - Modern Data Strategy
Data Con LA
Data Con LA 2022 - Blockchain for Master Data Management
Data Con LA
How Transparent AI Will Enable More Equitable Products
Data Con LA
Data Con LA 2022-Pre-Recorded- XR Interactions for Water Sustaining Behavior ...
Data Con LA
Data Con LA 2022- Pre-Recorded- Helping California Tackle the COVID 19 Pandem...
Data Con LA
Data Con LA 2022 - Pre- Recorded - OpenSearch: Everything You Need to Know Ab...
Data Con LA
1 of 25
1 of 25

Data Con LA 2022 - AutoDC + AutoML = your AI development superpower

Aug. 26, 2022
0 likes 0 views
Data & Analytics

Zac Yung-Chun Liu, Head of AI Research, Andromeda 360 AI
Scott Tarlow, Principal Applied Scientist, Hypergiant
The availability of AutoML (automated machine learning) with publicly accessible pre-trained models enable domain experts to automatically build high-quality custom ML applications without much requirement for ML model construction knowledge, which greatly speeds up the ML model development. AutoML has been an essential piece in the model-centric approach in the data science community. ' AutoDC (automated data-centric processing), similar to the purpose of AutoML, is a newly developed open source tool that enables domain experts to automatically and systematically improve datasets by fixing incorrect labels, adding examples that represent edge cases, and applying data augmentation, without much coding requirement and manual process. ' Coming these two frameworks enable the domain experts to improve both dataset and model concurrently and iteratively. ' In this talk, we will showcase 3 data science use cases and examples, which demonstrates the effectiveness of these two frameworks combined and how it empowers domain experts who don't know ML coding to do AI development.

Zac Yung-Chun Liu, Head of AI Research, Andromeda 360 AI
Scott Tarlow, Principal Applied Scientist, Hypergiant
The availability of AutoML (automated machine learning) with publicly accessible pre-trained models enable domain experts to automatically build high-quality custom ML applications without much requirement for ML model construction knowledge, which greatly speeds up the ML model development. AutoML has been an essential piece in the model-centric approach in the data science community. ' AutoDC (automated data-centric processing), similar to the purpose of AutoML, is a newly developed open source tool that enables domain experts to automatically and systematically improve datasets by fixing incorrect labels, adding examples that represent edge cases, and applying data augmentation, without much coding requirement and manual process. ' Coming these two frameworks enable the domain experts to improve both dataset and model concurrently and iteratively. ' In this talk, we will showcase 3 data science use cases and examples, which demonstrates the effectiveness of these two frameworks combined and how it empowers domain experts who don't know ML coding to do AI development.

Data & Analytics

Recommended

More Related Content

More from Data Con LA

Data Con LA 2022 - Transformers for NLP
Data Con LA
Data Con LA 2022 - Key Open Source Databases Strategies That Share Business i...
Data Con LA
Data Con LA 2022 - Modern Data Strategy
Data Con LA
Data Con LA 2022 - Blockchain for Master Data Management
Data Con LA
How Transparent AI Will Enable More Equitable Products
Data Con LA
Data Con LA 2022-Pre-Recorded- XR Interactions for Water Sustaining Behavior ...
Data Con LA
Data Con LA 2022- Pre-Recorded- Helping California Tackle the COVID 19 Pandem...
Data Con LA
Data Con LA 2022 - Pre- Recorded - OpenSearch: Everything You Need to Know Ab...
Data Con LA
Data Con LA 2022 - Pre - recorded - Quantum Computing, The next new technolog...
Data Con LA
Data Con LA 2022-Pre-recorded-Integrating data science initiatives in busines...
Data Con LA
Data Con LA 2022 - Pre-recorded - Hispanic Demographics basing on most recent...
Data Con LA
Data Con LA 2022 - Pre-recorded - Data Science in water utility industry
Data Con LA
Data Con LA 2022 - Pre- recorded - Web3 and Decentralized Identity
Data Con LA
Data Con LA 2022-Pre-recorded - Hamilton, General Purpose framework for Scala...
Data Con LA
Data Con LA 2022 - Pre- Recorded - Simplifying AI/ML using Databricks feature...
Data Con LA
Data Con LA 2022 - Pre-recorded - How to Become a Business Intelligence Analyst
Data Con LA
Data Con LA 2022 - Pre-recorded - Use Cases Unlocked by Business Intelligence...
Data Con LA
Grid Benefits from Energy Storage
Data Con LA
Real-time Streaming Pipelines with FLaNK
Data Con LA
Data Con LA 2020 Keynote - William Kehoe
Data Con LA

Featured

Irresistible content for immovable prospects
Velocity Partners
How To Build Amazing Products Through Customer Feedback
Product School
Bridging the Gap Between Data Science & Engineer: Building High-Performance T...
ryanorban
Intro to user centered design
Rebecca Destello
How to Master Difficult Conversations at Work – Leader’s Guide
Piktochart
How to Land that First Customer
Floown
How to think like a startup
Loic Le Meur
What to Upload to SlideShare
SlideShare
Be A Great Product Leader (Amplify, Oct 2019)
Adam Nash
Trillion Dollar Coach Book (Bill Campbell)
Eric Schmidt
APIdays Paris 2019 - Innovation @ scale, APIs as Digital Factories' New Machi...
apidays
A few thoughts on work life-balance
Wim Vanderbauwhede
Is vc still a thing final
Mark Suster
The GaryVee Content Model
Gary Vaynerchuk
Mammalian Brain Chemistry Explains Everything
Loretta Breuning, PhD
Blockchain + AI + Crypto Economics Are We Creating a Code Tsunami?
Dinis Guarda
The AI Rush
Jean-Baptiste Dumont
AI and Machine Learning Demystified by Carol Smith at Midwest UX 2017
Carol Smith
10 facts about jobs in the future
Pew Research Center's Internet & American Life Project
Harry Surden - Artificial Intelligence and Law Overview
Harry Surden

Related Books

Free with a 30 day trial from Scribd

See all
Power Pivot and Power BI: The Excel User's Guide to DAX, Power Query, Power BI &amp; Power Pivot in Excel 2010-2016 Rob Collie
Free
Supercharge Excel: When you learn to Write DAX for Power Pivot Matt Allington
Free
Business Analysis Debra Paul
Free
Learn to Write DAX: A practical guide to learning Power Pivot for Excel and Power BI Matt Allington
Free
Python Data Science Essentials - Second Edition Luca Massaron
Free
Probability, Markov Chains, Queues, and Simulation: The Mathematical Basis of Performance Modeling William J. Stewart
Free
Numerical Methods for Stochastic Computations: A Spectral Method Approach Dongbin Xiu
Free
Data Visualization: a successful design process Andy Kirk
Free
Agent-Based and Individual-Based Modeling: A Practical Introduction, Second Edition Steven F. Railsback
Free
Dynamic Models in Biology Stephen P. Ellner
Free
Outnumbered: From Facebook and Google to Fake News and Filter-bubbles – The Algorithms That Control Our Lives David Sumpter
Free
Computational Economics David A. Kendrick
Free
Data Model Patterns: A Metadata Map David C. Hay
Free
Guerrilla Data Analysis Using Microsoft Excel: 2nd Edition Covering Excel 2010/2013 Oz du Soleil
Free
Python Machine Learning Sebastian Raschka
Free
Splunk Essentials Betsy Page Sigman
Free

Related Audiobooks

Free with a 30 day trial from Scribd

See all
Machine Learning in Python: Hands on Machine Learning with Python Tools, Concepts and Techniques Bob Mather
Free
Data Science for Beginners: Comprehensive Guide to Most Important Basics in Data Science Alex Campbell
Free
Advances in Financial Machine Learning Marcos López de Prado
Free
Python Guide: Clear Introduction to Python Programming and Machine Learning Alex Campbell
Free
Data Visualization Guide: Clear Introduction to Data Mining, Analysis, and Visualization Alex Campbell
Free
Data Mining and Analytics: Ultimate Guide to the Basics of Data Mining, Analytics and Metrics Alex Campbell
Free
Data Visualization: Clear Introduction to Data Visualization with Python. Proper Guide for Data Scientist. Alex Campbell
Free

Data Con LA 2022 - AutoDC + AutoML = your AI development superpower

  1. 1. AutoDC + AutoML = Your AI Dev Superpower Zac Yung-Chun Liu + Andromeda 360 AI zac@a360.ai Scott Tarlow Hypergiant scott.tarlow@hypergiant.com 2022 Data Con LA- AI / ML / Data Science Track
  2. 2. Talk outline ● Introduction: model-centric vs data-centric ● AutoDC (Automated data-centric processing) ● AutoDC + AutoML ● Data-centric + AutoML ● Discussions
  3. 3. Introduction Model-centric approach AI = Code + Data Data-centric approach AI = Code + Data Systematically engineering the data used to build an AI system Model generation Hyperparameter tuning Optuna, Hyperopt, Bayesian Optimization etc. Array, Featuretools, SMOTE, D3M etc.
  4. 4. Introduction Model-centric approach AI = Code + Data Data-centric approach AI = Code + Data Systematically engineering the data used to build an AI system Model generation Hyperparameter tuning Improvement (accuracy) 85% → 87% Improvement (accuracy) 85% → 95%
  5. 5. Introduction Model-centric approach AI = Code + Data Data-centric approach AI = Code + Data Systematically engineering the data used to build an AI system Model generation Hyperparameter tuning AutoML AutoDC
  6. 6. Ideation of AutoDC framework 1 2 INPUT DATA AUTOML: 1. Data preprocessing 2. Feature engineering 3. Model generation 4. Hyperparameter tuning 3 OUTPUT PREDICTION MODEL- CENTRIC AI 1 2 LABELED DATASET AUTODC: 1. Label correction 2. Edge case selection 3. Data augmentation 3 IMPROVED DATASET DATA- CENTRIC AI Presented in NeurIPS 2021
  7. 7. AutoDC workflow 1 2 LABELED DATASET AUTODC: 3 IMPROVED DATASET 4 ML MODEL OR AUTOML EMBEDDING CREATION RESNET 50 T-SNE OUTLIER CREATION ISOLATION FOREST LABEL CORRECTION HUMAN IN THE LOOP EDGE CASE SELECTION OPTIMIZED RATIO DATA AUGMENTATION KERAS DATA GENERATOR Currently only support computer vision
  8. 8. AutoDC example 2.0 1.5 1.0 0.5 0.0 -0.5 -1.0 -1.5 -1.5 -1.0 -0.5 0.0 0.5 1.0 1.5 2.0 2.0 1.5 1.0 0.5 0.0 -0.5 -1.0 -1.5 -1.5 -1.0 -0.5 0.0 0.5 1.0 1.5 2.0 2.0 1.5 1.0 0.5 0.0 -0.5 -1.0 -1.5 -1.5 -1.0 -0.5 0.0 0.5 1.0 1.5 2.0 A B C Embeddings → outlier detection → identify incorrect labels, edge cases Data: roman numerals
  9. 9. github.com/gohypergiant/AutoDC
  10. 10. AutoDC on 3 example datasets ROMAN NUMERICAL 3,000 IMAGES / 10 classes ASIRRA DOG VS CAT 25,000 IMAGES / 2 classes STANFORD PARASITIC SNAIL 5,000 IMAGES / 4 classes
  11. 11. AutoDC improvement- Image Classification ROMAN NUMERICAL 3,000 IMAGES / 10 classes STANFORD PARASITIC SNAIL 5,000 IMAGES / 4 classes ASIRRA DOG VS CAT 25,000 IMAGES / 2 classes 65% 80% 72% 82% 81% 95% UNMODIFIED DATA IMPROVED DATA Fixed model- ResNet50 → 10-15% improvement
  12. 12. AutoDC: 80% time saved ROMAN NUMERICAL 3,000 IMAGES / 10 classes STANFORD PARASITIC SNAIL 5,000 IMAGES / 4 classes ASIRRA DOG VS CAT 25,000 IMAGES / 2 classes 1h 4h 2h 9h 1h 6h AutoDC Manual process
  13. 13. AutoDC limitations ● The parameters in AutoDC still require fine-tuning ● Users need to identify a training model in advance + AutoML → fill the gap
  14. 14. (1) AutoDC + AutoML AutoDC AutoML Input Data Improved Data ● Use AutoDC and AutoML as separated components
  15. 15. (2) AutoDC + AutoML AutoDC Hyperparameter tuning Input Data Improved Data ML model AutoML ● Include AutoDC as one of the search components in AutoML Fine-tune
  16. 16. AutoDC + AutoML (1) - separated component ROMAN NUMERICAL 3,000 IMAGES STANFORD PARASITIC SNAIL 5,000 IMAGES ASIRRA DOG VS CAT 25,000 IMAGES 65% 80% 72% 82% 81% 95% UNMODIFIED DATA IMPROVED DATA + Google AutoML 85% 85% 97% 15-20% improvement (Additional 2-5% ↑) AutoML run: 8-20 hours
  17. 17. AutoDC + AutoML (2) - fine-tune AutoDC in AutoML XX% improvement ? More time saved ● Still in development (future works) ● Expect to be more time efficient
  18. 18. Data-centric approach + AutoML
  19. 19. Data-centric approach + AutoML : Preventive Maintenance on Aircraft Engines - Imbalanced Classification
  20. 20. Data-centric approach + AutoML : Preventive Maintenance on Aircraft Engines We treat encoding the variables as a parameter, similar to drop out rate (a neural network regularization tool). This plot shows that reducing drop out (lowering bias) works better than encoding categorical variables (increasing variance). This means that the categorical variables may not contribute much to a good model
  21. 21. Data-centric approach + AutoML : Preventive Maintenance on Aircraft Engines More evidence towards this is the random_state parameter is more important than encoding, showing that the encoded variables did not have a meaningful contribution to a strong model. Using this allows us to create less complex models which are less susceptible to drift in production.
  22. 22. Discussions ● AutoDC framework is modular and flexible, can be updated with newly developed ML techniques ● (1) AutoDC + AutoML, (2) data-centric approach + AutoML → automate most of the manual processes in ML development ● Low-code/ no-code ML solutions for domain experts ● Only hard requirement: compute resources
  23. 23. Call for open source contributions github.com/gohypergiant/AutoDC
  24. 24. Project sponsors ML/ DS service Focus on space, defense, and critical infrastructure hypergiant.com Open and modular ML platform (A360) Focus on single touch ML deployment (Starpack) a360.ai

Editor's Notes

  • [3 mins]
    Model-centric vs data-centric: fixed data, improve model vs fixed model, improve data (Andrew Ng’s flagship talk in 2021)
    Kaggle competition → model centric
    More data-centric competition, first one in 2021 initiated by Andrew Ng
    Model-centric → incremental improvement
    Data-centric → better approach, build better model
    Commonly used packages and techniques


  • [1 min]
    Model-centric: Incremental improvement (1-2%)
    Data-centric: significant improvement (> 10%)


  • [2 mins]
    The availability of AutoML (automated machine learning) with publicly accessible pre-trained models enable domain experts to automatically build high-quality custom ML applications without much requirement for ML model construction knowledge, which greatly speeds up the ML model development.

    AutoML has been an essential piece in the model-centric approach in the data science community.

    Similar with AutoML, we’ve created AutoDC as open source tooling.



  • [3 mins]
    AutoDC (automated data-centric processing), similar to the purpose of AutoML, is a newly developed open source tool that enables domain experts to automatically and systematically improve datasets by fixing incorrect labels, adding examples that represent edge cases, and applying data augmentation, without much coding requirement and manual process.

    Note: there are some overlaps, for example, data preprocessing and feature engineering could also be in AutoDC, no strict boundaries
  • [2 mins]
    AutoDC workflow
    Input: labeled dataset
    Output: improved dataset
    Still requires a ML model to know the improvement
  • [2 mins]
    AutoDC example
    Roman numerals data
    Embeddings → help identify incorrect labels and edge cases
  • [2 mins]
    Quick walkthrough Github repo
  • [2 mins]
    AutoDC example
    Roman numerals data
    Embeddings → help identify incorrect labels and edge cases
  • [2 mins]
    AutoDC improves ML model (fixed ResNet50 model)
    Tested 3 image data
    10-15% improvement
    Prove AutoDC can be a powerful tool to improve data and label quality
    What about we combine AutoML with it?
  • [2 mins]
    AutoDC improves ML model (fixed ResNet50 model)
    Tested 3 image data
    10-15% improvement
    Prove AutoDC can be a powerful tool to improve data and label quality
    What about we combine AutoML with it?
  • [2 mins]
    AutoDC example
    Roman numerals data
    Embeddings → help identify incorrect labels and edge cases
  • [2 mins]
    The final numbers need to be updated
  • [2 mins]
  • Traditional AutoML only presents the best methods - but one with data centric approaches 1) include preprocessing in the hyperparameter search, 2) present the other options to see how significant each parameter is.
  • Mix of continuous variables (sensors) and categorical variables with missing values are used to predict if an aircraft engine should be replaced before its next flight or not. We use our AutoML library to build a strong model, impute the data, and build new features.

×