Successfully reported this slideshow.
Your SlideShare is downloading. ×

MLBox 0.8.2

Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Loading in …3
×

Check these out next

1 of 22 Ad

More Related Content

Slideshows for you (20)

Similar to MLBox 0.8.2 (20)

Advertisement

Recently uploaded (20)

MLBox 0.8.2

  1. 1. BY AXEL DE ROMBLAY & HENRI GÉRARD
  2. 2. HTTPS://GITHUB.COM/AXELDEROMBLAY/MLBOX
  3. 3. • INTRODUCTION : WHY AUTO-ML IS GAINING MOMENTUM ? • FOCUS ON THE AUTOMATION PROCESS • MLBOX COMPARED WITH OTHER AUTOMATED TOOLS • CONCLUSION • LIVE DEMO • Q&A
  4. 4. Data ScientistData Computation means Data pre-processing Model tuning Machine Learning Almost an automated process…
  5. 5. Auto Machine Learning A fully automated process Data Computation meansRobot • Supervised tasks - classification - regression • Structured data - csv files - json files - … • Unsupervised tasks - outlier detection - clustering - … • Unstructured data - images - texts - …
  6. 6. What is auto-ML ? We want to automate… …the maximum number of steps in a ML pipeline… …with minimum human intervention… …while conserving a high performance !
  7. 7. Data cleaning (duplicates, ids, correlations, leaks, … ) Data encoding (NA, dates, text, categorical features, … ) STEP 2 : Preprocessing STEP 1 : Reading / merging STEP 3 : Optimisation Feature selection Feature engineering Model selection Prediction Model interpretation STEP 4 : Application Focus on the automation process Diagram of a standard ML pipeline
  8. 8. Step 1: reading and merging Maybe the hardest step. Not a priority for auto-ML at the moment. Some inputs : paths to the data + target name Difficult to auto-merge different sources
  9. 9. Step 2: preprocessing Not all packages tackle preprocessing No inputs : heuristics to detect feature types are easy Naive encoding for most auto-ML packages
  10. 10. Step 3: optimization Top priority for auto-ML community Some inputs : scoring function + a hyper-parameter space Computation time can be long !
  11. 11. Step 4: application Prediction is also a priority for auto-ML No auto-monitoring after model deployment Fast and accurate step
  12. 12. Maintenance Ease of setup Automated steps - Reading - Encoding - Optimisation - Encoding - Optimisation - Reading/merging - Cleaning - Encoding - Optimisation - Encoding - Optimisation - Optimisation Quality of automation Open source ? Overview of various solutions
  13. 13. MLBox: a fully automated package
  14. 14. STEP 1 : Reading / merging From several raw datasets to one structured dataset. • List of paths to all the datasets • Target name • Reading of several files (csv, xls, json and hdf5) • Auto-merging of all the sources – information crunching • Task detection (binary/multiclass classification or regression) • Split between train and test sets • Basic information display • Dense structured train and test files (without duplicates) Features what is Automated by MLBox ?
  15. 15. • A dataset • Auto-cleaning/dropping : duplicates, drifts / covariate shifts (*), high correlations, highly sparse features / samples, constants, … • Feature encoding : missing values, lists, dates, categorical features, text, … - SEVERAL STRATEGIES • A cleaned dataset with numerical features STEP 2 : Preprocessing From a dirty dataset to a cleaned numerical one Features what is Automated by MLBox ?
  16. 16. STEP 3 : Optimisation A wide range of models are tested and cross-validated • A metric (a wide choice, otherwise can be implemented) - OPTIONAL • A hyper-parameter space – OPTIONAL • The train set • Feature engineering : using neural networks (*) • Feature selection : filter methods, wrapper methods, embedded methods • Model selection : a wide range of accurate models (LightGBM, XGBoost, Random Forest, Linear, …) • Hyper-parameter optimisation : TPE (Bayesian optimisation method) – dumping of fitted pipelines • Ensembling : multi-layer stacking, boosting, bagging, … • The optimal pipeline configuration Features what is Automated by MLBox ?
  17. 17. STEP 4 : Prediction The best model is fitted and predicts on the test set • Train and test datasets • The optimal pipeline configuration - OPTIONAL • Target prediction : classification and regression - dumping of predictions + optimal fitted pipeline • Model interpretation : feature importances (saved) • Leak detection : warning • The predictions on the test set Features what is Automated by MLBox ?
  18. 18.  Quality: functional code : tested on Kaggle & codecov  Performance: fully distributed and optimised  AI: dumping and automatic reading of computations  Updates: latest algorithms MLBox: about the package  Compatibility: Python 3.5-3.7 / Windows, Linux & Mac OS  Quick setup: $ pip install mlbox  User friendly: tutorials, docs, examples…
  19. 19. Conclusion: the benefits of auto-ML Democratizes Machine Learning Machine Learning for everybody (no coding) Avoids errors A robot never makes mistakes… Increases productivity Repetitive tasks are automated and accelerated ! A Data Scientist can focus more on non-traditional issues !
  20. 20. Live demo https://colab.research.google.com/drive/1EwYZg_4yayagfQfOrXUUFidUuYbj0fEi
  21. 21. Next releases • FEATURES TEXT & TIME SERIES ENCODING AUTO OPTIMISATION (USING TUNE) MODEL INTERPRETATION (USING LIME) • PERFORMANCES PIPELINE PARALLELISM PER VARIABLE TYPE (USING COLUMNTRANSFORMER IN SKLEARN) PIPELINE MEMORY IMPROVEMENTS • ERGONOMY FIT AND PREDICT TASKS SPLIT VERBOSITY IMPROVEMENTS (USING MLFLOW…) ERGONOMY IMPROVEMENTS (PARAMETERS NAMES, …)
  22. 22. Thank you ! Questions ? Please star or share the repo !

Editor's Notes

  • 1min
  • 1min
  • 1min
  • 2min
    Data preprocessing and model tuning are both repetitive tasks that take a lot of time…
    A Data Scientist is expensive !
  • 2min
    So why don’t we replace the DS by a robot ??? We would save time and money !
    Let’s see what can be automated !
  • 1min
    Performance = computation time + accuracy
  • 2min
    2012-2017 : Machine Learning (understand the benefit of Machine Learning | identify use cases)
    2018- ? : Auto ML (model deployment issue !)
  • 2min
    Companies will benefit from Auto-ml to deploy ML models !
    Solution : identify supervised tasks that can be automated with a reusable/generic pipeline ! Other tasks are specifics (so difficult to be automated with a single tool/method: DS will focus on these tasks !).

    Gap between IT companies that have already automated pipelines and other companies that are late !
  • 2min
    90% of machine learning tasks follow this pipeline
  • 1min
    Reading and creating the datasets has always been manual ! Here is the key point for automation !!!
  • 1min
    Preprocessing issues : drifts (ids), leaks, duplicates ??
    Heuristics : str -> categorical features, long str -> text, …
    Naive encoding : no NN !
  • 1min
    First step automated : no needs to automate more !! (some solution pretend to test « thousands » of models but it is useless !)
    Optional inputs
    Optimization algo : a lot of ways to tackle the high dimension issue
    Meta learning !
  • 1min
    NO monitoring !
    Feature interpretation : ok
  • 3min
    MLBox tackles merging !!
  • 1min
    Available on PyPI
    Github with tutos, examples
    Docs with articles, kaggle kernels, …

    Performance : tested on Kaggle !
    Features : drifts, embeddings, stacking, leak, feature importances,…
  • 2min
  • 2min
  • 2min
  • 2min
  • 2min
  • 2min
    Democratize ML !!
    Google CEO Sundar Pichai said that, “We hope AutoML will take an ability that a few Ph.D.s have today and will make it possible in three to five years for hundreds of thousands of developers to design new neural nets for their particular needs.”
  • 10min
    THANKS ! Q&A ?
  • 10min
    THANKS ! Q&A ?
  • 10min
    THANKS ! Q&A ?

×