http://ainyc19.xnextcon.com
I will describe what is available in terms of Open Source and Proprietary tools for automating Data Science tasks and introduce 2 new tools: one to visualize any sized data set with one click, another: to try multiple ML models and techniques with a single call. I will provide the Github Repos for both for free in the talk.
1. AI Next Conference: 7/24/2019
Machine Learning
Automated Data Visualization
Ram Seshadri
July 2019 Slide 1
and
2. AI Next Conference: 7/24/2019AI Next Conference: 7/24/2019
“Machine learning teams are still struggling to take advantage of ML
due to challenges with inflexible frameworks, lack of reproducibility,
collaboration issues, and immature software tools”
Cecelia Shao
Comet.ml
“Why is my Data Science team taking
sooo long to complete a simple project?”
-- A Frustrated CIO
Slide 2
Machine learning teams are still struggling to take advantage of ML due to
challenges with inflexible frameworks, lack of reproducibility, collaboration
issues, and immature software tools.
The Answer?
3. AI Next Conference: 7/24/2019
Faster
Visualization
Automatic
Feature
Selection
• Auto_ViML
Automatic
Model Selection
and Tuning
• Auto_ViML
One Click Model
Serving and
Production
Auto_ViML was designed along with AutoViz to Build Variant
Interpretable Machine Learning Models Fast!
__
● They are proprietary and expensive (lock-in)
● Black Boxes which are too complex to interpret
● Very little reproducibility outside of tool
HOWEVER CURRENT TOOLS ARE LIMITED BECAUSE...
• AutoViz
INTRODUCING A SIMPLER APPROACH TO AUTO-ML
Slide 3
How can we make DATA SCIENTISTS more productive?
AI Next Conference: 7/24/2019
4. AI Next Conference: 7/24/2019
●Open Source Tools for Faster Time to Insights with Design Goals as:
○Simple: Invoke them with a single Line of Code (each)
○Flexible: Suited to any kind of structured data set with no Prep required
○Incremental: Can be used by anyone from beginners to experts alike
○Experimental: Compare multiple visualization methods and models step by step
○Interpretable: get clear explanation of steps taken with validation graphs
○Reproducible: No Black Box. Reproducible model pipelines and outputs
○Extensible: Open Source with contributions from Python and DS community
I Built AutoViz and Auto_ViML to make my own life easier.
Hope it will do the same for you.
Slide 4
What is Auto_Viz and Auto_ViML?
5. AI Next Conference: 7/24/2019
What is AutoViz?
Slide 5
AutoViz enables you to automatically
visualize any data set with a Single Line of
Code. It automatically:
1. Selects a Random Sample from the Data
Set (if the Data Set is very large)
2. Selects most important features using
ML (if Number of Variables is very large)
3. Selects Best Methods to Visualize Data
for a given problem
4. Provides Charts to be saved in PNG,
JPG, and SVG Formats
OVERVIEW
6. AI Next Conference: 7/24/2019
Why AutoViz?
Slide 6
Help explain your hypotheses and variable selection better to others
BENEFITS
Systematic Look for insights systematically rather than through “gut instinct” or
domain knowledge
Simple Reduce features to the most important ones to deliver simple yet
powerful insights
Explainable
7. AI Next Conference: 7/24/2019
How AutoViz Works
Slide 7
Variable
Classification
Problem
Identification
Complex
Interactions
AutoViz classifies features into
highly granular data types to
determine how best to
represent them in Charts
AutoViz can visualize any
dataset for a given target:
Regression, Classification, Time
Series, Clustering and more
Most charts involve more than
one variable helping to deliver
powerful insights with minimal
effort
Select the Most
Important Features
Select the Best
Charts
Deliver them Fast!
AutoViz uses the powerful ML
algorithm, XGBoost, to select
important features given the
target variable
AutoViz selects the best ways
to visualize your data to extract
insights from your data
AutoViz selects statistically
valid sample data to visualize
(in case data set is very large)
Design Goals
Implementation
AutoViz PROCESS
8. AI Next Conference: 7/24/2019
Github: https://github.com/AutoViML/AutoViz
AutoViz: Boston Housing*
import AutoViz_Class as AV
AVC = AV.AutoViz_Class()
Just Import...
And Run AutoViz.
dft = AVC.AutoViz('', sep, target, df,lowess=True)
Results...
Slide 8
Thanks to UCI Machine Learning Repository for all data sets in this presentation:
Dua, D. and Graff, C. (2019). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA:
University of California, School of Information and Computer Science.
9. AI Next Conference: 7/24/2019
AutoViz Example: Housing*
● Number of Rooms and Median Value of
Homes seem to be highly correlated
● As Age of Building increases, Median
Value decreases albeit slowly
INSIGHTS
● NOX and DIS seem to be highly
correlated though they seem
to have a polynomial or non-
linear relationship
Slide 9* Thanks to UCI Machine Learning Repository
https://archive.ics.uci.edu/ml/machine-learning-databases/housing/
10. AI Next Conference: 7/24/2019
AutoViz Example: Housing
● Both CRIM and ZN are highly skewed
● Both may require a transformation
INSIGHTS
● PTRATIO and DIS seem to be
somewhat skewed as well but
don’t require transformations
Slide 10* Thanks to UCI Machine Learning Repository
https://archive.ics.uci.edu/ml/machine-learning-databases/housing/
11. AI Next Conference: 7/24/2019
AutoViz Example: Boston
RM, LSTAT, TAX, INDUS, AGE, and CRIM seem to be
decently correlated with Target. May be worth
exploring if they come up as Important Features.
INSIGHTS
● Average Median Value of
homes varies widely by CHAS
and RAD. Hence would be
important features in any
model.
Slide 11* Thanks to UCI Machine Learning Repository
https://archive.ics.uci.edu/ml/machine-learning-databases/housing/
12. AI Next Conference: 7/24/2019
How to Build a better model?
Slide 13
Remove Low Information
and Redundant Features
Add Polynomial and
Interaction, Other
Features
Select Models from
Simple to Complex and
Perform Tuning
Add Entropy Binning,
Stacking to K-Means
Featurizers to model
Add Imbalanced sampling
and training
Perform Ensembling of
Multiple Types of models
BUILD A ViML Model!
(VARIANT INTERPRETABLE MACHINE LEARNING MODEL, Step by Step)
PROCESS
13. AI Next Conference: 7/24/2019
Why Auto_ViML?
Slide 15
MULTIPLE MODELS
TRANSPARENCY
FEATURE
ENGINEERING
AUTOMATIC
FEATURE
SELECTION
SYSTEMATIC Auto ViML was designed from the ground-up to mimic how a Data
Scientist would approach a Modeling Problem.
Enables selective model complexity by adding features and
complexity step by step
Provides Deep Insights into the Data Set with Full Transparency
Models with Fewer Features result in Simpler Models. Auto_ViML
Produces models with 10-90% Fewer Features than Regular Models
without Significant Loss of Predictive Power*
* Based on my experience. Your results may vary.
Build and test multiple models thru’ Hyper Tuning and Cross Validation
BENEFITS
14. AI Next Conference: 7/24/2019
Auto_ViML LETS YOU TRY MULTIPLE APPROACHES
Slide 16
You can access all the powerful features of with one line of Python Code after you import.
You can turn on and turn off features and flags to see how they impact Model.
TRY
MULTIPLE
APPROACHES
TO GET THE
BEST MODEL
INTERACTIONS
vs. NO
INTERACTIONS
BOOSTING
vs. BAGGING
ENSEMBLING
vs. STACKING
IMBALANCED
vs.
BALANCED
GRIDSEARCH
vs. RANDOM
SHAP vs.
FEATURE
IMPORTANCES
Just like a Data Scientist
would...
15. AI Next Conference: 7/24/2019
Github: https://github.com/AutoViML/Auto_ViML
Auto_ViML: Boston Housing
from Auto_ViML import Auto_ViML
Just Import...
model, features, trainm, testm = Auto_ViML(train,
target, test,
sample_submission='',
hyper_param='GS',
scoring_parameter='f1',
Boosting_Flag=None,
KMeans_Featurizer=False,
Add_Poly=0,
Stacking_Flag=False,
Binning_Flag=False,
Imbalanced_Flag=True,
verbose=0)
And Run Auto_VIML.
Slide 17
Get Model, Features and
transformed Train and
Test data...
Thanks to UCI Machine Learning Repository for all data sets in this presentation:
Dua, D. and Graff, C. (2019). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA:
University of California, School of Information and Computer Science.
16. AI Next Conference: 7/24/2019
Here is an example of a Regression data set: Boston
Housing*. There are 13 predictors in the dataset.
But Auto_ViML finds that only 10 variables are needed
to get the job done. Also Watch the Feature
Importances.
Auto_ViML: Boston Housing*
Slide 18
DATA SET SIZE 506 x
14
TIME TAKEN
6 secs
Variables Selected
10
FEATURE REDUCTION
24%
Results:
Start with Linear Model
* Thanks to UCI Machine Learning Repository
https://archive.ics.uci.edu/ml/machine-learning-databases/housing/
17. AI Next Conference: 7/24/2019
Auto_ViML: Boston Housing
Slide 19
Results:
Move to Random Forests
Time Taken = 30 seconds
Thanks to UCI Machine Learning Repository for all data sets in this presentation:
Dua, D. and Graff, C. (2019). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA:
University of California, School of Information and Computer Science.
18. AI Next Conference: 7/24/2019
Auto_ViML: Boston Housing
Slide 20
Results:
Close with XGBoost
* Thanks to UCI Machine Learning Repository
https://archive.ics.uci.edu/ml/machine-learning-databases/housing/
20. AI Next Conference: 7/24/2019
Auto_ViML: Boston Housing
Slide 22
Linear Model with Interaction Variables
Ensemble Model with Binning
Forests Model with Binning Numerics
XGBoost Model with Stacking
Multiple Models
* Thanks to UCI Machine Learning Repository
https://archive.ics.uci.edu/ml/machine-learning-databases/housing/
22. AI Next Conference: 7/24/2019
Auto_ViML: Wisconsin Breast Cancer
Slide 24
DATA SET SIZE
512 x 32
TIME TAKEN
12 Secs
The Wisconsin Breast Cancer* data set is a classic
Data Set: Auto_ViML took 12 Seconds to find the
best features and best model with Weighted F1
score of 100% on validation set using Linear model
Wisconsin Breast Cancer Data Set
FEATURE REDUCTION
52%
Macro Average ROC AUC
100%
Results:
Compare the results
to another model
using Deep Learning
and Keras
Link
“Hyperparameter
Optimization with
Keras” by Mikko
* Thanks to UCI Machine Learning Repository
https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+(Diagnostic)
23. AI Next Conference: 7/24/2019AI Next Conference: 7/24/2019
●What’s Missing / Could be Improved:
○No Feature Engineering: You can create your own or use kits like featuretools, etc.
○No Image/Video/NLP Support: At the moment, it removes these features from model considerations
○No Time Series modeling: Auto_TimeSeries is in the works. Stay Tuned.
○No Neural Networks or Deep Learning: You can add your own modules or use tools like Ludwig
○Model serving: Adding a module for test data transformation necessary
Slide 25
Next Steps for AutoViz and Auto_ViML...
●What’s Missing / Could be Improved:
○Build it into Existing Tools such that structured data can be Visualized Fast!
○Build it into Educational tools to make it easy for Students and Colleges (where small, structured
datasets are the Norm) to help Visualize data (as writing code is still very hard for Students)
○Add additional Visualizations such as Pie Charts, Mosaic Charts, etc.
○Build it into Industrial Instruments such as IoT tools so that large data sets can be visualized
Auto_ViML
AutoViz