Weekly #105: AutoViz and Auto_ViML Visualization and Machine Learning

Machine Learning
Ram Seshadri
March 2020 Slide 1
AutoViz and Auto_ViML
Faster Time to Insights
using Automated
Visualization and Machine
Learning

“Machine learning teams are still struggling to take advantage of ML
due to challenges with inflexible frameworks, lack of reproducibility,
collaboration issues, and immature software tools”
Cecelia Shao
Comet.ml
“Why is my Data Science team taking sooo
long to complete a simple project?”
-- A Frustrated CIO
Slide 2
Machine learning teams are still struggling to take advantage of ML due to
challenges with inflexible frameworks, lack of reproducibility, collaboration
issues, and immature software tools.
The Answer?

7/24/2019
Faster
Visualization
Automatic
Feature
Selection
• Auto_ViML
Automatic
Model Selection
and Tuning
• Auto_ViML
One Click Model
Serving and
Production
Auto_ViML was designed along with AutoViz to Build Variant
Interpretable Machine Learning Models Fast!
__
● They are proprietary and expensive (lock-in)
● Black Boxes which are too complex to interpret
● Very little reproducibility outside of tool
HOWEVER CURRENT TOOLS ARE LIMITED BECAUSE...
• AutoViz
INTRODUCING A SIMPLER APPROACH TO AUTO-ML
Slide 3
How can we make DATA SCIENTISTS more productive?

●Open Source Tools for Faster Time to Insights with Design Goals as:
○ Simple: Invoke them with a single Line of Code (each)
○ Flexible: Suited to any kind of structured data set with no Prep required
○ Incremental: Can be used by anyone from beginners to experts alike
○ Experimental: Compare multiple visualization methods and models step by step
○ Interpretable: get clear explanation of steps taken with validation graphs
○ Reproducible: No Black Box. Reproducible model pipelines and outputs
○ Extensible: Open Source with contributions from Python and DS community
I Built AutoViz and Auto_ViML to make my own life easier.
Hope it will do the same for you.
Slide 4
What is Auto_Viz and Auto_ViML?

AutoViz and Auto_ViML do not completely eliminate the need for data scientists. But they speed up some
steps in the ML workflow which makes data scientists more productive!
Slide 5
Where do they fit in ML Workflow?
AutoViz Auto_ViML

What is AutoViz?
Slide 6
AutoViz enables you to automatically
visualize any data set with a Single Line of
Code. It automatically:
1. Selects a Random Sample from the Data
Set (if the Data Set is very large)
2. Selects most important features using
ML (if Number of Variables is very large)
3. Selects Best Methods to Visualize Data
for a given problem
4. Provides Charts to be saved in PNG,
JPG, and SVG Formats
OVERVIEW

Why AutoViz?
Slide 7
Help explain your hypotheses and variable selection better to others
BENEFITS
Systematic Look for insights systematically rather than through “gut instinct” or
domain knowledge
Simple Reduce features to the most important ones to deliver simple yet
powerful insights
Explainable

How AutoViz Works
Slide 8
Variable
Classification
Problem
Identification
Complex
Interactions
AutoViz classifies features into
highly granular data types to
determine how best to
represent them in Charts
AutoViz can visualize any
dataset for a given target:
Regression, Classification, Time
Series, Clustering and more
Most charts involve more than
one variable helping to deliver
powerful insights with minimal
effort
Select the Most
Important Features
Select the Best
Charts
Deliver them Fast!
AutoViz uses the powerful ML
algorithm, XGBoost, to select
important features given the
target variable
AutoViz selects the best ways
to visualize your data to extract
insights from your data
AutoViz selects statistically
valid sample data to visualize
(in case data set is very large)
Design Goals
Implementation
AutoViz PROCESS

https://github.com/AutoViML/AutoViz
AutoViz
from autoviz.AutoViz_Class import AutoViz_Class
AV = AutoViz_Class()
2. Next Import...
3. Run AutoViz.
dft = AV.AutoViz('', sep, target, df)
See Results...
Slide 9
Thanks to UCI Machine Learning Repository for all data sets in this presentation:
Dua, D. and Graff, C. (2019). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA:
University of California, School of Information and Computer Science.
1. First, Install...
pip install autoviz
Installing and Running
AutoViz is as easy as 1, 2, 3!
Source Code:

pip install autoviz+
AutoViz Downloads
Slide 10
AutoViz has now
been downloaded
more than 17K times+
AutoViz Downloads
* As of August 23, 2019
Chart Source Courtesy:
https://packaging.python.org/guides/analyzing-pypi-package-downloads/
+ Stats Source Courtesy:
PePy org

AutoViz: Housing*
● Number of Rooms and Median Value of
Homes seem to be highly correlated
● As Age of Building increases, Median
Value decreases albeit slowly
INSIGHTS
● NOX and DIS seem to be highly
correlated though they seem
to have a polynomial or
non-linear relationship
Slide 11* Thanks to UCI Machine Learning Repository
https://archive.ics.uci.edu/ml/machine-learning-databases/housing/

AutoViz: Housing*
● Both CRIM and ZN are highly skewed
● Both may require a transformation
INSIGHTS
● PTRATIO and DIS seem to be
somewhat skewed as well but
don’t require transformations

AutoViz: Housing*
RM, LSTAT, TAX, INDUS, AGE, and CRIM seem to be
decently correlated with Target. May be worth
exploring if they come up as Important Features.
INSIGHTS
● Average Median Value of
homes varies widely by CHAS
and RAD. Hence would be
important features in any
model.

Build your first baseline model using:
1. Features selected by AutoViz
2. Iterate through models and visualization until
satisfied with result
How AutoViz enables better Modeling
Slide 14
AutoViz
Look at Charts and Graphs generated by AutoViz:
1. Sometimes, AV can generate over 1000 charts!
2. Save Charts you like best by right click and save
3. You also have the option of getting charts in
SVG or PNG or JPEG formats
Call AutoViz from Python Jupyter Notebooks:
1. Give directory path+filename, separator in
data, target variable name (can be empty
string)
2. If DataFrame available instead of filename,
give name of DataFrame in above
3. Run AutoViz
Select Variables:
1. That appear most promising to Modeling goal
2. Rerun AutoViz by removing variables or adding
new variables that provide more insights
1. Call AutoViz with your Filename and target
2. Look at Charts to derive Insights into Data3. Add/Remove Variables and re-run AutoViz
4. Build your baseline model with AutoViz features

How to Build a better model?
Slide 15
Remove Low Information
and Redundant Features
Add Polynomial and
Interaction, Other
Features
Select Models from
Simple to Complex and
Perform Tuning
Add Entropy Binning,
Stacking to K-Means
Featurizers to model
Add Imbalanced sampling
and training
Perform Ensembling of
Multiple Types of models
BUILD A ViML Model!
(VARIANT INTERPRETABLE MACHINE LEARNING MODEL, Step by Step)
PROCESS

What is Auto_ViML?
Slide 16
INSIGHTFUL
INTERPRETA
BLE
VARIANT
REPRODUCI
BLE
ITERATIVE
ViML helps you
try as many as
15 different
models with one
API
ViML reduces
features to the
bare minimum
(as much 10-90%
reduction in
features)
ViML is fully
reproducible by
explaining its
steps (full
transparency)
Delivers insights
into which
models and
techniques will
work best with
your data
ViML helps you
build you more
complex models
after trying out
simpler options
VARIANT INTERPRETABLE MACHINE LEARNING MODELS with one API

Why Auto_ViML?
Slide 17
MULTIPLE MODELS
TRANSPARENCY
FEATURE
ENGINEERING
AUTOMATIC
FEATURE
SELECTION
SYSTEMATIC Auto ViML was designed from the ground-up to mimic how a Data
Scientist would approach a Modeling Problem.
Enables selective model complexity by adding features and complexity
step by step
Provides Deep Insights into the Data Set with Full Transparency
Models with Fewer Features result in Simpler Models. Auto_ViML
Produces models with 10-99% Fewer Features than Regular Models
without Significant Loss of Predictive Power*
* Based on my experience. Your results may vary.
Build and test multiple models thru’ Hyper Tuning and Cross Validation
BENEFITS

Now with
CatBoost!!
Auto_ViML LETS YOU TRY MULTIPLE APPROACHES
Slide 18
You can access all the powerful features of with one line of Python Code after you import.
You can turn on and turn off features and ﬂags to see how they impact Model.
TRY
MULTIPLE
APPROACHES
TO GET THE
BEST MODEL
INTERACTIONS
vs. NO
INTERACTIONS
BOOSTING
vs. BAGGING
ENSEMBLING
vs. STACKING
IMBALANCED
vs.
BALANCED
GRIDSEARCH
vs. RANDOM
FEATURE
IMPORTANCES
Keep Upgrading
Auto_ViML version
since it is updated
monthly!
Predictions
from 12
models
Downsampling
supported
HyperOpt
coming
SHAP included
0 = No Intxns
1 = Pairwise Intxns
2 = Squared Vars

model, features, trainm, testm = Auto_ViML(train, target, test, sample_submission='',
hyper_param='GS', scoring_parameter='rmse',
feature_reduction=True,
Boosting_Flag=None, KMeans_Featurizer=False, Add_Poly=0,
Stacking_Flag=False, Binning_Flag=False, Imbalanced_Flag=True,
verbose=0)
Github: https://github.com/AutoViML/Auto_ViML
Auto_ViML
from autoviml.Auto_ViML import Auto_ViML
2. Next, Import...
3. Run Auto_VIML.
Slide 19
Get a fully trained Model, best Features and transformed Train and
Test data...
Installing and Running
Auto_ViML is as easy as
1, 2, 3!
pip install autoviml
1. First Install...

pip install autoviml
Auto_ViML Downloads
Slide 20
Chart Source Courtesy:
https://packaging.python.org/guides/analyzing-pypi-package-downloads/
Auto_ViML has now
nearly 50K
downloads+
Auto_ViML Downloads
* As of March 20, 2020
+ Stats Source Courtesy:
PePy org

Here is an example of a Regression data set: Boston
Housing*. There are 13 predictors in the dataset.
But Auto_ViML finds that only 10 variables are needed
to get the job done. For example:
['RM', 'LSTAT', 'NOX', 'PTRATIO', 'CRIM', 'TAX',
'CHAS', 'B', 'RAD', 'ZN']
Auto_ViML: Boston Housing*
Slide 21
DATA SET SIZE 506 x
14
TIME TAKEN
6 secs
Variables Selected
10
FEATURE REDUCTION
24%
Results:
Start with Linear Model
* Thanks to UCI Machine Learning Repository

Auto_ViML: Boston Housing
Slide 22
Results:
Move to Random Forests
Time Taken = 30 seconds

Slide 23
Results:
Close with XGBoost

Slide 25
Linear Model with Interaction Variables
Ensemble Model with Binning
Forests Model with Binning Numerics
XGBoost Model with Stacking
Multiple Models

Auto_ViML: Wisconsin Breast Cancer
Slide 26
DATA SET SIZE
512 x 32
TIME TAKEN
12 Secs
The Wisconsin Breast Cancer* data set is a classic
Data Set: Auto_ViML took 12 Seconds to find the
best features and best model with Weighted F1
score of 100% on validation set using Linear model
Wisconsin Breast Cancer Data Set
FEATURE REDUCTION
52%
Macro Average ROC AUC
100%
Results:
Compare the results
to another model
using Deep Learning
and Keras
Link
“Hyperparameter
Optimization with
Keras” by Mikko
https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+(Diagnostic)

● What’s Missing / Could be Improved:
○ No Feature Engineering: You can create your own or use kits like featuretools, etc.
○ No Image/Video/NLP Support: At the moment, it removes these features from model considerations
○ No Time Series modeling: Auto_TimeSeries is in the works. Stay Tuned.
○ No Neural Networks or Deep Learning: You can add your own modules or use tools like Ludwig
○ Model serving: Adding a module for test data transformation necessary
Slide 27
Next Steps for AutoViz and Auto_ViML...
● What’s Missing / Could be Improved:
○ Build it into Existing Tools such that structured data can be Visualized Fast!
○ Build it into Educational tools to make it easy for Students and Colleges (where small, structured
datasets are the Norm) to help Visualize data (as writing code is still very hard for Students)
○ Add additional Visualizations such as Pie Charts, Mosaic Charts, etc.
○ Build it into Industrial Instruments such as IoT tools so that large data sets can be visualized
Auto_ViML
AutoViz

AutoViz + Auto_ViML = POWERFUL INSIGHTS
Slide 28
LOAD DATA SET
RUN AUTOVIZ TO VISUALIZE
ENGINEER FEATURES
RUN AUTO_ViML AGAIN
ADD / REMOVE FEATURES
RUN AUTO_ViML
BEST MODEL and FEATURES
SELECTED
BUILD PIPELINE
SERVE MODEL
Not in Scope

THANK YOU
Slide 29
Ram Seshadri rsesha2001@yahoo.com

Weekly #105: AutoViz and Auto_ViML Visualization and Machine Learning

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Weekly #105: AutoViz and Auto_ViML Visualization and Machine Learning

Similar to Weekly #105: AutoViz and Auto_ViML Visualization and Machine Learning (20)

More from Bill Liu

More from Bill Liu (20)

Recently uploaded

Recently uploaded (20)

Weekly #105: AutoViz and Auto_ViML Visualization and Machine Learning