SlideShare a Scribd company logo
Chief Scientist, H2O
leland@h2o.ai
www.cs.uic.edu/~wilkinson
Automatic Visualization
Leland Wilkinson
Visualizing Big Data
• Complexity: Many functions are polynomial or exponential
• Curse of Dimensionality: distances tend toward constant as
• Chokepoint: Cannot send big data over the wire
• Real Estate: Cannot plot big data on the client
• Cheesy solutions in 2D
• Pixelate (too complex for higher dimensions)
• Project (usually violates triangle inequality for )
• Image maps (OK for popups and simple links, not for EDA)
• Viable solutions
• Aggregate (big n) to a few thousand rows
• Project (big p) to a few dozen columns
Big Data
set cover (core sets)
Outliers
Outliers
Outliers
Outliers
• An anomaly is an observation inconsistent with a set of beliefs.
• The anomaly depends on these beliefs
• An outlier is an observation inconsistent with a set of points.
• The points are presumed generated by a probabilistic process in a vector space.
• All outliers are anomalies but not all anomalies are outliers
• Some anomalies are logical or mathematical
• Outliers are probabilistic
• Outlier detection has more than a 200 year history.
• The goal was to reduce bias in models
• The goal today is to learn interesting stuff from examining outliers
• Statisticians no longer delete outliers. They use robust methods.
Outliers
Outliers
• Barnett & Lewis (1994), Outliers in Statistical Data.
• Rousseeuw & Leroy (1987). Robust Regression & Outlier Detection.
• Hartigan (1975) Clustering Algorithms.
Beauty is truth, truth beauty,—that is all
Ye know on earth, and all ye need to know.
Outliers
• Univariate outliers
• Distance from Center Rule
• Gaps Rule
Outliers
• Multivariate outliers
• Distance from Center Rule
• Gaps Rule
Outliers
1. Map categorical variables to continuous values (SVD).
2. If p large, use random projections to reduce dimensionality.
3. Normalize columns on [0, 1]
4. If n large, aggregate
• If p = 2, you could use gridding or hex binning
• But general solution is based on Hartigan’s Leader algorithm
5. Compute nearest neighbor distances between points.
6. Fit exponential distribution to largest distances.
7. Reject points in upper tail of this distribution.
Outliers
• Low-dimensional projections are not reliable ways to discover
high-dimensional outliers.
Outliers
• Parallel coordinates, SPLOMs, and other multivariate visualizations
are not reliable ways to discover high-dimensional outliers.
A
-4 -2 0 2 4
1 2
3
4
5
6
12
3
4
5
6
-4 -2 0 2 4
1 2
3
4
5
6
12
3
4
5
6
-4 -2 0 2 4
-4-2024
1 2
3
4
5
6
-4-2024
1
2
3
4
5
6
B 1
2
3
4
5
6
1
2
3
4
5
6
1
2
3
4
5
6
1
2
3
4
5
6
1
2
345
6
1
2
345
6
C
1
2
34 5
6
1
2
34 5
6
-4-2024
1
2
34 5
6
-4-2024
1
2
3
4
5
6
1
2
3
4
5
6
1
2
3
4
5
6
D
1
2
3
4
5
6
1
2
3
4
5
6
1
2
3
4
56
1
2
3
4
5 6
1
2
3
4
5 6
1
2
3
4
5 6
E
-4-2024
1
2
3
4
56
-4 -2 0 2 4
-4-2024
1
2 3
4
5
6
1
23
4
5
6
-4 -2 0 2 4
1
2 3
4
5
6
1
2 3
4
5
6
-4 -2 0 2 4
1
2 3
4
5
6
F
66
6
6
6 6
666
6
6
6
6
6 6 6
6
6
6
6
6 6 6
6
6
6
6
66
6
Outliers
• Popular ML algorithms are not reliable ways to identify outliers.
Scagnostics
• We characterize a scatterplot (2D point set) with nine measures.
• We base our measures on three geometric graphs.
• Convex Hull
• Alpha Shape
• Minimum Spanning Tree
Scagnostics
• Each geometric graph is a subset of the Delaunay triangulation
Scagnostics
X
Shape
13
Shape
2) Convex: ratio of area of alpha shape to the area of convex hull.
3) Skinny: ratio of perimeter to area of the alpha shape.
4) Stringy: ratio of diameter of MST to length of MST. Similar to skinny.
The diameter of a graph is the longest shortest path between a pair of its vertices.
Convex: area of alpha shape divided by area of convex hull
Skinny: ratio of perimeter to area of the alpha shape
Stringy: ratio of 2-degree vertices in MST to number of vertices > 1-degree
Scagnostics
X
Density
Skewed: ratio of (Q90 - Q50) / (Q90 - Q10),
where quantiles are on MST edge lengths
15
Density
7) Skewed: ratio of (Q90 - Q50) / (Q90 - Q10), where the quantiles are taken from the
MST edge lengths.
8) Clumpy: 1 minus the ratio of the longest edge in the largest runt (blue) to the
length of runt cutting edge (red).
The Hartigan RUNT statistic for a node of a hierarchical clustering tree is the
smaller of the number of leaves owned by each of its two children. We derive this
for each vertex in the MST using an edge-cutting algorithm.
largest runt
longest edge
in runt
Clumpy: 1 minus the ratio of the longest edge in the largest runt (blue) to the
length of runt-cutting edge (red)
15
Density
7) Skewed: ratio of (Q90 - Q50) / (Q90 - Q10), where the quantiles are taken from the
MST edge lengths.
8) Clumpy: 1 minus the ratio of the longest edge in the largest runt (blue) to the
length of runt cutting edge (red).
The Hartigan RUNT statistic for a node of a hierarchical clustering tree is the
smaller of the number of leaves owned by each of its two children. We derive this
for each vertex in the MST using an edge-cutting algorithm.
largest runt
longest edge
in runt
Outlying: proportion of total MST length due to edges adjacent to outliers
Scagnostics
X
Density
Sparse: 90th percentile of distribution of edge lengths in MST
Striated: proportion of all vertices in the MST that are degree-2 and have a
cosine between adjacent edges less than -.75
Scagnostics
Scagnostics
Scagnostics
AutoVis
Graham Wills and Leland Wilkinson. 2010. AutoVis: automatic visualization.
Information Visualization 9, 1 (March 2010), 47-69.
H2O AutoViz
Future Plans
1. Add brushing to graphics
2. Create case-weight vector for DAI (0 = exclude)
3. Suggest additional features to pass to DAI
4. Animate visualizations
5. Add natural language explanations to graphics.
Thank You!

More Related Content

What's hot

Neural Networks made easy
Neural Networks made easyNeural Networks made easy
Neural Networks made easy
Venkata Reddy Konasani
 
Ppt shuai
Ppt shuaiPpt shuai
Ppt shuai
Xiang Zhang
 
Ml7 bagging
Ml7 baggingMl7 bagging
Ml7 bagging
ankit_ppt
 
Ot regularization and_gradient_descent
Ot regularization and_gradient_descentOt regularization and_gradient_descent
Ot regularization and_gradient_descent
ankit_ppt
 
Intro to Machine Learning for non-Data Scientists
Intro to Machine Learning for non-Data ScientistsIntro to Machine Learning for non-Data Scientists
Intro to Machine Learning for non-Data Scientists
Parinaz Ameri
 
Linear Regression, Machine learning term
Linear Regression, Machine learning termLinear Regression, Machine learning term
Linear Regression, Machine learning term
S Rulez
 
Machine learning and_nlp
Machine learning and_nlpMachine learning and_nlp
Machine learning and_nlp
ankit_ppt
 
Meetup_Consumer_Credit_Default_Vers_2_All
Meetup_Consumer_Credit_Default_Vers_2_AllMeetup_Consumer_Credit_Default_Vers_2_All
Meetup_Consumer_Credit_Default_Vers_2_AllBernard Ong
 
Learning On The Border:Active Learning in Imbalanced classification Data
Learning On The Border:Active Learning in Imbalanced classification DataLearning On The Border:Active Learning in Imbalanced classification Data
Learning On The Border:Active Learning in Imbalanced classification Data萍華 楊
 
Feature Reduction Techniques
Feature Reduction TechniquesFeature Reduction Techniques
Feature Reduction Techniques
Vishal Patel
 
Machine Learning
Machine LearningMachine Learning
Machine Learning
Girish Khanzode
 
Jan vitek distributedrandomforest_5-2-2013
Jan vitek distributedrandomforest_5-2-2013Jan vitek distributedrandomforest_5-2-2013
Jan vitek distributedrandomforest_5-2-2013
Sri Ambati
 
Hacking Predictive Modeling - RoadSec 2018
Hacking Predictive Modeling - RoadSec 2018Hacking Predictive Modeling - RoadSec 2018
Hacking Predictive Modeling - RoadSec 2018
HJ van Veen
 
Support vector machine
Support vector machineSupport vector machine
Support vector machine
zekeLabs Technologies
 
H2O World - Top 10 Deep Learning Tips & Tricks - Arno Candel
H2O World - Top 10 Deep Learning Tips & Tricks - Arno CandelH2O World - Top 10 Deep Learning Tips & Tricks - Arno Candel
H2O World - Top 10 Deep Learning Tips & Tricks - Arno Candel
Sri Ambati
 
Kaggle presentation
Kaggle presentationKaggle presentation
Kaggle presentation
HJ van Veen
 
GBM package in r
GBM package in rGBM package in r
GBM package in r
mark_landry
 
Machine learning and linear regression programming
Machine learning and linear regression programmingMachine learning and linear regression programming
Machine learning and linear regression programming
Soumya Mukherjee
 
Data exploration validation and sanitization
Data exploration validation and sanitizationData exploration validation and sanitization
Data exploration validation and sanitization
Venkata Reddy Konasani
 
Machine Learning using Support Vector Machine
Machine Learning using Support Vector MachineMachine Learning using Support Vector Machine
Machine Learning using Support Vector Machine
Mohsin Ul Haq
 

What's hot (20)

Neural Networks made easy
Neural Networks made easyNeural Networks made easy
Neural Networks made easy
 
Ppt shuai
Ppt shuaiPpt shuai
Ppt shuai
 
Ml7 bagging
Ml7 baggingMl7 bagging
Ml7 bagging
 
Ot regularization and_gradient_descent
Ot regularization and_gradient_descentOt regularization and_gradient_descent
Ot regularization and_gradient_descent
 
Intro to Machine Learning for non-Data Scientists
Intro to Machine Learning for non-Data ScientistsIntro to Machine Learning for non-Data Scientists
Intro to Machine Learning for non-Data Scientists
 
Linear Regression, Machine learning term
Linear Regression, Machine learning termLinear Regression, Machine learning term
Linear Regression, Machine learning term
 
Machine learning and_nlp
Machine learning and_nlpMachine learning and_nlp
Machine learning and_nlp
 
Meetup_Consumer_Credit_Default_Vers_2_All
Meetup_Consumer_Credit_Default_Vers_2_AllMeetup_Consumer_Credit_Default_Vers_2_All
Meetup_Consumer_Credit_Default_Vers_2_All
 
Learning On The Border:Active Learning in Imbalanced classification Data
Learning On The Border:Active Learning in Imbalanced classification DataLearning On The Border:Active Learning in Imbalanced classification Data
Learning On The Border:Active Learning in Imbalanced classification Data
 
Feature Reduction Techniques
Feature Reduction TechniquesFeature Reduction Techniques
Feature Reduction Techniques
 
Machine Learning
Machine LearningMachine Learning
Machine Learning
 
Jan vitek distributedrandomforest_5-2-2013
Jan vitek distributedrandomforest_5-2-2013Jan vitek distributedrandomforest_5-2-2013
Jan vitek distributedrandomforest_5-2-2013
 
Hacking Predictive Modeling - RoadSec 2018
Hacking Predictive Modeling - RoadSec 2018Hacking Predictive Modeling - RoadSec 2018
Hacking Predictive Modeling - RoadSec 2018
 
Support vector machine
Support vector machineSupport vector machine
Support vector machine
 
H2O World - Top 10 Deep Learning Tips & Tricks - Arno Candel
H2O World - Top 10 Deep Learning Tips & Tricks - Arno CandelH2O World - Top 10 Deep Learning Tips & Tricks - Arno Candel
H2O World - Top 10 Deep Learning Tips & Tricks - Arno Candel
 
Kaggle presentation
Kaggle presentationKaggle presentation
Kaggle presentation
 
GBM package in r
GBM package in rGBM package in r
GBM package in r
 
Machine learning and linear regression programming
Machine learning and linear regression programmingMachine learning and linear regression programming
Machine learning and linear regression programming
 
Data exploration validation and sanitization
Data exploration validation and sanitizationData exploration validation and sanitization
Data exploration validation and sanitization
 
Machine Learning using Support Vector Machine
Machine Learning using Support Vector MachineMachine Learning using Support Vector Machine
Machine Learning using Support Vector Machine
 

Similar to Automatic Visualization - Leland Wilkinson, Chief Scientist, H2O.ai

Automatic Visualization
Automatic VisualizationAutomatic Visualization
Automatic Visualization
Sri Ambati
 
ODSC India 2018: Topological space creation & Clustering at BigData scale
ODSC India 2018: Topological space creation & Clustering at BigData scaleODSC India 2018: Topological space creation & Clustering at BigData scale
ODSC India 2018: Topological space creation & Clustering at BigData scale
Kuldeep Jiwani
 
Introduction geostatistic for_mineral_resources
Introduction geostatistic for_mineral_resourcesIntroduction geostatistic for_mineral_resources
Introduction geostatistic for_mineral_resources
Adi Handarbeni
 
NS-CUK Seminar:H.B.Kim, Review on "Asymmetric transitivity preserving graph ...
NS-CUK Seminar:H.B.Kim,  Review on "Asymmetric transitivity preserving graph ...NS-CUK Seminar:H.B.Kim,  Review on "Asymmetric transitivity preserving graph ...
NS-CUK Seminar:H.B.Kim, Review on "Asymmetric transitivity preserving graph ...
ssuser4b1f48
 
T7 data analysis
T7 data analysisT7 data analysis
T7 data analysis
kompellark
 
Mean shift and Hierarchical clustering
Mean shift and Hierarchical clustering Mean shift and Hierarchical clustering
Mean shift and Hierarchical clustering
Yan Xu
 
ACM 2013-02-25
ACM 2013-02-25ACM 2013-02-25
ACM 2013-02-25
Ted Dunning
 
SPATIAL POINT PATTERNS
SPATIAL POINT PATTERNSSPATIAL POINT PATTERNS
SPATIAL POINT PATTERNS
LiemNguyenDuy
 
09 placement
09 placement09 placement
09 placement
yogiramesh89
 
Minicourse on Network Science
Minicourse on Network ScienceMinicourse on Network Science
Minicourse on Network Science
Pavel Loskot
 
Module-5-1_230523_171754 (1).pdf
Module-5-1_230523_171754 (1).pdfModule-5-1_230523_171754 (1).pdf
Module-5-1_230523_171754 (1).pdf
vikasmittal92
 
unit 4 nearest neighbor.ppt
unit 4 nearest neighbor.pptunit 4 nearest neighbor.ppt
unit 4 nearest neighbor.ppt
PRANAVKUMAR699137
 
Machine Learning Foundations for Professional Managers
Machine Learning Foundations for Professional ManagersMachine Learning Foundations for Professional Managers
Machine Learning Foundations for Professional Managers
Albert Y. C. Chen
 
Oxford 05-oct-2012
Oxford 05-oct-2012Oxford 05-oct-2012
Oxford 05-oct-2012
Ted Dunning
 
Transition-based Dependency Parsing with Selectional Branching
Transition-based Dependency Parsing with Selectional BranchingTransition-based Dependency Parsing with Selectional Branching
Transition-based Dependency Parsing with Selectional Branching
Jinho Choi
 
Lect 3 background mathematics
Lect 3 background mathematicsLect 3 background mathematics
Lect 3 background mathematics
hktripathy
 
image segmentation image segmentation.pptx
image segmentation image segmentation.pptximage segmentation image segmentation.pptx
image segmentation image segmentation.pptx
NaveenKumar5162
 
Clustering of graphs and search of assemblages
Clustering of graphs and search of assemblagesClustering of graphs and search of assemblages
Clustering of graphs and search of assemblages
Data-Centric_Alliance
 
Cluster Validation
Cluster ValidationCluster Validation
Cluster Validation
Udaya Arangala
 
1516 contouring
1516 contouring1516 contouring
1516 contouring
Dr Fereidoun Dejahang
 

Similar to Automatic Visualization - Leland Wilkinson, Chief Scientist, H2O.ai (20)

Automatic Visualization
Automatic VisualizationAutomatic Visualization
Automatic Visualization
 
ODSC India 2018: Topological space creation & Clustering at BigData scale
ODSC India 2018: Topological space creation & Clustering at BigData scaleODSC India 2018: Topological space creation & Clustering at BigData scale
ODSC India 2018: Topological space creation & Clustering at BigData scale
 
Introduction geostatistic for_mineral_resources
Introduction geostatistic for_mineral_resourcesIntroduction geostatistic for_mineral_resources
Introduction geostatistic for_mineral_resources
 
NS-CUK Seminar:H.B.Kim, Review on "Asymmetric transitivity preserving graph ...
NS-CUK Seminar:H.B.Kim,  Review on "Asymmetric transitivity preserving graph ...NS-CUK Seminar:H.B.Kim,  Review on "Asymmetric transitivity preserving graph ...
NS-CUK Seminar:H.B.Kim, Review on "Asymmetric transitivity preserving graph ...
 
T7 data analysis
T7 data analysisT7 data analysis
T7 data analysis
 
Mean shift and Hierarchical clustering
Mean shift and Hierarchical clustering Mean shift and Hierarchical clustering
Mean shift and Hierarchical clustering
 
ACM 2013-02-25
ACM 2013-02-25ACM 2013-02-25
ACM 2013-02-25
 
SPATIAL POINT PATTERNS
SPATIAL POINT PATTERNSSPATIAL POINT PATTERNS
SPATIAL POINT PATTERNS
 
09 placement
09 placement09 placement
09 placement
 
Minicourse on Network Science
Minicourse on Network ScienceMinicourse on Network Science
Minicourse on Network Science
 
Module-5-1_230523_171754 (1).pdf
Module-5-1_230523_171754 (1).pdfModule-5-1_230523_171754 (1).pdf
Module-5-1_230523_171754 (1).pdf
 
unit 4 nearest neighbor.ppt
unit 4 nearest neighbor.pptunit 4 nearest neighbor.ppt
unit 4 nearest neighbor.ppt
 
Machine Learning Foundations for Professional Managers
Machine Learning Foundations for Professional ManagersMachine Learning Foundations for Professional Managers
Machine Learning Foundations for Professional Managers
 
Oxford 05-oct-2012
Oxford 05-oct-2012Oxford 05-oct-2012
Oxford 05-oct-2012
 
Transition-based Dependency Parsing with Selectional Branching
Transition-based Dependency Parsing with Selectional BranchingTransition-based Dependency Parsing with Selectional Branching
Transition-based Dependency Parsing with Selectional Branching
 
Lect 3 background mathematics
Lect 3 background mathematicsLect 3 background mathematics
Lect 3 background mathematics
 
image segmentation image segmentation.pptx
image segmentation image segmentation.pptximage segmentation image segmentation.pptx
image segmentation image segmentation.pptx
 
Clustering of graphs and search of assemblages
Clustering of graphs and search of assemblagesClustering of graphs and search of assemblages
Clustering of graphs and search of assemblages
 
Cluster Validation
Cluster ValidationCluster Validation
Cluster Validation
 
1516 contouring
1516 contouring1516 contouring
1516 contouring
 

More from Sri Ambati

GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
Sri Ambati
 
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
Sri Ambati
 
Generative AI Masterclass - Model Risk Management.pptx
Generative AI Masterclass - Model Risk Management.pptxGenerative AI Masterclass - Model Risk Management.pptx
Generative AI Masterclass - Model Risk Management.pptx
Sri Ambati
 
AI and the Future of Software Development: A Sneak Peek
AI and the Future of Software Development: A Sneak Peek AI and the Future of Software Development: A Sneak Peek
AI and the Future of Software Development: A Sneak Peek
Sri Ambati
 
LLMOps: Match report from the top of the 5th
LLMOps: Match report from the top of the 5thLLMOps: Match report from the top of the 5th
LLMOps: Match report from the top of the 5th
Sri Ambati
 
Building, Evaluating, and Optimizing your RAG App for Production
Building, Evaluating, and Optimizing your RAG App for ProductionBuilding, Evaluating, and Optimizing your RAG App for Production
Building, Evaluating, and Optimizing your RAG App for Production
Sri Ambati
 
Building LLM Solutions using Open Source and Closed Source Solutions in Coher...
Building LLM Solutions using Open Source and Closed Source Solutions in Coher...Building LLM Solutions using Open Source and Closed Source Solutions in Coher...
Building LLM Solutions using Open Source and Closed Source Solutions in Coher...
Sri Ambati
 
Risk Management for LLMs
Risk Management for LLMsRisk Management for LLMs
Risk Management for LLMs
Sri Ambati
 
Open-Source AI: Community is the Way
Open-Source AI: Community is the WayOpen-Source AI: Community is the Way
Open-Source AI: Community is the Way
Sri Ambati
 
Building Custom GenAI Apps at H2O
Building Custom GenAI Apps at H2OBuilding Custom GenAI Apps at H2O
Building Custom GenAI Apps at H2O
Sri Ambati
 
Applied Gen AI for the Finance Vertical
Applied Gen AI for the Finance Vertical Applied Gen AI for the Finance Vertical
Applied Gen AI for the Finance Vertical
Sri Ambati
 
Cutting Edge Tricks from LLM Papers
Cutting Edge Tricks from LLM PapersCutting Edge Tricks from LLM Papers
Cutting Edge Tricks from LLM Papers
Sri Ambati
 
Practitioner's Guide to LLMs: Exploring Use Cases and a Glimpse Beyond Curren...
Practitioner's Guide to LLMs: Exploring Use Cases and a Glimpse Beyond Curren...Practitioner's Guide to LLMs: Exploring Use Cases and a Glimpse Beyond Curren...
Practitioner's Guide to LLMs: Exploring Use Cases and a Glimpse Beyond Curren...
Sri Ambati
 
Open Source h2oGPT with Retrieval Augmented Generation (RAG), Web Search, and...
Open Source h2oGPT with Retrieval Augmented Generation (RAG), Web Search, and...Open Source h2oGPT with Retrieval Augmented Generation (RAG), Web Search, and...
Open Source h2oGPT with Retrieval Augmented Generation (RAG), Web Search, and...
Sri Ambati
 
KGM Mastering Classification and Regression with LLMs: Insights from Kaggle C...
KGM Mastering Classification and Regression with LLMs: Insights from Kaggle C...KGM Mastering Classification and Regression with LLMs: Insights from Kaggle C...
KGM Mastering Classification and Regression with LLMs: Insights from Kaggle C...
Sri Ambati
 
LLM Interpretability
LLM Interpretability LLM Interpretability
LLM Interpretability
Sri Ambati
 
Never Reply to an Email Again
Never Reply to an Email AgainNever Reply to an Email Again
Never Reply to an Email Again
Sri Ambati
 
Introducción al Aprendizaje Automatico con H2O-3 (1)
Introducción al Aprendizaje Automatico con H2O-3 (1)Introducción al Aprendizaje Automatico con H2O-3 (1)
Introducción al Aprendizaje Automatico con H2O-3 (1)
Sri Ambati
 
From Rapid Prototypes to an end-to-end Model Deployment: an AI Hedge Fund Use...
From Rapid Prototypes to an end-to-end Model Deployment: an AI Hedge Fund Use...From Rapid Prototypes to an end-to-end Model Deployment: an AI Hedge Fund Use...
From Rapid Prototypes to an end-to-end Model Deployment: an AI Hedge Fund Use...
Sri Ambati
 
AI Foundations Course Module 1 - Shifting to the Next Step in Your AI Transfo...
AI Foundations Course Module 1 - Shifting to the Next Step in Your AI Transfo...AI Foundations Course Module 1 - Shifting to the Next Step in Your AI Transfo...
AI Foundations Course Module 1 - Shifting to the Next Step in Your AI Transfo...
Sri Ambati
 

More from Sri Ambati (20)

GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
 
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
 
Generative AI Masterclass - Model Risk Management.pptx
Generative AI Masterclass - Model Risk Management.pptxGenerative AI Masterclass - Model Risk Management.pptx
Generative AI Masterclass - Model Risk Management.pptx
 
AI and the Future of Software Development: A Sneak Peek
AI and the Future of Software Development: A Sneak Peek AI and the Future of Software Development: A Sneak Peek
AI and the Future of Software Development: A Sneak Peek
 
LLMOps: Match report from the top of the 5th
LLMOps: Match report from the top of the 5thLLMOps: Match report from the top of the 5th
LLMOps: Match report from the top of the 5th
 
Building, Evaluating, and Optimizing your RAG App for Production
Building, Evaluating, and Optimizing your RAG App for ProductionBuilding, Evaluating, and Optimizing your RAG App for Production
Building, Evaluating, and Optimizing your RAG App for Production
 
Building LLM Solutions using Open Source and Closed Source Solutions in Coher...
Building LLM Solutions using Open Source and Closed Source Solutions in Coher...Building LLM Solutions using Open Source and Closed Source Solutions in Coher...
Building LLM Solutions using Open Source and Closed Source Solutions in Coher...
 
Risk Management for LLMs
Risk Management for LLMsRisk Management for LLMs
Risk Management for LLMs
 
Open-Source AI: Community is the Way
Open-Source AI: Community is the WayOpen-Source AI: Community is the Way
Open-Source AI: Community is the Way
 
Building Custom GenAI Apps at H2O
Building Custom GenAI Apps at H2OBuilding Custom GenAI Apps at H2O
Building Custom GenAI Apps at H2O
 
Applied Gen AI for the Finance Vertical
Applied Gen AI for the Finance Vertical Applied Gen AI for the Finance Vertical
Applied Gen AI for the Finance Vertical
 
Cutting Edge Tricks from LLM Papers
Cutting Edge Tricks from LLM PapersCutting Edge Tricks from LLM Papers
Cutting Edge Tricks from LLM Papers
 
Practitioner's Guide to LLMs: Exploring Use Cases and a Glimpse Beyond Curren...
Practitioner's Guide to LLMs: Exploring Use Cases and a Glimpse Beyond Curren...Practitioner's Guide to LLMs: Exploring Use Cases and a Glimpse Beyond Curren...
Practitioner's Guide to LLMs: Exploring Use Cases and a Glimpse Beyond Curren...
 
Open Source h2oGPT with Retrieval Augmented Generation (RAG), Web Search, and...
Open Source h2oGPT with Retrieval Augmented Generation (RAG), Web Search, and...Open Source h2oGPT with Retrieval Augmented Generation (RAG), Web Search, and...
Open Source h2oGPT with Retrieval Augmented Generation (RAG), Web Search, and...
 
KGM Mastering Classification and Regression with LLMs: Insights from Kaggle C...
KGM Mastering Classification and Regression with LLMs: Insights from Kaggle C...KGM Mastering Classification and Regression with LLMs: Insights from Kaggle C...
KGM Mastering Classification and Regression with LLMs: Insights from Kaggle C...
 
LLM Interpretability
LLM Interpretability LLM Interpretability
LLM Interpretability
 
Never Reply to an Email Again
Never Reply to an Email AgainNever Reply to an Email Again
Never Reply to an Email Again
 
Introducción al Aprendizaje Automatico con H2O-3 (1)
Introducción al Aprendizaje Automatico con H2O-3 (1)Introducción al Aprendizaje Automatico con H2O-3 (1)
Introducción al Aprendizaje Automatico con H2O-3 (1)
 
From Rapid Prototypes to an end-to-end Model Deployment: an AI Hedge Fund Use...
From Rapid Prototypes to an end-to-end Model Deployment: an AI Hedge Fund Use...From Rapid Prototypes to an end-to-end Model Deployment: an AI Hedge Fund Use...
From Rapid Prototypes to an end-to-end Model Deployment: an AI Hedge Fund Use...
 
AI Foundations Course Module 1 - Shifting to the Next Step in Your AI Transfo...
AI Foundations Course Module 1 - Shifting to the Next Step in Your AI Transfo...AI Foundations Course Module 1 - Shifting to the Next Step in Your AI Transfo...
AI Foundations Course Module 1 - Shifting to the Next Step in Your AI Transfo...
 

Recently uploaded

The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
Laura Byrne
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
Prayukth K V
 
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Ramesh Iyer
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance
 
PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)
Ralf Eggert
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
DianaGray10
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
DianaGray10
 
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
Product School
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
BookNet Canada
 
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Product School
 
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
Product School
 
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
Product School
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
Alan Dix
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
DianaGray10
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
James Anderson
 
Search and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical FuturesSearch and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical Futures
Bhaskar Mitra
 
JMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaJMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and Grafana
RTTS
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Product School
 
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
Elena Simperl
 
Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*
Frank van Harmelen
 

Recently uploaded (20)

The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
 
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
 
PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
 
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
 
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
 
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
 
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
 
Search and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical FuturesSearch and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical Futures
 
JMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaJMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and Grafana
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
 
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
 
Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*
 

Automatic Visualization - Leland Wilkinson, Chief Scientist, H2O.ai

  • 2. Visualizing Big Data • Complexity: Many functions are polynomial or exponential • Curse of Dimensionality: distances tend toward constant as • Chokepoint: Cannot send big data over the wire • Real Estate: Cannot plot big data on the client • Cheesy solutions in 2D • Pixelate (too complex for higher dimensions) • Project (usually violates triangle inequality for ) • Image maps (OK for popups and simple links, not for EDA) • Viable solutions • Aggregate (big n) to a few thousand rows • Project (big p) to a few dozen columns
  • 3. Big Data set cover (core sets)
  • 8. • An anomaly is an observation inconsistent with a set of beliefs. • The anomaly depends on these beliefs • An outlier is an observation inconsistent with a set of points. • The points are presumed generated by a probabilistic process in a vector space. • All outliers are anomalies but not all anomalies are outliers • Some anomalies are logical or mathematical • Outliers are probabilistic • Outlier detection has more than a 200 year history. • The goal was to reduce bias in models • The goal today is to learn interesting stuff from examining outliers • Statisticians no longer delete outliers. They use robust methods. Outliers
  • 9. Outliers • Barnett & Lewis (1994), Outliers in Statistical Data. • Rousseeuw & Leroy (1987). Robust Regression & Outlier Detection. • Hartigan (1975) Clustering Algorithms. Beauty is truth, truth beauty,—that is all Ye know on earth, and all ye need to know.
  • 10. Outliers • Univariate outliers • Distance from Center Rule • Gaps Rule
  • 11. Outliers • Multivariate outliers • Distance from Center Rule • Gaps Rule
  • 12. Outliers 1. Map categorical variables to continuous values (SVD). 2. If p large, use random projections to reduce dimensionality. 3. Normalize columns on [0, 1] 4. If n large, aggregate • If p = 2, you could use gridding or hex binning • But general solution is based on Hartigan’s Leader algorithm 5. Compute nearest neighbor distances between points. 6. Fit exponential distribution to largest distances. 7. Reject points in upper tail of this distribution.
  • 13. Outliers • Low-dimensional projections are not reliable ways to discover high-dimensional outliers.
  • 14. Outliers • Parallel coordinates, SPLOMs, and other multivariate visualizations are not reliable ways to discover high-dimensional outliers. A -4 -2 0 2 4 1 2 3 4 5 6 12 3 4 5 6 -4 -2 0 2 4 1 2 3 4 5 6 12 3 4 5 6 -4 -2 0 2 4 -4-2024 1 2 3 4 5 6 -4-2024 1 2 3 4 5 6 B 1 2 3 4 5 6 1 2 3 4 5 6 1 2 3 4 5 6 1 2 3 4 5 6 1 2 345 6 1 2 345 6 C 1 2 34 5 6 1 2 34 5 6 -4-2024 1 2 34 5 6 -4-2024 1 2 3 4 5 6 1 2 3 4 5 6 1 2 3 4 5 6 D 1 2 3 4 5 6 1 2 3 4 5 6 1 2 3 4 56 1 2 3 4 5 6 1 2 3 4 5 6 1 2 3 4 5 6 E -4-2024 1 2 3 4 56 -4 -2 0 2 4 -4-2024 1 2 3 4 5 6 1 23 4 5 6 -4 -2 0 2 4 1 2 3 4 5 6 1 2 3 4 5 6 -4 -2 0 2 4 1 2 3 4 5 6 F 66 6 6 6 6 666 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 66 6
  • 15. Outliers • Popular ML algorithms are not reliable ways to identify outliers.
  • 16. Scagnostics • We characterize a scatterplot (2D point set) with nine measures. • We base our measures on three geometric graphs. • Convex Hull • Alpha Shape • Minimum Spanning Tree
  • 17. Scagnostics • Each geometric graph is a subset of the Delaunay triangulation
  • 18. Scagnostics X Shape 13 Shape 2) Convex: ratio of area of alpha shape to the area of convex hull. 3) Skinny: ratio of perimeter to area of the alpha shape. 4) Stringy: ratio of diameter of MST to length of MST. Similar to skinny. The diameter of a graph is the longest shortest path between a pair of its vertices. Convex: area of alpha shape divided by area of convex hull Skinny: ratio of perimeter to area of the alpha shape Stringy: ratio of 2-degree vertices in MST to number of vertices > 1-degree
  • 19. Scagnostics X Density Skewed: ratio of (Q90 - Q50) / (Q90 - Q10), where quantiles are on MST edge lengths 15 Density 7) Skewed: ratio of (Q90 - Q50) / (Q90 - Q10), where the quantiles are taken from the MST edge lengths. 8) Clumpy: 1 minus the ratio of the longest edge in the largest runt (blue) to the length of runt cutting edge (red). The Hartigan RUNT statistic for a node of a hierarchical clustering tree is the smaller of the number of leaves owned by each of its two children. We derive this for each vertex in the MST using an edge-cutting algorithm. largest runt longest edge in runt Clumpy: 1 minus the ratio of the longest edge in the largest runt (blue) to the length of runt-cutting edge (red) 15 Density 7) Skewed: ratio of (Q90 - Q50) / (Q90 - Q10), where the quantiles are taken from the MST edge lengths. 8) Clumpy: 1 minus the ratio of the longest edge in the largest runt (blue) to the length of runt cutting edge (red). The Hartigan RUNT statistic for a node of a hierarchical clustering tree is the smaller of the number of leaves owned by each of its two children. We derive this for each vertex in the MST using an edge-cutting algorithm. largest runt longest edge in runt Outlying: proportion of total MST length due to edges adjacent to outliers
  • 20. Scagnostics X Density Sparse: 90th percentile of distribution of edge lengths in MST Striated: proportion of all vertices in the MST that are degree-2 and have a cosine between adjacent edges less than -.75
  • 24. AutoVis Graham Wills and Leland Wilkinson. 2010. AutoVis: automatic visualization. Information Visualization 9, 1 (March 2010), 47-69.
  • 26. Future Plans 1. Add brushing to graphics 2. Create case-weight vector for DAI (0 = exclude) 3. Suggest additional features to pass to DAI 4. Animate visualizations 5. Add natural language explanations to graphics.