Report

Share

Follow

•4 likes•938 views

•4 likes•938 views

Report

Share

Download to read offline

Presented at #H2OWorld 2017 in Mountain View, CA. Enjoy the video: https://youtu.be/bas3-Ue2qxc. Learn more about H2O.ai: https://www.h2o.ai/. Follow @h2oai: https://twitter.com/h2oai. - - - Abstract: Auto Visualization involves the problem of producing meaningful graphics when presented with data. Relevant to this task are the strategies that expert statisticians and data analysts use to gain insights through visualization, as well as the portfolio of diagnostic methods devised by statisticians in the last 50 years. While some researchers and companies may claim to do automatic visualization, the problem is much deeper than simply producing collections of histograms, bar charts, and scatterplots. The deeper problem is what subset of these graphics is critical to recognizing anomalies, outliers, unusual distributions, missing values, and so on. This talk will cover aspects of this deeper problem and will introduce H2O software that implements some of these algorithms. Leland Wilkinson is Chief Scientist at H2O.ai and Adjunct Professor of Computer Science at the University of Illinois Chicago. He received an A.B. degree from Harvard in 1966, an S.T.B. degree from Harvard Divinity School in 1969, and a Ph.D. from Yale in 1975. Wilkinson wrote the SYSTAT statistical package and founded SYSTAT Inc. in 1984. After the company grew to 50 employees, he sold SYSTAT to SPSS in 1994 and worked there for ten years on research and development of visualization systems. Wilkinson subsequently worked at Skytree and Tableau before joining H2O.ai. Wilkinson is a Fellow of the American Statistical Association, an elected member of the International Statistical Institute, and a Fellow of the American Association for the Advancement of Science. He has won best speaker award at the National Computer Graphics Association and the Youden prize for best expository paper in the statistics journal Technometrics. He has served on the Committee on Applied and Theoretical Statistics of the National Research Council and is a member of the Boards of the National Institute of Statistical Sciences (NISS) and the Institute for Pure and Applied Mathematics (IPAM). In addition to authoring journal articles, the original SYSTAT computer program and manuals, and patents in visualization and distributed analytic computing, Wilkinson is the author (with Grant Blank and Chris Gruber) of Desktop Data Analysis with SYSTAT. He is also the author of The Grammar of Graphics, the foundation for several commercial and opensource visualization systems (IBMRAVE, Tableau, Rggplot2, and PythonBokeh).

Follow

- 1. Chief Scientist, H2O leland@h2o.ai www.cs.uic.edu/~wilkinson Automatic Visualization Leland Wilkinson
- 2. Visualizing Big Data • Complexity: Many functions are polynomial or exponential • Curse of Dimensionality: distances tend toward constant as • Chokepoint: Cannot send big data over the wire • Real Estate: Cannot plot big data on the client • Cheesy solutions in 2D • Pixelate (too complex for higher dimensions) • Project (usually violates triangle inequality for ) • Image maps (OK for popups and simple links, not for EDA) • Viable solutions • Aggregate (big n) to a few thousand rows • Project (big p) to a few dozen columns
- 3. Big Data set cover (core sets)
- 4. Outliers
- 5. Outliers
- 6. Outliers
- 7. Outliers
- 8. • An anomaly is an observation inconsistent with a set of beliefs. • The anomaly depends on these beliefs • An outlier is an observation inconsistent with a set of points. • The points are presumed generated by a probabilistic process in a vector space. • All outliers are anomalies but not all anomalies are outliers • Some anomalies are logical or mathematical • Outliers are probabilistic • Outlier detection has more than a 200 year history. • The goal was to reduce bias in models • The goal today is to learn interesting stuff from examining outliers • Statisticians no longer delete outliers. They use robust methods. Outliers
- 9. Outliers • Barnett & Lewis (1994), Outliers in Statistical Data. • Rousseeuw & Leroy (1987). Robust Regression & Outlier Detection. • Hartigan (1975) Clustering Algorithms. Beauty is truth, truth beauty,—that is all Ye know on earth, and all ye need to know.
- 10. Outliers • Univariate outliers • Distance from Center Rule • Gaps Rule
- 11. Outliers • Multivariate outliers • Distance from Center Rule • Gaps Rule
- 12. Outliers 1. Map categorical variables to continuous values (SVD). 2. If p large, use random projections to reduce dimensionality. 3. Normalize columns on [0, 1] 4. If n large, aggregate • If p = 2, you could use gridding or hex binning • But general solution is based on Hartigan’s Leader algorithm 5. Compute nearest neighbor distances between points. 6. Fit exponential distribution to largest distances. 7. Reject points in upper tail of this distribution.
- 13. Outliers • Low-dimensional projections are not reliable ways to discover high-dimensional outliers.
- 14. Outliers • Parallel coordinates, SPLOMs, and other multivariate visualizations are not reliable ways to discover high-dimensional outliers. A -4 -2 0 2 4 1 2 3 4 5 6 12 3 4 5 6 -4 -2 0 2 4 1 2 3 4 5 6 12 3 4 5 6 -4 -2 0 2 4 -4-2024 1 2 3 4 5 6 -4-2024 1 2 3 4 5 6 B 1 2 3 4 5 6 1 2 3 4 5 6 1 2 3 4 5 6 1 2 3 4 5 6 1 2 345 6 1 2 345 6 C 1 2 34 5 6 1 2 34 5 6 -4-2024 1 2 34 5 6 -4-2024 1 2 3 4 5 6 1 2 3 4 5 6 1 2 3 4 5 6 D 1 2 3 4 5 6 1 2 3 4 5 6 1 2 3 4 56 1 2 3 4 5 6 1 2 3 4 5 6 1 2 3 4 5 6 E -4-2024 1 2 3 4 56 -4 -2 0 2 4 -4-2024 1 2 3 4 5 6 1 23 4 5 6 -4 -2 0 2 4 1 2 3 4 5 6 1 2 3 4 5 6 -4 -2 0 2 4 1 2 3 4 5 6 F 66 6 6 6 6 666 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 66 6
- 15. Outliers • Popular ML algorithms are not reliable ways to identify outliers.
- 16. Scagnostics • We characterize a scatterplot (2D point set) with nine measures. • We base our measures on three geometric graphs. • Convex Hull • Alpha Shape • Minimum Spanning Tree
- 17. Scagnostics • Each geometric graph is a subset of the Delaunay triangulation
- 18. Scagnostics X Shape 13 Shape 2) Convex: ratio of area of alpha shape to the area of convex hull. 3) Skinny: ratio of perimeter to area of the alpha shape. 4) Stringy: ratio of diameter of MST to length of MST. Similar to skinny. The diameter of a graph is the longest shortest path between a pair of its vertices. Convex: area of alpha shape divided by area of convex hull Skinny: ratio of perimeter to area of the alpha shape Stringy: ratio of 2-degree vertices in MST to number of vertices > 1-degree
- 19. Scagnostics X Density Skewed: ratio of (Q90 - Q50) / (Q90 - Q10), where quantiles are on MST edge lengths 15 Density 7) Skewed: ratio of (Q90 - Q50) / (Q90 - Q10), where the quantiles are taken from the MST edge lengths. 8) Clumpy: 1 minus the ratio of the longest edge in the largest runt (blue) to the length of runt cutting edge (red). The Hartigan RUNT statistic for a node of a hierarchical clustering tree is the smaller of the number of leaves owned by each of its two children. We derive this for each vertex in the MST using an edge-cutting algorithm. largest runt longest edge in runt Clumpy: 1 minus the ratio of the longest edge in the largest runt (blue) to the length of runt-cutting edge (red) 15 Density 7) Skewed: ratio of (Q90 - Q50) / (Q90 - Q10), where the quantiles are taken from the MST edge lengths. 8) Clumpy: 1 minus the ratio of the longest edge in the largest runt (blue) to the length of runt cutting edge (red). The Hartigan RUNT statistic for a node of a hierarchical clustering tree is the smaller of the number of leaves owned by each of its two children. We derive this for each vertex in the MST using an edge-cutting algorithm. largest runt longest edge in runt Outlying: proportion of total MST length due to edges adjacent to outliers
- 20. Scagnostics X Density Sparse: 90th percentile of distribution of edge lengths in MST Striated: proportion of all vertices in the MST that are degree-2 and have a cosine between adjacent edges less than -.75
- 21. Scagnostics
- 22. Scagnostics
- 23. Scagnostics
- 24. AutoVis Graham Wills and Leland Wilkinson. 2010. AutoVis: automatic visualization. Information Visualization 9, 1 (March 2010), 47-69.
- 25. H2O AutoViz
- 26. Future Plans 1. Add brushing to graphics 2. Create case-weight vector for DAI (0 = exclude) 3. Suggest additional features to pass to DAI 4. Animate visualizations 5. Add natural language explanations to graphics.
- 27. Thank You!