Data science is a process of abstraction. In order to explain or to predict a real phenomena, the process starts with acquiring and refining the data. It then moves between the three layers of abstraction: transformations (data abstraction), visualizations (visual abstraction), and modeling (symbolic abstraction). All three layers of abstraction together build a truer (or closer) representation of the real phenomena.
Data visualization (data-vis) helps us to understand the portrait and the shape of the data. The science of data-vis for exploratory data analysis is well developed for both static graphics (scatter-plot matrices, glyph-based approaches, geometric transforms like parallel coordinates) and interactive graphics (layering, brushing and linking, projections and tours). (For more information, see Amit Kapoor’s Strata + Hadoop World Singapore talk, Visualizing Multidimensional Data.) Though visualization is used in data science to understand the shape of the data, it’s not widely used for statistical models, which are evaluated based on numerical summaries.
Amit Kapoor demonstrates extending visualization to the statistical model (model-vis), which aids in understanding the shape of the model, the impact of parameters and input data on the model, the fit of the model, and where it can be improved. Model visualization can help us to understand the shape of the model and compare it to the shape of the data. It allows us to see the fit of the model and understand where the fit can be improved. It also allows us to better understand the parameters in the model and how the model changes when the parameters change as well as how the parameters changes when the input data changes.
The science and tools for model-vis are still very underdeveloped. Amit looks at practical examples of doing model-vis in regression (linear, lasso), classification (logistic, trees, LDA), and clustering (hierarchical) problems that can help us better understand the model. This includes exploring model-vis approaches that:
Visualize the model in data space as opposed to data in model space
- Visualize the entire space of models
- Visualize the same model with varying tuning parameters
- Visualize the same model with different input datasets
- Visualize the process of model fitting as opposed to final result
Integrating these approaches for model-vis as a part of model evaluation strengthens a data scientist’s understanding of the model and leads to better model building, complementing data-vis for fitting better models as well as communicating the insight from the data science process.
3. The Blind Men & the Elephant
___
“And so these men of Indostan
Disputed loud and long,
Each in his own opinion
Exceeding stiff and strong,
Though each was partly in the right,
And all were in the wrong.”
— John Godfrey Saxe
12. ML Pipeline++
___
Data Transformation — — — — Model Building
(Tidy Data) (Tidy Model)
! !
! !
! !
Visual Exploration Model Exploration
(Data-Vis) (Model-Vis)
13. Model-Vis Key Concept
___
Use visualisation to aid the
transition of implicit knowledge
in the data and your head to
explicit knowledge in the model.
14. Model-Vis Approach
___
[0] Visualise the data space
[1] Visualise the predictions in the data space
[2] Visualise the errors in model fitting
[3] Visualise with different model parameters
[4] Visualise with different input datasets
[5] Visualise the entire model space
[6] Visualise the entire feature space
[7] Visualise the many models together
17. Regression: Small
___
Cars dataset - price vs kmpl
Scraped from comparison website
Refined & tidied up
Base version for petrol cars
Price < 1,000K, n = 42
27. Model-Vis Approach
___
[0] Visualise the data space
[1] Visualise the predictions in the data space
[2] Visualise the errors in model fitting
[3] Visualise with different model parameters
[4] Visualise with different input datasets
[5] Visualise the entire model space
[6] Visualise the entire feature space
[7] Visualise the many models together
28. Model-Vis & ML Approach
___
[0] DATA VIS: the data space
[1] PREDICTION: the predictions in the data space
[2] VALIDATION: the errors in model fitting
[3] TUNING: with different model parameters
[4] BOOTSTRAP: with different input datasets
[5] ENSEMBLE: the entire model space
[6] FEATURE ENGG: the entire feature space
[7] N-MODELS: the many models together
58. Model-Vis
___
Similar challenges to Data-Vis
More an Art, than a Science
Essential in ML Model Pipeline
Both to Explain or to Predict
Scope for easier tooling