Data Visualisation for Data Science

Deﬁnitions Typologies Good vs bad Tables Principles Before After Visual perception An example What to remember Référ
Data Visualization for
Data Science
Principles in action
Christophe Bontemps
Toulouse School of Economics, INRA

MY JOB

WHY I’M HERE ?
From Huff (1993)

BEFORE WE START
Let’s do a simple exercise (from Buja et al. (2009))

THE “VISUAL PERCEPTION” OF A GRAPHIC
(source : Buja et al. (2009))

“VISUAL PERCEPTION” AS A STATISTICAL TEST

“ The human eye acts is a broad feature detector and general
statistical test”. Buja et al. (2009)

Test : H0 : {There is "nothing" } = {No relation}

Test : H0 : {There is "nothing" } = {No relation}
H1 : { There is "something" } = {There is some relation
(Correlation, linearity, heterogeneity, groups..) }

“VISUAL PERCEPTION” AS A COMPARISON

What do you see here ?

Difﬁcult to see the maximum/minimum of each curve...

Difﬁcult to see the maximum/minimum of each curve...
Idea shared by Gelman (2004) and Munzner (2014)

WHAT IS DATA VISUALIZATION ?
It is a representation, a function of the data

A statistic too, is a function or a summary of the data

So, it is a sort of statistic

It can be descriptive or inferential

Two or multi-dimensional

Static or dynamic

Static or dynamic
Informative or not

Static or dynamic
Informative or not
Misleading or accurately representing the data

Static or dynamic
Informative or not
Misleading or accurately representing the data
Beautiful or ugly....

For Tukey (1977) “The greatest value of a picture is when it
forces us to notice what we never expected to see”

Cleveland (1994) says that “graphical methods and
techniques are powerful tools for showing the structure of
data. The material is relevant for data analysis, when the
analyst wants to study data, and for data communication,
when the analyst wants to communicate data to others”

Bertin (2005) (translated in Bertin (1983)) deﬁnes it as a
"visual language" and, as such, with a semiology, i.e. with
a theory of the functions of signs and symbols.

Bertin (2005) (translated in Bertin (1983)) deﬁnes it as a
"visual language" and, as such, with a semiology, i.e. with
a theory of the functions of signs and symbols.
Tufte (2001) “ Graphics are instruments for reasoning
about quantitative information. Often the most effective
way to describe , explore and summarize a set of numbers
- even a large set - is to look at pictures of those numbers”

SO WHAT ?
Data visualisation serves different purposes :
Explanatory data analysis

SO WHAT ?
Statistical questioning of data patterns

SO WHAT ?
Visual display of information for communication

SO WHAT ?
Visual display of information for communication
Tool for interacting with data

2 TYPES OF GRAPHICS :
THOSE IMMEDIATE TO UNDERSTAND
FIGURE – Seen on HK-TV

FIGURE – Where do people run in Paris (N. Yau)
source :
http://flowingdata.com/2014/02/05/where-people-run/

FIGURE – Climate forecast uncertainty (S. Planton)

... AND THOSE NOT UNDERSTOOD IMMEDIATELY :
FIGURE – (Dynamic) Parallel Coordinates Plot comparing 5 indicators
for 3 countries (Sweden, Nigeria and Germany).
source :
http://ncva.itn.liu.se/education-geovisual-analytics/parallel-c

FIGURE – Pagerank Algorithm Reveals World’s All-Time Top Soccer
Team (MIT Review, March 2015)

FIGURE – How people spend their days (NYT).

“GOOD” OR “BAD” GRAPHICS ?
“There are no “good” nor “bad” graphics (...), there are graphics
answering legitimate questions and graphics that do not answer
question at all ”
Bertin (1981)

FAMOUS EXAMPLES OF “GOOD” VISUALIZATIONS
FIGURE – Charles Minard’s (1869) chart showing the number of men
in Napoleon’s 1812 Russian campaign army, their movements, as
well as the temperature they encountered on the return path.

FIGURE – London Cholera Map - John Snow (1854)

FIGURE – War Mortality - Florence Nightingale (1855) found that
Zymotic diseases (blue) > wounds injuries.

Same data with “modern” visualisation tools. Gelman and
Unwin (2011)
FIGURE – War Mortality - Florence Nightingale (1855) redrawn by
Gelman and Unwin (2011).

FIGURE – Visualizing 5 dimensions : Gapminder (Hans Rosling)

SO WHAT ARE THE RULES ?
Can you name some rules for a good (resp. bad) graphic ?
Your turn !

Your turn !
Axis and scale (starting at zero !) ?

Your turn !
Context ?

Your turn !
Context ?
No multiple scales ?

Your turn !
Context ?
No multiple scales ?
Colors ?

YOUR TURN : WHAT’S WRONG WITH THIS GRAPHIC ?

BANANA’S SALES HAVE INCREASED !
FIGURE – from A. Dix example of interactive bar chart

WHAT’S WRONG WITH THIS GRAPHIC ?
FIGURE – Government spending "Skyrocketing".Tufte (2001) from
Playfair(1786).

SCALES ARE MISLEADING !
FIGURE – Governemnt spending "Skyrocketing" (revisited). Tufte
(2001) from Playfair(1786).

WHAT’S WRONG WITH THIS GRAPHIC ? (HARDER)
FIGURE – Major Cause of Disability - 1975-2010 (J. Schwabish, 2014).

Do you remember a damn thing of this graph ?

(SMALL) MULTIPLE GRAPHS, ARE OFTEN BETTER
FIGURE – Major Cause of Disability- 1975-2010 (J. Schwabish).
Cf. "brushing" (ex : for parallel Coordinates plots)

KEEP ALL YOUR AUDIENCE
Normal →
Color-blind →

WHICH MEANS THAT FOR 5 % OF MEN :
See also the ggplot option + scale_colour_colorblind()

DATA VISUALISATION IS USED FOR TWO MAIN
PURPOSES
Data exploration
Graphs as visual tests, comparisons (short time to built
and to read)

PURPOSES
Data exploration
and to read)
Data representation
Summaries, storytelling (long time to build, short time to
read)

PURPOSES
Data exploration
and to read)
Data representation
read)
The problem is that :
“ Communicating implies simpliﬁcation
data exploration implies exhaustivity”

TABLES VS GRAPHICS ?
Several papers have discussed the issue : Gelman et al. (2002),
Gelman (2011) and Friendly and Kwan (2012).
Here, descriptive statistics of continuous variables.

TABLES VS GRAPHICS ?
Graph version of the table. From Gelman (2011)

GRAPHICS reveal DATA : ANSCOMBE (1973) QUARTET
We use here 4 couples of random variables : (X1, Y1), (X2, Y2)
(X3, Y3) and (X4, Y4). All four data sets have the same
descriptive statistics.
Xs Mean Std. Dev. Ys Mean Std. Dev. corr(Xi, Yi) N
X1 9 3.32 Y1 7.5 2.03 0.8164 11
X2 9 3.32 Y2 7.5 2.03 0.8162 11
X3 9 3.32 Y3 7.5 2.03 0.8163 11
X4 9 3.32 Y4 7.5 2.03 0.8165 11

ANSCOMBE (1973) QUARTET
All four data sets are described by the same linear model
(Yi = α + βXi + i), revealing apparently the same
relationships :
Dependent variable :
Y1 Y2 Y3 Y4
Regressed on :
Xi, i=1,...,4 0.500 ∗∗∗
0.500∗∗∗
0.500∗∗∗
0.500∗∗∗
Constant 3.000∗∗
3.001∗∗
3.002∗∗
3.002∗∗
R2
0.667 0.666 0.666 0.667
Resid Std. Error 1.237 1.237 1.236 1.236
F Statistic 17.990∗∗∗
17.966∗∗∗
17.972∗∗∗
18.003∗∗∗
Note : Data from Anscombe (1973). ∗
p <0.1 ; ∗∗
p < 0.05 ; ∗∗∗
p < 0.01

A simple scatter plot (regression overlaid) shows something
very different.
4
8
12
5 10 15
x1
y1
Regression of Y1 on X1 (with constant)
4
8
12
5 10 15
x2
y2
4
8
12
5 10 15
x3
y3
4
8
12
5 10 15
x4
y4

NP : Plots of the residuals shows also same differences
−2
−1
0
1
2
5 6 7 8 9 10
Fitted values
Residuals
Residual vs Fitted Plot
−2
−1
0
1
5 6 7 8 9 10
Fitted values
Residuals
−1
0
1
2
3
5 6 7 8 9 10
Fitted values
Residuals
−1
0
1
2
7 8 9 10 11 12
Fitted values
Residuals

TABLES AND MATRICES
Data with many 0/1 variables (indicators for towns)
Bertin (1981)

TABLES AND MATRICES

TABLES AND MATRICES
Bertin (1981)

AND IN MANY DIMENSIONS ?

TABLES AND MATRICES
From Munzner (2014)

REGRESSION TABLES ARE GRAPHICS !
(Mod. 1) (Mod. 2)
Special Special
i_under18 -0.0692∗ -0.119∗∗∗
(-2.25) (-3.57)
log_income 0.116∗∗∗ 0.102∗∗∗
(4.31) (3.51)
i_car 0.00131 -0.112∗
(0.03) (-2.00)
b08_locenv_water 0.0624∗∗∗ 0.0583∗∗
(4.99) (4.28)
i_can 0.710∗∗∗
(23.27)
Constant -1.467∗∗∗ -0.961∗∗
(-5.38) (-3.24)
Classical "visualisation" of regressions

REGRESSION TABLES ARE GRAPHICS !
(Mod. 1) (Mod. 2)
Special Special
i_under18 -0.0692 -0.119
(-2.25) (-3.57)
log_income 0.116 0.102
(4.31) (3.51)
i_car 0.00131 -0.112
(0.03) (-2.00)
b08_locenv_water 0.0624 0.0583
(4.99) (4.28)
i_can 0.710
(23.27)
Constant -1.467 -0.961
(-5.38) (-3.24)
Stars are used as preattentive visual variables !

REGRESSION AS A GRAPHIC

GOOD GRAPHICS ?
It the excellent Handbook of data visualisation Chen et al.
(2007), we ﬁnd some good questions :
What to Whom, How and Why ?
A graphic may be linked to three pieces of text : its caption, a
headline and an article it accompanies. Ideally, all three should
be consistent and complement each other.

GOOD GRAPHICS ?
Present or explore data ?
Different purpose, different requirements !

GOOD GRAPHICS ?
Choice of Graphical form ?
Choice depends on the type of data to be displayed (e.g.
univariate continuous data, bivariate categorical data, etc..) and
on what is to be shown.

GOOD GRAPHICS ?
Choice of Graphical form ?
Choice depends on the type of data to be displayed (e.g.
univariate continuous data, bivariate categorical data, etc..) and
on what is to be shown.
Unique solution ?
There is not always a unique optimal choice and alternatives can
be equally good or good in different ways, emphasizing different
aspects of the same data.

EDWARD R. TUFTE’S RULES
In his seminal book, Tufte (2001) propose some principles for
displaying quantitative information.
Data : Above all, show the data

Question : Induce the viewer to think about the substance
rather than about methodology, graphic design. Encourage the
eye to compare different piece of data.

Data-ink ratio : Maximize the ink-data ratio. Erase all non
data ink, Erase redundant information

Integrity : Avoid distorting what the data have to say

General to speciﬁc : Reveal the data at different levels of
detail (from broad picture to ﬁne structure)

General to speciﬁc : Reveal the data at different levels of
detail (from broad picture to ﬁne structure)
Context : Graphical display should be closely integrated with
the statistical and verbal descriptions of the data set.

PRACTICAL EXAMPLE : DATA-INK RATIO
Let’s start with a classical graph (R default - Boxplot )
g1 g2 g3 g4 g5
98100102104106108110112
Groupe
Response
FIGURE – Distribution of a continuous variable on 4 groups

ERASE ALL NON DATA INK
Groupe
Response
1 2 3 4 5
98100102104106108110112

ERASE ALL REDUNDANT !
Groupe
Response
1 2 3 4 5
98100102104106108110112

GOING FURTHER...
Groupe
Response
1 2 3 4 5
98100102104106108110112

AND SHOW THE DATA...
Groupe
Response
101.0
100.0
101.0
103.8
109.1
1 2 3 4 5

HAVE WE LOST SOMETHING ?
g1 g2 g3 g4 g5
98100102104106108110112
Groupe
Response
Groupe
Response
101.0
100.0
101.0
103.8
109.1
1 2 3 4 5
Did you noticed that group 1 and group 3 had the same median
(101.0) ? see the ggplot theme + theme_tufte()

INTEGRITY : THE LIE FACTOR
LieFactor =
Size of effect shown in graphic
Size of effect in data
(1)
A Lie Factor = 1 indicates a substantial distortion
FIGURE – Fuel economy standards. (E. Tufte - from NY Times 1978)

FIGURE – Fuel economy standards (revisited)
The "18 mpg" line measures 1.5 cm (in 1978) ; the "27,5 mpg"
measures 13 cm (in 1985)
−→ Lie factor = 14.5% ! ! !

BERTIN’S APPROACH : A VISUAL LANGUAGE
If graphs are used to communicate, it is a form of language.
Any language has a grammar, “words” and logic. Let us study
the science that deals with signs or sign language : “The
Semiology”.
TABLE – Bertin’s deﬁnition of 8 visual variables
Position (x, y)
Size
Value
Texture
Colour
Orientation
Shape

THESE VARIABLES SERVE DIFFERENT GOALS
Visual variable syntactics, designating each visual variable as
suited or not for levels of measurement :
Equivalence, differences, order, proportions.
Variable suited for :
Position (x, y) = O ∝
Size = O ∝
Value = O ∝
Texture = O
Colour =
Orientation =
Shape ≡
≡ : Equivalence, = : Differences, O : Order, ∝ : Proportions

EXAMPLE : SHAPE IS NOT SUITABLE FOR
PROPORTIONALITY
Price of land in the East of France Bertin (1970)

EXAMPLE : SIZE IS SUITABLE FOR PROPORTIONALITY
Price of land in the East of France Bertin (1970)

A NOTE ON COLORS
“Colors” are not suited for ordering !
Try putting the following hues in order from low to high.

A NOTE ON COLORS
These colors are easy to order from low to high.
Few (2008) provides meaningful solutions for choosing palettes
of colours, for example for heatmaps.
See also the ggplot theme theme_few()

A NOTE ON COLORS (FINAL)
Colors are sometimes a graphic puzzle Tufte (2001).
Your eyes will go back and forth from the graph to the legend...
(source : http://viz.wtf/image/135265269618)

CONJUNCTION OF COLOURS AND PROPORTIONALITY
Productivity of Airlines
(Demo with goodleVis)

FLASH QUIZZ :
If 100% of the US prisoners are represented by the big
square...what is the percentage for each group ?
FIGURE – Ethic composition of prisoners in Jail in 2008 in the USA.
(Le Monde 5/12/2014)

NOT SO SIMPLE...
FIGURE – Ethic composition of prisoners in Jail in 2008 in the USA.
(Le Monde 5/12/2014)

VERIFICATION
→

OR...

IT MATTERS BECAUSE MANY HIGH DIMENSION
VISUALISATION USE AREA..
Spinograms
A spinogram is area-proportional just like the histogram, but
allows a non-linear x-axis and thus can make all boxes of equal
height. Theus and Urbanek (2009)

MOSAIC PLOTS
Step 1 of the construction of a mosaic plot (Similar to spineplot
here). All surviving passengers are highlighted in all plots.
Theus and Urbanek (2009)

MOSAIC PLOTS
Step 2 of the construction of a mosaic plot. Conditioning on
Age.Theus and Urbanek (2009)

MOSAIC PLOTS
Step 3 of the construction of a mosaic plot. Conditioning on Age
and Gender.Theus and Urbanek (2009)

MOSAIC PLOTS
Final step of the construction of a mosaic plot. Explicit mention
of Survived as highlighted.Theus and Urbanek (2009)

SCHWABISH (JEP, 2014) BEFORE-AFTER
FIGURE – An Unbalanced Chart - Original

FIGURE – An Unbalanced Chart - Revised

FIGURE – A Clutterplot Example - Original

FIGURE – A Clutterplot Example - Revised

“GOOD” OR “BAD” GRAPHICS ?
“There are no “good” nor “bad” graphics (...), there are graphics
answering legitimate questions and graphics that do not answer
question at all ”
Bertin (1981)
It is easy to criticize ... but are there some rules ?

A NOTE ON PERCEPTION
A bird (Duck, Toucan ?) on the X axis, a rabbit on the Y axis !
//
Source
http://flowingdata.com/2014/06/25/duck-vs-rabbit-plot/

“PREATTENTIVE” VARIABLES
How many "3" in that sequence ? (from Ware (2012))

AND NOW...
Find the red dot !

TEST : FIND THE RED DOT !

HARDER : IS THERE A "STRANGER" ?

THAT WASN’T EASY
Preattentive concept, Treisman (1985) and Healey (2007)
Some visual elements or patterns are detected immediately
But there may be interferences (colour and form)
Very useful (detection, explanatory and presentation)
Helpful to highlight a message !

TOO MUCH VARIATION DOESN’T HELP
From Ware (2012)

MOST PREATTENTIVE VISUAL VARIABLES
From Ware (2012)

VISUAL PERCEPTION AND PIE CHARTS

VISUAL PERCEPTION AND PIE CHARTS
https://twitter.com/freakonometrics/status/6127423301609512

VISUAL PERCEPTION AND LINES
From Cairo (2012)

When was the biggest negative (positive) difference ?
From Cairo (2012)

When was the biggest negative (positive) difference ?

THE CLEVELAND-MCGILL EFFECT

THE CLEVELAND-MCGILL EFFECT
From Cleveland and McGill (1984)

WEBER’S LAW AND FRAMED BOXES

WEBER’S LAW AND FRAMED BOXES
From Cleveland and McGill (1984)

THE CLEVELAND-MCGILL SCALE
http://hcil2.cs.umd.edu/trs/99-20/99-20.html

PARTIAL CONCLUSION
Gordon and Finch (2015) gives some nice principles

PARTIAL CONCLUSION
1. Show the data clearly

PARTIAL CONCLUSION
2. Use simplicity in design

PARTIAL CONCLUSION
3. Use good alignment on a common scale for quantities to be
compared

PARTIAL CONCLUSION
compared
4. Keep visual encoding transparent

PARTIAL CONCLUSION
compared
5. Use graphical forms consistent with those principles

PARTIAL CONCLUSION
compared
5. Use graphical forms consistent with those principles
We may add some others (use preattentive elements,
integrity, ...)

PARTIAL CONCLUSION
Do not forget the big picture

CASE STUDY : VISUALIZING THE WHOLE AND THE
DETAILS !
2588 dairy farmers over 11 years.
One variable is estimated : risk aversion (AR)
6 region of study
Don’t know the results
https:
//xtophedataviz.shinyapps.io/ShinyParallel/

CASE STUDY : RISK AVERSION
Simple plot : Median value over time.

Simple plot : Median value with dispersion visualized.

Classical BoxPlot : There are changes over time.

CASE STUDY : HOW TO VISUALIZE FARMS ?
Points over time : Too much overlapping

Points over time : Jitter helps !

Farms over time : Jitter helps !

Farms over time : Spaghetti plots !

Farms over time : Spaghetti plots with some Brushing !

Farms over time by region : Multiple Spaghetti plots !

Farms over time by region : Highlighting Spaghetti plots !

WHAT TO REMEMBER
Data visualisation serves at least two main purposes
Data exploration
and to read)

WHAT TO REMEMBER
Data exploration
and to read)
Data representation
read)

WHAT TO REMEMBER
Data exploration
and to read)
Data representation
read)
The problem is that :
“ Communicating implies simpliﬁcation
data exploration implies exhaustivity”

WHAT TO REMEMBER
From the viewer“data visualisation” are implicitly or explicitly
comparisons or even tests (in the statistical sense)
Graphics should help questioning

WHAT TO REMEMBER
They should provide elements, to answer (data at least)

WHAT TO REMEMBER
They should provide elements, to answer (data at least)
If the question implies comparison, they should truthfully
show the comparison

WHAT TO REMEMBER
Many “data visualisation” are useless, meaningless or stupid !
Some are simply poor :
and to read)

WHAT TO REMEMBER
and to read)
Some are funny :

WHAT TO REMEMBER
and to read)
Some are funny :
Many are ridiculous :

CHALLENGES : NETWORKS
Relationships of all of Victor Hugo’s characters of "Les
Miserables".
http://bl.ocks.org/mbostock/4062045_

NETWORKS : ADJACENT MATRIX PLOT
An adjacency matrix, where each cell ij represents an edge from
vertex i to vertex j. Here, vertices represent characters in a
book, while edges represent co-occurrence in a chapter.

NETWORKS : ADJACENT MATRIX PLOT
Here again, sorting is very useful !

WHAT TO REMEMBER : THERE ARE RULES
Data visualisation is a visual language, so there are :
Elements of language

Rules of use (spelling)

Rules of use (spelling)
Grammar

WHAT TO REMEMBER : A GOOD TECHNIQUE DOES
NOT PRECLUDE GOOD COMMON SENSE !
let’s...
KISS

let’s...
KISS
Keep It Simple Stupid !

let’s...
KISS
Keep It Statistical Stupid !

let’s...
KISS
Keep It Statistical Stupid !
Keep It Statistical and Simple !

REFERENCES I
Anscombe, F. J. (1973). Graphs in statistical analysis. The American
Statistician, 27(1) :17–21.
Bertin, J. (1970). La graphique. Communications, 15(1) :169–185.
Bertin, J. (1981). Théorie matricielle de la graphique. Communication et
langages, 48(1) :62–74.
Bertin, J. (1983). Semiology of graphics, translation from sémilogie graphique
(1967).
Bertin, J. (2005). Sémiologie graphique : Les diagrammes, les réseaux, les cartes. Les
Réimpressions des Éditions de l’École des Hautes Études en Sciences
Sociales. Éditions de l’École des Hautes Études en Sciences Sociales.
Buja, A., Cook, D., Hofmann, H., Lawrence, M., Lee, E.-K., Swayne, D. F., and
Wickham, H. (2009). Statistical inference for exploratory data analysis and
model diagnostics. Philosophical Transactions of the Royal Society of London
A : Mathematical, Physical and Engineering Sciences, 367(1906) :4361–4383.
Cairo, A. (2012). The Functional Art : An introduction to information graphics and
visualization. Voices That Matter. Pearson Education.

REFERENCES II
Chen, C.-h., Härdle, W. K., and Unwin, A. (2007). Handbook of data
visualization. Springer Science & Business Media.
Cleveland, W. S. (1994). The Elements of Graphing Data. Hobart Press,
Summit : NJ, 2 edition.
Cleveland, W. S. and McGill, R. (1984). Graphical perception : Theory,
experimentation, and application to the development of graphical
methods. Journal of the American Statistical Association, 79(387) :531–554.
Few, S. (2008). Practical rules for using color in charts. Visual Business
Intelligence Newsletter, (11).
Friendly, M. and Kwan, E. (2012). Comment. Journal of Computational and
Graphical Statistics.
Gelman, A. (2004). Exploratory data analysis for complex models. Journal of
Computational and Graphical Statistics, 13(4).
Gelman, A. (2011). Why tables are really much better than graphs. Journal of
Computational and Graphical Statistics, 20(1) :3–7.
Gelman, A., Pasarica, C., and Dodhia, R. (2002). Let’s practice what we
preach : turning tables into graphs. The American Statistician,
56(2) :121–130.

REFERENCES III
Gelman, A. and Unwin, A. (2011). Visualization, graphics, and statistics.
Statistical Computing and graphics, 22(1) :9–12.
Gordon, I. and Finch, S. (2015). Statistician heal thyself : Have we lost the
plot ? Journal of Computational and Graphical Statistics, 24(4) :1210–1229.
Healey, C. (2007). Perception in visualization.
Huff, D. (1993). How to Lie with Statistics. W. W. Norton & Company.
Munzner, T. (2014). Visualization Analysis and Design. AK Peters Visualization
Series. A K Peters/CRC Press, 1 edition.
Theus, M. and Urbanek, S. (2009). Interactive graphics for data analysis :
principles and examples. Series in computer science and data analysis. CRC
Press.
Treisman, A. (1985). Preattentive processing in vision. Computer Vision,
Graphics, and Image Processing, 31(2) :156–177.
Tufte, E. R. (2001). The Visual Display of Quantitative Information. Graphics
Press, 2 edition.
Tukey, J. W. (1977). Exploratory data analysis. Reading, Mass.
Ware, C. (2012). Information visualization : perception for design. Elsevier.

Data Visualisation for Data Science

More Related Content

What's hot

Viewers also liked

Similar to Data Visualisation for Data Science

Recently uploaded

Data Visualisation for Data Science