Data visualization

Data Visualization

http://nycdatascience.com/part4_en/

Data Visualization
class 5

Vivian Zhang | Scott Kostyshak
CTO @Supstat Inc | Data Scientist @Supstat Inc

1 of 98

2/4/14, 7:31 AM

Data Visualization


Data visualization
We will study the application of primary drawing functions and advanced drawing functions in R and
will focus on understanding the methods of data exploration by visualization.
· The related functions in R
· The properties of a single variable
· Displaying compositions
· The relationship between variables
· Exhibiting change over time
· Geographic information
Case study and excercise: Analyzing the NBA data with graphics

2 of 98

2/4/14, 7:31 AM

Data Visualization


Why use visualization?

3 of 98

2/4/14, 7:31 AM

Data Visualization


Data visualization
A ﬁgure is worth a thousand words.
data <- read.table('data/anscombe.txt',T)
data <- data[,-1]
head(data)

1
2
3
4
5
6

4 of 98

x1
10
8
13
9
11
14

x2
10
8
13
9
11
14

x3 x4
y1
y2
y3
y4
10 8 8.04 9.14 7.46 6.58
8 8 6.95 8.14 6.77 5.76
13 8 7.58 8.74 12.74 7.71
9 8 8.81 8.77 7.11 8.84
11 8 8.33 9.26 7.81 8.47
14 8 9.96 8.10 8.84 7.04

2/4/14, 7:31 AM

Data Visualization


Data visualization
Try to calculate some statistical indicators. First calculate the mean of these datasets, and then
calculate the correlation coefﬁcient of the four groups of data
colMeans(data)

x1 x2 x3 x4 y1 y2 y3 y4
9.0 9.0 9.0 9.0 7.5 7.5 7.5 7.5

sapply(1:4,function(x) cor(data[,x],data[,x+4]))

[1] 0.816 0.816 0.816 0.817

5 of 98

2/4/14, 7:31 AM

Data Visualization


Data visualization

6 of 98

2/4/14, 7:31 AM

Data Visualization


Some basic principles
1. Determine the target of visualization from the beginning
· Exploratory visualization
· Explanatory visualization
2. Understanding the characteristics of the data and the audience
· Which variables are important and interesting
· Consider the role and background of the audience
· Select a proper mapping
3. Keep concise but give enough information

7 of 98

2/4/14, 7:31 AM

Data Visualization


Mapping elements of a graph:
1. Coordinate position
2. Line
3. Size
4. Color
5. Shape
6. Text

8 of 98

2/4/14, 7:31 AM

Data Visualization


Visualization functions in R

9 of 98

2/4/14, 7:31 AM

Data Visualization


Visualization functions in R
· base graphics
· lattice
· ggplot2

10 of 98

2/4/14, 7:31 AM

Data Visualization


Elementary graphing functions
plot(cars$dist~cars$speed)

11 of 98

2/4/14, 7:31 AM

Data Visualization


plot(cars$dist,type='l')

12 of 98

2/4/14, 7:31 AM

Data Visualization


plot(cars$dist,type='h')

13 of 98

2/4/14, 7:31 AM

Data Visualization


hist(cars$dist)

14 of 98

2/4/14, 7:31 AM

Data Visualization


lattice package
library(lattice)
num <- sample(1:3,size=50,replace=T)
barchart(table(num))

15 of 98

2/4/14, 7:31 AM

Data Visualization


lattice package
qqmath(rnorm(100))

16 of 98

2/4/14, 7:31 AM

Data Visualization


lattice package
stripplot(~ Sepal.Length | Species, data = iris,layout=c(1,3))

17 of 98

2/4/14, 7:31 AM

Data Visualization


lattice package
densityplot(~ Sepal.Length, groups=Species, data = iris,plot.points=FALSE)

18 of 98

2/4/14, 7:31 AM

Data Visualization


lattice package
bwplot(Species~ Sepal.Length, data = iris)

19 of 98

2/4/14, 7:31 AM

Data Visualization


lattice package
xyplot(Sepal.Width~ Sepal.Length, groups=Species, data = iris)

20 of 98

2/4/14, 7:31 AM

Data Visualization


lattice package
splom(iris[1:4])

21 of 98

2/4/14, 7:31 AM

Data Visualization


lattice package
histogram(~ Sepal.Length | Species, data = iris,layout=c(1,3))

22 of 98

2/4/14, 7:31 AM

Data Visualization


Three-dimensional graphs in the lattice
package
library(plyr)
func3d <- function(x,y) {
sin(x^2/2 - y^2/4) * cos(2*x - exp(y))
}
vec1 <- vec2 <- seq(0,2,length=30)
para <- expand.grid(x=vec1,y=vec2)
result6 <- mdply(.data=para,.fun=func3d)

23 of 98

2/4/14, 7:31 AM

Data Visualization


Three-dimensional graphs in the lattice
package
library(lattice)
wireframe(V1~x*y,data=result6,scales = list(arrows = FALSE),
drape = TRUE, colorkey = F)

24 of 98

2/4/14, 7:31 AM

Data Visualization


ggplot package
Data, Mapping and Geom
library(ggplot2)
p <- ggplot(data=mpg,mapping=aes(x=cty,y=hwy)) + geom_point()
print(p)

25 of 98

2/4/14, 7:31 AM

Data Visualization


ggplot package
Observe the internal structure
summary(p)

data: manufacturer, model, displ, year, cyl, trans, drv, cty, hwy, fl, class [234x11]
mapping: x = cty, y = hwy
faceting: facet_null()
----------------------------------geom_point: na.rm = FALSE
stat_identity:
position_identity: (width = NULL, height = NULL)

26 of 98

2/4/14, 7:31 AM

Data Visualization


ggplot package
Add other data mappings
p <- ggplot(data=mpg,mapping=aes(x=cty,y=hwy,colour=factor(year)))
p <- p + geom_point()
print(p)

27 of 98

2/4/14, 7:31 AM

Data Visualization


ggplot package
Add a statistical transformation such as a smooth
p <- ggplot(data=mpg,mapping=aes(x=cty,y=hwy,colour=factor(year)))
p <- p + geom_smooth()
print(p)

28 of 98

2/4/14, 7:31 AM

Data Visualization


ggplot package
Add points and smooth lines on the plot layer
p <- ggplot(data=mpg,mapping=aes(x=cty,y=hwy)) +
geom_point(aes(colour=factor(year))) +
geom_smooth()

29 of 98

2/4/14, 7:31 AM

Data Visualization

30 of 98


2/4/14, 7:31 AM

Data Visualization


ggplot package
Scale control
geom_smooth() +
scale_color_manual(values=c('blue2','red4'))

31 of 98

2/4/14, 7:31 AM

Data Visualization

32 of 98


2/4/14, 7:31 AM

Data Visualization


ggplot package
Facet control
geom_smooth() +
scale_color_manual(values=c('blue2','red4')) +
facet_wrap(~ year,ncol=1)

33 of 98

2/4/14, 7:31 AM

Data Visualization

34 of 98


2/4/14, 7:31 AM

Data Visualization


ggplot package
Polishing your plots for publication
p <- ggplot(data=mpg, mapping=aes(x=cty,y=hwy)) +
geom_point(aes(colour=class,size=displ),
alpha=0.5,position = "jitter") +
geom_smooth() +
scale_size_continuous(range = c(4, 10)) +
facet_wrap(~ year,ncol=1) +
opts(title='Vehicle model and fuel consumption') +
labs(y='Highway miles per gallon',
x='Urban miles per gallon',
size='Displacement',
colour = 'Model')

35 of 98

2/4/14, 7:31 AM

Data Visualization

36 of 98


2/4/14, 7:31 AM

Data Visualization


ggplot exercise I
change the coordinate system,such as coord_flip() , coord_polar(),coord_cartesian()
p <- ggplot(data=mpg, mapping=aes(x=cty,y=hwy)) +
geom_point(aes(colour=factor(year),size=displ), alpha=0.5,position = "jitter")+
stat_smooth()+
scale_color_manual(values =c('steelblue','red4'))+
scale_size_continuous(range = c(4, 10))

37 of 98

2/4/14, 7:31 AM

Data Visualization


The properties of a single variable

38 of 98

2/4/14, 7:31 AM

Data Visualization


Histogram
library(ggplot2)
p <- ggplot(data=iris,aes(x=Sepal.Length))+
geom_histogram()
print(p)

39 of 98

2/4/14, 7:31 AM

Data Visualization


Histogram
We can customize the histogram as follows:
p <- ggplot(iris,aes(x=Sepal.Length))+
geom_histogram(binwidth=0.1,
# Set the group gap
fill='skyblue', # Set the fill color
colour='black') # Set the border color

40 of 98

2/4/14, 7:31 AM

Data Visualization

41 of 98


2/4/14, 7:31 AM

Data Visualization


Histograms plus density curve
The main role of the histogram of is to show counting by groups and distribution characteristics. The
distribution of a sample in traditional statistics is of important signiﬁcance. But there is another
method that can also show the distribution of data, namely the kernel density estimation curve. We
can estimate a density curve that represents the distribution, according to the data. We can display
the histogram and density curve at the same time.
p <- ggplot(iris,aes(x=Sepal.Length)) +
geom_histogram(aes(y=..density..),
fill='skyblue',
color='black') +
geom_density(color='black',
linetype=2,adjust=2)

42 of 98

2/4/14, 7:31 AM

Data Visualization

43 of 98


2/4/14, 7:31 AM

Data Visualization


Density curve
Similar to the window width parameter, the adjust parameter will control the presentation of the
density curve. We try different parameters to draw mutiple density curves. The smaller the parameter
is, the more volatile and sensitive the curve is.
p <- ggplot(iris,aes(x=Sepal.Length)) +
geom_histogram(aes(y=..density..), # Note: set y to relative frequency
fill='gray60',
color='gray') +
geom_density(color='black',linetype=1,adjust=0.5) +
geom_density(color='black',linetype=2,adjust=1) +
geom_density(color='black',linetype=3,adjust=2)

44 of 98

2/4/14, 7:31 AM

Data Visualization

45 of 98


2/4/14, 7:31 AM

Data Visualization


Density curve
Density curve is also convenient for comparison between different data. For example, we want to
compare the Sepal.Length distribution of three different ﬂowers of the iris, like this:
p <- ggplot(iris,aes(x=Sepal.Length,fill=Species)) + geom_density(alpha=0.5,color='gray')
print(p)

46 of 98

2/4/14, 7:31 AM

Data Visualization


Boxplot
In addition to the histograms and density map, We can also use boxplots to show the distribution of
one-dimensional data. The boxplot is also convenient for comparison of different data.
p <- ggplot(iris,aes(x=Species,y=Sepal.Length,fill=Species)) + geom_boxplot()
print(p)

47 of 98

2/4/14, 7:31 AM

Data Visualization


Violin plot
A violin plot contains more information than a boxplot about the (sub-)distributions of the data:
p <- ggplot(iris,aes(x=Species,y=Sepal.Length,fill=Species)) + geom_violin()
print(p)

48 of 98

2/4/14, 7:31 AM

Data Visualization


Violin plot plus points
p <- ggplot(iris,aes(x=Species,y=Sepal.Length,
fill=Species)) +
geom_violin(fill='gray',alpha=0.5) +
geom_dotplot(binaxis = "y", stackdir = "center")
print(p)

49 of 98

2/4/14, 7:31 AM

Data Visualization


Displaying compositions

50 of 98

2/4/14, 7:31 AM

Data Visualization


Bar chart
The proportion of each vehicle model in the mpg dataset and these proportions grouped by years
p <- ggplot(mpg,aes(x=class)) +
geom_bar()
print(p)

51 of 98

2/4/14, 7:31 AM

Data Visualization


Stacked bar chart
The proportion of each vehicle model in the mpg dataset and these proportions grouped by years
mpg$year <- factor(mpg$year)
p <- ggplot(mpg,aes(x=class,fill=year)) +
geom_bar(color='black')

52 of 98

2/4/14, 7:31 AM

Data Visualization

53 of 98


2/4/14, 7:31 AM

Data Visualization


Stacked bar chart
Stacked bar chart
p <- ggplot(mpg,aes(x=class,fill=year)) +
geom_bar(color='black',
position=position_dodge())

54 of 98

2/4/14, 7:31 AM

Data Visualization

55 of 98


2/4/14, 7:31 AM

Data Visualization


Pie chart
p <- ggplot(mpg, aes(x = factor(1), fill = factor(class))) +
geom_bar(width = 1)+
coord_polar(theta = "y")

56 of 98

2/4/14, 7:31 AM

Data Visualization

57 of 98


2/4/14, 7:31 AM

Data Visualization


Rose diagram
Wind rose, a commonly used graphics tool by meteorologists, describes the wind speed and
direction distributions in a speciﬁc place.

set.seed(1)
# Randomly generate 100 wind directions, and divide them into 16 intervals.
dir <- cut_interval(runif(100,0,360),n=16)
# Randomly generate 100 wind speed, and divide them into 4 intensities.
mag <- cut_interval(rgamma(100,15),4)
sample <- data.frame(dir=dir,mag=mag)
# Map wind direction to X-axie, frequency to Y-axie and speed to fill colors. Transform the coo
p <- ggplot(sample,aes(x=dir,fill=mag)) +
geom_bar()+ coord_polar()

58 of 98

2/4/14, 7:31 AM

Data Visualization

59 of 98


2/4/14, 7:31 AM

Data Visualization


Mosaic Plot
Divide the data according to different variables, and then use rectangles of different sizes to
represent different groups of data. Let's look at the gender breakdown of survivors:

60 of 98

2/4/14, 7:31 AM

Data Visualization

61 of 98


2/4/14, 7:31 AM

Data Visualization


The proportion structure of continuous data
data <- read.csv('data/soft_impact.csv',T)
library(reshape2)
data.melt <- melt(data,id='Year')
p <- ggplot(data.melt,aes(x=Year,y=value,
group=variable,fill=variable)) +
geom_area(color='black',size=0.3,
position=position_fill()) +
scale_fill_brewer()

62 of 98

2/4/14, 7:31 AM

Data Visualization

63 of 98


2/4/14, 7:31 AM

Data Visualization


The relationship between variables

64 of 98

2/4/14, 7:31 AM

Data Visualization


Scatter diagram
Show the relationship between two variables with a scatter diagram.
p <- ggplot(data=mpg,aes(x=cty,y=hwy)) +
geom_point()
print(p)

65 of 98

2/4/14, 7:31 AM

Data Visualization


Scatter plot of multidimensional data
p <- ggplot(data=mpg,aes(x=cty,y=hwy)) + geom_point(aes(color=year))
print(p)

66 of 98

2/4/14, 7:31 AM

Data Visualization


Represent different years with different shapes
p <- ggplot(data=mpg,aes(x=cty,y=hwy)) + geom_point(aes(color=year,shape=year))
print(p)

67 of 98

2/4/14, 7:31 AM

Data Visualization


With large data sets, the points in a scatter plot may obscure each other due to overplotting, we can
make some random disturbance to solve this problem.
p <- ggplot(data=mpg,aes(x=cty,y=hwy)) + geom_point(aes(color=year),alpha=0.5,position =
print(p)

68 of 98

2/4/14, 7:31 AM

Data Visualization


For the trend of the scatterplot, we can draw out the regression line.
geom_point(aes(color=year),alpha=0.5,position = "jitter") +
geom_smooth(method='lm')
print(p)

69 of 98

2/4/14, 7:31 AM

Data Visualization


In addition to color, We can also use the size of the dot to reﬂect another variable, such as the size
of the cylinder. Some refer to plots like this as "bubble charts".
geom_point(aes(color=year,size=displ),alpha=0.5,position = "jitter") +
geom_smooth(method='lm') +
scale_size_continuous(range = c(4, 10))

70 of 98

2/4/14, 7:31 AM

Data Visualization

71 of 98


2/4/14, 7:31 AM

Data Visualization


Although we can show all the variables in a picture, we can also split it into multiple pictures to show
the characteristics of different variables. This method is called grouping, conditioning, or faceting.
geom_point(aes(colour=class,size=displ),
alpha=0.5,position = "jitter") +
geom_smooth() +
scale_size_continuous(range = c(4, 10)) +
facet_wrap(~ year,ncol=1)

72 of 98

2/4/14, 7:31 AM

Data Visualization

73 of 98


2/4/14, 7:31 AM

Data Visualization


ggplot exercise II
· make scatter plot for diamond data
· use transparency and small size points, look into size and alpha option in geom_point()
· use bin chart to observe intensity of points,look into stat_bin2d()
· estimate

data

dentisy,look

into

stat_density2d()

and

use

+cooord_cartesian(xlim=c(0,1.5), ylim=c(0,6000))

74 of 98

2/4/14, 7:31 AM

Data Visualization

75 of 98


2/4/14, 7:31 AM

Data Visualization

76 of 98


2/4/14, 7:31 AM

Data Visualization

77 of 98


2/4/14, 7:31 AM

Data Visualization


The typical scatter plot is to show a relationship between two variables. When you want to look at
many bivariate relationships at once, you can use a scatter plot matrix.

78 of 98

2/4/14, 7:31 AM

Data Visualization


if given many numerical variables, concentrated display can be done.

79 of 98

2/4/14, 7:31 AM

Data Visualization


Change over time

80 of 98

2/4/14, 7:31 AM

Data Visualization


Change over time
For visualization of time series data, the ﬁrst step is looking at how the variable changes over time.
For example, we'll have a look at American employment GDP data visualization.
fillcolor <- ifelse(economics[440:470,'unemploy']<8000,'steelblue','red4')
p <- ggplot(economics[440:470,],aes(x=date,y=unemploy)) +
geom_bar(stat='identity',
fill=fillcolor)

81 of 98

2/4/14, 7:31 AM

Data Visualization

82 of 98


2/4/14, 7:31 AM

Data Visualization


Change over time
For the time series of small amount of data, we can use the bar graph to display. At the same time
display the number of positive and negative values with different colors.For the time series of large
scale data, the bar will be crowded, and lines and points can be used to represent the strip.

p <- ggplot(economics[300:470,],aes(x=date,ymax=psavert,ymin=0)) +
geom_linerange(color='grey20',size=0.5) +
geom_point(aes(y=psavert),color='red4') +
theme_bw()

83 of 98

2/4/14, 7:31 AM

Data Visualization

84 of 98


2/4/14, 7:31 AM

Data Visualization


Change over time
When the data is more intensive, we can use line graph or area chart to show the change of a trend.
Also, some important time points or time interval can be marked in the time series graph, such as
marking 80's as a key time.
fill.color <- ifelse(economics$date > '1980-01-01' &
economics$date < '1990-01-01',
'steelblue','red4')
p <- ggplot(economics,aes(x=date,ymax=psavert,ymin=0)) +
geom_linerange(color=fill.color,size=0.9) +
geom_text(aes(x=as.Date("1985-01-01",'%Y-%m-%d'),y=13),label="1980'") +
theme_bw()

85 of 98

2/4/14, 7:31 AM

Data Visualization

86 of 98


2/4/14, 7:31 AM

Data Visualization

87 of 98


2/4/14, 7:31 AM

Data Visualization


Geographic information
visualization

88 of 98

2/4/14, 7:31 AM

Data Visualization


Map
Two types of drawing map
· Download the geographic information data, and then draw the geographical boundaries, and
identify areas and locations according to the need
· Download bitmap data of Google map, and then mark the location and path information on the
google map

89 of 98

2/4/14, 7:31 AM

Data Visualization


Map
world map
library(ggplot2)
world <- map_data("world")
worldmap <- ggplot(world, aes(x=long, y=lat, group=group)) +
geom_path(color='gray10',size=0.3) +
geom_point(x=114,y=30,size=10,shape='*') +
scale_y_continuous(breaks=(-2:2) * 30) +
scale_x_continuous(breaks=(-4:4) * 45) +
coord_map("ortho", orientation=c(30, 120, 0)) +
theme(panel.grid.major = element_line(colour = "gray50"),
panel.background = element_rect(fill = "white"),
axis.text=element_blank(),
axis.ticks=element_blank(),
axis.title=element_blank())

90 of 98

2/4/14, 7:31 AM

Data Visualization

91 of 98


2/4/14, 7:31 AM

Data Visualization


map of the U.S.
map <- map_data('state')
arrests <- USArrests
names(arrests) <- tolower(names(arrests))
arrests$region <- tolower(rownames(USArrests))
usmap <- ggplot(data=arrests) +
geom_map(map =map,aes(map_id = region,fill = murder),color='gray40' ) +
expand_limits(x = map$long, y = map$lat) +
scale_fill_continuous(high='red2',low='white') +
theme_bw() +
theme(panel.grid.major = element_blank(),
panel.background = element_blank(),
axis.text=element_blank(),
axis.ticks=element_blank(),
axis.title=element_blank(),
legend.position = c(0.95,0.28),
legend.background=element_rect(fill="white", colour="white"))+ coord_map('mercator'

92 of 98

2/4/14, 7:31 AM

Data Visualization

93 of 98


2/4/14, 7:31 AM

Data Visualization


Drawing a map of China based on a bitmap
Another method to drawing China map is to download a document containing bitmap data from
Google or openstreetmap, and then to overlap points and lines elements on it with ggplot2. This
document does not include information of latitude and longitude, just a simple bitmap, for fast
mapping.
library(ggmap)
library(XML)
webpage <-'http://data.earthquake.cn/datashare/globeEarthquake_csn.html'
tables <- readHTMLTable(webpage,stringsAsFactors = FALSE)
raw <- tables[[6]]
data <- raw[,c(1,3,4)]
names(data) <- c('date','lan','lon')
data$lan <- as.numeric(data$lan)
data$lon <- as.numeric(data$lon)
data$date <- as.Date(data$date, "%Y-%m-%d")
#Read the map data from Google by the ggmap package, and mark the previous data on the map.
earthquake <- ggmap(get_googlemap(center = 'china', zoom=4,maptype='terrain'),extent='device'
geom_point(data=data,aes(x=lon,y=lan),colour = 'red',alpha=0.7)+
theme(legend.position = "none")

94 of 98

2/4/14, 7:31 AM

Data Visualization

95 of 98


2/4/14, 7:31 AM

Data Visualization


R and interactive visualization
GoogleVis is R package providing a interface between R and Google visualization API. It allows the
user to use the Google Visualization API for data visualization without the need to upload data.
We want to compare the development trajectory of 20 country group over the past several years. In
order to obtain the data, we selected three variables from the world bank database, which reﬂect the
change of GDP, CO2 emissions and life expectancy between 2001 to 2009.
library(googleVis)
library(WDI)
DF <- WDI(country=c("CN","RU","BR","ZA","IN",'DE','AU','CA','FR','IT','JP','MX','GB','US'
M <- gvisMotionChart(DF, idvar="country", timevar="year",
xvar='EN.ATM.CO2E.KT',
yvar='NY.GDP.MKTP.CD')
plot(M)

96 of 98

2/4/14, 7:31 AM

Data Visualization


Case study and excercise

97 of 98

2/4/14, 7:31 AM

Data Visualization


Exercise III: Analyzing NBA data
· Calculate the seasonal winning rate, and draw a bar chart
· Calculating the seasonal winning rate at home and on the road, and draw a bar chart
· According to the seasonal scores of home side, draw a set of four histograms
· According to the seasonal scores of home side，draw the boxplots of ﬁve seasons
· Draw the boxplots of scores of all competitions for home side and opposite side
· Calculate the average and winning percentage for each opponent, and make a scatterplot to ﬁnd
the strong and the weak team.

98 of 98

2/4/14, 7:31 AM

Data visualization

Recommended

Recommended

More Related Content

What's hot

What's hot (11)

Similar to Data visualization

Similar to Data visualization (20)

More from Vivian S. Zhang

More from Vivian S. Zhang (20)

Recently uploaded

Recently uploaded (20)

Data visualization