Upcoming SlideShare
Loading in...5
×

Data visualization

1,449

Published on

I am sharing the slides I used for teaching my "Data Science by R" class. You can sign up a class at http://www.nycdatascience.com/ ----NYC Data Science Academy. We offer classes in R, Python, Processing, D3.js, Hadoop, and etc.

Published in: Education
0 Comments
1 Like
Statistics
Notes
• Full Name
Comment goes here.

Are you sure you want to Yes No
Your message goes here
• Be the first to comment

No Downloads
Views
Total Views
1,449
On Slideshare
0
From Embeds
0
Number of Embeds
7
Actions
Shares
0
Downloads
52
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Data visualization

1. 1. Data Visualization http://nycdatascience.com/part4_en/ Data Visualization class 5 Vivian Zhang | Scott Kostyshak CTO @Supstat Inc | Data Scientist @Supstat Inc 1 of 98 2/4/14, 7:31 AM
2. 2. Data Visualization http://nycdatascience.com/part4_en/ Data visualization We will study the application of primary drawing functions and advanced drawing functions in R and will focus on understanding the methods of data exploration by visualization. · The related functions in R · The properties of a single variable · Displaying compositions · The relationship between variables · Exhibiting change over time · Geographic information Case study and excercise: Analyzing the NBA data with graphics 2 of 98 2/4/14, 7:31 AM
3. 3. Data Visualization http://nycdatascience.com/part4_en/ Why use visualization? 3 of 98 2/4/14, 7:31 AM
4. 4. Data Visualization http://nycdatascience.com/part4_en/ Data visualization A ﬁgure is worth a thousand words. data <- read.table('data/anscombe.txt',T) data <- data[,-1] head(data) 1 2 3 4 5 6 4 of 98 x1 10 8 13 9 11 14 x2 10 8 13 9 11 14 x3 x4 y1 y2 y3 y4 10 8 8.04 9.14 7.46 6.58 8 8 6.95 8.14 6.77 5.76 13 8 7.58 8.74 12.74 7.71 9 8 8.81 8.77 7.11 8.84 11 8 8.33 9.26 7.81 8.47 14 8 9.96 8.10 8.84 7.04 2/4/14, 7:31 AM
5. 5. Data Visualization http://nycdatascience.com/part4_en/ Data visualization Try to calculate some statistical indicators. First calculate the mean of these datasets, and then calculate the correlation coefﬁcient of the four groups of data colMeans(data) x1 x2 x3 x4 y1 y2 y3 y4 9.0 9.0 9.0 9.0 7.5 7.5 7.5 7.5 sapply(1:4,function(x) cor(data[,x],data[,x+4])) [1] 0.816 0.816 0.816 0.817 5 of 98 2/4/14, 7:31 AM
6. 6. Data Visualization http://nycdatascience.com/part4_en/ Data visualization 6 of 98 2/4/14, 7:31 AM
7. 7. Data Visualization http://nycdatascience.com/part4_en/ Some basic principles 1. Determine the target of visualization from the beginning · Exploratory visualization · Explanatory visualization 2. Understanding the characteristics of the data and the audience · Which variables are important and interesting · Consider the role and background of the audience · Select a proper mapping 3. Keep concise but give enough information 7 of 98 2/4/14, 7:31 AM
8. 8. Data Visualization http://nycdatascience.com/part4_en/ Mapping elements of a graph: 1. Coordinate position 2. Line 3. Size 4. Color 5. Shape 6. Text 8 of 98 2/4/14, 7:31 AM
9. 9. Data Visualization http://nycdatascience.com/part4_en/ Visualization functions in R 9 of 98 2/4/14, 7:31 AM
10. 10. Data Visualization http://nycdatascience.com/part4_en/ Visualization functions in R · base graphics · lattice · ggplot2 10 of 98 2/4/14, 7:31 AM
11. 11. Data Visualization http://nycdatascience.com/part4_en/ Elementary graphing functions plot(cars\$dist~cars\$speed) 11 of 98 2/4/14, 7:31 AM
12. 12. Data Visualization http://nycdatascience.com/part4_en/ Elementary graphing functions plot(cars\$dist,type='l') 12 of 98 2/4/14, 7:31 AM
13. 13. Data Visualization http://nycdatascience.com/part4_en/ Elementary graphing functions plot(cars\$dist,type='h') 13 of 98 2/4/14, 7:31 AM
14. 14. Data Visualization http://nycdatascience.com/part4_en/ Elementary graphing functions hist(cars\$dist) 14 of 98 2/4/14, 7:31 AM
15. 15. Data Visualization http://nycdatascience.com/part4_en/ lattice package library(lattice) num <- sample(1:3,size=50,replace=T) barchart(table(num)) 15 of 98 2/4/14, 7:31 AM
16. 16. Data Visualization http://nycdatascience.com/part4_en/ lattice package qqmath(rnorm(100)) 16 of 98 2/4/14, 7:31 AM
17. 17. Data Visualization http://nycdatascience.com/part4_en/ lattice package stripplot(~ Sepal.Length | Species, data = iris,layout=c(1,3)) 17 of 98 2/4/14, 7:31 AM
18. 18. Data Visualization http://nycdatascience.com/part4_en/ lattice package densityplot(~ Sepal.Length, groups=Species, data = iris,plot.points=FALSE) 18 of 98 2/4/14, 7:31 AM
19. 19. Data Visualization http://nycdatascience.com/part4_en/ lattice package bwplot(Species~ Sepal.Length, data = iris) 19 of 98 2/4/14, 7:31 AM
20. 20. Data Visualization http://nycdatascience.com/part4_en/ lattice package xyplot(Sepal.Width~ Sepal.Length, groups=Species, data = iris) 20 of 98 2/4/14, 7:31 AM
21. 21. Data Visualization http://nycdatascience.com/part4_en/ lattice package splom(iris[1:4]) 21 of 98 2/4/14, 7:31 AM
22. 22. Data Visualization http://nycdatascience.com/part4_en/ lattice package histogram(~ Sepal.Length | Species, data = iris,layout=c(1,3)) 22 of 98 2/4/14, 7:31 AM
23. 23. Data Visualization http://nycdatascience.com/part4_en/ Three-dimensional graphs in the lattice package library(plyr) func3d <- function(x,y) { sin(x^2/2 - y^2/4) * cos(2*x - exp(y)) } vec1 <- vec2 <- seq(0,2,length=30) para <- expand.grid(x=vec1,y=vec2) result6 <- mdply(.data=para,.fun=func3d) 23 of 98 2/4/14, 7:31 AM
24. 24. Data Visualization http://nycdatascience.com/part4_en/ Three-dimensional graphs in the lattice package library(lattice) wireframe(V1~x*y,data=result6,scales = list(arrows = FALSE), drape = TRUE, colorkey = F) 24 of 98 2/4/14, 7:31 AM
25. 25. Data Visualization http://nycdatascience.com/part4_en/ ggplot package Data, Mapping and Geom library(ggplot2) p <- ggplot(data=mpg,mapping=aes(x=cty,y=hwy)) + geom_point() print(p) 25 of 98 2/4/14, 7:31 AM
26. 26. Data Visualization http://nycdatascience.com/part4_en/ ggplot package Observe the internal structure summary(p) data: manufacturer, model, displ, year, cyl, trans, drv, cty, hwy, fl, class [234x11] mapping: x = cty, y = hwy faceting: facet_null() ----------------------------------geom_point: na.rm = FALSE stat_identity: position_identity: (width = NULL, height = NULL) 26 of 98 2/4/14, 7:31 AM
27. 27. Data Visualization http://nycdatascience.com/part4_en/ ggplot package Add other data mappings p <- ggplot(data=mpg,mapping=aes(x=cty,y=hwy,colour=factor(year))) p <- p + geom_point() print(p) 27 of 98 2/4/14, 7:31 AM
28. 28. Data Visualization http://nycdatascience.com/part4_en/ ggplot package Add a statistical transformation such as a smooth p <- ggplot(data=mpg,mapping=aes(x=cty,y=hwy,colour=factor(year))) p <- p + geom_smooth() print(p) 28 of 98 2/4/14, 7:31 AM
29. 29. Data Visualization http://nycdatascience.com/part4_en/ ggplot package Add points and smooth lines on the plot layer p <- ggplot(data=mpg,mapping=aes(x=cty,y=hwy)) + geom_point(aes(colour=factor(year))) + geom_smooth() 29 of 98 2/4/14, 7:31 AM
30. 30. Data Visualization 30 of 98 http://nycdatascience.com/part4_en/ 2/4/14, 7:31 AM
31. 31. Data Visualization http://nycdatascience.com/part4_en/ ggplot package Scale control p <- ggplot(data=mpg,mapping=aes(x=cty,y=hwy)) + geom_point(aes(colour=factor(year))) + geom_smooth() + scale_color_manual(values=c('blue2','red4')) 31 of 98 2/4/14, 7:31 AM
32. 32. Data Visualization 32 of 98 http://nycdatascience.com/part4_en/ 2/4/14, 7:31 AM
33. 33. Data Visualization http://nycdatascience.com/part4_en/ ggplot package Facet control p <- ggplot(data=mpg,mapping=aes(x=cty,y=hwy)) + geom_point(aes(colour=factor(year))) + geom_smooth() + scale_color_manual(values=c('blue2','red4')) + facet_wrap(~ year,ncol=1) 33 of 98 2/4/14, 7:31 AM
34. 34. Data Visualization 34 of 98 http://nycdatascience.com/part4_en/ 2/4/14, 7:31 AM
35. 35. Data Visualization http://nycdatascience.com/part4_en/ ggplot package Polishing your plots for publication p <- ggplot(data=mpg, mapping=aes(x=cty,y=hwy)) + geom_point(aes(colour=class,size=displ), alpha=0.5,position = "jitter") + geom_smooth() + scale_size_continuous(range = c(4, 10)) + facet_wrap(~ year,ncol=1) + opts(title='Vehicle model and fuel consumption') + labs(y='Highway miles per gallon', x='Urban miles per gallon', size='Displacement', colour = 'Model') 35 of 98 2/4/14, 7:31 AM
36. 36. Data Visualization 36 of 98 http://nycdatascience.com/part4_en/ 2/4/14, 7:31 AM
37. 37. Data Visualization http://nycdatascience.com/part4_en/ ggplot exercise I change the coordinate system,such as coord_flip() , coord_polar(),coord_cartesian() p <- ggplot(data=mpg, mapping=aes(x=cty,y=hwy)) + geom_point(aes(colour=factor(year),size=displ), alpha=0.5,position = "jitter")+ stat_smooth()+ scale_color_manual(values =c('steelblue','red4'))+ scale_size_continuous(range = c(4, 10)) 37 of 98 2/4/14, 7:31 AM
38. 38. Data Visualization http://nycdatascience.com/part4_en/ The properties of a single variable 38 of 98 2/4/14, 7:31 AM
39. 39. Data Visualization http://nycdatascience.com/part4_en/ Histogram library(ggplot2) p <- ggplot(data=iris,aes(x=Sepal.Length))+ geom_histogram() print(p) 39 of 98 2/4/14, 7:31 AM
40. 40. Data Visualization http://nycdatascience.com/part4_en/ Histogram We can customize the histogram as follows: p <- ggplot(iris,aes(x=Sepal.Length))+ geom_histogram(binwidth=0.1, # Set the group gap fill='skyblue', # Set the fill color colour='black') # Set the border color 40 of 98 2/4/14, 7:31 AM
41. 41. Data Visualization 41 of 98 http://nycdatascience.com/part4_en/ 2/4/14, 7:31 AM
42. 42. Data Visualization http://nycdatascience.com/part4_en/ Histograms plus density curve The main role of the histogram of is to show counting by groups and distribution characteristics. The distribution of a sample in traditional statistics is of important signiﬁcance. But there is another method that can also show the distribution of data, namely the kernel density estimation curve. We can estimate a density curve that represents the distribution, according to the data. We can display the histogram and density curve at the same time. p <- ggplot(iris,aes(x=Sepal.Length)) + geom_histogram(aes(y=..density..), fill='skyblue', color='black') + geom_density(color='black', linetype=2,adjust=2) 42 of 98 2/4/14, 7:31 AM
43. 43. Data Visualization 43 of 98 http://nycdatascience.com/part4_en/ 2/4/14, 7:31 AM
44. 44. Data Visualization http://nycdatascience.com/part4_en/ Density curve Similar to the window width parameter, the adjust parameter will control the presentation of the density curve. We try different parameters to draw mutiple density curves. The smaller the parameter is, the more volatile and sensitive the curve is. p <- ggplot(iris,aes(x=Sepal.Length)) + geom_histogram(aes(y=..density..), # Note: set y to relative frequency fill='gray60', color='gray') + geom_density(color='black',linetype=1,adjust=0.5) + geom_density(color='black',linetype=2,adjust=1) + geom_density(color='black',linetype=3,adjust=2) 44 of 98 2/4/14, 7:31 AM
45. 45. Data Visualization 45 of 98 http://nycdatascience.com/part4_en/ 2/4/14, 7:31 AM
46. 46. Data Visualization http://nycdatascience.com/part4_en/ Density curve Density curve is also convenient for comparison between different data. For example, we want to compare the Sepal.Length distribution of three different ﬂowers of the iris, like this: p <- ggplot(iris,aes(x=Sepal.Length,fill=Species)) + geom_density(alpha=0.5,color='gray') print(p) 46 of 98 2/4/14, 7:31 AM
47. 47. Data Visualization http://nycdatascience.com/part4_en/ Boxplot In addition to the histograms and density map, We can also use boxplots to show the distribution of one-dimensional data. The boxplot is also convenient for comparison of different data. p <- ggplot(iris,aes(x=Species,y=Sepal.Length,fill=Species)) + geom_boxplot() print(p) 47 of 98 2/4/14, 7:31 AM
48. 48. Data Visualization http://nycdatascience.com/part4_en/ Violin plot A violin plot contains more information than a boxplot about the (sub-)distributions of the data: p <- ggplot(iris,aes(x=Species,y=Sepal.Length,fill=Species)) + geom_violin() print(p) 48 of 98 2/4/14, 7:31 AM
49. 49. Data Visualization http://nycdatascience.com/part4_en/ Violin plot plus points p <- ggplot(iris,aes(x=Species,y=Sepal.Length, fill=Species)) + geom_violin(fill='gray',alpha=0.5) + geom_dotplot(binaxis = "y", stackdir = "center") print(p) 49 of 98 2/4/14, 7:31 AM
50. 50. Data Visualization http://nycdatascience.com/part4_en/ Displaying compositions 50 of 98 2/4/14, 7:31 AM
51. 51. Data Visualization http://nycdatascience.com/part4_en/ Bar chart The proportion of each vehicle model in the mpg dataset and these proportions grouped by years p <- ggplot(mpg,aes(x=class)) + geom_bar() print(p) 51 of 98 2/4/14, 7:31 AM
52. 52. Data Visualization http://nycdatascience.com/part4_en/ Stacked bar chart The proportion of each vehicle model in the mpg dataset and these proportions grouped by years mpg\$year <- factor(mpg\$year) p <- ggplot(mpg,aes(x=class,fill=year)) + geom_bar(color='black') 52 of 98 2/4/14, 7:31 AM
53. 53. Data Visualization 53 of 98 http://nycdatascience.com/part4_en/ 2/4/14, 7:31 AM
54. 54. Data Visualization http://nycdatascience.com/part4_en/ Stacked bar chart Stacked bar chart p <- ggplot(mpg,aes(x=class,fill=year)) + geom_bar(color='black', position=position_dodge()) 54 of 98 2/4/14, 7:31 AM
55. 55. Data Visualization 55 of 98 http://nycdatascience.com/part4_en/ 2/4/14, 7:31 AM
56. 56. Data Visualization http://nycdatascience.com/part4_en/ Pie chart p <- ggplot(mpg, aes(x = factor(1), fill = factor(class))) + geom_bar(width = 1)+ coord_polar(theta = "y") 56 of 98 2/4/14, 7:31 AM
57. 57. Data Visualization 57 of 98 http://nycdatascience.com/part4_en/ 2/4/14, 7:31 AM
58. 58. Data Visualization http://nycdatascience.com/part4_en/ Rose diagram Wind rose, a commonly used graphics tool by meteorologists, describes the wind speed and direction distributions in a speciﬁc place. set.seed(1) # Randomly generate 100 wind directions, and divide them into 16 intervals. dir <- cut_interval(runif(100,0,360),n=16) # Randomly generate 100 wind speed, and divide them into 4 intensities. mag <- cut_interval(rgamma(100,15),4) sample <- data.frame(dir=dir,mag=mag) # Map wind direction to X-axie, frequency to Y-axie and speed to fill colors. Transform the coo p <- ggplot(sample,aes(x=dir,fill=mag)) + geom_bar()+ coord_polar() 58 of 98 2/4/14, 7:31 AM
59. 59. Data Visualization 59 of 98 http://nycdatascience.com/part4_en/ 2/4/14, 7:31 AM
60. 60. Data Visualization http://nycdatascience.com/part4_en/ Mosaic Plot Divide the data according to different variables, and then use rectangles of different sizes to represent different groups of data. Let's look at the gender breakdown of survivors: 60 of 98 2/4/14, 7:31 AM
61. 61. Data Visualization 61 of 98 http://nycdatascience.com/part4_en/ 2/4/14, 7:31 AM
62. 62. Data Visualization http://nycdatascience.com/part4_en/ The proportion structure of continuous data data <- read.csv('data/soft_impact.csv',T) library(reshape2) data.melt <- melt(data,id='Year') p <- ggplot(data.melt,aes(x=Year,y=value, group=variable,fill=variable)) + geom_area(color='black',size=0.3, position=position_fill()) + scale_fill_brewer() 62 of 98 2/4/14, 7:31 AM
63. 63. Data Visualization 63 of 98 http://nycdatascience.com/part4_en/ 2/4/14, 7:31 AM
64. 64. Data Visualization http://nycdatascience.com/part4_en/ The relationship between variables 64 of 98 2/4/14, 7:31 AM
65. 65. Data Visualization http://nycdatascience.com/part4_en/ Scatter diagram Show the relationship between two variables with a scatter diagram. p <- ggplot(data=mpg,aes(x=cty,y=hwy)) + geom_point() print(p) 65 of 98 2/4/14, 7:31 AM
66. 66. Data Visualization http://nycdatascience.com/part4_en/ Scatter plot of multidimensional data mpg\$year <- factor(mpg\$year) p <- ggplot(data=mpg,aes(x=cty,y=hwy)) + geom_point(aes(color=year)) print(p) 66 of 98 2/4/14, 7:31 AM
67. 67. Data Visualization http://nycdatascience.com/part4_en/ Scatter plot of multidimensional data Represent different years with different shapes mpg\$year <- factor(mpg\$year) p <- ggplot(data=mpg,aes(x=cty,y=hwy)) + geom_point(aes(color=year,shape=year)) print(p) 67 of 98 2/4/14, 7:31 AM
68. 68. Data Visualization http://nycdatascience.com/part4_en/ Scatter plot of multidimensional data With large data sets, the points in a scatter plot may obscure each other due to overplotting, we can make some random disturbance to solve this problem. p <- ggplot(data=mpg,aes(x=cty,y=hwy)) + geom_point(aes(color=year),alpha=0.5,position = print(p) 68 of 98 2/4/14, 7:31 AM
69. 69. Data Visualization http://nycdatascience.com/part4_en/ Scatter plot of multidimensional data For the trend of the scatterplot, we can draw out the regression line. p <- ggplot(data=mpg,aes(x=cty,y=hwy)) + geom_point(aes(color=year),alpha=0.5,position = "jitter") + geom_smooth(method='lm') print(p) 69 of 98 2/4/14, 7:31 AM
70. 70. Data Visualization http://nycdatascience.com/part4_en/ Scatter plot of multidimensional data In addition to color, We can also use the size of the dot to reﬂect another variable, such as the size of the cylinder. Some refer to plots like this as "bubble charts". p <- ggplot(data=mpg,aes(x=cty,y=hwy)) + geom_point(aes(color=year,size=displ),alpha=0.5,position = "jitter") + geom_smooth(method='lm') + scale_size_continuous(range = c(4, 10)) 70 of 98 2/4/14, 7:31 AM
71. 71. Data Visualization 71 of 98 http://nycdatascience.com/part4_en/ 2/4/14, 7:31 AM
72. 72. Data Visualization http://nycdatascience.com/part4_en/ Scatter plot of multidimensional data Although we can show all the variables in a picture, we can also split it into multiple pictures to show the characteristics of different variables. This method is called grouping, conditioning, or faceting. p <- ggplot(data=mpg,aes(x=cty,y=hwy)) + geom_point(aes(colour=class,size=displ), alpha=0.5,position = "jitter") + geom_smooth() + scale_size_continuous(range = c(4, 10)) + facet_wrap(~ year,ncol=1) 72 of 98 2/4/14, 7:31 AM
73. 73. Data Visualization 73 of 98 http://nycdatascience.com/part4_en/ 2/4/14, 7:31 AM
74. 74. Data Visualization http://nycdatascience.com/part4_en/ ggplot exercise II · make scatter plot for diamond data · use transparency and small size points, look into size and alpha option in geom_point() · use bin chart to observe intensity of points,look into stat_bin2d() · estimate data dentisy,look into stat_density2d() and use +cooord_cartesian(xlim=c(0,1.5), ylim=c(0,6000)) 74 of 98 2/4/14, 7:31 AM
75. 75. Data Visualization 75 of 98 http://nycdatascience.com/part4_en/ 2/4/14, 7:31 AM
76. 76. Data Visualization 76 of 98 http://nycdatascience.com/part4_en/ 2/4/14, 7:31 AM
77. 77. Data Visualization 77 of 98 http://nycdatascience.com/part4_en/ 2/4/14, 7:31 AM
78. 78. Data Visualization http://nycdatascience.com/part4_en/ Scatter plot of multidimensional data The typical scatter plot is to show a relationship between two variables. When you want to look at many bivariate relationships at once, you can use a scatter plot matrix. 78 of 98 2/4/14, 7:31 AM
79. 79. Data Visualization http://nycdatascience.com/part4_en/ Scatter plot of multidimensional data if given many numerical variables, concentrated display can be done. 79 of 98 2/4/14, 7:31 AM
80. 80. Data Visualization http://nycdatascience.com/part4_en/ Change over time 80 of 98 2/4/14, 7:31 AM
81. 81. Data Visualization http://nycdatascience.com/part4_en/ Change over time For visualization of time series data, the ﬁrst step is looking at how the variable changes over time. For example, we'll have a look at American employment GDP data visualization. fillcolor <- ifelse(economics[440:470,'unemploy']<8000,'steelblue','red4') p <- ggplot(economics[440:470,],aes(x=date,y=unemploy)) + geom_bar(stat='identity', fill=fillcolor) 81 of 98 2/4/14, 7:31 AM
82. 82. Data Visualization 82 of 98 http://nycdatascience.com/part4_en/ 2/4/14, 7:31 AM
83. 83. Data Visualization http://nycdatascience.com/part4_en/ Change over time For the time series of small amount of data, we can use the bar graph to display. At the same time display the number of positive and negative values with different colors.For the time series of large scale data, the bar will be crowded, and lines and points can be used to represent the strip. p <- ggplot(economics[300:470,],aes(x=date,ymax=psavert,ymin=0)) + geom_linerange(color='grey20',size=0.5) + geom_point(aes(y=psavert),color='red4') + theme_bw() 83 of 98 2/4/14, 7:31 AM
84. 84. Data Visualization 84 of 98 http://nycdatascience.com/part4_en/ 2/4/14, 7:31 AM
85. 85. Data Visualization http://nycdatascience.com/part4_en/ Change over time When the data is more intensive, we can use line graph or area chart to show the change of a trend. Also, some important time points or time interval can be marked in the time series graph, such as marking 80's as a key time. fill.color <- ifelse(economics\$date > '1980-01-01' & economics\$date < '1990-01-01', 'steelblue','red4') p <- ggplot(economics,aes(x=date,ymax=psavert,ymin=0)) + geom_linerange(color=fill.color,size=0.9) + geom_text(aes(x=as.Date("1985-01-01",'%Y-%m-%d'),y=13),label="1980'") + theme_bw() 85 of 98 2/4/14, 7:31 AM
86. 86. Data Visualization 86 of 98 http://nycdatascience.com/part4_en/ 2/4/14, 7:31 AM
87. 87. Data Visualization 87 of 98 http://nycdatascience.com/part4_en/ 2/4/14, 7:31 AM
88. 88. Data Visualization http://nycdatascience.com/part4_en/ Geographic information visualization 88 of 98 2/4/14, 7:31 AM
89. 89. Data Visualization http://nycdatascience.com/part4_en/ Map Two types of drawing map · Download the geographic information data, and then draw the geographical boundaries, and identify areas and locations according to the need · Download bitmap data of Google map, and then mark the location and path information on the google map 89 of 98 2/4/14, 7:31 AM
90. 90. Data Visualization http://nycdatascience.com/part4_en/ Map world map library(ggplot2) world <- map_data("world") worldmap <- ggplot(world, aes(x=long, y=lat, group=group)) + geom_path(color='gray10',size=0.3) + geom_point(x=114,y=30,size=10,shape='*') + scale_y_continuous(breaks=(-2:2) * 30) + scale_x_continuous(breaks=(-4:4) * 45) + coord_map("ortho", orientation=c(30, 120, 0)) + theme(panel.grid.major = element_line(colour = "gray50"), panel.background = element_rect(fill = "white"), axis.text=element_blank(), axis.ticks=element_blank(), axis.title=element_blank()) 90 of 98 2/4/14, 7:31 AM
91. 91. Data Visualization 91 of 98 http://nycdatascience.com/part4_en/ 2/4/14, 7:31 AM
92. 92. Data Visualization http://nycdatascience.com/part4_en/ map of the U.S. map <- map_data('state') arrests <- USArrests names(arrests) <- tolower(names(arrests)) arrests\$region <- tolower(rownames(USArrests)) usmap <- ggplot(data=arrests) + geom_map(map =map,aes(map_id = region,fill = murder),color='gray40' ) + expand_limits(x = map\$long, y = map\$lat) + scale_fill_continuous(high='red2',low='white') + theme_bw() + theme(panel.grid.major = element_blank(), panel.background = element_blank(), axis.text=element_blank(), axis.ticks=element_blank(), axis.title=element_blank(), legend.position = c(0.95,0.28), legend.background=element_rect(fill="white", colour="white"))+ coord_map('mercator' 92 of 98 2/4/14, 7:31 AM
93. 93. Data Visualization 93 of 98 http://nycdatascience.com/part4_en/ 2/4/14, 7:31 AM
94. 94. Data Visualization http://nycdatascience.com/part4_en/ Drawing a map of China based on a bitmap Another method to drawing China map is to download a document containing bitmap data from Google or openstreetmap, and then to overlap points and lines elements on it with ggplot2. This document does not include information of latitude and longitude, just a simple bitmap, for fast mapping. library(ggmap) library(XML) webpage <-'http://data.earthquake.cn/datashare/globeEarthquake_csn.html' tables <- readHTMLTable(webpage,stringsAsFactors = FALSE) raw <- tables[[6]] data <- raw[,c(1,3,4)] names(data) <- c('date','lan','lon') data\$lan <- as.numeric(data\$lan) data\$lon <- as.numeric(data\$lon) data\$date <- as.Date(data\$date, "%Y-%m-%d") #Read the map data from Google by the ggmap package, and mark the previous data on the map. earthquake <- ggmap(get_googlemap(center = 'china', zoom=4,maptype='terrain'),extent='device' geom_point(data=data,aes(x=lon,y=lan),colour = 'red',alpha=0.7)+ theme(legend.position = "none") 94 of 98 2/4/14, 7:31 AM
95. 95. Data Visualization 95 of 98 http://nycdatascience.com/part4_en/ 2/4/14, 7:31 AM
96. 96. Data Visualization http://nycdatascience.com/part4_en/ R and interactive visualization GoogleVis is R package providing a interface between R and Google visualization API. It allows the user to use the Google Visualization API for data visualization without the need to upload data. We want to compare the development trajectory of 20 country group over the past several years. In order to obtain the data, we selected three variables from the world bank database, which reﬂect the change of GDP, CO2 emissions and life expectancy between 2001 to 2009. library(googleVis) library(WDI) DF <- WDI(country=c("CN","RU","BR","ZA","IN",'DE','AU','CA','FR','IT','JP','MX','GB','US' M <- gvisMotionChart(DF, idvar="country", timevar="year", xvar='EN.ATM.CO2E.KT', yvar='NY.GDP.MKTP.CD') plot(M) 96 of 98 2/4/14, 7:31 AM
97. 97. Data Visualization http://nycdatascience.com/part4_en/ Case study and excercise 97 of 98 2/4/14, 7:31 AM
98. 98. Data Visualization http://nycdatascience.com/part4_en/ Exercise III: Analyzing NBA data · Calculate the seasonal winning rate, and draw a bar chart · Calculating the seasonal winning rate at home and on the road, and draw a bar chart · According to the seasonal scores of home side, draw a set of four histograms · According to the seasonal scores of home side，draw the boxplots of ﬁve seasons · Draw the boxplots of scores of all competitions for home side and opposite side · Calculate the average and winning percentage for each opponent, and make a scatterplot to ﬁnd the strong and the weak team. 98 of 98 2/4/14, 7:31 AM
1. A particular slide catching your eye?

Clipping is a handy way to collect important slides you want to go back to later.