r for data science 4. exploratory data analysis clean -rev -ref

Exploratory Data Analysis (EDA, 탐색적 데이터 분석)
Outline (개요)
• Exploratory Data Analysis (EDA, 탐색적 데이터 분석)
• Variation (다양성, 분산): Univariate distributions (일변량 분포)
• Categorical (범주형) variable (변수)
• Continuous (연속형) variable (변수)
• Covariation (공분산): Bivariate distributions (이변량 분포)
• Continuous (연속형) & Categorical (범주형)
• Categorical (범주형) & Categorical (범주형)
• Continuous (연속형) & Continuous (연속형)

• Iterative cycles (반복 순환) of Exploratory Data Analysis (EDA, 탐색적
데이터 분석)
1. Generate questions (질문) about your data
1. What type of variation (다양성, 분산) occurs within my variables?
2. What type of covariation (공분산) occurs between my variables?
2. Search for answers (답): Transform (변환하다) & Visualize (시각화하다) &
Model(모형을 만들다)
3. Use what you learn to refine (개선하다) your questions and/or generate (생성하다)
new questions.

1. Generate questions (질문) about your data
1) What type of variation (다양성, 분산) occurs within my variables?
• Univariate distributions (일변량 분포)
2) What type of covariation (공분산) occurs between my variables?
• Bivariate distributions (이변량 분포)

Variation (다양성, 분산):
Univariate distributions (일변량 분포)

diamonds data
diamonds
# # A tibble: 53,940 x 10
# carat cut color clarity depth table price x y z
# <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
# 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
# 2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
# 3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
# 4 0.290 Premium I VS2 62.4 58 334 4.2 4.23 2.63
# 5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
# 6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48
# 7 0.24 Very Good I VVS1 62.3 57 336 3.95 3.98 2.47
# 8 0.26 Very Good H SI1 61.9 55 337 4.07 4.11 2.53
# 9 0.22 Fair E VS2 65.1 61 337 3.87 3.78 2.49
# 10 0.23 Very Good H VS1 59.4 61 338 4 4.05 2.39
# # ... with 53,930 more rows
help(diamonds)

diamonds {ggplot2} R Documentation
Prices of 50,000 round cut diamonds
Description
A dataset containing the prices and other attributes of almost 54,000 diamonds. The variables are as follows:
Usage
diamonds
Format
A data frame with 53940 rows and 10 variables:
price
price in US dollars ($326–$18,823)
carat
weight of the diamond (0.2–5.01)
cut
quality of the cut (Fair, Good, Very Good, Premium, Ideal)
color
diamond colour, from J (worst) to D (best)
clarity
a measurement of how clear the diamond is (I1 (worst), SI2, SI1, VS2, VS1, VVS2, VVS1, IF (best))
x
length in mm (0–10.74)
y
width in mm (0–58.9)
z
depth in mm (0–31.8)
depth
total depth percentage = z / mean(x, y) = 2 * z / (x + y) (43–79)
table
width of top of diamond relative to widest point (43–95)

Categorical (범주형) variable (변수) vs.
Continuous (연속형) variable (변수)
diamonds
# # A tibble: 53,940 x 10
# carat cut color clarity depth table price x y z
# <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
# 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
# 2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
# 3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
# 4 0.290 Premium I VS2 62.4 58 334 4.2 4.23 2.63
# 5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
# 6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48
# 7 0.24 Very Good I VVS1 62.3 57 336 3.95 3.98 2.47
# 8 0.26 Very Good H SI1 61.9 55 337 4.07 4.11 2.53
# 9 0.22 Fair E VS2 65.1 61 337 3.87 3.78 2.49
# 10 0.23 Very Good H VS1 59.4 61 338 4 4.05 2.39
# # ... with 53,930 more rows
diamonds %>%
count(cut)
#> # A tibble: 5 x 2
#> cut n
#> <ord> <int>
#> 1 Fair 1610
#> 2 Good 4906
#> 3 Very Good 12082
#> 4 Premium 13791
#> 5 Ideal 21551
diamonds %>%
count(cut_width(carat, 0.5))
#> # A tibble: 11 x 2
#> `cut_width(carat, 0.5)` n
#> <fct> <int>
#> 1 [-0.25,0.25] 785
#> 2 (0.25,0.75] 29498
#> 3 (0.75,1.25] 15977
#> 4 (1.25,1.75] 5313
#> 5 (1.75,2.25] 2002
#> 6 (2.25,2.75] 322
#> # ... with 5 more rows

Visualizing univariate distributions (일변량 분포의 시각화)
Categorical (범주형) variable (변수)
geom_bar
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut))

geom_histogram
geom_histogram(mapping = aes(x = carat), binwidth = 0.5)

geom_histogram
ggplot(data = diamonds %>% filter(carat < 3)) +

geom_freqpoly
geom_freqpoly(mapping = aes(x = carat), binwidth = 0.1)
0
2500
5000
7500
10000
0 1 2 3
carat
count

Visualizing bivariate distributions (이변량 분포의 시각화)
Continuous (연속형) & Categorical (범주형)
geom_freqpoly & color
geom_freqpoly(mapping = aes(x = carat, color = cut), binwidth = 0.1)

• Questions to ask
• Which values are the most common? Why?
• Which values are rare? Why? Does that match your expectations?
• Can you see any unusual patterns? What might explain them?

Why are there more diamonds at whole carats and common fractions of carats?
Why are there more diamonds slightly to the right of each peak than there are slightly to
the left of each peak?
Why are there no diamonds bigger than 3 carats?

Unusual values (outliers, 특이값, 드문 값)
geom_histogram(mapping = aes(x = y), binwidth = 0.5)

geom_histogram(mapping = aes(x = y), binwidth = 0.5) +
coord_cartesian(ylim = c(0, 50))

unusual <- diamonds %>%
filter(y < 3 | y > 20) %>%
select(price, x, y, z) %>%
arrange(y)
unusual
#> # A tibble: 9 x 4
#> price x y z
#> <int> <dbl> <dbl> <dbl>
#> 1 5139 0 0 0
#> 2 6381 0 0 0
#> 3 12800 0 0 0
#> 4 15686 0 0 0
#> 5 18034 0 0 0
#> 6 2130 0 0 0
#> 7 2130 0 0 0
#> 8 2075 5.15 31.8 5.12
#> 9 12210 8.09 58.9 8.06
#@ replacing the unusual values with missing values
diamonds2 <- diamonds %>%
mutate(y = ifelse(y < 3 | y > 20, NA, y))

Missing values vs. non-missing values
nycflights13::flights %>%
mutate(
cancelled = is.na(dep_time),
sched_hour = sched_dep_time %/% 100,
sched_min = sched_dep_time %% 100,
sched_dep_time = sched_hour + sched_min / 60
) %>%
ggplot(mapping = aes(sched_dep_time)) +
geom_freqpoly(mapping = aes(colour = cancelled), binwidth = 1/4)

More on visualizing univariate distribution

Covariation (공분산):
Bivariate distributions (이변량 분포)

gplot(data = diamonds, mapping = aes(x = price)) +
geom_freqpoly(mapping = aes(colour = cut), binwidth = 500)

ggplot(data = diamonds, mapping = aes(x = price, y = ..density..)) +
geom_freqpoly(mapping = aes(colour = cut), binwidth = 500)

boxplot

geom_boxplot
ggplot(data = diamonds, mapping = aes(x = cut, y = price)) +
geom_boxplot()

geom_boxplot
ggplot(data = mpg, mapping = aes(x = class, y = hwy)) +
geom_boxplot()

geom_boxplot
ggplot(data = mpg) +
geom_boxplot(mapping = aes(x = reorder(class, hwy, FUN = median), y = hwy))

geom_boxplot
ggplot(data = mpg) +
geom_boxplot(mapping = aes(x = reorder(class, hwy, FUN = median), y = hwy)) +
coord_flip()

Categorical (범주형) & Categorical (범주형)

bivariate distributions (이변량 분포)
"long format": unique identifier (고유 식별자) = composite key (복합키)
#@ count in a "long format": unique identifier is the composite key consists of color & cut
diamonds %>%
group_by(color, cut) %>% summarize(count = n()) %>% print(n = Inf)
diamonds %>%
count(color, cut) %>% print(n = Inf)
# # A tibble: 35 x 3
# color cut n
# <ord> <ord> <int>
# 1 D Fair 163
# 2 D Good 662
# 3 D Very Good 1513
# 4 D Premium 1603
# 5 D Ideal 2834
# 6 E Fair 224
# 7 E Good 933
# 8 E Very Good 2400
# 9 E Premium 2337
# 10 E Ideal 3903
# 11 F Fair 312
# 12 F Good 909
# 13 F Very Good 2164
# 14 F Premium 2331
# 15 F Ideal 3826
# 16 G Fair 314
# 17 G Good 871
# 18 G Very Good 2299
# 19 G Premium 2924
# 20 G Ideal 4884
# 21 H Fair 303
# 22 H Good 702
# 23 H Very Good 1824
# 24 H Premium 2360
# 25 H Ideal 3115
# 26 I Fair 175
# 27 I Good 522
# 28 I Very Good 1204
# 29 I Premium 1428
# 30 I Ideal 2093
# 31 J Fair 119
# 32 J Good 307
# 33 J Very Good 678
# 34 J Premium 808
# 35 J Ideal 896

"wide format": one variable -> multiple column
#@ "wide format": one variable -> multiple column (spread)
diamonds %>%
select(color, cut) %>%
table %>% addmargins
# cut
# color Fair Good Very Good Premium Ideal Sum
# D 163 662 1513 1603 2834 6775
# E 224 933 2400 2337 3903 9797
# F 312 909 2164 2331 3826 9542
# G 314 871 2299 2924 4884 11292
# H 303 702 1824 2360 3115 8304
# I 175 522 1204 1428 2093 5422
# J 119 307 678 808 896 2808
# Sum 1610 4906 12082 13791 21551 53940
#@ spread(): "long format" -> "wide format": one variable -> multiple column
diamonds %>%
count(color, cut) %>%
spread(key = cut, value = n)
# # A tibble: 7 x 6
# color Fair Good `Very Good` Premium Ideal
# <ord> <int> <int> <int> <int> <int>
# 1 D 163 662 1513 1603 2834
# 2 E 224 933 2400 2337 3903
# 3 F 312 909 2164 2331 3826
# 4 G 314 871 2299 2924 4884
# 5 H 303 702 1824 2360 3115
# 6 I 175 522 1204 1428 2093
# 7 J 119 307 678 808 896

geom_count
geom_count(mapping = aes(x = cut, y = color))

geom_tile
diamonds %>% count(color, cut) %>%
ggplot(mapping = aes(x = color, y = cut)) +
geom_tile(mapping = aes(fill = n))

"long format": unique identifier (고유 식별자) = composite key (복합키)
#@ gather(): "wide format" -> "long format": multiple column -> one variable (key)
#@ "wide format": one variable -> multiple column (spread)
diamonds %>%
spread(key = cut, value = n) %>%
gather(key = cut, value = n, Fair, Good, `Very Good`, Premium, Ideal)
diamonds %>%
gather(key = cut, value = n, Fair:Ideal)
diamonds %>%
gather(key = cut, value = n, -color)
# > diamonds %>%
# + count(color, cut) %>%
# + spread(key = cut, value = n) %>%
# + gather(key = cut, value = n, -color)
# # A tibble: 35 x 3
# color cut n
# <ord> <chr> <int>
# 1 D Fair 163
# 2 E Fair 224
# 3 F Fair 312
# 4 G Fair 314
# 5 H Fair 303
# 6 I Fair 175
# 7 J Fair 119
# 8 D Good 662
# 9 E Good 933
# 10 F Good 909
# # ... with 25 more rows

Continuous (연속형) & Continuous (연속형)

geom_point
geom_point(mapping = aes(x = carat, y = price))

geom_point
geom_point(mapping = aes(x = carat, y = price), alpha = 1 / 100)

geom_bin2d
geom_bin2d(mapping = aes(x = carat, y = price))

geom_hex
# install.packages("hexbin")
geom_hex(mapping = aes(x = carat, y = price))

-> Continuous (연속형) & Categorical (범주형)
ggplot(diamonds %>% filter(carat < 3), mapping = aes(x = carat, y = price)) +
geom_boxplot(mapping = aes(group = cut_width(carat, 0.1)))

More on visualizing bivariate distribution

More on visualizing trivariate distribution

REFERENCES
#1. RStudio Official Documentations (Help & Cheat Sheet)
Free Webpage) https://www.rstudio.com/resources/cheatsheets/
#2. Wickham, H. and Grolemund, G., 2016.R for data science:
import, tidy, transform, visualize, and model data. O'Reilly.
Free Webpage) https://r4ds.had.co.nz/
Cf) Tidyverse syntax (www.tidyverse.org), rather than R Base
syntax
Cf) Hadley Wickham: Chief Scientist at RStudio. Adjunct
Professor of Statistics at the University of Auckland, Stanford
University, and Rice University

r for data science 4. exploratory data analysis clean -rev -ref

Recommended

Recommended

More Related Content

What's hot

What's hot (8)

Similar to r for data science 4. exploratory data analysis clean -rev -ref

Similar to r for data science 4. exploratory data analysis clean -rev -ref (20)

More from Min-hyung Kim

More from Min-hyung Kim (7)

Recently uploaded

Recently uploaded (20)

r for data science 4. exploratory data analysis clean -rev -ref