REFERENCES
#1. RStudio Official Documentations (Help & Cheat Sheet)
Free Webpage) https://www.rstudio.com/resources/cheatsheets/
#2. Wickham, H. and Grolemund, G., 2016.R for data science: import, tidy, transform, visualize, and model data. O'Reilly.
Free Webpage) https://r4ds.had.co.nz/
Cf) Tidyverse syntax (www.tidyverse.org), rather than R Base syntax
Cf) Hadley Wickham: Chief Scientist at RStudio. Adjunct Professor of Statistics at the University of Auckland, Stanford University, and Rice University
2. Exploratory Data Analysis (EDA, 탐색적 데이터 분석)
• Iterative cycles (반복 순환) of Exploratory Data Analysis (EDA, 탐색적
데이터 분석)
1. Generate questions (질문) about your data
1. What type of variation (다양성, 분산) occurs within my variables?
2. What type of covariation (공분산) occurs between my variables?
2. Search for answers (답): Transform (변환하다) & Visualize (시각화하다) &
Model(모형을 만들다)
3. Use what you learn to refine (개선하다) your questions and/or generate (생성하다)
new questions.
3. Exploratory Data Analysis (EDA, 탐색적 데이터 분석)
1. Generate questions (질문) about your data
1) What type of variation (다양성, 분산) occurs within my variables?
• Univariate distributions (일변량 분포)
2) What type of covariation (공분산) occurs between my variables?
• Bivariate distributions (이변량 분포)
5. diamonds data
diamonds
# # A tibble: 53,940 x 10
# carat cut color clarity depth table price x y z
# <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
# 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
# 2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
# 3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
# 4 0.290 Premium I VS2 62.4 58 334 4.2 4.23 2.63
# 5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
# 6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48
# 7 0.24 Very Good I VVS1 62.3 57 336 3.95 3.98 2.47
# 8 0.26 Very Good H SI1 61.9 55 337 4.07 4.11 2.53
# 9 0.22 Fair E VS2 65.1 61 337 3.87 3.78 2.49
# 10 0.23 Very Good H VS1 59.4 61 338 4 4.05 2.39
# # ... with 53,930 more rows
help(diamonds)
6. diamonds {ggplot2} R Documentation
Prices of 50,000 round cut diamonds
Description
A dataset containing the prices and other attributes of almost 54,000 diamonds. The variables are as follows:
Usage
diamonds
Format
A data frame with 53940 rows and 10 variables:
price
price in US dollars ($326–$18,823)
carat
weight of the diamond (0.2–5.01)
cut
quality of the cut (Fair, Good, Very Good, Premium, Ideal)
color
diamond colour, from J (worst) to D (best)
clarity
a measurement of how clear the diamond is (I1 (worst), SI2, SI1, VS2, VS1, VVS2, VVS1, IF (best))
x
length in mm (0–10.74)
y
width in mm (0–58.9)
z
depth in mm (0–31.8)
depth
total depth percentage = z / mean(x, y) = 2 * z / (x + y) (43–79)
table
width of top of diamond relative to widest point (43–95)
7. Categorical (범주형) variable (변수) vs.
Continuous (연속형) variable (변수)
diamonds
# # A tibble: 53,940 x 10
# carat cut color clarity depth table price x y z
# <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
# 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
# 2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
# 3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
# 4 0.290 Premium I VS2 62.4 58 334 4.2 4.23 2.63
# 5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
# 6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48
# 7 0.24 Very Good I VVS1 62.3 57 336 3.95 3.98 2.47
# 8 0.26 Very Good H SI1 61.9 55 337 4.07 4.11 2.53
# 9 0.22 Fair E VS2 65.1 61 337 3.87 3.78 2.49
# 10 0.23 Very Good H VS1 59.4 61 338 4 4.05 2.39
# # ... with 53,930 more rows
diamonds %>%
count(cut)
#> # A tibble: 5 x 2
#> cut n
#> <ord> <int>
#> 1 Fair 1610
#> 2 Good 4906
#> 3 Very Good 12082
#> 4 Premium 13791
#> 5 Ideal 21551
diamonds %>%
count(cut_width(carat, 0.5))
#> # A tibble: 11 x 2
#> `cut_width(carat, 0.5)` n
#> <fct> <int>
#> 1 [-0.25,0.25] 785
#> 2 (0.25,0.75] 29498
#> 3 (0.75,1.25] 15977
#> 4 (1.25,1.75] 5313
#> 5 (1.75,2.25] 2002
#> 6 (2.25,2.75] 322
#> # ... with 5 more rows
13. Visualizing univariate distributions (일변량 분포의 시각화)
• Questions to ask
• Which values are the most common? Why?
• Which values are rare? Why? Does that match your expectations?
• Can you see any unusual patterns? What might explain them?
14. Why are there more diamonds at whole carats and common fractions of carats?
Why are there more diamonds slightly to the right of each peak than there are slightly to
the left of each peak?
Why are there no diamonds bigger than 3 carats?
ggplot(data = diamonds %>% filter(carat < 3)) +
geom_histogram(mapping = aes(x = carat), binwidth = 0.01)
29. Exploratory Data Analysis (EDA, 탐색적 데이터 분석)
Covariation (공분산):
Bivariate distributions (이변량 분포)
Categorical (범주형) & Categorical (범주형)
30. bivariate distributions (이변량 분포)
Categorical (범주형) & Categorical (범주형)
"long format": unique identifier (고유 식별자) = composite key (복합키)
#@ count in a "long format": unique identifier is the composite key consists of color & cut
diamonds %>%
group_by(color, cut) %>% summarize(count = n()) %>% print(n = Inf)
diamonds %>%
count(color, cut) %>% print(n = Inf)
# # A tibble: 35 x 3
# color cut n
# <ord> <ord> <int>
# 1 D Fair 163
# 2 D Good 662
# 3 D Very Good 1513
# 4 D Premium 1603
# 5 D Ideal 2834
# 6 E Fair 224
# 7 E Good 933
# 8 E Very Good 2400
# 9 E Premium 2337
# 10 E Ideal 3903
# 11 F Fair 312
# 12 F Good 909
# 13 F Very Good 2164
# 14 F Premium 2331
# 15 F Ideal 3826
# 16 G Fair 314
# 17 G Good 871
# 18 G Very Good 2299
# 19 G Premium 2924
# 20 G Ideal 4884
# 21 H Fair 303
# 22 H Good 702
# 23 H Very Good 1824
# 24 H Premium 2360
# 25 H Ideal 3115
# 26 I Fair 175
# 27 I Good 522
# 28 I Very Good 1204
# 29 I Premium 1428
# 30 I Ideal 2093
# 31 J Fair 119
# 32 J Good 307
# 33 J Very Good 678
# 34 J Premium 808
# 35 J Ideal 896
31. bivariate distributions (이변량 분포)
Categorical (범주형) & Categorical (범주형)
"wide format": one variable -> multiple column
#@ "wide format": one variable -> multiple column (spread)
diamonds %>%
select(color, cut) %>%
table %>% addmargins
# cut
# color Fair Good Very Good Premium Ideal Sum
# D 163 662 1513 1603 2834 6775
# E 224 933 2400 2337 3903 9797
# F 312 909 2164 2331 3826 9542
# G 314 871 2299 2924 4884 11292
# H 303 702 1824 2360 3115 8304
# I 175 522 1204 1428 2093 5422
# J 119 307 678 808 896 2808
# Sum 1610 4906 12082 13791 21551 53940
#@ spread(): "long format" -> "wide format": one variable -> multiple column
diamonds %>%
count(color, cut) %>%
spread(key = cut, value = n)
# # A tibble: 7 x 6
# color Fair Good `Very Good` Premium Ideal
# <ord> <int> <int> <int> <int> <int>
# 1 D 163 662 1513 1603 2834
# 2 E 224 933 2400 2337 3903
# 3 F 312 909 2164 2331 3826
# 4 G 314 871 2299 2924 4884
# 5 H 303 702 1824 2360 3115
# 6 I 175 522 1204 1428 2093
# 7 J 119 307 678 808 896
43. REFERENCES
#1. RStudio Official Documentations (Help & Cheat Sheet)
Free Webpage) https://www.rstudio.com/resources/cheatsheets/
#2. Wickham, H. and Grolemund, G., 2016.R for data science:
import, tidy, transform, visualize, and model data. O'Reilly.
Free Webpage) https://r4ds.had.co.nz/
Cf) Tidyverse syntax (www.tidyverse.org), rather than R Base
syntax
Cf) Hadley Wickham: Chief Scientist at RStudio. Adjunct
Professor of Statistics at the University of Auckland, Stanford
University, and Rice University