SlideShare a Scribd company logo
1 of 43
Exploratory Data Analysis (EDA, 탐색적 데이터 분석)
Outline (개요)
• Exploratory Data Analysis (EDA, 탐색적 데이터 분석)
• Variation (다양성, 분산): Univariate distributions (일변량 분포)
• Categorical (범주형) variable (변수)
• Continuous (연속형) variable (변수)
• Covariation (공분산): Bivariate distributions (이변량 분포)
• Continuous (연속형) & Categorical (범주형)
• Categorical (범주형) & Categorical (범주형)
• Continuous (연속형) & Continuous (연속형)
Exploratory Data Analysis (EDA, 탐색적 데이터 분석)
• Iterative cycles (반복 순환) of Exploratory Data Analysis (EDA, 탐색적
데이터 분석)
1. Generate questions (질문) about your data
1. What type of variation (다양성, 분산) occurs within my variables?
2. What type of covariation (공분산) occurs between my variables?
2. Search for answers (답): Transform (변환하다) & Visualize (시각화하다) &
Model(모형을 만들다)
3. Use what you learn to refine (개선하다) your questions and/or generate (생성하다)
new questions.
Exploratory Data Analysis (EDA, 탐색적 데이터 분석)
1. Generate questions (질문) about your data
1) What type of variation (다양성, 분산) occurs within my variables?
• Univariate distributions (일변량 분포)
2) What type of covariation (공분산) occurs between my variables?
• Bivariate distributions (이변량 분포)
Exploratory Data Analysis (EDA, 탐색적 데이터 분석)
Variation (다양성, 분산):
Univariate distributions (일변량 분포)
diamonds data
diamonds
# # A tibble: 53,940 x 10
# carat cut color clarity depth table price x y z
# <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
# 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
# 2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
# 3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
# 4 0.290 Premium I VS2 62.4 58 334 4.2 4.23 2.63
# 5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
# 6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48
# 7 0.24 Very Good I VVS1 62.3 57 336 3.95 3.98 2.47
# 8 0.26 Very Good H SI1 61.9 55 337 4.07 4.11 2.53
# 9 0.22 Fair E VS2 65.1 61 337 3.87 3.78 2.49
# 10 0.23 Very Good H VS1 59.4 61 338 4 4.05 2.39
# # ... with 53,930 more rows
help(diamonds)
diamonds {ggplot2} R Documentation
Prices of 50,000 round cut diamonds
Description
A dataset containing the prices and other attributes of almost 54,000 diamonds. The variables are as follows:
Usage
diamonds
Format
A data frame with 53940 rows and 10 variables:
price
price in US dollars ($326–$18,823)
carat
weight of the diamond (0.2–5.01)
cut
quality of the cut (Fair, Good, Very Good, Premium, Ideal)
color
diamond colour, from J (worst) to D (best)
clarity
a measurement of how clear the diamond is (I1 (worst), SI2, SI1, VS2, VS1, VVS2, VVS1, IF (best))
x
length in mm (0–10.74)
y
width in mm (0–58.9)
z
depth in mm (0–31.8)
depth
total depth percentage = z / mean(x, y) = 2 * z / (x + y) (43–79)
table
width of top of diamond relative to widest point (43–95)
Categorical (범주형) variable (변수) vs.
Continuous (연속형) variable (변수)
diamonds
# # A tibble: 53,940 x 10
# carat cut color clarity depth table price x y z
# <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
# 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
# 2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
# 3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
# 4 0.290 Premium I VS2 62.4 58 334 4.2 4.23 2.63
# 5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
# 6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48
# 7 0.24 Very Good I VVS1 62.3 57 336 3.95 3.98 2.47
# 8 0.26 Very Good H SI1 61.9 55 337 4.07 4.11 2.53
# 9 0.22 Fair E VS2 65.1 61 337 3.87 3.78 2.49
# 10 0.23 Very Good H VS1 59.4 61 338 4 4.05 2.39
# # ... with 53,930 more rows
diamonds %>%
count(cut)
#> # A tibble: 5 x 2
#> cut n
#> <ord> <int>
#> 1 Fair 1610
#> 2 Good 4906
#> 3 Very Good 12082
#> 4 Premium 13791
#> 5 Ideal 21551
diamonds %>%
count(cut_width(carat, 0.5))
#> # A tibble: 11 x 2
#> `cut_width(carat, 0.5)` n
#> <fct> <int>
#> 1 [-0.25,0.25] 785
#> 2 (0.25,0.75] 29498
#> 3 (0.75,1.25] 15977
#> 4 (1.25,1.75] 5313
#> 5 (1.75,2.25] 2002
#> 6 (2.25,2.75] 322
#> # ... with 5 more rows
Visualizing univariate distributions (일변량 분포의 시각화)
Categorical (범주형) variable (변수)
geom_bar
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut))
Visualizing univariate distributions (일변량 분포의 시각화)
Continuous (연속형) variable (변수)
geom_histogram
ggplot(data = diamonds) +
geom_histogram(mapping = aes(x = carat), binwidth = 0.5)
Visualizing univariate distributions (일변량 분포의 시각화)
Continuous (연속형) variable (변수)
geom_histogram
ggplot(data = diamonds %>% filter(carat < 3)) +
geom_histogram(mapping = aes(x = carat), binwidth = 0.5)
Visualizing univariate distributions (일변량 분포의 시각화)
Continuous (연속형) variable (변수)
geom_freqpoly
ggplot(data = diamonds %>% filter(carat < 3)) +
geom_freqpoly(mapping = aes(x = carat), binwidth = 0.1)
0
2500
5000
7500
10000
0 1 2 3
carat
count
Visualizing bivariate distributions (이변량 분포의 시각화)
Continuous (연속형) & Categorical (범주형)
geom_freqpoly & color
ggplot(data = diamonds %>% filter(carat < 3)) +
geom_freqpoly(mapping = aes(x = carat, color = cut), binwidth = 0.1)
Visualizing univariate distributions (일변량 분포의 시각화)
• Questions to ask
• Which values are the most common? Why?
• Which values are rare? Why? Does that match your expectations?
• Can you see any unusual patterns? What might explain them?
Why are there more diamonds at whole carats and common fractions of carats?
Why are there more diamonds slightly to the right of each peak than there are slightly to
the left of each peak?
Why are there no diamonds bigger than 3 carats?
ggplot(data = diamonds %>% filter(carat < 3)) +
geom_histogram(mapping = aes(x = carat), binwidth = 0.01)
Visualizing univariate distributions (일변량 분포의 시각화)
Unusual values (outliers, 특이값, 드문 값)
ggplot(data = diamonds) +
geom_histogram(mapping = aes(x = y), binwidth = 0.5)
Visualizing univariate distributions (일변량 분포의 시각화)
Unusual values (outliers, 특이값, 드문 값)
ggplot(data = diamonds) +
geom_histogram(mapping = aes(x = y), binwidth = 0.5) +
coord_cartesian(ylim = c(0, 50))
Unusual values (outliers, 특이값, 드문 값)
unusual <- diamonds %>%
filter(y < 3 | y > 20) %>%
select(price, x, y, z) %>%
arrange(y)
unusual
#> # A tibble: 9 x 4
#> price x y z
#> <int> <dbl> <dbl> <dbl>
#> 1 5139 0 0 0
#> 2 6381 0 0 0
#> 3 12800 0 0 0
#> 4 15686 0 0 0
#> 5 18034 0 0 0
#> 6 2130 0 0 0
#> 7 2130 0 0 0
#> 8 2075 5.15 31.8 5.12
#> 9 12210 8.09 58.9 8.06
#@ replacing the unusual values with missing values
diamonds2 <- diamonds %>%
mutate(y = ifelse(y < 3 | y > 20, NA, y))
Visualizing univariate distributions (일변량 분포의 시각화)
Missing values vs. non-missing values
nycflights13::flights %>%
mutate(
cancelled = is.na(dep_time),
sched_hour = sched_dep_time %/% 100,
sched_min = sched_dep_time %% 100,
sched_dep_time = sched_hour + sched_min / 60
) %>%
ggplot(mapping = aes(sched_dep_time)) +
geom_freqpoly(mapping = aes(colour = cancelled), binwidth = 1/4)
More on visualizing univariate distribution
Exploratory Data Analysis (EDA, 탐색적 데이터 분석)
Covariation (공분산):
Bivariate distributions (이변량 분포)
Continuous (연속형) & Categorical (범주형)
Visualizing bivariate distributions (이변량 분포의 시각화)
Continuous (연속형) & Categorical (범주형)
geom_freqpoly & color
gplot(data = diamonds, mapping = aes(x = price)) +
geom_freqpoly(mapping = aes(colour = cut), binwidth = 500)
Visualizing univariate distributions (일변량 분포의 시각화)
Categorical (범주형) variable (변수)
geom_bar
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut))
Visualizing bivariate distributions (이변량 분포의 시각화)
Continuous (연속형) & Categorical (범주형)
geom_freqpoly & color
ggplot(data = diamonds, mapping = aes(x = price, y = ..density..)) +
geom_freqpoly(mapping = aes(colour = cut), binwidth = 500)
Visualizing bivariate distributions (이변량 분포의 시각화)
Continuous (연속형) & Categorical (범주형)
boxplot
Visualizing bivariate distributions (이변량 분포의 시각화)
Continuous (연속형) & Categorical (범주형)
geom_boxplot
ggplot(data = diamonds, mapping = aes(x = cut, y = price)) +
geom_boxplot()
Visualizing bivariate distributions (이변량 분포의 시각화)
Continuous (연속형) & Categorical (범주형)
geom_boxplot
ggplot(data = mpg, mapping = aes(x = class, y = hwy)) +
geom_boxplot()
Visualizing bivariate distributions (이변량 분포의 시각화)
Continuous (연속형) & Categorical (범주형)
geom_boxplot
ggplot(data = mpg) +
geom_boxplot(mapping = aes(x = reorder(class, hwy, FUN = median), y = hwy))
Visualizing bivariate distributions (이변량 분포의 시각화)
Continuous (연속형) & Categorical (범주형)
geom_boxplot
ggplot(data = mpg) +
geom_boxplot(mapping = aes(x = reorder(class, hwy, FUN = median), y = hwy)) +
coord_flip()
Exploratory Data Analysis (EDA, 탐색적 데이터 분석)
Covariation (공분산):
Bivariate distributions (이변량 분포)
Categorical (범주형) & Categorical (범주형)
bivariate distributions (이변량 분포)
Categorical (범주형) & Categorical (범주형)
"long format": unique identifier (고유 식별자) = composite key (복합키)
#@ count in a "long format": unique identifier is the composite key consists of color & cut
diamonds %>%
group_by(color, cut) %>% summarize(count = n()) %>% print(n = Inf)
diamonds %>%
count(color, cut) %>% print(n = Inf)
# # A tibble: 35 x 3
# color cut n
# <ord> <ord> <int>
# 1 D Fair 163
# 2 D Good 662
# 3 D Very Good 1513
# 4 D Premium 1603
# 5 D Ideal 2834
# 6 E Fair 224
# 7 E Good 933
# 8 E Very Good 2400
# 9 E Premium 2337
# 10 E Ideal 3903
# 11 F Fair 312
# 12 F Good 909
# 13 F Very Good 2164
# 14 F Premium 2331
# 15 F Ideal 3826
# 16 G Fair 314
# 17 G Good 871
# 18 G Very Good 2299
# 19 G Premium 2924
# 20 G Ideal 4884
# 21 H Fair 303
# 22 H Good 702
# 23 H Very Good 1824
# 24 H Premium 2360
# 25 H Ideal 3115
# 26 I Fair 175
# 27 I Good 522
# 28 I Very Good 1204
# 29 I Premium 1428
# 30 I Ideal 2093
# 31 J Fair 119
# 32 J Good 307
# 33 J Very Good 678
# 34 J Premium 808
# 35 J Ideal 896
bivariate distributions (이변량 분포)
Categorical (범주형) & Categorical (범주형)
"wide format": one variable -> multiple column
#@ "wide format": one variable -> multiple column (spread)
diamonds %>%
select(color, cut) %>%
table %>% addmargins
# cut
# color Fair Good Very Good Premium Ideal Sum
# D 163 662 1513 1603 2834 6775
# E 224 933 2400 2337 3903 9797
# F 312 909 2164 2331 3826 9542
# G 314 871 2299 2924 4884 11292
# H 303 702 1824 2360 3115 8304
# I 175 522 1204 1428 2093 5422
# J 119 307 678 808 896 2808
# Sum 1610 4906 12082 13791 21551 53940
#@ spread(): "long format" -> "wide format": one variable -> multiple column
diamonds %>%
count(color, cut) %>%
spread(key = cut, value = n)
# # A tibble: 7 x 6
# color Fair Good `Very Good` Premium Ideal
# <ord> <int> <int> <int> <int> <int>
# 1 D 163 662 1513 1603 2834
# 2 E 224 933 2400 2337 3903
# 3 F 312 909 2164 2331 3826
# 4 G 314 871 2299 2924 4884
# 5 H 303 702 1824 2360 3115
# 6 I 175 522 1204 1428 2093
# 7 J 119 307 678 808 896
Visualizing bivariate distributions (이변량 분포의 시각화)
Categorical (범주형) & Categorical (범주형)
geom_count
ggplot(data = diamonds) +
geom_count(mapping = aes(x = cut, y = color))
Visualizing bivariate distributions (이변량 분포의 시각화)
Categorical (범주형) & Categorical (범주형)
geom_tile
diamonds %>% count(color, cut) %>%
ggplot(mapping = aes(x = color, y = cut)) +
geom_tile(mapping = aes(fill = n))
bivariate distributions (이변량 분포)
Categorical (범주형) & Categorical (범주형)
"long format": unique identifier (고유 식별자) = composite key (복합키)
#@ gather(): "wide format" -> "long format": multiple column -> one variable (key)
#@ "wide format": one variable -> multiple column (spread)
diamonds %>%
count(color, cut) %>%
spread(key = cut, value = n) %>%
gather(key = cut, value = n, Fair, Good, `Very Good`, Premium, Ideal)
diamonds %>%
count(color, cut) %>%
spread(key = cut, value = n) %>%
gather(key = cut, value = n, Fair:Ideal)
diamonds %>%
count(color, cut) %>%
spread(key = cut, value = n) %>%
gather(key = cut, value = n, -color)
# > diamonds %>%
# + count(color, cut) %>%
# + spread(key = cut, value = n) %>%
# + gather(key = cut, value = n, -color)
# # A tibble: 35 x 3
# color cut n
# <ord> <chr> <int>
# 1 D Fair 163
# 2 E Fair 224
# 3 F Fair 312
# 4 G Fair 314
# 5 H Fair 303
# 6 I Fair 175
# 7 J Fair 119
# 8 D Good 662
# 9 E Good 933
# 10 F Good 909
# # ... with 25 more rows
Exploratory Data Analysis (EDA, 탐색적 데이터 분석)
Covariation (공분산):
Bivariate distributions (이변량 분포)
Continuous (연속형) & Continuous (연속형)
Visualizing bivariate distributions (이변량 분포의 시각화)
Continuous (연속형) & Continuous (연속형)
geom_point
ggplot(data = diamonds) +
geom_point(mapping = aes(x = carat, y = price))
Visualizing bivariate distributions (이변량 분포의 시각화)
Continuous (연속형) & Continuous (연속형)
geom_point
ggplot(data = diamonds) +
geom_point(mapping = aes(x = carat, y = price), alpha = 1 / 100)
Visualizing bivariate distributions (이변량 분포의 시각화)
Continuous (연속형) & Continuous (연속형)
geom_bin2d
ggplot(data = diamonds %>% filter(carat < 3)) +
geom_bin2d(mapping = aes(x = carat, y = price))
Visualizing bivariate distributions (이변량 분포의 시각화)
Continuous (연속형) & Continuous (연속형)
geom_hex
# install.packages("hexbin")
ggplot(data = diamonds %>% filter(carat < 3)) +
geom_hex(mapping = aes(x = carat, y = price))
Visualizing bivariate distributions (이변량 분포의 시각화)
Continuous (연속형) & Continuous (연속형)
-> Continuous (연속형) & Categorical (범주형)
ggplot(diamonds %>% filter(carat < 3), mapping = aes(x = carat, y = price)) +
geom_boxplot(mapping = aes(group = cut_width(carat, 0.1)))
More on visualizing bivariate distribution
More on visualizing trivariate distribution
REFERENCES
#1. RStudio Official Documentations (Help & Cheat Sheet)
Free Webpage) https://www.rstudio.com/resources/cheatsheets/
#2. Wickham, H. and Grolemund, G., 2016.R for data science:
import, tidy, transform, visualize, and model data. O'Reilly.
Free Webpage) https://r4ds.had.co.nz/
Cf) Tidyverse syntax (www.tidyverse.org), rather than R Base
syntax
Cf) Hadley Wickham: Chief Scientist at RStudio. Adjunct
Professor of Statistics at the University of Auckland, Stanford
University, and Rice University

More Related Content

What's hot

Analysis of single samples
Analysis of single samplesAnalysis of single samples
Analysis of single samplesPaul Gardner
 
comp.org Chapter 2
comp.org Chapter 2comp.org Chapter 2
comp.org Chapter 2Rajat Sharma
 
Pernos ejercicio 8
Pernos ejercicio 8Pernos ejercicio 8
Pernos ejercicio 8amy Lopez
 
Mth 4108-1 - chapter 9 (ans)
Mth 4108-1 - chapter 9 (ans)Mth 4108-1 - chapter 9 (ans)
Mth 4108-1 - chapter 9 (ans)outdoorjohn
 
The effects of cold weather on wind data quality – An empirical study on how ...
The effects of cold weather on wind data quality – An empirical study on how ...The effects of cold weather on wind data quality – An empirical study on how ...
The effects of cold weather on wind data quality – An empirical study on how ...Winterwind
 

What's hot (8)

Analysis of single samples
Analysis of single samplesAnalysis of single samples
Analysis of single samples
 
Lpp simplex method
Lpp simplex methodLpp simplex method
Lpp simplex method
 
comp.org Chapter 2
comp.org Chapter 2comp.org Chapter 2
comp.org Chapter 2
 
BS2506 tutorial 2
BS2506 tutorial 2BS2506 tutorial 2
BS2506 tutorial 2
 
Sect3 4
Sect3 4Sect3 4
Sect3 4
 
Pernos ejercicio 8
Pernos ejercicio 8Pernos ejercicio 8
Pernos ejercicio 8
 
Mth 4108-1 - chapter 9 (ans)
Mth 4108-1 - chapter 9 (ans)Mth 4108-1 - chapter 9 (ans)
Mth 4108-1 - chapter 9 (ans)
 
The effects of cold weather on wind data quality – An empirical study on how ...
The effects of cold weather on wind data quality – An empirical study on how ...The effects of cold weather on wind data quality – An empirical study on how ...
The effects of cold weather on wind data quality – An empirical study on how ...
 

Similar to r for data science 4. exploratory data analysis clean -rev -ref

Variance and standard deviation
Variance and standard deviationVariance and standard deviation
Variance and standard deviationAmrit Swaroop
 
Visualising Multi Dimensional Data
Visualising Multi Dimensional DataVisualising Multi Dimensional Data
Visualising Multi Dimensional DataAmit Kapoor
 
YamadaiR(Categorical Factor Analysis)
YamadaiR(Categorical Factor Analysis)YamadaiR(Categorical Factor Analysis)
YamadaiR(Categorical Factor Analysis)考司 小杉
 
SIMS Quant Course Lecture 4
SIMS Quant Course Lecture 4SIMS Quant Course Lecture 4
SIMS Quant Course Lecture 4Rashmi Sinha
 
Overview of variance and Standard deviation.pptx
Overview of variance and Standard deviation.pptxOverview of variance and Standard deviation.pptx
Overview of variance and Standard deviation.pptxCHIRANTANMONDAL2
 
VARIANCE AND STANDARD DEVIATION.pptx
VARIANCE AND STANDARD DEVIATION.pptxVARIANCE AND STANDARD DEVIATION.pptx
VARIANCE AND STANDARD DEVIATION.pptxKenPaulBalcueva3
 
measure of variability (windri). In research include example
measure of variability (windri). In research include examplemeasure of variability (windri). In research include example
measure of variability (windri). In research include examplewindri3
 
3.3 Measures of Variation
3.3 Measures of Variation3.3 Measures of Variation
3.3 Measures of Variationmlong24
 
Big Data Day LA 2015 - Using data visualization to find patterns in multidime...
Big Data Day LA 2015 - Using data visualization to find patterns in multidime...Big Data Day LA 2015 - Using data visualization to find patterns in multidime...
Big Data Day LA 2015 - Using data visualization to find patterns in multidime...Data Con LA
 
Measures of Variability.pptx
Measures of Variability.pptxMeasures of Variability.pptx
Measures of Variability.pptxNehaMishra52555
 
Prediction of Diamond Prices Using Multivariate Regression
Prediction of Diamond Prices Using Multivariate RegressionPrediction of Diamond Prices Using Multivariate Regression
Prediction of Diamond Prices Using Multivariate RegressionMohitMhapuskar
 
Standard deviation quartile deviation
Standard deviation  quartile deviationStandard deviation  quartile deviation
Standard deviation quartile deviationRekha Yadav
 
Variables and Statements
Variables and StatementsVariables and Statements
Variables and Statementsprimeteacher32
 
Data manipulation and visualization in r 20190711 myanmarucsy
Data manipulation and visualization in r 20190711 myanmarucsyData manipulation and visualization in r 20190711 myanmarucsy
Data manipulation and visualization in r 20190711 myanmarucsySmartHinJ
 
R graphics260809
R graphics260809R graphics260809
R graphics260809lizbethfdz
 
Mean, Variance and standard deviation.pptx
Mean, Variance and standard deviation.pptxMean, Variance and standard deviation.pptx
Mean, Variance and standard deviation.pptxJakeGad
 
Variance & standard deviation
Variance & standard deviationVariance & standard deviation
Variance & standard deviationFaisal Hussain
 

Similar to r for data science 4. exploratory data analysis clean -rev -ref (20)

Variance and standard deviation
Variance and standard deviationVariance and standard deviation
Variance and standard deviation
 
Visualising Multi Dimensional Data
Visualising Multi Dimensional DataVisualising Multi Dimensional Data
Visualising Multi Dimensional Data
 
YamadaiR(Categorical Factor Analysis)
YamadaiR(Categorical Factor Analysis)YamadaiR(Categorical Factor Analysis)
YamadaiR(Categorical Factor Analysis)
 
RBootcamp Day 4
RBootcamp Day 4RBootcamp Day 4
RBootcamp Day 4
 
SIMS Quant Course Lecture 4
SIMS Quant Course Lecture 4SIMS Quant Course Lecture 4
SIMS Quant Course Lecture 4
 
Overview of variance and Standard deviation.pptx
Overview of variance and Standard deviation.pptxOverview of variance and Standard deviation.pptx
Overview of variance and Standard deviation.pptx
 
VARIANCE AND STANDARD DEVIATION.pptx
VARIANCE AND STANDARD DEVIATION.pptxVARIANCE AND STANDARD DEVIATION.pptx
VARIANCE AND STANDARD DEVIATION.pptx
 
measure of variability (windri). In research include example
measure of variability (windri). In research include examplemeasure of variability (windri). In research include example
measure of variability (windri). In research include example
 
3.3 Measures of Variation
3.3 Measures of Variation3.3 Measures of Variation
3.3 Measures of Variation
 
Big Data Day LA 2015 - Using data visualization to find patterns in multidime...
Big Data Day LA 2015 - Using data visualization to find patterns in multidime...Big Data Day LA 2015 - Using data visualization to find patterns in multidime...
Big Data Day LA 2015 - Using data visualization to find patterns in multidime...
 
Measures of Variability.pptx
Measures of Variability.pptxMeasures of Variability.pptx
Measures of Variability.pptx
 
Prediction of Diamond Prices Using Multivariate Regression
Prediction of Diamond Prices Using Multivariate RegressionPrediction of Diamond Prices Using Multivariate Regression
Prediction of Diamond Prices Using Multivariate Regression
 
Variability
VariabilityVariability
Variability
 
Standard deviation quartile deviation
Standard deviation  quartile deviationStandard deviation  quartile deviation
Standard deviation quartile deviation
 
Variables and Statements
Variables and StatementsVariables and Statements
Variables and Statements
 
Data manipulation and visualization in r 20190711 myanmarucsy
Data manipulation and visualization in r 20190711 myanmarucsyData manipulation and visualization in r 20190711 myanmarucsy
Data manipulation and visualization in r 20190711 myanmarucsy
 
R graphics260809
R graphics260809R graphics260809
R graphics260809
 
Staisticsii
StaisticsiiStaisticsii
Staisticsii
 
Mean, Variance and standard deviation.pptx
Mean, Variance and standard deviation.pptxMean, Variance and standard deviation.pptx
Mean, Variance and standard deviation.pptx
 
Variance & standard deviation
Variance & standard deviationVariance & standard deviation
Variance & standard deviation
 

More from Min-hyung Kim

20230511 Automation of EMR Tasks using AutoHotkey in MS Windows_MKv1.1.pdf
20230511 Automation of EMR Tasks using AutoHotkey in MS Windows_MKv1.1.pdf20230511 Automation of EMR Tasks using AutoHotkey in MS Windows_MKv1.1.pdf
20230511 Automation of EMR Tasks using AutoHotkey in MS Windows_MKv1.1.pdfMin-hyung Kim
 
20221001 KAFM 의학 형의상학(Medical Ontology) v5 -clean.pptx
20221001 KAFM 의학 형의상학(Medical Ontology) v5 -clean.pptx20221001 KAFM 의학 형의상학(Medical Ontology) v5 -clean.pptx
20221001 KAFM 의학 형의상학(Medical Ontology) v5 -clean.pptxMin-hyung Kim
 
MH prediction modeling and validation in r (2) classification 190709
MH prediction modeling and validation in r (2) classification 190709MH prediction modeling and validation in r (2) classification 190709
MH prediction modeling and validation in r (2) classification 190709Min-hyung Kim
 
MH prediction modeling and validation in r (1) regression 190709
MH prediction modeling and validation in r (1) regression 190709MH prediction modeling and validation in r (1) regression 190709
MH prediction modeling and validation in r (1) regression 190709Min-hyung Kim
 
MH Prediction Modeling and Validation -clean
MH Prediction Modeling and Validation -cleanMH Prediction Modeling and Validation -clean
MH Prediction Modeling and Validation -cleanMin-hyung Kim
 
r for data science 2. grammar of graphics (ggplot2) clean -ref
r for data science 2. grammar of graphics (ggplot2)  clean -refr for data science 2. grammar of graphics (ggplot2)  clean -ref
r for data science 2. grammar of graphics (ggplot2) clean -refMin-hyung Kim
 
CDM SynPuf OMOP CDM library(rodbc) library(ggplot2) library(jsonlite) 180403
CDM SynPuf OMOP CDM library(rodbc) library(ggplot2) library(jsonlite) 180403CDM SynPuf OMOP CDM library(rodbc) library(ggplot2) library(jsonlite) 180403
CDM SynPuf OMOP CDM library(rodbc) library(ggplot2) library(jsonlite) 180403Min-hyung Kim
 

More from Min-hyung Kim (7)

20230511 Automation of EMR Tasks using AutoHotkey in MS Windows_MKv1.1.pdf
20230511 Automation of EMR Tasks using AutoHotkey in MS Windows_MKv1.1.pdf20230511 Automation of EMR Tasks using AutoHotkey in MS Windows_MKv1.1.pdf
20230511 Automation of EMR Tasks using AutoHotkey in MS Windows_MKv1.1.pdf
 
20221001 KAFM 의학 형의상학(Medical Ontology) v5 -clean.pptx
20221001 KAFM 의학 형의상학(Medical Ontology) v5 -clean.pptx20221001 KAFM 의학 형의상학(Medical Ontology) v5 -clean.pptx
20221001 KAFM 의학 형의상학(Medical Ontology) v5 -clean.pptx
 
MH prediction modeling and validation in r (2) classification 190709
MH prediction modeling and validation in r (2) classification 190709MH prediction modeling and validation in r (2) classification 190709
MH prediction modeling and validation in r (2) classification 190709
 
MH prediction modeling and validation in r (1) regression 190709
MH prediction modeling and validation in r (1) regression 190709MH prediction modeling and validation in r (1) regression 190709
MH prediction modeling and validation in r (1) regression 190709
 
MH Prediction Modeling and Validation -clean
MH Prediction Modeling and Validation -cleanMH Prediction Modeling and Validation -clean
MH Prediction Modeling and Validation -clean
 
r for data science 2. grammar of graphics (ggplot2) clean -ref
r for data science 2. grammar of graphics (ggplot2)  clean -refr for data science 2. grammar of graphics (ggplot2)  clean -ref
r for data science 2. grammar of graphics (ggplot2) clean -ref
 
CDM SynPuf OMOP CDM library(rodbc) library(ggplot2) library(jsonlite) 180403
CDM SynPuf OMOP CDM library(rodbc) library(ggplot2) library(jsonlite) 180403CDM SynPuf OMOP CDM library(rodbc) library(ggplot2) library(jsonlite) 180403
CDM SynPuf OMOP CDM library(rodbc) library(ggplot2) library(jsonlite) 180403
 

Recently uploaded

Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfRachmat Ramadhan H
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...amitlee9823
 
Call Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girl
Call Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girlCall Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girl
Call Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girlkumarajju5765
 
Zuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptxZuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptxolyaivanovalion
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxolyaivanovalion
 
Capstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramCapstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramMoniSankarHazra
 
ALSO dropshipping via API with DroFx.pptx
ALSO dropshipping via API with DroFx.pptxALSO dropshipping via API with DroFx.pptx
ALSO dropshipping via API with DroFx.pptxolyaivanovalion
 
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxBPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxMohammedJunaid861692
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxolyaivanovalion
 
Data-Analysis for Chicago Crime Data 2023
Data-Analysis for Chicago Crime Data  2023Data-Analysis for Chicago Crime Data  2023
Data-Analysis for Chicago Crime Data 2023ymrp368
 
Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...shambhavirathore45
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAroojKhan71
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Serviceranjana rawat
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxolyaivanovalion
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxolyaivanovalion
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionfulawalesam
 
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...Pooja Nehwal
 

Recently uploaded (20)

CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
 
Call Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girl
Call Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girlCall Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girl
Call Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girl
 
Sampling (random) method and Non random.ppt
Sampling (random) method and Non random.pptSampling (random) method and Non random.ppt
Sampling (random) method and Non random.ppt
 
Zuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptxZuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptx
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptx
 
Capstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramCapstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics Program
 
ALSO dropshipping via API with DroFx.pptx
ALSO dropshipping via API with DroFx.pptxALSO dropshipping via API with DroFx.pptx
ALSO dropshipping via API with DroFx.pptx
 
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
 
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxBPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptx
 
Data-Analysis for Chicago Crime Data 2023
Data-Analysis for Chicago Crime Data  2023Data-Analysis for Chicago Crime Data  2023
Data-Analysis for Chicago Crime Data 2023
 
Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFx
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptx
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interaction
 
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
 

r for data science 4. exploratory data analysis clean -rev -ref

  • 1. Exploratory Data Analysis (EDA, 탐색적 데이터 분석) Outline (개요) • Exploratory Data Analysis (EDA, 탐색적 데이터 분석) • Variation (다양성, 분산): Univariate distributions (일변량 분포) • Categorical (범주형) variable (변수) • Continuous (연속형) variable (변수) • Covariation (공분산): Bivariate distributions (이변량 분포) • Continuous (연속형) & Categorical (범주형) • Categorical (범주형) & Categorical (범주형) • Continuous (연속형) & Continuous (연속형)
  • 2. Exploratory Data Analysis (EDA, 탐색적 데이터 분석) • Iterative cycles (반복 순환) of Exploratory Data Analysis (EDA, 탐색적 데이터 분석) 1. Generate questions (질문) about your data 1. What type of variation (다양성, 분산) occurs within my variables? 2. What type of covariation (공분산) occurs between my variables? 2. Search for answers (답): Transform (변환하다) & Visualize (시각화하다) & Model(모형을 만들다) 3. Use what you learn to refine (개선하다) your questions and/or generate (생성하다) new questions.
  • 3. Exploratory Data Analysis (EDA, 탐색적 데이터 분석) 1. Generate questions (질문) about your data 1) What type of variation (다양성, 분산) occurs within my variables? • Univariate distributions (일변량 분포) 2) What type of covariation (공분산) occurs between my variables? • Bivariate distributions (이변량 분포)
  • 4. Exploratory Data Analysis (EDA, 탐색적 데이터 분석) Variation (다양성, 분산): Univariate distributions (일변량 분포)
  • 5. diamonds data diamonds # # A tibble: 53,940 x 10 # carat cut color clarity depth table price x y z # <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl> # 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43 # 2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31 # 3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31 # 4 0.290 Premium I VS2 62.4 58 334 4.2 4.23 2.63 # 5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75 # 6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48 # 7 0.24 Very Good I VVS1 62.3 57 336 3.95 3.98 2.47 # 8 0.26 Very Good H SI1 61.9 55 337 4.07 4.11 2.53 # 9 0.22 Fair E VS2 65.1 61 337 3.87 3.78 2.49 # 10 0.23 Very Good H VS1 59.4 61 338 4 4.05 2.39 # # ... with 53,930 more rows help(diamonds)
  • 6. diamonds {ggplot2} R Documentation Prices of 50,000 round cut diamonds Description A dataset containing the prices and other attributes of almost 54,000 diamonds. The variables are as follows: Usage diamonds Format A data frame with 53940 rows and 10 variables: price price in US dollars ($326–$18,823) carat weight of the diamond (0.2–5.01) cut quality of the cut (Fair, Good, Very Good, Premium, Ideal) color diamond colour, from J (worst) to D (best) clarity a measurement of how clear the diamond is (I1 (worst), SI2, SI1, VS2, VS1, VVS2, VVS1, IF (best)) x length in mm (0–10.74) y width in mm (0–58.9) z depth in mm (0–31.8) depth total depth percentage = z / mean(x, y) = 2 * z / (x + y) (43–79) table width of top of diamond relative to widest point (43–95)
  • 7. Categorical (범주형) variable (변수) vs. Continuous (연속형) variable (변수) diamonds # # A tibble: 53,940 x 10 # carat cut color clarity depth table price x y z # <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl> # 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43 # 2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31 # 3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31 # 4 0.290 Premium I VS2 62.4 58 334 4.2 4.23 2.63 # 5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75 # 6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48 # 7 0.24 Very Good I VVS1 62.3 57 336 3.95 3.98 2.47 # 8 0.26 Very Good H SI1 61.9 55 337 4.07 4.11 2.53 # 9 0.22 Fair E VS2 65.1 61 337 3.87 3.78 2.49 # 10 0.23 Very Good H VS1 59.4 61 338 4 4.05 2.39 # # ... with 53,930 more rows diamonds %>% count(cut) #> # A tibble: 5 x 2 #> cut n #> <ord> <int> #> 1 Fair 1610 #> 2 Good 4906 #> 3 Very Good 12082 #> 4 Premium 13791 #> 5 Ideal 21551 diamonds %>% count(cut_width(carat, 0.5)) #> # A tibble: 11 x 2 #> `cut_width(carat, 0.5)` n #> <fct> <int> #> 1 [-0.25,0.25] 785 #> 2 (0.25,0.75] 29498 #> 3 (0.75,1.25] 15977 #> 4 (1.25,1.75] 5313 #> 5 (1.75,2.25] 2002 #> 6 (2.25,2.75] 322 #> # ... with 5 more rows
  • 8. Visualizing univariate distributions (일변량 분포의 시각화) Categorical (범주형) variable (변수) geom_bar ggplot(data = diamonds) + geom_bar(mapping = aes(x = cut))
  • 9. Visualizing univariate distributions (일변량 분포의 시각화) Continuous (연속형) variable (변수) geom_histogram ggplot(data = diamonds) + geom_histogram(mapping = aes(x = carat), binwidth = 0.5)
  • 10. Visualizing univariate distributions (일변량 분포의 시각화) Continuous (연속형) variable (변수) geom_histogram ggplot(data = diamonds %>% filter(carat < 3)) + geom_histogram(mapping = aes(x = carat), binwidth = 0.5)
  • 11. Visualizing univariate distributions (일변량 분포의 시각화) Continuous (연속형) variable (변수) geom_freqpoly ggplot(data = diamonds %>% filter(carat < 3)) + geom_freqpoly(mapping = aes(x = carat), binwidth = 0.1) 0 2500 5000 7500 10000 0 1 2 3 carat count
  • 12. Visualizing bivariate distributions (이변량 분포의 시각화) Continuous (연속형) & Categorical (범주형) geom_freqpoly & color ggplot(data = diamonds %>% filter(carat < 3)) + geom_freqpoly(mapping = aes(x = carat, color = cut), binwidth = 0.1)
  • 13. Visualizing univariate distributions (일변량 분포의 시각화) • Questions to ask • Which values are the most common? Why? • Which values are rare? Why? Does that match your expectations? • Can you see any unusual patterns? What might explain them?
  • 14. Why are there more diamonds at whole carats and common fractions of carats? Why are there more diamonds slightly to the right of each peak than there are slightly to the left of each peak? Why are there no diamonds bigger than 3 carats? ggplot(data = diamonds %>% filter(carat < 3)) + geom_histogram(mapping = aes(x = carat), binwidth = 0.01)
  • 15. Visualizing univariate distributions (일변량 분포의 시각화) Unusual values (outliers, 특이값, 드문 값) ggplot(data = diamonds) + geom_histogram(mapping = aes(x = y), binwidth = 0.5)
  • 16. Visualizing univariate distributions (일변량 분포의 시각화) Unusual values (outliers, 특이값, 드문 값) ggplot(data = diamonds) + geom_histogram(mapping = aes(x = y), binwidth = 0.5) + coord_cartesian(ylim = c(0, 50))
  • 17. Unusual values (outliers, 특이값, 드문 값) unusual <- diamonds %>% filter(y < 3 | y > 20) %>% select(price, x, y, z) %>% arrange(y) unusual #> # A tibble: 9 x 4 #> price x y z #> <int> <dbl> <dbl> <dbl> #> 1 5139 0 0 0 #> 2 6381 0 0 0 #> 3 12800 0 0 0 #> 4 15686 0 0 0 #> 5 18034 0 0 0 #> 6 2130 0 0 0 #> 7 2130 0 0 0 #> 8 2075 5.15 31.8 5.12 #> 9 12210 8.09 58.9 8.06 #@ replacing the unusual values with missing values diamonds2 <- diamonds %>% mutate(y = ifelse(y < 3 | y > 20, NA, y))
  • 18. Visualizing univariate distributions (일변량 분포의 시각화) Missing values vs. non-missing values nycflights13::flights %>% mutate( cancelled = is.na(dep_time), sched_hour = sched_dep_time %/% 100, sched_min = sched_dep_time %% 100, sched_dep_time = sched_hour + sched_min / 60 ) %>% ggplot(mapping = aes(sched_dep_time)) + geom_freqpoly(mapping = aes(colour = cancelled), binwidth = 1/4)
  • 19. More on visualizing univariate distribution
  • 20. Exploratory Data Analysis (EDA, 탐색적 데이터 분석) Covariation (공분산): Bivariate distributions (이변량 분포) Continuous (연속형) & Categorical (범주형)
  • 21. Visualizing bivariate distributions (이변량 분포의 시각화) Continuous (연속형) & Categorical (범주형) geom_freqpoly & color gplot(data = diamonds, mapping = aes(x = price)) + geom_freqpoly(mapping = aes(colour = cut), binwidth = 500)
  • 22. Visualizing univariate distributions (일변량 분포의 시각화) Categorical (범주형) variable (변수) geom_bar ggplot(data = diamonds) + geom_bar(mapping = aes(x = cut))
  • 23. Visualizing bivariate distributions (이변량 분포의 시각화) Continuous (연속형) & Categorical (범주형) geom_freqpoly & color ggplot(data = diamonds, mapping = aes(x = price, y = ..density..)) + geom_freqpoly(mapping = aes(colour = cut), binwidth = 500)
  • 24. Visualizing bivariate distributions (이변량 분포의 시각화) Continuous (연속형) & Categorical (범주형) boxplot
  • 25. Visualizing bivariate distributions (이변량 분포의 시각화) Continuous (연속형) & Categorical (범주형) geom_boxplot ggplot(data = diamonds, mapping = aes(x = cut, y = price)) + geom_boxplot()
  • 26. Visualizing bivariate distributions (이변량 분포의 시각화) Continuous (연속형) & Categorical (범주형) geom_boxplot ggplot(data = mpg, mapping = aes(x = class, y = hwy)) + geom_boxplot()
  • 27. Visualizing bivariate distributions (이변량 분포의 시각화) Continuous (연속형) & Categorical (범주형) geom_boxplot ggplot(data = mpg) + geom_boxplot(mapping = aes(x = reorder(class, hwy, FUN = median), y = hwy))
  • 28. Visualizing bivariate distributions (이변량 분포의 시각화) Continuous (연속형) & Categorical (범주형) geom_boxplot ggplot(data = mpg) + geom_boxplot(mapping = aes(x = reorder(class, hwy, FUN = median), y = hwy)) + coord_flip()
  • 29. Exploratory Data Analysis (EDA, 탐색적 데이터 분석) Covariation (공분산): Bivariate distributions (이변량 분포) Categorical (범주형) & Categorical (범주형)
  • 30. bivariate distributions (이변량 분포) Categorical (범주형) & Categorical (범주형) "long format": unique identifier (고유 식별자) = composite key (복합키) #@ count in a "long format": unique identifier is the composite key consists of color & cut diamonds %>% group_by(color, cut) %>% summarize(count = n()) %>% print(n = Inf) diamonds %>% count(color, cut) %>% print(n = Inf) # # A tibble: 35 x 3 # color cut n # <ord> <ord> <int> # 1 D Fair 163 # 2 D Good 662 # 3 D Very Good 1513 # 4 D Premium 1603 # 5 D Ideal 2834 # 6 E Fair 224 # 7 E Good 933 # 8 E Very Good 2400 # 9 E Premium 2337 # 10 E Ideal 3903 # 11 F Fair 312 # 12 F Good 909 # 13 F Very Good 2164 # 14 F Premium 2331 # 15 F Ideal 3826 # 16 G Fair 314 # 17 G Good 871 # 18 G Very Good 2299 # 19 G Premium 2924 # 20 G Ideal 4884 # 21 H Fair 303 # 22 H Good 702 # 23 H Very Good 1824 # 24 H Premium 2360 # 25 H Ideal 3115 # 26 I Fair 175 # 27 I Good 522 # 28 I Very Good 1204 # 29 I Premium 1428 # 30 I Ideal 2093 # 31 J Fair 119 # 32 J Good 307 # 33 J Very Good 678 # 34 J Premium 808 # 35 J Ideal 896
  • 31. bivariate distributions (이변량 분포) Categorical (범주형) & Categorical (범주형) "wide format": one variable -> multiple column #@ "wide format": one variable -> multiple column (spread) diamonds %>% select(color, cut) %>% table %>% addmargins # cut # color Fair Good Very Good Premium Ideal Sum # D 163 662 1513 1603 2834 6775 # E 224 933 2400 2337 3903 9797 # F 312 909 2164 2331 3826 9542 # G 314 871 2299 2924 4884 11292 # H 303 702 1824 2360 3115 8304 # I 175 522 1204 1428 2093 5422 # J 119 307 678 808 896 2808 # Sum 1610 4906 12082 13791 21551 53940 #@ spread(): "long format" -> "wide format": one variable -> multiple column diamonds %>% count(color, cut) %>% spread(key = cut, value = n) # # A tibble: 7 x 6 # color Fair Good `Very Good` Premium Ideal # <ord> <int> <int> <int> <int> <int> # 1 D 163 662 1513 1603 2834 # 2 E 224 933 2400 2337 3903 # 3 F 312 909 2164 2331 3826 # 4 G 314 871 2299 2924 4884 # 5 H 303 702 1824 2360 3115 # 6 I 175 522 1204 1428 2093 # 7 J 119 307 678 808 896
  • 32. Visualizing bivariate distributions (이변량 분포의 시각화) Categorical (범주형) & Categorical (범주형) geom_count ggplot(data = diamonds) + geom_count(mapping = aes(x = cut, y = color))
  • 33. Visualizing bivariate distributions (이변량 분포의 시각화) Categorical (범주형) & Categorical (범주형) geom_tile diamonds %>% count(color, cut) %>% ggplot(mapping = aes(x = color, y = cut)) + geom_tile(mapping = aes(fill = n))
  • 34. bivariate distributions (이변량 분포) Categorical (범주형) & Categorical (범주형) "long format": unique identifier (고유 식별자) = composite key (복합키) #@ gather(): "wide format" -> "long format": multiple column -> one variable (key) #@ "wide format": one variable -> multiple column (spread) diamonds %>% count(color, cut) %>% spread(key = cut, value = n) %>% gather(key = cut, value = n, Fair, Good, `Very Good`, Premium, Ideal) diamonds %>% count(color, cut) %>% spread(key = cut, value = n) %>% gather(key = cut, value = n, Fair:Ideal) diamonds %>% count(color, cut) %>% spread(key = cut, value = n) %>% gather(key = cut, value = n, -color) # > diamonds %>% # + count(color, cut) %>% # + spread(key = cut, value = n) %>% # + gather(key = cut, value = n, -color) # # A tibble: 35 x 3 # color cut n # <ord> <chr> <int> # 1 D Fair 163 # 2 E Fair 224 # 3 F Fair 312 # 4 G Fair 314 # 5 H Fair 303 # 6 I Fair 175 # 7 J Fair 119 # 8 D Good 662 # 9 E Good 933 # 10 F Good 909 # # ... with 25 more rows
  • 35. Exploratory Data Analysis (EDA, 탐색적 데이터 분석) Covariation (공분산): Bivariate distributions (이변량 분포) Continuous (연속형) & Continuous (연속형)
  • 36. Visualizing bivariate distributions (이변량 분포의 시각화) Continuous (연속형) & Continuous (연속형) geom_point ggplot(data = diamonds) + geom_point(mapping = aes(x = carat, y = price))
  • 37. Visualizing bivariate distributions (이변량 분포의 시각화) Continuous (연속형) & Continuous (연속형) geom_point ggplot(data = diamonds) + geom_point(mapping = aes(x = carat, y = price), alpha = 1 / 100)
  • 38. Visualizing bivariate distributions (이변량 분포의 시각화) Continuous (연속형) & Continuous (연속형) geom_bin2d ggplot(data = diamonds %>% filter(carat < 3)) + geom_bin2d(mapping = aes(x = carat, y = price))
  • 39. Visualizing bivariate distributions (이변량 분포의 시각화) Continuous (연속형) & Continuous (연속형) geom_hex # install.packages("hexbin") ggplot(data = diamonds %>% filter(carat < 3)) + geom_hex(mapping = aes(x = carat, y = price))
  • 40. Visualizing bivariate distributions (이변량 분포의 시각화) Continuous (연속형) & Continuous (연속형) -> Continuous (연속형) & Categorical (범주형) ggplot(diamonds %>% filter(carat < 3), mapping = aes(x = carat, y = price)) + geom_boxplot(mapping = aes(group = cut_width(carat, 0.1)))
  • 41. More on visualizing bivariate distribution
  • 42. More on visualizing trivariate distribution
  • 43. REFERENCES #1. RStudio Official Documentations (Help & Cheat Sheet) Free Webpage) https://www.rstudio.com/resources/cheatsheets/ #2. Wickham, H. and Grolemund, G., 2016.R for data science: import, tidy, transform, visualize, and model data. O'Reilly. Free Webpage) https://r4ds.had.co.nz/ Cf) Tidyverse syntax (www.tidyverse.org), rather than R Base syntax Cf) Hadley Wickham: Chief Scientist at RStudio. Adjunct Professor of Statistics at the University of Auckland, Stanford University, and Rice University