SlideShare a Scribd company logo
Handling missing values
and Outliers
Loading the weather-data that is semi wrangled
Previously I wrangled a dataset that contained weather data. With this presentation I plan to check for outliers,
look for missing value and explore the different ways of dealing with NA values and experiment with some basic
functional programming and filtering time-series data.
LIBRARY(TIDYVERSE)
LIBRARY(DATAEXPLORER)
WEATHER_DATA_PIVOT_TBL = READR::READ_RDS('WEATHER_DATA_PIVOTED.RDS')
GLIMPSE(WEATHER_DATA_PIVOT_TBL)
OBSERVATIONS: 366
VARIABLES: 23
$ DATE <CHR> "2014/12/1", "2014/12/2", "2014/12/3...
$ EVENTS <CHR> "RAIN", "RAIN-SNOW", "RAIN", "", "RA...
$ MAX.TEMPERATUREF <DBL> 64, 42, 51, 43, 42, 45, 38, 29, 49, ...
$ MEAN.TEMPERATUREF <DBL> 52, 38, 44, 37, 34, 42, 30, 24, 39, ...
$ MIN.TEMPERATUREF <DBL> 39, 33, 37, 30, 26, 38, 21, 18, 29, ...
$ MAX.DEW.POINTF <DBL> 46, 40, 49, 24, 37, 45, 36, 28, 49, ...
$ MEANDEW.POINTF <DBL> 40, 27, 42, 21, 25, 40, 20, 16, 41, ...
$ MIN.DEWPOINTF <DBL> 26, 17, 24, 13, 12, 36, -3, 3, 28, 3...
$ MAX.HUMIDITY <DBL> 74, 92, 100, 69, 85, 100, 92, 92, 10...
$ MEAN.HUMIDITY <DBL> 63, 72, 79, 54, 66, 93, 61, 70, 93, ...
$ MIN.HUMIDITY <DBL> 52, 51, 57, 39, 47, 85, 29, 47, 86, ...
$ MAX.SEA.LEVEL.PRESSUREIN <DBL> 30.45, 30.71, 30.40, 30.56, 30.68, 3...
$ MEAN.SEA.LEVEL.PRESSUREIN <DBL> 30.13, 30.59, 30.07, 30.33, 30.59, 3...
$ MIN.SEA.LEVEL.PRESSUREIN <DBL> 30.01, 30.40, 29.87, 30.09, 30.45, 3...
$ MAX.VISIBILITYMILES <DBL> 10, 10, 10, 10, 10, 10, 10, 10, 10, ...
$ MEAN.VISIBILITYMILES <DBL> 10, 8, 5, 10, 10, 4, 10, 8, 2, 3, 7,...
$ MIN.VISIBILITYMILES <DBL> 10, 2, 1, 10, 5, 0, 5, 2, 1, 1, 1, 7...
$ MAX.WIND.SPEEDMPH <DBL> 22, 24, 29, 25, 22, 22, 25, 21, 38, ...
$ MEAN.WIND.SPEEDMPH <DBL> 13, 15, 12, 12, 10, 8, 15, 13, 20, 1...
$ MAX.GUST.SPEEDMPH <DBL> 29, 29, 38, 33, 26, 25, 32, 28, 52, ...
$ PRECIPITATIONIN <DBL> 0.01, 0.10, 0.44, 0.00, 0.11, 1.09, ...
$ CLOUDCOVER <DBL> 6, 7, 8, 3, 5, 8, 6, 8, 8, 8, 8, 7, ...
$ WINDDIRDEGREES <DBL> 268, 62, 254, 292, 61, 313, 350, 354...
Type Conversions
The Events column contains data that can be categorized into different classes. Such as Rain day, Rain Snow etc.
• I’ll replace the blank rows with the text “None”
• I’ll start by converting this column to a factor.
• I’ll convert the date column to from character type to a date type
REPLACE THE BLANKS IN THE EVENTS COLUMN WITH 'NONE'
METHOD 1:
WEATHER_DATA_PIVOT_TBL$EVENTS[WEATHER_DATA_PIVOT_TBL$EVENTS==""] <- 'NONE'
(WEATHER_DATA_CLEAN_TBL <- WEATHER_DATA_PIVOT_TBL %>% MUTATE(EVENTS = EVENTS %>% AS_FACT
OR(),
DATE = LUBRIDATE::YMD(DATE) ) )
METHOD 2:
(WEATHER_DATA_CLEAN_TBL <- WEATHER_DATA_PIVOT_TBL %>%
MUTATE(EVENTS = CASE_WHEN(
EVENTS == "" ~ 'NONE',
TRUE ~ EVENTS
) %>% AS.FACTOR()) %>%
MUTATE(DATE = DATE %>% LUBRIDATE::YMD())
)
A TIBBLE: 366 X 23
DATE EVENTS MAX.TEMPERATUREF MEAN.TEMPERATUR~ MIN.TEMPERATUREF
<DATE> <FCT> <DBL> <DBL> <DBL>
1 2014-12-01 RAIN 64 52 39
2 2014-12-02 RAIN-~ 42 38 33
3 2014-12-03 RAIN 51 44 37
4 2014-12-04 NONE 43 37 30
5 2014-12-05 RAIN 42 34 26
6 2014-12-06 RAIN 45 42 38
7 2014-12-07 RAIN 38 30 21
8 2014-12-08 SNOW 29 24 18
9 2014-12-09 RAIN 49 39 29
10 2014-12-10 RAIN 48 43 38
... WITH 356 MORE ROWS, AND 18 MORE VARIABLES: MAX.DEW.POINTF <DBL>,
MEANDEW.POINTF <DBL>, MIN.DEWPOINTF <DBL>, MAX.HUMIDITY <DBL>,
MEAN.HUMIDITY <DBL>, MIN.HUMIDITY <DBL>,
MAX.SEA.LEVEL.PRESSUREIN <DBL>, MEAN.SEA.LEVEL.PRESSUREIN <DBL>,
MIN.SEA.LEVEL.PRESSUREIN <DBL>, MAX.VISIBILITYMILES <DBL>,
MEAN.VISIBILITYMILES <DBL>, MIN.VISIBILITYMILES <DBL>,
MAX.WIND.SPEEDMPH <DBL>, MEAN.WIND.SPEEDMPH <DBL>,
MAX.GUST.SPEEDMPH <DBL>, PRECIPITATIONIN <DBL>, CLOUDCOVER <DBL>,
WINDDIRDEGREES <DBL>
TIP: TO SIMULTANEOUSLY ASSIGN AN EXPRESSION TO A VARIABLE AND HAVE IT PRINTED TO THE CONSOLE
, ONE CAN WRAP THE ENTIRE EXPRESSION IN PARENTHESES E.G ( Y <- MEAN(X))
Use the summary() function to get a good feel for the distribution of data within the dataset. This is a very handy
way to detect outliers and missing values
SUMMARY(WEATHER_DATA_CLEAN_TBL)
DATE EVENTS MAX.TEMPERATUREF MEAN.TEMPERATUREF
MIN. :2014-12-01 NONE :201 MIN. :18.00 MIN. : 8.00
1ST QU.:2015-03-02 RAIN : 90 1ST QU.:42.00 1ST QU.:36.25
MEDIAN :2015-06-01 SNOW : 31 MEDIAN :60.00 MEDIAN :53.50
MEAN :2015-06-01 RAIN-SNOW: 10 MEAN :58.93 MEAN :51.40
3RD QU.:2015-08-31 FOG-RAIN : 8 3RD QU.:76.00 3RD QU.:68.00
MAX. :2015-12-01 FOG-SNOW : 7 MAX. :96.00 MAX. :84.00
(OTHER) : 19
MIN.TEMPERATUREF MAX.DEW.POINTF MEANDEW.POINTF MIN.DEWPOINTF
MIN. :-3.00 MIN. :-6.00 MIN. :-11.00 MIN. :-18.00
1ST QU.:30.00 1ST QU.:32.00 1ST QU.: 24.00 1ST QU.: 16.25
MEDIAN :46.00 MEDIAN :47.50 MEDIAN : 41.00 MEDIAN : 35.00
MEAN :43.33 MEAN :45.48 MEAN : 38.96 MEAN : 32.25
3RD QU.:60.00 3RD QU.:61.00 3RD QU.: 56.00 3RD QU.: 51.00
MAX. :74.00 MAX. :75.00 MAX. : 71.00 MAX. : 68.00
MAX.HUMIDITY MEAN.HUMIDITY MIN.HUMIDITY
MIN. : 39.00 MIN. :28.00 MIN. :16.00
1ST QU.: 73.25 1ST QU.:56.00 1ST QU.:35.00
MEDIAN : 86.00 MEDIAN :66.00 MEDIAN :46.00
MEAN : 85.69 MEAN :66.02 MEAN :48.31
3RD QU.: 93.00 3RD QU.:76.75 3RD QU.:60.00
MAX. :1000.00 MAX. :98.00 MAX. :96.00
MAX.SEA.LEVEL.PRESSUREIN MEAN.SEA.LEVEL.PRESSUREIN
MIN. :29.58 MIN. :29.49
1ST QU.:30.00 1ST QU.:29.87
MEDIAN :30.14 MEDIAN :30.03
MEAN :30.16 MEAN :30.04
3RD QU.:30.31 3RD QU.:30.19
MAX. :30.88 MAX. :30.77
MIN.SEA.LEVEL.PRESSUREIN MAX.VISIBILITYMILES MEAN.VISIBILITYMILES
MIN. :29.16 MIN. : 2.000 MIN. :-1.000
1ST QU.:29.76 1ST QU.:10.000 1ST QU.: 8.000
MEDIAN :29.94 MEDIAN :10.000 MEDIAN :10.000
MEAN :29.93 MEAN : 9.907 MEAN : 8.861
3RD QU.:30.09 3RD QU.:10.000 3RD QU.:10.000
MAX. :30.64 MAX. :10.000 MAX. :10.000
MIN.VISIBILITYMILES MAX.WIND.SPEEDMPH MEAN.WIND.SPEEDMPH
MIN. : 0.000 MIN. : 8.00 MIN. : 4.00
1ST QU.: 2.000 1ST QU.:16.00 1ST QU.: 8.00
MEDIAN :10.000 MEDIAN :20.00 MEDIAN :10.00
MEAN : 6.716 MEAN :20.62 MEAN :10.68
3RD QU.:10.000 3RD QU.:24.00 3RD QU.:13.00
MAX. :10.000 MAX. :38.00 MAX. :22.00
MAX.GUST.SPEEDMPH PRECIPITATIONIN CLOUDCOVER WINDDIRDEGREES
MIN. : 0.00 MIN. :0.0000 MIN. :0.000 MIN. : 1.0
1ST QU.:21.00 1ST QU.:0.0000 1ST QU.:3.000 1ST QU.:113.0
MEDIAN :25.50 MEDIAN :0.0000 MEDIAN :5.000 MEDIAN :222.0
MEAN :26.99 MEAN :0.1173 MEAN :4.708 MEAN :200.1
3RD QU.:31.25 3RD QU.:0.0700 3RD QU.:7.000 3RD QU.:275.0
MAX. :94.00 MAX. :2.9000 MAX. :8.000 MAX. :360.0
NA'S :6 NA'S :49
Screening and handling outliers
There seems to be obvious outliers in the Mean.VisibilityMiles column and the Max.Humidity column.
PLOT(WEATHER_DATA_CLEAN_TBL$DATE, WEATHER_DATA_CLEAN_TBL$MAX.HUMIDITY,
YLAB = 'MAXIMUM HUMIDITY',
XLAB ='DATE'
)
WHATEVER THE CAUSE, THIS IS CLEARLY AN INVALID DATA POINT AND NEEDS TO BE FIXED. I'M
ASSUMING THAT IT IS OUT BY A FACTOR OF 10 AND DROPPING A ZERO SHOULD DO THE TRICK
TO QUICKLY FIND THE ROW NUMBER OF THIS ERROR, THE WHICH.MAX() FROM THE BASE PACKAGE IS VERY
HANDY
WEATHER_DATA_CLEAN_TBL$MAX.HUMIDITY %>% WHICH.MAX()
[1] 142
THE ROW NUMBER IS 142 AND CAN BE QUICKLY NAVIGATED TO USING DPLYR'S SLICE FUNCTION
WEATHER_DATA_CLEAN_TBL %>% SLICE(142) %>% GLIMPSE()
OBSERVATIONS: 1
VARIABLES: 23
$ DATE <DATE> 2015-04-21
$ EVENTS <FCT> FOG-RAIN-THUNDERSTORM
$ MAX.TEMPERATUREF <DBL> 65
$ MEAN.TEMPERATUREF <DBL> 56
$ MIN.TEMPERATUREF <DBL> 46
$ MAX.DEW.POINTF <DBL> 57
$ MEANDEW.POINTF <DBL> 49
$ MIN.DEWPOINTF <DBL> 36
$ MAX.HUMIDITY <DBL> 1000
$ MEAN.HUMIDITY <DBL> 71
$ MIN.HUMIDITY <DBL> 42
$ MAX.SEA.LEVEL.PRESSUREIN <DBL> 29.75
$ MEAN.SEA.LEVEL.PRESSUREIN <DBL> 29.6
$ MIN.SEA.LEVEL.PRESSUREIN <DBL> 29.53
$ MAX.VISIBILITYMILES <DBL> 10
$ MEAN.VISIBILITYMILES <DBL> 5
$ MIN.VISIBILITYMILES <DBL> 0
$ MAX.WIND.SPEEDMPH <DBL> 20
$ MEAN.WIND.SPEEDMPH <DBL> 10
$ MAX.GUST.SPEEDMPH <DBL> 94
$ PRECIPITATIONIN <DBL> 0.54
$ CLOUDCOVER <DBL> 6
$ WINDDIRDEGREES <DBL> 184
LET'S KNOCK OFF A ZERO FROM 1000 AND REPLACE IT WITH 100
WEATHER_DATA_CLEAN_TBL$MAX.HUMIDITY[142] <- 100
Further (not so obvious) errors
When looking at a summary of the mean visibility miles, there appears to be another error. Miles cannot be
negative? Let’s replace it to be 1
SUMMARY(WEATHER_DATA_CLEAN_TBL$MEAN.VISIBILITYMILES)
MIN. 1ST QU. MEDIAN MEAN 3RD QU. MAX.
-1.000 8.000 10.000 8.861 10.000 10.000
MIN = WHICH.MIN(WEATHER_DATA_CLEAN_TBL$MEAN.VISIBILITYMILES)
WEATHER_DATA_CLEAN_TBL$MEAN.VISIBILITYMILES[MIN] <- 10
Handling NA Values
One of the most common problems when working with a dataset is missing values and can be a cause of great
trouble that requires careful thought. Recall the 3 types of missing data.
• Missing completely at random (no relationship between missing data and circumstances)
• Missing at random (Circumstances cause some data to be missing)
• Missing not at random (Circumstances cause data to be missing, but value that is missing is related to the
reason that data is missing )
Addressing the missing values
Fixing NA values require subject matter expertise and with this data set I chose to replace NA’s by imputation. I’ll
replace them with the median.
I chose to replace them with the median because - The mean is sensitive to outliers - The median is robust to
outliers - not as heavily impacted by skewed data as the mean.
Let’s get the percentage-wise NA’s per column relative to the rest of the data set with
three different methods
• summarise_all
• map()
• plot_missing()
Using SUMMARISE_ALL()
METHOD 1 :
WEATHER_DATA_CLEAN_TBL %>% SUMMARISE_ALL(~ IS.NA(.) %>% SUM()/LENGTH(.)*100) %>% GLIMPS
E()
OBSERVATIONS: 1
VARIABLES: 23
$ DATE <DBL> 0
$ EVENTS <DBL> 0
$ MAX.TEMPERATUREF <DBL> 0
$ MEAN.TEMPERATUREF <DBL> 0
$ MIN.TEMPERATUREF <DBL> 0
$ MAX.DEW.POINTF <DBL> 0
$ MEANDEW.POINTF <DBL> 0
$ MIN.DEWPOINTF <DBL> 0
$ MAX.HUMIDITY <DBL> 0
$ MEAN.HUMIDITY <DBL> 0
$ MIN.HUMIDITY <DBL> 0
$ MAX.SEA.LEVEL.PRESSUREIN <DBL> 0
$ MEAN.SEA.LEVEL.PRESSUREIN <DBL> 0
$ MIN.SEA.LEVEL.PRESSUREIN <DBL> 0
$ MAX.VISIBILITYMILES <DBL> 0
$ MEAN.VISIBILITYMILES <DBL> 0
$ MIN.VISIBILITYMILES <DBL> 0
$ MAX.WIND.SPEEDMPH <DBL> 0
$ MEAN.WIND.SPEEDMPH <DBL> 0
$ MAX.GUST.SPEEDMPH <DBL> 1.639344
$ PRECIPITATIONIN <DBL> 13.38798
$ CLOUDCOVER <DBL> 0
$ WINDDIRDEGREES <DBL> 0
Using PURRR::MAP_DF()
METHOD 2:
WEATHER_DATA_CLEAN_TBL %>%
MAP_DF(~IS.NA(.) %>% SUM()/LENGTH(.)*100) %>%
GATHER() %>%
FILTER(VALUE>0)
A TIBBLE: 2 X 2
KEY VALUE
<CHR> <DBL>
1 MAX.GUST.SPEEDMPH 1.64
2 PRECIPITATIONIN 13.4
Using DATAEXPLORER:: PLOT_MISSING()
METHOD3 :
WEATHER_DATA_CLEAN_TBL %>% PLOT_MISSING()
Replacing values programmatically
THE EXPRESSION READS AS FOLLOWS: IF THE COLUMN IS OF NUMERIC TYPE, SCAN THEM FOR NA'S AND
IF YOU FIND THEM, REPLACE IT WITH THE MEDIAN VALUE OF THAT COLUMN, OTHERWISE LEAVE THE VALUE
AS IT IS
(WEATHER_DATA_CLEAN_TBL2 <- WEATHER_DATA_CLEAN_TBL %>%
MUTATE_IF(IS.NUMERIC, ~IF_ELSE(CONDITION = IS.NA(.),
TRUE = MEDIAN(.,NA.RM = TRUE),
FALSE = .))
)
A TIBBLE: 366 X 23
DATE EVENTS MAX.TEMPERATUREF MEAN.TEMPERATUR~ MIN.TEMPERATUREF
<DATE> <FCT> <DBL> <DBL> <DBL>
1 2014-12-01 RAIN 64 52 39
2 2014-12-02 RAIN-~ 42 38 33
3 2014-12-03 RAIN 51 44 37
4 2014-12-04 NONE 43 37 30
5 2014-12-05 RAIN 42 34 26
6 2014-12-06 RAIN 45 42 38
7 2014-12-07 RAIN 38 30 21
8 2014-12-08 SNOW 29 24 18
9 2014-12-09 RAIN 49 39 29
10 2014-12-10 RAIN 48 43 38
... WITH 356 MORE ROWS, AND 18 MORE VARIABLES: MAX.DEW.POINTF <DBL>,
MEANDEW.POINTF <DBL>, MIN.DEWPOINTF <DBL>, MAX.HUMIDITY <DBL>,
MEAN.HUMIDITY <DBL>, MIN.HUMIDITY <DBL>,
MAX.SEA.LEVEL.PRESSUREIN <DBL>, MEAN.SEA.LEVEL.PRESSUREIN <DBL>,
MIN.SEA.LEVEL.PRESSUREIN <DBL>, MAX.VISIBILITYMILES <DBL>,
MEAN.VISIBILITYMILES <DBL>, MIN.VISIBILITYMILES <DBL>,
MAX.WIND.SPEEDMPH <DBL>, MEAN.WIND.SPEEDMPH <DBL>,
MAX.GUST.SPEEDMPH <DBL>, PRECIPITATIONIN <DBL>, CLOUDCOVER <DBL>,
WINDDIRDEGREES <DBL>
CHECK FOR NA VALUES AGAIN USING THE SUMMARY FUCNTION
SUMMARY(WEATHER_DATA_CLEAN_TBL2)
DATE EVENTS MAX.TEMPERATUREF MEAN.TEMPERATUREF
MIN. :2014-12-01 NONE :201 MIN. :18.00 MIN. : 8.00
1ST QU.:2015-03-02 RAIN : 90 1ST QU.:42.00 1ST QU.:36.25
MEDIAN :2015-06-01 SNOW : 31 MEDIAN :60.00 MEDIAN :53.50
MEAN :2015-06-01 RAIN-SNOW: 10 MEAN :58.93 MEAN :51.40
3RD QU.:2015-08-31 FOG-RAIN : 8 3RD QU.:76.00 3RD QU.:68.00
MAX. :2015-12-01 FOG-SNOW : 7 MAX. :96.00 MAX. :84.00
(OTHER) : 19
MIN.TEMPERATUREF MAX.DEW.POINTF MEANDEW.POINTF MIN.DEWPOINTF
MIN. :-3.00 MIN. :-6.00 MIN. :-11.00 MIN. :-18.00
1ST QU.:30.00 1ST QU.:32.00 1ST QU.: 24.00 1ST QU.: 16.25
MEDIAN :46.00 MEDIAN :47.50 MEDIAN : 41.00 MEDIAN : 35.00
MEAN :43.33 MEAN :45.48 MEAN : 38.96 MEAN : 32.25
3RD QU.:60.00 3RD QU.:61.00 3RD QU.: 56.00 3RD QU.: 51.00
MAX. :74.00 MAX. :75.00 MAX. : 71.00 MAX. : 68.00
MAX.HUMIDITY MEAN.HUMIDITY MIN.HUMIDITY MAX.SEA.LEVEL.PRESSUREIN
MIN. : 39.00 MIN. :28.00 MIN. :16.00 MIN. :29.58
1ST QU.: 73.25 1ST QU.:56.00 1ST QU.:35.00 1ST QU.:30.00
MEDIAN : 86.00 MEDIAN :66.00 MEDIAN :46.00 MEDIAN :30.14
MEAN : 83.23 MEAN :66.02 MEAN :48.31 MEAN :30.16
3RD QU.: 93.00 3RD QU.:76.75 3RD QU.:60.00 3RD QU.:30.31
MAX. :100.00 MAX. :98.00 MAX. :96.00 MAX. :30.88
MEAN.SEA.LEVEL.PRESSUREIN MIN.SEA.LEVEL.PRESSUREIN MAX.VISIBILITYMILES
MIN. :29.49 MIN. :29.16 MIN. : 2.000
1ST QU.:29.87 1ST QU.:29.76 1ST QU.:10.000
MEDIAN :30.03 MEDIAN :29.94 MEDIAN :10.000
MEAN :30.04 MEAN :29.93 MEAN : 9.907
3RD QU.:30.19 3RD QU.:30.09 3RD QU.:10.000
MAX. :30.77 MAX. :30.64 MAX. :10.000
MEAN.VISIBILITYMILES MIN.VISIBILITYMILES MAX.WIND.SPEEDMPH
MIN. : 1.000 MIN. : 0.000 MIN. : 8.00
1ST QU.: 8.000 1ST QU.: 2.000 1ST QU.:16.00
MEDIAN :10.000 MEDIAN :10.000 MEDIAN :20.00
MEAN : 8.891 MEAN : 6.716 MEAN :20.62
3RD QU.:10.000 3RD QU.:10.000 3RD QU.:24.00
MAX. :10.000 MAX. :10.000 MAX. :38.00
MEAN.WIND.SPEEDMPH MAX.GUST.SPEEDMPH PRECIPITATIONIN CLOUDCOVER
MIN. : 4.00 MIN. : 0.00 MIN. :0.0000 MIN. :0.000
1ST QU.: 8.00 1ST QU.:21.00 1ST QU.:0.0000 1ST QU.:3.000
MEDIAN :10.00 MEDIAN :25.50 MEDIAN :0.0000 MEDIAN :5.000
MEAN :10.68 MEAN :26.96 MEAN :0.1016 MEAN :4.708
3RD QU.:13.00 3RD QU.:31.00 3RD QU.:0.0400 3RD QU.:7.000
MAX. :22.00 MAX. :94.00 MAX. :2.9000 MAX. :8.000
WINDDIRDEGREES
MIN. : 1.0
1ST QU.:113.0
MEDIAN :222.0
MEAN :200.1
3RD QU.:275.0
MAX. :360.0

More Related Content

Similar to Handling missing data and outliers

How to read multiple excel files - With R
How to read  multiple excel files  - With RHow to read  multiple excel files  - With R
How to read multiple excel files - With R
Casper Crause
 
Programación de C++, Función Case
Programación de C++, Función CaseProgramación de C++, Función Case
Programación de C++, Función Case
Ramon Lop-Mi
 
Bioestadistica (Formulas)
Bioestadistica (Formulas)Bioestadistica (Formulas)
Bioestadistica (Formulas)
Alejandra Neri
 
Complex models in ecology: challenges and solutions
Complex models in ecology: challenges and solutionsComplex models in ecology: challenges and solutions
Complex models in ecology: challenges and solutions
Peter Solymos
 
Mini project boston housing dataset v1
Mini project   boston housing dataset v1Mini project   boston housing dataset v1
Mini project boston housing dataset v1
Wyendrila Roy
 
Samantha Wang [InfluxData] | Best Practices on How to Transform Your Data Usi...
Samantha Wang [InfluxData] | Best Practices on How to Transform Your Data Usi...Samantha Wang [InfluxData] | Best Practices on How to Transform Your Data Usi...
Samantha Wang [InfluxData] | Best Practices on How to Transform Your Data Usi...
InfluxData
 
Performance
PerformancePerformance
Performance
Cary Millsap
 
Statistical Process Control WithAdrian™ AQP
Statistical Process Control WithAdrian™ AQPStatistical Process Control WithAdrian™ AQP
Statistical Process Control WithAdrian™ AQP
Adrian Beale
 
Cluster analysis
Cluster  analysisCluster  analysis
Cluster analysis
Sammya Sengupta
 
令和から本気出す
令和から本気出す令和から本気出す
令和から本気出す
Takashi Kitano
 
Redis 101
Redis 101Redis 101
Redis 101
Doğan Can
 
Optimization and Mathematical Programming in R and ROI - R Optimization Infra...
Optimization and Mathematical Programming in R and ROI - R Optimization Infra...Optimization and Mathematical Programming in R and ROI - R Optimization Infra...
Optimization and Mathematical Programming in R and ROI - R Optimization Infra...
Dr. Volkan OBAN
 
dplyr
dplyrdplyr
Damian Peckett - Artificially Intelligent Crop Irrigation
Damian Peckett - Artificially Intelligent Crop Irrigation Damian Peckett - Artificially Intelligent Crop Irrigation
Damian Peckett - Artificially Intelligent Crop Irrigation
damianpeckett
 
ADAPTIVE SIMULATED ANNEALING (ASA
ADAPTIVE SIMULATED ANNEALING (ASAADAPTIVE SIMULATED ANNEALING (ASA
ADAPTIVE SIMULATED ANNEALING (ASA
Darian Pruitt
 
Nagios Conference 2014 - Rob Seiwert - Graphing and Trend Prediction in Nagios
Nagios Conference 2014 - Rob Seiwert - Graphing and Trend Prediction in NagiosNagios Conference 2014 - Rob Seiwert - Graphing and Trend Prediction in Nagios
Nagios Conference 2014 - Rob Seiwert - Graphing and Trend Prediction in Nagios
Nagios
 
[M3A3] Data Analysis and Interpretation Specialization
[M3A3] Data Analysis and Interpretation Specialization [M3A3] Data Analysis and Interpretation Specialization
[M3A3] Data Analysis and Interpretation Specialization
Andrea Rubio
 
CLIM Undergraduate Workshop: (Attachment) Performing Extreme Value Analysis (...
CLIM Undergraduate Workshop: (Attachment) Performing Extreme Value Analysis (...CLIM Undergraduate Workshop: (Attachment) Performing Extreme Value Analysis (...
CLIM Undergraduate Workshop: (Attachment) Performing Extreme Value Analysis (...
The Statistical and Applied Mathematical Sciences Institute
 
Univariate analysis of variance
Univariate analysis of varianceUnivariate analysis of variance
Univariate analysis of variance
dian Arrachman
 
Univariate analysis of variance
Univariate analysis of varianceUnivariate analysis of variance
Univariate analysis of variance
dian Arrachman
 

Similar to Handling missing data and outliers (20)

How to read multiple excel files - With R
How to read  multiple excel files  - With RHow to read  multiple excel files  - With R
How to read multiple excel files - With R
 
Programación de C++, Función Case
Programación de C++, Función CaseProgramación de C++, Función Case
Programación de C++, Función Case
 
Bioestadistica (Formulas)
Bioestadistica (Formulas)Bioestadistica (Formulas)
Bioestadistica (Formulas)
 
Complex models in ecology: challenges and solutions
Complex models in ecology: challenges and solutionsComplex models in ecology: challenges and solutions
Complex models in ecology: challenges and solutions
 
Mini project boston housing dataset v1
Mini project   boston housing dataset v1Mini project   boston housing dataset v1
Mini project boston housing dataset v1
 
Samantha Wang [InfluxData] | Best Practices on How to Transform Your Data Usi...
Samantha Wang [InfluxData] | Best Practices on How to Transform Your Data Usi...Samantha Wang [InfluxData] | Best Practices on How to Transform Your Data Usi...
Samantha Wang [InfluxData] | Best Practices on How to Transform Your Data Usi...
 
Performance
PerformancePerformance
Performance
 
Statistical Process Control WithAdrian™ AQP
Statistical Process Control WithAdrian™ AQPStatistical Process Control WithAdrian™ AQP
Statistical Process Control WithAdrian™ AQP
 
Cluster analysis
Cluster  analysisCluster  analysis
Cluster analysis
 
令和から本気出す
令和から本気出す令和から本気出す
令和から本気出す
 
Redis 101
Redis 101Redis 101
Redis 101
 
Optimization and Mathematical Programming in R and ROI - R Optimization Infra...
Optimization and Mathematical Programming in R and ROI - R Optimization Infra...Optimization and Mathematical Programming in R and ROI - R Optimization Infra...
Optimization and Mathematical Programming in R and ROI - R Optimization Infra...
 
dplyr
dplyrdplyr
dplyr
 
Damian Peckett - Artificially Intelligent Crop Irrigation
Damian Peckett - Artificially Intelligent Crop Irrigation Damian Peckett - Artificially Intelligent Crop Irrigation
Damian Peckett - Artificially Intelligent Crop Irrigation
 
ADAPTIVE SIMULATED ANNEALING (ASA
ADAPTIVE SIMULATED ANNEALING (ASAADAPTIVE SIMULATED ANNEALING (ASA
ADAPTIVE SIMULATED ANNEALING (ASA
 
Nagios Conference 2014 - Rob Seiwert - Graphing and Trend Prediction in Nagios
Nagios Conference 2014 - Rob Seiwert - Graphing and Trend Prediction in NagiosNagios Conference 2014 - Rob Seiwert - Graphing and Trend Prediction in Nagios
Nagios Conference 2014 - Rob Seiwert - Graphing and Trend Prediction in Nagios
 
[M3A3] Data Analysis and Interpretation Specialization
[M3A3] Data Analysis and Interpretation Specialization [M3A3] Data Analysis and Interpretation Specialization
[M3A3] Data Analysis and Interpretation Specialization
 
CLIM Undergraduate Workshop: (Attachment) Performing Extreme Value Analysis (...
CLIM Undergraduate Workshop: (Attachment) Performing Extreme Value Analysis (...CLIM Undergraduate Workshop: (Attachment) Performing Extreme Value Analysis (...
CLIM Undergraduate Workshop: (Attachment) Performing Extreme Value Analysis (...
 
Univariate analysis of variance
Univariate analysis of varianceUnivariate analysis of variance
Univariate analysis of variance
 
Univariate analysis of variance
Univariate analysis of varianceUnivariate analysis of variance
Univariate analysis of variance
 

More from Casper Crause

Integrating R and Power BI
Integrating R and Power BIIntegrating R and Power BI
Integrating R and Power BI
Casper Crause
 
Company segmentation - an approach with R
Company segmentation - an approach with RCompany segmentation - an approach with R
Company segmentation - an approach with R
Casper Crause
 
Storytelling By Visualization
Storytelling By Visualization Storytelling By Visualization
Storytelling By Visualization
Casper Crause
 
Comparing Co2 Emissions Around The Globe
Comparing Co2 Emissions Around The GlobeComparing Co2 Emissions Around The Globe
Comparing Co2 Emissions Around The Globe
Casper Crause
 
Understanding control-flow
Understanding control-flowUnderstanding control-flow
Understanding control-flow
Casper Crause
 
Levelling up your chart skills
Levelling up your chart skillsLevelling up your chart skills
Levelling up your chart skills
Casper Crause
 
Project portfolio for Casper Crause
Project portfolio for Casper CrauseProject portfolio for Casper Crause
Project portfolio for Casper Crause
Casper Crause
 

More from Casper Crause (7)

Integrating R and Power BI
Integrating R and Power BIIntegrating R and Power BI
Integrating R and Power BI
 
Company segmentation - an approach with R
Company segmentation - an approach with RCompany segmentation - an approach with R
Company segmentation - an approach with R
 
Storytelling By Visualization
Storytelling By Visualization Storytelling By Visualization
Storytelling By Visualization
 
Comparing Co2 Emissions Around The Globe
Comparing Co2 Emissions Around The GlobeComparing Co2 Emissions Around The Globe
Comparing Co2 Emissions Around The Globe
 
Understanding control-flow
Understanding control-flowUnderstanding control-flow
Understanding control-flow
 
Levelling up your chart skills
Levelling up your chart skillsLevelling up your chart skills
Levelling up your chart skills
 
Project portfolio for Casper Crause
Project portfolio for Casper CrauseProject portfolio for Casper Crause
Project portfolio for Casper Crause
 

Recently uploaded

一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
ewymefz
 
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project PresentationPredicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Boston Institute of Analytics
 
The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...
jerlynmaetalle
 
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdfCh03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
haila53
 
一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单
enxupq
 
一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单
enxupq
 
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
ewymefz
 
Q1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year ReboundQ1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year Rebound
Oppotus
 
Business update Q1 2024 Lar España Real Estate SOCIMI
Business update Q1 2024 Lar España Real Estate SOCIMIBusiness update Q1 2024 Lar España Real Estate SOCIMI
Business update Q1 2024 Lar España Real Estate SOCIMI
AlejandraGmez176757
 
一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单
ewymefz
 
一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单
ocavb
 
FP Growth Algorithm and its Applications
FP Growth Algorithm and its ApplicationsFP Growth Algorithm and its Applications
FP Growth Algorithm and its Applications
MaleehaSheikh2
 
Jpolillo Amazon PPC - Bid Optimization Sample
Jpolillo Amazon PPC - Bid Optimization SampleJpolillo Amazon PPC - Bid Optimization Sample
Jpolillo Amazon PPC - Bid Optimization Sample
James Polillo
 
Tabula.io Cheatsheet: automate your data workflows
Tabula.io Cheatsheet: automate your data workflowsTabula.io Cheatsheet: automate your data workflows
Tabula.io Cheatsheet: automate your data workflows
alex933524
 
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
yhkoc
 
社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .
NABLAS株式会社
 
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
nscud
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP
 
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
axoqas
 
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
axoqas
 

Recently uploaded (20)

一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
 
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project PresentationPredicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
 
The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...
 
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdfCh03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
 
一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单
 
一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单
 
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
 
Q1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year ReboundQ1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year Rebound
 
Business update Q1 2024 Lar España Real Estate SOCIMI
Business update Q1 2024 Lar España Real Estate SOCIMIBusiness update Q1 2024 Lar España Real Estate SOCIMI
Business update Q1 2024 Lar España Real Estate SOCIMI
 
一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单
 
一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单
 
FP Growth Algorithm and its Applications
FP Growth Algorithm and its ApplicationsFP Growth Algorithm and its Applications
FP Growth Algorithm and its Applications
 
Jpolillo Amazon PPC - Bid Optimization Sample
Jpolillo Amazon PPC - Bid Optimization SampleJpolillo Amazon PPC - Bid Optimization Sample
Jpolillo Amazon PPC - Bid Optimization Sample
 
Tabula.io Cheatsheet: automate your data workflows
Tabula.io Cheatsheet: automate your data workflowsTabula.io Cheatsheet: automate your data workflows
Tabula.io Cheatsheet: automate your data workflows
 
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
 
社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .
 
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
 
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
 
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
 

Handling missing data and outliers

  • 1. Handling missing values and Outliers Loading the weather-data that is semi wrangled Previously I wrangled a dataset that contained weather data. With this presentation I plan to check for outliers, look for missing value and explore the different ways of dealing with NA values and experiment with some basic functional programming and filtering time-series data. LIBRARY(TIDYVERSE) LIBRARY(DATAEXPLORER) WEATHER_DATA_PIVOT_TBL = READR::READ_RDS('WEATHER_DATA_PIVOTED.RDS') GLIMPSE(WEATHER_DATA_PIVOT_TBL) OBSERVATIONS: 366 VARIABLES: 23 $ DATE <CHR> "2014/12/1", "2014/12/2", "2014/12/3... $ EVENTS <CHR> "RAIN", "RAIN-SNOW", "RAIN", "", "RA... $ MAX.TEMPERATUREF <DBL> 64, 42, 51, 43, 42, 45, 38, 29, 49, ... $ MEAN.TEMPERATUREF <DBL> 52, 38, 44, 37, 34, 42, 30, 24, 39, ... $ MIN.TEMPERATUREF <DBL> 39, 33, 37, 30, 26, 38, 21, 18, 29, ... $ MAX.DEW.POINTF <DBL> 46, 40, 49, 24, 37, 45, 36, 28, 49, ... $ MEANDEW.POINTF <DBL> 40, 27, 42, 21, 25, 40, 20, 16, 41, ... $ MIN.DEWPOINTF <DBL> 26, 17, 24, 13, 12, 36, -3, 3, 28, 3... $ MAX.HUMIDITY <DBL> 74, 92, 100, 69, 85, 100, 92, 92, 10... $ MEAN.HUMIDITY <DBL> 63, 72, 79, 54, 66, 93, 61, 70, 93, ... $ MIN.HUMIDITY <DBL> 52, 51, 57, 39, 47, 85, 29, 47, 86, ... $ MAX.SEA.LEVEL.PRESSUREIN <DBL> 30.45, 30.71, 30.40, 30.56, 30.68, 3... $ MEAN.SEA.LEVEL.PRESSUREIN <DBL> 30.13, 30.59, 30.07, 30.33, 30.59, 3... $ MIN.SEA.LEVEL.PRESSUREIN <DBL> 30.01, 30.40, 29.87, 30.09, 30.45, 3... $ MAX.VISIBILITYMILES <DBL> 10, 10, 10, 10, 10, 10, 10, 10, 10, ... $ MEAN.VISIBILITYMILES <DBL> 10, 8, 5, 10, 10, 4, 10, 8, 2, 3, 7,... $ MIN.VISIBILITYMILES <DBL> 10, 2, 1, 10, 5, 0, 5, 2, 1, 1, 1, 7... $ MAX.WIND.SPEEDMPH <DBL> 22, 24, 29, 25, 22, 22, 25, 21, 38, ... $ MEAN.WIND.SPEEDMPH <DBL> 13, 15, 12, 12, 10, 8, 15, 13, 20, 1... $ MAX.GUST.SPEEDMPH <DBL> 29, 29, 38, 33, 26, 25, 32, 28, 52, ... $ PRECIPITATIONIN <DBL> 0.01, 0.10, 0.44, 0.00, 0.11, 1.09, ... $ CLOUDCOVER <DBL> 6, 7, 8, 3, 5, 8, 6, 8, 8, 8, 8, 7, ... $ WINDDIRDEGREES <DBL> 268, 62, 254, 292, 61, 313, 350, 354...
  • 2. Type Conversions The Events column contains data that can be categorized into different classes. Such as Rain day, Rain Snow etc. • I’ll replace the blank rows with the text “None” • I’ll start by converting this column to a factor. • I’ll convert the date column to from character type to a date type REPLACE THE BLANKS IN THE EVENTS COLUMN WITH 'NONE' METHOD 1: WEATHER_DATA_PIVOT_TBL$EVENTS[WEATHER_DATA_PIVOT_TBL$EVENTS==""] <- 'NONE' (WEATHER_DATA_CLEAN_TBL <- WEATHER_DATA_PIVOT_TBL %>% MUTATE(EVENTS = EVENTS %>% AS_FACT OR(), DATE = LUBRIDATE::YMD(DATE) ) ) METHOD 2: (WEATHER_DATA_CLEAN_TBL <- WEATHER_DATA_PIVOT_TBL %>% MUTATE(EVENTS = CASE_WHEN( EVENTS == "" ~ 'NONE', TRUE ~ EVENTS ) %>% AS.FACTOR()) %>% MUTATE(DATE = DATE %>% LUBRIDATE::YMD()) ) A TIBBLE: 366 X 23 DATE EVENTS MAX.TEMPERATUREF MEAN.TEMPERATUR~ MIN.TEMPERATUREF <DATE> <FCT> <DBL> <DBL> <DBL> 1 2014-12-01 RAIN 64 52 39 2 2014-12-02 RAIN-~ 42 38 33 3 2014-12-03 RAIN 51 44 37 4 2014-12-04 NONE 43 37 30 5 2014-12-05 RAIN 42 34 26 6 2014-12-06 RAIN 45 42 38 7 2014-12-07 RAIN 38 30 21 8 2014-12-08 SNOW 29 24 18 9 2014-12-09 RAIN 49 39 29 10 2014-12-10 RAIN 48 43 38 ... WITH 356 MORE ROWS, AND 18 MORE VARIABLES: MAX.DEW.POINTF <DBL>, MEANDEW.POINTF <DBL>, MIN.DEWPOINTF <DBL>, MAX.HUMIDITY <DBL>, MEAN.HUMIDITY <DBL>, MIN.HUMIDITY <DBL>,
  • 3. MAX.SEA.LEVEL.PRESSUREIN <DBL>, MEAN.SEA.LEVEL.PRESSUREIN <DBL>, MIN.SEA.LEVEL.PRESSUREIN <DBL>, MAX.VISIBILITYMILES <DBL>, MEAN.VISIBILITYMILES <DBL>, MIN.VISIBILITYMILES <DBL>, MAX.WIND.SPEEDMPH <DBL>, MEAN.WIND.SPEEDMPH <DBL>, MAX.GUST.SPEEDMPH <DBL>, PRECIPITATIONIN <DBL>, CLOUDCOVER <DBL>, WINDDIRDEGREES <DBL> TIP: TO SIMULTANEOUSLY ASSIGN AN EXPRESSION TO A VARIABLE AND HAVE IT PRINTED TO THE CONSOLE , ONE CAN WRAP THE ENTIRE EXPRESSION IN PARENTHESES E.G ( Y <- MEAN(X)) Use the summary() function to get a good feel for the distribution of data within the dataset. This is a very handy way to detect outliers and missing values SUMMARY(WEATHER_DATA_CLEAN_TBL) DATE EVENTS MAX.TEMPERATUREF MEAN.TEMPERATUREF MIN. :2014-12-01 NONE :201 MIN. :18.00 MIN. : 8.00 1ST QU.:2015-03-02 RAIN : 90 1ST QU.:42.00 1ST QU.:36.25 MEDIAN :2015-06-01 SNOW : 31 MEDIAN :60.00 MEDIAN :53.50 MEAN :2015-06-01 RAIN-SNOW: 10 MEAN :58.93 MEAN :51.40 3RD QU.:2015-08-31 FOG-RAIN : 8 3RD QU.:76.00 3RD QU.:68.00 MAX. :2015-12-01 FOG-SNOW : 7 MAX. :96.00 MAX. :84.00 (OTHER) : 19 MIN.TEMPERATUREF MAX.DEW.POINTF MEANDEW.POINTF MIN.DEWPOINTF MIN. :-3.00 MIN. :-6.00 MIN. :-11.00 MIN. :-18.00 1ST QU.:30.00 1ST QU.:32.00 1ST QU.: 24.00 1ST QU.: 16.25 MEDIAN :46.00 MEDIAN :47.50 MEDIAN : 41.00 MEDIAN : 35.00 MEAN :43.33 MEAN :45.48 MEAN : 38.96 MEAN : 32.25 3RD QU.:60.00 3RD QU.:61.00 3RD QU.: 56.00 3RD QU.: 51.00 MAX. :74.00 MAX. :75.00 MAX. : 71.00 MAX. : 68.00 MAX.HUMIDITY MEAN.HUMIDITY MIN.HUMIDITY MIN. : 39.00 MIN. :28.00 MIN. :16.00 1ST QU.: 73.25 1ST QU.:56.00 1ST QU.:35.00 MEDIAN : 86.00 MEDIAN :66.00 MEDIAN :46.00 MEAN : 85.69 MEAN :66.02 MEAN :48.31 3RD QU.: 93.00 3RD QU.:76.75 3RD QU.:60.00 MAX. :1000.00 MAX. :98.00 MAX. :96.00 MAX.SEA.LEVEL.PRESSUREIN MEAN.SEA.LEVEL.PRESSUREIN MIN. :29.58 MIN. :29.49 1ST QU.:30.00 1ST QU.:29.87 MEDIAN :30.14 MEDIAN :30.03
  • 4. MEAN :30.16 MEAN :30.04 3RD QU.:30.31 3RD QU.:30.19 MAX. :30.88 MAX. :30.77 MIN.SEA.LEVEL.PRESSUREIN MAX.VISIBILITYMILES MEAN.VISIBILITYMILES MIN. :29.16 MIN. : 2.000 MIN. :-1.000 1ST QU.:29.76 1ST QU.:10.000 1ST QU.: 8.000 MEDIAN :29.94 MEDIAN :10.000 MEDIAN :10.000 MEAN :29.93 MEAN : 9.907 MEAN : 8.861 3RD QU.:30.09 3RD QU.:10.000 3RD QU.:10.000 MAX. :30.64 MAX. :10.000 MAX. :10.000 MIN.VISIBILITYMILES MAX.WIND.SPEEDMPH MEAN.WIND.SPEEDMPH MIN. : 0.000 MIN. : 8.00 MIN. : 4.00 1ST QU.: 2.000 1ST QU.:16.00 1ST QU.: 8.00 MEDIAN :10.000 MEDIAN :20.00 MEDIAN :10.00 MEAN : 6.716 MEAN :20.62 MEAN :10.68 3RD QU.:10.000 3RD QU.:24.00 3RD QU.:13.00 MAX. :10.000 MAX. :38.00 MAX. :22.00 MAX.GUST.SPEEDMPH PRECIPITATIONIN CLOUDCOVER WINDDIRDEGREES MIN. : 0.00 MIN. :0.0000 MIN. :0.000 MIN. : 1.0 1ST QU.:21.00 1ST QU.:0.0000 1ST QU.:3.000 1ST QU.:113.0 MEDIAN :25.50 MEDIAN :0.0000 MEDIAN :5.000 MEDIAN :222.0 MEAN :26.99 MEAN :0.1173 MEAN :4.708 MEAN :200.1 3RD QU.:31.25 3RD QU.:0.0700 3RD QU.:7.000 3RD QU.:275.0 MAX. :94.00 MAX. :2.9000 MAX. :8.000 MAX. :360.0 NA'S :6 NA'S :49 Screening and handling outliers There seems to be obvious outliers in the Mean.VisibilityMiles column and the Max.Humidity column. PLOT(WEATHER_DATA_CLEAN_TBL$DATE, WEATHER_DATA_CLEAN_TBL$MAX.HUMIDITY, YLAB = 'MAXIMUM HUMIDITY', XLAB ='DATE' )
  • 5. WHATEVER THE CAUSE, THIS IS CLEARLY AN INVALID DATA POINT AND NEEDS TO BE FIXED. I'M ASSUMING THAT IT IS OUT BY A FACTOR OF 10 AND DROPPING A ZERO SHOULD DO THE TRICK TO QUICKLY FIND THE ROW NUMBER OF THIS ERROR, THE WHICH.MAX() FROM THE BASE PACKAGE IS VERY HANDY WEATHER_DATA_CLEAN_TBL$MAX.HUMIDITY %>% WHICH.MAX() [1] 142 THE ROW NUMBER IS 142 AND CAN BE QUICKLY NAVIGATED TO USING DPLYR'S SLICE FUNCTION WEATHER_DATA_CLEAN_TBL %>% SLICE(142) %>% GLIMPSE() OBSERVATIONS: 1 VARIABLES: 23 $ DATE <DATE> 2015-04-21 $ EVENTS <FCT> FOG-RAIN-THUNDERSTORM $ MAX.TEMPERATUREF <DBL> 65 $ MEAN.TEMPERATUREF <DBL> 56 $ MIN.TEMPERATUREF <DBL> 46 $ MAX.DEW.POINTF <DBL> 57 $ MEANDEW.POINTF <DBL> 49 $ MIN.DEWPOINTF <DBL> 36
  • 6. $ MAX.HUMIDITY <DBL> 1000 $ MEAN.HUMIDITY <DBL> 71 $ MIN.HUMIDITY <DBL> 42 $ MAX.SEA.LEVEL.PRESSUREIN <DBL> 29.75 $ MEAN.SEA.LEVEL.PRESSUREIN <DBL> 29.6 $ MIN.SEA.LEVEL.PRESSUREIN <DBL> 29.53 $ MAX.VISIBILITYMILES <DBL> 10 $ MEAN.VISIBILITYMILES <DBL> 5 $ MIN.VISIBILITYMILES <DBL> 0 $ MAX.WIND.SPEEDMPH <DBL> 20 $ MEAN.WIND.SPEEDMPH <DBL> 10 $ MAX.GUST.SPEEDMPH <DBL> 94 $ PRECIPITATIONIN <DBL> 0.54 $ CLOUDCOVER <DBL> 6 $ WINDDIRDEGREES <DBL> 184 LET'S KNOCK OFF A ZERO FROM 1000 AND REPLACE IT WITH 100 WEATHER_DATA_CLEAN_TBL$MAX.HUMIDITY[142] <- 100 Further (not so obvious) errors When looking at a summary of the mean visibility miles, there appears to be another error. Miles cannot be negative? Let’s replace it to be 1 SUMMARY(WEATHER_DATA_CLEAN_TBL$MEAN.VISIBILITYMILES) MIN. 1ST QU. MEDIAN MEAN 3RD QU. MAX. -1.000 8.000 10.000 8.861 10.000 10.000 MIN = WHICH.MIN(WEATHER_DATA_CLEAN_TBL$MEAN.VISIBILITYMILES) WEATHER_DATA_CLEAN_TBL$MEAN.VISIBILITYMILES[MIN] <- 10 Handling NA Values One of the most common problems when working with a dataset is missing values and can be a cause of great trouble that requires careful thought. Recall the 3 types of missing data. • Missing completely at random (no relationship between missing data and circumstances) • Missing at random (Circumstances cause some data to be missing) • Missing not at random (Circumstances cause data to be missing, but value that is missing is related to the reason that data is missing )
  • 7. Addressing the missing values Fixing NA values require subject matter expertise and with this data set I chose to replace NA’s by imputation. I’ll replace them with the median. I chose to replace them with the median because - The mean is sensitive to outliers - The median is robust to outliers - not as heavily impacted by skewed data as the mean. Let’s get the percentage-wise NA’s per column relative to the rest of the data set with three different methods • summarise_all • map() • plot_missing() Using SUMMARISE_ALL() METHOD 1 : WEATHER_DATA_CLEAN_TBL %>% SUMMARISE_ALL(~ IS.NA(.) %>% SUM()/LENGTH(.)*100) %>% GLIMPS E() OBSERVATIONS: 1 VARIABLES: 23 $ DATE <DBL> 0 $ EVENTS <DBL> 0 $ MAX.TEMPERATUREF <DBL> 0 $ MEAN.TEMPERATUREF <DBL> 0 $ MIN.TEMPERATUREF <DBL> 0 $ MAX.DEW.POINTF <DBL> 0 $ MEANDEW.POINTF <DBL> 0 $ MIN.DEWPOINTF <DBL> 0 $ MAX.HUMIDITY <DBL> 0 $ MEAN.HUMIDITY <DBL> 0 $ MIN.HUMIDITY <DBL> 0 $ MAX.SEA.LEVEL.PRESSUREIN <DBL> 0 $ MEAN.SEA.LEVEL.PRESSUREIN <DBL> 0 $ MIN.SEA.LEVEL.PRESSUREIN <DBL> 0 $ MAX.VISIBILITYMILES <DBL> 0 $ MEAN.VISIBILITYMILES <DBL> 0 $ MIN.VISIBILITYMILES <DBL> 0 $ MAX.WIND.SPEEDMPH <DBL> 0 $ MEAN.WIND.SPEEDMPH <DBL> 0 $ MAX.GUST.SPEEDMPH <DBL> 1.639344 $ PRECIPITATIONIN <DBL> 13.38798
  • 8. $ CLOUDCOVER <DBL> 0 $ WINDDIRDEGREES <DBL> 0 Using PURRR::MAP_DF() METHOD 2: WEATHER_DATA_CLEAN_TBL %>% MAP_DF(~IS.NA(.) %>% SUM()/LENGTH(.)*100) %>% GATHER() %>% FILTER(VALUE>0) A TIBBLE: 2 X 2 KEY VALUE <CHR> <DBL> 1 MAX.GUST.SPEEDMPH 1.64 2 PRECIPITATIONIN 13.4 Using DATAEXPLORER:: PLOT_MISSING() METHOD3 : WEATHER_DATA_CLEAN_TBL %>% PLOT_MISSING()
  • 9. Replacing values programmatically THE EXPRESSION READS AS FOLLOWS: IF THE COLUMN IS OF NUMERIC TYPE, SCAN THEM FOR NA'S AND IF YOU FIND THEM, REPLACE IT WITH THE MEDIAN VALUE OF THAT COLUMN, OTHERWISE LEAVE THE VALUE AS IT IS (WEATHER_DATA_CLEAN_TBL2 <- WEATHER_DATA_CLEAN_TBL %>% MUTATE_IF(IS.NUMERIC, ~IF_ELSE(CONDITION = IS.NA(.), TRUE = MEDIAN(.,NA.RM = TRUE), FALSE = .)) ) A TIBBLE: 366 X 23 DATE EVENTS MAX.TEMPERATUREF MEAN.TEMPERATUR~ MIN.TEMPERATUREF <DATE> <FCT> <DBL> <DBL> <DBL> 1 2014-12-01 RAIN 64 52 39 2 2014-12-02 RAIN-~ 42 38 33 3 2014-12-03 RAIN 51 44 37 4 2014-12-04 NONE 43 37 30 5 2014-12-05 RAIN 42 34 26 6 2014-12-06 RAIN 45 42 38 7 2014-12-07 RAIN 38 30 21 8 2014-12-08 SNOW 29 24 18 9 2014-12-09 RAIN 49 39 29 10 2014-12-10 RAIN 48 43 38 ... WITH 356 MORE ROWS, AND 18 MORE VARIABLES: MAX.DEW.POINTF <DBL>, MEANDEW.POINTF <DBL>, MIN.DEWPOINTF <DBL>, MAX.HUMIDITY <DBL>, MEAN.HUMIDITY <DBL>, MIN.HUMIDITY <DBL>, MAX.SEA.LEVEL.PRESSUREIN <DBL>, MEAN.SEA.LEVEL.PRESSUREIN <DBL>, MIN.SEA.LEVEL.PRESSUREIN <DBL>, MAX.VISIBILITYMILES <DBL>, MEAN.VISIBILITYMILES <DBL>, MIN.VISIBILITYMILES <DBL>, MAX.WIND.SPEEDMPH <DBL>, MEAN.WIND.SPEEDMPH <DBL>, MAX.GUST.SPEEDMPH <DBL>, PRECIPITATIONIN <DBL>, CLOUDCOVER <DBL>, WINDDIRDEGREES <DBL>
  • 10. CHECK FOR NA VALUES AGAIN USING THE SUMMARY FUCNTION SUMMARY(WEATHER_DATA_CLEAN_TBL2) DATE EVENTS MAX.TEMPERATUREF MEAN.TEMPERATUREF MIN. :2014-12-01 NONE :201 MIN. :18.00 MIN. : 8.00 1ST QU.:2015-03-02 RAIN : 90 1ST QU.:42.00 1ST QU.:36.25 MEDIAN :2015-06-01 SNOW : 31 MEDIAN :60.00 MEDIAN :53.50 MEAN :2015-06-01 RAIN-SNOW: 10 MEAN :58.93 MEAN :51.40 3RD QU.:2015-08-31 FOG-RAIN : 8 3RD QU.:76.00 3RD QU.:68.00 MAX. :2015-12-01 FOG-SNOW : 7 MAX. :96.00 MAX. :84.00 (OTHER) : 19 MIN.TEMPERATUREF MAX.DEW.POINTF MEANDEW.POINTF MIN.DEWPOINTF MIN. :-3.00 MIN. :-6.00 MIN. :-11.00 MIN. :-18.00 1ST QU.:30.00 1ST QU.:32.00 1ST QU.: 24.00 1ST QU.: 16.25 MEDIAN :46.00 MEDIAN :47.50 MEDIAN : 41.00 MEDIAN : 35.00 MEAN :43.33 MEAN :45.48 MEAN : 38.96 MEAN : 32.25 3RD QU.:60.00 3RD QU.:61.00 3RD QU.: 56.00 3RD QU.: 51.00 MAX. :74.00 MAX. :75.00 MAX. : 71.00 MAX. : 68.00 MAX.HUMIDITY MEAN.HUMIDITY MIN.HUMIDITY MAX.SEA.LEVEL.PRESSUREIN MIN. : 39.00 MIN. :28.00 MIN. :16.00 MIN. :29.58 1ST QU.: 73.25 1ST QU.:56.00 1ST QU.:35.00 1ST QU.:30.00 MEDIAN : 86.00 MEDIAN :66.00 MEDIAN :46.00 MEDIAN :30.14 MEAN : 83.23 MEAN :66.02 MEAN :48.31 MEAN :30.16 3RD QU.: 93.00 3RD QU.:76.75 3RD QU.:60.00 3RD QU.:30.31 MAX. :100.00 MAX. :98.00 MAX. :96.00 MAX. :30.88 MEAN.SEA.LEVEL.PRESSUREIN MIN.SEA.LEVEL.PRESSUREIN MAX.VISIBILITYMILES MIN. :29.49 MIN. :29.16 MIN. : 2.000 1ST QU.:29.87 1ST QU.:29.76 1ST QU.:10.000 MEDIAN :30.03 MEDIAN :29.94 MEDIAN :10.000 MEAN :30.04 MEAN :29.93 MEAN : 9.907 3RD QU.:30.19 3RD QU.:30.09 3RD QU.:10.000 MAX. :30.77 MAX. :30.64 MAX. :10.000 MEAN.VISIBILITYMILES MIN.VISIBILITYMILES MAX.WIND.SPEEDMPH MIN. : 1.000 MIN. : 0.000 MIN. : 8.00 1ST QU.: 8.000 1ST QU.: 2.000 1ST QU.:16.00 MEDIAN :10.000 MEDIAN :10.000 MEDIAN :20.00 MEAN : 8.891 MEAN : 6.716 MEAN :20.62 3RD QU.:10.000 3RD QU.:10.000 3RD QU.:24.00 MAX. :10.000 MAX. :10.000 MAX. :38.00
  • 11. MEAN.WIND.SPEEDMPH MAX.GUST.SPEEDMPH PRECIPITATIONIN CLOUDCOVER MIN. : 4.00 MIN. : 0.00 MIN. :0.0000 MIN. :0.000 1ST QU.: 8.00 1ST QU.:21.00 1ST QU.:0.0000 1ST QU.:3.000 MEDIAN :10.00 MEDIAN :25.50 MEDIAN :0.0000 MEDIAN :5.000 MEAN :10.68 MEAN :26.96 MEAN :0.1016 MEAN :4.708 3RD QU.:13.00 3RD QU.:31.00 3RD QU.:0.0400 3RD QU.:7.000 MAX. :22.00 MAX. :94.00 MAX. :2.9000 MAX. :8.000 WINDDIRDEGREES MIN. : 1.0 1ST QU.:113.0 MEDIAN :222.0 MEAN :200.1 3RD QU.:275.0 MAX. :360.0