5. working on data using R -Cleaning, filtering ,transformation, Sampling
1. Working on data ( cleaning, filtering
,transformation,sampling,visualization)
K K Singh, Dept. of CSE, RGUKT Nuzvid
8/19/2017K K Singh, Dept. of CSE, RGUKT Nuzvid
1
2. Exploring DATA
cd <- read.table(‘custData.csv’, sep=',',header=T)
Once we’ve loaded the data into R, we’ll want to examine it.
class()—Tells us what type of R object you have. In our case,
summary()—Gives you a summary of almost any R object.
str()-Gives structure of data table/frame
names()– Gives detailed structure of data table/frame
dim() –Gives rows and columns of data
Data exploration uses a combination of summary statistics—means and
medians, variances, and counts—and visualization. You can spot some
problems just by using
summary statistics; other problems are easier to find visually.
8/19/2017K K Singh, Dept. of CSE, RGUKT Nuzvid
2
3. OTHER DATA FORMATS
.csv is not the only common data file format you’ll encounter. Other formats include
.tsv (tab-separated values),
pipe-separated files,
Microsoft Excel workbooks,
JSON data,
and XML.
R’s built-in read.table() command can be made to read most separated value formats.
8/19/2017K K Singh, Dept. of CSE, RGUKT Nuzvid
3
4. 8/19/2017K K Singh, Dept. of CSE, RGUKT Nuzvid
4
custdata<-fread(“custData.csv”)
Summary(custdata)
5. Typical problems revealed by data summaries
MISSING
VALUES
INVALID
VALUES AND
OUTLIERS
8/19/2017K K Singh, Dept. of CSE, RGUKT Nuzvid
5
6. Typical problems revealed by data summaries
8/19/2017K K Singh, Dept. of CSE, RGUKT Nuzvid
6 DATA RANGE
Unit
7. Data Cleaning
Fundamentally, there are two things you can do with missing variables: drop the
rows with missing values, or convert the missing values to a meaningful value.
If the missing data represents a fairly small fraction of the dataset, it’s probably saf
just to drop these customers from your analysis. But if it is significant, What do yo
do then?
The most straightforward solution is just to create a new category for the variable,
called missing.
f <- ifelse(is.na(custdata$is.employed), "missing", ifelse(custdata$is.employed==T,
“employed“, “not_employed”))
summary(as.factor(f))
8/19/2017K K Singh, Dept. of CSE, RGUKT Nuzvid
7
9. Data_transformations
The purpose of data transformation is to make data easier to model—and easier to
understand. For example, the cost of living will vary from state to state, so what would
be a high salary in one region could be barely enough to scrape by in another. If you
want to use income as an input to your insurance model, it might be more meaningful
to normalize a customer’s income by the typical income in the area where they live.
custdata <- merge(custdata, medianincome, by.x="state.of.res",
by.y="State")
summary(custdata[,c("state.of.res", "income", "Median.Income")])
custdata$income.norm <- with(custdata, income/Median.Income)
OR
custdata$income.norm <- custdata[, income/Median.Income]
summary(custdata$income.norm)
8/19/2017K K Singh, Dept. of CSE, RGUKT Nuzvid
9
10. CONVERTING CONTINUOUS VARIABLES TO DISCRETE
In these cases, you might want to convert the continuous age and income
variables into ranges, or discrete variables.
8/19/2017K K Singh, Dept. of CSE, RGUKT Nuzvid
10
11. NORMALIZATION AND RESCALING
It is useful when absolute quantities are less meaningful than relative ones.
For example, you might be less interested in a customer’s absolute age than in how old or young
they are relative to a “typical” customer. Let’s take the mean age of your customers to be the typical
age. You can normalize by that, as shown in the following listing.
summary(custdata$age)
meanage <- mean(custdata$age)
custdata$age.normalized <- custdata$age/meanage
summary(custdata$age.normalized)
8/19/2017K K Singh, Dept. of CSE, RGUKT Nuzvid
11
12. Data Sampling
Sampling is the process of selecting a subset of a population to
represent the whole, during analysis and modeling.
it’s easier to test and debug the code on small subsamples before
training the model on the entire dataset. Visualization can be easier
with a subsample of the data;
The other reason to sample your data is to create test and training
splits.
8/19/2017K K Singh, Dept. of CSE, RGUKT Nuzvid
12
13. 8/19/2017K K Singh, Dept. of CSE, RGUKT Nuzvid
13 A convenient way to manage random sampling is to add a sample group column to the data frame. The
sample group column contains a number generated uniformly from zero to one, using the runif function. You
can draw a random sample of arbitrary size from the data frame by using the appropriate threshold on the
sample group column.
14. Data visualization (Refer to the lecture on Graph plotting )
Visually checking distributions for a single variable
What is the peak value of the distribution?
How many peaks are there in the distribution (unimodality versus bimodality)?
How normal (or lognormal) is the data?
How much does the data vary? Is it concentrated in a certain interval or in a certain
category?
Is there a relationship between the two inputs age and income in my data?
8/19/2017K K Singh, Dept. of CSE, RGUKT Nuzvid
14
15. Uses
1. plot Shows the relationship between two continuous variables. Best when
that relationship is functional.
2. Shows the relationship between two continuous variables. Best when the
relationship is too loose or cloud-like to be seen on a line plot.
3. Shows the relationship between two categorical variables (var1 and var2).
Highlights the frequencies of each value of var1.
4. Shows the relationship between two categorical variables (var1 and var2).
Best for comparing the relative frequencies of each value of var2 within each
value of var1 when var2 takes on more than two values.
5. Examines data range, Checks number of modes,Checks if distribution is
normal/lognormal, Checks for anomalies and outliers. (use a log scale to
visualize data that is heavily skewed.)
6. Presents information from a five-number summary. Useful for indicating
whether a distribution is skewed and whether there are potential unusual
observations (outliers), Very useful when large numbers of observations are
involved and when two or more data sets are being compared.
Graph type
1. Line Plot
2. Scatter plot
3. Bar chart
4. Bar chart with
faceting
5. Histogram or
density plot
6. A box and whisker
plot(boxplot)
8/19/2017K K Singh, Dept. of CSE, RGUKT Nuzvid
15
16. Assignments
load(nycflights)
1. Create a new data frame that includes flights headed to SFO in February,
and save this data frame assfo_feb_flights. How many such recors are
there?
2. Calculate the median and interquartile range for arr_delays of flights in
the sfo_feb_flights data frame, grouped by carrier. Which carrier has the
highest IQR of arrival delays?
3. Considering the data from all the NYC airports, which month has the
highest average departure delay?
4. What was the worst day to fly out of NYC in 2013 if you dislike delayed
flights?
5. Make a histogram and calculate appropriate summary statistics for
arrival delays of sfo_feb_flights. Which of the following is false?
8/19/2017K K Singh, Dept. of CSE, RGUKT Nuzvid
16