SlideShare a Scribd company logo
1 of 17
Working on data ( cleaning, filtering
,transformation,sampling,visualization)
K K Singh, Dept. of CSE, RGUKT Nuzvid
8/19/2017K K Singh, Dept. of CSE, RGUKT Nuzvid
1
Exploring DATA
 cd <- read.table(‘custData.csv’, sep=',',header=T)
 Once we’ve loaded the data into R, we’ll want to examine it.
 class()—Tells us what type of R object you have. In our case,
 summary()—Gives you a summary of almost any R object.
 str()-Gives structure of data table/frame
 names()– Gives detailed structure of data table/frame
 dim() –Gives rows and columns of data
 Data exploration uses a combination of summary statistics—means and
medians, variances, and counts—and visualization. You can spot some
problems just by using
summary statistics; other problems are easier to find visually.
8/19/2017K K Singh, Dept. of CSE, RGUKT Nuzvid
2
OTHER DATA FORMATS
 .csv is not the only common data file format you’ll encounter. Other formats include
 .tsv (tab-separated values),
 pipe-separated files,
 Microsoft Excel workbooks,
 JSON data,
 and XML.
 R’s built-in read.table() command can be made to read most separated value formats.
8/19/2017K K Singh, Dept. of CSE, RGUKT Nuzvid
3
8/19/2017K K Singh, Dept. of CSE, RGUKT Nuzvid
4
 custdata<-fread(“custData.csv”)
 Summary(custdata)
Typical problems revealed by data summaries
 MISSING
VALUES
 INVALID
VALUES AND
OUTLIERS
8/19/2017K K Singh, Dept. of CSE, RGUKT Nuzvid
5
Typical problems revealed by data summaries
8/19/2017K K Singh, Dept. of CSE, RGUKT Nuzvid
6  DATA RANGE
 Unit
Data Cleaning
 Fundamentally, there are two things you can do with missing variables: drop the
rows with missing values, or convert the missing values to a meaningful value.
 If the missing data represents a fairly small fraction of the dataset, it’s probably saf
just to drop these customers from your analysis. But if it is significant, What do yo
do then?
 The most straightforward solution is just to create a new category for the variable,
called missing.
 f <- ifelse(is.na(custdata$is.employed), "missing", ifelse(custdata$is.employed==T,
“employed“, “not_employed”))
 summary(as.factor(f))
8/19/2017K K Singh, Dept. of CSE, RGUKT Nuzvid
7
8/19/2017K K Singh, Dept. of CSE, RGUKT Nuzvid
8
Data_transformations
The purpose of data transformation is to make data easier to model—and easier to
understand. For example, the cost of living will vary from state to state, so what would
be a high salary in one region could be barely enough to scrape by in another. If you
want to use income as an input to your insurance model, it might be more meaningful
to normalize a customer’s income by the typical income in the area where they live.
custdata <- merge(custdata, medianincome, by.x="state.of.res",
by.y="State")
summary(custdata[,c("state.of.res", "income", "Median.Income")])
custdata$income.norm <- with(custdata, income/Median.Income)
OR
custdata$income.norm <- custdata[, income/Median.Income]
summary(custdata$income.norm)
8/19/2017K K Singh, Dept. of CSE, RGUKT Nuzvid
9
CONVERTING CONTINUOUS VARIABLES TO DISCRETE
 In these cases, you might want to convert the continuous age and income
variables into ranges, or discrete variables.
8/19/2017K K Singh, Dept. of CSE, RGUKT Nuzvid
10
NORMALIZATION AND RESCALING
It is useful when absolute quantities are less meaningful than relative ones.
 For example, you might be less interested in a customer’s absolute age than in how old or young
they are relative to a “typical” customer. Let’s take the mean age of your customers to be the typical
age. You can normalize by that, as shown in the following listing.
 summary(custdata$age)
 meanage <- mean(custdata$age)
 custdata$age.normalized <- custdata$age/meanage
 summary(custdata$age.normalized)
8/19/2017K K Singh, Dept. of CSE, RGUKT Nuzvid
11
Data Sampling
 Sampling is the process of selecting a subset of a population to
represent the whole, during analysis and modeling.
 it’s easier to test and debug the code on small subsamples before
training the model on the entire dataset. Visualization can be easier
with a subsample of the data;
 The other reason to sample your data is to create test and training
splits.
8/19/2017K K Singh, Dept. of CSE, RGUKT Nuzvid
12
8/19/2017K K Singh, Dept. of CSE, RGUKT Nuzvid
13 A convenient way to manage random sampling is to add a sample group column to the data frame. The
sample group column contains a number generated uniformly from zero to one, using the runif function. You
can draw a random sample of arbitrary size from the data frame by using the appropriate threshold on the
sample group column.
Data visualization (Refer to the lecture on Graph plotting )
 Visually checking distributions for a single variable
 What is the peak value of the distribution?
 How many peaks are there in the distribution (unimodality versus bimodality)?
 How normal (or lognormal) is the data?
 How much does the data vary? Is it concentrated in a certain interval or in a certain
category?
 Is there a relationship between the two inputs age and income in my data?
8/19/2017K K Singh, Dept. of CSE, RGUKT Nuzvid
14
Uses
1. plot Shows the relationship between two continuous variables. Best when
that relationship is functional.
2. Shows the relationship between two continuous variables. Best when the
relationship is too loose or cloud-like to be seen on a line plot.
3. Shows the relationship between two categorical variables (var1 and var2).
Highlights the frequencies of each value of var1.
4. Shows the relationship between two categorical variables (var1 and var2).
Best for comparing the relative frequencies of each value of var2 within each
value of var1 when var2 takes on more than two values.
5. Examines data range, Checks number of modes,Checks if distribution is
normal/lognormal, Checks for anomalies and outliers. (use a log scale to
visualize data that is heavily skewed.)
6. Presents information from a five-number summary. Useful for indicating
whether a distribution is skewed and whether there are potential unusual
observations (outliers), Very useful when large numbers of observations are
involved and when two or more data sets are being compared.
 Graph type
1. Line Plot
2. Scatter plot
3. Bar chart
4. Bar chart with
faceting
5. Histogram or
density plot
6. A box and whisker
plot(boxplot)
8/19/2017K K Singh, Dept. of CSE, RGUKT Nuzvid
15
Assignments
 load(nycflights)
 1. Create a new data frame that includes flights headed to SFO in February,
and save this data frame assfo_feb_flights. How many such recors are
there?
 2. Calculate the median and interquartile range for arr_delays of flights in
the sfo_feb_flights data frame, grouped by carrier. Which carrier has the
highest IQR of arrival delays?
 3. Considering the data from all the NYC airports, which month has the
highest average departure delay?
 4. What was the worst day to fly out of NYC in 2013 if you dislike delayed
flights?
 5. Make a histogram and calculate appropriate summary statistics for
arrival delays of sfo_feb_flights. Which of the following is false?
8/19/2017K K Singh, Dept. of CSE, RGUKT Nuzvid
16
5. working on data using R -Cleaning, filtering ,transformation, Sampling

More Related Content

What's hot

Manipulating Data using base R package
Manipulating Data using base R package Manipulating Data using base R package
Manipulating Data using base R package Rupak Roy
 
Stata cheat sheet: data transformation
Stata  cheat sheet: data transformationStata  cheat sheet: data transformation
Stata cheat sheet: data transformationTim Essam
 
Stata cheatsheet transformation
Stata cheatsheet transformationStata cheatsheet transformation
Stata cheatsheet transformationLaura Hughes
 
4 R Tutorial DPLYR Apply Function
4 R Tutorial DPLYR Apply Function4 R Tutorial DPLYR Apply Function
4 R Tutorial DPLYR Apply FunctionSakthi Dasans
 
R code descriptive statistics of phenotypic data by Avjinder Kaler
R code descriptive statistics of phenotypic data by Avjinder KalerR code descriptive statistics of phenotypic data by Avjinder Kaler
R code descriptive statistics of phenotypic data by Avjinder KalerAvjinder (Avi) Kaler
 
Data handling in r
Data handling in rData handling in r
Data handling in rAbhik Seal
 
Data manipulation on r
Data manipulation on rData manipulation on r
Data manipulation on rAbhik Seal
 
Stata Programming Cheat Sheet
Stata Programming Cheat SheetStata Programming Cheat Sheet
Stata Programming Cheat SheetLaura Hughes
 
Stata cheat sheet: data processing
Stata cheat sheet: data processingStata cheat sheet: data processing
Stata cheat sheet: data processingTim Essam
 
SAS and R Code for Basic Statistics
SAS and R Code for Basic StatisticsSAS and R Code for Basic Statistics
SAS and R Code for Basic StatisticsAvjinder (Avi) Kaler
 
Data manipulation with dplyr
Data manipulation with dplyrData manipulation with dplyr
Data manipulation with dplyrRomain Francois
 
DN 2017 | Reducing pain in data engineering | Martin Loetzsch | Project A
DN 2017 | Reducing pain in data engineering | Martin Loetzsch | Project ADN 2017 | Reducing pain in data engineering | Martin Loetzsch | Project A
DN 2017 | Reducing pain in data engineering | Martin Loetzsch | Project ADataconomy Media
 
5 R Tutorial Data Visualization
5 R Tutorial Data Visualization5 R Tutorial Data Visualization
5 R Tutorial Data VisualizationSakthi Dasans
 
January 2016 Meetup: Speeding up (big) data manipulation with data.table package
January 2016 Meetup: Speeding up (big) data manipulation with data.table packageJanuary 2016 Meetup: Speeding up (big) data manipulation with data.table package
January 2016 Meetup: Speeding up (big) data manipulation with data.table packageZurich_R_User_Group
 
Grouping & Summarizing Data in R
Grouping & Summarizing Data in RGrouping & Summarizing Data in R
Grouping & Summarizing Data in RJeffrey Breen
 
R getting spatial
R getting spatialR getting spatial
R getting spatialFAO
 

What's hot (20)

3 Data Structure in R
3 Data Structure in R3 Data Structure in R
3 Data Structure in R
 
Manipulating Data using base R package
Manipulating Data using base R package Manipulating Data using base R package
Manipulating Data using base R package
 
Stata cheat sheet: data transformation
Stata  cheat sheet: data transformationStata  cheat sheet: data transformation
Stata cheat sheet: data transformation
 
Stata cheatsheet transformation
Stata cheatsheet transformationStata cheatsheet transformation
Stata cheatsheet transformation
 
4 R Tutorial DPLYR Apply Function
4 R Tutorial DPLYR Apply Function4 R Tutorial DPLYR Apply Function
4 R Tutorial DPLYR Apply Function
 
R code for data manipulation
R code for data manipulationR code for data manipulation
R code for data manipulation
 
R code descriptive statistics of phenotypic data by Avjinder Kaler
R code descriptive statistics of phenotypic data by Avjinder KalerR code descriptive statistics of phenotypic data by Avjinder Kaler
R code descriptive statistics of phenotypic data by Avjinder Kaler
 
Data handling in r
Data handling in rData handling in r
Data handling in r
 
Data manipulation on r
Data manipulation on rData manipulation on r
Data manipulation on r
 
Stata Programming Cheat Sheet
Stata Programming Cheat SheetStata Programming Cheat Sheet
Stata Programming Cheat Sheet
 
Stata cheat sheet: data processing
Stata cheat sheet: data processingStata cheat sheet: data processing
Stata cheat sheet: data processing
 
SAS and R Code for Basic Statistics
SAS and R Code for Basic StatisticsSAS and R Code for Basic Statistics
SAS and R Code for Basic Statistics
 
Data manipulation with dplyr
Data manipulation with dplyrData manipulation with dplyr
Data manipulation with dplyr
 
DN 2017 | Reducing pain in data engineering | Martin Loetzsch | Project A
DN 2017 | Reducing pain in data engineering | Martin Loetzsch | Project ADN 2017 | Reducing pain in data engineering | Martin Loetzsch | Project A
DN 2017 | Reducing pain in data engineering | Martin Loetzsch | Project A
 
Basic Analysis using Python
Basic Analysis using PythonBasic Analysis using Python
Basic Analysis using Python
 
Basic Analysis using R
Basic Analysis using RBasic Analysis using R
Basic Analysis using R
 
5 R Tutorial Data Visualization
5 R Tutorial Data Visualization5 R Tutorial Data Visualization
5 R Tutorial Data Visualization
 
January 2016 Meetup: Speeding up (big) data manipulation with data.table package
January 2016 Meetup: Speeding up (big) data manipulation with data.table packageJanuary 2016 Meetup: Speeding up (big) data manipulation with data.table package
January 2016 Meetup: Speeding up (big) data manipulation with data.table package
 
Grouping & Summarizing Data in R
Grouping & Summarizing Data in RGrouping & Summarizing Data in R
Grouping & Summarizing Data in R
 
R getting spatial
R getting spatialR getting spatial
R getting spatial
 

Similar to 5. working on data using R -Cleaning, filtering ,transformation, Sampling

SCAF – AN EFFECTIVE APPROACH TO CLASSIFY SUBSPACE CLUSTERING ALGORITHMS
SCAF – AN EFFECTIVE APPROACH TO CLASSIFY SUBSPACE CLUSTERING ALGORITHMSSCAF – AN EFFECTIVE APPROACH TO CLASSIFY SUBSPACE CLUSTERING ALGORITHMS
SCAF – AN EFFECTIVE APPROACH TO CLASSIFY SUBSPACE CLUSTERING ALGORITHMSijdkp
 
A frame work for clustering time evolving data
A frame work for clustering time evolving dataA frame work for clustering time evolving data
A frame work for clustering time evolving dataiaemedu
 
Accounting serx
Accounting serxAccounting serx
Accounting serxzeer1234
 
Accounting serx
Accounting serxAccounting serx
Accounting serxzeer1234
 
Data visualization using R
Data visualization using RData visualization using R
Data visualization using RUmmiya Mohammedi
 
An Approach to Mixed Dataset Clustering and Validation with ART-2 Artificial ...
An Approach to Mixed Dataset Clustering and Validation with ART-2 Artificial ...An Approach to Mixed Dataset Clustering and Validation with ART-2 Artificial ...
An Approach to Mixed Dataset Clustering and Validation with ART-2 Artificial ...Happiest Minds Technologies
 
UNIT-4.docx
UNIT-4.docxUNIT-4.docx
UNIT-4.docxscet315
 
QUERY INVERSION TO FIND DATA PROVENANCE
QUERY INVERSION TO FIND DATA PROVENANCE QUERY INVERSION TO FIND DATA PROVENANCE
QUERY INVERSION TO FIND DATA PROVENANCE cscpconf
 
BSA_AML Rule Tuning
BSA_AML Rule TuningBSA_AML Rule Tuning
BSA_AML Rule TuningMayank Johri
 
Approach to BSA/AML Rule Thresholds
Approach to BSA/AML Rule ThresholdsApproach to BSA/AML Rule Thresholds
Approach to BSA/AML Rule ThresholdsMayank Johri
 
Finding Relationships between the Our-NIR Cluster Results
Finding Relationships between the Our-NIR Cluster ResultsFinding Relationships between the Our-NIR Cluster Results
Finding Relationships between the Our-NIR Cluster ResultsCSCJournals
 
A Comparative Study for Anomaly Detection in Data Mining
A Comparative Study for Anomaly Detection in Data MiningA Comparative Study for Anomaly Detection in Data Mining
A Comparative Study for Anomaly Detection in Data MiningIRJET Journal
 
Ijariie1117 volume 1-issue 1-page-25-27
Ijariie1117 volume 1-issue 1-page-25-27Ijariie1117 volume 1-issue 1-page-25-27
Ijariie1117 volume 1-issue 1-page-25-27IJARIIE JOURNAL
 
Drsp dimension reduction for similarity matching and pruning of time series ...
Drsp  dimension reduction for similarity matching and pruning of time series ...Drsp  dimension reduction for similarity matching and pruning of time series ...
Drsp dimension reduction for similarity matching and pruning of time series ...IJDKP
 
Fuzzy clustering and fuzzy c-means partition cluster analysis and validation ...
Fuzzy clustering and fuzzy c-means partition cluster analysis and validation ...Fuzzy clustering and fuzzy c-means partition cluster analysis and validation ...
Fuzzy clustering and fuzzy c-means partition cluster analysis and validation ...IJECEIAES
 
Dimension Reduction: What? Why? and How?
Dimension Reduction: What? Why? and How?Dimension Reduction: What? Why? and How?
Dimension Reduction: What? Why? and How?Kazi Toufiq Wadud
 

Similar to 5. working on data using R -Cleaning, filtering ,transformation, Sampling (20)

SCAF – AN EFFECTIVE APPROACH TO CLASSIFY SUBSPACE CLUSTERING ALGORITHMS
SCAF – AN EFFECTIVE APPROACH TO CLASSIFY SUBSPACE CLUSTERING ALGORITHMSSCAF – AN EFFECTIVE APPROACH TO CLASSIFY SUBSPACE CLUSTERING ALGORITHMS
SCAF – AN EFFECTIVE APPROACH TO CLASSIFY SUBSPACE CLUSTERING ALGORITHMS
 
A frame work for clustering time evolving data
A frame work for clustering time evolving dataA frame work for clustering time evolving data
A frame work for clustering time evolving data
 
Bank loan purchase modeling
Bank loan purchase modelingBank loan purchase modeling
Bank loan purchase modeling
 
Accounting serx
Accounting serxAccounting serx
Accounting serx
 
Accounting serx
Accounting serxAccounting serx
Accounting serx
 
Data visualization using R
Data visualization using RData visualization using R
Data visualization using R
 
QQ Plot.pptx
QQ Plot.pptxQQ Plot.pptx
QQ Plot.pptx
 
An Approach to Mixed Dataset Clustering and Validation with ART-2 Artificial ...
An Approach to Mixed Dataset Clustering and Validation with ART-2 Artificial ...An Approach to Mixed Dataset Clustering and Validation with ART-2 Artificial ...
An Approach to Mixed Dataset Clustering and Validation with ART-2 Artificial ...
 
UNIT-4.docx
UNIT-4.docxUNIT-4.docx
UNIT-4.docx
 
QUERY INVERSION TO FIND DATA PROVENANCE
QUERY INVERSION TO FIND DATA PROVENANCE QUERY INVERSION TO FIND DATA PROVENANCE
QUERY INVERSION TO FIND DATA PROVENANCE
 
BSA_AML Rule Tuning
BSA_AML Rule TuningBSA_AML Rule Tuning
BSA_AML Rule Tuning
 
Approach to BSA/AML Rule Thresholds
Approach to BSA/AML Rule ThresholdsApproach to BSA/AML Rule Thresholds
Approach to BSA/AML Rule Thresholds
 
Finding Relationships between the Our-NIR Cluster Results
Finding Relationships between the Our-NIR Cluster ResultsFinding Relationships between the Our-NIR Cluster Results
Finding Relationships between the Our-NIR Cluster Results
 
A Comparative Study for Anomaly Detection in Data Mining
A Comparative Study for Anomaly Detection in Data MiningA Comparative Study for Anomaly Detection in Data Mining
A Comparative Study for Anomaly Detection in Data Mining
 
Ijariie1117 volume 1-issue 1-page-25-27
Ijariie1117 volume 1-issue 1-page-25-27Ijariie1117 volume 1-issue 1-page-25-27
Ijariie1117 volume 1-issue 1-page-25-27
 
Drsp dimension reduction for similarity matching and pruning of time series ...
Drsp  dimension reduction for similarity matching and pruning of time series ...Drsp  dimension reduction for similarity matching and pruning of time series ...
Drsp dimension reduction for similarity matching and pruning of time series ...
 
E1062530
E1062530E1062530
E1062530
 
Fuzzy clustering and fuzzy c-means partition cluster analysis and validation ...
Fuzzy clustering and fuzzy c-means partition cluster analysis and validation ...Fuzzy clustering and fuzzy c-means partition cluster analysis and validation ...
Fuzzy clustering and fuzzy c-means partition cluster analysis and validation ...
 
1234
12341234
1234
 
Dimension Reduction: What? Why? and How?
Dimension Reduction: What? Why? and How?Dimension Reduction: What? Why? and How?
Dimension Reduction: What? Why? and How?
 

Recently uploaded

VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...Suhani Kapoor
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxolyaivanovalion
 
B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxStephen266013
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxolyaivanovalion
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfSocial Samosa
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Callshivangimorya083
 
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一ffjhghh
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingNeil Barnes
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxolyaivanovalion
 
Call Girls In Mahipalpur O9654467111 Escorts Service
Call Girls In Mahipalpur O9654467111  Escorts ServiceCall Girls In Mahipalpur O9654467111  Escorts Service
Call Girls In Mahipalpur O9654467111 Escorts ServiceSapana Sha
 
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...Suhani Kapoor
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxolyaivanovalion
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfLars Albertsson
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptSonatrach
 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998YohFuh
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionfulawalesam
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptxthyngster
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfRachmat Ramadhan H
 
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiVIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiSuhani Kapoor
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxolyaivanovalion
 

Recently uploaded (20)

VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptx
 
B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docx
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptx
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
 
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data Storytelling
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptx
 
Call Girls In Mahipalpur O9654467111 Escorts Service
Call Girls In Mahipalpur O9654467111  Escorts ServiceCall Girls In Mahipalpur O9654467111  Escorts Service
Call Girls In Mahipalpur O9654467111 Escorts Service
 
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptx
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdf
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interaction
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
 
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiVIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptx
 

5. working on data using R -Cleaning, filtering ,transformation, Sampling

  • 1. Working on data ( cleaning, filtering ,transformation,sampling,visualization) K K Singh, Dept. of CSE, RGUKT Nuzvid 8/19/2017K K Singh, Dept. of CSE, RGUKT Nuzvid 1
  • 2. Exploring DATA  cd <- read.table(‘custData.csv’, sep=',',header=T)  Once we’ve loaded the data into R, we’ll want to examine it.  class()—Tells us what type of R object you have. In our case,  summary()—Gives you a summary of almost any R object.  str()-Gives structure of data table/frame  names()– Gives detailed structure of data table/frame  dim() –Gives rows and columns of data  Data exploration uses a combination of summary statistics—means and medians, variances, and counts—and visualization. You can spot some problems just by using summary statistics; other problems are easier to find visually. 8/19/2017K K Singh, Dept. of CSE, RGUKT Nuzvid 2
  • 3. OTHER DATA FORMATS  .csv is not the only common data file format you’ll encounter. Other formats include  .tsv (tab-separated values),  pipe-separated files,  Microsoft Excel workbooks,  JSON data,  and XML.  R’s built-in read.table() command can be made to read most separated value formats. 8/19/2017K K Singh, Dept. of CSE, RGUKT Nuzvid 3
  • 4. 8/19/2017K K Singh, Dept. of CSE, RGUKT Nuzvid 4  custdata<-fread(“custData.csv”)  Summary(custdata)
  • 5. Typical problems revealed by data summaries  MISSING VALUES  INVALID VALUES AND OUTLIERS 8/19/2017K K Singh, Dept. of CSE, RGUKT Nuzvid 5
  • 6. Typical problems revealed by data summaries 8/19/2017K K Singh, Dept. of CSE, RGUKT Nuzvid 6  DATA RANGE  Unit
  • 7. Data Cleaning  Fundamentally, there are two things you can do with missing variables: drop the rows with missing values, or convert the missing values to a meaningful value.  If the missing data represents a fairly small fraction of the dataset, it’s probably saf just to drop these customers from your analysis. But if it is significant, What do yo do then?  The most straightforward solution is just to create a new category for the variable, called missing.  f <- ifelse(is.na(custdata$is.employed), "missing", ifelse(custdata$is.employed==T, “employed“, “not_employed”))  summary(as.factor(f)) 8/19/2017K K Singh, Dept. of CSE, RGUKT Nuzvid 7
  • 8. 8/19/2017K K Singh, Dept. of CSE, RGUKT Nuzvid 8
  • 9. Data_transformations The purpose of data transformation is to make data easier to model—and easier to understand. For example, the cost of living will vary from state to state, so what would be a high salary in one region could be barely enough to scrape by in another. If you want to use income as an input to your insurance model, it might be more meaningful to normalize a customer’s income by the typical income in the area where they live. custdata <- merge(custdata, medianincome, by.x="state.of.res", by.y="State") summary(custdata[,c("state.of.res", "income", "Median.Income")]) custdata$income.norm <- with(custdata, income/Median.Income) OR custdata$income.norm <- custdata[, income/Median.Income] summary(custdata$income.norm) 8/19/2017K K Singh, Dept. of CSE, RGUKT Nuzvid 9
  • 10. CONVERTING CONTINUOUS VARIABLES TO DISCRETE  In these cases, you might want to convert the continuous age and income variables into ranges, or discrete variables. 8/19/2017K K Singh, Dept. of CSE, RGUKT Nuzvid 10
  • 11. NORMALIZATION AND RESCALING It is useful when absolute quantities are less meaningful than relative ones.  For example, you might be less interested in a customer’s absolute age than in how old or young they are relative to a “typical” customer. Let’s take the mean age of your customers to be the typical age. You can normalize by that, as shown in the following listing.  summary(custdata$age)  meanage <- mean(custdata$age)  custdata$age.normalized <- custdata$age/meanage  summary(custdata$age.normalized) 8/19/2017K K Singh, Dept. of CSE, RGUKT Nuzvid 11
  • 12. Data Sampling  Sampling is the process of selecting a subset of a population to represent the whole, during analysis and modeling.  it’s easier to test and debug the code on small subsamples before training the model on the entire dataset. Visualization can be easier with a subsample of the data;  The other reason to sample your data is to create test and training splits. 8/19/2017K K Singh, Dept. of CSE, RGUKT Nuzvid 12
  • 13. 8/19/2017K K Singh, Dept. of CSE, RGUKT Nuzvid 13 A convenient way to manage random sampling is to add a sample group column to the data frame. The sample group column contains a number generated uniformly from zero to one, using the runif function. You can draw a random sample of arbitrary size from the data frame by using the appropriate threshold on the sample group column.
  • 14. Data visualization (Refer to the lecture on Graph plotting )  Visually checking distributions for a single variable  What is the peak value of the distribution?  How many peaks are there in the distribution (unimodality versus bimodality)?  How normal (or lognormal) is the data?  How much does the data vary? Is it concentrated in a certain interval or in a certain category?  Is there a relationship between the two inputs age and income in my data? 8/19/2017K K Singh, Dept. of CSE, RGUKT Nuzvid 14
  • 15. Uses 1. plot Shows the relationship between two continuous variables. Best when that relationship is functional. 2. Shows the relationship between two continuous variables. Best when the relationship is too loose or cloud-like to be seen on a line plot. 3. Shows the relationship between two categorical variables (var1 and var2). Highlights the frequencies of each value of var1. 4. Shows the relationship between two categorical variables (var1 and var2). Best for comparing the relative frequencies of each value of var2 within each value of var1 when var2 takes on more than two values. 5. Examines data range, Checks number of modes,Checks if distribution is normal/lognormal, Checks for anomalies and outliers. (use a log scale to visualize data that is heavily skewed.) 6. Presents information from a five-number summary. Useful for indicating whether a distribution is skewed and whether there are potential unusual observations (outliers), Very useful when large numbers of observations are involved and when two or more data sets are being compared.  Graph type 1. Line Plot 2. Scatter plot 3. Bar chart 4. Bar chart with faceting 5. Histogram or density plot 6. A box and whisker plot(boxplot) 8/19/2017K K Singh, Dept. of CSE, RGUKT Nuzvid 15
  • 16. Assignments  load(nycflights)  1. Create a new data frame that includes flights headed to SFO in February, and save this data frame assfo_feb_flights. How many such recors are there?  2. Calculate the median and interquartile range for arr_delays of flights in the sfo_feb_flights data frame, grouped by carrier. Which carrier has the highest IQR of arrival delays?  3. Considering the data from all the NYC airports, which month has the highest average departure delay?  4. What was the worst day to fly out of NYC in 2013 if you dislike delayed flights?  5. Make a histogram and calculate appropriate summary statistics for arrival delays of sfo_feb_flights. Which of the following is false? 8/19/2017K K Singh, Dept. of CSE, RGUKT Nuzvid 16