SlideShare a Scribd company logo
1 of 8
Download to read offline
Using R in Data mining for the masses,
Chapter 3
Ulrik Hørlyk Hjort
April 5, 2014
1 Data Scrubbing
In the following text “the book” will refer to the the book: “Data Mining for
the masses”
1.1 Importing data into an data frame
Import an csv file by using the read.csv() function. Use the help function in
R - help(read.csv) or just ?read.csv - to see a full description of this function
and the arguments it takes.
Start the R commandline interpreter environment and type the following
command to inport the dataset for chapter 3 (It is assumed that the data
set file is located in the directory where the R interprenter is executed):
data = read.csv(‘‘Chapter03DataSet.csv’’, sep=’’,’’,header = TRUE)
This line imports the data tables from the csv file into the data frame object
data and is equivalent to the action that add a data set to a process in
RapidMiner.
To get a overview of the data set - like the Result Perspective in Rapid-
Miner - simply type the name of the data frame, which in this case is data.
We then get the following output of the data set where the top line of the
table, called the header, contains the column names. Each horizontal line
1
afterward denotes a data row, which begins with the name of the row, and
then followed by the actual data. Each data member of a row is called a
cell. Observe that since the number of columns exceed the screen witdh the
coloums will be split up in the view.
Output of the data frame view:
> data
Gender Race Birth_Year Marital_Status Years_on_Internet
1 M White 1972 M 8
2 M Hispanic 1981 S 14
3 F African American 1977 S 6
4 F White 1961 D 8
5 M White 1954 M 2
6 M African American 1982 D 15
7 M African American 1981 D 11
8 M White 1977 S 3
9 F African American 1969 M 6
10 M White 1987 S 12
11 F Hispanic 1959 D 12
2
Hours_Per_Day Preferred_Browser Preferred_Search_Engine Preferred_Email
1 1 Firefox Google Yahoo
2 2 Chrome Google Hotmail
3 2 Firefox Yahoo Yahoo
4 6 Firefox Google Hotmail
5 3 Internet Explorer Bing Hotmail
6 4 Internet Explorer Google Yahoo
7 2 Firefox Google Yahoo
8 3 Internet Explorer Yahoo Yahoo
9 2 Firefox Google Gmail
10 1 Safari Yahoo Yahoo
11 5 Chrome Google Gmail
3
Read_News Online_Shopping Online_Gaming Facebook Twitter
1 Y N N Y N
2 Y N N Y N
3 Y Y Y N
4 N Y N N Y
5 Y Y N Y N
6 Y N Y N N
7 Y Y Y
8 Y Y 99
9 N Y N N N
10 Y Y Y N
11 Y N N Y N
4
Other_Social_Network
1
2
3
4
5
6
7 LinkedIn
8 LinkedIn
9
10 MySpace
11 Google+
5
1.2 Data Reduction
The data reduction example in the book reduces the data set by filtering out
the row in which the Online Shopping attribute is undefined. In R it is
done in the following way:
data <- data[data$Online_Shopping==’’Y’’ | data$Online_Shopping==’’N’’,]
The inverted filter function is easily done by negate the logical expression:
data <- data[!data$Online_Shopping==’’Y’’ & !data$Online_Shopping==’’N’’,]
1.2.1 Sample Of A Data Frame
Reducing the size of the data set by takeing a sample can be done in the
following way in R. First we create a function to generate an random sample
on a data frame:
random_sample = function(data_frame,n) {
return (df[sample(nrow(data_frame), n),])
}
The random sample function takes the data frame as first argument and the
number of row we wish in the reduced data set as second argument. We can
generate a new data frame with 5 random rows from the original one by:
reduced_data_frame<-randomSample(data, 5)
where reduced data frame is the new data frame containing the 5 row
reduced data set.
6
1.3 Handling Inconsistent Data
In R we can show a summary view of the data frame which display the
modes, means, medians and factor levels for the data rows. Looking at
the summary data frame view we can see the inconsistent value 99 for the
Twitter attribute:
> summary(data)
Gender Race Birth_Year Marital_Status Years_on_Internet
F:4 African American:4 Min. :1954 D:4 Min. : 2.000
M:7 Hispanic :2 1st Qu.:1965 M:3 1st Qu.: 6.000
White :5 Median :1977 S:4 Median : 8.000
Mean :1973 Mean : 8.818
3rd Qu.:1981 3rd Qu.:12.000
Max. :1987 Max. :15.000
Hours_Per_Day Preferred_Browser Preferred_Search_Engine
Min. :1.000 Chrome :2 Bing :1
1st Qu.:2.000 Firefox :5 Google:7
Median :2.000 Internet Explorer:3 Yahoo :3
Mean :2.818 Safari :1
3rd Qu.:3.500
Max. :6.000
Preferred_Email Read_News Online_Shopping Online_Gaming Facebook Twitter
Gmail :2 :1 :2 :3 N:3 99:1
Hotmail:3 N:2 N:4 N:6 Y:8 N :8
Yahoo :6 Y:8 Y:5 Y:2 Y :2
Other_Social_Network
:7
Google+ :1
LinkedIn:2
MySpace :1
We wish to change this value to a valid one e.g “Y” or “N”. Since the mode
of “N” is 80% we choose that as the replacement value. The replacement is
done in the following way in R:
data$Twitter[!data$Twitter == "Y" & !data$Twitter == "N"] <- "N"
It is ly possible to replace with values already in the range and the following
will result in an error message:
#data$Twitter[!data$Twitter == ‘‘Y’’ & !data$Twitter == ‘‘N’’] <- ‘‘Q’’
Error Message:
In ‘[<-.factor‘(‘*tmp*‘, !d1$Twitter == ‘‘Y’’ & !d1$Twitter == ‘‘N’’, :
invalid factor level, NAs generated
If we want to assign a value that is not currently a factor level, you first would need to
create the additional level like:
levels(data$Twitter) <- c(levels(data$Twitter), ‘‘Q’’)
7
1.4 Handling Inconsistent Data
Reducing the data set to a subset data frame containing the attributes “Birth Year”,
“Gender”, “Martial Status”, “Race” and “Years on Internet” is done in the fol-
lowing way in R:
reduced_data_frame<-subset(data, select = c("Gender", "Birth_Year",
"Marital_Status", "Race",
"Years_on_Internet"))
If we instead wish to remove attributes from the original data set we can do the following
which delete the attributes “Twitter”, “Gender” and “Race”:
data$Twitter <- NULL
data$Gender <- NULL
data$Race <- NULL
8

More Related Content

Recently uploaded

Dubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls DubaiDubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls Dubaikojalkojal131
 
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi ArabiaIn Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabiaahmedjiabur940
 
怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制
怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制
怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制vexqp
 
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...nirzagarg
 
7. Epi of Chronic respiratory diseases.ppt
7. Epi of Chronic respiratory diseases.ppt7. Epi of Chronic respiratory diseases.ppt
7. Epi of Chronic respiratory diseases.pptibrahimabdi22
 
Ranking and Scoring Exercises for Research
Ranking and Scoring Exercises for ResearchRanking and Scoring Exercises for Research
Ranking and Scoring Exercises for ResearchRajesh Mondal
 
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...nirzagarg
 
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...gajnagarg
 
The-boAt-Story-Navigating-the-Waves-of-Innovation.pptx
The-boAt-Story-Navigating-the-Waves-of-Innovation.pptxThe-boAt-Story-Navigating-the-Waves-of-Innovation.pptx
The-boAt-Story-Navigating-the-Waves-of-Innovation.pptxVivek487417
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...ZurliaSoop
 
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...Bertram Ludäscher
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Researchmichael115558
 
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...gajnagarg
 
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...nirzagarg
 
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24  Building Real-Time Pipelines With FLaNKDATA SUMMIT 24  Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNKTimothy Spann
 
Jual Cytotec Asli Obat Aborsi No. 1 Paling Manjur
Jual Cytotec Asli Obat Aborsi No. 1 Paling ManjurJual Cytotec Asli Obat Aborsi No. 1 Paling Manjur
Jual Cytotec Asli Obat Aborsi No. 1 Paling Manjurptikerjasaptiker
 
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...nirzagarg
 
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Klinik kandungan
 

Recently uploaded (20)

Dubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls DubaiDubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls Dubai
 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
 
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi ArabiaIn Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
 
怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制
怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制
怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制
 
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
 
7. Epi of Chronic respiratory diseases.ppt
7. Epi of Chronic respiratory diseases.ppt7. Epi of Chronic respiratory diseases.ppt
7. Epi of Chronic respiratory diseases.ppt
 
Ranking and Scoring Exercises for Research
Ranking and Scoring Exercises for ResearchRanking and Scoring Exercises for Research
Ranking and Scoring Exercises for Research
 
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
 
Sequential and reinforcement learning for demand side management by Margaux B...
Sequential and reinforcement learning for demand side management by Margaux B...Sequential and reinforcement learning for demand side management by Margaux B...
Sequential and reinforcement learning for demand side management by Margaux B...
 
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
 
The-boAt-Story-Navigating-the-Waves-of-Innovation.pptx
The-boAt-Story-Navigating-the-Waves-of-Innovation.pptxThe-boAt-Story-Navigating-the-Waves-of-Innovation.pptx
The-boAt-Story-Navigating-the-Waves-of-Innovation.pptx
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
 
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
 
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24  Building Real-Time Pipelines With FLaNKDATA SUMMIT 24  Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
 
Jual Cytotec Asli Obat Aborsi No. 1 Paling Manjur
Jual Cytotec Asli Obat Aborsi No. 1 Paling ManjurJual Cytotec Asli Obat Aborsi No. 1 Paling Manjur
Jual Cytotec Asli Obat Aborsi No. 1 Paling Manjur
 
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
 
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
 

Featured

Product Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage EngineeringsProduct Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage EngineeringsPixeldarts
 
How Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthHow Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthThinkNow
 
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfAI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfmarketingartwork
 
PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024Neil Kimberley
 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)contently
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024Albert Qian
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsKurio // The Social Media Age(ncy)
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Search Engine Journal
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summarySpeakerHub
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next Tessa Mero
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentLily Ray
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best PracticesVit Horky
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project managementMindGenius
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...RachelPearson36
 
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Applitools
 
12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at WorkGetSmarter
 

Featured (20)

Product Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage EngineeringsProduct Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage Engineerings
 
How Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthHow Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental Health
 
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfAI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
 
Skeleton Culture Code
Skeleton Culture CodeSkeleton Culture Code
Skeleton Culture Code
 
PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024
 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie Insights
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search Intent
 
How to have difficult conversations
How to have difficult conversations How to have difficult conversations
How to have difficult conversations
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best Practices
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project management
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
 
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
 
12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work
 

Data mining for the masses, Chapter 3, Using R instead of Rapidminer

  • 1. Using R in Data mining for the masses, Chapter 3 Ulrik Hørlyk Hjort April 5, 2014 1 Data Scrubbing In the following text “the book” will refer to the the book: “Data Mining for the masses” 1.1 Importing data into an data frame Import an csv file by using the read.csv() function. Use the help function in R - help(read.csv) or just ?read.csv - to see a full description of this function and the arguments it takes. Start the R commandline interpreter environment and type the following command to inport the dataset for chapter 3 (It is assumed that the data set file is located in the directory where the R interprenter is executed): data = read.csv(‘‘Chapter03DataSet.csv’’, sep=’’,’’,header = TRUE) This line imports the data tables from the csv file into the data frame object data and is equivalent to the action that add a data set to a process in RapidMiner. To get a overview of the data set - like the Result Perspective in Rapid- Miner - simply type the name of the data frame, which in this case is data. We then get the following output of the data set where the top line of the table, called the header, contains the column names. Each horizontal line 1
  • 2. afterward denotes a data row, which begins with the name of the row, and then followed by the actual data. Each data member of a row is called a cell. Observe that since the number of columns exceed the screen witdh the coloums will be split up in the view. Output of the data frame view: > data Gender Race Birth_Year Marital_Status Years_on_Internet 1 M White 1972 M 8 2 M Hispanic 1981 S 14 3 F African American 1977 S 6 4 F White 1961 D 8 5 M White 1954 M 2 6 M African American 1982 D 15 7 M African American 1981 D 11 8 M White 1977 S 3 9 F African American 1969 M 6 10 M White 1987 S 12 11 F Hispanic 1959 D 12 2
  • 3. Hours_Per_Day Preferred_Browser Preferred_Search_Engine Preferred_Email 1 1 Firefox Google Yahoo 2 2 Chrome Google Hotmail 3 2 Firefox Yahoo Yahoo 4 6 Firefox Google Hotmail 5 3 Internet Explorer Bing Hotmail 6 4 Internet Explorer Google Yahoo 7 2 Firefox Google Yahoo 8 3 Internet Explorer Yahoo Yahoo 9 2 Firefox Google Gmail 10 1 Safari Yahoo Yahoo 11 5 Chrome Google Gmail 3
  • 4. Read_News Online_Shopping Online_Gaming Facebook Twitter 1 Y N N Y N 2 Y N N Y N 3 Y Y Y N 4 N Y N N Y 5 Y Y N Y N 6 Y N Y N N 7 Y Y Y 8 Y Y 99 9 N Y N N N 10 Y Y Y N 11 Y N N Y N 4
  • 6. 1.2 Data Reduction The data reduction example in the book reduces the data set by filtering out the row in which the Online Shopping attribute is undefined. In R it is done in the following way: data <- data[data$Online_Shopping==’’Y’’ | data$Online_Shopping==’’N’’,] The inverted filter function is easily done by negate the logical expression: data <- data[!data$Online_Shopping==’’Y’’ & !data$Online_Shopping==’’N’’,] 1.2.1 Sample Of A Data Frame Reducing the size of the data set by takeing a sample can be done in the following way in R. First we create a function to generate an random sample on a data frame: random_sample = function(data_frame,n) { return (df[sample(nrow(data_frame), n),]) } The random sample function takes the data frame as first argument and the number of row we wish in the reduced data set as second argument. We can generate a new data frame with 5 random rows from the original one by: reduced_data_frame<-randomSample(data, 5) where reduced data frame is the new data frame containing the 5 row reduced data set. 6
  • 7. 1.3 Handling Inconsistent Data In R we can show a summary view of the data frame which display the modes, means, medians and factor levels for the data rows. Looking at the summary data frame view we can see the inconsistent value 99 for the Twitter attribute: > summary(data) Gender Race Birth_Year Marital_Status Years_on_Internet F:4 African American:4 Min. :1954 D:4 Min. : 2.000 M:7 Hispanic :2 1st Qu.:1965 M:3 1st Qu.: 6.000 White :5 Median :1977 S:4 Median : 8.000 Mean :1973 Mean : 8.818 3rd Qu.:1981 3rd Qu.:12.000 Max. :1987 Max. :15.000 Hours_Per_Day Preferred_Browser Preferred_Search_Engine Min. :1.000 Chrome :2 Bing :1 1st Qu.:2.000 Firefox :5 Google:7 Median :2.000 Internet Explorer:3 Yahoo :3 Mean :2.818 Safari :1 3rd Qu.:3.500 Max. :6.000 Preferred_Email Read_News Online_Shopping Online_Gaming Facebook Twitter Gmail :2 :1 :2 :3 N:3 99:1 Hotmail:3 N:2 N:4 N:6 Y:8 N :8 Yahoo :6 Y:8 Y:5 Y:2 Y :2 Other_Social_Network :7 Google+ :1 LinkedIn:2 MySpace :1 We wish to change this value to a valid one e.g “Y” or “N”. Since the mode of “N” is 80% we choose that as the replacement value. The replacement is done in the following way in R: data$Twitter[!data$Twitter == "Y" & !data$Twitter == "N"] <- "N" It is ly possible to replace with values already in the range and the following will result in an error message: #data$Twitter[!data$Twitter == ‘‘Y’’ & !data$Twitter == ‘‘N’’] <- ‘‘Q’’ Error Message: In ‘[<-.factor‘(‘*tmp*‘, !d1$Twitter == ‘‘Y’’ & !d1$Twitter == ‘‘N’’, : invalid factor level, NAs generated If we want to assign a value that is not currently a factor level, you first would need to create the additional level like: levels(data$Twitter) <- c(levels(data$Twitter), ‘‘Q’’) 7
  • 8. 1.4 Handling Inconsistent Data Reducing the data set to a subset data frame containing the attributes “Birth Year”, “Gender”, “Martial Status”, “Race” and “Years on Internet” is done in the fol- lowing way in R: reduced_data_frame<-subset(data, select = c("Gender", "Birth_Year", "Marital_Status", "Race", "Years_on_Internet")) If we instead wish to remove attributes from the original data set we can do the following which delete the attributes “Twitter”, “Gender” and “Race”: data$Twitter <- NULL data$Gender <- NULL data$Race <- NULL 8