SlideShare a Scribd company logo
1 of 34
Unit I
Data Manipulations
Data Already in R – Reading data – Reading and formatting datasets –
Manipulating Data with dplyr – Tiding Data with tidyr
Data Manipulation
• Data science is as much about manipulating data as it is about fitting
models to data.
• Data rarely arrives in a form that we can directly feed into the
statistical models or machine learning algorithms.
• The first stages of data analysis are almost always figuring out how to
load the data into R
• And then figuring out how to transform it into a shape you can readily
analyze.
Data Already in R:
• There are some data sets already built into R or available in R packages.
• Dataset is a package in R.
• Its aim is to make tidy datasets easier to release, exchange and reuse. It
organizes and formats data frame 'R' objects into well-referenced, well-
described, interoperable datasets into release and reuse ready form.
• We can load the package into R using the library() function.
library(datasets)
If we want , together with a short description of each, then use
library(help = "datasets“).
It describes Package, version, Title, author aetc.
To load an actual data set into R’s memory, use:
data() function
For example,
data(“CO2”)
To display first 6 rows of a dataset:
head(co2)
To plot a graph:
plot(conc ~ uptake, data = CO2)
• Another package with several useful data sets is mlbench.
• It contains data sets for machine learning benchmarks, so these
data sets are aimed at testing how new methods perform on
known data sets.
• This package is not distributed together with R, but you can
install it, load it, and get a list of the data sets within it like this:
install.packages("mlbench")
library(mlbench)
library(help = "mlbench")
Quickly Reviewing Data
• head(dataset/dataframe, no.ofrows)
• head(CO2,3) -> displays first 3 rows in co2 dataset.
• Simillarly,
• tail(CO2,3) ->
• summary(co2) ->displays summary statistics for all the columns in a
data frame,
• str(co2) -> displays the type of each column
Reading Data
• There are several packages for reading data in different file formats,
from Excel to JSON to XML and so on.
• R has plenty of built-in functions for reading such data. Use
?read.table
• read.table() is a function in R that reads a file in table format and creates
a data frame from it
read.table()
Example
my_data <- read.table("data.txt", header = TRUE, sep = "t")
Arguments in read.table():
• header: This is a boolean value telling the function whether it should
consider the first line in the input file a header line.
• col.names: If the first line is not used to specify the header, you can
use this option to name the columns.
• dec: This is the decimal point used in numbers.
• comment.char: By default, the function assumes that “#” is the start of
a comment and ignores the rest of a line when it sees it
• colClasses: This lets you specify which type each column should have,
so here you can specify that some columns should be factors, and
others should be strings
Install package in R studio
• To install the `mlbench` package in RStudio, you can follow these steps:
1. Open RStudio and click on the **Packages** tab in the bottom right
pane.
2. Click on the **Install** button.
3. In the **Install Packages** dialog box, type `mlbench` in the
**Packages** field.
4. Select the **Install dependencies** option.
5. Click on the **Install** button to install the package.
Examples of Reading and Formatting Data Sets
Breast Cancer Data set :
• As a first example of reading data from a text file, we consider the
BreastCancer data set from mlbench.
• Then we have something to compare our results with.
library(mlbench)
data(BreastCancer)
head(3,BreastCancer)
• The URL to the actual data is https://archive.ics.uci.edu/ml/machine-
learning-databases/ breast-cancer-wisconsin/breast-cancer-wisconsin.data’
• To get data,
• we could go to the URL and save the file.
• to read the data directly from the URL.
• We can read the data and get it as a vector of lines using the readLines()
function.
• lines <- readLines(data_url) lines[1:5]
• For this data, it seems to be a comma-separated values file without a
header line. So save the data with the “.csv” suffix.
• Boston Housing Data Set : For the second example of loading data,
we take another data set from the mlbench package.(Refer text book)
The readr package:
• readr is an R package that provides a fast and friendly way to read
rectangular data from delimited files, such as comma-separated values
(CSV) and tab-separated values (TSV)
• It is designed to parse many types of data.
• To install readr,
• you can either install the whole tidyverse by running
install.packages("tidyverse") or
• install just readr by running install.packages("readr")
• Once installed, you can load readr with library(readr)
• readr supports the following file formats with these read_*() functions:
• read_csv(): comma-separated values (CSV)
• read_tsv(): tab-separated values (TSV)
• read_csv2(): semicolon-separated values with , as the decimal mark
• read_delim(): delimited files (CSV and TSV are important special
cases)
• read_fwf(): fixed-width files
• read_table(): whitespace-separated files
• read_log(): web log files 1
Manipulating Data with dplyr
• Data frames are ideal for representing tabular data
• Nearly all packages that implement statistical models or machine
learning algorithms in R work on data frames.
• But to actually manipulate a data frame, you often have to write a lot
of code to filter data, rearrange data, summarize it in various ways,
and such.
• A few years ago, manipulating data frames required a lot more
programming than actually analyzing data.
• That has improved dramatically with the dplyr package (pronounced
“d plier” where “plier” is pronounced as “pliers”). This pack
• dplyr package has to be installed externaly.
• It helps to resolve the most frequent data manipulation hurdles.
• There are uncomplicated “verbs”, functions present for tackling every
common data manipulation and the thoughts can be translated into
code faster.
• This package provides a number of convenient functions that let you
modify data frames in various ways and string them together in pipes
using the %>% or |> operator
• If you import dplyr, you get a large selection of functions that let you
build pipelines for data frame manipulation using pipelines.
Some Useful dplyr Functions
• The dplyr package has several representations of data frame and its
equivalent formats.
• (illustrate with output)
• iris %>% as_tibble()
• iris |> as_tibble()
• iris %>% as_tibble() %>% select(Petal.Width,Petal.Length) %>% head(3)
• iris %>% as_tibble() %>% select(Sepal.Length:Petal.Length) %>%
head(3)
• iris |> as_tibble() |> select(starts_with("Petal")) |> head(3)
• iris |> as_tibble() |> select(ends_with("Width")) |> head(3)
• iris |> as_tibble() |> select(contains("etal")) |> head(3)
• iris |> as_tibble() |> select(matches(".t.")) |> head(3)
• iris %>% as_tibble() %>% select(-starts_with("Petal")) %>% head(3)
• iris %>% as_tibble() %>% mutate(Petal.Width.plus.Length =
Petal.Width + Petal.Length) %>%
• select(Species, Petal.Width.plus.Length) %>% head(3)
• iris %>% as_tibble() %>%
• mutate(Petal.Width.plus.Length = Petal.Width + Petal.Length,
• Sepal.Width.plus.Length = Sepal.Width + Sepal.Length) %>%
• select(Petal.Width.plus.Length, Sepal.Width.plus.Length) %>%
head(3)
• iris %>% as_tibble() %>% arrange(Sepal.Length) %>% head(3)
• iris %>% as_tibble() %>% arrange(desc(Sepal.Length)) %>% head(3)
• iris %>% as_tibble() %>% group_by(Species) %>% head(3)
• iris %>% group_by(Species) %>% summarise(Mean.Petal.Length =
mean(Petal.Length))
• Breast Cancer Data Manipulation – Refer Text book
Tidying Data with tidyr
Tidy data is a standard way of mapping the meaning of a data set
to its structure. A data set is messy or tidy depending on how rows,
columns and tables are matched up with observations, variables
and types.
Hadley Wickham
• tidy data can be used to plot or summarize the data efficiently.
• It mostly comes down to what data is represented as columns in a data frame
and what is not.
• For example, if I want to look at the iris data set and see how the Petal.Length
varies among species, then I can look at the Species column against the
Petal.Length column:
iris |>
as_tibble() |>
select(Species, Petal.Length) |>
head(3)
• Can plot a graph for the same
• This works because we have a column for the x-axis and another for
the y-axis.
• But if we want to plot the different measurements of the irises to see
how those are related and each measurement is a separate column.
• In such case we can use tidyr package.
library(tidyr)
• It has a function, pivot_longer(), that modifies the data frame, so
columns become names in a factor and other columns become
values.
• pivot_longer function is designed to reshape data from a wider
format to a longer format.
• It makes easier to analyze and visualize the data.
• The data frame or tibble to be reshaped
• What it does is essentially transforming the data frame such that you
get one column containing the name of your original columns and
another column containing the values in those columns.
• In the iris data set, we have observations for sepal length and sepal
width.
• If we want to examine Species vs. Sepal.Length or Sepal.Width, we can
readily do this.
• Pivot wider() – inverse of pivot longer()

More Related Content

Similar to Unit I - introduction to r language 2.pptx

Postgresql Database Administration Basic - Day2
Postgresql  Database Administration Basic  - Day2Postgresql  Database Administration Basic  - Day2
Postgresql Database Administration Basic - Day2PoguttuezhiniVP
 
Pandas-(Ziad).pptx
Pandas-(Ziad).pptxPandas-(Ziad).pptx
Pandas-(Ziad).pptxSivam Chinna
 
Python Pandas.pptx
Python Pandas.pptxPython Pandas.pptx
Python Pandas.pptxSujayaBiju
 
2. Data Preprocessing with Numpy and Pandas.pptx
2. Data Preprocessing with Numpy and Pandas.pptx2. Data Preprocessing with Numpy and Pandas.pptx
2. Data Preprocessing with Numpy and Pandas.pptxPeangSereysothirich
 
Unit 3_Numpy_Vsp.pptx
Unit 3_Numpy_Vsp.pptxUnit 3_Numpy_Vsp.pptx
Unit 3_Numpy_Vsp.pptxprakashvs7
 
R programming & Machine Learning
R programming & Machine LearningR programming & Machine Learning
R programming & Machine LearningAmanBhalla14
 
II B.Sc IT DATA STRUCTURES.pptx
II B.Sc IT DATA STRUCTURES.pptxII B.Sc IT DATA STRUCTURES.pptx
II B.Sc IT DATA STRUCTURES.pptxsabithabanu83
 
Summerization notes for descriptive statistics using r
Summerization notes for descriptive statistics using r Summerization notes for descriptive statistics using r
Summerization notes for descriptive statistics using r Ashwini Mathur
 
Hive_An Brief Introduction to HIVE_BIGDATAANALYTICS
Hive_An Brief Introduction to HIVE_BIGDATAANALYTICSHive_An Brief Introduction to HIVE_BIGDATAANALYTICS
Hive_An Brief Introduction to HIVE_BIGDATAANALYTICSRUHULAMINHAZARIKA
 
Ten tools for ten big data areas 04_Apache Hive
Ten tools for ten big data areas 04_Apache HiveTen tools for ten big data areas 04_Apache Hive
Ten tools for ten big data areas 04_Apache HiveWill Du
 
Cost Based Optimizer - Part 1 of 2
Cost Based Optimizer - Part 1 of 2Cost Based Optimizer - Part 1 of 2
Cost Based Optimizer - Part 1 of 2Mahesh Vallampati
 
Data Structure & aaplications_Module-1.pptx
Data Structure & aaplications_Module-1.pptxData Structure & aaplications_Module-1.pptx
Data Structure & aaplications_Module-1.pptxGIRISHKUMARBC1
 
Data Wrangling and Visualization Using Python
Data Wrangling and Visualization Using PythonData Wrangling and Visualization Using Python
Data Wrangling and Visualization Using PythonMOHITKUMAR1379
 
Python-for-Data-Analysis.pdf
Python-for-Data-Analysis.pdfPython-for-Data-Analysis.pdf
Python-for-Data-Analysis.pdfssuser598883
 
Pandas Dataframe reading data Kirti final.pptx
Pandas Dataframe reading data  Kirti final.pptxPandas Dataframe reading data  Kirti final.pptx
Pandas Dataframe reading data Kirti final.pptxKirti Verma
 

Similar to Unit I - introduction to r language 2.pptx (20)

Postgresql Database Administration Basic - Day2
Postgresql  Database Administration Basic  - Day2Postgresql  Database Administration Basic  - Day2
Postgresql Database Administration Basic - Day2
 
Pandas-(Ziad).pptx
Pandas-(Ziad).pptxPandas-(Ziad).pptx
Pandas-(Ziad).pptx
 
Python Pandas.pptx
Python Pandas.pptxPython Pandas.pptx
Python Pandas.pptx
 
Cassandra data modelling best practices
Cassandra data modelling best practicesCassandra data modelling best practices
Cassandra data modelling best practices
 
2. Data Preprocessing with Numpy and Pandas.pptx
2. Data Preprocessing with Numpy and Pandas.pptx2. Data Preprocessing with Numpy and Pandas.pptx
2. Data Preprocessing with Numpy and Pandas.pptx
 
Unit 3_Numpy_Vsp.pptx
Unit 3_Numpy_Vsp.pptxUnit 3_Numpy_Vsp.pptx
Unit 3_Numpy_Vsp.pptx
 
R programming & Machine Learning
R programming & Machine LearningR programming & Machine Learning
R programming & Machine Learning
 
II B.Sc IT DATA STRUCTURES.pptx
II B.Sc IT DATA STRUCTURES.pptxII B.Sc IT DATA STRUCTURES.pptx
II B.Sc IT DATA STRUCTURES.pptx
 
e_lumley.pdf
e_lumley.pdfe_lumley.pdf
e_lumley.pdf
 
Lecture 9.pptx
Lecture 9.pptxLecture 9.pptx
Lecture 9.pptx
 
Summerization notes for descriptive statistics using r
Summerization notes for descriptive statistics using r Summerization notes for descriptive statistics using r
Summerization notes for descriptive statistics using r
 
Data structures
Data structuresData structures
Data structures
 
Hive_An Brief Introduction to HIVE_BIGDATAANALYTICS
Hive_An Brief Introduction to HIVE_BIGDATAANALYTICSHive_An Brief Introduction to HIVE_BIGDATAANALYTICS
Hive_An Brief Introduction to HIVE_BIGDATAANALYTICS
 
Ten tools for ten big data areas 04_Apache Hive
Ten tools for ten big data areas 04_Apache HiveTen tools for ten big data areas 04_Apache Hive
Ten tools for ten big data areas 04_Apache Hive
 
Cost Based Optimizer - Part 1 of 2
Cost Based Optimizer - Part 1 of 2Cost Based Optimizer - Part 1 of 2
Cost Based Optimizer - Part 1 of 2
 
Python for data analysis
Python for data analysisPython for data analysis
Python for data analysis
 
Data Structure & aaplications_Module-1.pptx
Data Structure & aaplications_Module-1.pptxData Structure & aaplications_Module-1.pptx
Data Structure & aaplications_Module-1.pptx
 
Data Wrangling and Visualization Using Python
Data Wrangling and Visualization Using PythonData Wrangling and Visualization Using Python
Data Wrangling and Visualization Using Python
 
Python-for-Data-Analysis.pdf
Python-for-Data-Analysis.pdfPython-for-Data-Analysis.pdf
Python-for-Data-Analysis.pdf
 
Pandas Dataframe reading data Kirti final.pptx
Pandas Dataframe reading data  Kirti final.pptxPandas Dataframe reading data  Kirti final.pptx
Pandas Dataframe reading data Kirti final.pptx
 

Recently uploaded

SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxiammrhaywood
 
Employee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptxEmployee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptxNirmalaLoungPoorunde1
 
A Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformA Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformChameera Dedduwage
 
Crayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon ACrayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon AUnboundStockton
 
MENTAL STATUS EXAMINATION format.docx
MENTAL     STATUS EXAMINATION format.docxMENTAL     STATUS EXAMINATION format.docx
MENTAL STATUS EXAMINATION format.docxPoojaSen20
 
URLs and Routing in the Odoo 17 Website App
URLs and Routing in the Odoo 17 Website AppURLs and Routing in the Odoo 17 Website App
URLs and Routing in the Odoo 17 Website AppCeline George
 
Solving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptxSolving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptxOH TEIK BIN
 
Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactdawncurless
 
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPTECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPTiammrhaywood
 
Separation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and ActinidesSeparation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and ActinidesFatimaKhan178732
 
Mastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionMastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionSafetyChain Software
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxheathfieldcps1
 
Science 7 - LAND and SEA BREEZE and its Characteristics
Science 7 - LAND and SEA BREEZE and its CharacteristicsScience 7 - LAND and SEA BREEZE and its Characteristics
Science 7 - LAND and SEA BREEZE and its CharacteristicsKarinaGenton
 
Presiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha electionsPresiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha electionsanshu789521
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...EduSkills OECD
 
The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13Steve Thomason
 
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17Incoming and Outgoing Shipments in 1 STEP Using Odoo 17
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17Celine George
 
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptx
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptxContemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptx
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptxRoyAbrique
 

Recently uploaded (20)

Model Call Girl in Bikash Puri Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Bikash Puri  Delhi reach out to us at 🔝9953056974🔝Model Call Girl in Bikash Puri  Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Bikash Puri Delhi reach out to us at 🔝9953056974🔝
 
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
 
Employee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptxEmployee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptx
 
A Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformA Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy Reform
 
Crayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon ACrayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon A
 
MENTAL STATUS EXAMINATION format.docx
MENTAL     STATUS EXAMINATION format.docxMENTAL     STATUS EXAMINATION format.docx
MENTAL STATUS EXAMINATION format.docx
 
URLs and Routing in the Odoo 17 Website App
URLs and Routing in the Odoo 17 Website AppURLs and Routing in the Odoo 17 Website App
URLs and Routing in the Odoo 17 Website App
 
Solving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptxSolving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptx
 
Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impact
 
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPTECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
 
Separation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and ActinidesSeparation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and Actinides
 
TataKelola dan KamSiber Kecerdasan Buatan v022.pdf
TataKelola dan KamSiber Kecerdasan Buatan v022.pdfTataKelola dan KamSiber Kecerdasan Buatan v022.pdf
TataKelola dan KamSiber Kecerdasan Buatan v022.pdf
 
Mastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionMastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory Inspection
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptx
 
Science 7 - LAND and SEA BREEZE and its Characteristics
Science 7 - LAND and SEA BREEZE and its CharacteristicsScience 7 - LAND and SEA BREEZE and its Characteristics
Science 7 - LAND and SEA BREEZE and its Characteristics
 
Presiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha electionsPresiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha elections
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
 
The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13
 
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17Incoming and Outgoing Shipments in 1 STEP Using Odoo 17
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17
 
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptx
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptxContemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptx
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptx
 

Unit I - introduction to r language 2.pptx

  • 1. Unit I Data Manipulations Data Already in R – Reading data – Reading and formatting datasets – Manipulating Data with dplyr – Tiding Data with tidyr
  • 2. Data Manipulation • Data science is as much about manipulating data as it is about fitting models to data. • Data rarely arrives in a form that we can directly feed into the statistical models or machine learning algorithms. • The first stages of data analysis are almost always figuring out how to load the data into R • And then figuring out how to transform it into a shape you can readily analyze.
  • 3. Data Already in R: • There are some data sets already built into R or available in R packages. • Dataset is a package in R. • Its aim is to make tidy datasets easier to release, exchange and reuse. It organizes and formats data frame 'R' objects into well-referenced, well- described, interoperable datasets into release and reuse ready form. • We can load the package into R using the library() function. library(datasets) If we want , together with a short description of each, then use library(help = "datasets“). It describes Package, version, Title, author aetc.
  • 4.
  • 5. To load an actual data set into R’s memory, use: data() function For example, data(“CO2”)
  • 6. To display first 6 rows of a dataset: head(co2)
  • 7. To plot a graph: plot(conc ~ uptake, data = CO2)
  • 8. • Another package with several useful data sets is mlbench. • It contains data sets for machine learning benchmarks, so these data sets are aimed at testing how new methods perform on known data sets. • This package is not distributed together with R, but you can install it, load it, and get a list of the data sets within it like this: install.packages("mlbench") library(mlbench) library(help = "mlbench")
  • 9. Quickly Reviewing Data • head(dataset/dataframe, no.ofrows) • head(CO2,3) -> displays first 3 rows in co2 dataset. • Simillarly, • tail(CO2,3) -> • summary(co2) ->displays summary statistics for all the columns in a data frame, • str(co2) -> displays the type of each column
  • 10. Reading Data • There are several packages for reading data in different file formats, from Excel to JSON to XML and so on. • R has plenty of built-in functions for reading such data. Use ?read.table • read.table() is a function in R that reads a file in table format and creates a data frame from it read.table() Example my_data <- read.table("data.txt", header = TRUE, sep = "t")
  • 11. Arguments in read.table(): • header: This is a boolean value telling the function whether it should consider the first line in the input file a header line. • col.names: If the first line is not used to specify the header, you can use this option to name the columns. • dec: This is the decimal point used in numbers. • comment.char: By default, the function assumes that “#” is the start of a comment and ignores the rest of a line when it sees it • colClasses: This lets you specify which type each column should have, so here you can specify that some columns should be factors, and others should be strings
  • 12. Install package in R studio • To install the `mlbench` package in RStudio, you can follow these steps: 1. Open RStudio and click on the **Packages** tab in the bottom right pane. 2. Click on the **Install** button. 3. In the **Install Packages** dialog box, type `mlbench` in the **Packages** field. 4. Select the **Install dependencies** option. 5. Click on the **Install** button to install the package.
  • 13. Examples of Reading and Formatting Data Sets Breast Cancer Data set : • As a first example of reading data from a text file, we consider the BreastCancer data set from mlbench. • Then we have something to compare our results with. library(mlbench) data(BreastCancer) head(3,BreastCancer)
  • 14. • The URL to the actual data is https://archive.ics.uci.edu/ml/machine- learning-databases/ breast-cancer-wisconsin/breast-cancer-wisconsin.data’ • To get data, • we could go to the URL and save the file. • to read the data directly from the URL. • We can read the data and get it as a vector of lines using the readLines() function. • lines <- readLines(data_url) lines[1:5]
  • 15. • For this data, it seems to be a comma-separated values file without a header line. So save the data with the “.csv” suffix. • Boston Housing Data Set : For the second example of loading data, we take another data set from the mlbench package.(Refer text book)
  • 16. The readr package: • readr is an R package that provides a fast and friendly way to read rectangular data from delimited files, such as comma-separated values (CSV) and tab-separated values (TSV) • It is designed to parse many types of data. • To install readr, • you can either install the whole tidyverse by running install.packages("tidyverse") or • install just readr by running install.packages("readr") • Once installed, you can load readr with library(readr)
  • 17. • readr supports the following file formats with these read_*() functions: • read_csv(): comma-separated values (CSV) • read_tsv(): tab-separated values (TSV) • read_csv2(): semicolon-separated values with , as the decimal mark • read_delim(): delimited files (CSV and TSV are important special cases) • read_fwf(): fixed-width files • read_table(): whitespace-separated files • read_log(): web log files 1
  • 18. Manipulating Data with dplyr • Data frames are ideal for representing tabular data • Nearly all packages that implement statistical models or machine learning algorithms in R work on data frames. • But to actually manipulate a data frame, you often have to write a lot of code to filter data, rearrange data, summarize it in various ways, and such. • A few years ago, manipulating data frames required a lot more programming than actually analyzing data. • That has improved dramatically with the dplyr package (pronounced “d plier” where “plier” is pronounced as “pliers”). This pack
  • 19. • dplyr package has to be installed externaly. • It helps to resolve the most frequent data manipulation hurdles. • There are uncomplicated “verbs”, functions present for tackling every common data manipulation and the thoughts can be translated into code faster. • This package provides a number of convenient functions that let you modify data frames in various ways and string them together in pipes using the %>% or |> operator • If you import dplyr, you get a large selection of functions that let you build pipelines for data frame manipulation using pipelines.
  • 20. Some Useful dplyr Functions • The dplyr package has several representations of data frame and its equivalent formats. • (illustrate with output) • iris %>% as_tibble() • iris |> as_tibble() • iris %>% as_tibble() %>% select(Petal.Width,Petal.Length) %>% head(3)
  • 21. • iris %>% as_tibble() %>% select(Sepal.Length:Petal.Length) %>% head(3) • iris |> as_tibble() |> select(starts_with("Petal")) |> head(3) • iris |> as_tibble() |> select(ends_with("Width")) |> head(3) • iris |> as_tibble() |> select(contains("etal")) |> head(3) • iris |> as_tibble() |> select(matches(".t.")) |> head(3)
  • 22. • iris %>% as_tibble() %>% select(-starts_with("Petal")) %>% head(3) • iris %>% as_tibble() %>% mutate(Petal.Width.plus.Length = Petal.Width + Petal.Length) %>% • select(Species, Petal.Width.plus.Length) %>% head(3) • iris %>% as_tibble() %>% • mutate(Petal.Width.plus.Length = Petal.Width + Petal.Length, • Sepal.Width.plus.Length = Sepal.Width + Sepal.Length) %>% • select(Petal.Width.plus.Length, Sepal.Width.plus.Length) %>% head(3)
  • 23. • iris %>% as_tibble() %>% arrange(Sepal.Length) %>% head(3) • iris %>% as_tibble() %>% arrange(desc(Sepal.Length)) %>% head(3) • iris %>% as_tibble() %>% group_by(Species) %>% head(3) • iris %>% group_by(Species) %>% summarise(Mean.Petal.Length = mean(Petal.Length))
  • 24. • Breast Cancer Data Manipulation – Refer Text book
  • 25. Tidying Data with tidyr Tidy data is a standard way of mapping the meaning of a data set to its structure. A data set is messy or tidy depending on how rows, columns and tables are matched up with observations, variables and types. Hadley Wickham
  • 26. • tidy data can be used to plot or summarize the data efficiently. • It mostly comes down to what data is represented as columns in a data frame and what is not. • For example, if I want to look at the iris data set and see how the Petal.Length varies among species, then I can look at the Species column against the Petal.Length column: iris |> as_tibble() |> select(Species, Petal.Length) |> head(3) • Can plot a graph for the same
  • 27.
  • 28. • This works because we have a column for the x-axis and another for the y-axis. • But if we want to plot the different measurements of the irises to see how those are related and each measurement is a separate column. • In such case we can use tidyr package. library(tidyr) • It has a function, pivot_longer(), that modifies the data frame, so columns become names in a factor and other columns become values.
  • 29. • pivot_longer function is designed to reshape data from a wider format to a longer format. • It makes easier to analyze and visualize the data. • The data frame or tibble to be reshaped
  • 30.
  • 31.
  • 32.
  • 33. • What it does is essentially transforming the data frame such that you get one column containing the name of your original columns and another column containing the values in those columns. • In the iris data set, we have observations for sepal length and sepal width. • If we want to examine Species vs. Sepal.Length or Sepal.Width, we can readily do this.
  • 34. • Pivot wider() – inverse of pivot longer()