Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
R environment
1.
2. Introduction to R
Free software console environment
Statistic computing and graphics
Open Source
Expanded over the world
Base Package
Some data and graphic analysis
Can be extended with other packages
3. Pro & Cons R
Pro
Free
Versatile and dynamic
Open source (many packages)
Cons
Command line
5. Overview of R
Three main windows
Console
Editor
graphics
6. Scripts
A command written in R editor are known as script
Several advantages
Possibility to save commands in a file
Reload the same anaysis
Reuse old scripts to create new ones
Edit the commands directly
7. Basics: objects and functions
What are typed into the console is named “command”
Command is composed by two parts, separated by “<-”
Objects
Functions
“object is created from function”
Object can be: variable, collection, model, graphic, …
Function: thing to create objects
object <- function
8. Examples
Object: var 1, Function: 25
Object: var 2, Function: string “Hello”
Object: var 3, Function: c(….)
c() (concatenate): groups more elements in a single object
var1 <- 25
var2 <- “Hello”
var3 <- c(“Hello”, “World”)
9. Examples
To display the contents of an object in the console, we
type its name
Be careful: R is case sensitive
student is different from Student
students <- c(“Adam”, “Kate”, “Paul”)
students
[1] “Adam” “Kate” “Paul”
10. Workspace
All the objects are stored in memory
The collection of objects is named workspace
When we quit R, it will ask to choose if to save the
workspace
Saving the workspace, the contents of the console
window and any objects that have been created are
saved.
11. Add & Remove
We can insert and remove elements from an object
Add: c(object, newElement)
I added “John” into students
Delete: object[object != elementToRemove]
I removed “John” from students
students <- c(students, “John”)
students <- students[students != “John”]
12. Variables
Collections of values (numeric and not)
A value into an object
A value can be a string or a number
If string we must place the value in quotes ( “ ” ).
More values into an object
Object1 is a numeric variable
Object2 is a string variable
myobject <- 20 myobject <- “hello”
object1 <- c(10, 20, 30, 40)
object2 <- c(“adam”, “john”, “lisa”)
13. Dataframes
Objects containing variables
combines more objects into a single one
data.frame() function
To view the content of the dataframe we can just write
its name
obj1 <- c(“Steve”, “Jim”) obj2 <- c(35, 25)
container <- data.frame(Name = obj1, Age = obj2)
container
Name Age
1 Steve 35
2 Jim 25
14. Dataframes
Dataframe in previous example has two variables:
Names and Age
We can refer to these variables unsing $
We can add a new variable in a dataframe
container$Age
[1] 35 25
container$Job <- c(“Teacher”, “Engineer”)
container
Name Age Job
1 Steve 35 Teacher
2 Jim 25 Engineer
15. Dataframes
We can list the variable in the dataframe
names() function
names(container)
[1] “Name” “Age” “Work”
16. Calculate new variables
Calculate new variable from other ones
We can use mathematic operations on dataframe’s
variables ( +, -, /, *, …)
scoreExA <- c(12, 15, 9, 10)
scoreExB <- c(17, 15, 6, 12)
scores <- data.frame(class1 = scoreExA, class2 = scoreExB)
scores$sumScores <- scores$class1 + scores$class2
scores
class1 class2 sumScores
1 12 17 29
2 15 15 30
3 9 6 15
4 10 12 22
17. Wide format
We must insert new sets of data in a logical way
The most logical way is known as wide format
In wide format
each row represent data from one entity (samples)
each column represents a variable
Example
I want to examine the performance of a group of people
in a test, indicating their sex.
We have two column: “result of test” and “gender”
We have several rows that are the partecipants
19. Factor
Before we have used a “gender” variable
The column with the information about the gender is
called a “grouping variable” or factor
A factor is a variable that groups different entities
Very often a factor is a numeric variable (levels)
Example: we decide if a person is a male, we give the
number 0, if a person is a female we give the number 1
This feature can be useful in order to split up the data
Separated analysis of males and females
20. Factor
Example: we have 10 students, 4 females and 6 males.
We need to enter a series of 0s and 1s.
Useful function: rep()
gender <- c(0,0,0,0,1,1,1,1,1,1)
gender <- c(rep(0, 4), rep(1, 6))
gender
[1] 0 0 0 0 1 1 1 1 1 1
21. Factor
To turn this variable into a factor we use: factor()
gender <- factor(gender, levels=c(0,1),
labels=c(“Female”, “Male”))
gender
[1] Female Female Female Female Male Male
Male Male Male Male
22. Date variable
To create a date variable we use the function
as.Date()
The date format is the same as a string, but has a
particular format
Allows to calculate differences (days)
. How to make a date variable:
1. Write the date as a text string: “YYYY-MM-DD”
2. Use as.Date(object) function
birthDate <- as.Date(c(“1987-01-12”, “1990-05-20”, “1988-03-04”))
23. Missing values
Often, missing data can occur in a set of values
Examples:
Some partecipants miss out a question
There is not signal result at certain times
…
We have to tell R that a value is missing
The code used is “NA” (not available)
testResults <- c(12, 30, NA, 25, NA, 10, 34)
job <- c(“Teacher”, NA, “Waiter”, “Sales girl)
24. Select part of data
We can select small portion of a dataframe
Select particular variables (some columns)
Select a subset of cases (some rows)
General command: [rows, columns]
Examples
newDF <- oldDF[rows, columns]
names <- people[, “name”]
names&ages <- people[, c(“name”, “age”)]
25. Select part of data
Examples
I want the names of all the people that have an age
greater than 20
names <- people[age > 20, “name”]
26. Select part of data
Another way: subset() function
Examples
newDF <- subset(oldDF, conditions, select = c(list of
variables))
success <- subset(students, score >= 18, select = c(“Name”,
“Surname”, “Score”))
27. Working directory
There is a default directory of R, named workind
directory
Tipically is: "C:/Users/username/Documents“
(Windows)
We can check the working dir with getwd() function
We can change the default directory: setwd()
getwd()
[1] “C:/Users/daniele/Documents”
setwd(“C:/Users/daniele/R/Examples”)
getwd()
[1] “C:/Users/daniele/R/Examples”
28. Import and export data
We can import data from Excel, OpenOffice, SPSS, …
Usually data imported are in a dataframe format
Two common types of files
Text (tab-delimited text): read.delim()
CSV (comma-delimited tex): read.csv()
textdata <- read.delim(“data.dat”, header=TRUE)
csvdata <- read.csv(“students.cvs”, header=TRUE)
29. Import and export data
We can export data to a variety of formats
Two way to export
General form: write.table()
CSV (comma-delimited tex): write.csv()
Separator is “ , “ as the default
write.csv(dataframe, “data.csv”)
write.table(dataframe, “file.txt”, sep=“t”, row.names=FALSE)
Editor's Notes
Im going to describe R environment. A sw that we will use to do some computations, data analysis, and so on
Is a free software, we can download it from official webside or other mirrors.
With this sw we can perform several statistic, mathematics, computations.
We can obtain Analitical or graphical results.
Opens source: unlike commercial (choemmerscial) software agency, who develop R allows every one to access their code, to modify it and so, in short, contribute to the software delopment.
R is provided as a basic package with the most useful functions. We can already…. However we can add some other packages to expand the number of feature in R.
Other packages, that add new functionalities to the program.
The main advantage is that R is free. Anybody can use this software right away (subito) and modify it because is also an open source software. (mostly /above all is open source)
The downside/disadvatage, regards the use/usability of this. Since there is a command line console and not a GUI, for some people it is difficult.
To install R in our computer we just need to download it from this website, and select the platform (linux, windows and mac).
Then we select the version (tipically the last version) and install it in the machine.
When we start the application, we can see three main windows: console, editor, and graphics.
Console is where we put and execute the commands and see the results.
editor is a windows where we can write our commands and then edit and reuse them. it is lust like a block note.
Finally we have graphic that produces some graphics, graph and so on.
The console is the main window where we can execute the commands. but an useful way to write and execute commands is through the editor. A command, or a set of commands written in the editor are known as script.
there are some advantages to use the editor:
reexecute the same commands for other data,
and we can modify, or correct the commands in the editor.
As i sad before, … come back to the command
less than / lower and dash/hypen. the two parts are object and function.
as shown in the example below.
This particular structure, This means “is created from” .
Ano object is anything created in R, can be…
Functions are things that we make in R to create objects.
where i put 25 into the var 1, that is created. This object is stored in the memory, and so we can refer this one in a future time.
The third example uses a function called c(), naming concatenate function. that assembles more elements into a single object. take a look in the example where in var3 two elements are concatenated.
This is another example. I put three strings into a variable called students.
we just have to type the name of the variable.
We must to be careful because R is case sensitive. for instance if we write a variable named students in lowercase and another one named student with upper case, they are two different variables.
And what is shown on the console window..
We start exploring the characteristics of the various commands that we can write.
In a set of elements that we have created with the c function, we can insert or remove some elements.
add element using this command function c that concatenates the same object with the element that we want to insert.
To remove an object we use this command that means: we recreate the students but get rid the element john.
Several = severol
A variable can be a string variable or a numeric variable.
string values should always be placed inside the quotes and this tells R that this data is not numeric,
This feature discriminate a string form a number value.
Series = siris
.
dataframe is a container that groups different variables. In this way we can combine more objects into a single one. A dataframe id displayed like a spreadsheet in exel, where we have columns and rows.
With this command we create a new object container and then we tell R how to build the dataframe.
Espacially if we have many data, we may need
using also $. for example i want to insert a variable named job. using the c function i put the two jobs to the elements.
sometimes it is useful to list the variables in a dataframe, for instance when we have a large dataframe. this ca be done using the names() function. specifying the name of the dataframe within the bracket.
We can also use arithmetic operators to compute and create new variables in a dataframe. and these variables can be placed in turn into a dataframe.
When we input a new set of data, we must do so in a logical way. The best logical way is the wide format, where each row represent the data (the values) of one entity (a person, a subject, ..) and each column represents one variable (age, score, ..). It’s like a spreadsheet in excel.
Is a common format that we all use.
gender= genda
the first columns is simple reference of the person is an identification
for instance if we want to divide some entities, for ages classes, from 1 to 10, from 11 to 20, and so one, we define a factor that assume number 1 for
Almost always is a numeric variable
It is recommended numeric variable because this factor can represents different levels of a treatment variable.
Ascending or descending
factor has three parameters, the first one is the variable that we have to turn into factor, the second one is used to denote different groups, in this case 2 levels. The last parameter is the assignment of labels to these levels, in our case female and male.
label = leibl
It is not the same, differences in terms of days
Are unable to respond/answer to a givend question.
This lack of data
When we have a dataframe it’s useful/convenient to select a part of this. for instance we must take just a variable, o a subset of variable or subset of cases.
square bracket
Which rows i want, which columns i want.
[, because in this case i want all the people
With which (con la quale)
Choose, anything else
from which
another function that does the same
Students who have passed an exam
more equal
the working directory is a directory in which we save or load some files (commands, or data, or our output results).
there is a default directory of R,.
Obviously we can change this directory, for our convenience (conviniens),
setwd and as parameter we put the the path that we want to use.
There are also functions to import and export
There are two types of files, and they differ in which delimiter is used to separate the elements
Both the function require two parameters.
The header true tells R that the data file has variable names in the first row of the file.
the first parameter is the name of the object to export, the second is the name of the file, the third is the separator used to seprate data values (i’ve used tab delimiter), and the last tell R to write or not the number of rows (in this case not).