7. Median Value vs Mode
The Median is the "middle" of a sorted list of
numbers.
The mode is simply the number which
appears most often.
So, for (1,3, 5, 12,3) median is (5), mode
is (3)
10. 1st quartile, 3rd quartile and
Interquartile range
Quartiles are the values that divide a list of numbers into
quarters:
Put the list of numbers in order
Then cut the list into four equal parts
The Quartiles are at the "cuts“
Example find the 1st quartile and 3rd
2,4,4,5,6,7,8
11. Standard deviation vs Variance vs
Standard Score/ z-score
The standard deviation:
(Deviation just means how far from the normal)
The Standard Deviation is a measure of how spread out numbers are.
Its symbol is σ (the Greek letter sigma)
is the square root of the variance. For example, a Normal distribution with
mean = 10 and sd = 3 is exactly the same thing as a Normal distribution
with mean = 10 and variance = 9.
Standard Score ("z-score") for a number:
first subtract the number from mean,
then divide by the Standard Deviation
13. Population means we include all the numbers on our calculations
Sample means we select a sample from a Big population not available
14. A Practical Example
Your company packages sugar in 1 kg bags.
When you weigh a sample of bags you get these results:
1007g, 1032g, 1002g, 983g, 1004g, ... (a hundred measurements)
Mean = 1010g
Standard Deviation = 20g
How many package less that 1 KG? 30.85%
How to fix this problem?
Let's adjust the machine so that 1000g is:
at −3 standard deviations: 0.1%
at −2.5 standard deviations: 0.6% [Good choice]
The standard deviation is 20g, and we need 2.5 of them: 2.5 × 20g = 50g, so
increase the package 50 gram when weight to fix the problem.
15. Accuracy vs Precision
Accuracy is how close a measured value is to the actual (true) value.
Precision is how close the measured values are to each other.
16. Correlation (Association)
When we need to know if there is a relations between two variables x and y
or not we check correlation and the value of it between +1 and -1
+1 means strong correlation: when x increase y increase.
-1 means strong negative correlation: when x increase y decrease.
0 means no correlation and no relation between x and y.
17. ANOVA
Analysis of variance.
Like you sale lemon and orange on park and on beach and you need to
know if this makes different or not.
18. Regression
help in prediction where we use information that we have and apply
some statistics to predict something that we don’t know.
So, we can use past sales to predict future sales.
19. What is R?
R is an open source, free language and environment for statistical
computing and graphics.
Run on any platform, ie, windows/Unix/Linux
20. R
Case sensitive
Not sensitive to white spaces
Use = or <- to assign value to a variable
Download R from here
https://cran.r-project.org/
Download R studio from here
https://www.rstudio.com/products/rstudio/do
wnload/
Ctrl+ L to clear the console
25. Vectors
an ordered set of values
To define a new vector add c()
For continues numbers use :
Examples
c(1,100,3,5,8)
c(9,80,3,5,8) + c(1,100,3,5,8)
c(2,4,8) – 2
c(3:8)-2
1:5 + 6:10
sum(2:6)
26. Set title to the vector
X=100:102
names(X)=c(“First”,”Second”,”Third”)
X
Y=1:26
names(Y)=toupper(letters[1:26])
First Second Third
100 101 102
A B C D E F G H I J K …
1 2 3 4 5 6 7 8 9 10 11 ..
27. na.rm = TRUE
Z=c(3,4,5,6,7)
mean(Z) 5
Null in R = NA.
K=c(3,4,5,6,7,NA)
mean(K) NULL
to ignore null values during calculation add na.rm = TRUE
mean ( K , na.rm = TRUE ) 5
Mean is equal to the sum over every possible value weighted by the
probability of that value, if all items has the same weight then mean =
average ;
28. factor
It takes vector and give a new vector of the distinct values inside this vector
using levels function.
Example
kk= factor(c(‘man’,’animal’,’man’,’man’,’animal’))
levels(kk)
nlevels(kk)
as.integer(kk)
29. List
Each element of the list can has different type.
Example
zz= list(1,6,’ssss’,true)
kk= list (first=1,second=6,third=‘ssss’,fourth=true)
// kk[1:3] // kk[1] // kk[“first”] // kk$first
To convert vector to list use as.list(vector name)
To convert list to vector use as.numeric(list name) or unlist(list name)
30. NA vs NULL
When we have a missing value in the list we can set it as NA or NULL
length(NA) = 1
length(NULL) = 0
31. Data Frame
It is like a DB table contains rows and columns
To create it use data.frame()
Example
zz=data.frame( x=c(1:5) , y=letters[1:5] )
Related Functions
rownames, colnames,dim , dimnames, nrow, ncol
32. rnorm , round
• make vector, z, containing a sequence of 5
randomly generated numbers from a normal
distribution with a mean of 10 and a standard
deviation of 3, then round it to 2 decimal
points
z=rnorm(5,10,3)
z=round(z,2)
33. Some R functions
• getwd() : get current working directory
• setwd("c:/") : set working directory
• dir() : list files in current directory
• ls() : list current defined variables
• X=read.cvs(“1.cvs”) : Read file from working directory
• sessionInfo()
34. Matrix Operations
Math : Given two square matrices, A and B, if AB = I, the identity matrix with
1s on the diagonals and 0s on the off-diagonals, then B is the right-inverse of
A, and can be represented as A−1.
35. Defined Matrix
create the matrix first as a vector, and then give the
vector the dimensions; for very large data, this may
be more compute efficient.
A = c(1.00, 0.14, 0.35, 0.14, 1.00, 0.09, 0.35, 0.09, 1.00)
dim(A)= c(3,3)
AA=solve(A)
Z=A %*% AA
1.00 0.14 0.35
0.14 1.00 0.09
0.35 0.09 1.00
1 0 0
0 1 0
0 0 1
36. List
• Use list when we have “ragged” data arrays in which the
variables have unequal numbers of observations. ie, we have
3 departments and we need to apply some calculation on
salaries , 1st department has 5 employees and second has 4
employees and 3rd has 6 employees and we need to work with
them together.
Dept1=c( 5,8,6,9,4)
Dept2=c( 15,7,3,4)
Dept3=c( 6,8,3,6,9,4)
AllDepts=list(Dept1,Dept2,Dept3)
37. Apply a Function over a List X
• lapply returns a list of the same length as X, each element of
which is the result of applying FUN to the corresponding
element of X.
• sapply is a user-friendly version of lapply by default returning
a matrix
DeptAverage = sapply(AllDepts,mean)
Dept_sdev = sapply(allSections, sd)
Dept_Variances=lapply(allSections, var)
DeptSD=round(Dept_sdev ,2)
39. data frame
data frame is a list, but rectangular like a matrix. Every column
represents a variable or a factor in the dataset. Every row in the
data frame represents a case.
Import data: the package data.table offers fast aggregation of large data
library(data.table)
Data1 = fread(“Data1.csv",header=T, verbose =FALSE, showProgress =FALSE)
str(Data1) displays variables in the dataset with few sample values.
summary(Data1)
USStatesCodes= fread(“USStatesCodes.csv",header=T)
GenderList = fread(“GenderList.csv",header=T)
41. Data1
CustomerID Gender Code Gender Name State Code State numTrans
111111 1 Female 22 Alabama 334
123221 2 male 23 Alaska 324
776768 2 male 52 Florida 352
455656 1 Female 29 Arizona 313
Select data that met one criteria
which (Data1 $ GenderCode = 2)
Select some columns from data
SelectedColumnsNames= c(“CustomerID” , ”numTrans”)
Data2 = Data1[SelectedColumnsNames]
Get Information about column
summary(Data1 $ numTrans)
Min. 1st Qu. Median Mean 3rd Qu. Max.
313 333 350 366 377 400
46. Process description
Need function that can read all rows and extract the products
and indicate if this product were order or not in each transaction
(each row). So, it create a data frame from these data.
The best function for this job is read.transactions function in arules package,
and we can detect relations between data by apriori function.
liquor soups coffee butter juice fruit soda pastry ….
1 1 1
1 1 1
1 1 1
1 1 1 1 1 1
1
1 1
47. Steps
Install.packages (“arules”)
require(arules)
setwd("C:/R-datasets")
SalesData =read.transactions(“groceries.csv”, sep=“ , ”)
View data
str(SalesData)
summary(SalesData) get calculation information about data
inspect(SalesData[1:3]) read sales transactions that exists in 1st 3 rows
itemFrequency(SalesData[,1]) all rows and product number 1
itemFrequency(SalesData [ , 1 : 6 ] ) all rows and products from 1 to 6
Plot
itemFrequencyPlot (SalesData , support = 0.05) draw items that exceed a limit 5%
itemFrequencyPlot (SalesData , topN = 20) draw top 20 sales items
Detect Association
AssociationRules1 =
apriori (SalesData, parameter = list (support = 0.007,confidence=0.25, minlen=2))
48. Browse Association rules
• Inspect(AssociationRules1 [1:2] )
• Inspect(sort(AssociationRules1, by=“lift”)[1:4])
Lift is simply the ratio of these values: target
response divided by average response.
LHS RHS Support Confidence Lift
Coffee Milk 0.006 0.44 4.2