R for Statistical Computing

R for Statistical
Computing
RAFIE TARABAY
ENG_RAFIE@MANS.EDU.EG

Central tendency
finding the middle of the data, and understanding how
the data shapes.
MEAN MEDIAN MODE

Median Value vs Mode
 The Median is the "middle" of a sorted list of
numbers.
 The mode is simply the number which
appears most often.
 So, for (1,3, 5, 12,3) median is (5), mode
is (3)

Data variability
VARIANCE - STANDARD DEVIATION

1st quartile, 3rd quartile and
Interquartile range
Quartiles are the values that divide a list of numbers into
quarters:
 Put the list of numbers in order
 Then cut the list into four equal parts
 The Quartiles are at the "cuts“
 Example find the 1st quartile and 3rd
2,4,4,5,6,7,8

Standard deviation vs Variance vs
Standard Score/ z-score
The standard deviation:
(Deviation just means how far from the normal)
 The Standard Deviation is a measure of how spread out numbers are.
 Its symbol is σ (the Greek letter sigma)
 is the square root of the variance. For example, a Normal distribution with
mean = 10 and sd = 3 is exactly the same thing as a Normal distribution
with mean = 10 and variance = 9.
Standard Score ("z-score") for a number:
 first subtract the number from mean,
 then divide by the Standard Deviation

Population means we include all the numbers on our calculations
Sample means we select a sample from a Big population not available

A Practical Example
 Your company packages sugar in 1 kg bags.
 When you weigh a sample of bags you get these results:
 1007g, 1032g, 1002g, 983g, 1004g, ... (a hundred measurements)
 Mean = 1010g
 Standard Deviation = 20g
 How many package less that 1 KG? 30.85%
How to fix this problem?
 Let's adjust the machine so that 1000g is:
 at −3 standard deviations: 0.1%
 at −2.5 standard deviations: 0.6% [Good choice]
 The standard deviation is 20g, and we need 2.5 of them: 2.5 × 20g = 50g, so
increase the package 50 gram when weight to fix the problem.

Accuracy vs Precision
 Accuracy is how close a measured value is to the actual (true) value.
 Precision is how close the measured values are to each other.

Correlation (Association)
 When we need to know if there is a relations between two variables x and y
or not we check correlation and the value of it between +1 and -1
 +1 means strong correlation: when x increase y increase.
 -1 means strong negative correlation: when x increase y decrease.
 0 means no correlation and no relation between x and y.

ANOVA
 Analysis of variance.
 Like you sale lemon and orange on park and on beach and you need to
know if this makes different or not.

Regression
 help in prediction where we use information that we have and apply
some statistics to predict something that we don’t know.
 So, we can use past sales to predict future sales.

What is R?
 R is an open source, free language and environment for statistical
computing and graphics.
 Run on any platform, ie, windows/Unix/Linux

R
 Case sensitive
 Not sensitive to white spaces
 Use = or <- to assign value to a variable
 Download R from here
https://cran.r-project.org/
 Download R studio from here
https://www.rstudio.com/products/rstudio/do
wnload/
Ctrl+ L to clear the console

Some R’s operations
 X=5
 Y=4
 Z=x*y
 A=1:10 1,2,3,4,5,6,7,8,9,10
 B=A^2 1,4,9,16,25,36,49,64,81,100
 K=B[1:5] 1,4,9,16,25
 A[1:3]=c(33,66,99)
 A 33,66,99,4,5,6,7,8,9,10

Bulk Data containers
 Vector
 List
 Data Frame

Vectors
 an ordered set of values
 To define a new vector add c()
 For continues numbers use :
Examples
 c(1,100,3,5,8)
 c(9,80,3,5,8) + c(1,100,3,5,8)
 c(2,4,8) – 2
 c(3:8)-2
 1:5 + 6:10
 sum(2:6)

Set title to the vector
 X=100:102
 names(X)=c(“First”,”Second”,”Third”)
 X
 Y=1:26
 names(Y)=toupper(letters[1:26])
First Second Third
100 101 102
A B C D E F G H I J K …
1 2 3 4 5 6 7 8 9 10 11 ..

na.rm = TRUE
 Z=c(3,4,5,6,7)
 mean(Z) 5
 Null in R = NA.
 K=c(3,4,5,6,7,NA)
 mean(K) NULL
 to ignore null values during calculation add na.rm = TRUE
 mean ( K , na.rm = TRUE ) 5
 Mean is equal to the sum over every possible value weighted by the
probability of that value, if all items has the same weight then mean =
average ;

factor
It takes vector and give a new vector of the distinct values inside this vector
using levels function.
Example
 kk= factor(c(‘man’,’animal’,’man’,’man’,’animal’))
 levels(kk)
 nlevels(kk)
 as.integer(kk)

List
 Each element of the list can has different type.
Example
 zz= list(1,6,’ssss’,true)
 kk= list (first=1,second=6,third=‘ssss’,fourth=true)
 // kk[1:3] // kk[1] // kk[“first”] // kk$first
 To convert vector to list use as.list(vector name)
 To convert list to vector use as.numeric(list name) or unlist(list name)

NA vs NULL
 When we have a missing value in the list we can set it as NA or NULL
 length(NA) = 1
 length(NULL) = 0

Data Frame
 It is like a DB table contains rows and columns
 To create it use data.frame()
Example
 zz=data.frame( x=c(1:5) , y=letters[1:5] )
Related Functions
 rownames, colnames,dim , dimnames, nrow, ncol

rnorm , round
• make vector, z, containing a sequence of 5
randomly generated numbers from a normal
distribution with a mean of 10 and a standard
deviation of 3, then round it to 2 decimal
points
z=rnorm(5,10,3)
z=round(z,2)

Some R functions
• getwd() : get current working directory
• setwd("c:/") : set working directory
• dir() : list files in current directory
• ls() : list current defined variables
• X=read.cvs(“1.cvs”) : Read file from working directory
• sessionInfo()

Matrix Operations
Math : Given two square matrices, A and B, if AB = I, the identity matrix with
1s on the diagonals and 0s on the off-diagonals, then B is the right-inverse of
A, and can be represented as A−1.

Defined Matrix
create the matrix first as a vector, and then give the
vector the dimensions; for very large data, this may
be more compute efficient.
A = c(1.00, 0.14, 0.35, 0.14, 1.00, 0.09, 0.35, 0.09, 1.00)
dim(A)= c(3,3)
AA=solve(A)
Z=A %*% AA 
1.00 0.14 0.35
0.14 1.00 0.09
0.35 0.09 1.00
1 0 0
0 1 0
0 0 1

List
• Use list when we have “ragged” data arrays in which the
variables have unequal numbers of observations. ie, we have
3 departments and we need to apply some calculation on
salaries , 1st department has 5 employees and second has 4
employees and 3rd has 6 employees and we need to work with
them together.
Dept1=c( 5,8,6,9,4)
Dept2=c( 15,7,3,4)
Dept3=c( 6,8,3,6,9,4)
AllDepts=list(Dept1,Dept2,Dept3)

Apply a Function over a List X
• lapply returns a list of the same length as X, each element of
which is the result of applying FUN to the corresponding
element of X.
• sapply is a user-friendly version of lapply by default returning
a matrix
DeptAverage = sapply(AllDepts,mean)
Dept_sdev = sapply(allSections, sd)
Dept_Variances=lapply(allSections, var)
DeptSD=round(Dept_sdev ,2)

Data Frames
HOW TO MERGE DATA FROM MANY DATA FRAMES?

data frame
data frame is a list, but rectangular like a matrix. Every column
represents a variable or a factor in the dataset. Every row in the
data frame represents a case.
Import data: the package data.table offers fast aggregation of large data
library(data.table)
Data1 = fread(“Data1.csv",header=T, verbose =FALSE, showProgress =FALSE)
str(Data1) displays variables in the dataset with few sample values.
summary(Data1)
USStatesCodes= fread(“USStatesCodes.csv",header=T)
GenderList = fread(“GenderList.csv",header=T)

Data1
CustID GenderCode StateCode numTrans
111111 1 22 334
123221 2 23 324
776768 2 52 352
455656 1 29 313
GenderList
GenderID GenderName
1 Female
2 Male
USStatesCodes
StateID State
22 Alabama
23 Alaska
29 Arizona
52 Florida
Data1 = merge( Data1, GenderList, by.x = "GenderCode", by.y = "GenderID“, all.x = TRUE)
Data1 = merge( Data1, USStatesCodes, by.x = "StateCode", by.y = "StateID “, all.x = TRUE)
setnames(Data1 ,"custID","CustomerID")

Data1
CustomerID Gender Code Gender Name State Code State numTrans
111111 1 Female 22 Alabama 334
123221 2 male 23 Alaska 324
776768 2 male 52 Florida 352
455656 1 Female 29 Arizona 313
Select data that met one criteria
which (Data1 $ GenderCode = 2)
Select some columns from data
SelectedColumnsNames= c(“CustomerID” , ”numTrans”)
Data2 = Data1[SelectedColumnsNames]
Get Information about column
summary(Data1 $ numTrans)
Min. 1st Qu. Median Mean 3rd Qu. Max.
313 333 350 366 377 400

What is machine learning types?

Association Rules for
Market Basket Analysis

Process description
Need function that can read all rows and extract the products
and indicate if this product were order or not in each transaction
(each row). So, it create a data frame from these data.
The best function for this job is read.transactions function in arules package,
and we can detect relations between data by apriori function.
liquor soups coffee butter juice fruit soda pastry ….
1 1 1
1 1 1
1 1 1
1 1 1 1 1 1
1
1 1

Steps
 Install.packages (“arules”)
 require(arules)
 setwd("C:/R-datasets")
 SalesData =read.transactions(“groceries.csv”, sep=“ , ”)
View data
 str(SalesData)
 summary(SalesData)  get calculation information about data
 inspect(SalesData[1:3]) read sales transactions that exists in 1st 3 rows
 itemFrequency(SalesData[,1])  all rows and product number 1
 itemFrequency(SalesData [ , 1 : 6 ] )  all rows and products from 1 to 6
Plot
 itemFrequencyPlot (SalesData , support = 0.05)  draw items that exceed a limit 5%
 itemFrequencyPlot (SalesData , topN = 20)  draw top 20 sales items
Detect Association
 AssociationRules1 =
apriori (SalesData, parameter = list (support = 0.007,confidence=0.25, minlen=2))

Browse Association rules
• Inspect(AssociationRules1 [1:2] )
• Inspect(sort(AssociationRules1, by=“lift”)[1:4])
Lift is simply the ratio of these values: target
response divided by average response.
LHS RHS Support Confidence Lift
Coffee Milk 0.006 0.44 4.2

Time series data
install.packages("readr")
library(readr)
US_EGP = read_csv("US_EGP.csv", col_types = cols(Time = col_date(format = "%Y-%m-%d")))
View(US_EGP)
plot( US_EGP$HighPrice ~ US_EGP$Time , type="l” , col="red")

Connectivity between R and Hive
install.packages("RJDBC",dep=TRUE)
require(RJDBC)
#Load Hive JDBC driver
hivedrv <- JDBC("org.apache.hadoop.hive.jdbc.HiveDriver",
c(list.files("/home/zzzzz/hadoop/hadoop",pattern="jar$",full.names=T),
list.files("/home/zzzzz/hadoop/hive/lib",pattern="jar$",full.names=T)))
#Connect to Hive service
hivecon <- dbConnect(hivedrv, "jdbc:hive://ip:port/default")
query = "select * from mytable LIMIT 10"
hres <- dbGetQuery(hivecon, query)

R for Statistical Computing

More Related Content

What's hot

Similar to R for Statistical Computing

More from Mohammed El Rafie Tarabay

Recently uploaded

R for Statistical Computing