The purpose of the Informative Essay assignment is to choose a
job or task that you know how to do and then write a minimum
of 2 full pages, maximum of 3 full pages, Informative Essay
teaching the reader how to do that job or task. You will follow
the organization techniques explained in Unit 6.
Here are the details:
1. Read the Lecture Notes in Unit 6. You may also find the
information in Chapter 10.5 in our text on Process Analysis
helpful. The lecture notes will really be the most important to
read in writing this assignment. However, here is a link to that
chapter that you may look at in addition to the lecture notes:
process-analysis/ (Links to an external site.)
2. Choose your topic, that is, the job or task you want to teach.
As the notes explain, this should be a job or task that you
already know how to do, and it should be something you can do
well. At this point, think about your audience (reader). Will
your reader need any knowledge or experience to do this job or
task, or will you write these instructions for a general reader
where no experience is required to perform the job?
3. Plan your outline to organize this essay. Unit 6 notes offer
advice on this organization process. Be sure to include an
introductory paragraph that has the four main points presented
in the lecture notes.
4. Write the essay. It will need to be at least 2 FULL pages
long, maximum of 3 full pages long. You will use the MLA
formatting that you used in previous essays from Units 3, 4, and
5. Be sure to include a title for your essay.
6. After writing the essay, be sure to take time to read it several
times for revision and editing. It would be helpful to have at
least one other person proofread it as well before submitting the
# comments start with #
# to quit q()
# two steps to install any library
Science & Big Data Analy (ITS-836-51)/RStudio/Week2")
x <- 3 # x is a vector of length 1
v1 <- c(2,4,6,8,10)
v <- c(1:10) #creates a vector of 10 elements numbered 1
through 10. More complicated data
# Import test data
test1<-read.csv("CVEs.csv", sep=",")
test2<-read.table("CVEs.csv", sep=",")
write.csv(test2, file="out.csv")
# Write CSV in R
write.table(test1, file = "out1.csv",row.names=TRUE,
na="",col.names=TRUE, sep=",")
head <- head(test)
tail <- tail(test)
cor(test$X, test$index)
# Import test data
#A 5-number summary is a set of 5 descriptive statistics for
summarizing a continuous univariate data set.
#It consists of the data set's: minimum, 1st quartile, median, 3rd
quartile, maximum
#Find the set, L, of data below the median. The 1st quartile is
the median of L.
#Find the set, U, of data above the median. The 3rd quartile is
the median of U.
##-- now some "magic" to do the 4 regressions in a loop:
ff <- y ~ x
mods <- setNames(as.list(1:4), paste0("lm", 1:4))
for(i in 1:4) {
ff[2:3] <- lapply(paste0(c("y","x"), i),
## or ff[[2]] <-"y", i))
## ff[[3]] <-"x", i))
mods[[i]] <- lmi <- lm(ff, data = anscombe)
## See how close they are (numerically!)
sapply(mods, coef)
lapply(mods, function(fm) coef(summary(fm)))
## Now, do what you should have done in the first place:
op <- par(mfrow = c(2, 2), mar = 0.1+c(4,4,1,1), oma = c(0, 0,
2, 0))
for(i in 1:4) {
ff[2:3] <- lapply(paste0(c("y","x"), i),
plot(ff, data = anscombe, col = "red", pch = 21, bg = "orange",
cex = 1.2,
xlim = c(3, 19), ylim = c(3, 13))
abline(mods[[i]], col = "blue")
mtext("Anscombe's 4 Regression data sets", outer = TRUE, cex
= 1.5)
# top plot
# bottom plot as log10 is actually
# easier to read, but this plot is in natural log
hist(data$sales_total, breaks=100, main="Sales total",
xlab="sales", col="gray")
# draw a line for the media
abline(v = median(data$sales_total), col = "magenta", lwd = 4)
# use rug() function to see the actual datapoints
#Boxplots can be created for individual variables or for
variables by group.
#The format is boxplot(x, data=), where x is a formula and
data= denotes the data frame providing
#the data.
boxplot(data$sales_total,data=data, main="Dis by Sales",
xlab="Sales", ylab="Total")
# Boxplot of MPG by Car Cylinders, using one of R built-in
boxplot(mpg~cyl,data=mtcars, main="Car Milage Data",
xlab="Number of Cylinders", ylab="Miles Per Gallon")
#in our boxplot above, we might want to draw a horizontal line
at 12 where the national standard is.
abline(h = 12)
boxplot(data$sales_total,data=data, main="Total sales Bplot",
xlab="Sales", ylab="Total")
# Dot chart of a single numeric vector
dotchart(mtcars$mpg, labels = row.names(mtcars),
cex = 0.6, xlab = "mpg")
# Simple Scatterplot
plot(wt, mpg, main="Scatterplot Example",
xlab="Car Weight ", ylab="Miles Per Gallon ", pch=19)
#The R function abline() can be used to add vertical, horizontal
or regression lines to a graph
plot(data$sales_total, data$gender)
# Add fit lines
abline(lm(data$sales_total~ data$num_of_orders), col="red") #
regression line (y~x)
lines(lowess(data$sales_total, data$num_of_orders),
col="blue") # lowess line (x,y)
# Basic Scatterplot Matrix
# Scatterplot Matrices from the car Package
library(tidyverse) # data manipulation
library(cluster) # clustering algorithms
library(factoextra) # clustering algorithms & visualization
# Import test data
data1 <- na.omit(data)
columns <- data[1, ]
#As we don't want the clustering algorithm to depend to an
#variable unit, we start by scaling data using the R function
data1 <- scale(data1)
distance <- get_dist(data1)
# plot cluster library
# K-Means Cluster Analysis
# simplest example, just the dataset and number of clusters
fit <- kmeans(data1, 5) # 5 cluster solution
# get cluster means
# append cluster assignment
mydata <- data.frame(data1, fit$cluster)
clusplot(mydata, fit$cluster, color=TRUE, shade=TRUE,
labels=2, lines=0)
fit <- kmeans(data1, 8) # 8 cluster solution
# get cluster means
# append cluster assignment
mydata <- data.frame(data1, fit$cluster)
clusplot(mydata, fit$cluster, color=TRUE, shade=TRUE,
labels=2, lines=0)
# K-Means Clustering with 5 clusters
fit <- kmeans(mydata, 5)
# Determine number of clusters
wss <- (nrow(data1)-1)*sum(apply(data1,2,var))
for (i in 2:15) wss[i] <- sum(kmeans(data1,
#A plot of the within groups sum of squares by number of
clusters extracted can help determine the appropriate number of
#The analyst looks for a bend in the plot similar to a scree test
in factor analysis
# We want (total within-cluster variation) to be the lowest
plot(1:15, wss, type="b", xlab="Number of Clusters",
ylab="Within groups sum of squares")
# Determine number of clusters
wss <- (nrow(data1)-1)*sum(apply(data1,2,var))
for (i in 2:15) wss[i] <- sum(kmeans(data1,
plot(1:15, wss, type="b", xlab="Number of Clusters",
ylab="Within groups sum of squares")
# Cluster Plot against 1st 2 principal components
# vary parameters for most readable graph
clusplot(mydata, fit$cluster, color=TRUE, shade=TRUE,
labels=2, lines=0)
# Centroid Plot against 1st 2 discriminant functions
plotcluster(mydata, fit$cluster)
fviz_dist(distance, gradient = list(low = "#00AFBB", mid =
"white", high = "#FC4E07"))
# try with 25 attempts, 2 clusters
km <- kmeans(data1, centers = 2, nstart = 25)
#The output of kmeans is a list with several bits of information.
The most important being:
# cluster: A vector of integers (from 1:k) indicating the cluster
to which each point is allocated.
#centers: A matrix of cluster centers.
#totss: The total sum of squares.
#withinss: Vector of within-cluster sum of squares, one
component per cluster.
#tot.withinss: Total within-cluster sum of squares, i.e.
#betweenss: The between-cluster sum of squares, i.e. $totss-
#size: The number of points in each cluster.
# print the clusters
# Plot clusters
fviz_cluster(km, data = data1)
(cl <- kmeans(data1, 8))
plot(data1, col = cl$cluster)
points(cl$centers, col = 1:3, pch = 8, cex = 2)
# sum of squares
ss <- function(x) sum(scale(x, scale = FALSE)^2)
## cluster centers "fitted" to each obs.:
fitted.data1 <- fitted(cl); head(fitted.data1)
resid.data1 <- data1 - fitted(cl)
## Equalities : ----------------------------------
cbind(cl[c("betweenss", "tot.withinss", "totss")], # the same two
c(ss(fitted.data1), ss(resid.data1), ss(data1)))
stopifnot(all.equal(cl$ totss, ss(data1)),
all.equal(cl$ tot.withinss, ss(resid.data1)),
## these three are the same:
all.equal(cl$ betweenss, ss(fitted.data1)),
all.equal(cl$ betweenss, cl$totss - cl$tot.withinss),
## and hence also
all.equal(ss(data1), ss(fitted.data1) + ss(resid.data1))
kmeans(data1,1)$withinss # trivial one-cluster, (its W.SS ==
## random starts do help here with too many clusters
## (and are often recommended anyway!):
(cl <- kmeans(x, 5, nstart = 25))
plot(x, col = cl$cluster)
points(cl$centers, col = 1:5, pch = 8)
library(tidyverse) # data manipulation
library(cluster) # clustering algorithms
library(factoextra) # clustering algorithms & visualization
# Import test data
data1 <- na.omit(data)
columns <- data[1, ]
#As we don't want the clustering algorithm to depend to an
#variable unit, we start by scaling data using the R function
data1 <- scale(data1)
distance <- get_dist(data1)
# plot cluster library
# K-Means Cluster Analysis
# simplest example, just the dataset and number of clusters
fit <- kmeans(data1, 5) # 5 cluster solution
# get cluster means
# append cluster assignment
mydata <- data.frame(data1, fit$cluster)
clusplot(mydata, fit$cluster, color=TRUE, shade=TRUE,
labels=2, lines=0)
fit <- kmeans(data1, 8) # 8 cluster solution
# get cluster means
# append cluster assignment
mydata <- data.frame(data1, fit$cluster)
clusplot(mydata, fit$cluster, color=TRUE, shade=TRUE,
labels=2, lines=0)
# K-Means Clustering with 5 clusters
fit <- kmeans(mydata, 5)
# Determine number of clusters
wss <- (nrow(data1)-1)*sum(apply(data1,2,var))
for (i in 2:15) wss[i] <- sum(kmeans(data1,
#A plot of the within groups sum of squares by number of
clusters extracted can help determine the appropriate number of
#The analyst looks for a bend in the plot similar to a scree test
in factor analysis
# We want (total within-cluster variation) to be the lowest
plot(1:15, wss, type="b", xlab="Number of Clusters",
ylab="Within groups sum of squares")
# Determine number of clusters
wss <- (nrow(data1)-1)*sum(apply(data1,2,var))
for (i in 2:15) wss[i] <- sum(kmeans(data1,
plot(1:15, wss, type="b", xlab="Number of Clusters",
ylab="Within groups sum of squares")
# Cluster Plot against 1st 2 principal components
# vary parameters for most readable graph
clusplot(mydata, fit$cluster, color=TRUE, shade=TRUE,
labels=2, lines=0)
# Centroid Plot against 1st 2 discriminant functions
plotcluster(mydata, fit$cluster)
fviz_dist(distance, gradient = list(low = "#00AFBB", mid =
"white", high = "#FC4E07"))
# try with 25 attempts, 2 clusters
km <- kmeans(data1, centers = 2, nstart = 25)
#The output of kmeans is a list with several bits of information.
The most important being:
# cluster: A vector of integers (from 1:k) indicating the cluster
to which each point is allocated.
#centers: A matrix of cluster centers.
#totss: The total sum of squares.
#withinss: Vector of within-cluster sum of squares, one
component per cluster.
#tot.withinss: Total within-cluster sum of squares, i.e.
#betweenss: The between-cluster sum of squares, i.e. $totss-
#size: The number of points in each cluster.
# print the clusters
# Plot clusters
fviz_cluster(km, data = data1)
(cl <- kmeans(data1, 8))
plot(data1, col = cl$cluster)
points(cl$centers, col = 1:3, pch = 8, cex = 2)
# sum of squares
ss <- function(x) sum(scale(x, scale = FALSE)^2)
## cluster centers "fitted" to each obs.:
fitted.data1 <- fitted(cl); head(fitted.data1)
resid.data1 <- data1 - fitted(cl)
## Equalities : ----------------------------------
cbind(cl[c("betweenss", "tot.withinss", "totss")], # the same two
c(ss(fitted.data1), ss(resid.data1), ss(data1)))
stopifnot(all.equal(cl$ totss, ss(data1)),
all.equal(cl$ tot.withinss, ss(resid.data1)),
## these three are the same:
all.equal(cl$ betweenss, ss(fitted.data1)),
all.equal(cl$ betweenss, cl$totss - cl$tot.withinss),
## and hence also
all.equal(ss(data1), ss(fitted.data1) + ss(resid.data1))
kmeans(data1,1)$withinss # trivial one-cluster, (its W.SS ==
## random starts do help here with too many clusters
## (and are often recommended anyway!):
(cl <- kmeans(x, 5, nstart = 25))
plot(x, col = cl$cluster)
points(cl$centers, col = 1:5, pch = 8)
if(!require(arules)) install.packages("arules")
if(!require(arulesViz)) install.packages("arulesViz")
if(!require(dplyr)) install.packages("dplyr")
if(!require(lubridate)) install.packages("lubridate")
if(!require(ggplot2)) install.packages("ggplot2")
if(!require(knitr)) install.packages("knitr")
if(!require(RColorBrewer)) install.packages("RColorBrewer")
Science & Big Data Analy (ITS-836-51)/RStudio/Week6")
# First example from :
#Other refs: https://rstudio-pubs-
# First lets use the AdultUCI dataset that comes bundled with
the arules package.
rules <- apriori(Groceries,parameter=list(support=0.002,
confidence = 0.5))
inspect(head(sort(rules, by = "lift")))
plot(rules, method = "grouped")
plot(rules,method = "scatterplot")
plot(rules,method = "graph")
# Import test data
df <- read.csv("OnlineRetailSmall.csv")
df <- df[complete.cases(df), ] # Drop missing values
# Change Description and Country columns to factors
# Factors are the data objects which are used to categorize the
data and store it as levels.
df %>% mutate(Description = as.factor(Description),
Country = as.factor(Country))
# Change InvoiceDate to Date datatype
df$Date <- as.Date(df$InvoiceDate)
df$InvoiceDate <- as.Date(df$InvoiceDate)
# Extract time from the InvoiceDate column
# Convert InvoiceNo into numeric
InvoiceNo <- as.numeric(as.character(df$InvoiceNo))
# Add new columns to original dataframe
cbind(df, TransTime, InvoiceNo)
# Group by invoice number and combine order item strings with
a comma
transactionData <- ddply(df,c("InvoiceNo","Date"),
function(df1)paste(df1$Description,collapse =
transactionData$InvoiceNo <- NULL # Don't need these
transactionData$Date <- NULL
colnames(transactionData) <- c("items")
v", quote = FALSE, row.names = TRUE)
# MBA analysis
# From package arules
tr <- read.transactions('market_basket_transactionsSmall.csv',
format = 'basket', sep=',')
# plot the frequency of items
8,'Pastel2'), main="Absolute Item Frequency Plot")
main='Relative Item Frequency Plot',
ylab="Item Frequency (Relative)")
# Generate the a priori rules
association.rules <- apriori(tr, parameter = list(supp=0.001,
inspect(association.rules[1:10]) # Top 10 association rules
# Select rules which are subsets of larger rules -> Remove rows
where the sums of the subsets are > 1
subset.rules <- which(colSums(is.subset(association.rules,
association.rules)) > 1) # get subset rules in vector
# What did customers buy before buying "METAL"
metal.association.rules <- apriori(tr, parameter =
list(supp=0.001, conf=0.8),appearance =
# What did customers buy after buying "METAL"
metal.association.rules2 <- apriori(tr, parameter =
list(supp=0.001, conf=0.8),appearance =
# Plotting
# Filter rules with confidence greater than 0.4 or 40%
#Plot SubRules
# Top 10 rules viz
top10subRules <- head(subRules, n = 10, by = "confidence")
plot(top10subRules, method = "graph", engine = "htmlwidget")
# Filter top 20 rules with highest lift
# Paralell Coordinates plot - visualize which products along
with which items cause what kind of sales.
# Closer arrows re bought together
subRules2<-head(subRules, n=20, by="lift")
plot(subRules2, method="paracoord")
ITS-836 Course Paper, a total of 60 points (60% of the total
course points)
Izzat Alsmadi
GuidelinesRubrics to deliver Course Paper
Three deliverables
Deliverable 1, 10 points
· The deliverable should contain the following components:
(1) Overall Goals/Research Hypothesis (20 %)
1-3 research questions to navigate/direct all your project.
· You may delay this section until (1) you study all previous
work and (2) you do some analysis and understand the
(2) (Previous/Related Contributions) (40 %)
As most of the selected projects use public datasets, no doubt
there are different attempts/projects to analyze those datasets.
30 % of this deliverable is in your overall assessment of
previous data analysis efforts. This effort should include:
· Evaluating existing source codes that they have (e.g. in
Kernels and discussion sections) or any other refence. Make
sure you try those codes and show their results
· In addition to the code, summarize most relevant literature or
efforts to analyze the same dataset you have picked.
· For the few who picked their own datasets, you are still
expecting to do your literature survey in this section on what is
most relevant to your data/idea/area and summarize those most
relevant contributions.
(3) A comparison study (40 %)
Compare results in your own work/project with results from
previous or other contributions (data and analysis comparison
not literature review)
The difference between section 3 and section 2 is that section 2
focuses on code/data analysis found in sources such as Kaggle,
github, etc. while section 3 focuses on research papers that not
necessary studied the same dataset, but the same focus area

