Data Analytics Using R - Report

DAUR Report
Submitted By:-
Akanksha Gohil

1. Introduction to Data Analytics
Data Analytics refers to a set of techniques used to analyze the data in order to enhance
productivity and business profitability.
In the Data Analytics process, the data is extracted from different sources and is cleaned, then
categorized to analyze various behavioral patterns.
The techniques and tools used for data analysis vary according to the organization or individual.
Data analytics is a very broad concept that includes many diverse types of data analysis
techniques. Any type of information can be inputted to data analytics techniques in order to
gain meaningful insights that can be utilized to get the required result.
It is a process of inspecting, cleansing, modifying, transforming and modeling data with the aim
of discovering: -
i) Required useful information
ii) Informing conclusions
iii)Supporting decision-making
Importance of Data Analytics:-
As a large amount of data gets generated everyday, the need to dig out useful and meaningful
insights is a must for any business enterprise. Data Analytics plays a key role in improving and
moulding business decisions.
Below mentioned are 4 main factors which signifies the need for Data Analytics in our lives:
● Gather Hidden Insights – The hidden insights and information about data are first
gathered and then they analyzed with respect to the business requirements.
● Generate Reports – Reports are developed from the data and are passed on to the
respective data analytics teams to proceed with further actions for a high rise in the
business.

● Perform Market Analysis –The market analysis can be performed to interpret the
strengths and weaknesses of the competitors.
● Improve Business Requirement – Analysis of Data allows improving the business
from customer requirements to customer experience.
Four Types of Data Analytics:--
1. Descriptive analytics tells us what has happened over a particular time period.
2. Diagnostic analytics focuses more on why something happened. This includes
more diverse data sets and a little hypothesizing.
3. Predictive analytics tells us what is likely going to happen in the near future.
4. Prescriptive analytics suggests a course of action.
Fig: Types of Data Analytics

Top Tools usedin Data Analytics
With the increase in demand for Data Analysis, many tools have come up with various
functionalities and expertise for the analysis of data. Some of them are open-source while some
are not. Following are the most widely used top tools in the field of data analytics:-
★ R programming
★ Python
★ Tableau Public
★ QlikView
★ SAS
★ Microsoft Excel
★ RapidMiner
★ KNIME
★ OpenRefine
★ Apache Spark
Out of these R, Python and SAS are widely used in the market. Following is an image where the
distributing of usage of these technologies with respect to the industry is shown: -

2. Classification:-
Classification techniques are used to predict a categorical class label, such as income level: low,
medium or high.
Let us try to demonstrate the concept of classification with an example of- gender classification
using the hair length. Here the gender will be our ‘target class’, it will be classified on the basis
of their hair length, therefore the hair length will be a ‘feature parameter’. Now we can put up
some kind of condition to be taken as a reference point from where classification can be done.
Suppose the differentiated boundary hair length value is 15.0 cm then we can say that if hair
length is less than 15.0 cm then gender could be male or else female.
Applicationof ClassificationAlgorithms
● Classification of spamemails.
● Prediction to analyze whether bank customer will pay the loan or not.
● Identification of cancer tumor cells.
● Sentiment analysis.
● Classification of drugs.
● Detection of facial keypoints.
2.1 Classificationtechniques:-
● Decision Trees – These are organized in the form of sets of questions and
answers in the tree structure.
● Naive Bayes Classifiers – It is a probabilistic machine learning model that is used
for the classification.
● Support Vector Machines – It is a non-probabilistic binary linear
classification model used to classify a case into one of the two categories.

(i) DecisionTree:-
It is a kind of supervised-learning algorithm. Here, we split the population into two or
more homogeneous data sets.
The Decision Tree is a very powerful non-linear classification tool. A Decision Tree makes use
of a tree-like structure to generate relationships among various features or parameters and
potential outcomes. It makes use of the branching decisions as its core structure.
Following is the structure of a decision tree: -
Fig: Structure of a decision tree
Here, root node represents the entire set of population or the sample set. It then further gets
divided into two or more homogeneous sets of data. Decision Tree is produced when any sub-
node gets split into further sub-nodes. The Leaf/Terminal Node does not split further. The
process of removing sub-nodes of any decision node is called pruning. A Branch / Sub-Tree is a
subsection of the entire tree.

Two types of Decision Tree
1. Categorical(classification) Variable Decision Tree: A Decision Tree which
has target variable as a categorical variable.
2. Continuous(Regression) Variable Decision Tree: Decision Tree has a continuous
target variable.
Advantages of DecisionTree inR
● Easy to Understand: It does not need any statistical knowledge to read and
interpret them.
● Less data cleaning required: Compared to some other modeling techniques,
it requires fewer data.
● Data type is not a constraint: It can handle both numerical and categorical
variables.
● Simple to understand and interpret.
● Requires little data preparation.
● It works with both numerical and categorical data.
● Handles non-linearity.
● Possible to confirm a model using statistical tests.
Disadvantages of R DecisionTree
● Overfitting: It is one of the most practical difficulties for Decision Tree models. By
setting constraints on model parameters and pruning, we can solve this problem
in R.
● Not fit for continuous variables: At the time of using continuous numerical
variables. Whenever it categorizes variables in different categories, the
Decision Tree loses information.
● To learn globally optimal tree is NP-hard, algos rely on greedy search.
● Complex “if-then” relationships between features inflate tree size. Example –
XOR gate, multiplexor.

(ii) Naïve Bayes Classification:-
We use Bayes’ theorem to make the prediction. It is based on prior knowledge and
current evidence.
Bayes’ theorem is expressed by the following equation:
where P(A) and P(B) are the probability of events A and B without regarding each other. P(A|B) is
the probability of A conditional on B and P(B|A) is the probability of B conditional on A.
(iii) Support Vector Machine (SVM) :-
We use it to find the optimal hyperplane (line in 2D, a plane in 3D and hyperplane in more
than 3 dimensions). Which helps in maximizes the margin between two classes. Support
Vectors are observations that support hyperplane on either side.
It helps in solving a linear optimization problem. It also helps out in finding the hyperplane
with the largest margin. We use the “Kernel Trick” to separate instances that are inseparable.
Advantages of SVM inR
● If we are using Kernel trick in case of non-linear separable data then it
performs very well.
● SVM works well in high dimensional space and in case of text or image
classification.
● It does not suffer a multicollinearity problem.

Disadvantages of SVM in R
● It takes more time on large-sized data sets.
● SVM does not return probability estimates.
● In the case of linearly separable data, this is almost like logistic regression.
Applications of Classification in R
● An emergency room in a hospital measures 17 variables of newly admitted
patients. Variables, like blood pressure, age and many more. Furthermore, a careful
decision has to be made if the patient has to be admitted to the ICU. Due to a high
cost of I.C.U, those patients who may survive more than a month are given high
priority. Also, the problem is to predict high-risk patients. And, to discriminate them
from low-risk patients.
● A credit company receives hundreds of thousands of applications for new cards.
The application contains information about several different attributes.
Moreover, the problem is to categorize those who have good credit, bad credit or
fall into a grey area.
● Astronomers have been cataloguing distant objects in the sky using long
exposure C.C.D images. Thus, the object that needs to be labelled is a star, galaxy
etc. The data is noisy, and the images are very faint, hence, the cataloguing can
take decades to complete.
2.2 Performance Measures:-
i) Confusion matrix
The R function table() can be used to produce a confusion matrix in order to determine how
many observations were correctly or incorrectly classified. It compares the observed and the
predicted outcome values and shows the number of correct and incorrect predictions
categorized by type of outcome.

Fig : Confusion Matrix
A. True positives: these are cases in which we predicted the individuals would
be diabetes-positive and they were.
B. True negatives: We predicted diabetes-negative, and the individuals were diabetes-
negative.
C. False positives: We predicted diabetes-positive, but the individuals didn’t actually
have diabetes. (Also known as a Type I error.)
D. False negatives: We predicted diabetes-negative, but they did have diabetes.
(Also known as a Type II error.)
E. Precision: It is the proportion of true positives among all the individuals that have
been predicted to be diabetes-positive by the model. This represents the
accuracy of a predicted positive outcome. Precision =
TruePositives/(TruePositives + FalsePositives).
F. Sensitivity (or Recall): It is the True Positive Rate (TPR) or the proportion of
identified positives among the diabetes-positive population (class = 1). Sensitivity
= TruePositives/(TruePositives + FalseNegatives).
G. Specificity: It is the measures the True Negative Rate (TNR), which is the
proportion of identified negatives among the diabetes-negative population (class =
0). Specificity = TrueNegatives/(TrueNegatives + FalseNegatives).

2.3 Case:-
In this assignment, for classification techniques, we are going to take the case of “Iris” data
set which is a builtin dataset of R, for that we will need to import library- rpart and rpart.plot.
Following is the structure of iris dataset:-
i) DECISION TREE :- After performing the decision tree algorithm,we got the following output
for categorical variable ‘Species’ with respect all the continuous variables:-
Fig : Decision Tree

Here we have tried to split our decisions on the basis of Petal Length and Petal Width, We
are trying to make a decision, whether the petal length is more than or less than 2.5. Is the
decision is ‘no’, we further check it for petal width being smaller or greater than 1.8 to get
a sufficiently homogeneous partition.
ii) Naive Bayes Classification:- We will be performing Naive Bayes Classification on the same
dataset of Iris. For Naive Bayes Classification we have imported ‘e0171’ library.Here, we will be
making a model where training and testing datasets will be used. We have renamed the
columns as "sepal_length", "sepal_width", "petal_length", "petal_width","class".
After performing the algorithm for naive bayes, we get the following output:-
iii) Support Vector Machine (SVM)
With the use of Support Vector Machine, we try to achieve the following two
classification goals simultaneously:

1. Maximize the margin
2. Correctly classify the data points
We had applied SVM techniques on Iris dataset and produced a model for the same,
after conduction all the required action on it, we get the following result:-
precision recall f1-score support
Iris-setosa 1.00 1.00 1.00 17
Iris-versicolor 1.00 1.00 1.00 16
Iris-virginica 1.00 1.00 1.00 12
avg / total 1.00 1.00 1.00 45
3. Clustering:-
3.1 Clustering Technique
Clustering is a method of data breakdown that partitions the data into several groups based on
their similarity.
We group the data through a statistical procedure. These smaller groups that are formed from
the bigger data are known as clusters. These clusters show the resulting properties:
● They are learned while carrying out the operation and the information of their
number is not known in advance.
● Clusters are the accumulation of alike objects that share common characteristics.
Clustering is the most widespread and popular method of Data Analysis and Data Mining. It
used in cases where the underlying input data has a colossal volume and we are tasked with
finding similar subsets that can be analyzed in several ways.

For example – A marketing company can categorize its customers based on their economic
background, age and several other factors to sell their products, in a better way.
In different fields, R clustering has different names, such as:
● Marketing – In marketing, ‘segmentation’ or ‘typological analyses’ term is available
for clustering.
● Medicine – Clustering in medicine is known as nosology.
● Biology – It is referred to as numerical taxonomy in the field of Biology.
To define the correct criteria for clustering and making use of efficient algorithms, the general
formula is as follows:
Bn(number of partitions for n objects)>exp(n)
You can determine the complexity of clustering by the number of possible combinations of
objects. The complexity of the cluster depends on this number.
The basis for joining or separating objects is the distance between them. These distances are
dissimilarities (when objects are far from each other) or similarity (when objects are close by).
Methods for Measuring Distance between Objects
For calculating the distance between the objects in K-means, we make use of the following
types of methods:
● Euclidean Distance – It is the most widely used method for measuring the distance
between the objects that are present in a multidimensional space.
In general, for an n-dimensional space, the distance is
● Squared Euclidean Distance – This is obtained by squaring the Euclidean Distance.
The objects that are present at further distances are assigned greater weights.
● City-Block (Manhattan) Distance – The difference betweentwo points in all dimensions is
calculated using this method. It is similar to Euclidean Distance in many cases but it
has an added functioning in the reduction of the effect in the extreme objects, which
do not possess squared coordinates.

The squares of the inertia are the weighted sum mean of squares of the interval of the
points from the center of the assigned cluster whose sum is calculated.
We perform the calculation of the Sum of Squares of Clusters on their centers as follows:
The above formula is known as the Huygens’ Formula.
The Between-Cluster Sum of squares is calculated by evaluating the square of difference from
the center of gravity from each cluster and their addition.
We perform the calculation of the Within-Cluster Sum of squares through the process of the
unearthing of the square of difference from the center of gravity for each given cluster and
their addition within the single cluster. With the diminishing of the cluster, the population
becomes better.
R-squared (RSQ) delineates the proportion of the sum of squares that are present in the
clusters. The closer proportion is to 1, better is the clustering. However, one’s aim is not the
maximization of the costs as the result would lead to a greater number of clusters. Therefore,
we require an ideal R
2
that is closer to 1 but does not create many clusters. As we move from k
to k+1 clusters, there is a significant increase in the value of R
2
Some of the properties of efficient clustering are:
● Detecting structures that are present in the data.
● Determining optimal clusters.
● Giving out readable differentiated clusters.
● Ensuring stability of cluster even with the minor changes in data.
● Efficient processing of the large volume of data.
● Handling different data types of variables.
Note: In the case of correct clustering, either IR is large or IA is small while calculating the sum
of squares.
Clustering is only restarted after we have performed data interpretation, transformation as well
as the exclusion of the variables. While excluding the variable, it is simply not taken into
account during the operation of clustering. This variable becomes an illustrative variable.
Agglomerative Hierarchical Clustering

In the Agglomerative Hierarchical Clustering (AHC), sequences of nested partitions of n
clusters are produced. The nested partitions have an ascending order of increasing
heterogeneity. We use AHC if the distance is either in an individual or a variable space. The
distance between two objects or clusters must be defined while carrying out categorization.
The algorithm for AHC is as follows:
● We first observe the initial clusters.
● In the next step, we assess the distance between the clusters.
● We then proceed to merge the most proximate clusters together and performing
their replacement with a single cluster.
● We repeat step 2 until only a single cluster remains in the end.
AHC generates a type of tree called dendrogram. After splitting this dendrogram, we obtain the
clusters.
Hierarchical Clustering is most widely used in identifying patterns in digital images, prediction
of stock prices, text mining, etc. It is also used for researching protein sequence classification.
1. Main Distances
● Maximum distance – In this, the greatest distance between the two observed
objects have clusters that are of equal diameters.
● Minimum distance – The minimum distance between the two observations
delineates the neighbor technique or a single linkage AHC method.
In this case, the minimum distance between the points of different clusters is supposed to be
greater than the maximum points that are present in the same cluster. The distance between
the points of distance clusters is supposed to be higher than the points that are present in the
same cluster.
2. Density Estimation
In density estimation, we detect the structure of the various complex clusters. The three
methods for estimating density in clustering are as follows:
● The k-nearest-neighbors method – The number of k observations that are centered on x
determines the density at the point x. The volume of the sphere further divides this.

● The Uniform Kernel Method – In this, the radius is fixed but the number
of neighbors is not.
● The Wong Hybrid Method – We use this in the preliminary analysis.
You must definitely explore the Graphical Data Analysis with R
Clustering by Similarity Aggregation
Clustering by Similarity Aggregation is known as relational clustering which is also known by the
name of Condorcet method.
With this method, we compare all the individual objects in pairs that help in building the global
clustering. The principle of equivalence relation exhibits three properties – reflexivity,
symmetry, and transitivity.
● Reflexivity => Mii = 1
● Symmetry => Mij = Mij
● Transitivity => Mij + Mjk – Mik <=1
This type of clustering algorithm makes use of an intuitive approach. A pair of individual values
(A,B) are assigned to the vectors m(A,B) and d(A,B). Both A and B possess the same value in
m(A,B) whereas in the case of d(A,B), they exhibit different values.
The two individuals A and B follow the Condorcet Criterion as follows:
c(A, B) = m(A, B)-d(A, B)
For an individual A and cluster S, the Condorcet criterion is as follows:
c(A,S) = Σic(A,Bi)
The summation overall is the Bi ∈ S.
With the previous conditions, we start by constructing clusters that place each individual A in
cluster S. In this cluster c(A,S), A is the largest and has the least value of 0.
In the next step, we calculate the global Condorcet criterion through a summation of
individuals present in A as well as the cluster SA which contains them.

K-Means Clustering in R
One of the most popular partitioning algorithms in clustering is the K-means cluster analysis
in R. It is an unsupervised learning algorithm. It tries to cluster data based on their similarity.
Also, we have specified the number of clusters and we want that the data must be grouped into
the same clusters. The algorithm assigns each observation to a cluster and also finds the
centroid of each cluster.
3.2 Performance Measures
Contrary to supervised learning where we have the ground truth to evaluate the model’s
performance, clustering analysis doesn’t have a solid evaluation metric that we can use to
evaluate the outcome of different clustering algorithms. Moreover, since kmeans requires k as
input and doesn’t learn it from data, there is no right answer in terms of the number of clusters
that we should have in any problem. Sometimes domain knowledge and intuition may help but
usually, that is not the case. In the cluster-predict methodology, we can evaluate how well the
models are performing based on different K clusters since clusters are used in the downstream
modeling.
In this post we’ll cover two metrics that may give us some intuition about k:
· Elbow method
· Silhouette analysis
Elbow Method
Elbow method gives us an idea on what a good k number of clusters would be based on the
sum of squared distance (SSE) between data points and their assigned clusters’ centroids. We
pick k at the spot where SSE starts to flatten out and forming an elbow. We’ll use the geyser
dataset and evaluate SSE for different values of k and see where the curve might form an elbow
and flatten out.

The graph above shows that k=2 is not a bad choice. Sometimes it’s still hard to figure out a
good number of clusters to use because the curve is monotonically decreasing and may not
show any elbow or has an obvious point where the curve starts flattening out.
Silhouette Analysis
Silhouette analysis can be used to determine the degree of separation between clusters. For
each sample:
· Compute the average distance from all data points in the same cluster (ai).
· Compute the average distance from all data points in the closest cluster (bi).
· Compute the coefficient:
The coefficient can take values in the interval [-1, 1].
· If it is 0 –> the sample is very close to the neighboring clusters.

· It is 1 –> the sample is far away from the neighboring clusters.
· It is -1 –> the sample is assigned to the wrong clusters.
Therefore, we want the coefficients to be as big as possible and close to 1 to have good
clusters. We’ll use here geyser dataset again because it’s cheaper to run the silhouette analysis
and it is actually obvious that there is most likely only two groups of data points.
3.3 Case Study
You are head of client insights and marketing at a telecommunication company, ConnectFast
INC. You understand that not each client is analogous and you wish to possess totally different
|completely different} methods to draw in different customers. You appreciate the ability of
client segmentation to deliver superior results with optimized value. You’re conjointly alert to
unsupervised learning techniques like cluster analysis to make client segments. To brush up
your skills with cluster analysis, you've got selected a sample of eight customers with their
average decision period (both regionally and internationally). The subsequent is that the data:
To get a feel of this, you've got premeditated the information with average international
decision period on the coordinate axis and average phone call period on the coordinate axis.
The subsequent is that the plot:

Euclidian Distance tofind Cluster Centroids
In this case, two centroids (C1 & C2) are randomly placed at the coordinates (1, 1) and (3, 4).
Why did we choose two centroids? For this problem, visual guesstimate of scattered plot above
tells us that are two clusters. However, we will notice in a later part of this series, this question
may not have such a straightforward answer for larger data sets.
Now, we will measure the distance between two centroids (C1 & C2) and all the data points on
the above-scattered plot using Euclidean measure. Euclidean distance is measured through the
following formula
In this case, 2 centroids (C1 & C2) are arbitrarily placed at the coordinates (1, 1) and (3, 4). Why
did we elect 2 centroids? For this downside, visual estimation of scattered plot on top, informs
us that are 2 clusters. However, we are going to notice in the coming part of this series; this
question might not have such an easy declare for larger information sets.

Now, we are going to measure the space between 2 centroids (C1 & C2) and all the data
points on the above-scattered plot using Euclidean measure. Euclidean distance is
measured through the subsequent formula
Columns 3 and 4 (i.e. Distance from C1 and C2) are measured using the same formula. For
instance, for the first customer
You could measure all the other values similarly. Furthermore, cluster membership (last
column) is allocated using the closeness to clusters (C1 and C2). The first client is closer to
centroid 1 (1.41 in comparison to 2.24) hence is assigned membership C1.
The following is the scatter plot with cluster centroids C1 and C2 (displayed with blue and
orange diamond shapes). The customers are have marked with the colour of centroids basis
their closeness to the centroids.
As we have arbitrarily assigned the centroids, the second step is to move them iteratively. The
new position of the centroid is measured by taking the average of member points for the
centroid. For the first centroid, customers 1, 2 and 3 are members. Hence, the new x-axis
position for the centroid C1 is the average value for x-axis for these customers i.e. (2+1+1)/3 =
1.33. We will get the new coordinates for C1 equal to (1.33, 2.33) and C2 equal to (4.4, 4.2). The
new plot is shown below:

Finally, one last iteration we will take the centroids at the centre of the clusters. As displayed
below:
The positions for our cluster centroids in this case turned out to be C1 (1.75, 2.25) and C2(4.75,
4.75).
4. Association:-
4.1 ASSOCIATION TECHNIQUES:
Association mining is commonly used to make product recommendations by identifying
products that are frequently bought together. One such association mining technique is the
apriori algorithm

Apriori Algorithm
● It proceeds by identifying the frequent individual items in the database and extending
them to larger and larger item sets as long as those item sets appear sufficiently often in
the database. The frequent itemsets determined by Apriori can be used to determine
association rules which highlight general trends in the database: this has applications in
domains such as market basket analysis.
● Apriori algorithm is a classical algorithm in data mining. It is used for mining frequent
itemsets and relevant association rules. It is devised to operate on a database
containing a lot of transactions, for instance, items brought by customers in a store.
Association rules
Association rule learning is a prominent and a well-explored method for determining relations
among variables in large databases. Let us take a look at the formal definition of the problem of
association rules.
● Let I={i1,i2,i3,…,in} be a set of n attributes called items and D={t1,t2,…,tn} be the set of
transactions. It is called database. Every transaction, ti in D has a unique transaction ID,
and it consists of a subset of itemsets in I.
● A rule can be defined as an implication, X⟶Y where X and Y are subsets of I(X,Y⊆I), and they have no element in
common, i.e., X∩Y. Xand Y are the antecedent and the consequent ofthe rule.
General Process of the Apriori algorithm
The entire algorithm can be divided into two steps:
Step 1: Apply minimum support to find all the frequent sets with k items in a database.
Step 2: Use the self-join rule to find the frequent sets with k+1 items with the help of frequent
k-itemsets. Repeat this process from k=1 to the point when we are unable to apply the self-join
rule.

This approach of extending a frequent itemset one at a time is called the “bottom up”
approach.
Mining Association Rules
The Apriori algorithm has been looked at with respect to frequent itemset generation. There is
another way for which we can use this algorithm, i.e., finding association rules.
We need to find all rules having support more than the threshold support and confidence more
than the threshold confidence.
One possible way for this is to, to list all the possible association rules and calculate the support
and confidence for each rule. Then eliminate the rules that fail the threshold support and
confidence. It is very heavy and prohibitive as the number of all the possible association rules
increases exponentially with the number of items.
We can also use another way, which is called the two-step approach, to find efficient
association rules.
The two-step approach is:
Step 1: Frequent itemset generation: Find all itemsets for which the support is greater than the
threshold support following the process we have already seen earlier in this article.
Step 2: Rule generation: Create rules from each frequent itemset using the binary partition of
frequent itemsets and look for the ones with high confidence. These rules are called candidate
rules
There are multiple rules possible even from a very small database, so in order to select the
interesting ones, we use constraints on various measures of interest and significance. We will
look at some of these useful measures such as support, confidence, lift and conviction.
4.2 Performance Measures :
1. Support

The support of an itemset X, supp(X) is the proportion of transaction in the database in
which the item X appears. It signifies the popularity of an itemset. In the example above,
2. Confidence
it signifies the likelihood of item Y being purchased when item X is purchased. So, for the rule
{Onion, Potato} => {Burger},
It can give some important insights, but it also has a major drawback. It only takes into account
the popularity of the itemset X and not the popularity of Y. If Y is equally popular as X then
there will be a higher probability that a transaction containing X will also contain Y thus
increasing the confidence. To overcome this drawback there is another measure called lift.
3.Lift
The lift of a rule is defined as: This signifies the likelihood of the itemset Y being purchased
when item X is purchased while taking into account the popularity of Y. In our example above,If
the value of lift is greater than 1, it means that the itemset Y is likely to be bought with itemset
X, while a value less than 1 implies that itemset Y is unlikely to be bought if the itemset X is
bought.
Pros of the Apriori algorithm
1. It is an easy-to-implement and easy-to-understand algorithm.
2. It can be used on large itemsets.
Cons of the Apriori Algorithm
1. Sometimes, it may need to find a large number of candidate rules which can
be computationally expensive.
2. Calculating support is also expensive because it has to go through the entire database

Consider the following example:
Given is a set of transaction data. You can see transactions numbered 1 to 5. Each transaction
shows items bought in that transaction. You can see that Diaper is bought with Beer in three
transactions. Similarly, Bread is bought with milk in three transactions making them both
frequent item sets. Association rules are given in the form as below:
A=>B[Support,Confidence]
The part before => is referred to as if (Antecedent) and the part after => is referred to as then
(Consequent).
Where A and B are sets of items in the transaction data. A and B are disjoint sets.
In the following section you will learn about the basic concepts of Association Rule Mining:
Basic Concepts of Association Rule Mining

1. Itemset: Collection of one or more items. K-item-set means a set of k items.
2. Support Count: Frequency of occurrence of an item-set
3. Support (s): Fraction of transactions that contain the item-set 'X'
● For a Rule A=>B, Support is given by:

Note: P(AUB) is the probability of A and B occurring together. P denotes probability.
Go ahead, try finding the support for Milk=>Diaper as an exercise.
1. Confidence (c): For a rule A=>B Confidence shows the percentage in which B is bought
with A.
The number of transactions with both A and B divided by the total number of transactions
having A.
Now find the confidence for Milk=>Diaper.
Note: Support and Confidence measure how interesting the rule is. It is set by the minimum
support and minimum confidence thresholds. These thresholds set by client help to compare the
rule strength according to your own or client's will. The closer to threshold the more the rule is of
use to the client.
1. Frequent Itemsets: Item-sets whose support is greater or equal than minimum support
threshold (min_sup). In above example min_sup=3. This is set on user choice.

2. Strong rules: If a rule A=>B[Support, Confidence] satisfies min_sup and
min_confidence then it is a strong rule.
3. Lift: Lift gives the correlation between A and B in the rule A=>B. Correlation shows
how one item-set A effects the item-set B.
For example, the rule {Bread}=>{Milk}, lift is calculated as:
 If the rule had a lift of 1,then A and B are independent and no rule can be
derived from them.

 If the lift is > 1, then A and B are dependent on each other, and the degree of which is
given by ift value.
 If the lift is < 1, then presence of A will have negative effect on B.
Goal of Association Rule Mining
When you apply Association Rule Mining on a given set of transactions T your goal will be to
find all rules with:

1. Support greater than or equal to min_support
2. Confidence greater than or equal to min_confidence
5. Conclusion and Learning
The whole project has given us an overall perspective of the various methods of data analytics
and their applications depending on the requirements.

There are two types of learning, supervised and unsupervised learning. Supervised learning.
The supervised learning is about training the model of the algorithm with the majority data and
carrying out analysis on the rest of test data. The major difference between supervised and
unsupervised learning is the presence of historic data. Incase of supervised learning, there is
historical data for training the model whereas in case of unsupervised learning there is no
historical data.
Classification method is one of the supervised algorithms of learning and we understood that it
is used in forecasting the future outcomes and the accuracy of such models can also be checked
using confusion matrix. Similarly, Clustering is one of unsupervised learning algorithms and is
used for exploratory data mining.It is used when we have a group of people having similar
characteristics and they could be dealt together.
Associations is used when we have to establish relations among the probabilities of various
data items and use that relationship for business applications like cross selling, as it helps in
studying the purchase behaviour , it is also known as market basket analysis.
Reference
❖ https://www.kdnuggets.com/2016/07/burtchworks-sas-r-python-analytics-pros-
prefer.html

❖ https://www.investopedia.com/terms/d/data-analytics.asp
❖ https://www.edureka.co/blog/what-is-data-analytics/
❖ https://data-flair.training/blogs/classification-in-r/

Data Analytics Using R - Report

Recommended

Recommended

More Related Content

What's hot

What's hot (17)

Similar to Data Analytics Using R - Report

Similar to Data Analytics Using R - Report (20)

More from Akanksha Gohil

More from Akanksha Gohil (20)

Recently uploaded

Recently uploaded (20)

Data Analytics Using R - Report