SlideShare a Scribd company logo
1 of 33
DAUR Report
Submitted By:-
Akanksha Gohil
1. Introduction to Data Analytics
Data Analytics refers to a set of techniques used to analyze the data in order to enhance
productivity and business profitability.
In the Data Analytics process, the data is extracted from different sources and is cleaned, then
categorized to analyze various behavioral patterns.
The techniques and tools used for data analysis vary according to the organization or individual.
Data analytics is a very broad concept that includes many diverse types of data analysis
techniques. Any type of information can be inputted to data analytics techniques in order to
gain meaningful insights that can be utilized to get the required result.
It is a process of inspecting, cleansing, modifying, transforming and modeling data with the aim
of discovering: -
i) Required useful information
ii) Informing conclusions
iii)Supporting decision-making
Importance of Data Analytics:-
As a large amount of data gets generated everyday, the need to dig out useful and meaningful
insights is a must for any business enterprise. Data Analytics plays a key role in improving and
moulding business decisions.
Below mentioned are 4 main factors which signifies the need for Data Analytics in our lives:
● Gather Hidden Insights – The hidden insights and information about data are first
gathered and then they analyzed with respect to the business requirements.
● Generate Reports – Reports are developed from the data and are passed on to the
respective data analytics teams to proceed with further actions for a high rise in the
business.
● Perform Market Analysis –The market analysis can be performed to interpret the
strengths and weaknesses of the competitors.
● Improve Business Requirement – Analysis of Data allows improving the business
from customer requirements to customer experience.
Four Types of Data Analytics:--
1. Descriptive analytics tells us what has happened over a particular time period.
2. Diagnostic analytics focuses more on why something happened. This includes
more diverse data sets and a little hypothesizing.
3. Predictive analytics tells us what is likely going to happen in the near future.
4. Prescriptive analytics suggests a course of action.
Fig: Types of Data Analytics
Top Tools usedin Data Analytics
With the increase in demand for Data Analysis, many tools have come up with various
functionalities and expertise for the analysis of data. Some of them are open-source while some
are not. Following are the most widely used top tools in the field of data analytics:-
★ R programming
★ Python
★ Tableau Public
★ QlikView
★ SAS
★ Microsoft Excel
★ RapidMiner
★ KNIME
★ OpenRefine
★ Apache Spark
Out of these R, Python and SAS are widely used in the market. Following is an image where the
distributing of usage of these technologies with respect to the industry is shown: -
2. Classification:-
Classification techniques are used to predict a categorical class label, such as income level: low,
medium or high.
Let us try to demonstrate the concept of classification with an example of- gender classification
using the hair length. Here the gender will be our ‘target class’, it will be classified on the basis
of their hair length, therefore the hair length will be a ‘feature parameter’. Now we can put up
some kind of condition to be taken as a reference point from where classification can be done.
Suppose the differentiated boundary hair length value is 15.0 cm then we can say that if hair
length is less than 15.0 cm then gender could be male or else female.
Applicationof ClassificationAlgorithms
● Classification of spamemails.
● Prediction to analyze whether bank customer will pay the loan or not.
● Identification of cancer tumor cells.
● Sentiment analysis.
● Classification of drugs.
● Detection of facial keypoints.
2.1 Classificationtechniques:-
● Decision Trees – These are organized in the form of sets of questions and
answers in the tree structure.
● Naive Bayes Classifiers – It is a probabilistic machine learning model that is used
for the classification.
● Support Vector Machines – It is a non-probabilistic binary linear
classification model used to classify a case into one of the two categories.
(i) DecisionTree:-
It is a kind of supervised-learning algorithm. Here, we split the population into two or
more homogeneous data sets.
The Decision Tree is a very powerful non-linear classification tool. A Decision Tree makes use
of a tree-like structure to generate relationships among various features or parameters and
potential outcomes. It makes use of the branching decisions as its core structure.
Following is the structure of a decision tree: -
Fig: Structure of a decision tree
Here, root node represents the entire set of population or the sample set. It then further gets
divided into two or more homogeneous sets of data. Decision Tree is produced when any sub-
node gets split into further sub-nodes. The Leaf/Terminal Node does not split further. The
process of removing sub-nodes of any decision node is called pruning. A Branch / Sub-Tree is a
subsection of the entire tree.
Two types of Decision Tree
1. Categorical(classification) Variable Decision Tree: A Decision Tree which
has target variable as a categorical variable.
2. Continuous(Regression) Variable Decision Tree: Decision Tree has a continuous
target variable.
Advantages of DecisionTree inR
● Easy to Understand: It does not need any statistical knowledge to read and
interpret them.
● Less data cleaning required: Compared to some other modeling techniques,
it requires fewer data.
● Data type is not a constraint: It can handle both numerical and categorical
variables.
● Simple to understand and interpret.
● Requires little data preparation.
● It works with both numerical and categorical data.
● Handles non-linearity.
● Possible to confirm a model using statistical tests.
Disadvantages of R DecisionTree
● Overfitting: It is one of the most practical difficulties for Decision Tree models. By
setting constraints on model parameters and pruning, we can solve this problem
in R.
● Not fit for continuous variables: At the time of using continuous numerical
variables. Whenever it categorizes variables in different categories, the
Decision Tree loses information.
● To learn globally optimal tree is NP-hard, algos rely on greedy search.
● Complex “if-then” relationships between features inflate tree size. Example –
XOR gate, multiplexor.
(ii) Naïve Bayes Classification:-
We use Bayes’ theorem to make the prediction. It is based on prior knowledge and
current evidence.
Bayes’ theorem is expressed by the following equation:
where P(A) and P(B) are the probability of events A and B without regarding each other. P(A|B) is
the probability of A conditional on B and P(B|A) is the probability of B conditional on A.
(iii) Support Vector Machine (SVM) :-
We use it to find the optimal hyperplane (line in 2D, a plane in 3D and hyperplane in more
than 3 dimensions). Which helps in maximizes the margin between two classes. Support
Vectors are observations that support hyperplane on either side.
It helps in solving a linear optimization problem. It also helps out in finding the hyperplane
with the largest margin. We use the “Kernel Trick” to separate instances that are inseparable.
Advantages of SVM inR
● If we are using Kernel trick in case of non-linear separable data then it
performs very well.
● SVM works well in high dimensional space and in case of text or image
classification.
● It does not suffer a multicollinearity problem.
Disadvantages of SVM in R
● It takes more time on large-sized data sets.
● SVM does not return probability estimates.
● In the case of linearly separable data, this is almost like logistic regression.
Applications of Classification in R
● An emergency room in a hospital measures 17 variables of newly admitted
patients. Variables, like blood pressure, age and many more. Furthermore, a careful
decision has to be made if the patient has to be admitted to the ICU. Due to a high
cost of I.C.U, those patients who may survive more than a month are given high
priority. Also, the problem is to predict high-risk patients. And, to discriminate them
from low-risk patients.
● A credit company receives hundreds of thousands of applications for new cards.
The application contains information about several different attributes.
Moreover, the problem is to categorize those who have good credit, bad credit or
fall into a grey area.
● Astronomers have been cataloguing distant objects in the sky using long
exposure C.C.D images. Thus, the object that needs to be labelled is a star, galaxy
etc. The data is noisy, and the images are very faint, hence, the cataloguing can
take decades to complete.
2.2 Performance Measures:-
i) Confusion matrix
The R function table() can be used to produce a confusion matrix in order to determine how
many observations were correctly or incorrectly classified. It compares the observed and the
predicted outcome values and shows the number of correct and incorrect predictions
categorized by type of outcome.
Fig : Confusion Matrix
A. True positives: these are cases in which we predicted the individuals would
be diabetes-positive and they were.
B. True negatives: We predicted diabetes-negative, and the individuals were diabetes-
negative.
C. False positives: We predicted diabetes-positive, but the individuals didn’t actually
have diabetes. (Also known as a Type I error.)
D. False negatives: We predicted diabetes-negative, but they did have diabetes.
(Also known as a Type II error.)
E. Precision: It is the proportion of true positives among all the individuals that have
been predicted to be diabetes-positive by the model. This represents the
accuracy of a predicted positive outcome. Precision =
TruePositives/(TruePositives + FalsePositives).
F. Sensitivity (or Recall): It is the True Positive Rate (TPR) or the proportion of
identified positives among the diabetes-positive population (class = 1). Sensitivity
= TruePositives/(TruePositives + FalseNegatives).
G. Specificity: It is the measures the True Negative Rate (TNR), which is the
proportion of identified negatives among the diabetes-negative population (class =
0). Specificity = TrueNegatives/(TrueNegatives + FalseNegatives).
2.3 Case:-
In this assignment, for classification techniques, we are going to take the case of “Iris” data
set which is a builtin dataset of R, for that we will need to import library- rpart and rpart.plot.
Following is the structure of iris dataset:-
i) DECISION TREE :- After performing the decision tree algorithm,we got the following out- put
for categorical variable ‘Species’ with respect all the continuous variables:-
Fig : Decision Tree
Here we have tried to split our decisions on the basis of Petal Length and Petal Width, We
are trying to make a decision, whether the petal length is more than or less than 2.5. Is the
decision is ‘no’, we further check it for petal width being smaller or greater than 1.8 to get
a sufficiently homogeneous partition.
ii) Naive Bayes Classification:- We will be performing Naive Bayes Classification on the same
dataset of Iris. For Naive Bayes Classification we have imported ‘e0171’ library.Here, we will be
making a model where training and testing datasets will be used. We have renamed the
columns as "sepal_length", "sepal_width", "petal_length", "petal_width","class".
After performing the algorithm for naive bayes, we get the following output:-
iii) Support Vector Machine (SVM)
With the use of Support Vector Machine, we try to achieve the following two
classification goals simultaneously:
1. Maximize the margin
2. Correctly classify the data points
We had applied SVM techniques on Iris dataset and produced a model for the same,
after conduction all the required action on it, we get the following result:-
precision recall f1-score support
Iris-setosa 1.00 1.00 1.00 17
Iris-versicolor 1.00 1.00 1.00 16
Iris-virginica 1.00 1.00 1.00 12
avg / total 1.00 1.00 1.00 45
3. Clustering:-
3.1 Clustering Technique
Clustering is a method of data breakdown that partitions the data into several groups based on
their similarity.
We group the data through a statistical procedure. These smaller groups that are formed from
the bigger data are known as clusters. These clusters show the resulting properties:
● They are learned while carrying out the operation and the information of their
number is not known in advance.
● Clusters are the accumulation of alike objects that share common characteristics.
Clustering is the most widespread and popular method of Data Analysis and Data Mining. It
used in cases where the underlying input data has a colossal volume and we are tasked with
finding similar subsets that can be analyzed in several ways.
For example – A marketing company can categorize its customers based on their economic
background, age and several other factors to sell their products, in a better way.
In different fields, R clustering has different names, such as:
● Marketing – In marketing, ‘segmentation’ or ‘typological analyses’ term is available
for clustering.
● Medicine – Clustering in medicine is known as nosology.
● Biology – It is referred to as numerical taxonomy in the field of Biology.
To define the correct criteria for clustering and making use of efficient algorithms, the general
formula is as follows:
Bn(number of partitions for n objects)>exp(n)
You can determine the complexity of clustering by the number of possible combinations of
objects. The complexity of the cluster depends on this number.
The basis for joining or separating objects is the distance between them. These distances are
dissimilarities (when objects are far from each other) or similarity (when objects are close by).
Methods for Measuring Distance between Objects
For calculating the distance between the objects in K-means, we make use of the following
types of methods:
● Euclidean Distance – It is the most widely used method for measuring the distance
between the objects that are present in a multidimensional space.
In general, for an n-dimensional space, the distance is
● Squared Euclidean Distance – This is obtained by squaring the Euclidean Distance.
The objects that are present at further distances are assigned greater weights.
● City-Block (Manhattan) Distance – The difference betweentwo points in all dimensions is
calculated using this method. It is similar to Euclidean Distance in many cases but it
has an added functioning in the reduction of the effect in the extreme objects, which
do not possess squared coordinates.
The squares of the inertia are the weighted sum mean of squares of the interval of the
points from the center of the assigned cluster whose sum is calculated.
We perform the calculation of the Sum of Squares of Clusters on their centers as follows:
The above formula is known as the Huygens’ Formula.
The Between-Cluster Sum of squares is calculated by evaluating the square of difference from
the center of gravity from each cluster and their addition.
We perform the calculation of the Within-Cluster Sum of squares through the process of the
unearthing of the square of difference from the center of gravity for each given cluster and
their addition within the single cluster. With the diminishing of the cluster, the population
becomes better.
R-squared (RSQ) delineates the proportion of the sum of squares that are present in the
clusters. The closer proportion is to 1, better is the clustering. However, one’s aim is not the
maximization of the costs as the result would lead to a greater number of clusters. Therefore,
we require an ideal R
2
that is closer to 1 but does not create many clusters. As we move from k
to k+1 clusters, there is a significant increase in the value of R
2
Some of the properties of efficient clustering are:
● Detecting structures that are present in the data.
● Determining optimal clusters.
● Giving out readable differentiated clusters.
● Ensuring stability of cluster even with the minor changes in data.
● Efficient processing of the large volume of data.
● Handling different data types of variables.
Note: In the case of correct clustering, either IR is large or IA is small while calculating the sum
of squares.
Clustering is only restarted after we have performed data interpretation, transformation as well
as the exclusion of the variables. While excluding the variable, it is simply not taken into
account during the operation of clustering. This variable becomes an illustrative variable.
Agglomerative Hierarchical Clustering
In the Agglomerative Hierarchical Clustering (AHC), sequences of nested partitions of n
clusters are produced. The nested partitions have an ascending order of increasing
heterogeneity. We use AHC if the distance is either in an individual or a variable space. The
distance between two objects or clusters must be defined while carrying out categorization.
The algorithm for AHC is as follows:
● We first observe the initial clusters.
● In the next step, we assess the distance between the clusters.
● We then proceed to merge the most proximate clusters together and performing
their replacement with a single cluster.
● We repeat step 2 until only a single cluster remains in the end.
AHC generates a type of tree called dendrogram. After splitting this dendrogram, we obtain the
clusters.
Hierarchical Clustering is most widely used in identifying patterns in digital images, prediction
of stock prices, text mining, etc. It is also used for researching protein sequence classification.
1. Main Distances
● Maximum distance – In this, the greatest distance between the two observed
objects have clusters that are of equal diameters.
● Minimum distance – The minimum distance between the two observations
delineates the neighbor technique or a single linkage AHC method.
In this case, the minimum distance between the points of different clusters is supposed to be
greater than the maximum points that are present in the same cluster. The distance between
the points of distance clusters is supposed to be higher than the points that are present in the
same cluster.
2. Density Estimation
In density estimation, we detect the structure of the various complex clusters. The three
methods for estimating density in clustering are as follows:
● The k-nearest-neighbors method – The number of k observations that are centered on x
determines the density at the point x. The volume of the sphere further divides this.
● The Uniform Kernel Method – In this, the radius is fixed but the number
of neighbors is not.
● The Wong Hybrid Method – We use this in the preliminary analysis.
You must definitely explore the Graphical Data Analysis with R
Clustering by Similarity Aggregation
Clustering by Similarity Aggregation is known as relational clustering which is also known by the
name of Condorcet method.
With this method, we compare all the individual objects in pairs that help in building the global
clustering. The principle of equivalence relation exhibits three properties – reflexivity,
symmetry, and transitivity.
● Reflexivity => Mii = 1
● Symmetry => Mij = Mij
● Transitivity => Mij + Mjk – Mik <=1
This type of clustering algorithm makes use of an intuitive approach. A pair of individual values
(A,B) are assigned to the vectors m(A,B) and d(A,B). Both A and B possess the same value in
m(A,B) whereas in the case of d(A,B), they exhibit different values.
The two individuals A and B follow the Condorcet Criterion as follows:
c(A, B) = m(A, B)-d(A, B)
For an individual A and cluster S, the Condorcet criterion is as follows:
c(A,S) = Σic(A,Bi)
The summation overall is the Bi ∈ S.
With the previous conditions, we start by constructing clusters that place each individual A in
cluster S. In this cluster c(A,S), A is the largest and has the least value of 0.
In the next step, we calculate the global Condorcet criterion through a summation of
individuals present in A as well as the cluster SA which contains them.
K-Means Clustering in R
One of the most popular partitioning algorithms in clustering is the K-means cluster analysis
in R. It is an unsupervised learning algorithm. It tries to cluster data based on their similarity.
Also, we have specified the number of clusters and we want that the data must be grouped into
the same clusters. The algorithm assigns each observation to a cluster and also finds the
centroid of each cluster.
3.2 Performance Measures
Contrary to supervised learning where we have the ground truth to evaluate the model’s
performance, clustering analysis doesn’t have a solid evaluation metric that we can use to
evaluate the outcome of different clustering algorithms. Moreover, since kmeans requires k as
input and doesn’t learn it from data, there is no right answer in terms of the number of clusters
that we should have in any problem. Sometimes domain knowledge and intuition may help but
usually, that is not the case. In the cluster-predict methodology, we can evaluate how well the
models are performing based on different K clusters since clusters are used in the downstream
modeling.
In this post we’ll cover two metrics that may give us some intuition about k:
· Elbow method
· Silhouette analysis
Elbow Method
Elbow method gives us an idea on what a good k number of clusters would be based on the
sum of squared distance (SSE) between data points and their assigned clusters’ centroids. We
pick k at the spot where SSE starts to flatten out and forming an elbow. We’ll use the geyser
dataset and evaluate SSE for different values of k and see where the curve might form an elbow
and flatten out.
The graph above shows that k=2 is not a bad choice. Sometimes it’s still hard to figure out a
good number of clusters to use because the curve is monotonically decreasing and may not
show any elbow or has an obvious point where the curve starts flattening out.
Silhouette Analysis
Silhouette analysis can be used to determine the degree of separation between clusters. For
each sample:
· Compute the average distance from all data points in the same cluster (ai).
· Compute the average distance from all data points in the closest cluster (bi).
· Compute the coefficient:
The coefficient can take values in the interval [-1, 1].
· If it is 0 –> the sample is very close to the neighboring clusters.
· It is 1 –> the sample is far away from the neighboring clusters.
· It is -1 –> the sample is assigned to the wrong clusters.
Therefore, we want the coefficients to be as big as possible and close to 1 to have good
clusters. We’ll use here geyser dataset again because it’s cheaper to run the silhouette analysis
and it is actually obvious that there is most likely only two groups of data points.
3.3 Case Study
You are head of client insights and marketing at a telecommunication company, ConnectFast
INC. You understand that not each client is analogous and you wish to possess totally different
|completely different} methods to draw in different customers. You appreciate the ability of
client segmentation to deliver superior results with optimized value. You’re conjointly alert to
unsupervised learning techniques like cluster analysis to make client segments. To brush up
your skills with cluster analysis, you've got selected a sample of eight customers with their
average decision period (both regionally and internationally). The subsequent is that the data:
To get a feel of this, you've got premeditated the information with average international
decision period on the coordinate axis and average phone call period on the coordinate axis.
The subsequent is that the plot:
Euclidian Distance tofind Cluster Centroids
In this case, two centroids (C1 & C2) are randomly placed at the coordinates (1, 1) and (3, 4).
Why did we choose two centroids? For this problem, visual guesstimate of scattered plot above
tells us that are two clusters. However, we will notice in a later part of this series, this question
may not have such a straightforward answer for larger data sets.
Now, we will measure the distance between two centroids (C1 & C2) and all the data points on
the above-scattered plot using Euclidean measure. Euclidean distance is measured through the
following formula
In this case, 2 centroids (C1 & C2) are arbitrarily placed at the coordinates (1, 1) and (3, 4). Why
did we elect 2 centroids? For this downside, visual estimation of scattered plot on top, informs
us that are 2 clusters. However, we are going to notice in the coming part of this series; this
question might not have such an easy declare for larger information sets.
Now, we are going to measure the space between 2 centroids (C1 & C2) and all the data
points on the above-scattered plot using Euclidean measure. Euclidean distance is
measured through the subsequent formula
Columns 3 and 4 (i.e. Distance from C1 and C2) are measured using the same formula. For
instance, for the first customer
You could measure all the other values similarly. Furthermore, cluster membership (last
column) is allocated using the closeness to clusters (C1 and C2). The first client is closer to
centroid 1 (1.41 in comparison to 2.24) hence is assigned membership C1.
The following is the scatter plot with cluster centroids C1 and C2 (displayed with blue and
orange diamond shapes). The customers are have marked with the colour of centroids basis
their closeness to the centroids.
As we have arbitrarily assigned the centroids, the second step is to move them iteratively. The
new position of the centroid is measured by taking the average of member points for the
centroid. For the first centroid, customers 1, 2 and 3 are members. Hence, the new x-axis
position for the centroid C1 is the average value for x-axis for these customers i.e. (2+1+1)/3 =
1.33. We will get the new coordinates for C1 equal to (1.33, 2.33) and C2 equal to (4.4, 4.2). The
new plot is shown below:
Finally, one last iteration we will take the centroids at the centre of the clusters. As displayed
below:
The positions for our cluster centroids in this case turned out to be C1 (1.75, 2.25) and C2(4.75,
4.75).
4. Association:-
4.1 ASSOCIATION TECHNIQUES:
Association mining is commonly used to make product recommendations by identifying
products that are frequently bought together. One such association mining technique is the
apriori algorithm
Apriori Algorithm
● It proceeds by identifying the frequent individual items in the database and extending
them to larger and larger item sets as long as those item sets appear sufficiently often in
the database. The frequent itemsets determined by Apriori can be used to determine
association rules which highlight general trends in the database: this has applications in
domains such as market basket analysis.
● Apriori algorithm is a classical algorithm in data mining. It is used for mining frequent
itemsets and relevant association rules. It is devised to operate on a database
containing a lot of transactions, for instance, items brought by customers in a store.
Association rules
Association rule learning is a prominent and a well-explored method for determining relations
among variables in large databases. Let us take a look at the formal definition of the problem of
association rules.
● Let I={i1,i2,i3,…,in} be a set of n attributes called items and D={t1,t2,…,tn} be the set of
transactions. It is called database. Every transaction, ti in D has a unique transaction ID,
and it consists of a subset of itemsets in I.
● A rule can be defined as an implication, X⟶Y where X and Y are subsets of I(X,Y⊆I), and they have no element in
common, i.e., X∩Y. Xand Y are the antecedent and the consequent ofthe rule.
General Process of the Apriori algorithm
The entire algorithm can be divided into two steps:
Step 1: Apply minimum support to find all the frequent sets with k items in a database.
Step 2: Use the self-join rule to find the frequent sets with k+1 items with the help of frequent
k-itemsets. Repeat this process from k=1 to the point when we are unable to apply the self-join
rule.
This approach of extending a frequent itemset one at a time is called the “bottom up”
approach.
Mining Association Rules
The Apriori algorithm has been looked at with respect to frequent itemset generation. There is
another way for which we can use this algorithm, i.e., finding association rules.
We need to find all rules having support more than the threshold support and confidence more
than the threshold confidence.
One possible way for this is to, to list all the possible association rules and calculate the support
and confidence for each rule. Then eliminate the rules that fail the threshold support and
confidence. It is very heavy and prohibitive as the number of all the possible association rules
increases exponentially with the number of items.
We can also use another way, which is called the two-step approach, to find efficient
association rules.
The two-step approach is:
Step 1: Frequent itemset generation: Find all itemsets for which the support is greater than the
threshold support following the process we have already seen earlier in this article.
Step 2: Rule generation: Create rules from each frequent itemset using the binary partition of
frequent itemsets and look for the ones with high confidence. These rules are called candidate
rules
There are multiple rules possible even from a very small database, so in order to select the
interesting ones, we use constraints on various measures of interest and significance. We will
look at some of these useful measures such as support, confidence, lift and conviction.
4.2 Performance Measures :
1. Support
The support of an itemset X, supp(X) is the proportion of transaction in the database in
which the item X appears. It signifies the popularity of an itemset. In the example above,
2. Confidence
it signifies the likelihood of item Y being purchased when item X is purchased. So, for the rule
{Onion, Potato} => {Burger},
It can give some important insights, but it also has a major drawback. It only takes into account
the popularity of the itemset X and not the popularity of Y. If Y is equally popular as X then
there will be a higher probability that a transaction containing X will also contain Y thus
increasing the confidence. To overcome this drawback there is another measure called lift.
3.Lift
The lift of a rule is defined as: This signifies the likelihood of the itemset Y being purchased
when item X is purchased while taking into account the popularity of Y. In our example above,If
the value of lift is greater than 1, it means that the itemset Y is likely to be bought with itemset
X, while a value less than 1 implies that itemset Y is unlikely to be bought if the itemset X is
bought.
Pros of the Apriori algorithm
1. It is an easy-to-implement and easy-to-understand algorithm.
2. It can be used on large itemsets.
Cons of the Apriori Algorithm
1. Sometimes, it may need to find a large number of candidate rules which can
be computationally expensive.
2. Calculating support is also expensive because it has to go through the entire database
Consider the following example:
Given is a set of transaction data. You can see transactions numbered 1 to 5. Each transaction
shows items bought in that transaction. You can see that Diaper is bought with Beer in three
transactions. Similarly, Bread is bought with milk in three transactions making them both
frequent item sets. Association rules are given in the form as below:
A=>B[Support,Confidence]
The part before => is referred to as if (Antecedent) and the part after => is referred to as then
(Consequent).
Where A and B are sets of items in the transaction data. A and B are disjoint sets.
In the following section you will learn about the basic concepts of Association Rule Mining:
Basic Concepts of Association Rule Mining
1. Itemset: Collection of one or more items. K-item-set means a set of k items.
2. Support Count: Frequency of occurrence of an item-set
3. Support (s): Fraction of transactions that contain the item-set 'X'
● For a Rule A=>B, Support is given by:
Note: P(AUB) is the probability of A and B occurring together. P denotes probability.
Go ahead, try finding the support for Milk=>Diaper as an exercise.
1. Confidence (c): For a rule A=>B Confidence shows the percentage in which B is bought
with A.
The number of transactions with both A and B divided by the total number of transactions
having A.
Now find the confidence for Milk=>Diaper.
Note: Support and Confidence measure how interesting the rule is. It is set by the minimum
support and minimum confidence thresholds. These thresholds set by client help to compare the
rule strength according to your own or client's will. The closer to threshold the more the rule is of
use to the client.
1. Frequent Itemsets: Item-sets whose support is greater or equal than minimum support
threshold (min_sup). In above example min_sup=3. This is set on user choice.
2. Strong rules: If a rule A=>B[Support, Confidence] satisfies min_sup and
min_confidence then it is a strong rule.
3. Lift: Lift gives the correlation between A and B in the rule A=>B. Correlation shows
how one item-set A effects the item-set B.
For example, the rule {Bread}=>{Milk}, lift is calculated as:
 If the rule had a lift of 1,then A and B are independent and no rule can be
derived from them.

 If the lift is > 1, then A and B are dependent on each other, and the degree of which is
given by ift value.
 If the lift is < 1, then presence of A will have negative effect on B.
Goal of Association Rule Mining
When you apply Association Rule Mining on a given set of transactions T your goal will be to
find all rules with:
1. Support greater than or equal to min_support
2. Confidence greater than or equal to min_confidence
5. Conclusion and Learning
The whole project has given us an overall perspective of the various methods of data analytics
and their applications depending on the requirements.
There are two types of learning, supervised and unsupervised learning. Supervised learning.
The supervised learning is about training the model of the algorithm with the majority data and
carrying out analysis on the rest of test data. The major difference between supervised and
unsupervised learning is the presence of historic data. Incase of supervised learning, there is
historical data for training the model whereas in case of unsupervised learning there is no
historical data.
Classification method is one of the supervised algorithms of learning and we understood that it
is used in forecasting the future outcomes and the accuracy of such models can also be checked
using confusion matrix. Similarly, Clustering is one of unsupervised learning algorithms and is
used for exploratory data mining.It is used when we have a group of people having similar
characteristics and they could be dealt together.
Associations is used when we have to establish relations among the probabilities of various
data items and use that relationship for business applications like cross selling, as it helps in
studying the purchase behaviour , it is also known as market basket analysis.
Reference
❖ https://www.kdnuggets.com/2016/07/burtchworks-sas-r-python-analytics-pros-
prefer.html
❖ https://www.investopedia.com/terms/d/data-analytics.asp
❖ https://www.edureka.co/blog/what-is-data-analytics/
❖ https://data-flair.training/blogs/classification-in-r/

More Related Content

What's hot

Internship project report,Predictive Modelling
Internship project report,Predictive ModellingInternship project report,Predictive Modelling
Internship project report,Predictive ModellingAmit Kumar
 
CLASSIFICATION ALGORITHM USING RANDOM CONCEPT ON A VERY LARGE DATA SET: A SURVEY
CLASSIFICATION ALGORITHM USING RANDOM CONCEPT ON A VERY LARGE DATA SET: A SURVEYCLASSIFICATION ALGORITHM USING RANDOM CONCEPT ON A VERY LARGE DATA SET: A SURVEY
CLASSIFICATION ALGORITHM USING RANDOM CONCEPT ON A VERY LARGE DATA SET: A SURVEYEditor IJMTER
 
Machine Learning - Decision Trees
Machine Learning - Decision TreesMachine Learning - Decision Trees
Machine Learning - Decision TreesRupak Roy
 
SURVEY ON CLASSIFICATION ALGORITHMS USING BIG DATASET
SURVEY ON CLASSIFICATION ALGORITHMS USING BIG DATASETSURVEY ON CLASSIFICATION ALGORITHMS USING BIG DATASET
SURVEY ON CLASSIFICATION ALGORITHMS USING BIG DATASETEditor IJMTER
 
Data Science - Part V - Decision Trees & Random Forests
Data Science - Part V - Decision Trees & Random Forests Data Science - Part V - Decision Trees & Random Forests
Data Science - Part V - Decision Trees & Random Forests Derek Kane
 
Data Science Training | Data Science For Beginners | Data Science With Python...
Data Science Training | Data Science For Beginners | Data Science With Python...Data Science Training | Data Science For Beginners | Data Science With Python...
Data Science Training | Data Science For Beginners | Data Science With Python...Simplilearn
 
Machine Learning Unit 2 Semester 3 MSc IT Part 2 Mumbai University
Machine Learning Unit 2 Semester 3  MSc IT Part 2 Mumbai UniversityMachine Learning Unit 2 Semester 3  MSc IT Part 2 Mumbai University
Machine Learning Unit 2 Semester 3 MSc IT Part 2 Mumbai UniversityMadhav Mishra
 
An Application of Assignment Problem in Laptop Selection Problem Using MATLAB
An Application of Assignment Problem in Laptop Selection Problem Using MATLAB An Application of Assignment Problem in Laptop Selection Problem Using MATLAB
An Application of Assignment Problem in Laptop Selection Problem Using MATLAB mathsjournal
 
Disease prediction using machine learning
Disease prediction using machine learningDisease prediction using machine learning
Disease prediction using machine learningJinishaKG
 
How ml can improve purchase conversions
How ml can improve purchase conversionsHow ml can improve purchase conversions
How ml can improve purchase conversionsSudeep Shukla
 
Heart disease classification
Heart disease classificationHeart disease classification
Heart disease classificationSnehaDey21
 
Research Method EMBA chapter 12
Research Method EMBA chapter 12Research Method EMBA chapter 12
Research Method EMBA chapter 12Mazhar Poohlah
 
Proficiency comparison ofladtree
Proficiency comparison ofladtreeProficiency comparison ofladtree
Proficiency comparison ofladtreeijcsa
 

What's hot (17)

Internship project report,Predictive Modelling
Internship project report,Predictive ModellingInternship project report,Predictive Modelling
Internship project report,Predictive Modelling
 
CLASSIFICATION ALGORITHM USING RANDOM CONCEPT ON A VERY LARGE DATA SET: A SURVEY
CLASSIFICATION ALGORITHM USING RANDOM CONCEPT ON A VERY LARGE DATA SET: A SURVEYCLASSIFICATION ALGORITHM USING RANDOM CONCEPT ON A VERY LARGE DATA SET: A SURVEY
CLASSIFICATION ALGORITHM USING RANDOM CONCEPT ON A VERY LARGE DATA SET: A SURVEY
 
Machine Learning - Decision Trees
Machine Learning - Decision TreesMachine Learning - Decision Trees
Machine Learning - Decision Trees
 
Introduction to machine learning
Introduction to machine learningIntroduction to machine learning
Introduction to machine learning
 
SURVEY ON CLASSIFICATION ALGORITHMS USING BIG DATASET
SURVEY ON CLASSIFICATION ALGORITHMS USING BIG DATASETSURVEY ON CLASSIFICATION ALGORITHMS USING BIG DATASET
SURVEY ON CLASSIFICATION ALGORITHMS USING BIG DATASET
 
Data Science - Part V - Decision Trees & Random Forests
Data Science - Part V - Decision Trees & Random Forests Data Science - Part V - Decision Trees & Random Forests
Data Science - Part V - Decision Trees & Random Forests
 
Data Science Training | Data Science For Beginners | Data Science With Python...
Data Science Training | Data Science For Beginners | Data Science With Python...Data Science Training | Data Science For Beginners | Data Science With Python...
Data Science Training | Data Science For Beginners | Data Science With Python...
 
Decision tree
Decision treeDecision tree
Decision tree
 
Primer on major data mining algorithms
Primer on major data mining algorithmsPrimer on major data mining algorithms
Primer on major data mining algorithms
 
Machine Learning Unit 2 Semester 3 MSc IT Part 2 Mumbai University
Machine Learning Unit 2 Semester 3  MSc IT Part 2 Mumbai UniversityMachine Learning Unit 2 Semester 3  MSc IT Part 2 Mumbai University
Machine Learning Unit 2 Semester 3 MSc IT Part 2 Mumbai University
 
An Application of Assignment Problem in Laptop Selection Problem Using MATLAB
An Application of Assignment Problem in Laptop Selection Problem Using MATLAB An Application of Assignment Problem in Laptop Selection Problem Using MATLAB
An Application of Assignment Problem in Laptop Selection Problem Using MATLAB
 
Disease prediction using machine learning
Disease prediction using machine learningDisease prediction using machine learning
Disease prediction using machine learning
 
How ml can improve purchase conversions
How ml can improve purchase conversionsHow ml can improve purchase conversions
How ml can improve purchase conversions
 
Machine Learning - Deep Learning
Machine Learning - Deep LearningMachine Learning - Deep Learning
Machine Learning - Deep Learning
 
Heart disease classification
Heart disease classificationHeart disease classification
Heart disease classification
 
Research Method EMBA chapter 12
Research Method EMBA chapter 12Research Method EMBA chapter 12
Research Method EMBA chapter 12
 
Proficiency comparison ofladtree
Proficiency comparison ofladtreeProficiency comparison ofladtree
Proficiency comparison ofladtree
 

Similar to Data Analytics Using R - Report

Study of Data Mining Methods and its Applications
Study of  Data Mining Methods and its ApplicationsStudy of  Data Mining Methods and its Applications
Study of Data Mining Methods and its ApplicationsIRJET Journal
 
A Hybrid Theory Of Power Theft Detection
A Hybrid Theory Of Power Theft DetectionA Hybrid Theory Of Power Theft Detection
A Hybrid Theory Of Power Theft DetectionCamella Taylor
 
Data Science.pptx NEW COURICUUMN IN DATA
Data Science.pptx NEW COURICUUMN IN DATAData Science.pptx NEW COURICUUMN IN DATA
Data Science.pptx NEW COURICUUMN IN DATAjaved75
 
Review on Sentiment Analysis on Customer Reviews
Review on Sentiment Analysis on Customer ReviewsReview on Sentiment Analysis on Customer Reviews
Review on Sentiment Analysis on Customer ReviewsIRJET Journal
 
Machine Learning Approaches and its Challenges
Machine Learning Approaches and its ChallengesMachine Learning Approaches and its Challenges
Machine Learning Approaches and its Challengesijcnes
 
Data Science & AI Road Map by Python & Computer science tutor in Malaysia
Data Science  & AI Road Map by Python & Computer science tutor in MalaysiaData Science  & AI Road Map by Python & Computer science tutor in Malaysia
Data Science & AI Road Map by Python & Computer science tutor in MalaysiaAhmed Elmalla
 
Survey paper on Big Data Imputation and Privacy Algorithms
Survey paper on Big Data Imputation and Privacy AlgorithmsSurvey paper on Big Data Imputation and Privacy Algorithms
Survey paper on Big Data Imputation and Privacy AlgorithmsIRJET Journal
 
data mining and data warehousing
data mining and data warehousingdata mining and data warehousing
data mining and data warehousingSunny Gandhi
 
Review of Algorithms for Crime Analysis & Prediction
Review of Algorithms for Crime Analysis & PredictionReview of Algorithms for Crime Analysis & Prediction
Review of Algorithms for Crime Analysis & PredictionIRJET Journal
 
Data Analyst Interview Questions & Answers
Data Analyst Interview Questions & AnswersData Analyst Interview Questions & Answers
Data Analyst Interview Questions & AnswersSatyam Jaiswal
 
churn_detection.pptx
churn_detection.pptxchurn_detection.pptx
churn_detection.pptxDhanuDhanu49
 
Agile analytics : An exploratory study of technical complexity management
Agile analytics : An exploratory study of technical complexity managementAgile analytics : An exploratory study of technical complexity management
Agile analytics : An exploratory study of technical complexity managementAgnirudra Sikdar
 
IRJET- A Detailed Study on Classification Techniques for Data Mining
IRJET- A Detailed Study on Classification Techniques for Data MiningIRJET- A Detailed Study on Classification Techniques for Data Mining
IRJET- A Detailed Study on Classification Techniques for Data MiningIRJET Journal
 
IRJET - An User Friendly Interface for Data Preprocessing and Visualizati...
IRJET -  	  An User Friendly Interface for Data Preprocessing and Visualizati...IRJET -  	  An User Friendly Interface for Data Preprocessing and Visualizati...
IRJET - An User Friendly Interface for Data Preprocessing and Visualizati...IRJET Journal
 
IRJET- Medical Data Mining
IRJET- Medical Data MiningIRJET- Medical Data Mining
IRJET- Medical Data MiningIRJET Journal
 
CREDIT RISK MANAGEMENT USING ARTIFICIAL INTELLIGENCE TECHNIQUES
CREDIT RISK MANAGEMENT USING ARTIFICIAL INTELLIGENCE TECHNIQUESCREDIT RISK MANAGEMENT USING ARTIFICIAL INTELLIGENCE TECHNIQUES
CREDIT RISK MANAGEMENT USING ARTIFICIAL INTELLIGENCE TECHNIQUESijaia
 
IRJET- Fault Detection and Prediction of Failure using Vibration Analysis
IRJET-	 Fault Detection and Prediction of Failure using Vibration AnalysisIRJET-	 Fault Detection and Prediction of Failure using Vibration Analysis
IRJET- Fault Detection and Prediction of Failure using Vibration AnalysisIRJET Journal
 
Egypt hackathon 2014 analytics & spss session
Egypt hackathon 2014   analytics & spss sessionEgypt hackathon 2014   analytics & spss session
Egypt hackathon 2014 analytics & spss sessionM Baddar
 

Similar to Data Analytics Using R - Report (20)

Study of Data Mining Methods and its Applications
Study of  Data Mining Methods and its ApplicationsStudy of  Data Mining Methods and its Applications
Study of Data Mining Methods and its Applications
 
A Hybrid Theory Of Power Theft Detection
A Hybrid Theory Of Power Theft DetectionA Hybrid Theory Of Power Theft Detection
A Hybrid Theory Of Power Theft Detection
 
Data Science.pptx NEW COURICUUMN IN DATA
Data Science.pptx NEW COURICUUMN IN DATAData Science.pptx NEW COURICUUMN IN DATA
Data Science.pptx NEW COURICUUMN IN DATA
 
Review on Sentiment Analysis on Customer Reviews
Review on Sentiment Analysis on Customer ReviewsReview on Sentiment Analysis on Customer Reviews
Review on Sentiment Analysis on Customer Reviews
 
Machine Learning Approaches and its Challenges
Machine Learning Approaches and its ChallengesMachine Learning Approaches and its Challenges
Machine Learning Approaches and its Challenges
 
Data Science & AI Road Map by Python & Computer science tutor in Malaysia
Data Science  & AI Road Map by Python & Computer science tutor in MalaysiaData Science  & AI Road Map by Python & Computer science tutor in Malaysia
Data Science & AI Road Map by Python & Computer science tutor in Malaysia
 
Survey paper on Big Data Imputation and Privacy Algorithms
Survey paper on Big Data Imputation and Privacy AlgorithmsSurvey paper on Big Data Imputation and Privacy Algorithms
Survey paper on Big Data Imputation and Privacy Algorithms
 
data mining and data warehousing
data mining and data warehousingdata mining and data warehousing
data mining and data warehousing
 
Data Science and Analysis.pptx
Data Science and Analysis.pptxData Science and Analysis.pptx
Data Science and Analysis.pptx
 
Review of Algorithms for Crime Analysis & Prediction
Review of Algorithms for Crime Analysis & PredictionReview of Algorithms for Crime Analysis & Prediction
Review of Algorithms for Crime Analysis & Prediction
 
Data Analyst Interview Questions & Answers
Data Analyst Interview Questions & AnswersData Analyst Interview Questions & Answers
Data Analyst Interview Questions & Answers
 
Analytics
AnalyticsAnalytics
Analytics
 
churn_detection.pptx
churn_detection.pptxchurn_detection.pptx
churn_detection.pptx
 
Agile analytics : An exploratory study of technical complexity management
Agile analytics : An exploratory study of technical complexity managementAgile analytics : An exploratory study of technical complexity management
Agile analytics : An exploratory study of technical complexity management
 
IRJET- A Detailed Study on Classification Techniques for Data Mining
IRJET- A Detailed Study on Classification Techniques for Data MiningIRJET- A Detailed Study on Classification Techniques for Data Mining
IRJET- A Detailed Study on Classification Techniques for Data Mining
 
IRJET - An User Friendly Interface for Data Preprocessing and Visualizati...
IRJET -  	  An User Friendly Interface for Data Preprocessing and Visualizati...IRJET -  	  An User Friendly Interface for Data Preprocessing and Visualizati...
IRJET - An User Friendly Interface for Data Preprocessing and Visualizati...
 
IRJET- Medical Data Mining
IRJET- Medical Data MiningIRJET- Medical Data Mining
IRJET- Medical Data Mining
 
CREDIT RISK MANAGEMENT USING ARTIFICIAL INTELLIGENCE TECHNIQUES
CREDIT RISK MANAGEMENT USING ARTIFICIAL INTELLIGENCE TECHNIQUESCREDIT RISK MANAGEMENT USING ARTIFICIAL INTELLIGENCE TECHNIQUES
CREDIT RISK MANAGEMENT USING ARTIFICIAL INTELLIGENCE TECHNIQUES
 
IRJET- Fault Detection and Prediction of Failure using Vibration Analysis
IRJET-	 Fault Detection and Prediction of Failure using Vibration AnalysisIRJET-	 Fault Detection and Prediction of Failure using Vibration Analysis
IRJET- Fault Detection and Prediction of Failure using Vibration Analysis
 
Egypt hackathon 2014 analytics & spss session
Egypt hackathon 2014   analytics & spss sessionEgypt hackathon 2014   analytics & spss session
Egypt hackathon 2014 analytics & spss session
 

More from Akanksha Gohil

FMCG Company Analysis - Parle & Britania
FMCG Company Analysis - Parle & BritaniaFMCG Company Analysis - Parle & Britania
FMCG Company Analysis - Parle & BritaniaAkanksha Gohil
 
Adaptive Market Leadership - CavinKare
Adaptive Market Leadership - CavinKare Adaptive Market Leadership - CavinKare
Adaptive Market Leadership - CavinKare Akanksha Gohil
 
FieldFresh Foods Pvt Ltd
FieldFresh Foods Pvt LtdFieldFresh Foods Pvt Ltd
FieldFresh Foods Pvt LtdAkanksha Gohil
 
Finolex cables - project
Finolex cables - projectFinolex cables - project
Finolex cables - projectAkanksha Gohil
 
Sentiment Analysis on Hong Kong Protest - Using Twitter data
Sentiment Analysis on Hong Kong Protest - Using Twitter dataSentiment Analysis on Hong Kong Protest - Using Twitter data
Sentiment Analysis on Hong Kong Protest - Using Twitter dataAkanksha Gohil
 
Pricing Pyramid of Play Station
Pricing Pyramid of Play StationPricing Pyramid of Play Station
Pricing Pyramid of Play StationAkanksha Gohil
 
INDUSTRY ANALYSIS USING PORTER’S FIVE-FORCES FRAMEWORK
INDUSTRY ANALYSIS USING PORTER’S FIVE-FORCES FRAMEWORKINDUSTRY ANALYSIS USING PORTER’S FIVE-FORCES FRAMEWORK
INDUSTRY ANALYSIS USING PORTER’S FIVE-FORCES FRAMEWORKAkanksha Gohil
 
New Product in "time share holiday home" category
New Product in "time share holiday home" categoryNew Product in "time share holiday home" category
New Product in "time share holiday home" categoryAkanksha Gohil
 
Integrated Marketing Communication - Liability for Misleading Advertisements ...
Integrated Marketing Communication - Liability for Misleading Advertisements ...Integrated Marketing Communication - Liability for Misleading Advertisements ...
Integrated Marketing Communication - Liability for Misleading Advertisements ...Akanksha Gohil
 
Service Marketing - DTH service
Service Marketing - DTH serviceService Marketing - DTH service
Service Marketing - DTH serviceAkanksha Gohil
 
Failed marketing techniques
Failed marketing techniquesFailed marketing techniques
Failed marketing techniquesAkanksha Gohil
 
Motor Vehicle Act 2018
Motor Vehicle Act 2018Motor Vehicle Act 2018
Motor Vehicle Act 2018Akanksha Gohil
 
Aditya Birla Internship Presentation
Aditya Birla Internship PresentationAditya Birla Internship Presentation
Aditya Birla Internship PresentationAkanksha Gohil
 
Service Marketing - SPA Experience
Service Marketing - SPA ExperienceService Marketing - SPA Experience
Service Marketing - SPA ExperienceAkanksha Gohil
 
Milma - Consumer Behaviour
Milma - Consumer BehaviourMilma - Consumer Behaviour
Milma - Consumer BehaviourAkanksha Gohil
 
Human resource management project
Human resource management projectHuman resource management project
Human resource management projectAkanksha Gohil
 
E governance - management information system
E governance  - management information systemE governance  - management information system
E governance - management information systemAkanksha Gohil
 
Robotic technology a boon to the autistic children
Robotic technology a boon to the autistic childrenRobotic technology a boon to the autistic children
Robotic technology a boon to the autistic childrenAkanksha Gohil
 

More from Akanksha Gohil (20)

FMCG Company Analysis - Parle & Britania
FMCG Company Analysis - Parle & BritaniaFMCG Company Analysis - Parle & Britania
FMCG Company Analysis - Parle & Britania
 
Adaptive Market Leadership - CavinKare
Adaptive Market Leadership - CavinKare Adaptive Market Leadership - CavinKare
Adaptive Market Leadership - CavinKare
 
FieldFresh Foods Pvt Ltd
FieldFresh Foods Pvt LtdFieldFresh Foods Pvt Ltd
FieldFresh Foods Pvt Ltd
 
Finolex cables - project
Finolex cables - projectFinolex cables - project
Finolex cables - project
 
Sentiment Analysis on Hong Kong Protest - Using Twitter data
Sentiment Analysis on Hong Kong Protest - Using Twitter dataSentiment Analysis on Hong Kong Protest - Using Twitter data
Sentiment Analysis on Hong Kong Protest - Using Twitter data
 
Pricing Pyramid of Play Station
Pricing Pyramid of Play StationPricing Pyramid of Play Station
Pricing Pyramid of Play Station
 
INDUSTRY ANALYSIS USING PORTER’S FIVE-FORCES FRAMEWORK
INDUSTRY ANALYSIS USING PORTER’S FIVE-FORCES FRAMEWORKINDUSTRY ANALYSIS USING PORTER’S FIVE-FORCES FRAMEWORK
INDUSTRY ANALYSIS USING PORTER’S FIVE-FORCES FRAMEWORK
 
New Product in "time share holiday home" category
New Product in "time share holiday home" categoryNew Product in "time share holiday home" category
New Product in "time share holiday home" category
 
Integrated Marketing Communication - Liability for Misleading Advertisements ...
Integrated Marketing Communication - Liability for Misleading Advertisements ...Integrated Marketing Communication - Liability for Misleading Advertisements ...
Integrated Marketing Communication - Liability for Misleading Advertisements ...
 
Service Marketing - DTH service
Service Marketing - DTH serviceService Marketing - DTH service
Service Marketing - DTH service
 
Failed marketing techniques
Failed marketing techniquesFailed marketing techniques
Failed marketing techniques
 
Motor Vehicle Act 2018
Motor Vehicle Act 2018Motor Vehicle Act 2018
Motor Vehicle Act 2018
 
Aditya Birla Internship Presentation
Aditya Birla Internship PresentationAditya Birla Internship Presentation
Aditya Birla Internship Presentation
 
ABSLI SIP Report
ABSLI SIP ReportABSLI SIP Report
ABSLI SIP Report
 
Service Marketing - SPA Experience
Service Marketing - SPA ExperienceService Marketing - SPA Experience
Service Marketing - SPA Experience
 
Milma - Consumer Behaviour
Milma - Consumer BehaviourMilma - Consumer Behaviour
Milma - Consumer Behaviour
 
Human resource management project
Human resource management projectHuman resource management project
Human resource management project
 
CSR
CSRCSR
CSR
 
E governance - management information system
E governance  - management information systemE governance  - management information system
E governance - management information system
 
Robotic technology a boon to the autistic children
Robotic technology a boon to the autistic childrenRobotic technology a boon to the autistic children
Robotic technology a boon to the autistic children
 

Recently uploaded

Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
APIForce Zurich 5 April Automation LPDG
APIForce Zurich 5 April  Automation LPDGAPIForce Zurich 5 April  Automation LPDG
APIForce Zurich 5 April Automation LPDGMarianaLemus7
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksSoftradix Technologies
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphNeo4j
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsAndrey Dotsenko
 
Unlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power SystemsUnlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power SystemsPrecisely
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraDeakin University
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 

Recently uploaded (20)

Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
APIForce Zurich 5 April Automation LPDG
APIForce Zurich 5 April  Automation LPDGAPIForce Zurich 5 April  Automation LPDG
APIForce Zurich 5 April Automation LPDG
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other Frameworks
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
 
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptxVulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
Unlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power SystemsUnlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power Systems
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning era
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 

Data Analytics Using R - Report

  • 2. 1. Introduction to Data Analytics Data Analytics refers to a set of techniques used to analyze the data in order to enhance productivity and business profitability. In the Data Analytics process, the data is extracted from different sources and is cleaned, then categorized to analyze various behavioral patterns. The techniques and tools used for data analysis vary according to the organization or individual. Data analytics is a very broad concept that includes many diverse types of data analysis techniques. Any type of information can be inputted to data analytics techniques in order to gain meaningful insights that can be utilized to get the required result. It is a process of inspecting, cleansing, modifying, transforming and modeling data with the aim of discovering: - i) Required useful information ii) Informing conclusions iii)Supporting decision-making Importance of Data Analytics:- As a large amount of data gets generated everyday, the need to dig out useful and meaningful insights is a must for any business enterprise. Data Analytics plays a key role in improving and moulding business decisions. Below mentioned are 4 main factors which signifies the need for Data Analytics in our lives: ● Gather Hidden Insights – The hidden insights and information about data are first gathered and then they analyzed with respect to the business requirements. ● Generate Reports – Reports are developed from the data and are passed on to the respective data analytics teams to proceed with further actions for a high rise in the business.
  • 3. ● Perform Market Analysis –The market analysis can be performed to interpret the strengths and weaknesses of the competitors. ● Improve Business Requirement – Analysis of Data allows improving the business from customer requirements to customer experience. Four Types of Data Analytics:-- 1. Descriptive analytics tells us what has happened over a particular time period. 2. Diagnostic analytics focuses more on why something happened. This includes more diverse data sets and a little hypothesizing. 3. Predictive analytics tells us what is likely going to happen in the near future. 4. Prescriptive analytics suggests a course of action. Fig: Types of Data Analytics
  • 4. Top Tools usedin Data Analytics With the increase in demand for Data Analysis, many tools have come up with various functionalities and expertise for the analysis of data. Some of them are open-source while some are not. Following are the most widely used top tools in the field of data analytics:- ★ R programming ★ Python ★ Tableau Public ★ QlikView ★ SAS ★ Microsoft Excel ★ RapidMiner ★ KNIME ★ OpenRefine ★ Apache Spark Out of these R, Python and SAS are widely used in the market. Following is an image where the distributing of usage of these technologies with respect to the industry is shown: -
  • 5. 2. Classification:- Classification techniques are used to predict a categorical class label, such as income level: low, medium or high. Let us try to demonstrate the concept of classification with an example of- gender classification using the hair length. Here the gender will be our ‘target class’, it will be classified on the basis of their hair length, therefore the hair length will be a ‘feature parameter’. Now we can put up some kind of condition to be taken as a reference point from where classification can be done. Suppose the differentiated boundary hair length value is 15.0 cm then we can say that if hair length is less than 15.0 cm then gender could be male or else female. Applicationof ClassificationAlgorithms ● Classification of spamemails. ● Prediction to analyze whether bank customer will pay the loan or not. ● Identification of cancer tumor cells. ● Sentiment analysis. ● Classification of drugs. ● Detection of facial keypoints. 2.1 Classificationtechniques:- ● Decision Trees – These are organized in the form of sets of questions and answers in the tree structure. ● Naive Bayes Classifiers – It is a probabilistic machine learning model that is used for the classification. ● Support Vector Machines – It is a non-probabilistic binary linear classification model used to classify a case into one of the two categories.
  • 6. (i) DecisionTree:- It is a kind of supervised-learning algorithm. Here, we split the population into two or more homogeneous data sets. The Decision Tree is a very powerful non-linear classification tool. A Decision Tree makes use of a tree-like structure to generate relationships among various features or parameters and potential outcomes. It makes use of the branching decisions as its core structure. Following is the structure of a decision tree: - Fig: Structure of a decision tree Here, root node represents the entire set of population or the sample set. It then further gets divided into two or more homogeneous sets of data. Decision Tree is produced when any sub- node gets split into further sub-nodes. The Leaf/Terminal Node does not split further. The process of removing sub-nodes of any decision node is called pruning. A Branch / Sub-Tree is a subsection of the entire tree.
  • 7. Two types of Decision Tree 1. Categorical(classification) Variable Decision Tree: A Decision Tree which has target variable as a categorical variable. 2. Continuous(Regression) Variable Decision Tree: Decision Tree has a continuous target variable. Advantages of DecisionTree inR ● Easy to Understand: It does not need any statistical knowledge to read and interpret them. ● Less data cleaning required: Compared to some other modeling techniques, it requires fewer data. ● Data type is not a constraint: It can handle both numerical and categorical variables. ● Simple to understand and interpret. ● Requires little data preparation. ● It works with both numerical and categorical data. ● Handles non-linearity. ● Possible to confirm a model using statistical tests. Disadvantages of R DecisionTree ● Overfitting: It is one of the most practical difficulties for Decision Tree models. By setting constraints on model parameters and pruning, we can solve this problem in R. ● Not fit for continuous variables: At the time of using continuous numerical variables. Whenever it categorizes variables in different categories, the Decision Tree loses information. ● To learn globally optimal tree is NP-hard, algos rely on greedy search. ● Complex “if-then” relationships between features inflate tree size. Example – XOR gate, multiplexor.
  • 8. (ii) Naïve Bayes Classification:- We use Bayes’ theorem to make the prediction. It is based on prior knowledge and current evidence. Bayes’ theorem is expressed by the following equation: where P(A) and P(B) are the probability of events A and B without regarding each other. P(A|B) is the probability of A conditional on B and P(B|A) is the probability of B conditional on A. (iii) Support Vector Machine (SVM) :- We use it to find the optimal hyperplane (line in 2D, a plane in 3D and hyperplane in more than 3 dimensions). Which helps in maximizes the margin between two classes. Support Vectors are observations that support hyperplane on either side. It helps in solving a linear optimization problem. It also helps out in finding the hyperplane with the largest margin. We use the “Kernel Trick” to separate instances that are inseparable. Advantages of SVM inR ● If we are using Kernel trick in case of non-linear separable data then it performs very well. ● SVM works well in high dimensional space and in case of text or image classification. ● It does not suffer a multicollinearity problem.
  • 9. Disadvantages of SVM in R ● It takes more time on large-sized data sets. ● SVM does not return probability estimates. ● In the case of linearly separable data, this is almost like logistic regression. Applications of Classification in R ● An emergency room in a hospital measures 17 variables of newly admitted patients. Variables, like blood pressure, age and many more. Furthermore, a careful decision has to be made if the patient has to be admitted to the ICU. Due to a high cost of I.C.U, those patients who may survive more than a month are given high priority. Also, the problem is to predict high-risk patients. And, to discriminate them from low-risk patients. ● A credit company receives hundreds of thousands of applications for new cards. The application contains information about several different attributes. Moreover, the problem is to categorize those who have good credit, bad credit or fall into a grey area. ● Astronomers have been cataloguing distant objects in the sky using long exposure C.C.D images. Thus, the object that needs to be labelled is a star, galaxy etc. The data is noisy, and the images are very faint, hence, the cataloguing can take decades to complete. 2.2 Performance Measures:- i) Confusion matrix The R function table() can be used to produce a confusion matrix in order to determine how many observations were correctly or incorrectly classified. It compares the observed and the predicted outcome values and shows the number of correct and incorrect predictions categorized by type of outcome.
  • 10. Fig : Confusion Matrix A. True positives: these are cases in which we predicted the individuals would be diabetes-positive and they were. B. True negatives: We predicted diabetes-negative, and the individuals were diabetes- negative. C. False positives: We predicted diabetes-positive, but the individuals didn’t actually have diabetes. (Also known as a Type I error.) D. False negatives: We predicted diabetes-negative, but they did have diabetes. (Also known as a Type II error.) E. Precision: It is the proportion of true positives among all the individuals that have been predicted to be diabetes-positive by the model. This represents the accuracy of a predicted positive outcome. Precision = TruePositives/(TruePositives + FalsePositives). F. Sensitivity (or Recall): It is the True Positive Rate (TPR) or the proportion of identified positives among the diabetes-positive population (class = 1). Sensitivity = TruePositives/(TruePositives + FalseNegatives). G. Specificity: It is the measures the True Negative Rate (TNR), which is the proportion of identified negatives among the diabetes-negative population (class = 0). Specificity = TrueNegatives/(TrueNegatives + FalseNegatives).
  • 11. 2.3 Case:- In this assignment, for classification techniques, we are going to take the case of “Iris” data set which is a builtin dataset of R, for that we will need to import library- rpart and rpart.plot. Following is the structure of iris dataset:- i) DECISION TREE :- After performing the decision tree algorithm,we got the following out- put for categorical variable ‘Species’ with respect all the continuous variables:- Fig : Decision Tree
  • 12. Here we have tried to split our decisions on the basis of Petal Length and Petal Width, We are trying to make a decision, whether the petal length is more than or less than 2.5. Is the decision is ‘no’, we further check it for petal width being smaller or greater than 1.8 to get a sufficiently homogeneous partition. ii) Naive Bayes Classification:- We will be performing Naive Bayes Classification on the same dataset of Iris. For Naive Bayes Classification we have imported ‘e0171’ library.Here, we will be making a model where training and testing datasets will be used. We have renamed the columns as "sepal_length", "sepal_width", "petal_length", "petal_width","class". After performing the algorithm for naive bayes, we get the following output:- iii) Support Vector Machine (SVM) With the use of Support Vector Machine, we try to achieve the following two classification goals simultaneously:
  • 13. 1. Maximize the margin 2. Correctly classify the data points We had applied SVM techniques on Iris dataset and produced a model for the same, after conduction all the required action on it, we get the following result:- precision recall f1-score support Iris-setosa 1.00 1.00 1.00 17 Iris-versicolor 1.00 1.00 1.00 16 Iris-virginica 1.00 1.00 1.00 12 avg / total 1.00 1.00 1.00 45 3. Clustering:- 3.1 Clustering Technique Clustering is a method of data breakdown that partitions the data into several groups based on their similarity. We group the data through a statistical procedure. These smaller groups that are formed from the bigger data are known as clusters. These clusters show the resulting properties: ● They are learned while carrying out the operation and the information of their number is not known in advance. ● Clusters are the accumulation of alike objects that share common characteristics. Clustering is the most widespread and popular method of Data Analysis and Data Mining. It used in cases where the underlying input data has a colossal volume and we are tasked with finding similar subsets that can be analyzed in several ways.
  • 14. For example – A marketing company can categorize its customers based on their economic background, age and several other factors to sell their products, in a better way. In different fields, R clustering has different names, such as: ● Marketing – In marketing, ‘segmentation’ or ‘typological analyses’ term is available for clustering. ● Medicine – Clustering in medicine is known as nosology. ● Biology – It is referred to as numerical taxonomy in the field of Biology. To define the correct criteria for clustering and making use of efficient algorithms, the general formula is as follows: Bn(number of partitions for n objects)>exp(n) You can determine the complexity of clustering by the number of possible combinations of objects. The complexity of the cluster depends on this number. The basis for joining or separating objects is the distance between them. These distances are dissimilarities (when objects are far from each other) or similarity (when objects are close by). Methods for Measuring Distance between Objects For calculating the distance between the objects in K-means, we make use of the following types of methods: ● Euclidean Distance – It is the most widely used method for measuring the distance between the objects that are present in a multidimensional space. In general, for an n-dimensional space, the distance is ● Squared Euclidean Distance – This is obtained by squaring the Euclidean Distance. The objects that are present at further distances are assigned greater weights. ● City-Block (Manhattan) Distance – The difference betweentwo points in all dimensions is calculated using this method. It is similar to Euclidean Distance in many cases but it has an added functioning in the reduction of the effect in the extreme objects, which do not possess squared coordinates.
  • 15. The squares of the inertia are the weighted sum mean of squares of the interval of the points from the center of the assigned cluster whose sum is calculated. We perform the calculation of the Sum of Squares of Clusters on their centers as follows: The above formula is known as the Huygens’ Formula. The Between-Cluster Sum of squares is calculated by evaluating the square of difference from the center of gravity from each cluster and their addition. We perform the calculation of the Within-Cluster Sum of squares through the process of the unearthing of the square of difference from the center of gravity for each given cluster and their addition within the single cluster. With the diminishing of the cluster, the population becomes better. R-squared (RSQ) delineates the proportion of the sum of squares that are present in the clusters. The closer proportion is to 1, better is the clustering. However, one’s aim is not the maximization of the costs as the result would lead to a greater number of clusters. Therefore, we require an ideal R 2 that is closer to 1 but does not create many clusters. As we move from k to k+1 clusters, there is a significant increase in the value of R 2 Some of the properties of efficient clustering are: ● Detecting structures that are present in the data. ● Determining optimal clusters. ● Giving out readable differentiated clusters. ● Ensuring stability of cluster even with the minor changes in data. ● Efficient processing of the large volume of data. ● Handling different data types of variables. Note: In the case of correct clustering, either IR is large or IA is small while calculating the sum of squares. Clustering is only restarted after we have performed data interpretation, transformation as well as the exclusion of the variables. While excluding the variable, it is simply not taken into account during the operation of clustering. This variable becomes an illustrative variable. Agglomerative Hierarchical Clustering
  • 16. In the Agglomerative Hierarchical Clustering (AHC), sequences of nested partitions of n clusters are produced. The nested partitions have an ascending order of increasing heterogeneity. We use AHC if the distance is either in an individual or a variable space. The distance between two objects or clusters must be defined while carrying out categorization. The algorithm for AHC is as follows: ● We first observe the initial clusters. ● In the next step, we assess the distance between the clusters. ● We then proceed to merge the most proximate clusters together and performing their replacement with a single cluster. ● We repeat step 2 until only a single cluster remains in the end. AHC generates a type of tree called dendrogram. After splitting this dendrogram, we obtain the clusters. Hierarchical Clustering is most widely used in identifying patterns in digital images, prediction of stock prices, text mining, etc. It is also used for researching protein sequence classification. 1. Main Distances ● Maximum distance – In this, the greatest distance between the two observed objects have clusters that are of equal diameters. ● Minimum distance – The minimum distance between the two observations delineates the neighbor technique or a single linkage AHC method. In this case, the minimum distance between the points of different clusters is supposed to be greater than the maximum points that are present in the same cluster. The distance between the points of distance clusters is supposed to be higher than the points that are present in the same cluster. 2. Density Estimation In density estimation, we detect the structure of the various complex clusters. The three methods for estimating density in clustering are as follows: ● The k-nearest-neighbors method – The number of k observations that are centered on x determines the density at the point x. The volume of the sphere further divides this.
  • 17. ● The Uniform Kernel Method – In this, the radius is fixed but the number of neighbors is not. ● The Wong Hybrid Method – We use this in the preliminary analysis. You must definitely explore the Graphical Data Analysis with R Clustering by Similarity Aggregation Clustering by Similarity Aggregation is known as relational clustering which is also known by the name of Condorcet method. With this method, we compare all the individual objects in pairs that help in building the global clustering. The principle of equivalence relation exhibits three properties – reflexivity, symmetry, and transitivity. ● Reflexivity => Mii = 1 ● Symmetry => Mij = Mij ● Transitivity => Mij + Mjk – Mik <=1 This type of clustering algorithm makes use of an intuitive approach. A pair of individual values (A,B) are assigned to the vectors m(A,B) and d(A,B). Both A and B possess the same value in m(A,B) whereas in the case of d(A,B), they exhibit different values. The two individuals A and B follow the Condorcet Criterion as follows: c(A, B) = m(A, B)-d(A, B) For an individual A and cluster S, the Condorcet criterion is as follows: c(A,S) = Σic(A,Bi) The summation overall is the Bi ∈ S. With the previous conditions, we start by constructing clusters that place each individual A in cluster S. In this cluster c(A,S), A is the largest and has the least value of 0. In the next step, we calculate the global Condorcet criterion through a summation of individuals present in A as well as the cluster SA which contains them.
  • 18. K-Means Clustering in R One of the most popular partitioning algorithms in clustering is the K-means cluster analysis in R. It is an unsupervised learning algorithm. It tries to cluster data based on their similarity. Also, we have specified the number of clusters and we want that the data must be grouped into the same clusters. The algorithm assigns each observation to a cluster and also finds the centroid of each cluster. 3.2 Performance Measures Contrary to supervised learning where we have the ground truth to evaluate the model’s performance, clustering analysis doesn’t have a solid evaluation metric that we can use to evaluate the outcome of different clustering algorithms. Moreover, since kmeans requires k as input and doesn’t learn it from data, there is no right answer in terms of the number of clusters that we should have in any problem. Sometimes domain knowledge and intuition may help but usually, that is not the case. In the cluster-predict methodology, we can evaluate how well the models are performing based on different K clusters since clusters are used in the downstream modeling. In this post we’ll cover two metrics that may give us some intuition about k: · Elbow method · Silhouette analysis Elbow Method Elbow method gives us an idea on what a good k number of clusters would be based on the sum of squared distance (SSE) between data points and their assigned clusters’ centroids. We pick k at the spot where SSE starts to flatten out and forming an elbow. We’ll use the geyser dataset and evaluate SSE for different values of k and see where the curve might form an elbow and flatten out.
  • 19. The graph above shows that k=2 is not a bad choice. Sometimes it’s still hard to figure out a good number of clusters to use because the curve is monotonically decreasing and may not show any elbow or has an obvious point where the curve starts flattening out. Silhouette Analysis Silhouette analysis can be used to determine the degree of separation between clusters. For each sample: · Compute the average distance from all data points in the same cluster (ai). · Compute the average distance from all data points in the closest cluster (bi). · Compute the coefficient: The coefficient can take values in the interval [-1, 1]. · If it is 0 –> the sample is very close to the neighboring clusters.
  • 20. · It is 1 –> the sample is far away from the neighboring clusters. · It is -1 –> the sample is assigned to the wrong clusters. Therefore, we want the coefficients to be as big as possible and close to 1 to have good clusters. We’ll use here geyser dataset again because it’s cheaper to run the silhouette analysis and it is actually obvious that there is most likely only two groups of data points. 3.3 Case Study You are head of client insights and marketing at a telecommunication company, ConnectFast INC. You understand that not each client is analogous and you wish to possess totally different |completely different} methods to draw in different customers. You appreciate the ability of client segmentation to deliver superior results with optimized value. You’re conjointly alert to unsupervised learning techniques like cluster analysis to make client segments. To brush up your skills with cluster analysis, you've got selected a sample of eight customers with their average decision period (both regionally and internationally). The subsequent is that the data: To get a feel of this, you've got premeditated the information with average international decision period on the coordinate axis and average phone call period on the coordinate axis. The subsequent is that the plot:
  • 21. Euclidian Distance tofind Cluster Centroids In this case, two centroids (C1 & C2) are randomly placed at the coordinates (1, 1) and (3, 4). Why did we choose two centroids? For this problem, visual guesstimate of scattered plot above tells us that are two clusters. However, we will notice in a later part of this series, this question may not have such a straightforward answer for larger data sets. Now, we will measure the distance between two centroids (C1 & C2) and all the data points on the above-scattered plot using Euclidean measure. Euclidean distance is measured through the following formula In this case, 2 centroids (C1 & C2) are arbitrarily placed at the coordinates (1, 1) and (3, 4). Why did we elect 2 centroids? For this downside, visual estimation of scattered plot on top, informs us that are 2 clusters. However, we are going to notice in the coming part of this series; this question might not have such an easy declare for larger information sets.
  • 22. Now, we are going to measure the space between 2 centroids (C1 & C2) and all the data points on the above-scattered plot using Euclidean measure. Euclidean distance is measured through the subsequent formula Columns 3 and 4 (i.e. Distance from C1 and C2) are measured using the same formula. For instance, for the first customer You could measure all the other values similarly. Furthermore, cluster membership (last column) is allocated using the closeness to clusters (C1 and C2). The first client is closer to centroid 1 (1.41 in comparison to 2.24) hence is assigned membership C1. The following is the scatter plot with cluster centroids C1 and C2 (displayed with blue and orange diamond shapes). The customers are have marked with the colour of centroids basis their closeness to the centroids. As we have arbitrarily assigned the centroids, the second step is to move them iteratively. The new position of the centroid is measured by taking the average of member points for the centroid. For the first centroid, customers 1, 2 and 3 are members. Hence, the new x-axis position for the centroid C1 is the average value for x-axis for these customers i.e. (2+1+1)/3 = 1.33. We will get the new coordinates for C1 equal to (1.33, 2.33) and C2 equal to (4.4, 4.2). The new plot is shown below:
  • 23. Finally, one last iteration we will take the centroids at the centre of the clusters. As displayed below: The positions for our cluster centroids in this case turned out to be C1 (1.75, 2.25) and C2(4.75, 4.75). 4. Association:- 4.1 ASSOCIATION TECHNIQUES: Association mining is commonly used to make product recommendations by identifying products that are frequently bought together. One such association mining technique is the apriori algorithm
  • 24. Apriori Algorithm ● It proceeds by identifying the frequent individual items in the database and extending them to larger and larger item sets as long as those item sets appear sufficiently often in the database. The frequent itemsets determined by Apriori can be used to determine association rules which highlight general trends in the database: this has applications in domains such as market basket analysis. ● Apriori algorithm is a classical algorithm in data mining. It is used for mining frequent itemsets and relevant association rules. It is devised to operate on a database containing a lot of transactions, for instance, items brought by customers in a store. Association rules Association rule learning is a prominent and a well-explored method for determining relations among variables in large databases. Let us take a look at the formal definition of the problem of association rules. ● Let I={i1,i2,i3,…,in} be a set of n attributes called items and D={t1,t2,…,tn} be the set of transactions. It is called database. Every transaction, ti in D has a unique transaction ID, and it consists of a subset of itemsets in I. ● A rule can be defined as an implication, X⟶Y where X and Y are subsets of I(X,Y⊆I), and they have no element in common, i.e., X∩Y. Xand Y are the antecedent and the consequent ofthe rule. General Process of the Apriori algorithm The entire algorithm can be divided into two steps: Step 1: Apply minimum support to find all the frequent sets with k items in a database. Step 2: Use the self-join rule to find the frequent sets with k+1 items with the help of frequent k-itemsets. Repeat this process from k=1 to the point when we are unable to apply the self-join rule.
  • 25. This approach of extending a frequent itemset one at a time is called the “bottom up” approach. Mining Association Rules The Apriori algorithm has been looked at with respect to frequent itemset generation. There is another way for which we can use this algorithm, i.e., finding association rules. We need to find all rules having support more than the threshold support and confidence more than the threshold confidence. One possible way for this is to, to list all the possible association rules and calculate the support and confidence for each rule. Then eliminate the rules that fail the threshold support and confidence. It is very heavy and prohibitive as the number of all the possible association rules increases exponentially with the number of items. We can also use another way, which is called the two-step approach, to find efficient association rules. The two-step approach is: Step 1: Frequent itemset generation: Find all itemsets for which the support is greater than the threshold support following the process we have already seen earlier in this article. Step 2: Rule generation: Create rules from each frequent itemset using the binary partition of frequent itemsets and look for the ones with high confidence. These rules are called candidate rules There are multiple rules possible even from a very small database, so in order to select the interesting ones, we use constraints on various measures of interest and significance. We will look at some of these useful measures such as support, confidence, lift and conviction. 4.2 Performance Measures : 1. Support
  • 26. The support of an itemset X, supp(X) is the proportion of transaction in the database in which the item X appears. It signifies the popularity of an itemset. In the example above, 2. Confidence it signifies the likelihood of item Y being purchased when item X is purchased. So, for the rule {Onion, Potato} => {Burger}, It can give some important insights, but it also has a major drawback. It only takes into account the popularity of the itemset X and not the popularity of Y. If Y is equally popular as X then there will be a higher probability that a transaction containing X will also contain Y thus increasing the confidence. To overcome this drawback there is another measure called lift. 3.Lift The lift of a rule is defined as: This signifies the likelihood of the itemset Y being purchased when item X is purchased while taking into account the popularity of Y. In our example above,If the value of lift is greater than 1, it means that the itemset Y is likely to be bought with itemset X, while a value less than 1 implies that itemset Y is unlikely to be bought if the itemset X is bought. Pros of the Apriori algorithm 1. It is an easy-to-implement and easy-to-understand algorithm. 2. It can be used on large itemsets. Cons of the Apriori Algorithm 1. Sometimes, it may need to find a large number of candidate rules which can be computationally expensive. 2. Calculating support is also expensive because it has to go through the entire database
  • 27. Consider the following example: Given is a set of transaction data. You can see transactions numbered 1 to 5. Each transaction shows items bought in that transaction. You can see that Diaper is bought with Beer in three transactions. Similarly, Bread is bought with milk in three transactions making them both frequent item sets. Association rules are given in the form as below: A=>B[Support,Confidence] The part before => is referred to as if (Antecedent) and the part after => is referred to as then (Consequent). Where A and B are sets of items in the transaction data. A and B are disjoint sets. In the following section you will learn about the basic concepts of Association Rule Mining: Basic Concepts of Association Rule Mining
  • 28. 1. Itemset: Collection of one or more items. K-item-set means a set of k items. 2. Support Count: Frequency of occurrence of an item-set 3. Support (s): Fraction of transactions that contain the item-set 'X' ● For a Rule A=>B, Support is given by:
  • 29. Note: P(AUB) is the probability of A and B occurring together. P denotes probability. Go ahead, try finding the support for Milk=>Diaper as an exercise. 1. Confidence (c): For a rule A=>B Confidence shows the percentage in which B is bought with A. The number of transactions with both A and B divided by the total number of transactions having A. Now find the confidence for Milk=>Diaper. Note: Support and Confidence measure how interesting the rule is. It is set by the minimum support and minimum confidence thresholds. These thresholds set by client help to compare the rule strength according to your own or client's will. The closer to threshold the more the rule is of use to the client. 1. Frequent Itemsets: Item-sets whose support is greater or equal than minimum support threshold (min_sup). In above example min_sup=3. This is set on user choice.
  • 30. 2. Strong rules: If a rule A=>B[Support, Confidence] satisfies min_sup and min_confidence then it is a strong rule. 3. Lift: Lift gives the correlation between A and B in the rule A=>B. Correlation shows how one item-set A effects the item-set B. For example, the rule {Bread}=>{Milk}, lift is calculated as:  If the rule had a lift of 1,then A and B are independent and no rule can be derived from them.   If the lift is > 1, then A and B are dependent on each other, and the degree of which is given by ift value.  If the lift is < 1, then presence of A will have negative effect on B. Goal of Association Rule Mining When you apply Association Rule Mining on a given set of transactions T your goal will be to find all rules with:
  • 31. 1. Support greater than or equal to min_support 2. Confidence greater than or equal to min_confidence 5. Conclusion and Learning The whole project has given us an overall perspective of the various methods of data analytics and their applications depending on the requirements.
  • 32. There are two types of learning, supervised and unsupervised learning. Supervised learning. The supervised learning is about training the model of the algorithm with the majority data and carrying out analysis on the rest of test data. The major difference between supervised and unsupervised learning is the presence of historic data. Incase of supervised learning, there is historical data for training the model whereas in case of unsupervised learning there is no historical data. Classification method is one of the supervised algorithms of learning and we understood that it is used in forecasting the future outcomes and the accuracy of such models can also be checked using confusion matrix. Similarly, Clustering is one of unsupervised learning algorithms and is used for exploratory data mining.It is used when we have a group of people having similar characteristics and they could be dealt together. Associations is used when we have to establish relations among the probabilities of various data items and use that relationship for business applications like cross selling, as it helps in studying the purchase behaviour , it is also known as market basket analysis. Reference ❖ https://www.kdnuggets.com/2016/07/burtchworks-sas-r-python-analytics-pros- prefer.html