Data Mining: an Introduction

DATA MINING AND MACHINE LEARNING
IN A NUTSHELL

AN INTRODUCTION TO DATA MINING
Mohammad-Ali Abbasi
http://www.public.asu.edu/~mabbasi2/

SCHOOL OF COMPUTING, INFORMATICS, AND DECISION SYSTEMS ENGINEERING
ARIZONA STATE UNIVERSITY

http://dmml.asu.edu/
Data Mining and Machine Learning in a nutshell An Introduction to Data Mining 1

INTRODUCTION

• Data production rate has been increased
dramatically (Big Data) and we are able store
much more data than before
– E.g., purchase data, social media data, mobile
phone data
• Businesses and customers need useful or
actionable knowledge and gain insight from
raw data for various purposes
– It’s not just searching data or databases

Data mining helps us to extract new information and uncover
hidden patterns out of the stored and streaming data

DATA MINING

The process of discovering hidden patterns in large data sets
It utilizes methods at the intersection of artificial intelligence, machine learning,
statistics, and database systems

• Extracting or “mining” knowledge from large
amounts of data, or big data
• Data-driven discovery and modeling of hidden
patterns in big data
• Extracting implicit, previously unknown,
unexpected, and potentially useful
information/knowledge from data


DATA MINING STORIES

• “My bank called and said that they saw that I bought
two surfboards at Laguna Beach, California.” - credit
card fraud detection

• The NSA is using data mining to analyze telephone
call data to track al’Qaeda activities

• Walmart uses data mining to control product
distribution based on typical customer buying
patterns at individual stores


DATA MINING VS. DATABASES

• Data mining is the process of extracting
hidden and actionable patterns from data
• Database systems store and manage data
– Queries return part of stored data
– Queries do not extract hidden patterns
• Examples of querying databases
– Find all employees with income more than $250K
– Find top spending customers in last month
– Find all students from engineering college with
GPA more than average

EXAMPLES OF DATA MINING APPLICATIONS

• Identifying fraudulent transactions of a credit card
or spam emails
– You are given a user’s purchase history and a new
transaction, identify whether the transaction is fraud
or not;
– Determine whether a given email is spam or not
• Extracting purchase patterns from existing records
– beer ⇒ dippers (80%)
• Forecasting future sales and needs according to
some given samples
• Extracting groups of like-minded people in a given
network

BASIC DATA MINING TASKS

• Classification
– Assign data into predefined classes
• Spam Detection, fraudulent credit card detection
• Regression
– Predict a real value for a given data instance
• Predict the price for a given house
• Clustering
– Group similar items together into some clusters
• Detect communities in a given social network


DATA


DATA INSTANCES

• A collection of properties and features related
to an object or person
– A patient’s medical record
– A user’s profile
– A gene’s information
• Instances are also called examples, records,
data points, or observations
Data Instance:

Features or Attributes Class Label

DATA TYPES

• Nominal (categorical)
– No comparison is defined
– E.g., {male, female}
• Ordinal
– Comparable but the difference is not defined
– E.g., {Low, medium, high}
• Interval
– Deduction and addition is defined but not division
– E.g., 3:08 PM, calendar dates
• Ratio
– E.g., Height, weight, money quantities


SAMPLE DATASET
outlook temperature humidity windy play
sunny 85 85 FALSE no
sunny 80 90 TRUE no
overcast 83 86 FALSE yes
rainy 70 96 FALSE yes
rainy 65 70 TRUE no
overcast 64 65 TRUE yes
sunny 69 70 FALSE yes
sunny 75 70 TRUE yes
rainy 71 91 TRUE no

Interval Ordinal Nominal


DATA QUALITY

When making data ready for data mining
algorithms, data quality need to be assured
• Noise
– Noise is the distortion of the data
• Outliers
– Outliers are data points that are considerably different
from other data points in the dataset
• Missing Values
– Missing feature values in data instances
• Duplicate data


DATA PREPROCESSING

• Aggregation
– when multiple attributes need to be combined into a
single attribute or when the scale of the attributes change
• Discretization
– From continues values to discrete values
• Feature Selection
– Choose relevant features
• Feature Extraction
– Creating a mapping of new features from original features
• Sampling
– Random Sampling
– Sampling with or without replacement
– Stratified Sampling


CLASSIFICATION


CLASSIFICATION

Learning patterns from labeled data and classify
new data with labels (categories)
– For example, we want to classify an e-mail as
"legitimate" or "spam"
Classifier


CLASSIFICATION: THE PROCESS

• In classification, we are given a set of labeled
examples
• These examples are records/instances in the
format (x, y) where x is a vector and y is the
class attribute, commonly a scalar
• The classification task is to build model that
maps x to y
• Our task is to find a mapping f such that f(x) = y


CLASSIFICATION: THE PROCESS


CLASSIFICATION: AN EMAIL EXAMPLE

• A set of emails is given where
users have manually identified
spam versus non-spam
• Our task is to use a set of
features such as words in the
email (x) to identify spam/non-
spam status of the email (y)
• In this case, classes are
y = {spam, non-spam}
• What would it be dealt with in
a social setting?

CLASSIFICATION ALGORITHMS

• Decision tree learning

• Naive Bayes learning

• K-nearest neighbor classifier


DECISION TREE

• A decision tree is learned from the dataset
(training data with known classes) and later
applied to predict the class attribute value of
new data (test data with unknown classes)
where only the feature values are known


DECISION TREE INDUCTION


ID3, A DECISION TREE ALGORITHM

Use information gain (entropy) to determine
how well an attribute separates the training
data according to the class attribute value

– p+ is the proportion of positive examples in D
– p- is the proportion of negative examples in D

In a dataset containing ten examples, 7 have a positive class
attribute value and 3 have a negative class attribute value [7+, 3-]:

If the numbers of positive and negative examples in the set are equal, then the entropy is 1

DECISION TREE: EXAMPLE 1
outlook temperature humidity windy play
sunny 80 90 TRUE no
rainy 65 70 TRUE no
sunny 69 70 FALSE yes
sunny 75 70 TRUE yes
rainy 71 91 TRUE no


DECISION TREE: EXAMPLE 2

Class Labels

Learned Decision Tree 1 Learned Decision Tree 2

NAIVE BAYES CLASSIFIER

For two random variables X and Y, Bayes
theorem states that,

class variable the instance features

Then class attribute value for instance X

Assuming that variables
are independent


NBC: AN EXAMPLE


NEAREST NEIGHBOR CLASSIFIER

• k-nearest neighbor employs the neighbors of a
data point to perform classification
• The instance being classified is assigned the
label that the majority of k neighbors’ labels
• When k = 1, the closest neighbor’s label is
used as the predicted label for the instance
being classified
• For determining the neighbors, distance is
computed based on some distance metric,
e.g., Euclidean distance

K-NN: ALGORITHM

1. The dataset, number of neighbors (k), and
the instance i is given
2. Compute the distance between i and all
other data points in the dataset
3. Pick k closest neighbors
4. The class label for the data point i is the one
that the majority holds (if there are more
than one class, select one of them randomly)


K-NEAREST NEIGHBOR: EXAMPLE

k = 10
Class label = ? k=5

k=3

• Depending on the k, different labels can be predicted for the green circle
• In our example k = 3 and k = 5 generate different labels for the instance
• K= 10 we can choose either triangle or rectangle

K-NEAREST NEIGHBOR: EXAMPLE

Similarity between row 8 and other data instances;
(Similarity = 1 if attributes have the same value, otherwise similarity = 0)
Data instance Outlook Temperature Humidity Similarity Label K Prediction
2 1 1 1 3 N 1 N
1 1 0 1 2 N 2 N
4 0 1 1 2 Y 3 N
3 0 0 1 1 Y 4 ?
5 1 0 0 1 Y 5 Y
6 0 0 0 0 N 6 ?
7 0 0 0 0 Y 7 Y

EVALUATING CLASSIFICATION PERFORMANCE

• As the class labels are discrete, we can measure the
accuracy by dividing number of correctly predicted
labels (C) by the total number of instances (N)
• Accuracy = C/N
• Error rate = 1 - Accuracy
• More sophisticated approaches of evaluation will be
discussed later


REGRESSION


REGRESSION

Regression analysis includes techniques of
modeling and analyzing the relationship
between a dependent variable and one or more
independent variables
• Regression analysis is widely used for
prediction and forecasting
• It can be used to infer
relationships between
the independent and
dependent variables

REGRESSION
In regression, we deal with real numbers as class
values (Recall that in classification, class values
or labels are categories)
y ≈ f(X)

Dependent variable Regressors
y R x1, x2, …, xm

Our task is to find the relation between y and the vector
(x1, x2, …, xm)

LINEAR REGRESSION

In linear regression, we assume the relation
between the class attribute y and feature set x
to be linear

where w represents the vector of regression
coefficients
• The problem of regression can be solved by
estimating w and using the provided dataset
and the labels y
– The least squares is often used to solve the
problem

SOLVING LINEAR REGRESSION PROBLEMS

• The problem of regression can be solved by
estimating w and using the dataset provided
and the labels y
– “Least squares” is a popular method to solve
regression problems


LEAST SQUARES

Find W such that minimizing ǁY - XWǁ2 for
regressors X and labels Y


LEAST SQUARES


REGRESSION COEFFICIENTS

• When there is only one independent variable:
y = w0 + w1x n
x = å xi
1
n i

• Two independent variables
y = w0 + w1x1 + w2x2


LINEAR REGRESSION: EXAMPLE

Years of
Salary ($K)
experience

3 30
8 57
9 64
13 72
3 36
6 43
11 59
21 90
1 20
16 83


EVALUATING REGRESSION PERFORMANCE

• The labels cannot be predicted precisely
• It is needed to set a margin to accept or reject
the predictions
– For example, when the observed temperature is
71 any prediction in the range of 71±0.5 can be
considered as correct prediction


CLUSTERING


CLUSTERING

Grouping together items that are similar in
some way – according to some criteria

• Clustering is a form of unsupervised learning
– The clustering algorithms do not have examples
showing how the samples should be group
together
• The clustering algorithms look for patterns or
structures in the data that are of interest
• Clustering algorithms group together similar
items

CLUSTERING: EXAMPLE


MEASURING SIMILARITY IN CLUSTERING ALGORITHMS

• The goal is to group together similar items
• Different similarity measures can be used to
find similar items
• Usually similarity measures are critical to
clustering algorithms

The most popular (dis)similarity measure for
continuous features are Euclidean Distance and
Pearson Linear Correlation


EUCLIDEAN DISTANCE – A DISSIMILAR MEASURE

• Here n is the number of dimensions in the
data vector


PEARSON LINEAR CORRELATION
n

å (x i - x )(yi - y )
r (x, y) = n
i=1
n

å (x i - x )2 å (y i - y )2
i=1 i=1

1 n
x = å xi
n i
1 n
y = å yi
n i

• We’re shifting the expression profiles down (subtracting the means) and scaling
by the standard deviations (i.e., making the data have mean = 0 and std = 1)
• Always between –1 and +1 (perfectly anti-correlated and perfectly correlated)


SIMILARITY MEASURES: MORE DEFINITIONS


CLUSTERING

• Distance-based algorithms

– K-Means

• Hierarchical algorithms


K-MEANS

k-means clustering aims to partition n
observations into k clusters in which each
observation belongs to the cluster with the
nearest mean
• Finding the global optimal of k partitions is
computationally expensive (NP-hard).
However, there are efficient heuristic
algorithms that are commonly employed and
converge quickly to an optimum that might
not be global.

K-MEANS

• Given a set of observations (x1, x2, …, xn),
where each observation is a d-dimensional
real vector, k-means clustering aims to
partition the n observations into k sets (k ≤ n)
S = {S1, S2, …, Sk} so as to minimize the within-
cluster sum of squares:

where μi is the mean of points in Si


K-MEANS: ALGORITHM

Given data points xi and an initial set
of k centroids m1(1),…,mk(1), the algorithm proceeds as
follows:
• Assignment step: Assign each data point to the
cluster Si with the closest centroid each data
point goes into exactly one cluster)

• Update step: Calculate the new means to be
the centroid of the data points in the cluster


K-MEANS: AN EXAMPLE

Data Cluster 1 Cluster 2
X Y
point
Step Data point Centroid Data point Centroid
1 1 1
2 2 1 1 1 (1.0, 1.0) 2 (2.0, 1.0)
3 1 2
4 2 2
5 4 4
6 4 5
7 5 4
8 5 5


RUNNING K-MEANS ON IRIS DATASET


HIERARCHICAL CLUSTERING

Hierarchical clustering is a method of cluster
analysis which seeks to build a hierarchy of clusters.
• Strategies for hierarchical clustering generally fall
into two types:
– Agglomerative: This is a "bottom up" approach: each
observation starts in its own cluster, and pairs of
clusters are merged as one moves up the hierarchy.
– Divisive: This is a "top down" approach: all
observations start in one cluster, and splits are
performed recursively as one moves down the
hierarchy.


HIERARCHICAL ALGORITHMS

• Initially n data points are considered as either
1 or n clusters in hierarchical clustering
• These clusters are gradually split or merged
(divisive or agglomerative hierarchical
clustering algorithms), depending on the type
of an algorithm
• Until the desired number of clusters are
reached


HIERARCHICAL AGGLOMERATIVE CLUSTERING

• Start with each data point as a cluster
• Keep merging the most similar pairs of data
points/clusters until only one big cluster left
• This is called a bottom-up or agglomerative
method

This produces a binary tree or dendrogram
– The final cluster is the root and each data point is
a leaf
– The height of the bars indicate how close the
points are

HIERARCHICAL CLUSTERING: AN EXAMPLE


MERGING THE DATA POINTS IN HIERARCHICAL CLUSTERING

• Average Linkage
– Each cluster ci is associated with a mean vector i
which is the mean of all the data items in the
cluster
– The distance between two clusters ci and cj is then
just d( i , j )
• Single Linkage
– The minimum of all pairwise distances between
points in the two clusters
• Complete Linkage
– The maximum of all pairwise distances between
points in the two clusters

LINKAGE IN HIERARCHICAL CLUSTERING: EXAMPLE

Single Linkage Average Linkage

Complete Linkage


EVALUATING THE CLUSTERINGS

When we are given objects of two different
kinds, the perfect clustering would be that
objects of the same type are clustered together.

• Evaluation with ground truth
• Evaluation without ground truth

EVALUATION WITH GROUND TRUTH

When ground truth is available, the evaluator
has prior knowledge of what a clustering should
be
– That is, we know the correct clustering
assignments.

• Measures
– Precision and Recall, or F-Measure
– Purity
– Normalized Mutual Information (NMI)


PRECISION AND RECALL

• True Positive (TP) : • False Negative (FN) :
– when similar points are assigned to – when similar points are assigned to
the same clusters different clusters
– This is considered a correct – This is considered an incorrect
decision. decision
• True Negative (TN) : • False Positive (FP) :
– when dissimilar points are – when dissimilar points are
assigned to different clusters assigned to the same clusters
– This is considered a correct – This is considered an incorrect
decision decision


PRECISION AND RECALL: EXAMPLE 1


F-MEASURE

• To consolidate precision and recall into one
measure, we can use the harmonic mean of
precision of recall

Computed for the same example, we get F = 0.54


PURITY
• In purity, we assume the majority of a cluster
represents the cluster
• Hence, we use the label of the majority
against the label of each member to evaluate
• the algorithm easily tampered; consider points
Purity can be
• The purity is then defined assize 1) or very large
being singleton clusters (of the fraction of
instances that have labels equal to the
clusters.
cluster’s majority label
• In both cases, purity does not make much sense.

where Lj defines label j (ground truth) and
Mi defines the majority label for cluster i


MUTUAL INFORMATION

The mutual information of two random
variables is a quantity that measures the mutual
dependence of the two random variables

• p(x,y) is the joint probability distribution function of X and Y,
• p(x) and p(y) are the marginal probability distribution
functions of X and Y respectively


NORMALIZED MUTUAL INFORMATION

Normalized Mutual Information has been
derived from information theory where the
Mutual Information (MI) between the
clusterings found and the labels is normalized by
the upper bound of (MI) which is a mean of the
entropies (H) of labels and clusterings found


NORMALIZED MUTUAL INFORMATION

• where l and h are labels and found clusterings,
• nh and nl are the number of data points in the clusters h and l, respectively,
• nh,l is the number of points in clusters h and labeled l,
• n is the size of the dataset

• NMI values close to one indicate high similarity
between clusterings found and labels
• Values close to zero indicate high dissimilarity
between them

NORMALIZED MUTUAL INFORMATION: EXAMPLE

Partition a: [1,1,1,1,1,1,1, 2,2,2,2,2,2,2]
Partition b: [1,1,1,1,1,2,2, 1,2,2,2,2,2,2]

nh nl nh,l l=1 l=2
n = 14 h=1 6 l=1 7 h=1 5 1
h=2 8 l=2 7 h=2 2 6


EVALUATION WITHOUT GROUND TRUTH

• Use domain experts

• Use quality measures such as SSE
– SSE: the sum of the squared error for all clusters

• Use more than two clustering algorithms and
compare the results and pick the algorithm
with better quality measure


TEXT MINING


TEXT MINING

• In social media, most of the data that is
available online is in text format
• In general, the way to perform data mining is
to convert text data into tabular format and
then perform data mining on this data

• The process of converting text data into
tabular data is called vectorization


TEXT MINING PROCESS

A set of linguistic, statistical, and machine
learning techniques that model and structure
the information content of textual sources for
business intelligence, exploratory data analysis,
research, or investigation


TEXT PREPROCESSING

Text preprocessing aims to make the input
documents more consistent to facilitate text
representation, which is necessary for most text
analytics tasks
• Methods:
– Stop word removal
• Stop word removal eliminates words using a stop word list,
in which the words are considered more general and
meaningless
– e.g. the, a, is, at, which
– Stemming
• Stemming reduces inflected (or sometimes derived) words
to their stem, base or root form
– For example, “watch”, “watching”, “watched” are represented as
“watch”


TEXT REPRESENTATION

• The most common way to model documents
is to transform them into sparse numeric
vectors and then deal with them with linear
algebraic operations
• This representation is called “Bag of Words”

• Methods:
– Vector space model
– tf-idf


VECTOR SPACE MODEL

• In the vector space model, we start with a set
of documents, D
• Each document is a set of words
• The goal is to convert these textual
documents to vectors

• di : document i, wj,i : the weight for word j in document i

The weight can be set to 1 when the word exist in the document and 0 when
it does not. Or we can set this weight to the number of times the word is
observed in the document

VECTOR SPACE MODEL: AN EXAMPLE

• Documents:
– d1: data mining and social media mining
– d2: social network analysis
– d3: data mining
• Reference vector:
– (social, media, mining, network, analysis, data)
• Vector representation:
analysis data media mining network social
d1 0 1 1 1 0 1
d2 1 0 0 0 1 1
d3 0 1 0 1 0 0


TF-IDF (TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY)

tf-idf of term t, document d, and document corpus D is
calculated as follows:
tf-idf(t, d, D) = tf (t, d) * idf (t, D)

The total number of documents in
the corpus

The number of documents where
the term t appears


TF-IDF: AN EXAMPLE

Consider words “apple” and “the” that appear
10 and 20 times in document 1 (d1), which
contains 100 words.
Consider |D| = 20 and word “apple” only
appearing in d1 and word “the” appearing in all
20 documents


TF-IDF: AN EXAMPLE

• Documents:
– d1: data mining and social media mining
– d2: social network analysis
– d3: data mining
• tf-idf representation:

analysis data media mining network social
df(w) 1 2 1 2 1 2
log(N/df(w)) 0.48 0.18 0.48 0.18 0.48 0.18
d1, tf 0 1 1 2 0 1
d2, tf 1 0 0 0 1 1
d3, tf 0 1 0 1 0 0
d1, tf-idf 0.00 0.18 0.48 0.35 0.00 0.18
d2, tf-idf 0.48 0.00 0.00 0.00 0.48 0.18
d3, tf-idf 0.00 0.18 0.00 0.18 0.00 0.00


SENTIMENT ANALYSIS

• Sentiment analysis or opinion mining refers to
the application of natural language
processing, computational linguistics, and text
analytics to identify and extract subjective
information in source materials
• It aims to determine the attitude of a speaker
or a writer with respect to some topic or the
overall contextual polarity of a document.


POLARITY ANALYSIS

• The basic task in opinion mining is classifying
the polarity of a given document or text
– The polarity could be positive, negative, or neutral
• Methods:
– Naïve Bayes
– Pointwise Mutual Information (PMI)


MEASURING POLARITY, NAÏVE BAYES

• Bayes’ rule:

• If we consider that the occurrence of features
(words) in the document are independent


MEASURING POLARITY, MAXIMUM ENTROPY

• Z(d) is the normalization factor
• is feature-weight parameter and shows the
importance of each feature
• Fi,c is defined as a feature/class function for
feature fi and class c


MEASURING POLARITY, POINTWISE MUTUAL INFORMATION

• P(word) is the number of results returned by search engine in response to search
for term word
• P(word1 word2) is the number of results for mutual search of word1 and word2
together


Mohammad-Ali Abbasi (Ali),
Ali, is a Ph.D. student at Data Mining
and Machine Learning Lab, Arizona
State University.
His research interests include Data
Mining, Machine Learning, Social
Computing, and Social Media Behavior
Analysis.

http://www.public.asu.edu/~mabbasi2/


Data Mining: an Introduction

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (19)

Similar to Data Mining: an Introduction

Similar to Data Mining: an Introduction (20)

More from Ali Abbasi

More from Ali Abbasi (9)

Recently uploaded

Recently uploaded (20)

Data Mining: an Introduction

Editor's Notes