This document provides an introduction to data mining and machine learning. It discusses how data mining can extract hidden patterns from large datasets. The document covers common data mining tasks like classification, regression, and clustering. It also describes different algorithms for classification including decision trees, naive Bayes classifiers, and k-nearest neighbors. Regression is also introduced as predicting real-valued outputs. The document uses examples to illustrate key concepts in data mining.
key note address delivered on 23rd March 2011 in the Workshop on Data Mining and Computational Biology in Bioinformatics, sponsored by DBT India and organised by Unit of Simulation and Informatics, IARI, New Delhi.
I do not claim any originality either to slides or their content and in fact aknowledge various web sources.
This lecture gives various definitions of Data Mining. It also gives why Data Mining is required. Various examples on Classification , Cluster and Association rules are given.
FellowBuddy.com is an innovative platform that brings students together to share notes, exam papers, study guides, project reports and presentation for upcoming exams.
We connect Students who have an understanding of course material with Students who need help.
Benefits:-
# Students can catch up on notes they missed because of an absence.
# Underachievers can find peer developed notes that break down lecture and study material in a way that they can understand
# Students can earn better grades, save time and study effectively
Our Vision & Mission – Simplifying Students Life
Our Belief – “The great breakthrough in your life comes when you realize it, that you can learn anything you need to learn; to accomplish any goal that you have set for yourself. This means there are no limits on what you can be, have or do.”
Like Us - https://www.facebook.com/FellowBuddycom
Data Mining, KDD Process, Data mining functionalities, Characterization,
Discrimination ,
Association,
Classification,
Prediction,
Clustering,
Outlier analysis, Data Cleaning as a Process
key note address delivered on 23rd March 2011 in the Workshop on Data Mining and Computational Biology in Bioinformatics, sponsored by DBT India and organised by Unit of Simulation and Informatics, IARI, New Delhi.
I do not claim any originality either to slides or their content and in fact aknowledge various web sources.
This lecture gives various definitions of Data Mining. It also gives why Data Mining is required. Various examples on Classification , Cluster and Association rules are given.
FellowBuddy.com is an innovative platform that brings students together to share notes, exam papers, study guides, project reports and presentation for upcoming exams.
We connect Students who have an understanding of course material with Students who need help.
Benefits:-
# Students can catch up on notes they missed because of an absence.
# Underachievers can find peer developed notes that break down lecture and study material in a way that they can understand
# Students can earn better grades, save time and study effectively
Our Vision & Mission – Simplifying Students Life
Our Belief – “The great breakthrough in your life comes when you realize it, that you can learn anything you need to learn; to accomplish any goal that you have set for yourself. This means there are no limits on what you can be, have or do.”
Like Us - https://www.facebook.com/FellowBuddycom
Data Mining, KDD Process, Data mining functionalities, Characterization,
Discrimination ,
Association,
Classification,
Prediction,
Clustering,
Outlier analysis, Data Cleaning as a Process
presentation on recent data mining Techniques ,and future directions of research from the recent research papers made in Pre-master ,in Cairo University under supervision of Dr. Rabie
data mining, data preprocessing, data cleaning, knowledge discovery, association, classification, clustering, introduction, why data mining, application
This presentation gives the idea about Data Preprocessing in the field of Data Mining. Images, examples and other things are adopted from "Data Mining Concepts and Techniques by Jiawei Han, Micheline Kamber and Jian Pei "
Machine Learning and Data Mining: 19 Mining Text And Web DataPier Luca Lanzi
Course "Machine Learning and Data Mining" for the degree of Computer Engineering at the Politecnico di Milano. In this lecture we overview text and web mining. The slides are mainly taken from Jiawei Han textbook.
presentation on recent data mining Techniques ,and future directions of research from the recent research papers made in Pre-master ,in Cairo University under supervision of Dr. Rabie
data mining, data preprocessing, data cleaning, knowledge discovery, association, classification, clustering, introduction, why data mining, application
This presentation gives the idea about Data Preprocessing in the field of Data Mining. Images, examples and other things are adopted from "Data Mining Concepts and Techniques by Jiawei Han, Micheline Kamber and Jian Pei "
Machine Learning and Data Mining: 19 Mining Text And Web DataPier Luca Lanzi
Course "Machine Learning and Data Mining" for the degree of Computer Engineering at the Politecnico di Milano. In this lecture we overview text and web mining. The slides are mainly taken from Jiawei Han textbook.
text mining, data mining, machine learning, unstructured data, big data, database, data warehouse, text mining (industry), research (industry), text analysis, text, text analytics, unstructured, data science, structured data, advanced analytics, what is data mining, data mining lecture, data mining techniques, information, learning from data, computre technolog, technology, data process, data mining tutorial,
There are as many views and definitions of Data Mining as there are people working in and on the topic. Confusion reigns and people ask; what is it; why do we need it; and isn’t it just Data Mining rebranded? In this slide deck and presentation we set the scene an highlight the differences and need for Data Mining in order to give a framework for case studies and future projects.
So - why do we need it?
The economic, industrial, commercial, social, political and sustainability problems we face cannot be successfully addressed using the management techniques and models largely inherited from the Industrial Revolution. The world no longer appears infinite in resources, slow paced, linear and stable. We now see the limitations; feel the impact of rapid change; and we can conceptualize the non-linear and unstable nature of it all! We are also starting to comprehend the scale and the need for machine assistance.
Modeling our situation !
Sophisticated computer models for weather systems are now complemented by ecological, economic, conflict and resource modeling of varying depth and accuracy. However, the key is always the accuracy and coverage of the primary data. We started with modest databases and data mining, but they mostly proved inadequate, and we are now amassing vast databases on every aspect of life - people, planet and machines. This ‘BIG DATA’ explosion demands a rethink of how, what, and where we gather data; the way we analyze and model; and the way we make decisions.
So - what is the big difference?
Data Mining was limited, planer, simple, linear and constrained to a few relationships amongst people: what they did, where they went, who they knew and so on. In contrast; Big Data is unbounded, spans all peoples and machines in all domains and activities with application to every aspect of life, business, industry, government and sustainability etc. It also takes into account the non-linear nature of relationships and events.
“Big Data is an almost unconscious outcome of the desire and need to sustain all peoples on a rapidly smaller looking planet”
My keynote talk at San Diego Superdata conference, looking at history and current state of Analytics and Data Mining, and examining the effects of Big Data
meaning of data warehousing
needs of data warehousing
applications of data warehousing
architecture of data warehousing
advantages of data warehousing
disadvantages of data warehousing.
meaning of data mining
needs of data mining
applications of data mining
architecture of data mining
advantages of data mining
disadvantages of data mining
presentation on data mining for b.tech student or other . This topic is about data mining you can give in seminar and it is easy to edit and it look like made own . You can study from is ppt all important topic is give like (content, definition, techniques, kcc and so on.
Data Warehousing and Business Intelligence is one of the hottest skills today, and is the cornerstone for reporting, data science, and analytics. This course teaches the fundamentals with examples plus a project to fully illustrate the concepts.
The growth of social media over the last decade has revolutionized the way individuals interact and industries conduct business. Individuals produce data at an unprecedented rate by interacting, sharing, and consuming content through social media. Understanding and processing this new type of data to glean actionable patterns presents challenges and opportunities for interdisciplinary research, novel algorithms, and tool development. Social Media Mining integrates social media, social network analysis, and data mining to provide a convenient and coherent platform for students, practitioners, researchers, and project managers to understand the basics and potentials of social media mining. It introduces the unique problems arising from social media data and presents fundamental concepts, emerging issues, and effective algorithms for network analysis and data mining. Suitable for use in advanced undergraduate and beginning graduate courses as well as professional short courses, the text contains exercises of different degrees of difficulty that improve understanding and help apply concepts, principles, and methods in various scenarios of social media mining.
Details at: http://dmml.asu.edu/smm/
Abstract. In disasters such as the earthquake in Haiti and the tsunami in Japan, people used social media to ask for help or report injuries. The popularity, effciency, and ease of use of social media has led to its pervasive use during the disaster.
This creates a pool of timely reports about the disaster, injuries, and help requests.
This offers an alternative opportunity for first responders and disaster relief organizations to collect information about the disaster, victims, and their needs.
It also presents a challenge for these organizations to aggregate and process the requests from different social media.
Given the sheer volume of requests, it is necessary to filter reports and select those of high priority for decision making.
Little is known about how the two phases should be smoothly integrated.
In this paper we report the use of social media during a simulated crisis and crisis response process, the ASU Crisis Response Game.
Its main objective is to creat a training capability to understand how to use social media in crisis.
We report lessons learned from this exercise that may benefit first responders and NGOs who use social media to manage relief efforts during the disaster.
Real-World Behavior Analysis through a Social Media LensAli Abbasi
In this paper, using a large amount of data collected from Twitter, the blogosphere, social networks, and news sources, we perform preliminary research to investigate if human behavior in the real world can be understood by analyzing social media data. The goals of this research is twofold: (1) determining the relative effectiveness of a social media lens in analyzing and predicting real-world collective behavior, and (2) exploring the domains and situations under which social media can be a predictor for real-world's behavior. We develop a four-step model: community selection, data collection, online behavior analysis, and behavior prediction. The results of this study show that in most cases social media is a good tool for estimating attitudes and further research is needed for predicting social behavior.
Learning To Recognize Reliable Users And Content In Social Media With Coupled...Ali Abbasi
Learning to Recognize Reliable Users and Content in Social Media with Coupled Mutual Reinforcement, Mohammad Ali Abbasi,
Arizona State University
http://dmml.asu.edu
Opendatabay - Open Data Marketplace.pptxOpendatabay
Opendatabay.com unlocks the power of data for everyone. Open Data Marketplace fosters a collaborative hub for data enthusiasts to explore, share, and contribute to a vast collection of datasets.
First ever open hub for data enthusiasts to collaborate and innovate. A platform to explore, share, and contribute to a vast collection of datasets. Through robust quality control and innovative technologies like blockchain verification, opendatabay ensures the authenticity and reliability of datasets, empowering users to make data-driven decisions with confidence. Leverage cutting-edge AI technologies to enhance the data exploration, analysis, and discovery experience.
From intelligent search and recommendations to automated data productisation and quotation, Opendatabay AI-driven features streamline the data workflow. Finding the data you need shouldn't be a complex. Opendatabay simplifies the data acquisition process with an intuitive interface and robust search tools. Effortlessly explore, discover, and access the data you need, allowing you to focus on extracting valuable insights. Opendatabay breaks new ground with a dedicated, AI-generated, synthetic datasets.
Leverage these privacy-preserving datasets for training and testing AI models without compromising sensitive information. Opendatabay prioritizes transparency by providing detailed metadata, provenance information, and usage guidelines for each dataset, ensuring users have a comprehensive understanding of the data they're working with. By leveraging a powerful combination of distributed ledger technology and rigorous third-party audits Opendatabay ensures the authenticity and reliability of every dataset. Security is at the core of Opendatabay. Marketplace implements stringent security measures, including encryption, access controls, and regular vulnerability assessments, to safeguard your data and protect your privacy.
Techniques to optimize the pagerank algorithm usually fall in two categories. One is to try reducing the work per iteration, and the other is to try reducing the number of iterations. These goals are often at odds with one another. Skipping computation on vertices which have already converged has the potential to save iteration time. Skipping in-identical vertices, with the same in-links, helps reduce duplicate computations and thus could help reduce iteration time. Road networks often have chains which can be short-circuited before pagerank computation to improve performance. Final ranks of chain nodes can be easily calculated. This could reduce both the iteration time, and the number of iterations. If a graph has no dangling nodes, pagerank of each strongly connected component can be computed in topological order. This could help reduce the iteration time, no. of iterations, and also enable multi-iteration concurrency in pagerank computation. The combination of all of the above methods is the STICD algorithm. [sticd] For dynamic graphs, unchanged components whose ranks are unaffected can be skipped altogether.
Explore our comprehensive data analysis project presentation on predicting product ad campaign performance. Learn how data-driven insights can optimize your marketing strategies and enhance campaign effectiveness. Perfect for professionals and students looking to understand the power of data analysis in advertising. for more details visit: https://bostoninstituteofanalytics.org/data-science-and-artificial-intelligence/
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...pchutichetpong
M Capital Group (“MCG”) expects to see demand and the changing evolution of supply, facilitated through institutional investment rotation out of offices and into work from home (“WFH”), while the ever-expanding need for data storage as global internet usage expands, with experts predicting 5.3 billion users by 2023. These market factors will be underpinned by technological changes, such as progressing cloud services and edge sites, allowing the industry to see strong expected annual growth of 13% over the next 4 years.
Whilst competitive headwinds remain, represented through the recent second bankruptcy filing of Sungard, which blames “COVID-19 and other macroeconomic trends including delayed customer spending decisions, insourcing and reductions in IT spending, energy inflation and reduction in demand for certain services”, the industry has seen key adjustments, where MCG believes that engineering cost management and technological innovation will be paramount to success.
MCG reports that the more favorable market conditions expected over the next few years, helped by the winding down of pandemic restrictions and a hybrid working environment will be driving market momentum forward. The continuous injection of capital by alternative investment firms, as well as the growing infrastructural investment from cloud service providers and social media companies, whose revenues are expected to grow over 3.6x larger by value in 2026, will likely help propel center provision and innovation. These factors paint a promising picture for the industry players that offset rising input costs and adapt to new technologies.
According to M Capital Group: “Specifically, the long-term cost-saving opportunities available from the rise of remote managing will likely aid value growth for the industry. Through margin optimization and further availability of capital for reinvestment, strong players will maintain their competitive foothold, while weaker players exit the market to balance supply and demand.”
Adjusting primitives for graph : SHORT REPORT / NOTESSubhajit Sahu
Graph algorithms, like PageRank Compressed Sparse Row (CSR) is an adjacency-list based graph representation that is
Multiply with different modes (map)
1. Performance of sequential execution based vs OpenMP based vector multiply.
2. Comparing various launch configs for CUDA based vector multiply.
Sum with different storage types (reduce)
1. Performance of vector element sum using float vs bfloat16 as the storage type.
Sum with different modes (reduce)
1. Performance of sequential execution based vs OpenMP based vector element sum.
2. Performance of memcpy vs in-place based CUDA based vector element sum.
3. Comparing various launch configs for CUDA based vector element sum (memcpy).
4. Comparing various launch configs for CUDA based vector element sum (in-place).
Sum with in-place strategies of CUDA mode (reduce)
1. Comparing various launch configs for CUDA based vector element sum (in-place).
1. DATA MINING AND MACHINE LEARNING
IN A NUTSHELL
AN INTRODUCTION TO DATA MINING
Mohammad-Ali Abbasi
http://www.public.asu.edu/~mabbasi2/
SCHOOL OF COMPUTING, INFORMATICS, AND DECISION SYSTEMS ENGINEERING
ARIZONA STATE UNIVERSITY
http://dmml.asu.edu/
Data Mining and Machine Learning in a nutshell An Introduction to Data Mining 1
2. INTRODUCTION
• Data production rate has been increased
dramatically (Big Data) and we are able store
much more data than before
– E.g., purchase data, social media data, mobile
phone data
• Businesses and customers need useful or
actionable knowledge and gain insight from
raw data for various purposes
– It’s not just searching data or databases
Data mining helps us to extract new information and uncover
hidden patterns out of the stored and streaming data
Data Mining and Machine Learning in a nutshell An Introduction to Data Mining 2
3. DATA MINING
The process of discovering hidden patterns in large data sets
It utilizes methods at the intersection of artificial intelligence, machine learning,
statistics, and database systems
• Extracting or “mining” knowledge from large
amounts of data, or big data
• Data-driven discovery and modeling of hidden
patterns in big data
• Extracting implicit, previously unknown,
unexpected, and potentially useful
information/knowledge from data
Data Mining and Machine Learning in a nutshell An Introduction to Data Mining 3
4. DATA MINING STORIES
• “My bank called and said that they saw that I bought
two surfboards at Laguna Beach, California.” - credit
card fraud detection
• The NSA is using data mining to analyze telephone
call data to track al’Qaeda activities
• Walmart uses data mining to control product
distribution based on typical customer buying
patterns at individual stores
Data Mining and Machine Learning in a nutshell An Introduction to Data Mining 4
5. DATA MINING VS. DATABASES
• Data mining is the process of extracting
hidden and actionable patterns from data
• Database systems store and manage data
– Queries return part of stored data
– Queries do not extract hidden patterns
• Examples of querying databases
– Find all employees with income more than $250K
– Find top spending customers in last month
– Find all students from engineering college with
GPA more than average
Data Mining and Machine Learning in a nutshell An Introduction to Data Mining 5
6. EXAMPLES OF DATA MINING APPLICATIONS
• Identifying fraudulent transactions of a credit card
or spam emails
– You are given a user’s purchase history and a new
transaction, identify whether the transaction is fraud
or not;
– Determine whether a given email is spam or not
• Extracting purchase patterns from existing records
– beer ⇒ dippers (80%)
• Forecasting future sales and needs according to
some given samples
• Extracting groups of like-minded people in a given
network
Data Mining and Machine Learning in a nutshell An Introduction to Data Mining 6
7. BASIC DATA MINING TASKS
• Classification
– Assign data into predefined classes
• Spam Detection, fraudulent credit card detection
• Regression
– Predict a real value for a given data instance
• Predict the price for a given house
• Clustering
– Group similar items together into some clusters
• Detect communities in a given social network
Data Mining and Machine Learning in a nutshell An Introduction to Data Mining 7
8. DATA
Data Mining and Machine Learning in a nutshell An Introduction to Data Mining 8
9. DATA INSTANCES
• A collection of properties and features related
to an object or person
– A patient’s medical record
– A user’s profile
– A gene’s information
• Instances are also called examples, records,
data points, or observations
Data Instance:
Features or Attributes Class Label
Data Mining and Machine Learning in a nutshell An Introduction to Data Mining 9
10. DATA TYPES
• Nominal (categorical)
– No comparison is defined
– E.g., {male, female}
• Ordinal
– Comparable but the difference is not defined
– E.g., {Low, medium, high}
• Interval
– Deduction and addition is defined but not division
– E.g., 3:08 PM, calendar dates
• Ratio
– E.g., Height, weight, money quantities
Data Mining and Machine Learning in a nutshell An Introduction to Data Mining 10
11. SAMPLE DATASET
outlook temperature humidity windy play
sunny 85 85 FALSE no
sunny 80 90 TRUE no
overcast 83 86 FALSE yes
rainy 70 96 FALSE yes
rainy 68 80 FALSE yes
rainy 65 70 TRUE no
overcast 64 65 TRUE yes
sunny 72 95 FALSE no
sunny 69 70 FALSE yes
rainy 75 80 FALSE yes
sunny 75 70 TRUE yes
overcast 72 90 TRUE yes
overcast 81 75 FALSE yes
rainy 71 91 TRUE no
Interval Ordinal Nominal
Data Mining and Machine Learning in a nutshell An Introduction to Data Mining 11
12. DATA QUALITY
When making data ready for data mining
algorithms, data quality need to be assured
• Noise
– Noise is the distortion of the data
• Outliers
– Outliers are data points that are considerably different
from other data points in the dataset
• Missing Values
– Missing feature values in data instances
• Duplicate data
Data Mining and Machine Learning in a nutshell An Introduction to Data Mining 12
13. DATA PREPROCESSING
• Aggregation
– when multiple attributes need to be combined into a
single attribute or when the scale of the attributes change
• Discretization
– From continues values to discrete values
• Feature Selection
– Choose relevant features
• Feature Extraction
– Creating a mapping of new features from original features
• Sampling
– Random Sampling
– Sampling with or without replacement
– Stratified Sampling
Data Mining and Machine Learning in a nutshell An Introduction to Data Mining 13
15. CLASSIFICATION
Learning patterns from labeled data and classify
new data with labels (categories)
– For example, we want to classify an e-mail as
"legitimate" or "spam"
Classifier
Data Mining and Machine Learning in a nutshell An Introduction to Data Mining 15
16. CLASSIFICATION: THE PROCESS
• In classification, we are given a set of labeled
examples
• These examples are records/instances in the
format (x, y) where x is a vector and y is the
class attribute, commonly a scalar
• The classification task is to build model that
maps x to y
• Our task is to find a mapping f such that f(x) = y
Data Mining and Machine Learning in a nutshell An Introduction to Data Mining 16
18. CLASSIFICATION: AN EMAIL EXAMPLE
• A set of emails is given where
users have manually identified
spam versus non-spam
• Our task is to use a set of
features such as words in the
email (x) to identify spam/non-
spam status of the email (y)
• In this case, classes are
y = {spam, non-spam}
• What would it be dealt with in
a social setting?
Data Mining and Machine Learning in a nutshell An Introduction to Data Mining 18
19. CLASSIFICATION ALGORITHMS
• Decision tree learning
• Naive Bayes learning
• K-nearest neighbor classifier
Data Mining and Machine Learning in a nutshell An Introduction to Data Mining 19
20. DECISION TREE
• A decision tree is learned from the dataset
(training data with known classes) and later
applied to predict the class attribute value of
new data (test data with unknown classes)
where only the feature values are known
Data Mining and Machine Learning in a nutshell An Introduction to Data Mining 20
22. ID3, A DECISION TREE ALGORITHM
Use information gain (entropy) to determine
how well an attribute separates the training
data according to the class attribute value
– p+ is the proportion of positive examples in D
– p- is the proportion of negative examples in D
In a dataset containing ten examples, 7 have a positive class
attribute value and 3 have a negative class attribute value [7+, 3-]:
If the numbers of positive and negative examples in the set are equal, then the entropy is 1
Data Mining and Machine Learning in a nutshell An Introduction to Data Mining 22
23. DECISION TREE: EXAMPLE 1
outlook temperature humidity windy play
sunny 85 85 FALSE no
sunny 80 90 TRUE no
overcast 83 86 FALSE yes
rainy 70 96 FALSE yes
rainy 68 80 FALSE yes
rainy 65 70 TRUE no
overcast 64 65 TRUE yes
sunny 72 95 FALSE no
sunny 69 70 FALSE yes
rainy 75 80 FALSE yes
sunny 75 70 TRUE yes
overcast 72 90 TRUE yes
overcast 81 75 FALSE yes
rainy 71 91 TRUE no
Data Mining and Machine Learning in a nutshell An Introduction to Data Mining 23
24. DECISION TREE: EXAMPLE 2
Class Labels
Learned Decision Tree 1 Learned Decision Tree 2
Data Mining and Machine Learning in a nutshell An Introduction to Data Mining 24
25. NAIVE BAYES CLASSIFIER
For two random variables X and Y, Bayes
theorem states that,
class variable the instance features
Then class attribute value for instance X
Assuming that variables
are independent
Data Mining and Machine Learning in a nutshell An Introduction to Data Mining 25
26. NBC: AN EXAMPLE
Data Mining and Machine Learning in a nutshell An Introduction to Data Mining 26
27. NEAREST NEIGHBOR CLASSIFIER
• k-nearest neighbor employs the neighbors of a
data point to perform classification
• The instance being classified is assigned the
label that the majority of k neighbors’ labels
• When k = 1, the closest neighbor’s label is
used as the predicted label for the instance
being classified
• For determining the neighbors, distance is
computed based on some distance metric,
e.g., Euclidean distance
Data Mining and Machine Learning in a nutshell An Introduction to Data Mining 27
28. K-NN: ALGORITHM
1. The dataset, number of neighbors (k), and
the instance i is given
2. Compute the distance between i and all
other data points in the dataset
3. Pick k closest neighbors
4. The class label for the data point i is the one
that the majority holds (if there are more
than one class, select one of them randomly)
Data Mining and Machine Learning in a nutshell An Introduction to Data Mining 28
29. K-NEAREST NEIGHBOR: EXAMPLE
k = 10
Class label = ? k=5
k=3
• Depending on the k, different labels can be predicted for the green circle
• In our example k = 3 and k = 5 generate different labels for the instance
• K= 10 we can choose either triangle or rectangle
Data Mining and Machine Learning in a nutshell An Introduction to Data Mining 30
30. K-NEAREST NEIGHBOR: EXAMPLE
Similarity between row 8 and other data instances;
(Similarity = 1 if attributes have the same value, otherwise similarity = 0)
Data instance Outlook Temperature Humidity Similarity Label K Prediction
2 1 1 1 3 N 1 N
1 1 0 1 2 N 2 N
4 0 1 1 2 Y 3 N
3 0 0 1 1 Y 4 ?
5 1 0 0 1 Y 5 Y
6 0 0 0 0 N 6 ?
7 0 0 0 0 Y 7 Y
Data Mining and Machine Learning in a nutshell An Introduction to Data Mining 31
31. EVALUATING CLASSIFICATION PERFORMANCE
• As the class labels are discrete, we can measure the
accuracy by dividing number of correctly predicted
labels (C) by the total number of instances (N)
• Accuracy = C/N
• Error rate = 1 - Accuracy
• More sophisticated approaches of evaluation will be
discussed later
Data Mining and Machine Learning in a nutshell An Introduction to Data Mining 32
33. REGRESSION
Regression analysis includes techniques of
modeling and analyzing the relationship
between a dependent variable and one or more
independent variables
• Regression analysis is widely used for
prediction and forecasting
• It can be used to infer
relationships between
the independent and
dependent variables
Data Mining and Machine Learning in a nutshell An Introduction to Data Mining 34
34. REGRESSION
In regression, we deal with real numbers as class
values (Recall that in classification, class values
or labels are categories)
y ≈ f(X)
Dependent variable Regressors
y R x1, x2, …, xm
Our task is to find the relation between y and the vector
(x1, x2, …, xm)
Data Mining and Machine Learning in a nutshell An Introduction to Data Mining 35
35. LINEAR REGRESSION
In linear regression, we assume the relation
between the class attribute y and feature set x
to be linear
where w represents the vector of regression
coefficients
• The problem of regression can be solved by
estimating w and using the provided dataset
and the labels y
– The least squares is often used to solve the
problem
Data Mining and Machine Learning in a nutshell An Introduction to Data Mining 36
36. SOLVING LINEAR REGRESSION PROBLEMS
• The problem of regression can be solved by
estimating w and using the dataset provided
and the labels y
– “Least squares” is a popular method to solve
regression problems
Data Mining and Machine Learning in a nutshell An Introduction to Data Mining 37
37. LEAST SQUARES
Find W such that minimizing ǁY - XWǁ2 for
regressors X and labels Y
Data Mining and Machine Learning in a nutshell An Introduction to Data Mining 38
39. REGRESSION COEFFICIENTS
• When there is only one independent variable:
y = w0 + w1x n
x = å xi
1
n i
• Two independent variables
y = w0 + w1x1 + w2x2
Data Mining and Machine Learning in a nutshell An Introduction to Data Mining 40
40. LINEAR REGRESSION: EXAMPLE
Years of
Salary ($K)
experience
3 30
8 57
9 64
13 72
3 36
6 43
11 59
21 90
1 20
16 83
Data Mining and Machine Learning in a nutshell An Introduction to Data Mining 41
41. EVALUATING REGRESSION PERFORMANCE
• The labels cannot be predicted precisely
• It is needed to set a margin to accept or reject
the predictions
– For example, when the observed temperature is
71 any prediction in the range of 71±0.5 can be
considered as correct prediction
Data Mining and Machine Learning in a nutshell An Introduction to Data Mining 43
43. CLUSTERING
Grouping together items that are similar in
some way – according to some criteria
• Clustering is a form of unsupervised learning
– The clustering algorithms do not have examples
showing how the samples should be group
together
• The clustering algorithms look for patterns or
structures in the data that are of interest
• Clustering algorithms group together similar
items
Data Mining and Machine Learning in a nutshell An Introduction to Data Mining 45
45. MEASURING SIMILARITY IN CLUSTERING ALGORITHMS
• The goal is to group together similar items
• Different similarity measures can be used to
find similar items
• Usually similarity measures are critical to
clustering algorithms
The most popular (dis)similarity measure for
continuous features are Euclidean Distance and
Pearson Linear Correlation
Data Mining and Machine Learning in a nutshell An Introduction to Data Mining 47
46. EUCLIDEAN DISTANCE – A DISSIMILAR MEASURE
• Here n is the number of dimensions in the
data vector
Data Mining and Machine Learning in a nutshell An Introduction to Data Mining 48
47. PEARSON LINEAR CORRELATION
n
å (x i - x )(yi - y )
r (x, y) = n
i=1
n
å (x i - x )2 å (y i - y )2
i=1 i=1
1 n
x = å xi
n i
1 n
y = å yi
n i
• We’re shifting the expression profiles down (subtracting the means) and scaling
by the standard deviations (i.e., making the data have mean = 0 and std = 1)
• Always between –1 and +1 (perfectly anti-correlated and perfectly correlated)
Data Mining and Machine Learning in a nutshell An Introduction to Data Mining 49
48. SIMILARITY MEASURES: MORE DEFINITIONS
Data Mining and Machine Learning in a nutshell An Introduction to Data Mining 50
49. CLUSTERING
• Distance-based algorithms
– K-Means
• Hierarchical algorithms
Data Mining and Machine Learning in a nutshell An Introduction to Data Mining 51
50. K-MEANS
k-means clustering aims to partition n
observations into k clusters in which each
observation belongs to the cluster with the
nearest mean
• Finding the global optimal of k partitions is
computationally expensive (NP-hard).
However, there are efficient heuristic
algorithms that are commonly employed and
converge quickly to an optimum that might
not be global.
Data Mining and Machine Learning in a nutshell An Introduction to Data Mining 52
51. K-MEANS
• Given a set of observations (x1, x2, …, xn),
where each observation is a d-dimensional
real vector, k-means clustering aims to
partition the n observations into k sets (k ≤ n)
S = {S1, S2, …, Sk} so as to minimize the within-
cluster sum of squares:
where μi is the mean of points in Si
Data Mining and Machine Learning in a nutshell An Introduction to Data Mining 53
52. K-MEANS: ALGORITHM
Given data points xi and an initial set
of k centroids m1(1),…,mk(1), the algorithm proceeds as
follows:
• Assignment step: Assign each data point to the
cluster Si with the closest centroid each data
point goes into exactly one cluster)
• Update step: Calculate the new means to be
the centroid of the data points in the cluster
Data Mining and Machine Learning in a nutshell An Introduction to Data Mining 54
53. K-MEANS: AN EXAMPLE
Data Cluster 1 Cluster 2
X Y
point
Step Data point Centroid Data point Centroid
1 1 1
2 2 1 1 1 (1.0, 1.0) 2 (2.0, 1.0)
3 1 2
4 2 2
5 4 4
6 4 5
7 5 4
8 5 5
Data Mining and Machine Learning in a nutshell An Introduction to Data Mining 56
54. RUNNING K-MEANS ON IRIS DATASET
Data Mining and Machine Learning in a nutshell An Introduction to Data Mining 58
55. HIERARCHICAL CLUSTERING
Hierarchical clustering is a method of cluster
analysis which seeks to build a hierarchy of clusters.
• Strategies for hierarchical clustering generally fall
into two types:
– Agglomerative: This is a "bottom up" approach: each
observation starts in its own cluster, and pairs of
clusters are merged as one moves up the hierarchy.
– Divisive: This is a "top down" approach: all
observations start in one cluster, and splits are
performed recursively as one moves down the
hierarchy.
Data Mining and Machine Learning in a nutshell An Introduction to Data Mining 60
56. HIERARCHICAL ALGORITHMS
• Initially n data points are considered as either
1 or n clusters in hierarchical clustering
• These clusters are gradually split or merged
(divisive or agglomerative hierarchical
clustering algorithms), depending on the type
of an algorithm
• Until the desired number of clusters are
reached
Data Mining and Machine Learning in a nutshell An Introduction to Data Mining 61
57. HIERARCHICAL AGGLOMERATIVE CLUSTERING
• Start with each data point as a cluster
• Keep merging the most similar pairs of data
points/clusters until only one big cluster left
• This is called a bottom-up or agglomerative
method
This produces a binary tree or dendrogram
– The final cluster is the root and each data point is
a leaf
– The height of the bars indicate how close the
points are
Data Mining and Machine Learning in a nutshell An Introduction to Data Mining 62
58. HIERARCHICAL CLUSTERING: AN EXAMPLE
Data Mining and Machine Learning in a nutshell An Introduction to Data Mining 63
59. MERGING THE DATA POINTS IN HIERARCHICAL CLUSTERING
• Average Linkage
– Each cluster ci is associated with a mean vector i
which is the mean of all the data items in the
cluster
– The distance between two clusters ci and cj is then
just d( i , j )
• Single Linkage
– The minimum of all pairwise distances between
points in the two clusters
• Complete Linkage
– The maximum of all pairwise distances between
points in the two clusters
Data Mining and Machine Learning in a nutshell An Introduction to Data Mining 64
60. LINKAGE IN HIERARCHICAL CLUSTERING: EXAMPLE
Single Linkage Average Linkage
Complete Linkage
Data Mining and Machine Learning in a nutshell An Introduction to Data Mining 65
61. EVALUATING THE CLUSTERINGS
When we are given objects of two different
kinds, the perfect clustering would be that
objects of the same type are clustered together.
• Evaluation with ground truth
• Evaluation without ground truth
Data Mining and Machine Learning in a nutshell An Introduction to Data Mining 66
62. EVALUATION WITH GROUND TRUTH
When ground truth is available, the evaluator
has prior knowledge of what a clustering should
be
– That is, we know the correct clustering
assignments.
• Measures
– Precision and Recall, or F-Measure
– Purity
– Normalized Mutual Information (NMI)
Data Mining and Machine Learning in a nutshell An Introduction to Data Mining 67
63. PRECISION AND RECALL
• True Positive (TP) : • False Negative (FN) :
– when similar points are assigned to – when similar points are assigned to
the same clusters different clusters
– This is considered a correct – This is considered an incorrect
decision. decision
• True Negative (TN) : • False Positive (FP) :
– when dissimilar points are – when dissimilar points are
assigned to different clusters assigned to the same clusters
– This is considered a correct – This is considered an incorrect
decision decision
Data Mining and Machine Learning in a nutshell An Introduction to Data Mining 68
64. PRECISION AND RECALL: EXAMPLE 1
Data Mining and Machine Learning in a nutshell An Introduction to Data Mining 69
65. F-MEASURE
• To consolidate precision and recall into one
measure, we can use the harmonic mean of
precision of recall
Computed for the same example, we get F = 0.54
Data Mining and Machine Learning in a nutshell An Introduction to Data Mining 71
66. PURITY
• In purity, we assume the majority of a cluster
represents the cluster
• Hence, we use the label of the majority
against the label of each member to evaluate
• the algorithm easily tampered; consider points
Purity can be
• The purity is then defined assize 1) or very large
being singleton clusters (of the fraction of
instances that have labels equal to the
clusters.
cluster’s majority label
• In both cases, purity does not make much sense.
where Lj defines label j (ground truth) and
Mi defines the majority label for cluster i
Data Mining and Machine Learning in a nutshell An Introduction to Data Mining 72
67. MUTUAL INFORMATION
The mutual information of two random
variables is a quantity that measures the mutual
dependence of the two random variables
• p(x,y) is the joint probability distribution function of X and Y,
• p(x) and p(y) are the marginal probability distribution
functions of X and Y respectively
Data Mining and Machine Learning in a nutshell An Introduction to Data Mining 73
68. NORMALIZED MUTUAL INFORMATION
Normalized Mutual Information has been
derived from information theory where the
Mutual Information (MI) between the
clusterings found and the labels is normalized by
the upper bound of (MI) which is a mean of the
entropies (H) of labels and clusterings found
Data Mining and Machine Learning in a nutshell An Introduction to Data Mining 74
69. NORMALIZED MUTUAL INFORMATION
• where l and h are labels and found clusterings,
• nh and nl are the number of data points in the clusters h and l, respectively,
• nh,l is the number of points in clusters h and labeled l,
• n is the size of the dataset
• NMI values close to one indicate high similarity
between clusterings found and labels
• Values close to zero indicate high dissimilarity
between them
Data Mining and Machine Learning in a nutshell An Introduction to Data Mining 75
70. NORMALIZED MUTUAL INFORMATION: EXAMPLE
Partition a: [1,1,1,1,1,1,1, 2,2,2,2,2,2,2]
Partition b: [1,1,1,1,1,2,2, 1,2,2,2,2,2,2]
nh nl nh,l l=1 l=2
n = 14 h=1 6 l=1 7 h=1 5 1
h=2 8 l=2 7 h=2 2 6
Data Mining and Machine Learning in a nutshell An Introduction to Data Mining 76
71. EVALUATION WITHOUT GROUND TRUTH
• Use domain experts
• Use quality measures such as SSE
– SSE: the sum of the squared error for all clusters
• Use more than two clustering algorithms and
compare the results and pick the algorithm
with better quality measure
Data Mining and Machine Learning in a nutshell An Introduction to Data Mining 77
72. TEXT MINING
Data Mining and Machine Learning in a nutshell An Introduction to Data Mining 78
73. TEXT MINING
• In social media, most of the data that is
available online is in text format
• In general, the way to perform data mining is
to convert text data into tabular format and
then perform data mining on this data
• The process of converting text data into
tabular data is called vectorization
Data Mining and Machine Learning in a nutshell An Introduction to Data Mining 79
74. TEXT MINING PROCESS
A set of linguistic, statistical, and machine
learning techniques that model and structure
the information content of textual sources for
business intelligence, exploratory data analysis,
research, or investigation
Data Mining and Machine Learning in a nutshell An Introduction to Data Mining 80
75. TEXT PREPROCESSING
Text preprocessing aims to make the input
documents more consistent to facilitate text
representation, which is necessary for most text
analytics tasks
• Methods:
– Stop word removal
• Stop word removal eliminates words using a stop word list,
in which the words are considered more general and
meaningless
– e.g. the, a, is, at, which
– Stemming
• Stemming reduces inflected (or sometimes derived) words
to their stem, base or root form
– For example, “watch”, “watching”, “watched” are represented as
“watch”
Data Mining and Machine Learning in a nutshell An Introduction to Data Mining 81
76. TEXT REPRESENTATION
• The most common way to model documents
is to transform them into sparse numeric
vectors and then deal with them with linear
algebraic operations
• This representation is called “Bag of Words”
• Methods:
– Vector space model
– tf-idf
Data Mining and Machine Learning in a nutshell An Introduction to Data Mining 82
77. VECTOR SPACE MODEL
• In the vector space model, we start with a set
of documents, D
• Each document is a set of words
• The goal is to convert these textual
documents to vectors
• di : document i, wj,i : the weight for word j in document i
The weight can be set to 1 when the word exist in the document and 0 when
it does not. Or we can set this weight to the number of times the word is
observed in the document
Data Mining and Machine Learning in a nutshell An Introduction to Data Mining 83
78. VECTOR SPACE MODEL: AN EXAMPLE
• Documents:
– d1: data mining and social media mining
– d2: social network analysis
– d3: data mining
• Reference vector:
– (social, media, mining, network, analysis, data)
• Vector representation:
analysis data media mining network social
d1 0 1 1 1 0 1
d2 1 0 0 0 1 1
d3 0 1 0 1 0 0
Data Mining and Machine Learning in a nutshell An Introduction to Data Mining 84
79. TF-IDF (TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY)
tf-idf of term t, document d, and document corpus D is
calculated as follows:
tf-idf(t, d, D) = tf (t, d) * idf (t, D)
The total number of documents in
the corpus
The number of documents where
the term t appears
Data Mining and Machine Learning in a nutshell An Introduction to Data Mining 85
80. TF-IDF: AN EXAMPLE
Consider words “apple” and “the” that appear
10 and 20 times in document 1 (d1), which
contains 100 words.
Consider |D| = 20 and word “apple” only
appearing in d1 and word “the” appearing in all
20 documents
Data Mining and Machine Learning in a nutshell An Introduction to Data Mining 86
81. TF-IDF: AN EXAMPLE
• Documents:
– d1: data mining and social media mining
– d2: social network analysis
– d3: data mining
• tf-idf representation:
analysis data media mining network social
df(w) 1 2 1 2 1 2
log(N/df(w)) 0.48 0.18 0.48 0.18 0.48 0.18
d1, tf 0 1 1 2 0 1
d2, tf 1 0 0 0 1 1
d3, tf 0 1 0 1 0 0
d1, tf-idf 0.00 0.18 0.48 0.35 0.00 0.18
d2, tf-idf 0.48 0.00 0.00 0.00 0.48 0.18
d3, tf-idf 0.00 0.18 0.00 0.18 0.00 0.00
Data Mining and Machine Learning in a nutshell An Introduction to Data Mining 87
82. SENTIMENT ANALYSIS
• Sentiment analysis or opinion mining refers to
the application of natural language
processing, computational linguistics, and text
analytics to identify and extract subjective
information in source materials
• It aims to determine the attitude of a speaker
or a writer with respect to some topic or the
overall contextual polarity of a document.
Data Mining and Machine Learning in a nutshell An Introduction to Data Mining 88
83. POLARITY ANALYSIS
• The basic task in opinion mining is classifying
the polarity of a given document or text
– The polarity could be positive, negative, or neutral
• Methods:
– Naïve Bayes
– Pointwise Mutual Information (PMI)
Data Mining and Machine Learning in a nutshell An Introduction to Data Mining 89
84. MEASURING POLARITY, NAÏVE BAYES
• Bayes’ rule:
• If we consider that the occurrence of features
(words) in the document are independent
Data Mining and Machine Learning in a nutshell An Introduction to Data Mining 90
85. MEASURING POLARITY, MAXIMUM ENTROPY
• Z(d) is the normalization factor
• is feature-weight parameter and shows the
importance of each feature
• Fi,c is defined as a feature/class function for
feature fi and class c
Data Mining and Machine Learning in a nutshell An Introduction to Data Mining 91
86. MEASURING POLARITY, POINTWISE MUTUAL INFORMATION
• P(word) is the number of results returned by search engine in response to search
for term word
• P(word1 word2) is the number of results for mutual search of word1 and word2
together
Data Mining and Machine Learning in a nutshell An Introduction to Data Mining 92
87. Mohammad-Ali Abbasi (Ali),
Ali, is a Ph.D. student at Data Mining
and Machine Learning Lab, Arizona
State University.
His research interests include Data
Mining, Machine Learning, Social
Computing, and Social Media Behavior
Analysis.
http://www.public.asu.edu/~mabbasi2/
Data Mining and Machine Learning in a nutshell An Introduction to Data Mining 93
Editor's Notes
Weather dataset
Garbage in and garbage out
Source?
Using the inverse of similarity as a distance measure
\\frac{\\partial}{\\partial W} {(W^T X^T Y )} = X^T Y\\frac{\\partial}{\\partial W} { (W^T X^T X W) } = 2 X^T X W
SVD form of matrix X: X = U \\Sigma V^T U and V are real valued, they each are an orthogonal matrixU^T = U^{-1}V is orthogonal so: V^T^{-1} = V(V \\Sigma^2 V^T)^{-1} = V^T^{-1} \\Sigma^{-2} V^T = V \\Sigma^{-2} V^T
Solve w_0 first
Source?
TP – the intersection of the two oval shapes; TN – the rectangle minus the two oval shapes; FP – the circle minus the blue part; FN – the blue oval minus the circle. Accuracy = (TP+TN)/(TP+FP+FN+FN); Precision = TP/(TP+FP); Recall = TP/(TP+FN)
TP+FP = C(6,2) + C(8,2) = 15+28 = 43; FP counts the wrong pairs within each cluster; FN counts the similar pairs but wrongly put into different clusters; TN counts dissimilar pairs in different clusters
N – the total number of data points.
number of documents where the term t appears
df(w) is document frequency for word w. tf is the term frequency in a document.
SO – Sentiment Orientation“Thumbs Up, Thumbs Down …” Peter Turney