Upcoming SlideShare
×

# Data Mining: an Introduction

4,403 views

Published on

An Introduction to Data Mining,
Clustering, Classification, Regression, Text Mining

Published in: Data & Analytics
9 Likes
Statistics
Notes
• Full Name
Comment goes here.

Are you sure you want to Yes No
• Be the first to comment

Views
Total views
4,403
On SlideShare
0
From Embeds
0
Number of Embeds
162
Actions
Shares
0
264
0
Likes
9
Embeds 0
No embeds

No notes for slide
• Weather dataset
• Garbage in and garbage out
• Source?
• Using the inverse of similarity as a distance measure
• \\frac{\\partial}{\\partial W} {(W^T X^T Y )} = X^T Y\\frac{\\partial}{\\partial W} { (W^T X^T X W) } = 2 X^T X W
• SVD form of matrix X: X = U \\Sigma V^T U and V are real valued, they each are an orthogonal matrixU^T = U^{-1}V is orthogonal so: V^T^{-1} = V(V \\Sigma^2 V^T)^{-1} = V^T^{-1} \\Sigma^{-2} V^T = V \\Sigma^{-2} V^T
• Solve w_0 first
• Source?
• TP – the intersection of the two oval shapes; TN – the rectangle minus the two oval shapes; FP – the circle minus the blue part; FN – the blue oval minus the circle. Accuracy = (TP+TN)/(TP+FP+FN+FN); Precision = TP/(TP+FP); Recall = TP/(TP+FN)
• TP+FP = C(6,2) + C(8,2) = 15+28 = 43; FP counts the wrong pairs within each cluster; FN counts the similar pairs but wrongly put into different clusters; TN counts dissimilar pairs in different clusters
• N – the total number of data points.
• number of documents where the term t appears
• df(w) is document frequency for word w. tf is the term frequency in a document.
• SO – Sentiment Orientation“Thumbs Up, Thumbs Down …” Peter Turney
• ### Data Mining: an Introduction

1. 1. DATA MINING AND MACHINE LEARNING IN A NUTSHELL AN INTRODUCTION TO DATA MINING Mohammad-Ali Abbasi http://www.public.asu.edu/~mabbasi2/ SCHOOL OF COMPUTING, INFORMATICS, AND DECISION SYSTEMS ENGINEERING ARIZONA STATE UNIVERSITY http://dmml.asu.edu/Data Mining and Machine Learning in a nutshell An Introduction to Data Mining 1
2. 2. INTRODUCTION • Data production rate has been increased dramatically (Big Data) and we are able store much more data than before – E.g., purchase data, social media data, mobile phone data • Businesses and customers need useful or actionable knowledge and gain insight from raw data for various purposes – It’s not just searching data or databases Data mining helps us to extract new information and uncover hidden patterns out of the stored and streaming dataData Mining and Machine Learning in a nutshell An Introduction to Data Mining 2
3. 3. DATA MINING The process of discovering hidden patterns in large data sets It utilizes methods at the intersection of artificial intelligence, machine learning, statistics, and database systems • Extracting or “mining” knowledge from large amounts of data, or big data • Data-driven discovery and modeling of hidden patterns in big data • Extracting implicit, previously unknown, unexpected, and potentially useful information/knowledge from dataData Mining and Machine Learning in a nutshell An Introduction to Data Mining 3
4. 4. DATA MINING STORIES • “My bank called and said that they saw that I bought two surfboards at Laguna Beach, California.” - credit card fraud detection • The NSA is using data mining to analyze telephone call data to track al’Qaeda activities • Walmart uses data mining to control product distribution based on typical customer buying patterns at individual storesData Mining and Machine Learning in a nutshell An Introduction to Data Mining 4
5. 5. DATA MINING VS. DATABASES • Data mining is the process of extracting hidden and actionable patterns from data • Database systems store and manage data – Queries return part of stored data – Queries do not extract hidden patterns • Examples of querying databases – Find all employees with income more than $250K – Find top spending customers in last month – Find all students from engineering college with GPA more than averageData Mining and Machine Learning in a nutshell An Introduction to Data Mining 5 6. 6. EXAMPLES OF DATA MINING APPLICATIONS • Identifying fraudulent transactions of a credit card or spam emails – You are given a user’s purchase history and a new transaction, identify whether the transaction is fraud or not; – Determine whether a given email is spam or not • Extracting purchase patterns from existing records – beer ⇒ dippers (80%) • Forecasting future sales and needs according to some given samples • Extracting groups of like-minded people in a given networkData Mining and Machine Learning in a nutshell An Introduction to Data Mining 6 7. 7. BASIC DATA MINING TASKS • Classification – Assign data into predefined classes • Spam Detection, fraudulent credit card detection • Regression – Predict a real value for a given data instance • Predict the price for a given house • Clustering – Group similar items together into some clusters • Detect communities in a given social networkData Mining and Machine Learning in a nutshell An Introduction to Data Mining 7 8. 8. DATAData Mining and Machine Learning in a nutshell An Introduction to Data Mining 8 9. 9. DATA INSTANCES • A collection of properties and features related to an object or person – A patient’s medical record – A user’s profile – A gene’s information • Instances are also called examples, records, data points, or observations Data Instance: Features or Attributes Class LabelData Mining and Machine Learning in a nutshell An Introduction to Data Mining 9 10. 10. DATA TYPES • Nominal (categorical) – No comparison is defined – E.g., {male, female} • Ordinal – Comparable but the difference is not defined – E.g., {Low, medium, high} • Interval – Deduction and addition is defined but not division – E.g., 3:08 PM, calendar dates • Ratio – E.g., Height, weight, money quantitiesData Mining and Machine Learning in a nutshell An Introduction to Data Mining 10 11. 11. SAMPLE DATASET outlook temperature humidity windy play sunny 85 85 FALSE no sunny 80 90 TRUE no overcast 83 86 FALSE yes rainy 70 96 FALSE yes rainy 68 80 FALSE yes rainy 65 70 TRUE no overcast 64 65 TRUE yes sunny 72 95 FALSE no sunny 69 70 FALSE yes rainy 75 80 FALSE yes sunny 75 70 TRUE yes overcast 72 90 TRUE yes overcast 81 75 FALSE yes rainy 71 91 TRUE no Interval Ordinal NominalData Mining and Machine Learning in a nutshell An Introduction to Data Mining 11 12. 12. DATA QUALITY When making data ready for data mining algorithms, data quality need to be assured • Noise – Noise is the distortion of the data • Outliers – Outliers are data points that are considerably different from other data points in the dataset • Missing Values – Missing feature values in data instances • Duplicate dataData Mining and Machine Learning in a nutshell An Introduction to Data Mining 12 13. 13. DATA PREPROCESSING • Aggregation – when multiple attributes need to be combined into a single attribute or when the scale of the attributes change • Discretization – From continues values to discrete values • Feature Selection – Choose relevant features • Feature Extraction – Creating a mapping of new features from original features • Sampling – Random Sampling – Sampling with or without replacement – Stratified SamplingData Mining and Machine Learning in a nutshell An Introduction to Data Mining 13 14. 14. CLASSIFICATIONData Mining and Machine Learning in a nutshell An Introduction to Data Mining 14 15. 15. CLASSIFICATION Learning patterns from labeled data and classify new data with labels (categories) – For example, we want to classify an e-mail as "legitimate" or "spam" ClassifierData Mining and Machine Learning in a nutshell An Introduction to Data Mining 15 16. 16. CLASSIFICATION: THE PROCESS • In classification, we are given a set of labeled examples • These examples are records/instances in the format (x, y) where x is a vector and y is the class attribute, commonly a scalar • The classification task is to build model that maps x to y • Our task is to find a mapping f such that f(x) = yData Mining and Machine Learning in a nutshell An Introduction to Data Mining 16 17. 17. CLASSIFICATION: THE PROCESSData Mining and Machine Learning in a nutshell An Introduction to Data Mining 17 18. 18. CLASSIFICATION: AN EMAIL EXAMPLE • A set of emails is given where users have manually identified spam versus non-spam • Our task is to use a set of features such as words in the email (x) to identify spam/non- spam status of the email (y) • In this case, classes are y = {spam, non-spam} • What would it be dealt with in a social setting?Data Mining and Machine Learning in a nutshell An Introduction to Data Mining 18 19. 19. CLASSIFICATION ALGORITHMS • Decision tree learning • Naive Bayes learning • K-nearest neighbor classifierData Mining and Machine Learning in a nutshell An Introduction to Data Mining 19 20. 20. DECISION TREE • A decision tree is learned from the dataset (training data with known classes) and later applied to predict the class attribute value of new data (test data with unknown classes) where only the feature values are knownData Mining and Machine Learning in a nutshell An Introduction to Data Mining 20 21. 21. DECISION TREE INDUCTIONData Mining and Machine Learning in a nutshell An Introduction to Data Mining 21 22. 22. ID3, A DECISION TREE ALGORITHM Use information gain (entropy) to determine how well an attribute separates the training data according to the class attribute value – p+ is the proportion of positive examples in D – p- is the proportion of negative examples in D In a dataset containing ten examples, 7 have a positive class attribute value and 3 have a negative class attribute value [7+, 3-]: If the numbers of positive and negative examples in the set are equal, then the entropy is 1Data Mining and Machine Learning in a nutshell An Introduction to Data Mining 22 23. 23. DECISION TREE: EXAMPLE 1 outlook temperature humidity windy play sunny 85 85 FALSE no sunny 80 90 TRUE noovercast 83 86 FALSE yes rainy 70 96 FALSE yes rainy 68 80 FALSE yes rainy 65 70 TRUE noovercast 64 65 TRUE yes sunny 72 95 FALSE no sunny 69 70 FALSE yes rainy 75 80 FALSE yes sunny 75 70 TRUE yesovercast 72 90 TRUE yesovercast 81 75 FALSE yes rainy 71 91 TRUE noData Mining and Machine Learning in a nutshell An Introduction to Data Mining 23 24. 24. DECISION TREE: EXAMPLE 2 Class Labels Learned Decision Tree 1 Learned Decision Tree 2Data Mining and Machine Learning in a nutshell An Introduction to Data Mining 24 25. 25. NAIVE BAYES CLASSIFIER For two random variables X and Y, Bayes theorem states that, class variable the instance features Then class attribute value for instance X Assuming that variables are independentData Mining and Machine Learning in a nutshell An Introduction to Data Mining 25 26. 26. NBC: AN EXAMPLEData Mining and Machine Learning in a nutshell An Introduction to Data Mining 26 27. 27. NEAREST NEIGHBOR CLASSIFIER • k-nearest neighbor employs the neighbors of a data point to perform classification • The instance being classified is assigned the label that the majority of k neighbors’ labels • When k = 1, the closest neighbor’s label is used as the predicted label for the instance being classified • For determining the neighbors, distance is computed based on some distance metric, e.g., Euclidean distanceData Mining and Machine Learning in a nutshell An Introduction to Data Mining 27 28. 28. K-NN: ALGORITHM 1. The dataset, number of neighbors (k), and the instance i is given 2. Compute the distance between i and all other data points in the dataset 3. Pick k closest neighbors 4. The class label for the data point i is the one that the majority holds (if there are more than one class, select one of them randomly)Data Mining and Machine Learning in a nutshell An Introduction to Data Mining 28 29. 29. K-NEAREST NEIGHBOR: EXAMPLE k = 10 Class label = ? k=5 k=3 • Depending on the k, different labels can be predicted for the green circle • In our example k = 3 and k = 5 generate different labels for the instance • K= 10 we can choose either triangle or rectangleData Mining and Machine Learning in a nutshell An Introduction to Data Mining 30 30. 30. K-NEAREST NEIGHBOR: EXAMPLE Similarity between row 8 and other data instances; (Similarity = 1 if attributes have the same value, otherwise similarity = 0) Data instance Outlook Temperature Humidity Similarity Label K Prediction 2 1 1 1 3 N 1 N 1 1 0 1 2 N 2 N 4 0 1 1 2 Y 3 N 3 0 0 1 1 Y 4 ? 5 1 0 0 1 Y 5 Y 6 0 0 0 0 N 6 ? 7 0 0 0 0 Y 7 YData Mining and Machine Learning in a nutshell An Introduction to Data Mining 31 31. 31. EVALUATING CLASSIFICATION PERFORMANCE • As the class labels are discrete, we can measure the accuracy by dividing number of correctly predicted labels (C) by the total number of instances (N) • Accuracy = C/N • Error rate = 1 - Accuracy • More sophisticated approaches of evaluation will be discussed laterData Mining and Machine Learning in a nutshell An Introduction to Data Mining 32 32. 32. REGRESSIONData Mining and Machine Learning in a nutshell An Introduction to Data Mining 33 33. 33. REGRESSION Regression analysis includes techniques of modeling and analyzing the relationship between a dependent variable and one or more independent variables • Regression analysis is widely used for prediction and forecasting • It can be used to infer relationships between the independent and dependent variablesData Mining and Machine Learning in a nutshell An Introduction to Data Mining 34 34. 34. REGRESSION In regression, we deal with real numbers as class values (Recall that in classification, class values or labels are categories) y ≈ f(X) Dependent variable Regressors y R x1, x2, …, xm Our task is to find the relation between y and the vector (x1, x2, …, xm)Data Mining and Machine Learning in a nutshell An Introduction to Data Mining 35 35. 35. LINEAR REGRESSION In linear regression, we assume the relation between the class attribute y and feature set x to be linear where w represents the vector of regression coefficients • The problem of regression can be solved by estimating w and using the provided dataset and the labels y – The least squares is often used to solve the problemData Mining and Machine Learning in a nutshell An Introduction to Data Mining 36 36. 36. SOLVING LINEAR REGRESSION PROBLEMS • The problem of regression can be solved by estimating w and using the dataset provided and the labels y – “Least squares” is a popular method to solve regression problemsData Mining and Machine Learning in a nutshell An Introduction to Data Mining 37 37. 37. LEAST SQUARES Find W such that minimizing ǁY - XWǁ2 for regressors X and labels YData Mining and Machine Learning in a nutshell An Introduction to Data Mining 38 38. 38. LEAST SQUARESData Mining and Machine Learning in a nutshell An Introduction to Data Mining 39 39. 39. REGRESSION COEFFICIENTS • When there is only one independent variable: y = w0 + w1x n x = å xi 1 n i • Two independent variables y = w0 + w1x1 + w2x2Data Mining and Machine Learning in a nutshell An Introduction to Data Mining 40 40. 40. LINEAR REGRESSION: EXAMPLE Years of Salary ($K) experience 3 30 8 57 9 64 13 72 3 36 6 43 11 59 21 90 1 20 16 83Data Mining and Machine Learning in a nutshell An Introduction to Data Mining 41
41. 41. EVALUATING REGRESSION PERFORMANCE • The labels cannot be predicted precisely • It is needed to set a margin to accept or reject the predictions – For example, when the observed temperature is 71 any prediction in the range of 71±0.5 can be considered as correct predictionData Mining and Machine Learning in a nutshell An Introduction to Data Mining 43
42. 42. CLUSTERINGData Mining and Machine Learning in a nutshell An Introduction to Data Mining 44
43. 43. CLUSTERING Grouping together items that are similar in some way – according to some criteria • Clustering is a form of unsupervised learning – The clustering algorithms do not have examples showing how the samples should be group together • The clustering algorithms look for patterns or structures in the data that are of interest • Clustering algorithms group together similar itemsData Mining and Machine Learning in a nutshell An Introduction to Data Mining 45
44. 44. CLUSTERING: EXAMPLEData Mining and Machine Learning in a nutshell An Introduction to Data Mining 46
45. 45. MEASURING SIMILARITY IN CLUSTERING ALGORITHMS • The goal is to group together similar items • Different similarity measures can be used to find similar items • Usually similarity measures are critical to clustering algorithms The most popular (dis)similarity measure for continuous features are Euclidean Distance and Pearson Linear CorrelationData Mining and Machine Learning in a nutshell An Introduction to Data Mining 47
46. 46. EUCLIDEAN DISTANCE – A DISSIMILAR MEASURE • Here n is the number of dimensions in the data vectorData Mining and Machine Learning in a nutshell An Introduction to Data Mining 48
47. 47. PEARSON LINEAR CORRELATION n å (x i - x )(yi - y ) r (x, y) = n i=1 n å (x i - x )2 å (y i - y )2 i=1 i=1 1 n x = å xi n i 1 n y = å yi n i • We’re shifting the expression profiles down (subtracting the means) and scaling by the standard deviations (i.e., making the data have mean = 0 and std = 1) • Always between –1 and +1 (perfectly anti-correlated and perfectly correlated)Data Mining and Machine Learning in a nutshell An Introduction to Data Mining 49
48. 48. SIMILARITY MEASURES: MORE DEFINITIONSData Mining and Machine Learning in a nutshell An Introduction to Data Mining 50
49. 49. CLUSTERING • Distance-based algorithms – K-Means • Hierarchical algorithmsData Mining and Machine Learning in a nutshell An Introduction to Data Mining 51
50. 50. K-MEANS k-means clustering aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean • Finding the global optimal of k partitions is computationally expensive (NP-hard). However, there are efficient heuristic algorithms that are commonly employed and converge quickly to an optimum that might not be global.Data Mining and Machine Learning in a nutshell An Introduction to Data Mining 52
51. 51. K-MEANS • Given a set of observations (x1, x2, …, xn), where each observation is a d-dimensional real vector, k-means clustering aims to partition the n observations into k sets (k ≤ n) S = {S1, S2, …, Sk} so as to minimize the within- cluster sum of squares: where μi is the mean of points in SiData Mining and Machine Learning in a nutshell An Introduction to Data Mining 53
52. 52. K-MEANS: ALGORITHM Given data points xi and an initial set of k centroids m1(1),…,mk(1), the algorithm proceeds as follows: • Assignment step: Assign each data point to the cluster Si with the closest centroid each data point goes into exactly one cluster) • Update step: Calculate the new means to be the centroid of the data points in the clusterData Mining and Machine Learning in a nutshell An Introduction to Data Mining 54
53. 53. K-MEANS: AN EXAMPLE Data Cluster 1 Cluster 2 X Y point Step Data point Centroid Data point Centroid 1 1 1 2 2 1 1 1 (1.0, 1.0) 2 (2.0, 1.0) 3 1 2 4 2 2 5 4 4 6 4 5 7 5 4 8 5 5Data Mining and Machine Learning in a nutshell An Introduction to Data Mining 56
54. 54. RUNNING K-MEANS ON IRIS DATASETData Mining and Machine Learning in a nutshell An Introduction to Data Mining 58
55. 55. HIERARCHICAL CLUSTERING Hierarchical clustering is a method of cluster analysis which seeks to build a hierarchy of clusters. • Strategies for hierarchical clustering generally fall into two types: – Agglomerative: This is a "bottom up" approach: each observation starts in its own cluster, and pairs of clusters are merged as one moves up the hierarchy. – Divisive: This is a "top down" approach: all observations start in one cluster, and splits are performed recursively as one moves down the hierarchy.Data Mining and Machine Learning in a nutshell An Introduction to Data Mining 60
56. 56. HIERARCHICAL ALGORITHMS • Initially n data points are considered as either 1 or n clusters in hierarchical clustering • These clusters are gradually split or merged (divisive or agglomerative hierarchical clustering algorithms), depending on the type of an algorithm • Until the desired number of clusters are reachedData Mining and Machine Learning in a nutshell An Introduction to Data Mining 61
57. 57. HIERARCHICAL AGGLOMERATIVE CLUSTERING • Start with each data point as a cluster • Keep merging the most similar pairs of data points/clusters until only one big cluster left • This is called a bottom-up or agglomerative method This produces a binary tree or dendrogram – The final cluster is the root and each data point is a leaf – The height of the bars indicate how close the points areData Mining and Machine Learning in a nutshell An Introduction to Data Mining 62
58. 58. HIERARCHICAL CLUSTERING: AN EXAMPLEData Mining and Machine Learning in a nutshell An Introduction to Data Mining 63
59. 59. MERGING THE DATA POINTS IN HIERARCHICAL CLUSTERING • Average Linkage – Each cluster ci is associated with a mean vector i which is the mean of all the data items in the cluster – The distance between two clusters ci and cj is then just d( i , j ) • Single Linkage – The minimum of all pairwise distances between points in the two clusters • Complete Linkage – The maximum of all pairwise distances between points in the two clustersData Mining and Machine Learning in a nutshell An Introduction to Data Mining 64
60. 60. LINKAGE IN HIERARCHICAL CLUSTERING: EXAMPLE Single Linkage Average Linkage Complete LinkageData Mining and Machine Learning in a nutshell An Introduction to Data Mining 65
61. 61. EVALUATING THE CLUSTERINGS When we are given objects of two different kinds, the perfect clustering would be that objects of the same type are clustered together. • Evaluation with ground truth • Evaluation without ground truthData Mining and Machine Learning in a nutshell An Introduction to Data Mining 66
62. 62. EVALUATION WITH GROUND TRUTH When ground truth is available, the evaluator has prior knowledge of what a clustering should be – That is, we know the correct clustering assignments. • Measures – Precision and Recall, or F-Measure – Purity – Normalized Mutual Information (NMI)Data Mining and Machine Learning in a nutshell An Introduction to Data Mining 67
63. 63. PRECISION AND RECALL • True Positive (TP) : • False Negative (FN) : – when similar points are assigned to – when similar points are assigned to the same clusters different clusters – This is considered a correct – This is considered an incorrect decision. decision • True Negative (TN) : • False Positive (FP) : – when dissimilar points are – when dissimilar points are assigned to different clusters assigned to the same clusters – This is considered a correct – This is considered an incorrect decision decisionData Mining and Machine Learning in a nutshell An Introduction to Data Mining 68
64. 64. PRECISION AND RECALL: EXAMPLE 1Data Mining and Machine Learning in a nutshell An Introduction to Data Mining 69
65. 65. F-MEASURE • To consolidate precision and recall into one measure, we can use the harmonic mean of precision of recall Computed for the same example, we get F = 0.54Data Mining and Machine Learning in a nutshell An Introduction to Data Mining 71
66. 66. PURITY • In purity, we assume the majority of a cluster represents the cluster • Hence, we use the label of the majority against the label of each member to evaluate • the algorithm easily tampered; consider points Purity can be • The purity is then defined assize 1) or very large being singleton clusters (of the fraction of instances that have labels equal to the clusters. cluster’s majority label • In both cases, purity does not make much sense. where Lj defines label j (ground truth) and Mi defines the majority label for cluster iData Mining and Machine Learning in a nutshell An Introduction to Data Mining 72
67. 67. MUTUAL INFORMATION The mutual information of two random variables is a quantity that measures the mutual dependence of the two random variables • p(x,y) is the joint probability distribution function of X and Y, • p(x) and p(y) are the marginal probability distribution functions of X and Y respectivelyData Mining and Machine Learning in a nutshell An Introduction to Data Mining 73
68. 68. NORMALIZED MUTUAL INFORMATION Normalized Mutual Information has been derived from information theory where the Mutual Information (MI) between the clusterings found and the labels is normalized by the upper bound of (MI) which is a mean of the entropies (H) of labels and clusterings foundData Mining and Machine Learning in a nutshell An Introduction to Data Mining 74
69. 69. NORMALIZED MUTUAL INFORMATION • where l and h are labels and found clusterings, • nh and nl are the number of data points in the clusters h and l, respectively, • nh,l is the number of points in clusters h and labeled l, • n is the size of the dataset • NMI values close to one indicate high similarity between clusterings found and labels • Values close to zero indicate high dissimilarity between themData Mining and Machine Learning in a nutshell An Introduction to Data Mining 75
70. 70. NORMALIZED MUTUAL INFORMATION: EXAMPLE Partition a: [1,1,1,1,1,1,1, 2,2,2,2,2,2,2] Partition b: [1,1,1,1,1,2,2, 1,2,2,2,2,2,2] nh nl nh,l l=1 l=2 n = 14 h=1 6 l=1 7 h=1 5 1 h=2 8 l=2 7 h=2 2 6Data Mining and Machine Learning in a nutshell An Introduction to Data Mining 76
71. 71. EVALUATION WITHOUT GROUND TRUTH • Use domain experts • Use quality measures such as SSE – SSE: the sum of the squared error for all clusters • Use more than two clustering algorithms and compare the results and pick the algorithm with better quality measureData Mining and Machine Learning in a nutshell An Introduction to Data Mining 77
72. 72. TEXT MININGData Mining and Machine Learning in a nutshell An Introduction to Data Mining 78
73. 73. TEXT MINING • In social media, most of the data that is available online is in text format • In general, the way to perform data mining is to convert text data into tabular format and then perform data mining on this data • The process of converting text data into tabular data is called vectorizationData Mining and Machine Learning in a nutshell An Introduction to Data Mining 79
74. 74. TEXT MINING PROCESS A set of linguistic, statistical, and machine learning techniques that model and structure the information content of textual sources for business intelligence, exploratory data analysis, research, or investigationData Mining and Machine Learning in a nutshell An Introduction to Data Mining 80
75. 75. TEXT PREPROCESSING Text preprocessing aims to make the input documents more consistent to facilitate text representation, which is necessary for most text analytics tasks • Methods: – Stop word removal • Stop word removal eliminates words using a stop word list, in which the words are considered more general and meaningless – e.g. the, a, is, at, which – Stemming • Stemming reduces inflected (or sometimes derived) words to their stem, base or root form – For example, “watch”, “watching”, “watched” are represented as “watch”Data Mining and Machine Learning in a nutshell An Introduction to Data Mining 81
76. 76. TEXT REPRESENTATION • The most common way to model documents is to transform them into sparse numeric vectors and then deal with them with linear algebraic operations • This representation is called “Bag of Words” • Methods: – Vector space model – tf-idfData Mining and Machine Learning in a nutshell An Introduction to Data Mining 82
77. 77. VECTOR SPACE MODEL • In the vector space model, we start with a set of documents, D • Each document is a set of words • The goal is to convert these textual documents to vectors • di : document i, wj,i : the weight for word j in document i The weight can be set to 1 when the word exist in the document and 0 when it does not. Or we can set this weight to the number of times the word is observed in the documentData Mining and Machine Learning in a nutshell An Introduction to Data Mining 83
78. 78. VECTOR SPACE MODEL: AN EXAMPLE • Documents: – d1: data mining and social media mining – d2: social network analysis – d3: data mining • Reference vector: – (social, media, mining, network, analysis, data) • Vector representation: analysis data media mining network social d1 0 1 1 1 0 1 d2 1 0 0 0 1 1 d3 0 1 0 1 0 0Data Mining and Machine Learning in a nutshell An Introduction to Data Mining 84
79. 79. TF-IDF (TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY) tf-idf of term t, document d, and document corpus D is calculated as follows: tf-idf(t, d, D) = tf (t, d) * idf (t, D) The total number of documents in the corpus The number of documents where the term t appearsData Mining and Machine Learning in a nutshell An Introduction to Data Mining 85
80. 80. TF-IDF: AN EXAMPLE Consider words “apple” and “the” that appear 10 and 20 times in document 1 (d1), which contains 100 words. Consider |D| = 20 and word “apple” only appearing in d1 and word “the” appearing in all 20 documentsData Mining and Machine Learning in a nutshell An Introduction to Data Mining 86
81. 81. TF-IDF: AN EXAMPLE • Documents: – d1: data mining and social media mining – d2: social network analysis – d3: data mining • tf-idf representation: analysis data media mining network social df(w) 1 2 1 2 1 2 log(N/df(w)) 0.48 0.18 0.48 0.18 0.48 0.18 d1, tf 0 1 1 2 0 1 d2, tf 1 0 0 0 1 1 d3, tf 0 1 0 1 0 0 d1, tf-idf 0.00 0.18 0.48 0.35 0.00 0.18 d2, tf-idf 0.48 0.00 0.00 0.00 0.48 0.18 d3, tf-idf 0.00 0.18 0.00 0.18 0.00 0.00Data Mining and Machine Learning in a nutshell An Introduction to Data Mining 87
82. 82. SENTIMENT ANALYSIS • Sentiment analysis or opinion mining refers to the application of natural language processing, computational linguistics, and text analytics to identify and extract subjective information in source materials • It aims to determine the attitude of a speaker or a writer with respect to some topic or the overall contextual polarity of a document.Data Mining and Machine Learning in a nutshell An Introduction to Data Mining 88
83. 83. POLARITY ANALYSIS • The basic task in opinion mining is classifying the polarity of a given document or text – The polarity could be positive, negative, or neutral • Methods: – Naïve Bayes – Pointwise Mutual Information (PMI)Data Mining and Machine Learning in a nutshell An Introduction to Data Mining 89
84. 84. MEASURING POLARITY, NAÏVE BAYES • Bayes’ rule: • If we consider that the occurrence of features (words) in the document are independentData Mining and Machine Learning in a nutshell An Introduction to Data Mining 90
85. 85. MEASURING POLARITY, MAXIMUM ENTROPY • Z(d) is the normalization factor • is feature-weight parameter and shows the importance of each feature • Fi,c is defined as a feature/class function for feature fi and class cData Mining and Machine Learning in a nutshell An Introduction to Data Mining 91
86. 86. MEASURING POLARITY, POINTWISE MUTUAL INFORMATION • P(word) is the number of results returned by search engine in response to search for term word • P(word1 word2) is the number of results for mutual search of word1 and word2 togetherData Mining and Machine Learning in a nutshell An Introduction to Data Mining 92
87. 87. Mohammad-Ali Abbasi (Ali), Ali, is a Ph.D. student at Data Mining and Machine Learning Lab, Arizona State University. His research interests include Data Mining, Machine Learning, Social Computing, and Social Media Behavior Analysis. http://www.public.asu.edu/~mabbasi2/Data Mining and Machine Learning in a nutshell An Introduction to Data Mining 93