01 Machine Learning Introduction

1. Machine Learning for Data Mining Introduction Andres Mendez-Vazquez May 13, 2015 1 / 56

2. Outline 1 Why are we interested in Analyzing Data? Intuitive Definition: The 3V’s Complexity Data Everywhere 2 Machine Learning Machine Learning Process Features Classification Clustering Analysis 3 Data Mining Definition Applications Example: Frequent Itemsets 4 Hardware Support ASICS GPU’s 5 Projects What projects can you do? 2 / 56

4. Intuitive Deﬁnition: Volume When looking at the Volumes of Information, we have: Volumes of it: Terabyte(1012), Petabyte(1015) and UP!!! Examples of these Volumes are 1 Records 2 Transactions 3 Web Searches 4 etc 4 / 56

9. However Something Notable What constitutes truly “high” volume varies by industry and even geography!!! Simply look at the DNA data for a cellular cycle. Example 5 / 56

10. However Something Notable What constitutes truly “high” volume varies by industry and even geography!!! Simply look at the DNA data for a cellular cycle. Example 5 / 56

11. Intuitive Deﬁnition: Variety When looking at the Structure of the Information, we have: Variety like there is not tomorrow: It is structured, semi-structured, unstructured So Do you have some examples of structures in Information? 6 / 56

14. Intuitive Deﬁnition: Volume When Looking at the Velocity of this Information? Data in Motion!!! Velocity: Dynamic Generation Real Time Generation Problems with that: Latency Lag time between capture or generation and when it is available!!! 7 / 56

19. For example Imagine that I have a stream of m = 1025 integers with Ranges from [a1, ..., an] with n = 10, 000, 000 Now, somebody ask you to ﬁnd the most frequent item!!! A naive algorithm 1 Take hash table with a counter. 2 Then, put numbers in the hash table. Problems Which problems we have? 8 / 56

22. However There is the Count-Min Sketch Algorithm Invented by Charikar, Chen and Farch-Colton in 2004 With Properties Space Used Error Probability Error O 1 log 1 δ · (log m + log n) δ 9 / 56

26. Complexity Given all these things It is necessary to correlate and share data across entities. It is necessary to link, match and transform data across business entities and systems. With this... Complexity goes through the roof!!! 11 / 56

29. And it is through the roof!!! Linking open-data community project 12 / 56

30. Cautionary Tale Something Notable In 1880 the USA made a Census of the Population in diﬀerent aspects: Population Mortality Agriculture Manufacturing However Once data was collected it took 7 years to say something!!! 13 / 56

37. Ahhh... Thus, Hollering came with the following machine (Circa 1890)!!! 14 / 56

38. Hollering Tabulating Machine It was basically a sorter and counter Using punching cards as memories. And Mercury Sensors. Example 15 / 56

39. Hollering Tabulating Machine It was basically a sorter and counter Using punching cards as memories. And Mercury Sensors. Example 15 / 56

40. It was FAST!!! It took only!!! 2 years!!! Nevertheless in 1837 Babbage’s Diﬀerence engine was The First General Computer!!! Turing-complete!!! Way more complex than the tabulator!!! 53 years earlier!!! 16 / 56

45. Funny!!! Funny!!! 17 / 56

46. The Problem Actually, it never reached completion because Babbage was actually a yucky project manager!!! 18 / 56

47. The Problem Actually, it never reached completion because Babbage was actually a yucky project manager!!! 18 / 56

49. Data is Everywhere! Lots of data is being collected and warehoused Web data, e-commerce Purchases at department/ grocery stores Bank/Credit Card transactions Social Network Many Places 20 / 56

54. The Staggering Numbers A Ocean of Data How many data in the world? 800 Terabytes, 2000 160 Exabytes, 2006 500 Exabytes (Internet), 2009 2.7 Zettabytes, 2012 35 Zettabytes by 2020 Generation How many data generated ONE day? 7 TB, Twitter 10 TB, Facebook Source: “Big data: The next frontier for innovation, competition, and pro- ductivity” McKinsey Global Institute 2011 21 / 56

63. Type of Data Thus Relational Data (Tables/Transaction/Legacy Data) Text Data (Web) Semi-structured Data (XML) And more... Graph Data Social Network, Semantic Web (RDF), . . . Streaming Data You can only scan the data once 22 / 56

70. The Ever Growing Landscape 23 / 56

71. Machine Learning Deﬁnition Algorithms or techniques that enable computer (machine) to “learn” from data. Related with many areas such as data mining, statistics, information theory, etc. Algorithm Types: Unsupervised Learning Supervised learning Reinforcement learning Examples Artiﬁcial Neural Network (ANN) Support Vector Machine (SVM) Expectation-Maximization (EM) Deterministic Annealing (DA) 24 / 56

80. Machine Learning Process Process 1 Feature Extraction/Feature Generation 2 Clustering ≈ Class Identiﬁcation ≈ Unsupervised Learning 3 Classiﬁcation ≈ Supervised Learning Then... We start thinking: We need to process a lot of data... Or... LARGE SCALE MACHINE LEARNING 26 / 56

84. Feature Generation/Dimensionality Reduction Feature Generation Given a set of measurements, the goal is to discover compact and informative representations of the obtained data. Examples 1 The Karhunen–Loève transform ≈ Principal Component Analysis 1 Popular for feature generation and Dimensionality Reduction 2 The Singular Value Decomposition 1 Used for Dimensionality Reduction 28 / 56

88. Dimension Reduction/Feature Extraction Deﬁnition Process to transform high-dimensional data into low-dimensional ones for improving accuracy, understanding, or removing noises. Why? Curse of dimensionality: Complexity grows exponentially in volume by adding extra dimensions. 29 / 56

89. Dimension Reduction/Feature Extraction Deﬁnition Process to transform high-dimensional data into low-dimensional ones for improving accuracy, understanding, or removing noises. Why? Curse of dimensionality: Complexity grows exponentially in volume by adding extra dimensions. 29 / 56

90. Feature Selection Feature Selection Which features should be used for the classiﬁer? Why? The Curse of Dimensionality!!! Hypothesis Testing to discriminate good features 30 / 56

91. Feature Selection Feature Selection Which features should be used for the classiﬁer? Why? The Curse of Dimensionality!!! 30 / 56

92. What can be done? Measures for Class Separability Example: Between-class scatter matrix: Sb = M i=1 Pi (µi − µ0) (µi − µ0)T (1) Where: µ0 is the global mean vector, µ0 = M i=1 Pi µi . µi the median of class ωi . Pi ∼= ni N . 31 / 56

96. What can be done? Feature Subset Selection Examples: Filter Approach All combinations of features are used together with a separability measure. Wrapper Approach: Use the decided classiﬁer itself to ﬁnd the best set. 32 / 56

102. Classification Definition A procedure dividing data into the given set of categories based on the training set in a supervised way. What we want from classification? Generalization Vs. Specification Hard to achieve both Avoid - overfitting/overtraining Early stopping Holdout validation K-fold cross validation Leave-one-out cross-validation 34 / 56

110. Avoid - overﬁtting/overtraining Validation and Training Error Underfitting Overfitting Validation Error Training Error 35 / 56

111. Examples of Classification Algorithms Many Possible Algorithms Linear Classifiers: Perceptron Probability Classifiers: Naive Bayes Kernel Methods Classifiers : Support Vector Machines Non-Linear Classifiers: Artificial Neural Networks Graph Model Classifiers: . . . 36 / 56

118. Clustering Analysis Deﬁnition Grouping unlabeled data into clusters, for the purpose of inference of hidden structures or information. Using, for example Dissimilarity measurement Angle : Inner product, . . . Non-metric : Rank, Intensity, . . . Distance : Euclidean (l2), Manhattan(l1), . . . 38 / 56

123. Example 39 / 56

124. Examples of Clustering Algorithms Clustering 1 Basic Clustering Algorithms 1 K-means 2 Clustering Based in Cost Functions 1 Fuzzy C-means 2 Possibilistic 3 Hierarchical Clustering 1 Entropy based 4 Clustering Based in Graph Theory 40 / 56

133. What Is Data Mining? Data mining (knowledge discovery in databases): Extraction of interesting information or patterns from data in large databases. Alternative names and their “inside stories”: Knowledge discovery(mining) in databases (KDD) Knowledge extraction Data/pattern analysis Data archeology Business intelligence etc. 42 / 56

139. Examples: What is (not) Data Mining? What is not Data Mining? 1 Look up phone number in phone directory 2 Query a Web search engine for information about “Amazon” What is Data Mining? 1 Certain names are more prevalent in certain US locations (O’Brien, O’Rurke, O’Reilly. . . in Boston area) 2 Group together similar documents returned by search engine according to their context (e.g. Amazon rainforest, Amazon.com) 43 / 56

144. Data mining Applications Applications Mining the Web for Structured Data Near Neighbor Search in High Dimensional Data. Frequent itemsets and Association Rules Structure of the webgraph PageRank Link Analysis Proximity on Graphs Mining data streams. Large scale supervised machine learning techniques. 45 / 56

154. Example: Frequent Itemsets Based in the Market-Basket Model 1 On the one hand, we have items. 2 On the other we have baskets, sometimes called “transactions.” 1 Each basket consists of a set of items (an itemset) 2 They are small. Examples 1 {Cat, and, dog, bites} 2 {Yahoo, news, claims, cat, dog, and, produced, viable, oﬀspring} 3 {Cat, killer, likely, is, a, big, dog} 4 {Professional, free, advice, on, dog, training, puppy} 47 / 56

162. Example: Frequent Itemsets Then, we do the following Transaction ID Cat Dog and a mated 1 1 1 1 0 0 2 1 1 1 1 1 3 1 1 0 1 0 4 0 1 0 0 0 48 / 56

163. Combinatorial Problem Problem How many subsets we have? But we can do the following Given the itemset x in a database D and a set of transactions {ti }i∈I supp(x, D) = |{ti ∈ D|x ∈ ti }| (2) Then, setting a threshold How many frequent (supp(x, D) > ) itemsets? 49 / 56

167. Hardware Solutions: ASICS Application-Speciﬁc Integrated Circuit (ASIC) An ASIC is an integrated circuit customized for a particular use, rather than intended for general-purpose use. It allows for 1 Lower Power Consumption. 2 Better Colling Approaches. Example: From Microsoft Research 51 / 56

171. Hardware Solutions: GPU’s IDEAS Based on CUDA parallel computing architecture from Nvidia Emphasis on executing many concurrent LIGHT threads instead of one HEAVY thread as in CPUs Hardware for 8800 53 / 56

172. Advantages Massively parallel Hundreds of cores, millions of threads High throughput Limitations May not be applicable for all tasks Generic hardware (CPUs) closing the gap 54 / 56

174. Projects Possible topic are: Oil exploration detection. Association Rule Preprocessing Project. Neural Network-Based Financial Market Forecasting Project. Page Ranking - Improving over the Google Matrix Inﬂuence Maximization in Social Networks. Web Word Relevance Measures. Recommendation Systems. There are more possibilities at https://www.kaggle.com/competitions 56 / 56

01 Machine Learning Introduction

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (20)

Similar to 01 Machine Learning Introduction

Similar to 01 Machine Learning Introduction (20)

More from Andres Mendez-Vazquez

More from Andres Mendez-Vazquez (20)

Recently uploaded

Recently uploaded (20)

01 Machine Learning Introduction