Introduction to Data Mining for Newbies

2,792 views

Published on

3 Comments
11 Likes
Statistics
Notes
No Downloads
Views
Total views
2,792
On SlideShare
0
From Embeds
0
Number of Embeds
7
Actions
Shares
0
Downloads
122
Comments
3
Likes
11
Embeds 0
No embeds

No notes for slide

Introduction to Data Mining for Newbies

  1. 1. Introduction to Data Mining for Newbies Nov. 2th, 2012 @echojuliett
  2. 2. Google Datacenter@Douglas County, Georgia“These colorful pipes send and receive water for cooling our facility.Also pictured is a G-Bike, the vehicle of choice for team members to getaround outside our data centers.”Source: http://www.google.com/about/datacenters/gallery/#/tech/10
  3. 3. Eunjeong Lucy ParkPhDs, Data scientist @SNU DMLabA person who live on lattes.Find me at:http://dmlab.snu.ac.kr, http://lucypark.kr 3
  4. 4. “All scientists are data scientists.” - Monica Rogati, Senior Research Scientist @LinkedIn Source: http://xkcd.com/242/ 4
  5. 5. “Data is everywhere.” Tweets Cell phone logs Social networking data Politician data Web documents Manufacturing fault data Credit card transactions 5
  6. 6. “Data mining is…” • “…the process of exploration an analysis, by automatic or semi-automatic means, of large quantities of data in order to discover meaningful patterns and rules.” - Berry and Linoff, 1997Source: Berry and Linoff, Data Mining Techniques: For Marketing, Sales, and Customer Support, New York: Wiley, 1997. 6
  7. 7. “Data mining is…”• “…the belief in data.” - @echojuliett, 2012• Inductive reasoning  Mathematical induction: prove for k=1, assume for k, then prove for k+1  Induction vs. prejudice: # of cases  Ex: What is your hobby? 7
  8. 8. “Data mining is…” 8
  9. 9. 1. Basic Concepts of Data Mining2. Origins of Data Mining3. Data Mining Tools4. Masters of Data Mining 9
  10. 10. Data types Source: http://www.tipforest.com/t/83 Structured data Unstructured data
  11. 11. (the general) Data mining process Interpretation Data mining Preprocessing KNOWLEDGE Selection Target data Patterns Preprocessed DATA data warehouse of somewhat domain (Marketing, Finance, Manufacturing, etc.)
  12. 12. Selection • Data exploration – How many variables? • Independent variables, dependent variables, … • Continuous variables, categorical variables, … – How many records? – What distribution? – … • Variable selection & dimensionality reduction – Ex: Step-wise selection, PCA (Principal Component Analysis)
  13. 13. Preprocessing • “Partitioning” the data – training data & validation data (& test data …) Data set Training data Validation data
  14. 14. Preprocessing • Beware of “overfitting” Source: Bishop, PRML, p.7
  15. 15. Data mining methods Predictive methods Descriptive methods Classification Clustering Learns a method for predicting the instance Finds “natural” grouping of instances given class from pre-labeled (classified) instances un-labeled data Regression Association Rules Method for discovering interesting An attempt to predict a continuous attribute relations between variables in large DBs
  16. 16. Regression • Linear regression, k-nearest neighbors(k-NN), artificial neural networks (ANN), … • Polynomial curve fitting • The basic form min • The advanced form min • Example: • Tomorrow’s stock price = f (recent prices, economic indicators, …)
  17. 17. Classification • Regression with a categorical dependent variable • Naïve Bayes classification, decision trees, ANNs, SVMs,… • Ex: E-mail spam detection inbox ? spam
  18. 18. Clustering • Grouping of similar objects • Unsupervised, Exploratory Knowledge Discovery • k-means, hierarchical clustering, SOM, … • Ex: Politician segmentation J ac c ard Sim ilarit y bas ed H ierarc hic al C lus t ering D endrogram (D 9) 0. 8 0. 7 0. 6 0. 5 0. 4 0. 3 0. 2 0. 1 0 322323 298 133248 45 19122616520532238172 76 18294 294 2780 174185186 72 17321622969 117 61141203 17435 5346 37 267176212 1857 230125310 326312297 7720619 268277195262 75 10198 9978 20713096 253318 136255194243 250143179188 20 177154285266 213122 51 1724 30 1510 271291 59 321315299 128237183234204 86 1271002387 28 90 23540307 126 2 13 225231259120 67 71 156202 261198209150 10338 52 286 11 155 7 36 148292309 320295301 31326482 281263 264 89 169 170240 233146159 4 313 16 44 208161163 4816726929 25863252 56 47 175 42 68 107 118221 5 14714 134305 88 325296319 84 265260192 256 244 178 276 273279 257 55 308 91 9 6137 270 232220280272106 50 242 49 4154 249149 12 26 317304324129 316303288168 22 28327893 211 197 152 92 97 34 214 31 145 311302289 13116422419379 199 181 85 160200 171189217 18781 18433 300 95 314 70 196153 65 62 58 245 246 215108112287 166 157 222 135227 43 8 66 124 123 282 210 290218 14020115825114283 236241 162 239 25 113274 228 21 109 102 39 116254104 60 223 144180 110139115 105190 219119 284111 73 247151121293 138114328 275327306 Democratic United Party Grand National Party Others (liberal) (conservative)
  19. 19. Association Rules Source: http://lucypark.tistory.com/48
  20. 20. Data mining methods Predictive methods Descriptive methods Classification Clustering Learns a method for predicting the instance Finds “natural” grouping of instances given class from pre-labeled (classified) instances un-labeled data Regression Association Rules Method for discovering interesting An attempt to predict a continuous attribute relations between variables in large DBs
  21. 21. Pop quiz! 21
  22. 22. Pop quiz! 22
  23. 23. Pop quiz! 23
  24. 24. Pop quiz! 24
  25. 25. Pop quiz! Source: http://www.cis.hut.fi/research/som-research/worldmap.html 25
  26. 26. Pop quiz! Source: http://popupcity.net/2009/04/why-are-that-many-logos-blue/ 26
  27. 27. Pop quiz! 27
  28. 28. 1. Basic Concepts of Data Mining2. Origins of Data Mining3. Data Mining Tools4. Masters of Data Mining 28
  29. 29. Historical Note Data Fishing, Data Dredging: 1960- • used by statisticians (as a bad name) Knowledge Discovery in Databases (KDD): 1989- • used by Artificial Intelligence (AI), Machine Learning (ML) communities Data Mining, Data Analytics: 1990- • used in DB communities, business Big data: 2000-
  30. 30. Comparisons • Data mining • Statistics • Machine learning • Pattern recognition • …
  31. 31. 1. Basic Concepts of Data Mining2. Origins of Data Mining3. Data Mining Tools4. Masters of Data Mining 31
  32. 32. RSource: http://www.kdnuggets.com/2012/05/top-analytics-data-mining-big-data-software.html
  33. 33. SAS Enterprise Miner (“E-miner”)
  34. 34. XLMiner • 15-day trial version available at http://www.solver.com/xlminer-data-mining • Useful for prototyping • Supports: • Preprocessing • Data partitioning • Missing data imputation • Categorical data transformation • PCA (Principal Component Analysis) • Algorithms • Multiple linear regression • k-NN (k nearest neighbors) • CART (classification and regression trees) • ANN (artificial neural networks) • Discriminant analysis • logistic regression • Naïve Bayes classification • Association rules • k-means clustering • Hierarchical clustering
  35. 35. More… • Mathworks MATLAB / GNU Octave  Most DM algorithms are preinstalled  Relatively easy to learn • General purpose programming languages  For example, C, Java, Python, etc.  Packages such as Orange(http://orange.biolab.si/) for Python are available  May be more fit for tasks like natural language processing • Even more…  Try visiting http://www.kdnuggets.com/software/suites.html
  36. 36. 1. Basic Concepts of Data Mining2. Origins of Data Mining3. Data Mining Tools4. Masters of Data Mining 36
  37. 37. Foreign warriors • Mitchell (Carnegie Mellon University) • Vapnik (NEC Labs) • Bishop (Microsoft Cambridge) • Smola (Yahoo, Australian National University) • Ng (Stanford University)
  38. 38. Foreign warriors • 조성준 (서울대) • 조재희 (광운대) • 조성배 (연세대) • 이성임 (단국대) • 김성범 (고려대)
  39. 39. References • [1] Duda, Hart, Stork, Pattern Classification 2nd ed., Wiley, 2001. • [2] Bishop, Pattern Recognition and Machine Learning (PRML), Springer, 2006. • [3] Shmueli, Patel, Bruce, Data Mining for Business Intelligence, 2nd ed., Wiley, 2010
  40. 40. Any Questions? ?

×