Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Overview of Data Mining


Published on

Concept and basic methods of data mining with example of association rules in 'toserba'.

Published in: Technology
  • Be the first to comment

Overview of Data Mining

  1. 1. Overview of Data Mining Meeting of WP Data Mining April 28, 2008 Bowo Prasetyo <ul><li>This presentation will probably involve audience discussion, which will create action items. Use PowerPoint to keep track of these action items during your presentation </li></ul><ul><li>In Slide Show, click on the right mouse button </li></ul><ul><li>Select “Meeting Minder” </li></ul><ul><li>Select the “Action Items” tab </li></ul><ul><li>Type in action items as they come up </li></ul><ul><li>Click OK to dismiss this box </li></ul><ul><li>This will automatically create an Action Item slide at the end of your presentation with your points entered. </li></ul>
  2. 2. Contents <ul><li>What Is Data Mining? </li></ul><ul><li>Does It Differ To Statistics? </li></ul><ul><li>Why Uses Data Mining? </li></ul><ul><li>What Can Data Mining Do? </li></ul><ul><li>Methods of Data Mining </li></ul><ul><li>Contoh Kasus - Toserba </li></ul><ul><li>Mining Environmental Data </li></ul><ul><li>Conclusion </li></ul>
  3. 3. What Is Data Mining? <ul><li>The exploration and analysis of large quantities of data in order to discover meaningful patterns and rules 1) . </li></ul>1) Berry and Linoff, Data Mining Techniques for Marketing, Sales and Customer Support (Book), 1997
  4. 4. Does It Differ To Statistics? <ul><li>Data mining is a blend of statistics, artificial intelligence, and database research 16) . </li></ul>16) D. Pregibon, Data Mining: Statistical Computing and Graphics , p. 7-8, 1997 Statistics Artificial Intelligence Database Data Mining
  5. 5. Statistics, AI, Database <ul><li>Statistics </li></ul><ul><ul><li>Distribution, mean, median, standard deviation </li></ul></ul><ul><li>Artificial Intelligence (AI) </li></ul><ul><ul><li>Neural network, fuzzy theory, genetic algorithm, particle swarm optimization </li></ul></ul><ul><li>Database </li></ul><ul><ul><li>Relational, object-oriented, spatial, temporal </li></ul></ul>
  6. 6. Why Uses Data Mining? <ul><li>Data explosion </li></ul><ul><ul><li>Automated data collection </li></ul></ul><ul><ul><li>Log data of large organizations 2) : </li></ul></ul><ul><ul><ul><li>44%  1 terabyte per month </li></ul></ul></ul><ul><ul><ul><li>11%  10 terabytes per month </li></ul></ul></ul><ul><ul><li>World’s digital data on PCs, digital cameras, servers, sensors, etc. 3) : </li></ul></ul><ul><ul><ul><li>in 2006  161 billion gigabytes </li></ul></ul></ul><ul><ul><ul><li>In 2010  988 billion gigabytes (predicted) </li></ul></ul></ul><ul><ul><li>Large amounts of data, but small amounts of knowledge </li></ul></ul><ul><ul><li>Data mining to discover the knowledge </li></ul></ul>2) ESG Research, New ESG Research Finds Large Organizations Experiencing Explosive Growth in Log Data Collection, Analysis, and Storage , 2007 ( ) 3) EMC — IDC Research, The Expanding Digital Universe: A Forecast of Worldwide Information Growth Through 2010 , 2006 ( )
  7. 7. What Can Data Mining Do? Examples
  8. 8. On Business and Network Security <ul><li>Builds customer profiles based on his/her transactional histories 4) </li></ul><ul><li>Analyzes corporate credit ratings using public financial statements, such as financial ratios 5) </li></ul><ul><li>Detects credit card fraud by analyzing customer transaction database 6) </li></ul><ul><li>Detects network intrusion based on system program behavior such as sendmail and tcpdump 7) </li></ul>4) G. Adomavicius and A. Tuzhilin, Using data mining methods to build customer profiles , in Computer magazine p. 74-82, 2001 5) Z. Huang, H. Chen, C. Hsu, W. Chen, S. Wu, Credit rating analysis with support vector machines and neural networks: a market comparative study , in Journal of Decision Support Systems p. 543-558, 2004 6) T. Fawcett and F. Provost, Adaptive Fraud Detection , in Journal of Data Mining and Knowledge Discovery p. 291-316, 2004 7) W. Lee and S. J. Stolfo, Data Mining Approaches for Intrusion Detection , in Proceedings of the 7th USENIX Security Symposium, 1998
  9. 9. On The Web <ul><li>Discovers useful patterns from log files, contents, and links of websites 8) </li></ul><ul><li>Ranks the web pages on the internet using link structure analysis 9) </li></ul><ul><li>Personalizes a website based on log files, contents, and profile data 10) </li></ul><ul><li>Supports on-line recommendation to customers by analyzing e-commerce transaction records 11) </li></ul>8) R. Cooley, B. Mobasher, J. Srivastava, Web Mining: Information and Pattern Discovery on the World Wide Web , in Proceedings of 9th International Conference on Tools with Artificial Intelligence (ICTAI) p. 0558, 1997 9) Larry Page, Sergey Brin, R. Motwani, T. Winograd, The PageRank Citation Ranking: Bringing Order to the Web , 1998 ( ) 10) M. Eirinaki and M. Vazirgiannis, Web mining for web personalization , in ACM Transactions on Internet Technology (TOIT) p. 1- 27, 2003. 11) S. W. Changchien and T. Lu, Mining association rules procedure to support on-line recommendation by customers and products fragmentation , in Journal of Expert Systems with Applications v. 20-4 p. 325-335, 2001
  10. 10. On Environment <ul><li>Discovers rules in geo-spatial database 12) </li></ul><ul><li>Analyzes weather impacts on airspace system 13) </li></ul><ul><li>Discovers interesting patterns on Earth Science variables (soil moisture, temperature, precipitation) along with ecosystem data (Net Primary Production) 14) </li></ul><ul><li>Finds Ocean Climate Indices based on pressure and temperature data 15) </li></ul>12) J. Han, K. Koperski, N. Stefanovic, GeoMiner: a system prototype for spatial data mining, in Proceedings of ACM SIGMOD international conference on Management of data p. 553 - 556, 1997 13) Z. Nazeri and J. Zhang, Mining aviation data to understand impacts of severe weather on airspace system performance , in Proceedings of International Conference on Coding and Computing p. 518- 523, 2002. 14) V. Kumar, M. Steinbach, P. Tan, S. Klooster, C. Potter, A. Torregrosa, Mining Scientific Data: Discovery of Patterns in the Global Climate System , in Proceedings of the Joint Statistical Meetings p. 5--9, 2001 15) M. Steinbach, P. Tan, V. Kumar, S. Klooster, C. Potter , Data Mining for the Discovery of Ocean Climate Indices , in Proceedings of the 5th Workshop on Scientific Data Mining p. 7-16, 2002
  11. 11. Methods in Data Mining Basic Methods
  12. 12. Classification, Clustering, Association Rules <ul><li>Data mining consists of several basic methods: </li></ul><ul><ul><li>Classification </li></ul></ul><ul><ul><ul><li>Places items into groups based on a training set of previously labeled items (supervised) </li></ul></ul></ul><ul><ul><li>Clustering </li></ul></ul><ul><ul><ul><li>Places items into groups based on some defined distance measure (unsupervised) </li></ul></ul></ul><ul><ul><li>Association Rules </li></ul></ul><ul><ul><ul><li>Discovers items that co-occur frequently within a data set and also their rules, such as implication or correlation </li></ul></ul></ul>
  13. 13. Classification <ul><li>Naive Bayesian classifier </li></ul><ul><ul><li>Spam/Non-spam classification </li></ul></ul><ul><ul><ul><li>Spam if </li></ul></ul></ul>17) http ://
  14. 14. Clustering <ul><li>K-means algorithm 18) </li></ul><ul><ul><li>Partitions items into k clusters </li></ul></ul><ul><ul><li>Calculates mean of each cluster as centroid </li></ul></ul><ul><ul><li>Associates each items to the closest centroid using defined distance </li></ul></ul><ul><ul><li>Back to 2 until convergence </li></ul></ul>18) J. A. Hartigan and M. A. Wong, A k-means clustering algorithm, in Applied Statistics, 28 (1) p. 100-108, 1979
  15. 15. Association Rules <ul><li>If a customer buys bread and butter, then she will likely buy milk too with 90% confidence </li></ul><ul><li>Algorithm 19) : </li></ul><ul><ul><li>Finds frequent itemsets whose support >= minsup </li></ul></ul><ul><ul><li>Finds interesting rules from frequent itemsets above whose confidence >= minconf </li></ul></ul>19) R. Agrawal, R. Srikant, Fast Algorithms for Mining Association Rules , in Proc. 20th Int. Conf. Very Large Data Bases, VLDB, 1994
  16. 16. Association Rules <ul><li>Apriori algorithm to find frequent itemsets L in database D 19) : </li></ul><ul><ul><li>Find frequent set L k −1 </li></ul></ul><ul><ul><li>Join step </li></ul></ul><ul><ul><ul><li>C k is generated by joining L k −1 with itself </li></ul></ul></ul><ul><ul><li>Prune step </li></ul></ul><ul><ul><ul><li>Any ( k −1)-itemset that is not frequent cannot be a subset of a frequent k -itemset, hence should be removed </li></ul></ul></ul>( C k : Candidate itemset of size k ) ( L k : frequent itemset of size k whose support >= minsup )
  17. 17. Association Rules <ul><li>Apriori algorithm to find rules R from frequent itemsets L 19) : </li></ul><ul><ul><li>For each l  L generate S = non-empty subsets of l </li></ul></ul><ul><ul><li>For each s  S generate rule s  ( l - s ) if confidence >= minconf </li></ul></ul>
  18. 18. Visualization Of Mining Results <ul><li>Problem of mining results </li></ul><ul><ul><li>Too much results to display </li></ul></ul><ul><ul><li>Difficult to find important rules </li></ul></ul><ul><ul><li>Difficult to understand the rules </li></ul></ul><ul><li>Needs good visualization tools </li></ul><ul><ul><li>Chart for statistical results </li></ul></ul><ul><ul><li>Graph (node & edge) for association rules </li></ul></ul><ul><ul><li>Globe map for geo-spatial results </li></ul></ul><ul><ul><li>Animation for temporal results </li></ul></ul><ul><ul><li>Utilizes colors, styles, thickness etc. </li></ul></ul>
  19. 19. Contoh Kasus Aturan Asosiasi di Toserba
  20. 20. Item dan Transaksi <ul><li>Pembelian Pak Joko bulan Januari: </li></ul><ul><ul><li>beras, minyak goreng, daging sapi </li></ul></ul><ul><ul><li>gula pasir, minyak goreng, telur ayam </li></ul></ul><ul><ul><li>beras, gula pasir, minyak goreng, telur ayam </li></ul></ul><ul><ul><li>gula pasir, telur ayam </li></ul></ul>transaksi item
  21. 21. Frequent Item (Item Sering) <ul><li>“ Sering”: pembelian >= 2 </li></ul><ul><li>daging sapi = 1 kali  bukan sering </li></ul>support minimum support
  22. 22. n -Length Item ( n -Item) <ul><li>n > 1 </li></ul>2-length item 3-length item
  23. 23. Aturan Asosiasi <ul><li>Kustomer yang membeli beras akan membeli juga minyak goreng. </li></ul><ul><li>“ jika beras maka minyak goreng&quot; </li></ul>beras => minyak goreng support(minyak goreng & beras) support(beras) = 2/2 = 1 confidence antecedent consequent
  24. 24. Aturan Asosiasi Lengkap
  25. 25. Mining Environmental Data Examples
  26. 26. Explosion in Environmental Data <ul><li>Temperature, humidity, pressure, precipitation, sound, light, shock </li></ul><ul><li>Weather & rainfall trends, river height & flows, air & water quality, pollution levels, salinity, emissions, FPAR, NPP </li></ul><ul><li>Earth science, oceanography, meteorology, ecology </li></ul><ul><li>Sensors, hand-held/wireless devices, remote sensing (satellites), other automated logging devices </li></ul>
  27. 27. Geo-spatial Database <ul><li>Discovers rules in geo-spatial database 12) </li></ul><ul><li>Given Western Canada, describe the weather patterns </li></ul><ul><li>Given temperature, precipitation, etc., describe the regions </li></ul><ul><li>Show the differences in weather patterns between British Columbia and Alberta </li></ul><ul><li>If a Canadian town is large and is adjacent to large water body, then it is close to the U.S. border, with the possibility of 78% </li></ul>GeoMiner
  28. 28. Earth Science <ul><li>Interesting patterns on Earth Science 14) </li></ul>Regions that are covered by the highly correlated pattern, FPAR-Hi  NPP-Hi Shrubland regions FPAR: Fractional Intercepted Photosynthetically Active Radiation NPP : Net Primary Production
  29. 29. Earth Science <ul><li>Interesting patterns on Earth Science 14) </li></ul>Two clusters for NPP (land) and two clusters for SST (ocean). The clusters approximate the northern and southern hemispheres, for land and ocean. SST: sea surface temperature
  30. 30. Earth Science <ul><li>Interesting patterns on Earth Science 14) </li></ul>Clusters of ocean near the Philipines (SST) and lands of Eastern Brazil, Southern Africa, and a bit of Australia (NPP) is highly correlated (0.47). In particular, this sea region is highly correlated (0.66), with SOI, which is a climate index related to El Niño, and it is known that parts of Southern Africa and Australia experience droughts related to El Nino.
  31. 31. Conclusion <ul><li>Today’s data repository is huge and collected in enormous speed </li></ul><ul><li>Traditional statistical methods are no longer sufficient to analyze data. </li></ul><ul><li>Data mining is very important to discover knowledge hidden in data </li></ul><ul><li>Helps decision making in broad range of fields: business, network security, web, environment etc. </li></ul><ul><li>Good visualization tool is needed to understand mining results easily </li></ul>