Overview of Data Mining Meeting of WP Data Mining April 28, 2008 Bowo Prasetyo http://www.scribd.com/prazjp http://www.sli...
Contents <ul><li>What Is Data Mining? </li></ul><ul><li>Does It Differ To Statistics? </li></ul><ul><li>Why Uses Data Mini...
What Is Data Mining? <ul><li>The exploration and analysis of large quantities of data in order to discover meaningful patt...
Does It Differ To Statistics? <ul><li>Data mining is a blend of statistics, artificial intelligence, and database research...
Statistics, AI, Database <ul><li>Statistics </li></ul><ul><ul><li>Distribution, mean, median, standard deviation </li></ul...
Why Uses Data Mining? <ul><li>Data explosion </li></ul><ul><ul><li>Automated data collection </li></ul></ul><ul><ul><li>Lo...
What Can Data Mining Do? Examples
On Business and Network Security <ul><li>Builds customer profiles based on his/her transactional histories 4) </li></ul><u...
On The Web <ul><li>Discovers useful patterns from log files, contents, and links of websites 8) </li></ul><ul><li>Ranks th...
On Environment <ul><li>Discovers rules in geo-spatial database 12) </li></ul><ul><li>Analyzes weather impacts on airspace ...
Methods in Data Mining Basic Methods
Classification, Clustering, Association Rules <ul><li>Data mining consists of several basic methods: </li></ul><ul><ul><li...
Classification <ul><li>Naive Bayesian classifier </li></ul><ul><ul><li>Spam/Non-spam classification </li></ul></ul><ul><ul...
Clustering <ul><li>K-means algorithm 18) </li></ul><ul><ul><li>Partitions items into  k  clusters </li></ul></ul><ul><ul><...
Association Rules <ul><li>If a customer buys bread and butter, then she will likely buy milk too with 90% confidence </li>...
Association Rules <ul><li>Apriori algorithm to find frequent itemsets  L  in database  D 19) : </li></ul><ul><ul><li>Find ...
Association Rules <ul><li>Apriori algorithm to find rules  R  from frequent itemsets  L 19) : </li></ul><ul><ul><li>For ea...
Visualization Of Mining Results <ul><li>Problem of mining results </li></ul><ul><ul><li>Too much results to display </li><...
Contoh Kasus Aturan Asosiasi di Toserba
Item dan Transaksi <ul><li>Pembelian Pak Joko bulan Januari: </li></ul><ul><ul><li>beras, minyak goreng, daging sapi </li>...
Frequent Item (Item Sering) <ul><li>“ Sering”: pembelian >= 2 </li></ul><ul><li>daging sapi = 1 kali    bukan sering </li...
n -Length Item ( n -Item) <ul><li>n > 1 </li></ul>2-length item 3-length item
Aturan Asosiasi <ul><li>Kustomer yang membeli beras akan membeli juga minyak goreng. </li></ul><ul><li>“ jika beras maka m...
Aturan Asosiasi Lengkap
Mining Environmental Data Examples
Explosion in Environmental Data <ul><li>Temperature, humidity, pressure, precipitation, sound, light, shock </li></ul><ul>...
Geo-spatial Database <ul><li>Discovers rules in geo-spatial database 12) </li></ul><ul><li>Given Western Canada, describe ...
Earth Science <ul><li>Interesting patterns on Earth Science 14) </li></ul>Regions that are covered by the highly correlate...
Earth Science <ul><li>Interesting patterns on Earth Science 14) </li></ul>Two clusters for NPP (land) and two clusters for...
Earth Science <ul><li>Interesting patterns on Earth Science 14) </li></ul>Clusters of ocean near the Philipines (SST) and ...
Conclusion <ul><li>Today’s data repository is huge and collected in enormous speed  </li></ul><ul><li>Traditional statisti...
Upcoming SlideShare
Loading in …5
×

Overview of Data Mining

1,456
-1

Published on

Concept and basic methods of data mining with example of association rules in 'toserba'.

Published in: Technology
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
1,456
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
105
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide

Overview of Data Mining

  1. 1. Overview of Data Mining Meeting of WP Data Mining April 28, 2008 Bowo Prasetyo http://www.scribd.com/prazjp http://www.slideshare.net/bowoprasetyo <ul><li>This presentation will probably involve audience discussion, which will create action items. Use PowerPoint to keep track of these action items during your presentation </li></ul><ul><li>In Slide Show, click on the right mouse button </li></ul><ul><li>Select “Meeting Minder” </li></ul><ul><li>Select the “Action Items” tab </li></ul><ul><li>Type in action items as they come up </li></ul><ul><li>Click OK to dismiss this box </li></ul><ul><li>This will automatically create an Action Item slide at the end of your presentation with your points entered. </li></ul>
  2. 2. Contents <ul><li>What Is Data Mining? </li></ul><ul><li>Does It Differ To Statistics? </li></ul><ul><li>Why Uses Data Mining? </li></ul><ul><li>What Can Data Mining Do? </li></ul><ul><li>Methods of Data Mining </li></ul><ul><li>Contoh Kasus - Toserba </li></ul><ul><li>Mining Environmental Data </li></ul><ul><li>Conclusion </li></ul>
  3. 3. What Is Data Mining? <ul><li>The exploration and analysis of large quantities of data in order to discover meaningful patterns and rules 1) . </li></ul>1) Berry and Linoff, Data Mining Techniques for Marketing, Sales and Customer Support (Book), 1997
  4. 4. Does It Differ To Statistics? <ul><li>Data mining is a blend of statistics, artificial intelligence, and database research 16) . </li></ul>16) D. Pregibon, Data Mining: Statistical Computing and Graphics , p. 7-8, 1997 Statistics Artificial Intelligence Database Data Mining
  5. 5. Statistics, AI, Database <ul><li>Statistics </li></ul><ul><ul><li>Distribution, mean, median, standard deviation </li></ul></ul><ul><li>Artificial Intelligence (AI) </li></ul><ul><ul><li>Neural network, fuzzy theory, genetic algorithm, particle swarm optimization </li></ul></ul><ul><li>Database </li></ul><ul><ul><li>Relational, object-oriented, spatial, temporal </li></ul></ul>
  6. 6. Why Uses Data Mining? <ul><li>Data explosion </li></ul><ul><ul><li>Automated data collection </li></ul></ul><ul><ul><li>Log data of large organizations 2) : </li></ul></ul><ul><ul><ul><li>44%  1 terabyte per month </li></ul></ul></ul><ul><ul><ul><li>11%  10 terabytes per month </li></ul></ul></ul><ul><ul><li>World’s digital data on PCs, digital cameras, servers, sensors, etc. 3) : </li></ul></ul><ul><ul><ul><li>in 2006  161 billion gigabytes </li></ul></ul></ul><ul><ul><ul><li>In 2010  988 billion gigabytes (predicted) </li></ul></ul></ul><ul><ul><li>Large amounts of data, but small amounts of knowledge </li></ul></ul><ul><ul><li>Data mining to discover the knowledge </li></ul></ul>2) ESG Research, New ESG Research Finds Large Organizations Experiencing Explosive Growth in Log Data Collection, Analysis, and Storage , 2007 ( http://www.enterprisestrategygroup.com/_documents/NewsEvent/NewsEvent439.pdf ) 3) EMC — IDC Research, The Expanding Digital Universe: A Forecast of Worldwide Information Growth Through 2010 , 2006 ( http://www.emc.com/about/destination/digital_universe/ )
  7. 7. What Can Data Mining Do? Examples
  8. 8. On Business and Network Security <ul><li>Builds customer profiles based on his/her transactional histories 4) </li></ul><ul><li>Analyzes corporate credit ratings using public financial statements, such as financial ratios 5) </li></ul><ul><li>Detects credit card fraud by analyzing customer transaction database 6) </li></ul><ul><li>Detects network intrusion based on system program behavior such as sendmail and tcpdump 7) </li></ul>4) G. Adomavicius and A. Tuzhilin, Using data mining methods to build customer profiles , in Computer magazine p. 74-82, 2001 5) Z. Huang, H. Chen, C. Hsu, W. Chen, S. Wu, Credit rating analysis with support vector machines and neural networks: a market comparative study , in Journal of Decision Support Systems p. 543-558, 2004 6) T. Fawcett and F. Provost, Adaptive Fraud Detection , in Journal of Data Mining and Knowledge Discovery p. 291-316, 2004 7) W. Lee and S. J. Stolfo, Data Mining Approaches for Intrusion Detection , in Proceedings of the 7th USENIX Security Symposium, 1998
  9. 9. On The Web <ul><li>Discovers useful patterns from log files, contents, and links of websites 8) </li></ul><ul><li>Ranks the web pages on the internet using link structure analysis 9) </li></ul><ul><li>Personalizes a website based on log files, contents, and profile data 10) </li></ul><ul><li>Supports on-line recommendation to customers by analyzing e-commerce transaction records 11) </li></ul>8) R. Cooley, B. Mobasher, J. Srivastava, Web Mining: Information and Pattern Discovery on the World Wide Web , in Proceedings of 9th International Conference on Tools with Artificial Intelligence (ICTAI) p. 0558, 1997 9) Larry Page, Sergey Brin, R. Motwani, T. Winograd, The PageRank Citation Ranking: Bringing Order to the Web , 1998 ( http://citeseer.ist.psu.edu/page98pagerank.html ) 10) M. Eirinaki and M. Vazirgiannis, Web mining for web personalization , in ACM Transactions on Internet Technology (TOIT) p. 1- 27, 2003. 11) S. W. Changchien and T. Lu, Mining association rules procedure to support on-line recommendation by customers and products fragmentation , in Journal of Expert Systems with Applications v. 20-4 p. 325-335, 2001
  10. 10. On Environment <ul><li>Discovers rules in geo-spatial database 12) </li></ul><ul><li>Analyzes weather impacts on airspace system 13) </li></ul><ul><li>Discovers interesting patterns on Earth Science variables (soil moisture, temperature, precipitation) along with ecosystem data (Net Primary Production) 14) </li></ul><ul><li>Finds Ocean Climate Indices based on pressure and temperature data 15) </li></ul>12) J. Han, K. Koperski, N. Stefanovic, GeoMiner: a system prototype for spatial data mining, in Proceedings of ACM SIGMOD international conference on Management of data p. 553 - 556, 1997 13) Z. Nazeri and J. Zhang, Mining aviation data to understand impacts of severe weather on airspace system performance , in Proceedings of International Conference on Coding and Computing p. 518- 523, 2002. 14) V. Kumar, M. Steinbach, P. Tan, S. Klooster, C. Potter, A. Torregrosa, Mining Scientific Data: Discovery of Patterns in the Global Climate System , in Proceedings of the Joint Statistical Meetings p. 5--9, 2001 15) M. Steinbach, P. Tan, V. Kumar, S. Klooster, C. Potter , Data Mining for the Discovery of Ocean Climate Indices , in Proceedings of the 5th Workshop on Scientific Data Mining p. 7-16, 2002
  11. 11. Methods in Data Mining Basic Methods
  12. 12. Classification, Clustering, Association Rules <ul><li>Data mining consists of several basic methods: </li></ul><ul><ul><li>Classification </li></ul></ul><ul><ul><ul><li>Places items into groups based on a training set of previously labeled items (supervised) </li></ul></ul></ul><ul><ul><li>Clustering </li></ul></ul><ul><ul><ul><li>Places items into groups based on some defined distance measure (unsupervised) </li></ul></ul></ul><ul><ul><li>Association Rules </li></ul></ul><ul><ul><ul><li>Discovers items that co-occur frequently within a data set and also their rules, such as implication or correlation </li></ul></ul></ul>
  13. 13. Classification <ul><li>Naive Bayesian classifier </li></ul><ul><ul><li>Spam/Non-spam classification </li></ul></ul><ul><ul><ul><li>Spam if </li></ul></ul></ul>17) http ://en.wikipedia.org/wiki/Naive_Bayes_classifier
  14. 14. Clustering <ul><li>K-means algorithm 18) </li></ul><ul><ul><li>Partitions items into k clusters </li></ul></ul><ul><ul><li>Calculates mean of each cluster as centroid </li></ul></ul><ul><ul><li>Associates each items to the closest centroid using defined distance </li></ul></ul><ul><ul><li>Back to 2 until convergence </li></ul></ul>18) J. A. Hartigan and M. A. Wong, A k-means clustering algorithm, in Applied Statistics, 28 (1) p. 100-108, 1979
  15. 15. Association Rules <ul><li>If a customer buys bread and butter, then she will likely buy milk too with 90% confidence </li></ul><ul><li>Algorithm 19) : </li></ul><ul><ul><li>Finds frequent itemsets whose support >= minsup </li></ul></ul><ul><ul><li>Finds interesting rules from frequent itemsets above whose confidence >= minconf </li></ul></ul>19) R. Agrawal, R. Srikant, Fast Algorithms for Mining Association Rules , in Proc. 20th Int. Conf. Very Large Data Bases, VLDB, 1994
  16. 16. Association Rules <ul><li>Apriori algorithm to find frequent itemsets L in database D 19) : </li></ul><ul><ul><li>Find frequent set L k −1 </li></ul></ul><ul><ul><li>Join step </li></ul></ul><ul><ul><ul><li>C k is generated by joining L k −1 with itself </li></ul></ul></ul><ul><ul><li>Prune step </li></ul></ul><ul><ul><ul><li>Any ( k −1)-itemset that is not frequent cannot be a subset of a frequent k -itemset, hence should be removed </li></ul></ul></ul>( C k : Candidate itemset of size k ) ( L k : frequent itemset of size k whose support >= minsup )
  17. 17. Association Rules <ul><li>Apriori algorithm to find rules R from frequent itemsets L 19) : </li></ul><ul><ul><li>For each l  L generate S = non-empty subsets of l </li></ul></ul><ul><ul><li>For each s  S generate rule s  ( l - s ) if confidence >= minconf </li></ul></ul>
  18. 18. Visualization Of Mining Results <ul><li>Problem of mining results </li></ul><ul><ul><li>Too much results to display </li></ul></ul><ul><ul><li>Difficult to find important rules </li></ul></ul><ul><ul><li>Difficult to understand the rules </li></ul></ul><ul><li>Needs good visualization tools </li></ul><ul><ul><li>Chart for statistical results </li></ul></ul><ul><ul><li>Graph (node & edge) for association rules </li></ul></ul><ul><ul><li>Globe map for geo-spatial results </li></ul></ul><ul><ul><li>Animation for temporal results </li></ul></ul><ul><ul><li>Utilizes colors, styles, thickness etc. </li></ul></ul>
  19. 19. Contoh Kasus Aturan Asosiasi di Toserba
  20. 20. Item dan Transaksi <ul><li>Pembelian Pak Joko bulan Januari: </li></ul><ul><ul><li>beras, minyak goreng, daging sapi </li></ul></ul><ul><ul><li>gula pasir, minyak goreng, telur ayam </li></ul></ul><ul><ul><li>beras, gula pasir, minyak goreng, telur ayam </li></ul></ul><ul><ul><li>gula pasir, telur ayam </li></ul></ul>transaksi item
  21. 21. Frequent Item (Item Sering) <ul><li>“ Sering”: pembelian >= 2 </li></ul><ul><li>daging sapi = 1 kali  bukan sering </li></ul>support minimum support
  22. 22. n -Length Item ( n -Item) <ul><li>n > 1 </li></ul>2-length item 3-length item
  23. 23. Aturan Asosiasi <ul><li>Kustomer yang membeli beras akan membeli juga minyak goreng. </li></ul><ul><li>“ jika beras maka minyak goreng&quot; </li></ul>beras => minyak goreng support(minyak goreng & beras) support(beras) = 2/2 = 1 confidence antecedent consequent
  24. 24. Aturan Asosiasi Lengkap
  25. 25. Mining Environmental Data Examples
  26. 26. Explosion in Environmental Data <ul><li>Temperature, humidity, pressure, precipitation, sound, light, shock </li></ul><ul><li>Weather & rainfall trends, river height & flows, air & water quality, pollution levels, salinity, emissions, FPAR, NPP </li></ul><ul><li>Earth science, oceanography, meteorology, ecology </li></ul><ul><li>Sensors, hand-held/wireless devices, remote sensing (satellites), other automated logging devices </li></ul>
  27. 27. Geo-spatial Database <ul><li>Discovers rules in geo-spatial database 12) </li></ul><ul><li>Given Western Canada, describe the weather patterns </li></ul><ul><li>Given temperature, precipitation, etc., describe the regions </li></ul><ul><li>Show the differences in weather patterns between British Columbia and Alberta </li></ul><ul><li>If a Canadian town is large and is adjacent to large water body, then it is close to the U.S. border, with the possibility of 78% </li></ul>GeoMiner
  28. 28. Earth Science <ul><li>Interesting patterns on Earth Science 14) </li></ul>Regions that are covered by the highly correlated pattern, FPAR-Hi  NPP-Hi Shrubland regions FPAR: Fractional Intercepted Photosynthetically Active Radiation NPP : Net Primary Production
  29. 29. Earth Science <ul><li>Interesting patterns on Earth Science 14) </li></ul>Two clusters for NPP (land) and two clusters for SST (ocean). The clusters approximate the northern and southern hemispheres, for land and ocean. SST: sea surface temperature
  30. 30. Earth Science <ul><li>Interesting patterns on Earth Science 14) </li></ul>Clusters of ocean near the Philipines (SST) and lands of Eastern Brazil, Southern Africa, and a bit of Australia (NPP) is highly correlated (0.47). In particular, this sea region is highly correlated (0.66), with SOI, which is a climate index related to El Niño, and it is known that parts of Southern Africa and Australia experience droughts related to El Nino.
  31. 31. Conclusion <ul><li>Today’s data repository is huge and collected in enormous speed </li></ul><ul><li>Traditional statistical methods are no longer sufficient to analyze data. </li></ul><ul><li>Data mining is very important to discover knowledge hidden in data </li></ul><ul><li>Helps decision making in broad range of fields: business, network security, web, environment etc. </li></ul><ul><li>Good visualization tool is needed to understand mining results easily </li></ul>
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×