Data Mining

837 views

Published on

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
837
On SlideShare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
Downloads
24
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Data Mining

  1. 1. Data Mining Using IBM Intelligent Miner Presented by: Qiyan (Jennifer ) Huang
  2. 2. Outline <ul><li>Introduction </li></ul><ul><li>Mining Process </li></ul><ul><li>Main Functionalities of Intelligent Miner </li></ul><ul><li>Other Data Mining Products </li></ul><ul><li>Data Mining and Privacy </li></ul><ul><li>Summary </li></ul><ul><li>References </li></ul>
  3. 3. What is Data Mining <ul><li>Data mining : discovering interesting patterns from large amounts of data </li></ul><ul><ul><li>Knowledge discovery (mining) in databases (KDD), data/pattern analysis, information harvesting, business intelligence, etc . </li></ul></ul>
  4. 4. Evolution of Database Technology <ul><li>1960s: </li></ul><ul><ul><li>Data collection, database creation </li></ul></ul><ul><li>1970s: </li></ul><ul><ul><li>Relational data model, relational DBMS implementation </li></ul></ul><ul><li>1980s ~ present: </li></ul><ul><ul><li>RDBMS, advanced data models 1990s—2000s: </li></ul></ul><ul><ul><li>Data mining and data warehousing, multimedia databases, and Web databases </li></ul></ul>
  5. 5. Data Mining VS. Database Query <ul><li>Database </li></ul><ul><li>Data Mining </li></ul><ul><ul><ul><ul><ul><li>Find all customers who have purchased milk </li></ul></ul></ul></ul></ul><ul><ul><ul><ul><ul><li>Find all items which are frequently purchased with milk. (association rules) </li></ul></ul></ul></ul></ul><ul><ul><li>Identify customers who have purchased more than $10,000 in the last month. </li></ul></ul><ul><ul><ul><ul><ul><li>Identify customers with similar buying habits. (Clustering) </li></ul></ul></ul></ul></ul>
  6. 6. Data Mining Process (KDD) Data Cleaning Databases Data Warehouse Knowledge Task-relevant Data Selection Data Mining Pattern Evaluation J. Han. and M. Kamber. Data Mining: Concepts and Techniques,2001
  7. 7. About DB2 Intelligent Miner <ul><li>DB2 Intelligent Miner for Data “ focused on the large-scale mining, such as large volumes of data, parallel data mining on Windows NT, Sun Solaris, and OS/390 ” – IBM </li></ul>
  8. 8. Main Functionalities <ul><li>Cluster analysis </li></ul><ul><ul><li>Group the data that share similar trends and patterns </li></ul></ul><ul><li>Classification </li></ul><ul><ul><li>Predict the outcome based on historical data </li></ul></ul><ul><li>Association analysis </li></ul><ul><ul><li>Finding frequent patterns . </li></ul></ul>
  9. 18. This follows an example from Quinlan’s ID3 Classification
  10. 20. Classification
  11. 21. This follows an example from Quinlan’s ID3 Classification
  12. 22. Association <ul><ul><li>Association Rule: identifies relationships </li></ul></ul><ul><ul><li>Example </li></ul></ul><ul><ul><li> “ 30% customers buy shirts in all the transactions, 60% of these customers </li></ul></ul><ul><ul><li>will also by a tie” </li></ul></ul><ul><ul><ul><li>Confidence factor is 60% </li></ul></ul></ul><ul><ul><ul><li>Support – if buying shirt and tie together is observed in 12% of all transactions, then the support is thus 12% </li></ul></ul></ul><ul><ul><ul><li>Lift = 60% / 30%=2 </li></ul></ul></ul>
  13. 23. Association <ul><li>Support Confidence Type Lift Rule Body Rule Head </li></ul><ul><li>(%) (%) </li></ul><ul><li>5.5286 34.0800 + 2.7300 [203] + [1207] => [1716] </li></ul><ul><li>7.0388 34.1300 + 2.7400 [203] + [1719] => [1716] </li></ul><ul><li>5.4662 34.1700 + 2.7400 [202] + [802] => [1716] </li></ul><ul><li>5.8805 34.3400 + 2.7500 [203] + [802] => [1716] </li></ul><ul><li>5.0163 34.4900 + 2.7600 [203] + [705] => [1716] </li></ul><ul><li>7.1279 34.7400 + 2.7800 [202] + [1718] => [1716] </li></ul><ul><li>5.8226 34.7600 + 3.3900 [711] + [203] => [710] </li></ul><ul><li>5.0697 34.8300 + 2.7400 [202] + [1702] => [1703] </li></ul><ul><li>5.2836 34.8300 + 2.7400 [202] + [1207] => [1703] </li></ul><ul><li>5.4350 34.9400 + 3.4100 [201] + [711] => [710] </li></ul><ul><li>5.3459 35.0200 + 2.7600 [201] + [1702] => [1703] </li></ul>
  14. 24. Data Mining Products <ul><li>more than 50 commercial data mining tools </li></ul><ul><li>Wide range of pricing </li></ul><ul><ul><li>SAS Institute’s Enterprise Miner ~ $80k </li></ul></ul><ul><ul><li>SPSS Inc. Clementine ~ 75K </li></ul></ul><ul><ul><li>IBM Intelligent Miner ~ $60k </li></ul></ul><ul><ul><li>Desktop products start at few hundred dollars </li></ul></ul>
  15. 25. Data Mining Products Data Ming Product Comparison on Algorithm √ Nearest Neighbour √ √ Association √ √ Clustering √ √ Kohonen Self- Organizing Map √ √ √ Decision Tree √ √ √ Neural Network SPSS SAS IBM Algorithm
  16. 26. Data Mining & Privacy <ul><li>Release limited subset of data </li></ul><ul><ul><li>Hide attributes that potentially related to personal information </li></ul></ul><ul><li>Release Encrypted Data </li></ul><ul><li>Audit to detect misuse of Data </li></ul><ul><li>Set up Data Mining Controller </li></ul>
  17. 27. Summary <ul><li>Introduction to Data Mining </li></ul><ul><li>A KDD Data Mining Process </li></ul><ul><li>Functionalities of Intelligent Miner </li></ul><ul><li>Commercial Data Mining Tools </li></ul><ul><li>Data Mining & Privacy </li></ul>
  18. 28. References <ul><li>Angoss Whitepaper: </li></ul><ul><li>http://www.angoss.com/ProdServ/AnalyticalTools/kseeker/whitepaper.html. Retrieved on Oct26th,2003 </li></ul><ul><li>C. Clifton. & D. Marks Security and Privacy Implications of Data Ming.1996 </li></ul><ul><li>D.W. Abbott, I. P. Matkovsky & J. F. Elder IV. An Evaluation of High-end Data Mining Tools </li></ul><ul><li>Elder Research. http://www.rgrossman.com/faq/dm-02.htm . Retrieved on Oct28th,2003 </li></ul><ul><li>IBM. BD2 Intelligent Mine. </li></ul><ul><li>http://www-3.ibm.com/software/data/iminer/ . </li></ul><ul><li>Retrieved on Oct26th,2003 </li></ul><ul><li>J. F. Elder & D. W. Abbott. August, 1988 A comparison of Leading Data Mining Tools </li></ul><ul><li>J. Han. and M. Kamber. Data Mining: Concepts and Techniques, 2000 </li></ul><ul><li>http://www.cald.cs.cmu.edu/summerschool03/PrivacyPreservingDM.ppt Retrieved on Nov 10th,2003 </li></ul><ul><li>Robert Grossman http:// www.datamininglab.com/toolcomp.html#comparison . Retrieved on Oct20th,2003 </li></ul><ul><li>SPSS. http:// www.spss.com / . Retrieved on Nov12th,2003 </li></ul>
  19. 30. Evolution of Database Technology <ul><li>1960s: </li></ul><ul><ul><li>Data collection, database creation, and network DBMS </li></ul></ul><ul><li>1970s: </li></ul><ul><ul><li>Relational data model, relational DBMS implementation </li></ul></ul><ul><li>1980s: </li></ul><ul><ul><li>RDBMS, advanced data models 1990s—2000s: </li></ul></ul><ul><ul><li>Data mining and data warehousing, multimedia databases, and Web databases </li></ul></ul>
  20. 31. Data Mining: On What Kind of Data? <ul><li>Data Sources </li></ul><ul><ul><li>Relational database </li></ul></ul><ul><ul><li>Data warehouses </li></ul></ul><ul><ul><li>Transactional databases </li></ul></ul><ul><ul><li>WWW </li></ul></ul><ul><li>Data types </li></ul><ul><ul><li>Audio </li></ul></ul><ul><ul><li>Image </li></ul></ul><ul><ul><li>Text </li></ul></ul>
  21. 32. Output: A Decision Tree for “ buys_computer” age? overcast student? credit rating? no yes fair excellent <=30 >40 no no yes yes yes 30..40
  22. 33. Neural network  k - f weighted sum Input vector x output y Activation function weight vector w  w 0 w 1 w n x 0 x 1 x n
  23. 34. Neural network 0.15 0.29 0.11 0.25 0.09 0.23 0.32 0.27
  24. 35. Neural network
  25. 36. Applications of Clustering <ul><li>Pattern Recognition </li></ul><ul><li>Image Processing </li></ul><ul><li>Economic Science (especially market research) </li></ul><ul><li>WWW </li></ul><ul><ul><li>Document classification </li></ul></ul><ul><ul><li>Cluster Weblog data to discover groups of similar access patterns </li></ul></ul>
  26. 37. Data Mining & Privacy Data Mining Tool Mining Controller Data warehouse
  27. 38. Examples of Clustering Applications <ul><li>Marketing: Help marketers discover distinct groups in their customer bases, and then use this knowledge to develop targeted marketing programs </li></ul><ul><li>Insurance: Identifying groups of motor insurance policy holders with a high average claim cost </li></ul><ul><li>City-planning: Identifying groups of houses according to their house type, value, and geographical location </li></ul><ul><li>Earth-quake studies: Observed earth quake epicenters should be clustered along continent faults </li></ul>
  28. 39. Association <ul><li>Association and pattern analysis </li></ul><ul><ul><li>Applications: </li></ul></ul><ul><ul><ul><li>Basket data analysis, cross-marketing, catalog design, loss-leader analysis, clustering, classification, etc . </li></ul></ul></ul><ul><ul><li>Examples. </li></ul></ul><ul><ul><ul><li>buys(x, “diapers”)  buys(x, “beers”) [0.5%, 60%] </li></ul></ul></ul><ul><ul><ul><li>major(x, “CS”) ^ takes(x, “DB”)  grade(x, “A”) [1%, 75%] </li></ul></ul></ul>
  29. 40. Data Mining: On What Kind of Data? <ul><li>Relational databases </li></ul><ul><li>Data warehouses </li></ul><ul><li>Transactional databases </li></ul><ul><li>Advanced DB and information repositories </li></ul><ul><ul><li>Object-oriented and object-relational databases </li></ul></ul><ul><ul><li>Text databases and multimedia databases </li></ul></ul><ul><ul><li>Heterogeneous and legacy databases </li></ul></ul><ul><ul><li>WWW </li></ul></ul>
  30. 41. Steps of a KDD Process <ul><li>Learning the application domain: </li></ul><ul><ul><li>relevant prior knowledge and goals of application </li></ul></ul><ul><li>Creating a target data set: data selection </li></ul><ul><li>Data cleaning and preprocessing: (may take 60% of effort!) </li></ul><ul><li>Data reduction and transformation : </li></ul><ul><ul><li>Find useful features, dimensionality/variable reduction, invariant representation. </li></ul></ul><ul><li>Choosing functions of data mining </li></ul><ul><ul><li>summarization, classification, regression, association, clustering. </li></ul></ul><ul><li>Choosing the mining algorithm(s) </li></ul><ul><li>Data mining : search for patterns of interest </li></ul><ul><li>Pattern evaluation and knowledge presentation </li></ul><ul><ul><li>visualization, transformation, removing redundant patterns, etc. </li></ul></ul><ul><li>Use of discovered knowledge </li></ul>
  31. 42. Strength and Weakness <ul><li>Strength </li></ul><ul><ul><li>Algorithm breadth </li></ul></ul><ul><ul><li>Graphical output </li></ul></ul><ul><ul><li>Available for PC and mainframe environment </li></ul></ul><ul><li>Weakness </li></ul><ul><ul><li>No automation </li></ul></ul><ul><ul><li>Data has to reside in IBM’s database system </li></ul></ul>

×