Your SlideShare is downloading. ×
Data Mining
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×

Introducing the official SlideShare app

Stunning, full-screen experience for iPhone and Android

Text the download link to your phone

Standard text messaging rates apply

Data Mining

603
views

Published on


0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
603
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
16
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Data Mining Using IBM Intelligent Miner Presented by: Qiyan (Jennifer ) Huang
  • 2. Outline
    • Introduction
    • Mining Process
    • Main Functionalities of Intelligent Miner
    • Other Data Mining Products
    • Data Mining and Privacy
    • Summary
    • References
  • 3. What is Data Mining
    • Data mining : discovering interesting patterns from large amounts of data
      • Knowledge discovery (mining) in databases (KDD), data/pattern analysis, information harvesting, business intelligence, etc .
  • 4. Evolution of Database Technology
    • 1960s:
      • Data collection, database creation
    • 1970s:
      • Relational data model, relational DBMS implementation
    • 1980s ~ present:
      • RDBMS, advanced data models 1990s—2000s:
      • Data mining and data warehousing, multimedia databases, and Web databases
  • 5. Data Mining VS. Database Query
    • Database
    • Data Mining
            • Find all customers who have purchased milk
            • Find all items which are frequently purchased with milk. (association rules)
      • Identify customers who have purchased more than $10,000 in the last month.
            • Identify customers with similar buying habits. (Clustering)
  • 6. Data Mining Process (KDD) Data Cleaning Databases Data Warehouse Knowledge Task-relevant Data Selection Data Mining Pattern Evaluation J. Han. and M. Kamber. Data Mining: Concepts and Techniques,2001
  • 7. About DB2 Intelligent Miner
    • DB2 Intelligent Miner for Data “ focused on the large-scale mining, such as large volumes of data, parallel data mining on Windows NT, Sun Solaris, and OS/390 ” – IBM
  • 8. Main Functionalities
    • Cluster analysis
      • Group the data that share similar trends and patterns
    • Classification
      • Predict the outcome based on historical data
    • Association analysis
      • Finding frequent patterns .
  • 9.  
  • 10.  
  • 11.  
  • 12.  
  • 13.  
  • 14.  
  • 15.  
  • 16.  
  • 17.  
  • 18. This follows an example from Quinlan’s ID3 Classification
  • 19.  
  • 20. Classification
  • 21. This follows an example from Quinlan’s ID3 Classification
  • 22. Association
      • Association Rule: identifies relationships
      • Example
      • “ 30% customers buy shirts in all the transactions, 60% of these customers
      • will also by a tie”
        • Confidence factor is 60%
        • Support – if buying shirt and tie together is observed in 12% of all transactions, then the support is thus 12%
        • Lift = 60% / 30%=2
  • 23. Association
    • Support Confidence Type Lift Rule Body Rule Head
    • (%) (%)
    • 5.5286 34.0800 + 2.7300 [203] + [1207] => [1716]
    • 7.0388 34.1300 + 2.7400 [203] + [1719] => [1716]
    • 5.4662 34.1700 + 2.7400 [202] + [802] => [1716]
    • 5.8805 34.3400 + 2.7500 [203] + [802] => [1716]
    • 5.0163 34.4900 + 2.7600 [203] + [705] => [1716]
    • 7.1279 34.7400 + 2.7800 [202] + [1718] => [1716]
    • 5.8226 34.7600 + 3.3900 [711] + [203] => [710]
    • 5.0697 34.8300 + 2.7400 [202] + [1702] => [1703]
    • 5.2836 34.8300 + 2.7400 [202] + [1207] => [1703]
    • 5.4350 34.9400 + 3.4100 [201] + [711] => [710]
    • 5.3459 35.0200 + 2.7600 [201] + [1702] => [1703]
  • 24. Data Mining Products
    • more than 50 commercial data mining tools
    • Wide range of pricing
      • SAS Institute’s Enterprise Miner ~ $80k
      • SPSS Inc. Clementine ~ 75K
      • IBM Intelligent Miner ~ $60k
      • Desktop products start at few hundred dollars
  • 25. Data Mining Products Data Ming Product Comparison on Algorithm √ Nearest Neighbour √ √ Association √ √ Clustering √ √ Kohonen Self- Organizing Map √ √ √ Decision Tree √ √ √ Neural Network SPSS SAS IBM Algorithm
  • 26. Data Mining & Privacy
    • Release limited subset of data
      • Hide attributes that potentially related to personal information
    • Release Encrypted Data
    • Audit to detect misuse of Data
    • Set up Data Mining Controller
  • 27. Summary
    • Introduction to Data Mining
    • A KDD Data Mining Process
    • Functionalities of Intelligent Miner
    • Commercial Data Mining Tools
    • Data Mining & Privacy
  • 28. References
    • Angoss Whitepaper:
    • http://www.angoss.com/ProdServ/AnalyticalTools/kseeker/whitepaper.html. Retrieved on Oct26th,2003
    • C. Clifton. & D. Marks Security and Privacy Implications of Data Ming.1996
    • D.W. Abbott, I. P. Matkovsky & J. F. Elder IV. An Evaluation of High-end Data Mining Tools
    • Elder Research. http://www.rgrossman.com/faq/dm-02.htm . Retrieved on Oct28th,2003
    • IBM. BD2 Intelligent Mine.
    • http://www-3.ibm.com/software/data/iminer/ .
    • Retrieved on Oct26th,2003
    • J. F. Elder & D. W. Abbott. August, 1988 A comparison of Leading Data Mining Tools
    • J. Han. and M. Kamber. Data Mining: Concepts and Techniques, 2000
    • http://www.cald.cs.cmu.edu/summerschool03/PrivacyPreservingDM.ppt Retrieved on Nov 10th,2003
    • Robert Grossman http:// www.datamininglab.com/toolcomp.html#comparison . Retrieved on Oct20th,2003
    • SPSS. http:// www.spss.com / . Retrieved on Nov12th,2003
  • 29.  
  • 30. Evolution of Database Technology
    • 1960s:
      • Data collection, database creation, and network DBMS
    • 1970s:
      • Relational data model, relational DBMS implementation
    • 1980s:
      • RDBMS, advanced data models 1990s—2000s:
      • Data mining and data warehousing, multimedia databases, and Web databases
  • 31. Data Mining: On What Kind of Data?
    • Data Sources
      • Relational database
      • Data warehouses
      • Transactional databases
      • WWW
    • Data types
      • Audio
      • Image
      • Text
  • 32. Output: A Decision Tree for “ buys_computer” age? overcast student? credit rating? no yes fair excellent <=30 >40 no no yes yes yes 30..40
  • 33. Neural network  k - f weighted sum Input vector x output y Activation function weight vector w  w 0 w 1 w n x 0 x 1 x n
  • 34. Neural network 0.15 0.29 0.11 0.25 0.09 0.23 0.32 0.27
  • 35. Neural network
  • 36. Applications of Clustering
    • Pattern Recognition
    • Image Processing
    • Economic Science (especially market research)
    • WWW
      • Document classification
      • Cluster Weblog data to discover groups of similar access patterns
  • 37. Data Mining & Privacy Data Mining Tool Mining Controller Data warehouse
  • 38. Examples of Clustering Applications
    • Marketing: Help marketers discover distinct groups in their customer bases, and then use this knowledge to develop targeted marketing programs
    • Insurance: Identifying groups of motor insurance policy holders with a high average claim cost
    • City-planning: Identifying groups of houses according to their house type, value, and geographical location
    • Earth-quake studies: Observed earth quake epicenters should be clustered along continent faults
  • 39. Association
    • Association and pattern analysis
      • Applications:
        • Basket data analysis, cross-marketing, catalog design, loss-leader analysis, clustering, classification, etc .
      • Examples.
        • buys(x, “diapers”)  buys(x, “beers”) [0.5%, 60%]
        • major(x, “CS”) ^ takes(x, “DB”)  grade(x, “A”) [1%, 75%]
  • 40. Data Mining: On What Kind of Data?
    • Relational databases
    • Data warehouses
    • Transactional databases
    • Advanced DB and information repositories
      • Object-oriented and object-relational databases
      • Text databases and multimedia databases
      • Heterogeneous and legacy databases
      • WWW
  • 41. Steps of a KDD Process
    • Learning the application domain:
      • relevant prior knowledge and goals of application
    • Creating a target data set: data selection
    • Data cleaning and preprocessing: (may take 60% of effort!)
    • Data reduction and transformation :
      • Find useful features, dimensionality/variable reduction, invariant representation.
    • Choosing functions of data mining
      • summarization, classification, regression, association, clustering.
    • Choosing the mining algorithm(s)
    • Data mining : search for patterns of interest
    • Pattern evaluation and knowledge presentation
      • visualization, transformation, removing redundant patterns, etc.
    • Use of discovered knowledge
  • 42. Strength and Weakness
    • Strength
      • Algorithm breadth
      • Graphical output
      • Available for PC and mainframe environment
    • Weakness
      • No automation
      • Data has to reside in IBM’s database system