Data Mining Lecture 1: Introduction to Data Mining
Upcoming SlideShare
Loading in...5
×
 

Like this? Share it with your network

Share

Data Mining Lecture 1: Introduction to Data Mining

on

  • 1,558 views

 

Statistics

Views

Total Views
1,558
Views on SlideShare
1,548
Embed Views
10

Actions

Likes
4
Downloads
169
Comments
0

1 Embed 10

http://www.slideshare.net 10

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Data Mining Lecture 1: Introduction to Data Mining Presentation Transcript

  • 1. Data Mining
      • Lecture 1:
      • Introduction to Data Mining
      • Manuel Penaloza, PhD
  • 2. Introduction to Data Mining
    • Society produces huge amounts of data daily
      • Retail Store
        • POS data on customer purchases
      • Banks
        • Collection of customer service calls
      • Telecommunications
        • Phone call records (mobile and house-based calls)
      • Medicine
        • Genomic data collected on the structure of genes
      • Government
        • Law enforcement data, income tax data
      • Others: (Transactional) data from Sports, Schools, Research, Search engines, etc.
  • 3. What is Data Mining (DM)?
    • It is the process of discovering hidden relationships and patterns in large data sets
      • It can also predict the outcome of a future observation
    • Data mining is an interdisciplinary field
      • It is an extension to statistical analysis
      • It uses techniques from:
        • Statistics
        • Machine learning
        • Pattern recognition
        • Database technology
        • Visualization
        • High-performance computing
  • 4. Questions answered by DM
    • Extracting useful information from a dataset that answer:
      • Which CC customers are most profitable?
      • Which loan applicants are high-risk?
      • Which customer will respond to a planned promotion?
      • How do we detect phone card fraud?
      • How do customer profile change over time?
      • Which customers do prefer product A over product B?
      • What is the revenue prediction for next year?
      • Which students are most likely to transfer than others?
      • Which tax payer may be cheating the system?
      • Who is most likely to violate a probation sentence?
      • What is the predicted outcome for some treatment?
  • 5. Data sources
    • Relational Databases
      • Transactional data with many tables
    • Data warehouses
      • Historical data, aggregated and updated periodically
    • Files
      • In special format (e.g., CSV) or proprietary binary
    • Internet or electronic mail
      • HTML, XML, web search results, e-mails
    • Scientific, research
      • Seismology, remote sensing, etc.
  • 6. Example: Health System
    • Characteristics of the Health System:
      • Personal medical records (GP, specialists, etc.)
      • Billing records
      • Hospital data (surgery, admission, etc.)
    • Questions:
      • Are MD's following the procedures?
      • Which patient may have an adverse drug reactions?
      • Are people committing frauds?
      • Which patient are most likely to get cancer?
  • 7. Case study: E-commerce
    • A person buys book from Amazon.com
    • Objective: Recommend other books this person is likely to buy
    • Amazon may do clustering or sequential pattern analysis based on books bought by other people
    • Data analyzed:
      • “Customer who bought “Data Mining: Practical Machine Learning Tools and Techniques” also bought “Introduction to Data Mining”
    • Recommendations have been successful for Amazon
      • Increasing buyer’s satisfaction and purchases
  • 8. What motivated data mining?
    • Growth in data collection
    • Presence of data warehouses with reliable data
    • Competitive pressure to increase sales
    • The development of commercial off the shelves (COTS) data mining software
      • Examples: XLMiner, Insightful Miner, SAS, SPSS
    • Growth of computing power and storage capacity
    • High dimensionality of the data
    • Heterogeneous and complex data
    • Limitation of humans
  • 9. Insightful Miner TM 7: GUI *Figures taken from the Insightful Miner 7 Guide
  • 10. Creating Models
    • Create a network of pipelined components
      • By dragging and dropping components
  • 11. Choosing a data mining system
    • They have different functionality or methodology
    • Selection determined by:
      • Type of operating system used in your organization
      • The data sources handle by the tool:
        • ASCII text files, relational databases, XML data
      • The data mining functions and methods offered
      • Scalability of the system
        • Row and column scalability
      • Visualization tools available
      • Graphical user interface that guides the execution of the methods
      • Integration with other information systems
      • Cost and performance
  • 12. Data Mining in Databases
    • Current applications include data mining modules
    • Example:
      • Database management systems such as Oracle and MS SQL Server
      • CRM (Customer Relationship Management)
    • Advantages for Database systems:
      • One Stop shopping
      • Minimize data movement and conversion
    • Disadvantages for Database systems:
      • Limited to DM methods available in the system
      • Data extractions and transformations may not be powerful enough
  • 13. Standard data mining life cycle
    • CRISP (Cross-Industry Standard Process)
    • It is an iterative process with phase dependencies
    • IT consists of six (6) phases:
    see www.crisp-dm.org for more information
  • 14. CRISP_DM
    • Cross-industry standard developed in 1996
      • Analysts from SPSS/ISL, NCR, Daimler-Benz, OHRA
    • Funding from European Commission
    • Important Characteristics:
      • Non-proprietary
      • Application/Industry neutral
      • Tool neutral
      • General problem-solving process
      • Process with six phases but missing:
        • Saving results and updating the model
  • 15. CRISP-DM Phases (1)
    • Business Understanding
      • Understand project objectives and requirements
      • Formulation of a data mining problem definition
    • Data Understanding
      • Data collection
      • Evaluate the quality of the data
      • Perform exploratory data analysis
    • Data Preparation
      • Clean, prepare, integrate, and transform the data
      • Select appropriate attributes and variables
  • 16. CRISP-DM Phases (2)
    • Modeling
      • Select and apply appropriate modeling techniques
      • Calibrate model parameters to optimize results
      • If necessary, return to data preparation phase to satisfy model's data format
    • Evaluation
      • Determine if model satisfies objectives set in phase 1
      • Identify business issues that have not been addressed
    • Deployment
      • Organize and present the model to the “user”
      • Put model into practice
      • Set up for continuous mining of the data
  • 17. Data mining tasks (1)
    • Classification
      • Predict the categorical value of a target (dependent) variable based on the values of other attributes
      • Target variable is partitioned into classes
      • It predicts class membership of a new observation
      • Examples: Which drug should be prescribed for older patients with low sodium/potassium ratios?
    • Estimation
      • Similar to classification except target variable is numeric
      • That is, predicting a numeric value
      • Example: Estimate the blood pressure of a person based on his/her age, gender, body mass index, etc.
  • 18. Data mining tasks (2)
    • Prediction
      • Similar to estimation except that results lie in the future
      • Example: Predict the price of a stock 3 months into the future
    • Clustering
      • Grouping similar records together
      • Example: Find patients with similar profiles
    • Associations
      • Uncover rules that indicates the association between two or more attributes
      • Find out which items are purchased together
  • 19. Task: Classification
    • Build a model that learns to predict the class from pre-labeled instances or observations
      • Many approaches: Regression, Decision Trees, Neural Networks
    Given a set of points from classes what is the class of new point ? * Diagram taken from www.kdnuggets.com/data_mining_course/index.html
  • 20. Task: Clustering
    • Find grouping of instances given un-labeled data
    * Diagram taken from www.kdnuggets.com/data_mining_course/index.html
  • 21. DM looks easy Data Data Mining Method Regression Decision Tree Neural Network … Association Rules Model
    • But it is not easy
    • Real-world is complicate
  • 22. Methods and Techniques
    • Cluster Analysis (tasks: clustering)
    • Association Rules (tasks: association)
    • Decision trees (tasks: prediction, classification)
    • Neural networks (tasks: prediction, classification)
    • K-nearest neighbor (tasks: prediction, classification, clustering)
    • Regression analysis (task: estimation, prediction)
    • Confidence interval estimation (task: estimation)
  • 23. Fallacies of Data Mining (1)
    • Fallacy 1: There are data mining tools that automatically find the answers to our problem
      • Reality: There are no automatic tools that will solve your problems “while you wait”
    • Fallacy 2: The DM process require little human intervention
      • Reality: The DM process require human intervention in all its phases, including updating and evaluating the model by human experts
    • Fallacy 3: Data mining have a quick ROI
      • Reality: It depends on the startup costs, personnel costs, data source costs, and so on
  • 24. Fallacies of Data Mining (2)
    • Fallacy 4: DM tools are easy to use
      • Reality: Analysts must be familiar with the model
    • Fallacy 5: DM will identify the causes to the business problem
      • Reality: DM tool only identify patterns in your data, analysts must identify the cause
    • Fallacy 6: Data mining will clean up a data repository automatically
      • Reality: Sequence of transformation tasks must be defined by an analysts during early DM phases
      • * Fallacies described by Jen Que Louie, President of Nautilus Systems, Inc.
  • 25. In summary,
    • Problems suitable for Data Mining:
      • Require to discover knowledge to make right decisions
      • Current solutions are not adequate
      • Expected high-payoff for the right decisions
      • Have accessible, sufficient, and relevant data
      • Have a changing environment
    • IMPORTANT:
      • ENSURE privacy if personal data is used!
      • Not every data mining application is successful!
  • 26. Main References
    • Ian Witten and Eibe Frank. Data Mining: Practical Machine Learning Tools and Techniques , 2 nd edition, Morgan Kaufmann Publishers
    • Daniel LaRose. Discovering Knowledge in Data: An Introduction to Data Mining , Wiley Publication
    • Pang-Ning Tang et. al. Introduction to Data Mining , Addison Wesley
    • Jiawei Han and Micheline Kamber. Data Mining: Concepts and Techniques , Morgan Kaufmann Publishers
    • Online data mining course offered by KDnuggets TM at www.kdnuggets.com/data_mining_course/index.html
    • Engineering Statistics Handbook available online at http://www.itl.nist.gov/div898/handbook/eda/section1/eda126.htm
  • 27. Exercise #1
    • CRISP-DM is not the only DM process, do a quick search on the Internet for another process. Describe any similarity and differences with CRISP-DM.
    • Determine how data mining could help a web search engine company like Google in its operation?
      • Identify one or more objectives.
      • Which data mining task(s) could help this company?