Upcoming SlideShare
Loading in...5

Like this? Share it with your network








Total Views
Views on SlideShare
Embed Views



1 Embed 2 2



Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
Post Comment
Edit your comment

data_analysis_and_mining.ppt Presentation Transcript

  • 1. Chapter 18: Data Analysis and Mining Kat Powell
  • 2. Chapter 18: Data Analysis and Mining
    • Decision Support Systems
    • Data Analysis and OLAP
    • Data Warehousing
    • Data Mining
  • 3. Decision Support Systems
    • Make business decisions, often based on data collected by on-line transaction-processing systems
      • What items to stock?
      • To whom to send advertisements?
    • Examples of data used for making decisions
      • Retail sales transaction details
      • Customer profiles (income, age, gender, etc.)
  • 4. Decision-Support Systems Overview
    • Data analysis tasks are simplified by specialized tools and SQL extensions
    • Data mining seeks to discover knowledge automatically in the form of statistical rules and patterns from large databases.
    • A data warehouse archives information gathered from multiple sources, and stores it under a unified schema, at a single site.
      • Important for large businesses that generate data from multiple divisions, possibly at multiple sites
  • 5. Data Analysis and OLAP
    • OnLine Analytical Processing (OLAP)
      • Online interactive tools for analysis and summary of data that present it to humans in understandable format
    • Multidimensional data – data modeled as dim ension attributes and mea sure attributes in table summary format
      • Measure attributes measure some value and can be aggregated upon
      • Dimension attributes define the dimensions on which measure attributes (or aggregates thereof) are viewed
  • 6. Cross Tabulation (aka cross-tab) of sales by item-name and color
          • item_name, color, and size are the dim ension attributes of the sales relation
          • the number values in each cell represent m e asure attributes
          • Extra column and row store the cell totals for each column and row
          • Summary of sales
  • 7. Relational Representation of Cross-tabs
    • Cross-tabs also can be represented as relations
      • The value ' all' is used to represent an aggregate, that is, the set of all values for an attribute
      • The SQL:1999 standard actually uses ' null' values in place of ' all' despite confusion with regular ' null' values
  • 8. Data Cube (Sales relation)
    • Multidimensional generalization of a cross-tab
    • Can have n dimensions; we see 3 here
    • Cross-tabs can be used as views on a data cube
    • All cells contain values...even the invisible (inner) cells
  • 9. Online Analytical Processing
      • Pivoting : changing the dimensions used in a cross-tab
      • EX: An analyst may select a cross-tab on item_name and size OR on color and size
      • Slicing: creating a cross-tab for fixed values only
        • Sometimes called dicing , particularly when values for multiple dimensions are fixed .
      • EX: An analyst may wish to see cross-tab on item_name and color for a fixed value of size (for instance...large) , instead of the sum across all sizes.
  • 10. Online Analytical Processing
      • OLAP systems permit users to view data at any desired level of granularity
      • Rollup: moving from finer-granularity data to a coarser granularity
      • EX: Starting from the data cube on the sales table, got the cross-tab by rolling up on the attribute ' size'
      • Drill down: The opposite operation - that of moving from coarser-granularity data to finer-granularity data
      • EX: usually generated from the original data like drilling for fine, rare jewels
  • 11. Data Warehousing
      • Data sources often store only current data, not historical data
      • Corporate decision making requires a unified view of all organizational data, including historical data
      • A data warehouse is a repository (archive) of information gathered from multiple sources, stored under a unified schema, at a single site
        • Greatly simplifies querying, permits study of historical trends
        • Shifts decision support query load away from transaction processing systems
  • 12. Data Warehousing
  • 13. Warehouse Design Issues
      • When and how to gather data
        • Source driven architecture: data sources transmit new information to warehouse, either continuously or periodically (e.g. at night)
        • Destination driven architecture: warehouse periodically requests new information from data sources
        • Keeping warehouse exactly synchronized with data sources (e.g. using two-phase commit) is too expensive
          • Usually OK to have slightly out-of-date data at warehouse
          • Data/updates are periodically downloaded form online transaction processing (OLTP) systems.
  • 14. More Warehouse Design Issues
      • What schema to use
        • Schema integration
      • Data cleansing
        • E.g. correct mistakes in addresses (misspellings, zip code errors, etc.)
        • Merge address lists from different sources and purge duplicates
      • How to propagate updates
        • Warehouse schema may be a (materialized) view of schema from data sources
  • 15. Data Mining
      • The process of semi-automatically analyzing large databases to find useful patterns
      • Prediction based on past history
        • Predict if a credit card applicant poses a good credit risk, based on some attributes (income, job type, age, ..) and past history
        • Predict if a pattern of phone calling card usage is likely to be fraudulent
  • 16. Data Mining (Cont.)
      • Some other examples of prediction mechanisms:
        • Classification
          • Given a new item whose class is unknown, predict to which class it belongs
        • Regression formulae
          • Given a set of mappings for an unknown function, predict the function result for a new parameter value
  • 17. Data Mining (Cont.)
      • Descriptive Patterns
        • Associations
          • Find books that are often bought by “similar” customers. If a new such customer buys one such book, suggest the others too.
        • Associations may be used as a first step in detecting causation
          • E.g. association between exposure to chemical X and cancer
        • Clusters
          • E.g. typhoid cases were clustered in an area surrounding a contaminated well
          • remains important in detecting epidemics
  • 18. Other Types of Mining
      • Text mining : application of data mining to textual documents
        • cluster Web pages to find related pages
        • cluster pages a user has visited to organize their visit history
        • classify Web pages automatically into a Web directory
  • 19. Other Types of Mining
      • Data visualization systems help users examine large volumes of data and detect patterns visually
        • Can visually encode large amounts of information on a single screen
        • Humans are very good a detecting visual patterns
  • 20. The END