DatawarehouseDataMiningNotes

391 views

Published on

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
391
On SlideShare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
19
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

DatawarehouseDataMiningNotes

  1. 2. <ul><li>Database and data warehousing </li></ul><ul><ul><li>Chapter 4 pages 167-177 </li></ul></ul><ul><li>Data mining and OLAP </li></ul><ul><ul><li>Chapter 8 pages 321-325 </li></ul></ul>
  2. 3. <ul><li>In managing information, physical deals with the structure of information as it resides on various storage media. </li></ul><ul><li>Logical deals with how knowledge workers view their information needs, and includes such terms as: </li></ul><ul><ul><li>CHARACTER - our smallest unit of information. </li></ul></ul><ul><ul><li>FIELD - group of related characters. </li></ul></ul><ul><ul><li>RECORD - group of related fields. </li></ul></ul><ul><ul><li>FILE - group of related records. </li></ul></ul><ul><ul><li>DATABASE - group of logically associated files. </li></ul></ul><ul><ul><li>DATA WAREHOUSE - information from many databases. </li></ul></ul>
  3. 4. <ul><li>DATA DICTIONARY - contains the logical structure of information in a database. </li></ul><ul><ul><li>Definitions of all fields, records, and tables </li></ul></ul><ul><ul><li>Relationships between tables </li></ul></ul><ul><ul><li>Who is responsible for maintaining data in the database </li></ul></ul><ul><ul><li>Descriptions of who is authorized to access different parts of the database </li></ul></ul><ul><li>Data dictionary contains meta data (data about the data) </li></ul>
  4. 5. Sample Data Dictionary Report
  5. 6. <ul><ul><li>Definition- a database that stores current and historical data designed to support business analysis activities and decision-making tasks of managers; typically a relational database model is used. The data warehouse uses special software (tools) to assist managers extract information. </li></ul></ul><ul><ul><li>Benefits </li></ul></ul><ul><ul><ul><li>improved access </li></ul></ul></ul><ul><ul><ul><li>improved information </li></ul></ul></ul><ul><ul><ul><li>isolation from operational systems </li></ul></ul></ul><ul><ul><ul><li>tools permit advanced data analysis </li></ul></ul></ul><ul><ul><li>Users and data marts </li></ul></ul>
  6. 7. <ul><li>Extraction phase – create files on the computer that will store the data warehouse and move transaction data to this machine; data may come from many sources or parts of the organization </li></ul><ul><li>Transformation phase – cleanse and standardize the data. Why is this necessary? </li></ul><ul><li>Load phase – transfer the data from the transformation phase into the data warehouse </li></ul><ul><li>The ETL process becomes automated to make regular transfers of transaction data into the data warehouse </li></ul>
  7. 8. <ul><li>Operational Data </li></ul><ul><li>Data is on many systems </li></ul><ul><li>Current operational data </li></ul><ul><li>Inconsistent data definitions </li></ul><ul><li>Functionally organized data </li></ul><ul><li>Data are constantly changing </li></ul><ul><li>Support OLTP </li></ul><ul><li>Warehouse Data </li></ul><ul><li>Integrated in one enterprise-wide system </li></ul><ul><li>Recent and historical data </li></ul><ul><li>Consistent data definitions </li></ul><ul><li>Data are organized around business entities </li></ul><ul><li>Data are stabilized </li></ul><ul><li>Support OLAP </li></ul>
  8. 9. <ul><li>Data mining (knowledge discovery in databases): </li></ul><ul><ul><li>Extraction of interesting ( non-trivial, previously unknown and potentially useful) information or patterns from data in large databases </li></ul></ul><ul><li>Similar terms </li></ul><ul><ul><li>Exploratory data analysis </li></ul></ul><ul><ul><li>Data driven discovery </li></ul></ul><ul><ul><li>Deductive learning </li></ul></ul><ul><ul><li>Knowledge extraction </li></ul></ul>
  9. 10. <ul><li>A computer-based data analysis process </li></ul><ul><ul><li>Utilizes historical organizational data, typically in a data warehouse </li></ul></ul><ul><ul><li>Uses a variety of data analysis, modeling, and visualization techniques </li></ul></ul><ul><ul><li>One use is to discover previously unknown patterns or potential relationships in the data: undirected </li></ul></ul><ul><ul><li>Also used to make predictions, verify assumptions, or otherwise provide useful information: directed </li></ul></ul><ul><ul><li>Allows businesses to make proactive, knowledge-driven decisions </li></ul></ul>
  10. 11. <ul><li>Prediction </li></ul><ul><ul><li>Use some variables to predict unknown or future values of other variables </li></ul></ul><ul><li>Description </li></ul><ul><ul><li>Find human-interpretable patterns that describe the data </li></ul></ul>
  11. 14. <ul><li>Given a collection of data </li></ul><ul><ul><li>Each record contains a set of attributes , one of the attributes is the class variable </li></ul></ul><ul><ul><li>Usually, the given data set is divided into training and test sets, with training set used to build the model and test set used to validate it </li></ul></ul><ul><li>Find a model for the class attribute as a function of the values of other attributes based on the training set </li></ul><ul><ul><li>Previously unseen data ( test set ) are used to determine the accuracy of the model </li></ul></ul>
  12. 15. <ul><ul><li>Goal: Reduce cost of mailing by targeting a set of consumers likely to buy a new cell-phone product </li></ul></ul><ul><ul><li>Approach: </li></ul></ul><ul><ul><ul><li>Use the data for a similar product introduced before </li></ul></ul></ul><ul><ul><ul><li>We know which customers decided to buy and which decided otherwise; this {buy, don’t buy} decision is the class variable </li></ul></ul></ul><ul><ul><ul><li>Collect various demographic, lifestyle, and company-interaction related information about all such customers </li></ul></ul></ul><ul><ul><ul><li>Use this information as input attributes to create a classifier model </li></ul></ul></ul>
  13. 16. <ul><ul><li>Goal: Predict fraudulent cases in credit card transactions </li></ul></ul><ul><ul><li>Approach: </li></ul></ul><ul><ul><ul><li>Use past credit card transactions and the information on its account-holder as attributes </li></ul></ul></ul><ul><ul><ul><ul><li>When does a customer buy, what does he buy, how often he pays on time, etc </li></ul></ul></ul></ul><ul><ul><ul><li>Determine whether past transactions were fraud or fair transactions; this is the class variable </li></ul></ul></ul><ul><ul><ul><li>Create a model for the class of the transactions </li></ul></ul></ul>
  14. 17. <ul><ul><li>Goal: To predict whether a customer is likely to be lost to a competitor </li></ul></ul><ul><ul><li>Approach: </li></ul></ul><ul><ul><ul><li>Use detailed record of transactions with each of the past and present customers, to find attributes </li></ul></ul></ul><ul><ul><ul><ul><li>How often the customer calls, where he calls, what time-of-the day he calls most, his financial status, marital status, etc </li></ul></ul></ul></ul><ul><ul><ul><li>Label the customers as still with the company or left the company (churned) </li></ul></ul></ul><ul><ul><ul><li>Create a model for churn </li></ul></ul></ul>
  15. 18. <ul><li>Clustering concerns segmenting a diverse population into several homogeneous subgroups or clusters. Clustering differs from classification in that there are no predefined classifications. </li></ul><ul><ul><li>Data points in one cluster are more similar to one another </li></ul></ul><ul><ul><li>Data points in separate clusters are less similar to one another </li></ul></ul><ul><li>Similarity Measures: </li></ul><ul><ul><li>Euclidean distance if attributes are continuous </li></ul></ul><ul><ul><li>Other problem-specific measures </li></ul></ul>
  16. 19. <ul><li>Goal: Subdivide a market into distinct subsets of customers where any subset may conceivably be selected as a market target to be reached with a distinct marketing mix </li></ul><ul><li>Approach: </li></ul><ul><ul><li>Collect different attributes of customers based on their geographical and lifestyle related information </li></ul></ul><ul><ul><li>Find clusters of similar customers </li></ul></ul><ul><ul><li>Measure the clustering quality by observing buying patterns of customers in same cluster vs. those from different clusters </li></ul></ul>
  17. 20. <ul><li>Goal: To find groups of documents that are similar to each other based on the important terms appearing in them </li></ul><ul><li>Approach: To identify frequently occurring terms in each document; form a similarity measure based on the frequencies of different terms; use it to cluster </li></ul><ul><li>Gain: Retrieval can utilize the clusters to relate a new document or search term to clustered documents </li></ul>
  18. 21. <ul><li>Given a set of records, each of which contains some number of items from a specific collection, produce dependency rules which will predict occurrence of an item based on occurrences of other items: </li></ul>Rules Discovered: {Milk} --> {Coke} {Diaper, Milk} --> {Beer}
  19. 22. <ul><ul><li>Goal: To identify items that are bought together by sufficiently many customers </li></ul></ul><ul><ul><li>Approach: Process the point-of-sale data collected with barcode scanners to find dependencies among items </li></ul></ul><ul><ul><li>A classic case: If a customer buys diapers on Friday evening, then he is very likely to buy beer </li></ul></ul>
  20. 23. <ul><ul><ul><li>So, don’t be surprised if you find six-packs stacked next to diapers! </li></ul></ul></ul>
  21. 24. <ul><li>Privacy </li></ul><ul><li>Profiling </li></ul><ul><li>Unauthorized Use </li></ul><ul><li>Big Brother </li></ul>
  22. 25. <ul><li>Customer loyalty cards have multiple uses, but one use is to collect data for the data warehouse </li></ul><ul><li>Examples </li></ul><ul><ul><li>Grocery stores </li></ul></ul><ul><ul><li>Web sites </li></ul></ul><ul><ul><li>Harrah’s </li></ul></ul><ul><ul><li>Store related credit cards </li></ul></ul><ul><li>Assurance of a steady flow of data </li></ul>
  23. 26. <ul><li>Multidimensional data analysis (or OLAP) enables users to view data using various dimensions, measures and time frames (i. e., OLAP) </li></ul><ul><ul><li>dimensions: products, business units, country, industry (e.g., categories) </li></ul></ul><ul><ul><li>measures: money, unit sales, head count, variances </li></ul></ul><ul><ul><li>time: daily, weekly, monthly, quarterly, yearly) </li></ul></ul><ul><li>This type of analysis also provides the ability to view data in different ways (tables, charts, 3-D, geographically) </li></ul><ul><li>OLAP tools provide for this </li></ul><ul><li>Pivot tables in Excel or Access </li></ul>
  24. 27. Property City Time Three dimensional revenue model
  25. 29. Property Type City Time Total Revenue Flat Glasgow Q1 15056 House Glasgow Q1 14670 Flat Glasgow Q2 14555 House Glasgow Q2 15888 Flat Glasgow Q3 14578 House Glasgow Q3 16004 Flat Glasgow Q4 15890 House Glasgow Q4 15500 Flat London Q1 19678 House London Q1 23877 Flat London Q2 19567 House London Q2 28677
  26. 30. <ul><li>A common operation is to aggregate a measure over one or more dimensions </li></ul><ul><ul><li>Find the total revenue </li></ul></ul><ul><ul><li>Find the total revenue for each city </li></ul></ul><ul><ul><li>Find the top property-type for the 3 rd quarter based on total revenue across all cities </li></ul></ul>
  27. 31. <ul><li>Roll-up: Aggregating data across different dimension levels </li></ul><ul><ul><li>Example: given revenue by city, we can roll-up to get total revenue by state </li></ul></ul><ul><li>Drill-down: The inverse of roll-up: disaggregating data </li></ul><ul><ul><li>Example: given total revenue by state, we can drill-down to get revenue by city </li></ul></ul>
  28. 32. Rollup Drill Down Glasgow London Aberdeen Q1 29726 43555 53210 Q2 30443 48244 34567 Q3 30582 56222 45677 Q4 31390 45632 50056 Glasgow London Aberdeen Jan 9035 21005 5216 Feb 10788 14799 14944 Mar 9903 7751 33050 Apr 11273 10573 21884 May 9005 16896 8573 Jun 10265 20775 4110
  29. 33. <ul><li>Slicing & Dicing: Selecting data within dimension categories </li></ul><ul><ul><li>Example: given revenue by city for the entire year, we can extract the revenue for a given city for a given quarter </li></ul></ul><ul><li>Rotating: Reorienting the presentation of the data cube </li></ul><ul><ul><li>Example: given a cube that presents revenue by city & property type for each quarter, we can change the presentation to present the revenue by property type & quarter for each city </li></ul></ul>
  30. 34. Example: OLAP Usage at an Automobile Dealership The Story An automobile dealership manager wants to improve business activity. Therefore she wants to view sales figures from different perspectives. A Question What is the sales volume for a specific model and colors, for a specific salesperson ? The Data Needs <ul><li>Sales by model </li></ul><ul><li>Sales by salesperson </li></ul><ul><li>Sales by color </li></ul>
  31. 35. Example: The Multi-dimensional Data Model Used Sales Volume Blue Red White Van Coupe Sedan Miller Clyde Smith COLOR SALESPERSON MOD E L
  32. 36. Example: OLAP “Slicing & Dicing“ – Selecting Categories Sales Volumes Blue Red White Van Coupe Sedan Miller Clyde Smith COLOR SALESPERSON MOD E L <ul><li>Color: Blue and White </li></ul>Choose a range out of each dimension: <ul><li>Model: Coupe only </li></ul><ul><li>Salesperson: Clyde only </li></ul>Clyde Blue White Coupe “ Sliced & Diced“ Data
  33. 37. Example: OLAP “Rotation“ – Changing the Presentation of the Cube Sales Volume Blue Red White Van Coupe Sedan COLOR MOD E L Different Users will require different views of the multidimensional cube View of the Account Manager Rotate the data cube by 90° SALESPERSON Sales Volume Miller Smith Clyde Van Coupe Sedan MOD E L View of the Product Manager
  34. 38. Example: OLAP Drill-Down and Roll-Up Data can be disaggregated and aggregated along a dimension according to the natural hierarchy Drill-Down Roll-Up S tate Region Salesperson Miller Smith Clyde Lucas Gleason Atlanta Athens Georgia Sales Volume by Organization Dimension - three level hierarchy -
  35. 39. <ul><li>Primarily used to exploit data warehouses </li></ul><ul><li>Provides extremely fast response </li></ul><ul><li>View combinations of two dimensions </li></ul><ul><li>Enable drilling down (start with broad info and get more specific) </li></ul><ul><li>Produces results as counts or percentages </li></ul><ul><li>Conversion of tables to charts/graphs </li></ul><ul><li>Usually requires a tailored-made relational database </li></ul><ul><li>OLAP applications are widely used by mid-level and upper level managers </li></ul><ul><li>A form of business intelligence software </li></ul>
  36. 40. <ul><li>Go to www.fedscope.opm.gov </li></ul><ul><ul><li>Under data cubes on entry page click on employment </li></ul></ul><ul><ul><li>Demonstrate drill down and adding charts </li></ul></ul><ul><ul><li>Data for this example comes from the Central Personnel Data File (CPDF) of the federal government </li></ul></ul><ul><ul><li>The OLAP tool used to build this site is from a company named Cognos (PowerPlay) </li></ul></ul><ul><li>OLAP tools based on Excel </li></ul><ul><ul><li>http://www.cubularity.com </li></ul></ul>
  37. 41. <ul><li>Linkage between elements </li></ul><ul><ul><li>spreadsheet - between cells in same table </li></ul></ul><ul><ul><li>DBMS - between elements in different tables </li></ul></ul><ul><li>Orientation </li></ul><ul><ul><li>spreadsheet is toward calculations </li></ul></ul><ul><ul><li>DBMS is tilted toward organization and linkage of data elements in different tables </li></ul></ul><ul><li>Capabilities </li></ul><ul><ul><li>DBMS has extensive querying and reporting power </li></ul></ul><ul><ul><li>spreadsheet is limited </li></ul></ul><ul><li>Memory requirements </li></ul><ul><ul><li>entire spreadsheet table must be in memory </li></ul></ul><ul><ul><li>not true for the database table </li></ul></ul>

×