ECS 165B Database Systems                                          160   ECS 165B Database Systems                        ...
ECS 165B Database Systems                                        162   ECS 165B Database Systems                          ...
ECS 165B Database Systems                                        164   ECS 165B Database Systems                          ...
ECS 165B Database Systems                                        166   ECS 165B Database Systems                          ...
ECS 165B Database Systems                                         168   ECS 165B Database Systems                         ...
ECS 165B Database Systems                                         170   ECS 165B Database Systems                         ...
ECS 165B Database Systems                                             172   ECS 165B Database Systems                     ...
ECS 165B Database Systems                                        174   ECS 165B Database Systems                          ...
ECS 165B Database Systems                                               176


Algorithm for computing frequent itemsets
fo...
Upcoming SlideShare
Loading in …5
×

Data Warehouses, OLAP, Data Mining

621 views

Published on

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
621
On SlideShare
0
From Embeds
0
Number of Embeds
6
Actions
Shares
0
Downloads
52
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Data Warehouses, OLAP, Data Mining

  1. 1. ECS 165B Database Systems 160 ECS 165B Database Systems 161 Data Warehouses, OLAP, Data Mining Data Warehousing Overview Basic idea: (periodically) collect data under a unified (federated) • Most database applications are characterized by a large database schema at a single (logical) site number of transactions, each transactions making only small changes to data (Online Transaction Processing (OLTP)). OLAP Visualization Data Mining • Organizational decision making requires comprehensive view of all aspects of an enterprise (Decision Support Systems) Metadata B extract (e.g., using DBgateways (ODBC)) Repository Data • That is why many organizations have consolidated data Warehouse transform, load, refresh, purge warehouses containing data collected from several OLTP Data Integration systems managed by different business units, together with history and summary information. Source Source Source OLTP Databases • Trend of building data warehouses is complemented by an increased emphasis on powerful data analysis tools: – Online Analytical Processing (OLAP): dominated by • DW stores collection of diverse data stylized queries that typically involve group-by and aggregate operators. – “solution” to the data integration problem – Exploratory Data Analysis, Data Mining : users look for – single repository of all business related information interesting patterns in the data, queries are typically difficult • Highly subject-oriented to formulate, amount of data is to large to permit manual or even traditional statistical analysis. – data collection occurs by subject, not by application • Important for industry, data warehousing is a big business – collected data is used for OLAP and/or data mining (more than $10 billion per year) • Typical workloads involve ad hoc, fairly complex read-only • Large scale data warehouses (Top Ten list 2003): queries; optimized differently from OLTP system – France Telecom (29,232 Gb data size, Oracle DBMS, HP) – AT&T (26,269 Gb, Daytona, Sun) Dr. Michael Gertz 5. DW, OLAP and DM Dr. Michael Gertz 5. DW, OLAP and DM
  2. 2. ECS 165B Database Systems 162 ECS 165B Database Systems 163 Creating and Maintaining a Data Warehouse • User interfaces are aimed at executives and decision makers • Identify warehouse data and source data needed. • DW contains large volumes of data (Gb, Tb) Examples: • Information is non-volatile, i.e., Grocery store chain: cashier sales DB, inventory DB, – contents are stable for a long period of time promotion history – enables long analysis transactions Insurance Company: policy information, claims processing DB – often updates are append only • Creating a DW essentially is a data integration problem • History (e.g., managed as sequence of snapshots) and time – data extracted from operational databases is cleaned attributes are essential (minimize errors, fill in missing information), and transformed to reconcile semantic mismatches. – Transforming is accomplished by defining a view over the tables in the data sources But unlike a standard view, the view is stored (materialized) in the DW • Data is loaded into the DW. Additional processing such as sorting, generating summary of data, data partitioning, and building indexes is performed. Because of the huge amount of data, loading can be a slow process (; parallelism is key). Dr. Michael Gertz 5. DW, OLAP and DM Dr. Michael Gertz 5. DW, OLAP and DM
  3. 3. ECS 165B Database Systems 164 ECS 165B Database Systems 165 • DW data is periodically refreshed to reflect updates on the OLAP – Online Analytical Processing data sources. • OLAP applications are dominated by ad hoc, complex queries The problem of efficiently refreshing warehouse tables (which (involving group by and aggregation operators) are materialized views over tables in the sources databases) is • Typical way to think about OLAP queries is in terms of a an important research topic. multidimensional data model. (; incremental maintenance of materialized views) Multidimensional Data Model • Periodically purge data that is too old from the data warehouse (e.g., onto archival storage media, such as tapes). • Focus is on collection of numeric measures, each measure depends on a set of dimensions. • System catalog associated with a DW is used to keep track • Example: A multidimensional dataset for Sales of data currently stored in DW; can be very large; typically managed by a separate database called metadata repository. Sales o re C St B A milk 8 10 10 Product soda 30 20 50 eggs 25 8 15 Jan Feb Mar Date Underlying star schema Store(store id, name, address) Product(prod id, name, category) Date(time id, day, month, year) Sales(time id, store id, prod id, units sold) Dr. Michael Gertz 5. DW, OLAP and DM Dr. Michael Gertz 5. DW, OLAP and DM
  4. 4. ECS 165B Database Systems 166 ECS 165B Database Systems 167 • View of data as multidimensional array is readily generalized OLAP Queries to more than three dimensions (a.k.a Data Cube). • Consider how data can be queried and manipulated • OLAP systems that use arrays to store multidimensional (heavily influenced by end-user tools, such as spreadsheets). datasets are called Multidimensional OLAP (MOLAP). Assumption: – User is working with multidimensional data set • Data in multidimensional array can always be represented as – Each operation returns either different representation or a relation (see relation Sales on previous slide). summarization of underlying data set. • Each dimension can have set of associated attributes: store(store id, name, city, state, country) • Common operations are aggregating a measure over one or date(time id, date, week, month, quarter, year, holiday flag) more dimension: – “Compute the total of all sales.” Each such a dimension can be structured as a hierarchy. – “Compute the total of all sales for each city.” (Note: Importance of time; SQL’s date data type is not – “Determine the top five products ranked by their total appropriate) sales.” • OLAP systems that store all information, including fact tables, • Aggregate measure depends on fewer dimensions than the as relations are called Relational OLAP (ROLAP). original measure, e.g., by city depends only on Store dimension, not on Date or Product dimension =⇒ dimensionality reduction • Another use of aggregation is to summarize at different levels of a dimension hierarchy Given total sales by City ; total sales by State (roll-up) Given total sales by State ; total sales by City (drill-down) Drilling-down on other dimension: total sales for each product for each state (drilling down on product dimension) Dr. Michael Gertz 5. DW, OLAP and DM Dr. Michael Gertz 5. DW, OLAP and DM
  5. 5. ECS 165B Database Systems 168 ECS 165B Database Systems 169 • Another common operation is Pivoting : Comparison with SQL Queries Given the relation Sales, Pivoting on the Store and Date dimension ; two-dimensional chart. • Some OLAP queries cannot be expressed in SQL (e.g., queries that rank results and queries that involve time-oriented Pivoting can be combined with aggregation, e.g., yearly sales operations). by state. The result of pivoting then is called cross-tabulation. • Consider cross-tabulation on previous slide CA TX Total 1995 63 81 144 (Entries in the body of the chart) 1996 38 107 145 select sum(S.Sales) 1997 75 35 110 from Sales S, Date D, Store T where S.time id = D.time id Total 176 223 399 and S.store id = T.store id group by D.year, T.state • Time is the most important concept in OLAP queries: – Compute the total of all sales by month [for each city]. – Compute the percentage change in the total monthly sales (Summary column on the right) of each product. select sum(S.sales) from Sales S, Date D – Given sales values for each date, calculate for each date where S.time id = D.time id the average of the sales on that day, the previous day, and group by D.year the next day (“windowing”). (Summary row at the bottom) select sum(S.sales) from Sales S, Store T where S.store id = T.store id group by T.state Dr. Michael Gertz 5. DW, OLAP and DM Dr. Michael Gertz 5. DW, OLAP and DM
  6. 6. ECS 165B Database Systems 170 ECS 165B Database Systems 171 • In general, given a measure with k associated dimensions, we Data Mining can roll-up on any subset of these k dimensions • Data Mining (DM) seeks to discover new information or ; 2k possible SQL queries. “knowledge” from (very) large databases. Knowledge is represented in the form of statistical rules and patterns. • The SQL CUBE operation computes union of group by’s on every subset of the specified attributes • It differs from machine learning in that it deals with large For example, consider the query amounts of data stored primarily on disk (rather than in main memory). select item-name, color, size, sum(number) from sales group by cube(item-name, color, size) • Knowledge discovered from a database can be represented by a set of rules. Such rules can be discovered using one of two This computes the union of eight different groupings of the methods: sales relation: – User is involved directly in the process of knowledge { (item-name, color, size), (item-name, color), discovery (item-name, size), (color, size), (item-name), (color), (size), ( ) } – The DM system is responsible for automatically discovering knowledge from the database, by detecting patterns and where ( ) denotes an empty group by list. For each grouping, correlations in the data. the result contains the null value for attributes not present in the grouping. Knowledge Representation using Rules • General form of rules: antecedent =⇒ consequent • Example: Market Basket Analysis Market Basket ≡ collection of items purchased by customer in a single customer transaction (single visit to a store or through mail-order catalog) Idea: Use DM to identify sets of items that are purchased together ; information can be used to improve layout of goods in a store or catalog. Dr. Michael Gertz 5. DW, OLAP and DM Dr. Michael Gertz 5. DW, OLAP and DM
  7. 7. ECS 165B Database Systems 172 ECS 165B Database Systems 173 Association Rules • Important measures for an association rule: transid custid date item price qty – Support for a set of items is the percentage of transactions that contain all items in LHS ∪ RHS (75% for the above 111 201 3-12-91 pen 35 2 rule). 111 201 3-12-91 ink 2 1 111 201 3-12-91 diary 5 3 If support is low (as, e.g., for {diary} =⇒ {soap} [25%]), 111 201 3-12-91 soap 1 6 then there is not enough evidence to draw conclusion about 112 105 4-2-91 pen 35 1 correlation between items in LHS and items in RHS. 112 105 4-2-91 ink 2 1 112 105 4-2-91 diary 5 1 – Confidence: Consider transactions that contain all items in 113 106 4-2-91 pen 35 2 LHS. The confidence is the percentage of transactions that 113 106 4-2-91 diary 5 1 also contain all items of the RHS (75% for the above rule). 114 201 5-4-91 pen 35 2 114 201 5-4-91 ink 2 2 114 201 5-4-91 soap 1 4 • Examining set of transactions yields the rule {pen} =⇒ {ink}, i.e., if a pen is purchased in a transaction, it is likely that ink will also be purchased in the transaction Such a rule is called an association rule. General form: LHS =⇒ RHS, where both RHS and LHS are sets of items Dr. Michael Gertz 5. DW, OLAP and DM Dr. Michael Gertz 5. DW, OLAP and DM
  8. 8. ECS 165B Database Systems 174 ECS 165B Database Systems 175 Usage of Association Rules for Prediction Finding Association Rules • Association rules can be misleading when used naively for • Based on frequent itemsets, i.e., sets of items that have prediction. support greater than specified minimum support (minsup). Assume association rule {pen} =⇒ {ink} • Once frequent itemsets are found, candidate rules are with high support and confidence (over a given database). determined by partitioning itemset into LHS and RHS. Furthermore, pencils are often purchased in combination with • For each such rule LHS =⇒ RHS then the confidence is pens ; {pencil} =⇒ {ink}. computed by scanning all transactions. Uses two counters: But sales promotion that discounts pencils in order to increase lhscount ≡ number of tas where LHS is true sale of ink fails, because there are no causal links between rhscount ≡ number of tas where pencils and ink. additionally RHS is true =⇒ Assumptions are only justified if there are direct causal rhscount/lhscount then is confidence of that rule. links between LHS and RHS ! • Most expensive step is the identification of frequent item sets. Most algorithms rely upon the following A Priori Property : User-Guided Data Mining “Every subset of a frequent itemset must also be a frequent • User may run tests on database to verify or refute a hypothesis itemset.” • Example: Refine hypothesis “People holding a CS degree are the most likely to have an excellent credit rating” into rule: degree = ’CS’ ∧ income ≥ 50,000 =⇒ credit = ’excellent’ • Data-visualization helps in detecting patterns in large volumes of data via maps, charts, color-coding, and other graphical representation. Dr. Michael Gertz 5. DW, OLAP and DM Dr. Michael Gertz 5. DW, OLAP and DM
  9. 9. ECS 165B Database Systems 176 Algorithm for computing frequent itemsets foreach item, // Level 1 check if it is a frequent itemset // appears in > minsup tas repeat // Level-wise identification of frequent itemsets foreach new frequent itemset Ik with k items // Level k + 1 generate all itemsets Ik+1 with k + 1 items, Ik ⊂ Ik+1 check all tas once and check if the generated k + 1 itemsets are frequent until no new frequent itemsets are identified generation of candidate rules by partitioning of each frequent itemset computation of ratio rhscount/lhscount for candidate rules Dr. Michael Gertz 5. DW, OLAP and DM

×