Market basket analysis and statistical and informetric analyses are applied to a population of database queries (SELECT statements) to better understand table usage and co-occurrence patterns and inform placement on physical media.
This document discusses denormalization techniques used in data warehousing to improve query performance. It explains that while normalization is important for databases, denormalization can enhance performance in data warehouses where queries are frequent and updates are less common. Some key denormalization techniques covered include collapsing tables, splitting tables horizontally or vertically, pre-joining tables, adding redundant columns, and including derived attributes. Guidelines for when and how to apply denormalization carefully are also provided.
A properly designed database divides information into subject-based tables to reduce redundancy and link information together. The design process includes determining the database purpose, finding required information, dividing it into tables and fields, specifying primary keys, and setting relationships. Tables should be in first normal form with single values per field. Relationships like one-to-many are created by adding a primary key as a foreign key in another table. The design is then refined, sample data added, and normalization rules applied to achieve higher normal forms.
This document discusses methods for preparing datasets for data mining analysis using horizontal aggregations in SQL. It introduces horizontal aggregations, which aggregate numeric expressions and transpose results to produce datasets with a horizontal layout. This is unlike standard SQL aggregations which produce vertical layouts. The document proposes three methods for evaluating horizontal aggregations: 1) SPJ method using standard relational operators, 2) CASE method using SQL CASE constructs, and 3) PIVOT method using available PIVOT operators. The CASE method is presented as generally the most efficient evaluation method. The document concludes horizontal aggregations are useful for creating horizontally-laid out datasets required by most data mining algorithms.
The International Journal of Engineering and Science (The IJES)theijes
The International Journal of Engineering & Science is aimed at providing a platform for researchers, engineers, scientists, or educators to publish their original research results, to exchange new ideas, to disseminate information in innovative designs, engineering experiences and technological skills. It is also the Journal's objective to promote engineering and technology education. All papers submitted to the Journal will be blind peer-reviewed. Only original articles will be published.
Data warehousing and online analytical processingVijayasankariS
The document discusses data warehousing and online analytical processing (OLAP). It defines a data warehouse as a subject-oriented, integrated, time-variant and non-volatile collection of data used to support management decision making. It describes key concepts such as data warehouse modeling using data cubes and dimensions, extraction, transformation and loading of data, and common OLAP operations. The document also provides examples of star schemas and how they are used to model data warehouses.
This document provides an overview of data preprocessing techniques for data mining. It discusses data quality issues like accuracy, completeness, and consistency that require data cleaning. The major tasks of data preprocessing are described as data cleaning, integration, reduction, and transformation. Specific techniques for handling missing data, noisy data, and reducing redundancy during data integration are also summarized.
The document discusses dimensional modeling, which structures data from online transaction processing (OLTP) systems for online analytical processing (OLAP). It covers extracting and transforming OLTP data and loading it into a data warehouse with a star schema. Facts and dimensions are identified based on business requirements and grains of data. Tables are designed around the identified dimensions and facts. Data is then transformed from the OLTP to the OLAP schema for analysis and reporting.
This document discusses denormalization techniques used in data warehousing to improve query performance. It explains that while normalization is important for databases, denormalization can enhance performance in data warehouses where queries are frequent and updates are less common. Some key denormalization techniques covered include collapsing tables, splitting tables horizontally or vertically, pre-joining tables, adding redundant columns, and including derived attributes. Guidelines for when and how to apply denormalization carefully are also provided.
A properly designed database divides information into subject-based tables to reduce redundancy and link information together. The design process includes determining the database purpose, finding required information, dividing it into tables and fields, specifying primary keys, and setting relationships. Tables should be in first normal form with single values per field. Relationships like one-to-many are created by adding a primary key as a foreign key in another table. The design is then refined, sample data added, and normalization rules applied to achieve higher normal forms.
This document discusses methods for preparing datasets for data mining analysis using horizontal aggregations in SQL. It introduces horizontal aggregations, which aggregate numeric expressions and transpose results to produce datasets with a horizontal layout. This is unlike standard SQL aggregations which produce vertical layouts. The document proposes three methods for evaluating horizontal aggregations: 1) SPJ method using standard relational operators, 2) CASE method using SQL CASE constructs, and 3) PIVOT method using available PIVOT operators. The CASE method is presented as generally the most efficient evaluation method. The document concludes horizontal aggregations are useful for creating horizontally-laid out datasets required by most data mining algorithms.
The International Journal of Engineering and Science (The IJES)theijes
The International Journal of Engineering & Science is aimed at providing a platform for researchers, engineers, scientists, or educators to publish their original research results, to exchange new ideas, to disseminate information in innovative designs, engineering experiences and technological skills. It is also the Journal's objective to promote engineering and technology education. All papers submitted to the Journal will be blind peer-reviewed. Only original articles will be published.
Data warehousing and online analytical processingVijayasankariS
The document discusses data warehousing and online analytical processing (OLAP). It defines a data warehouse as a subject-oriented, integrated, time-variant and non-volatile collection of data used to support management decision making. It describes key concepts such as data warehouse modeling using data cubes and dimensions, extraction, transformation and loading of data, and common OLAP operations. The document also provides examples of star schemas and how they are used to model data warehouses.
This document provides an overview of data preprocessing techniques for data mining. It discusses data quality issues like accuracy, completeness, and consistency that require data cleaning. The major tasks of data preprocessing are described as data cleaning, integration, reduction, and transformation. Specific techniques for handling missing data, noisy data, and reducing redundancy during data integration are also summarized.
The document discusses dimensional modeling, which structures data from online transaction processing (OLTP) systems for online analytical processing (OLAP). It covers extracting and transforming OLTP data and loading it into a data warehouse with a star schema. Facts and dimensions are identified based on business requirements and grains of data. Tables are designed around the identified dimensions and facts. Data is then transformed from the OLTP to the OLAP schema for analysis and reporting.
The document discusses trends in data mining research, including mining complex data types like sequences, time series, graphs and networks. It covers various data mining methodologies like statistical data mining, visual data mining and views on the foundations of data mining. Statistical techniques discussed include regression, generalized linear models and discriminant analysis. Visual data mining involves using visualization to gain insights from large datasets and present data mining results.
This document describes a data warehouse and business intelligence project for analyzing Starbucks store data. It discusses extracting data from various structured, semi-structured, and unstructured sources, transforming the data using SQL and R, and loading it into a star schema data warehouse with fact and dimension tables. The data warehouse is then used for business queries and analysis in Tableau, with case studies examining city revenue, visitor and beverage sales by city, and city ratings based on food and beverage counts. The analysis finds that New York City generally has the highest revenue, visitor counts, and ratings.
This document defines common statistical terms including:
- Box & whisker plot, histogram, line plot, scatter plot which are different types of graphs used to visualize data.
- Mean, median, mode which are measures of central tendency used to describe data.
- Quartile, interquartile range, range, outlier which describe the spread and distribution of data.
- Population, sample, census, survey which relate to collecting and analyzing subsets of data from a larger group.
- Statistics which is the field of mathematics relating to collecting, analyzing and presenting data.
- Data flow diagrams (DFDs) are used to represent business processes and information flows within a system. They help analyze system requirements and design system components.
- DFDs use five symbols: external entities, processes, data stores, data flows, and resource flows. Processes manipulate data flows. Data stores hold information. External entities interact with the system.
- DFDs are drawn at different levels, with level 1 providing the major system functions/processes and lower levels showing more detail. Context diagrams show the system's interactions at a high level without internal details.
This document provides an introduction and definitions related to key concepts in statistics. It discusses what statistics is as the science of collecting, organizing, and interpreting data. It defines important statistical terms like data, variables, statistics, and parameters. It also outlines the two main branches of statistics as descriptive statistics, which focuses on summarizing and presenting data, and inferential statistics, which analyzes samples to make inferences about populations. Finally, it discusses common sources of data like published sources, experiments, and surveys.
This document defines dimensional data modeling and describes its key concepts. Dimensional modeling uses facts and dimensions to structure data warehouses in star or snowflake schemas for understandability and query performance. Facts are numeric measures that can be aggregated, while dimensions provide context as descriptive attributes. The document outlines the modeling process and benefits of dimensional modeling for data querying, extensibility, and understandability.
This document provides information on commonly used Excel formulas including lookup functions like VLOOKUP and HLOOKUP. It explains how to use these functions to look up values in a list or table of data vertically or horizontally. Conditional formulas with functions like IF, AND, OR are also described for creating logical comparisons. Various methods for displaying or hiding zero values on a worksheet are outlined. Financial functions such as PMT are introduced for calculating loan payments.
This document discusses data warehousing and online analytical processing (OLAP). It defines a data warehouse as a subject-oriented, integrated, time-variant, and nonvolatile collection of data used for analysis and decision making. The key aspects of a data warehouse covered are its multidimensional data model using cubes and dimensions, extraction of data from multiple sources, and usage for querying, reporting, analytical processing, and data mining. Common data warehouse architectures and operations like star schemas, snowflake schemas, and OLAP functions such as roll-up and drill-down are also summarized.
This document discusses data preprocessing concepts from Chapter 3 of the book "Data Mining: Concepts and Techniques". It covers the major tasks in data preprocessing including data cleaning, integration, and reduction. Data cleaning involves handling incomplete, noisy, and inconsistent data through techniques like filling in missing values, identifying outliers, and resolving inconsistencies. Data integration combines data from multiple sources. Data reduction strategies aim to reduce the volume of data for analysis through dimensionality and numerosity reduction.
This document discusses index-organized tables in Oracle8i. Index-organized tables store the entire contents of a table in an index structure, allowing both indexed and non-indexed columns to be retrieved with a single index access. This provides faster access times for queries using primary keys compared to conventional tables. The document outlines several applications that can benefit from index-organized tables, such as OLTP, e-commerce, and data warehousing applications involving large amounts of data accessed via primary keys. It also summarizes the results of a performance study showing index-organized tables outperforming conventional tables for primary key access.
This document provides an overview of Marco Delogu's lecture on data management with Stata. The lecture covers importing data from Excel and CSV files, combining datasets using the merge command, renaming and restructuring variables, eliminating duplicates, and using the egen command. It also discusses exporting results to Excel using putexcel and installing additional user-written commands. The goal is to teach students the basic tools for preparing data in Stata for empirical analysis.
This document discusses different types of charts that can be used to visualize data. It describes 10 common chart types including bar charts, line charts, pie charts, area charts, and scatter plots. It also covers the basic components of a chart and the steps to create a chart in Excel, which includes selecting the data, choosing a chart type from recommendations, and printing the completed chart.
A few ideas for how Excel commands may be useful. All scientist can find the knowledge of Excel can really assist in their planning of an experiment, and checking the data.
This is a very simple introduction, so you may realize why you can use Excel in your journey to becoming a quantitatively savvy scientist.
A spreadsheet is a grid of rows and columns used to organize and analyze numerical data in cells. A workbook contains one or more spreadsheets and allows users to perform calculations on the data. Spreadsheets are commonly used for household budgets, grades, finances, and payroll in businesses. A spreadsheet displays data in cells that are identified by their column letter and row number. A range refers to a group of adjacent cells, while the cell address specifies a single cell's location using its column and row.
This document provides an overview of key concepts for working with databases in Microsoft Access, including tables, queries, forms, reports, and relationships. It discusses table design, field data types, properties, validation rules, and referential integrity. It also covers topics like one-to-many and many-to-many relationships, joins, parameters, and action queries. Forms and reports are discussed, including their use of sections, controls, sorting, grouping, and subforms/subreports.
This document discusses various concepts in data warehouse logical design including data marts, types of data marts (dependent, independent, hybrid), star schemas, snowflake schemas, and fact constellation schemas. It defines each concept and provides examples to illustrate them. Dependent data marts are created from an existing data warehouse, independent data marts are stand-alone without a data warehouse, and hybrid data marts combine data from a warehouse and other sources. Star schemas have one table for each dimension that joins to a central fact table, while snowflake schemas have normalized dimension tables. Fact constellation schemas have multiple fact tables that share dimension tables.
This document discusses graphical descriptions of data through graphs and charts. It introduces frequency tables and relative frequency tables that organize qualitative data into categories and counts. A sample frequency table is created from data on types of cars students drive. This data is then displayed in a bar chart and pie chart to visualize the distribution. Bar charts are described as using category on the x-axis and frequency on the y-axis with rectangular bars proportional in height to the frequencies. Pie charts divide a circle into wedge-shaped sectors proportional to relative frequencies. Examples are provided of creating these graphs in StatCrunch software from the raw car type data.
This document provides an introduction to SPSS (Statistical Package for Social Sciences) software. It discusses the history and ownership of SPSS, its use as a statistical analysis program, and an overview of its basic functions. Key features covered include opening and managing data files, descriptive statistics like frequencies and charts, data cleaning techniques for handling missing values, and methods for data manipulation such as recoding variables and creating new computed variables. The goal is to provide readers with foundational knowledge on using SPSS for statistical analysis in the social sciences.
This document discusses index fragmentation, including external and internal fragmentation. External fragmentation occurs when index pages are out of logical order, while internal fragmentation happens when index pages are not fully utilized. To identify fragmentation, the sys.dm_db_index_physical_stats dynamic management view can be queried. Results show statistics like fragmentation percentage and page usage. Fragmentation can be resolved by rebuilding or reorganizing indexes. Rebuilding completely drops and recreates an index while reorganizing physically reorders pages to reduce logical fragmentation without requiring free space.
In this presentation various fundamental data analysis using Statistical Tool SPSS was elaborated with special reference to physical education and sports
In today’s world there is a wide availability of huge amount of data and thus there is a need for turning this
data into useful information which is referred to as knowledge. This demand for knowledge discovery
process has led to the development of many algorithms used to determine the association rules. One of the
major problems faced by these algorithms is generation of candidate sets. The FP-Tree algorithm is one of
the most preferred algorithms for association rule mining because it gives association rules without
generating candidate sets. But in the process of doing so, it generates many CP-trees which decreases its
efficiency. In this research paper, an improvised FP-tree algorithm with a modified header table, along
with a spare table and the MFI algorithm for association rule mining is proposed. This algorithm generates
frequent item sets without using candidate sets and CP-trees.
The document discusses trends in data mining research, including mining complex data types like sequences, time series, graphs and networks. It covers various data mining methodologies like statistical data mining, visual data mining and views on the foundations of data mining. Statistical techniques discussed include regression, generalized linear models and discriminant analysis. Visual data mining involves using visualization to gain insights from large datasets and present data mining results.
This document describes a data warehouse and business intelligence project for analyzing Starbucks store data. It discusses extracting data from various structured, semi-structured, and unstructured sources, transforming the data using SQL and R, and loading it into a star schema data warehouse with fact and dimension tables. The data warehouse is then used for business queries and analysis in Tableau, with case studies examining city revenue, visitor and beverage sales by city, and city ratings based on food and beverage counts. The analysis finds that New York City generally has the highest revenue, visitor counts, and ratings.
This document defines common statistical terms including:
- Box & whisker plot, histogram, line plot, scatter plot which are different types of graphs used to visualize data.
- Mean, median, mode which are measures of central tendency used to describe data.
- Quartile, interquartile range, range, outlier which describe the spread and distribution of data.
- Population, sample, census, survey which relate to collecting and analyzing subsets of data from a larger group.
- Statistics which is the field of mathematics relating to collecting, analyzing and presenting data.
- Data flow diagrams (DFDs) are used to represent business processes and information flows within a system. They help analyze system requirements and design system components.
- DFDs use five symbols: external entities, processes, data stores, data flows, and resource flows. Processes manipulate data flows. Data stores hold information. External entities interact with the system.
- DFDs are drawn at different levels, with level 1 providing the major system functions/processes and lower levels showing more detail. Context diagrams show the system's interactions at a high level without internal details.
This document provides an introduction and definitions related to key concepts in statistics. It discusses what statistics is as the science of collecting, organizing, and interpreting data. It defines important statistical terms like data, variables, statistics, and parameters. It also outlines the two main branches of statistics as descriptive statistics, which focuses on summarizing and presenting data, and inferential statistics, which analyzes samples to make inferences about populations. Finally, it discusses common sources of data like published sources, experiments, and surveys.
This document defines dimensional data modeling and describes its key concepts. Dimensional modeling uses facts and dimensions to structure data warehouses in star or snowflake schemas for understandability and query performance. Facts are numeric measures that can be aggregated, while dimensions provide context as descriptive attributes. The document outlines the modeling process and benefits of dimensional modeling for data querying, extensibility, and understandability.
This document provides information on commonly used Excel formulas including lookup functions like VLOOKUP and HLOOKUP. It explains how to use these functions to look up values in a list or table of data vertically or horizontally. Conditional formulas with functions like IF, AND, OR are also described for creating logical comparisons. Various methods for displaying or hiding zero values on a worksheet are outlined. Financial functions such as PMT are introduced for calculating loan payments.
This document discusses data warehousing and online analytical processing (OLAP). It defines a data warehouse as a subject-oriented, integrated, time-variant, and nonvolatile collection of data used for analysis and decision making. The key aspects of a data warehouse covered are its multidimensional data model using cubes and dimensions, extraction of data from multiple sources, and usage for querying, reporting, analytical processing, and data mining. Common data warehouse architectures and operations like star schemas, snowflake schemas, and OLAP functions such as roll-up and drill-down are also summarized.
This document discusses data preprocessing concepts from Chapter 3 of the book "Data Mining: Concepts and Techniques". It covers the major tasks in data preprocessing including data cleaning, integration, and reduction. Data cleaning involves handling incomplete, noisy, and inconsistent data through techniques like filling in missing values, identifying outliers, and resolving inconsistencies. Data integration combines data from multiple sources. Data reduction strategies aim to reduce the volume of data for analysis through dimensionality and numerosity reduction.
This document discusses index-organized tables in Oracle8i. Index-organized tables store the entire contents of a table in an index structure, allowing both indexed and non-indexed columns to be retrieved with a single index access. This provides faster access times for queries using primary keys compared to conventional tables. The document outlines several applications that can benefit from index-organized tables, such as OLTP, e-commerce, and data warehousing applications involving large amounts of data accessed via primary keys. It also summarizes the results of a performance study showing index-organized tables outperforming conventional tables for primary key access.
This document provides an overview of Marco Delogu's lecture on data management with Stata. The lecture covers importing data from Excel and CSV files, combining datasets using the merge command, renaming and restructuring variables, eliminating duplicates, and using the egen command. It also discusses exporting results to Excel using putexcel and installing additional user-written commands. The goal is to teach students the basic tools for preparing data in Stata for empirical analysis.
This document discusses different types of charts that can be used to visualize data. It describes 10 common chart types including bar charts, line charts, pie charts, area charts, and scatter plots. It also covers the basic components of a chart and the steps to create a chart in Excel, which includes selecting the data, choosing a chart type from recommendations, and printing the completed chart.
A few ideas for how Excel commands may be useful. All scientist can find the knowledge of Excel can really assist in their planning of an experiment, and checking the data.
This is a very simple introduction, so you may realize why you can use Excel in your journey to becoming a quantitatively savvy scientist.
A spreadsheet is a grid of rows and columns used to organize and analyze numerical data in cells. A workbook contains one or more spreadsheets and allows users to perform calculations on the data. Spreadsheets are commonly used for household budgets, grades, finances, and payroll in businesses. A spreadsheet displays data in cells that are identified by their column letter and row number. A range refers to a group of adjacent cells, while the cell address specifies a single cell's location using its column and row.
This document provides an overview of key concepts for working with databases in Microsoft Access, including tables, queries, forms, reports, and relationships. It discusses table design, field data types, properties, validation rules, and referential integrity. It also covers topics like one-to-many and many-to-many relationships, joins, parameters, and action queries. Forms and reports are discussed, including their use of sections, controls, sorting, grouping, and subforms/subreports.
This document discusses various concepts in data warehouse logical design including data marts, types of data marts (dependent, independent, hybrid), star schemas, snowflake schemas, and fact constellation schemas. It defines each concept and provides examples to illustrate them. Dependent data marts are created from an existing data warehouse, independent data marts are stand-alone without a data warehouse, and hybrid data marts combine data from a warehouse and other sources. Star schemas have one table for each dimension that joins to a central fact table, while snowflake schemas have normalized dimension tables. Fact constellation schemas have multiple fact tables that share dimension tables.
This document discusses graphical descriptions of data through graphs and charts. It introduces frequency tables and relative frequency tables that organize qualitative data into categories and counts. A sample frequency table is created from data on types of cars students drive. This data is then displayed in a bar chart and pie chart to visualize the distribution. Bar charts are described as using category on the x-axis and frequency on the y-axis with rectangular bars proportional in height to the frequencies. Pie charts divide a circle into wedge-shaped sectors proportional to relative frequencies. Examples are provided of creating these graphs in StatCrunch software from the raw car type data.
This document provides an introduction to SPSS (Statistical Package for Social Sciences) software. It discusses the history and ownership of SPSS, its use as a statistical analysis program, and an overview of its basic functions. Key features covered include opening and managing data files, descriptive statistics like frequencies and charts, data cleaning techniques for handling missing values, and methods for data manipulation such as recoding variables and creating new computed variables. The goal is to provide readers with foundational knowledge on using SPSS for statistical analysis in the social sciences.
This document discusses index fragmentation, including external and internal fragmentation. External fragmentation occurs when index pages are out of logical order, while internal fragmentation happens when index pages are not fully utilized. To identify fragmentation, the sys.dm_db_index_physical_stats dynamic management view can be queried. Results show statistics like fragmentation percentage and page usage. Fragmentation can be resolved by rebuilding or reorganizing indexes. Rebuilding completely drops and recreates an index while reorganizing physically reorders pages to reduce logical fragmentation without requiring free space.
In this presentation various fundamental data analysis using Statistical Tool SPSS was elaborated with special reference to physical education and sports
In today’s world there is a wide availability of huge amount of data and thus there is a need for turning this
data into useful information which is referred to as knowledge. This demand for knowledge discovery
process has led to the development of many algorithms used to determine the association rules. One of the
major problems faced by these algorithms is generation of candidate sets. The FP-Tree algorithm is one of
the most preferred algorithms for association rule mining because it gives association rules without
generating candidate sets. But in the process of doing so, it generates many CP-trees which decreases its
efficiency. In this research paper, an improvised FP-tree algorithm with a modified header table, along
with a spare table and the MFI algorithm for association rule mining is proposed. This algorithm generates
frequent item sets without using candidate sets and CP-trees.
This document describes a simulator for database aggregation using metadata. The simulator sits between an end-user application and a database management system (DBMS) to intercept SQL queries and transform them to take advantage of available aggregates using metadata describing the data warehouse schema. The simulator provides performance gains by optimizing queries to use appropriate aggregate tables. It was found to improve performance over previous aggregate navigators by making fewer calls to system tables through the use of metadata mappings. Experimental results showed the simulator solved queries faster than alternative approaches by transforming queries to leverage aggregate tables.
The document discusses different types of databases including relational databases, analytical databases, operational databases, and object-oriented databases. It describes key characteristics of each type of database such as how they model and store data. Relational databases use tables to store data and link tables using relationships while analytical databases store archived data for analysis and operational databases manage dynamic data. Object-oriented databases integrate object-oriented programming with databases.
A Study of Various Projected Data Based Pattern Mining Algorithmsijsrd.com
This document discusses and compares several algorithms for mining frequent patterns from transactional datasets: FP-Growth, H-mine, RELIM, and SaM. It analyzes the internal workings and performance of each algorithm. An experiment is conducted on the Mushroom dataset from the UCI repository using different minimum support thresholds. The results show that the execution times of the algorithms are generally similar, though SaM has a slightly lower time for higher support thresholds. The document provides an in-depth comparison of these frequent pattern mining algorithms.
A Novel preprocessing Algorithm for Frequent Pattern Mining in MultidatasetsWaqas Tariq
In many database applications, information stored in a database has a built-in hierarchy consisting of multiple levels of concepts. In such a database users may want to find out association rules among items only at the same levels. This task is called multiple-level association rule mining. However, mining frequent patterns at multiple levels may lead to the discovery of more specific and concrete knowledge from data. Initial step to find frequent pattern is to preprocess the multidataset to find the large 1 frequent pattern for all levels. In this research paper, we introduce a new algorithm, called CCB-tree i.e., Category-Content-Brand tree is developed to mine Large 1 Frequent pattern for all levels of abstraction. The proposed algorithm is a tree based structure and it first constructs the tree in CCB order for entire database and second, it searches for frequent pattern in CCB order. This method is using concept of reduced support and it reduces the time complexity.
This document provides an overview of various system and data modelling tools, including:
1. Data flow diagrams, context diagrams, schemas, and the data dictionary for representing database structure and relationships.
2. Decision trees and decision tables for showing decision paths and outcomes.
3. Normalization for minimizing data duplication through breaking databases into smaller linked tables.
4. SQL syntax and storyboards for querying databases and representing interfaces.
Examples are given for each tool to illustrate their use in system and data modelling.
A literature review of modern association rule mining techniquesijctet
This document discusses association rule mining techniques for extracting useful patterns from large datasets. It provides background on association rule mining and defines key concepts like support, confidence and frequent itemsets. The document then reviews several classic association rule mining algorithms like AIS, Apriori and FP-Growth. It explains that these algorithms aim to improve quality and efficiency by reducing database scans, generating fewer candidate itemsets and using pruning techniques.
A Quantified Approach for large Dataset Compression in Association MiningIOSR Journals
Abstract: With the rapid development of computer and information technology in the last several decades, an
enormous amount of data in science and engineering will continuously be generated in massive scale; data
compression is needed to reduce the cost and storage space. Compression and discovering association rules by
identifying relationships among sets of items in a transaction database is an important problem in Data Mining.
Finding frequent itemsets is computationally the most expensive step in association rule discovery and therefore
it has attracted significant research attention. However, existing compression algorithms are not appropriate in
data mining for large data sets. In this research a new approach is describe in which the original dataset is
sorted in lexicographical order and desired number of groups are formed to generate the quantification tables.
These quantification tables are used to generate the compressed dataset, which is more efficient algorithm for
mining complete frequent itemsets from compressed dataset. The experimental results show that the proposed
algorithm performs better when comparing it with the mining merge algorithm with different supports and
execution time.
Keywords: Apriori Algorithm, mining merge Algorithm, quantification table
The document discusses database normalization and provides examples to illustrate the concepts of first, second, and third normal forms. It explains that normalization is the process of evaluating and correcting database tables to minimize data redundancy and anomalies. The key steps in normalization include identifying attributes, dependencies between attributes, and creating normalized tables based on those dependencies. An example database for a college will be used to demonstrate converting tables into first, second, and third normal form. Additionally, an example will show when denormalization of a table may be acceptable.
This document describes a proposed algorithm called T2S (Top-k by Table Scan) for efficiently computing top-k queries on massive datasets. T2S first constructs a presorted table by sorting the tuples based on the order they would be retrieved by existing sorted-list based algorithms. During query processing, T2S only needs to maintain a fixed number of tuples in memory. The paper presents techniques for early termination checking and selective retrieval to skip tuples that are not top-k results. Experimental results show that T2S outperforms existing algorithms by significantly reducing the number of tuples retrieved and maintained in memory for top-k query processing on large datasets.
Data warehousing interview_questionsandanswersSourav Singh
A data warehouse is a repository of integrated data from multiple sources that is organized for analysis. It contains historical data to support decision making. There are four fundamental stages of data warehousing: offline operational databases, offline data warehouse, real-time data warehouse, and integrated data warehouse. Dimensional modeling involves facts tables containing measurements and dimension tables containing context for the measurements. (191 words)
The D-basis Algorithm for Association Rules of High ConfidenceITIIIndustries
We develop a new approach for distributed computing of the association rules of high confidence on the attributes/columns of a binary table. It is derived from the D-basis algorithm developed by K.Adaricheva and J.B.Nation (Theoretical Computer Science, 2017), which runs multiple times on sub-tables of a given binary table, obtained by removing one or more rows. The sets of rules retrieved at these runs are then aggregated. This allows us to obtain a basis of association rules of high confidence, which can be used for ranking all attributes of the table with respect to a given fixed attribute. This paper focuses on some algorithmic details and the technical implementation of the new algorithm. Results are given for tests performed on random, synthetic and real data
Review on: Techniques for Predicting Frequent Itemsvivatechijri
Electronic commerce(E- Commerce) is the trading or facilitation of trading in products or services
using computer networks, such as the Internet. It comes under a part of Data Mining which takes large amount
of data and extracts them. The paper uses the information about the techniques and methods used in the
shopping cart for prediction of product that the customer wants to buy or will buy and shows the relevant
products according to the cost of the product. The paper also summarizes the descriptive methods with
examples. For predicting the frequent pattern of itemset, many prediction algorithms, rule mining techniques
and various methods have already been designed for use of retail market. This paper examines literature
analysis on several techniques for mining frequent itemsets.The survey comprises various tree formations like
Partial tree, IT tree and algorithms with its advantages and its limitations.
TRANSFORMATION RULES FOR BUILDING OWL ONTOLOGIES FROM RELATIONAL DATABASEScscpconf
This document proposes transformation rules for building OWL ontologies from relational databases. It begins by classifying database tables into six categories based on their attributes and relationships. Transformation rules are then applied to each category to map the database schema into ontological components. The rules cover various database modeling constructs such as one-to-many relationships, simple and multiple inheritance, many-to-many relationships with and without attributes, and n-ary relationships. Additionally, the proposed approach analyzes stored data to detect disjointness and totalness constraints in class hierarchies and calculate participation levels in n-ary relations. The rules aim to generate richer ontologies than existing methods by handling more complex database cases and incorporating additional semantic information from data analysis.
The document discusses the process of processing and analyzing data collected during scientific research. It explains that after collection, data must be processed through steps like editing, coding, classification, and tabulation to prepare it for analysis. The term analysis refers to using measures and searching for patterns to make comparisons between data groups. The document then provides details on specific processing operations like editing, coding, classification, and tabulation before discussing elements of analysis like descriptive analysis, correlation analysis, causal analysis, and multivariate analysis. It emphasizes that both descriptive and inferential statistical analysis are important tools in research for designing studies, summarizing data, and drawing conclusions.
The document discusses database system architecture and data models. It introduces the three schema architecture which separates the conceptual, logical and internal schemas. This provides logical data independence where the conceptual schema can change without affecting external schemas or applications. It also discusses various data models like hierarchical, network, relational and object-oriented models. Key aspects of each model like structure, relationships and operations are summarized.
This document defines key terms related to databases and data analysis. It explains that a database is a collection of organized data that can be used to make decisions. To create a database, one must set objectives, perform analysis on the relevant data, design the database structure, create and populate it with data, then provide feedback and use the information. Other terms defined include entities, attributes, fields, relationships, sorting, filtering, subtotals, and dynamic tables and graphics for analyzing database information.
Running Head PROJECT DELIVERABLE 31PROJECT DELIVERABLE 310.docxtodd581
Running Head: PROJECT DELIVERABLE 31
PROJECT DELIVERABLE 310
Project Deliverable 3: Database and Programming Design
Leo Austin
Professor Joe Scott
CIS498 – Information Technology Capstone
08/22/2018
Introduction
Bicycle Trader being a constantly growing internet-based company requires the collection of an abundance of data to analyze for continued operations. Whether customers signup for services or browse through the website, data is gathered to allow the website to adapt to demands and cater to the customers’ needs and determine what will make using the site more user-friendly. Most importantly is the need to gather data in order to facilitate the entry and archiving of customer input data and use by other entities or departments within the business. Various database models can be taken into consideration for the needs of this business, and the relational database model is the most applicable due to the data sorting requirements for the website.
Not only is the rational database model the ideal database solution, but because they primarily consist of tables used to manage and store data, they are relatively easy to create and maintain. Many organizations choose this approach as it facilitates access to understandable data assets. Separating data by implementing tables also allows for the ability to adequately secure data by distinguishing each with their own classifications. Sorting data into tables also means that data can be added or withdrawn without having to overhaul the entire database.
Implementing data warehousing alongside relational databases provides further practicality and presents many advantages. By doing so, we can take advantage of its ability to “store large quantities of historical data and enable fast, complex queries across all the data, typically using Online Analytical Processing (OLAP)” (Panoply, n.d.). Data warehouses are essentially a collection of data from various sources that can be used by organizations for reporting and analysis. Because of the nature of Bicycle Trader and the abundance of like items that will be sold be by users on the website, a data warehouse will be the most practical solution for archiving data, because unlike most databases which normalize data in order to eliminate redundant data, a data warehouse uses a denormalized data structure. This means that fewer data tables with more grouping are used and redundancies aren’t excluded.
This combination of relational data systems, the data warehouse and relational database, can be hosted internally by the organization on its’ mainframe, and stored in their cloud. Using a cloud yields more advantages as it is the easiest and most cost-effective approach. By using this method, data can easily be accessed from several locations. Additionally, this allows for fewer physical resources as it eliminates some of the costs associated with expensive systems and equipment, expert staff, and energy consumption by alternatively utilizing the .
This document provides an overview of SQL programming. It covers the history of SQL and SQL Server, SQL fundamentals including database design principles like normalization, and key SQL statements like SELECT, JOIN, UNION and stored procedures. It also discusses database objects, transactions, and SQL Server architecture concepts like connections. The document is intended as a training guide, walking through concepts and providing examples to explain SQL programming techniques.
Similar to Market Basket Analysis of Database Table References Using R (20)
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...Aggregage
This webinar will explore cutting-edge, less familiar but powerful experimentation methodologies which address well-known limitations of standard A/B Testing. Designed for data and product leaders, this session aims to inspire the embrace of innovative approaches and provide insights into the frontiers of experimentation!
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...sameer shah
"Join us for STATATHON, a dynamic 2-day event dedicated to exploring statistical knowledge and its real-world applications. From theory to practice, participants engage in intensive learning sessions, workshops, and challenges, fostering a deeper understanding of statistical methodologies and their significance in various fields."
Predictably Improve Your B2B Tech Company's Performance by Leveraging DataKiwi Creative
Harness the power of AI-backed reports, benchmarking and data analysis to predict trends and detect anomalies in your marketing efforts.
Peter Caputa, CEO at Databox, reveals how you can discover the strategies and tools to increase your growth rate (and margins!).
From metrics to track to data habits to pick up, enhance your reporting for powerful insights to improve your B2B tech company's marketing.
- - -
This is the webinar recording from the June 2024 HubSpot User Group (HUG) for B2B Technology USA.
Watch the video recording at https://youtu.be/5vjwGfPN9lw
Sign up for future HUG events at https://events.hubspot.com/b2b-technology-usa/
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...Social Samosa
The Modern Marketing Reckoner (MMR) is a comprehensive resource packed with POVs from 60+ industry leaders on how AI is transforming the 4 key pillars of marketing – product, place, price and promotions.
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data LakeWalaa Eldin Moustafa
Dynamic policy enforcement is becoming an increasingly important topic in today’s world where data privacy and compliance is a top priority for companies, individuals, and regulators alike. In these slides, we discuss how LinkedIn implements a powerful dynamic policy enforcement engine, called ViewShift, and integrates it within its data lake. We show the query engine architecture and how catalog implementations can automatically route table resolutions to compliance-enforcing SQL views. Such views have a set of very interesting properties: (1) They are auto-generated from declarative data annotations. (2) They respect user-level consent and preferences (3) They are context-aware, encoding a different set of transformations for different use cases (4) They are portable; while the SQL logic is only implemented in one SQL dialect, it is accessible in all engines.
#SQL #Views #Privacy #Compliance #DataLake
Codeless Generative AI Pipelines
(GenAI with Milvus)
https://ml.dssconf.pl/user.html#!/lecture/DSSML24-041a/rate
Discover the potential of real-time streaming in the context of GenAI as we delve into the intricacies of Apache NiFi and its capabilities. Learn how this tool can significantly simplify the data engineering workflow for GenAI applications, allowing you to focus on the creative aspects rather than the technical complexities. I will guide you through practical examples and use cases, showing the impact of automation on prompt building. From data ingestion to transformation and delivery, witness how Apache NiFi streamlines the entire pipeline, ensuring a smooth and hassle-free experience.
Timothy Spann
https://www.youtube.com/@FLaNK-Stack
https://medium.com/@tspann
https://www.datainmotion.dev/
milvus, unstructured data, vector database, zilliz, cloud, vectors, python, deep learning, generative ai, genai, nifi, kafka, flink, streaming, iot, edge
Market Basket Analysis of Database Table References Using R
1. Market Basket Analysis of Database Table References Using R: An
Application to Physical Database Design
Jeffrey Tyzzer1
Summary
Market basket and statistical and informetric analyses are applied to a population of
database queries (SELECT statements) to better understand table usage and co-
occurrence patterns and inform placement on physical media.
Introduction
In 1999 Sally Jo Cunningham and Eibe Frank of the University of Waikato published a
paper titled “Market Basket Analysis of Library Circulation Data” [8]. In it the authors
apply market basket analysis (MBA) to library book circulation data, models of which are
a staple of informetrics, “the application of statistical and mathematical methods in
Library and Information Sciences” [10]. Their paper engendered ideas that led to this
paper, which concerns the application of MBA and statistical and informetric analyses to
a set of database queries, i.e. SELECT statements, to better understand table usage and
co-occurrence patterns.
Market Basket Analysis
Market basket analysis is a data mining technique that applies association rule analysis, a
method of uncovering connections among items in a data set, to supermarket purchases,
with the goal of finding items (i.e., groceries) having a high probability of appearing
together. For instance, a rule induced by MBA might be “in 85% of the baskets where
potato chips appeared, so did root beer.” In the Cunningham and Frank paper, the baskets
were the library checkouts and the groceries were the books. In this paper, the baskets are
the queries and the groceries are the tables referenced in the queries.
MBA was introduced in the seminal paper “Mining Association Rules between Sets of
Items in Large Databases,” by Agrawal et al. [1] and is used by retailers to guide store
layout (for example, placing products having a high probability of appearing in the same
purchase closer together to encourage greater sales) and promotions (e.g., buy one and
get the other half-off). The output of MBA is a set of association rules and attendant
metadata in the form {LHS => RHS}. LHS means “left-hand side” and RHS means
“right-hand side.” These rules are interpreted as “if LHS then RHS,” with the LHS
referred to as the antecedent and the RHS referred to as the consequent. For the potato
chip and root beer example, we’d have {Chips => Root beer}.
1
jefftyzzer AT sbcglobal DOT net
2. 2
The Project
Two questions directed my investigation:
1. Among the tables, are there a “vital few” [9] that account for the bulk of the table
references in the queries? If so, which ones are they?
2. Which table pairings (co-occurrences) are most frequent within the queries?
The answers to these questions can be used to:
• Steer the placement of tables on physical media2
• Justify denormalization decisions
• Inform the creation of materialized views, table clusters, and aggregates
• Guide partitioning strategies to achieve collocated joins and reduce inter-node
data shipping in distributed databases
• Identify missing indexes to support frequently joined tables3
• Direct the scope, depth, frequency, and priority of table and index statistics
gathering
• Contribute to an organization’s overall corpus of operational intelligence
The data at the focus of this study are metadata for queries executed against a population
of 494 tables within an OLTP database. The queries were captured over a four-day
period. There is an ad hoc query capability within the environment, but such queries are
run against a separate data store, thus the system under study was effectively closed with
respect to random, external, queries.
I wrote a Perl program to iterate over the compressed system-generated query log files,
272 in all, cull the tables from each of the SELECT statements within them, and, for
those referencing at least one of the 494 tables, output detail and summary data,
respectively, to two files. Of the 553,139 total statements read, 373,372 met this criterion
(the remainder, a sizable number, were metadata-type statements, e.g., data dictionary
lookups and variable instantiations).
The summary file lists each table and the number of queries it appears in; its structure is
simply {table, count}. The detail file lists {query, table, hour} triples, which were then
imported into a simple table consisting of three corresponding columns. query identifies
the query in which the table is referenced, tabname is the name of the table, and hour
designates the hour the query snapshot was taken, in the range 0-23.
2
Both within a given medium as well as among media with different performance characteristics, e.g.,
tiering storage between disk and solid-state drives (SSD) [15].
3
Adding indexes to support SELECTs may come at the expense of increased INSERT, UPDATE, and
DELETE costs. When adding indexes to a table the full complement of CRUD operations against it must
be considered. The analysis discussed here is easily extended to encompass (other) DML statements as
well.
3. 3
The “Vital Few”
The 80/20 principle describes a phenomenon in which “20% of a population or group can
explain 80% of an effect” [9]. This principle is widely observed in economics, where it’s
generally referred to as the Pareto principle, and informetrics, where it’s known as
Trueswell’s 80/20 rule [22]. Trueswell argued that 80% of a library’s circulation is
accounted for by 20% of its circulating books [9]. This behavior has also been observed
in computer science contexts, where it’s been noted that 80 percent of the transactions on
a file are applied to the 20 percent most frequently used records within it [14; 15].
I wanted to see if this pattern of skewed access could apply to RDBMS’s as well, i.e., if
20% of the tables in the database might account for 80% of all table references within
queries.
Figure 1 plots in ascending (rank) order the number of queries each of the 494 tables is
referenced in (space prohibits a table listing the frequencies of the 494 tables), showing a
characteristic reverse-J shape. Figure 2 presents this data as a Lorenz curve, which was
generated using the R package ineq. As [11] puts it, “[t]he straight line represents the
expected distribution” if all tables were queried an equal number of times, with the
curved line indicating the observed distribution. As the figure shows, 20% of the tables
account for a little more than 85% of the table references. Clearly, a subset of tables, the
“vital few,” account for the majority of table references in the queries. These tables get
the most attention query-wise and therefore deserve the most attention performance-wise.
Figure 1 - Plot of query-count-per-table frequency
4. 4
Figure 2 - Lorenz curve illustrating the 80/20 rule for table references
The Market Basket Analysis
The 373,3724
statements mentioned earlier are the table baskets from which the subset of
transactions against the 25 most-queried tables is derived. As a first step toward
uncovering connections among the tables in the database using MBA, I used the R
package diagram to create a web plot, or annulus [21], of the co-reference relationships
among these 25 tables, shown in figure 3. Note that line thickness in the figure is
proportional to the frequency of co-occurrence. As can be seen, there is a high level of
interconnectedness among these tables. Looking at these connections as a graph, with the
tables as nodes and their co-occurrence in queries as edges, I computed the graph’s
clustering coefficient, which is the number of actual connections between the nodes
divided by the total possible number of connections [3], which turned out to be 0.75, a
not surprisingly high value given what figure 3 illustrates.
4
See Appendix B for a sample size formula if a population of table baskets is not already available.
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
Cumulative Table Query Percentages
Cum pct of table references
Cumpctoftablepopulation
(.8525, .2)
5. 5
Figure 3 - Web plot of the co-references between the 25 most-queried tables
To mine the tables’ association rules, I used R’s arules package. Before the table
basket data could be analyzed it had to be read into an R data structure, which is done
using the read.transactions() function. The result is a transactions object, an
incidence matrix of the transaction items. To see the structure of the matrix, type the
name of the variable at the R prompt:
> tblTrans
transactions in sparse format with
110071 transactions (rows) and
154 items (columns)
Keep in mind the number of transactions (queries) and items (tables) shown here differs
from their respective numbers listed previously because I limited the analysis to just
those table baskets with at least two of the 25 most-queried tables in them. Looking at the
output, that’s 110,071 queries and 154 tables (the top 25 along with 129 others they
appear with).
To see the contents of the tblTrans transaction object, the inspect() function is
used (note I limited inspect() to the first five transactions, as identified by the ASCII
ordering of the transactionID):
Co-reference Patterns Among the 25 Most-queried Tables
ADDRESS
BIRTH_INFORMATION
CASE
CASE_ACCOUNT
CASE_ACCOUNT_SUMMARY
CASE_ACCOUNT_SUMMARY_TRANSACTION
CASE_COURT_CASE
CASE_PARTICIPANT
CHARGING_INSTRUCTION
COMBINED_LOG_TEMP_ENTRY
COURT_CASE
EMPLOYEREMPLOYER_ADDRESS
INTERNAL_USER
LEGAL_ACTIVITY
LOGICAL_COLLECTION_TRANSACTION
ORG_UNIT
PARTICIPANT
PARTICIPANT_ADDRESS
PARTICIPANT_EMPLOYER
PARTICIPANT_NAME
PARTICIPANT_PUBLIC_ASSISTANCE_CASE
PARTICIPANT_RELATIONSHIP
SOCIAL_SECURITY_NUMBER
SUPPORT_ORDER
6. 6
> inspect(tblTrans[1:5])
items transactionID
1 {PARTICIPANT,
PARTICIPANT_NAME} 1
2 {ADDRESS,
BIRTH_INFORMATION,
PARTICIPANT,
PARTICIPANT_ADDRESS,
PARTICIPANT_PHONE_NUMBER,
PARTICIPANT_PHYSICAL_ATTRIBUTES,
SOCIAL_SECURITY_NUMBER} 10
3 {CASE_COURT_CASE,
COURT_CASE,
LEGAL_ACTIVITY,
MEDICAL_TERMS,
SUPPORT_ORDER,
TERMS} 100
4 {CASE,
CASE_ACCOUNT,
CASE_ACCOUNT_SUMMARY,
CASE_COURT_CASE} 1000
5 {ADDRESS,
PARTICIPANT,
PARTICIPANT_ADDRESS} 10000
The summary() function provides additional descriptive statistics concerning the make-
up of the table transactions (output format edited slightly to fit):
> summary(tblTrans)
transactions as itemMatrix in sparse format with
110071 rows (elements/itemsets/transactions) and
154 columns (items) and a density of 0.02457888
most frequent items:
CASE CASE_PARTICIPANT PARTICIPANT COURT_CASE CASE_COURT_CASE (Other)
51216 38476 35549 21519 21421 248454
element (itemset/transaction) length distribution:
sizes
2 3 4 5 6 7 8 9 10 11 12 13 14 15 17
33573 29883 15367 12456 9879 3825 1899 603 1064 775 128 38 490 89 2
Min. 1st Qu. Median Mean 3rd Qu. Max.
2.000 2.000 3.000 3.785 5.000 17.000
includes extended item information - examples:
labels
1 ACCOUNT_HOLD_DETAIL
2 ADDRESS
3 ADJUSTMENT
includes extended transaction information - examples:
transactionID
1 1
2 10
3 100
There’s a wealth of information in this output. Note for instance the minimum item
(table) count is two, and the maximum is seventeen, and that there are 33,573
transactions with two items and two transactions with seventeen. While I’m loath to
assign a limit to the maximum number of tables that should ever appear in a query, a
7. 7
DBA would likely be keen to investigate the double-digit queries for potential tuning
opportunities.
Lastly, we can use the itemFrequencyPlot() function to generate an item
frequency distribution. Frequencies can be displayed as relative (percentages), or absolute
(counts). Note that for readability, I limited the plot to the 25 most-frequent items among
the query baskets, by specifying a value for the topN parameter. The command is below,
and the plot is shown in figure 4.
> itemFrequencyPlot(tblTrans, type = "absolute", topN = 25, main
= "Frequency Distribution of Top 25 Tables", xlab = "Table Name",
ylab = "Frequency")
Figure 4 - Item frequency bar plot among the top 25 tables
With the query baskets loaded, it was then time to generate the table association rules.
The R function within arules that does this is apriori(). apriori() takes up to
four arguments but I only used two: a transaction object, tblTrans, and a list of two
parameters that specify the minimum values for the two rule “interestingness criteria,”5
generality and reliability [12; 16]. Confidence, the first parameter, is a measure of
generality, and specifies how often the rule is true when the LHS is true, i.e.,
€
countOfBasketsWithLHSandRHSItems
countOfBasketsWithLHSItems
.
The second is support, which corresponds to the reliability criterion and specifies the
proportion of all baskets where the rule is true, i.e.,
5
This was the “attendant metadata” I mentioned in the Market Basket Analysis section.
8. 8
€
countOfBasketsWithLHSandRHSItems
totalCount
.
A third interestingness criterion, lift, is also useful for evaluating rules and figures
prominently in output from the R-extension package arulesViz (see figure 5).
Paraphrasing [18], lift measures the confidence of a rule and the expected confidence that
the second table will be queried given that the first table was:
€
Confidence(Rule)
Support(RHS)
with Support(RHS) calculated as
€
countOfBasketsWithRHSItem
totalCount
.
Lift indicates the strength of the association over its random co-occurrence. When lift is
greater than 1, the rule is better than guessing at predicting the consequent.
For confidence I specified .8 and for support I specified .05. The command I ran was
> tblRules <- apriori(tblTrans, parameter = list(supp= .05, conf
= .8))
which generated 71 rules. To get a high-level overview of the rules, you can call the
overloaded summary() function against the output of the apriori():
> summary(tblRules)
set of 71 rules
rule length distribution (lhs + rhs):sizes
2 3 4 5
11 28 25 7
Min. 1st Qu. Median Mean 3rd Qu. Max.
2.000 3.000 3.000 3.394 4.000 5.000
summary of quality measures:
support confidence lift
Min. :0.05239 Min. :0.8037 Min. :1.728
1st Qu.:0.05678 1st Qu.:0.8585 1st Qu.:2.460
Median :0.06338 Median :0.9493 Median :4.533
Mean :0.07842 Mean :0.9231 Mean :4.068
3rd Qu.:0.08271 3rd Qu.:0.9870 3rd Qu.:5.066
Max. :0.28343 Max. :1.0000 Max. :7.204
mining info:
data ntransactions support confidence
tblTrans 110071 0.05 0.8
To see the rules, execute the inspect() function (note I’m only showing the first and
last five, as sorted by confidence):
9. 9
> inspect(sort(tblRules, by = "confidence"))
lhs rhs support confidence lift
1 {SUPPORT_ORDER} => {LEGAL_ACTIVITY} 0.11643394 1.0000000 5.847065
2 {CASE_COURT_CASE,
SUPPORT_ORDER} => {LEGAL_ACTIVITY} 0.07112682 1.0000000 5.847065
3 {COURT_CASE,
SUPPORT_ORDER} => {LEGAL_ACTIVITY} 0.08326444 1.0000000 5.847065
4 {CASE_PARTICIPANT,
SUPPORT_ORDER} => {LEGAL_ACTIVITY} 0.05753559 1.0000000 5.847065
5 {CASE,
SUPPORT_ORDER} => {LEGAL_ACTIVITY} 0.06406774 1.0000000 5.847065
<snip>
67 {CASE_COURT_CASE,
COURT_CASE,
SUPPORT_ORDER} => {CASE} 0.05678153 0.8088521 1.738347
68 {CASE_COURT_CASE,
COURT_CASE,
LEGAL_ACTIVITY,
SUPPORT_ORDER} => {CASE} 0.05678153 0.8088521 1.738347
69 {CASE_COURT_CASE,
COURT_CASE} => {CASE} 0.13003425 0.8044175 1.728816
70 {CASE_COURT_CASE,
LEGAL_ACTIVITY} => {CASE} 0.07938512 0.8041598 1.728262
71 {CASE,
CASE_COURT_CASE} => {COURT_CASE} 0.13003425 0.8037399 4.111179
Let’s look at the first and last rules and interpret them. The first rule says that over the
period during which the queries were collected, the SUPPORT_ORDER table appeared in
11.64% of the queries and that when it did it was accompanied by the
LEGAL_ACTIVITY table 100% of the time. The last rule, the 71st
, says that during this
same period CASE and CASE_COURT_CASE appeared in 13% of the queries and that
they were accompanied by COURT_CASE 80.37% of the time.
While it’s not visible from the subset shown, all 71 of the generated rules have a single-
item consequent. This is fortunate, and is not always the case, as such rules are “the most
actionable” in practice compared to rules with compound consequents [4].
Figure 5, generated using arulesViz, is a scatter plot of the support and confidence of
the 71 rules generated by arules. Here we see the majority of the rules are in the 0.05-
0.15 support range, meaning between 5% and 15% of the 110,071 queries analyzed
contain all of the tables represented in the rule.
10. 10
Figure 5 - Plot of the interestingness measures for the 71 generated rules
An illuminating visualization is shown in Figure 6, also generated by the aRulesViz
package. This figure plots the rule antecedents on the x-axis and their consequents on the
y-axis. To economize on space, the table names aren’t displayed but rather are numbered
corresponding to output accompanying the graph that’s displayed on the main R window.
Looking at the plot, two things immediately stand out: the presence of four large rule
groups, and that only nine tables (the y-axis) account for the consequents in all 71 rules.
These nine tables are the “nuclear” tables around which all the others orbit, the most vital
of the vital few.
Rule Interestingness Measures
2
3
4
5
6
7
lift
0.8 0.85 0.9 0.95 1
0.05
0.1
0.15
0.2
0.25
confidence
support
11. 11
Figure 6 - Plot of table prevalence of rules
Another Way: Odds Ratios
As the final step, I computed the odds ratios between all existing pairings occurring
among the top-25 tables, which numbered 225. Odds is the ratio of the probability of an
event’s occurrence to the probability of its non-occurrence, and the odds ratio is the ratio
of the odds of two events (e.g., two tables co-occurring in a given query vs. each table
appearing without the other) [19]. To compute the odds ratios, I used the cc() function
from the epicalc R package which, when given a 2x2 contingency table (see table 1,
generated with the R CrossTable() function), outputs the following (the counts
shown are of the pairing of the PARTICIPANT and CASE_PARTICIPANT tables):
FALSE TRUE Total
FALSE 408021 50902 458923
TRUE 56963 21745 78708
Total 464984 72647 537631
OR = 3.06
Exact 95% CI = 3.01, 3.12
Table Prevalence Among Rules
10 20 30 40 50
2
4
6
8
Antecedent (LHS)
Consequent(RHS)
2
3
4
5
6
7
lift
12. 12
Chi-squared = 15719.49, 1 d.f., P value = 0
Fisher's exact test (2-sided) P value = 0
For these two tables, the odds ratio is 3.06, with a 95% confidence interval of 3.01 and
3.12. Odds greater than 1 are considered significant, and the higher the number the
greater the significance.
Cell Contents
|-------------------------|
| N |
| Chi-square contribution |
| N / Row Total |
| N / Col Total |
| N / Table Total |
|-------------------------|
Total Observations in Table: 537631
| CASE_PARTICIPANT
PARTICIPANT | FALSE | TRUE | Row Total |
--------------------------------------|-----------|-----------|-----------|
FALSE | 408021 | 50902 | 458923 |
| 310.961 | 1990.337 | |
| 0.889 | 0.111 | 0.854 |
| 0.877 | 0.701 | |
| 0.759 | 0.095 | |
--------------------------------------|-----------|-----------|-----------|
TRUE | 56963 | 21745 | 78708 |
| 1813.123 | 11605.065 | |
| 0.724 | 0.276 | 0.146 |
| 0.123 | 0.299 | |
| 0.106 | 0.040 | |
--------------------------------------|-----------|-----------|-----------|
Column Total | 464984 | 72647 | 537631 |
| 0.865 | 0.135 | |
--------------------------------------|-----------|-----------|-----------|
Table 1 - 2x2 contingency table
I tabulated the odds ratios of the 225 pairings, and table 2 shows the first 25, ordered by
the odds ratio in descending order. The top four odds ratios show as Inf--infinite--because
either the FALSE/TRUE or TRUE/FALSE cell in their respective 2x2 contingency table
was 0, indicating that among the 537,631 observations (rows) analyzed, the one table
never appeared without the other.
The results shown in table 2 differ from what figure 3 depicts. The former shows the
strongest link between the CASE and CASE_PARTICIPANT tables, whereas for the
latter it’s SUPPORT_ORDER and SOCIAL_SECURITY_NUMBER. This is because the
line thicknesses in figure 3 are based solely on the count of co-references (analogous to
the TRUE/TRUE cell in the 2x2 contingency table), whereas odds ratios consider the pair
counts relative to each other, i.e., they take into account the other three cells--
FALSE/FALSE, FALSE/TRUE, and TRUE/FALSE--as well.
Table 1 Table 2 OR
SUPPORT_ORDER SOCIAL_SECURITY_NUMBER Inf
SOCIAL_SECURITY_NUMBER PARTICIPANT_PUBLIC_ASSISTANCE_CASE Inf
CASE_ACCOUNT CASE Inf
13. 13
BIRTH_INFORMATION ADDRESS Inf
PARTICIPANT_EMPLOYER EMPLOYER_ADDRESS 209.014
EMPLOYER_ADDRESS EMPLOYER 54.854
ORG_UNIT INTERNAL_USER 53.583
PARTICIPANT_EMPLOYER EMPLOYER 37.200
SOCIAL_SECURITY_NUMBER PARTICIPANT_NAME 36.551
PARTICIPANT_RELATIONSHIP PARTICIPANT_NAME 34.779
CHARGING_INSTRUCTION CASE_ACCOUNT 27.309
SOCIAL_SECURITY_NUMBER PARTICIPANT_RELATIONSHIP 26.885
CHARGING_INSTRUCTION CASE_ACCOUNT_SUMMARY 25.040
PARTICIPANT LOGICAL_COLLECTION_TRANSACTION 23.281
COMBINED_LOG_TEMP_ENTRY CHARGING_INSTRUCTION 22.357
PARTICIPANT_NAME PARTICIPANT 9.739
SOCIAL_SECURITY_NUMBER PARTICIPANT 8.799
PARTICIPANT_PUBLIC_ASSISTANCE_CASE COMBINED_LOG_TEMP_ENTRY 8.578
CASE_COURT_CASE CASE 7.913
PARTICIPANT_ADDRESS PARTICIPANT 7.869
PARTICIPANT COMBINED_LOG_TEMP_ENTRY 7.861
COURT_CASE CASE_COURT_CASE 7.525
PARTICIPANT_NAME PARTICIPANT_EMPLOYER 7.454
SOCIAL_SECURITY_NUMBER PARTICIPANT_ADDRESS 7.420
PARTICIPANT_NAME ORG_UNIT 7.346
Table 2 - Top 25 table-pair odds ratios
Conclusion
In this paper, I’ve described a holistic process for identifying the most-queried “vital
few” tables in a database, uncovering their usage patterns and interrelationships, and
guiding their placement on physical media.
First I captured query metadata and parsed it for further analysis. I then established that
there are a “vital few” tables that account for the majority of query activity. Finally, I
used MBA supplemented with other methods to understand the co-reference patterns
among these tables, which may in turn inform their layout on storage media.
My hope is I’ve described what I did in enough detail that you’re able to adapt it, extend
it, and improve it to the betterment of the performance of your databases and
applications.
Appendix A - Data Placement on Physical Media
Storage devices such as disks maximize throughput by minimizing access time [5], and a
fundamental part of physical database design is the allocation of database objects to such
physical media--deciding where schema objects should be placed on disk to maximize
performance by minimizing disk seek time and rotational latency. The former is a
function of the movement of a disk’s read/write head arm assembly and the latter is
dependent on its rotations per minute. Both are electromechanical absolutes, although
their speeds vary from disk to disk.
As is well known, disk access is several orders of magnitude slower than RAM access--
estimates range from four to six[ibid.], and this relative disparity is no less true today
than it was when the IBM 350 Disk Storage Unit was introduced in 1956. So while this
14. 14
topic may seem like a bit of a chestnut in the annals of physical database design, it
remains a germane topic. The presence of such compensatory components and strategies
as bufferpools, defragmenting, table reorganizations, and read ahead prefetching in the
architecture of modern RDBMS underscores this point [20]. The fact is read/write heads
can only be in one place on the disk platter at a time. Solid-state drives (SSD) offer
potential relief here, but data volumes are rising at a rate much faster than that at which
SSD prices are falling.
Coupled with this physical reality is a fiscal one, as it’s been estimated anywhere
between 16-40% of IT budget outlay is committed to storage [6; 23]. In light of such
costs, it makes good financial sense for an organization to be a wise steward of this
resource and seek its most efficient use.
If one is designing a new database as part of a larger application development initiative,
then such tried-and-true tools as CRUD (create, read, update, delete) matrices and entity
affinity analysis can assist with physical table placement, but such techniques quickly
become tedious, and therefore error-prone, and these early placement decisions are at best
educated guesses. What would be useful is an automated, holistic, approach to help refine
the placement of tables as the full complement of queries comes on line and later as it
changes over the lifetime of the application, without incurring extra storage costs. The
present paper is, of course, an attempt at such an approach.
As to where to locate data on disk media in general, the rule of thumb is to place the
high-use tables on the middle tracks of the disk given this location has the smallest
average distance to all other tracks. In the case of disks employing zone-bit recording
(ZBR), as practically all now do, the recommendation is to place the high-frequency
tables, say, the vital few 20%, on the outermost cylinders, as the raw transfer rate is
higher there since the bits are more densely packed. This idea can be extended further by
placing tables typically co-accessed in queries in the outermost zones on separate disk
drives [2], minimizing read/write head contention and enabling query parallelism. If
zone-level disk placement specificity is not an option, separating co-accessed vital few
tables onto separate media is still a worthwhile practice.
Appendix B - How many baskets?
For this analysis, performed on modest hardware, I used all of the snapshot data I had
available. Indeed, one of the precepts of the burgeoning field of data science [17] is that
with today’s commodity hardware, computing power, and addressable memory sizes, we
no longer have to settle for samples. Plus, when it comes to the data we’ve been
discussing, sampling risks overlooking its long-tailed aspect [7]. Nonetheless, there may
still be instances where analyzing all of the snapshots at your disposal isn’t practicable, or
you may want to know at the outset of an analysis how many statements you’ll need to
capture to get a good representative of the query population (and therefore greater
statistical power), if you don’t have a ready pool of snapshots from which to draw. In
either case, you need to know how large your sample needs to be for a robust analysis.
15. 15
Sample size formulas exist for more conventional hypothesis testing, but Zaki, et al. [24]
give a more suitable sample size formula, one specific to market basket analysis:
€
n =
−2ln(c)
τε2
,
where n is the sample size, c is 1 - α, (α being the confidence level), ε is the acceptable
level of inaccuracy, and τ is the minimum required support [13]. Using this equation,
with 80% confidence (c = .20), 95% accuracy (ε = .05), and 5% support (τ = .05), the
sample size recommendation is 25,751.
Using this sample size, I ran the sample() function of the apriori package against
the tblTrans transaction object, the results of which I then used as input to the
apriori() function to generate a new set of rules. This time, 72 rules were generated.
Figure 7 shows the high degree of correspondence between the relative frequencies of the
tables in the sample (bars) and the population (line).
Figure 7 - Relative frequencies of the tables in the sample vs. in the population.
Figure 8 plots the regression lines of the confidence and support of the rules generated
from the sample (black) and the population (grey). Again, notice the high degree of
correspondence.
Sample Frequency Distribution of Top 25 Tables
Frequency
040008000
C
ASE
C
ASE_PAR
TIC
IPAN
T
PAR
TIC
IPAN
T
C
O
U
R
T_C
ASE
C
ASE_C
O
U
R
T_C
ASE
C
ASE_AC
C
O
U
N
T_SU
M
M
AR
Y
LEG
AL_AC
TIVITY
PAR
TIC
IPAN
T_N
AM
E
C
ASE_AC
C
O
U
N
T
SU
PPO
R
T_O
R
D
ER
AD
D
R
ESS
C
O
M
BIN
ED
_LO
G
_TEM
P_EN
TR
Y
O
R
G
_U
N
IT
C
ASE_AC
C
O
U
N
T_SU
M
M
AR
Y_TR
AN
SAC
TIO
N
BIR
TH
_IN
FO
R
M
ATIO
N
PAR
TIC
IPAN
T_AD
D
R
ESS
SO
C
IAL_SEC
U
R
ITY_N
U
M
BER
PAR
TIC
IPAN
T_EM
PLO
YER
LO
G
IC
AL_C
O
LLEC
TIO
N
_TR
AN
SAC
TIO
N
IN
TER
N
AL_U
SER
C
H
AR
G
IN
G
_IN
STR
U
C
TIO
N
EM
PLO
YER
PAR
TIC
IPAN
T_PH
YSIC
AL_ATTR
IBU
TESTER
M
S
PAR
TIC
IPAN
T_R
ELATIO
N
SH
IP
Table Name
16. 16
Figure 8 - Correspondence between the support and confidence of the sample and population rules
References
[1] Agrawal, Rakesh, et al. “Mining Association Rules Between Sets of Items in Large
Databases.” Proceedings of the 1993 ACM SIGMOD International Conference on
Management of Data. pp. 207-216.
[2] Agrawal, Sanjay, et al. “Automating Layout of Relational Databases.” Proceedings of
the 19th
International Conference on Data Engineering (ICDE ’03). (2003): 607-
618.
[3] Barabási, Albert-László and Zoltán N. Oltvai. “Network Biology: Understanding the
Cell’s Functional Organization.” Nature Reviews Genetics. 5.2 (2004): 101-113.
[4] Berry, Michael J. and Gordon Linoff. Data Mining Techniques: For Marketing, Sales,
and Customer Support. New York: John Wiley & Sons, 1997.
[5] Blanchette, Jean-François. “A Material History of Bits.” Journal of the American
Society For Information Science and Technology. 62.6 (2011): 1042-1057.
[6] Butts, Stuart. “How to Use Single Instancing to Control Storage Expense.” eWeek 03
August 2009. 22 December 2010 <http://mobile.eweek.com/c/a/Green-IT/How-
to-Use-Single-Instancing-to-Control-Storage-Expense/>.
[7] Cohen, Jeffrey, et al. “MAD Skills: New Analysis Practices for Big Data.” Journal
Proceedings of the VLDB Endowment. 2.2 (2009): 1481-1492.
0.80 0.85 0.90 0.95 1.00
0.050.100.150.200.25
Rule Sample vs. Population
Confidence
Support
17. 17
[8] Cunningham, Sally Jo and Eibe Frank. “Market Basket Analysis of Library
Circulation Data.” Proceedings of the Sixth International Conference on Neural
Information Processing (1999). Vol. II, pp. 825-830.
[9] Eldredge, Jonathan D. “The Vital Few Meet the Trivial Many: Unexpected Use
Patterns in a Monographs Collection.” Bulletin of the Medical Library
Association. 86.4 (1998): 496-503.
[10] Erar, Aydin. “Bibliometrics or Informetrics: Displaying Regularity in Scientific
Patterns by Using Statistical Distributions.” Hacettepe Journal of Mathematics
and Statistics. 31 (2002): 113-125.
[11] Fu, W. Wayne and Clarice C Sim. “Aggregate Bandwagon Effect on Online Videos'
Viewership: Value Uncertainty, Popularity Cues, and Heuristics.” Journal of the
American Society For Information Science and Technology. 62.12 (2011): 2382-
2395.
[12] Geng, Liqiang and Howard J. Hamilton. “Interestingness Measures for Data Mining:
A Survey.” ACM Computing Surveys. 38.3 (2006): 1-32.
[13] Hashler, Michael, et al. “Introduction to arules: A Computational Environment for
Mining Association Rules and Frequent Item Sets.” CRAN 16 March 2010.
<http://cran.r-project.org/web/packages/arules/vignettes/arules.pdf>.
[14] Heising, W.P. “Note on Random Addressing Techniques.” IBM Systems Journal. 2.2
(1963): 112-6.
[15] Hsu, W.W., A.J. Smith, and H.C. Young. “Characteristics of Production Database
Workloads and the TPC Benchmarks.” IBM Systems Journal. 40.3 (2001): 781-
802.
[16] Janert, Philipp, K. Data Analysis with Open Source Tools. Sebastopol: O’Reilly,
2010.
[17] Loukides, Mike. “What is Data Science?” O’Reilly Radar. 2 June 2010
http://radar.oreilly.com/2010/06/what-is-data-science.html
[18] Nisbet, Robert, John Elder, and Gary Miner. Handbook of Statistical Analysis and
Data Mining. Burlington, MA: Academic Press, 2009.
[19] Ott, R. Lyman, and Michael Longnecker. An Introduction to Statistical Methods and
Data Analysis. 6th
ed. Belmont, CA: Brooks/Cole, 2010.
[20] Pendle, Paul. “Solid-State Drives: Changing the Data World.” IBM Data
Management Magazine. Issue 3 (2011): 27-30.