SlideShare a Scribd company logo
Market Basket Analysis of Database Table References Using R: An
Application to Physical Database Design
Jeffrey Tyzzer1
Summary
Market basket and statistical and informetric analyses are applied to a population of
database queries (SELECT statements) to better understand table usage and co-
occurrence patterns and inform placement on physical media.
Introduction
In 1999 Sally Jo Cunningham and Eibe Frank of the University of Waikato published a
paper titled “Market Basket Analysis of Library Circulation Data” [8]. In it the authors
apply market basket analysis (MBA) to library book circulation data, models of which are
a staple of informetrics, “the application of statistical and mathematical methods in
Library and Information Sciences” [10]. Their paper engendered ideas that led to this
paper, which concerns the application of MBA and statistical and informetric analyses to
a set of database queries, i.e. SELECT statements, to better understand table usage and
co-occurrence patterns.
Market Basket Analysis
Market basket analysis is a data mining technique that applies association rule analysis, a
method of uncovering connections among items in a data set, to supermarket purchases,
with the goal of finding items (i.e., groceries) having a high probability of appearing
together. For instance, a rule induced by MBA might be “in 85% of the baskets where
potato chips appeared, so did root beer.” In the Cunningham and Frank paper, the baskets
were the library checkouts and the groceries were the books. In this paper, the baskets are
the queries and the groceries are the tables referenced in the queries.
MBA was introduced in the seminal paper “Mining Association Rules between Sets of
Items in Large Databases,” by Agrawal et al. [1] and is used by retailers to guide store
layout (for example, placing products having a high probability of appearing in the same
purchase closer together to encourage greater sales) and promotions (e.g., buy one and
get the other half-off). The output of MBA is a set of association rules and attendant
metadata in the form {LHS => RHS}. LHS means “left-hand side” and RHS means
“right-hand side.” These rules are interpreted as “if LHS then RHS,” with the LHS
referred to as the antecedent and the RHS referred to as the consequent. For the potato
chip and root beer example, we’d have {Chips => Root beer}.
1
jefftyzzer AT sbcglobal DOT net
2
The Project
Two questions directed my investigation:
1. Among the tables, are there a “vital few” [9] that account for the bulk of the table
references in the queries? If so, which ones are they?
2. Which table pairings (co-occurrences) are most frequent within the queries?
The answers to these questions can be used to:
• Steer the placement of tables on physical media2
• Justify denormalization decisions
• Inform the creation of materialized views, table clusters, and aggregates
• Guide partitioning strategies to achieve collocated joins and reduce inter-node
data shipping in distributed databases
• Identify missing indexes to support frequently joined tables3
• Direct the scope, depth, frequency, and priority of table and index statistics
gathering
• Contribute to an organization’s overall corpus of operational intelligence
The data at the focus of this study are metadata for queries executed against a population
of 494 tables within an OLTP database. The queries were captured over a four-day
period. There is an ad hoc query capability within the environment, but such queries are
run against a separate data store, thus the system under study was effectively closed with
respect to random, external, queries.
I wrote a Perl program to iterate over the compressed system-generated query log files,
272 in all, cull the tables from each of the SELECT statements within them, and, for
those referencing at least one of the 494 tables, output detail and summary data,
respectively, to two files. Of the 553,139 total statements read, 373,372 met this criterion
(the remainder, a sizable number, were metadata-type statements, e.g., data dictionary
lookups and variable instantiations).
The summary file lists each table and the number of queries it appears in; its structure is
simply {table, count}. The detail file lists {query, table, hour} triples, which were then
imported into a simple table consisting of three corresponding columns. query identifies
the query in which the table is referenced, tabname is the name of the table, and hour
designates the hour the query snapshot was taken, in the range 0-23.
2
Both within a given medium as well as among media with different performance characteristics, e.g.,
tiering storage between disk and solid-state drives (SSD) [15].
3
Adding indexes to support SELECTs may come at the expense of increased INSERT, UPDATE, and
DELETE costs. When adding indexes to a table the full complement of CRUD operations against it must
be considered. The analysis discussed here is easily extended to encompass (other) DML statements as
well.
3
The “Vital Few”
The 80/20 principle describes a phenomenon in which “20% of a population or group can
explain 80% of an effect” [9]. This principle is widely observed in economics, where it’s
generally referred to as the Pareto principle, and informetrics, where it’s known as
Trueswell’s 80/20 rule [22]. Trueswell argued that 80% of a library’s circulation is
accounted for by 20% of its circulating books [9]. This behavior has also been observed
in computer science contexts, where it’s been noted that 80 percent of the transactions on
a file are applied to the 20 percent most frequently used records within it [14; 15].
I wanted to see if this pattern of skewed access could apply to RDBMS’s as well, i.e., if
20% of the tables in the database might account for 80% of all table references within
queries.
Figure 1 plots in ascending (rank) order the number of queries each of the 494 tables is
referenced in (space prohibits a table listing the frequencies of the 494 tables), showing a
characteristic reverse-J shape. Figure 2 presents this data as a Lorenz curve, which was
generated using the R package ineq. As [11] puts it, “[t]he straight line represents the
expected distribution” if all tables were queried an equal number of times, with the
curved line indicating the observed distribution. As the figure shows, 20% of the tables
account for a little more than 85% of the table references. Clearly, a subset of tables, the
“vital few,” account for the majority of table references in the queries. These tables get
the most attention query-wise and therefore deserve the most attention performance-wise.
Figure 1 - Plot of query-count-per-table frequency
4
Figure 2 - Lorenz curve illustrating the 80/20 rule for table references
The Market Basket Analysis
The 373,3724
statements mentioned earlier are the table baskets from which the subset of
transactions against the 25 most-queried tables is derived. As a first step toward
uncovering connections among the tables in the database using MBA, I used the R
package diagram to create a web plot, or annulus [21], of the co-reference relationships
among these 25 tables, shown in figure 3. Note that line thickness in the figure is
proportional to the frequency of co-occurrence. As can be seen, there is a high level of
interconnectedness among these tables. Looking at these connections as a graph, with the
tables as nodes and their co-occurrence in queries as edges, I computed the graph’s
clustering coefficient, which is the number of actual connections between the nodes
divided by the total possible number of connections [3], which turned out to be 0.75, a
not surprisingly high value given what figure 3 illustrates.
4
See Appendix B for a sample size formula if a population of table baskets is not already available.
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
Cumulative Table Query Percentages
Cum pct of table references
Cumpctoftablepopulation
(.8525, .2)
5
Figure 3 - Web plot of the co-references between the 25 most-queried tables
To mine the tables’ association rules, I used R’s arules package. Before the table
basket data could be analyzed it had to be read into an R data structure, which is done
using the read.transactions() function. The result is a transactions object, an
incidence matrix of the transaction items. To see the structure of the matrix, type the
name of the variable at the R prompt:
> tblTrans
transactions in sparse format with
110071 transactions (rows) and
154 items (columns)
Keep in mind the number of transactions (queries) and items (tables) shown here differs
from their respective numbers listed previously because I limited the analysis to just
those table baskets with at least two of the 25 most-queried tables in them. Looking at the
output, that’s 110,071 queries and 154 tables (the top 25 along with 129 others they
appear with).
To see the contents of the tblTrans transaction object, the inspect() function is
used (note I limited inspect() to the first five transactions, as identified by the ASCII
ordering of the transactionID):
Co-reference Patterns Among the 25 Most-queried Tables
ADDRESS
BIRTH_INFORMATION
CASE
CASE_ACCOUNT
CASE_ACCOUNT_SUMMARY
CASE_ACCOUNT_SUMMARY_TRANSACTION
CASE_COURT_CASE
CASE_PARTICIPANT
CHARGING_INSTRUCTION
COMBINED_LOG_TEMP_ENTRY
COURT_CASE
EMPLOYEREMPLOYER_ADDRESS
INTERNAL_USER
LEGAL_ACTIVITY
LOGICAL_COLLECTION_TRANSACTION
ORG_UNIT
PARTICIPANT
PARTICIPANT_ADDRESS
PARTICIPANT_EMPLOYER
PARTICIPANT_NAME
PARTICIPANT_PUBLIC_ASSISTANCE_CASE
PARTICIPANT_RELATIONSHIP
SOCIAL_SECURITY_NUMBER
SUPPORT_ORDER
6
> inspect(tblTrans[1:5])
items transactionID
1 {PARTICIPANT,
PARTICIPANT_NAME} 1
2 {ADDRESS,
BIRTH_INFORMATION,
PARTICIPANT,
PARTICIPANT_ADDRESS,
PARTICIPANT_PHONE_NUMBER,
PARTICIPANT_PHYSICAL_ATTRIBUTES,
SOCIAL_SECURITY_NUMBER} 10
3 {CASE_COURT_CASE,
COURT_CASE,
LEGAL_ACTIVITY,
MEDICAL_TERMS,
SUPPORT_ORDER,
TERMS} 100
4 {CASE,
CASE_ACCOUNT,
CASE_ACCOUNT_SUMMARY,
CASE_COURT_CASE} 1000
5 {ADDRESS,
PARTICIPANT,
PARTICIPANT_ADDRESS} 10000
The summary() function provides additional descriptive statistics concerning the make-
up of the table transactions (output format edited slightly to fit):
> summary(tblTrans)
transactions as itemMatrix in sparse format with
110071 rows (elements/itemsets/transactions) and
154 columns (items) and a density of 0.02457888
most frequent items:
CASE CASE_PARTICIPANT PARTICIPANT COURT_CASE CASE_COURT_CASE (Other)
51216 38476 35549 21519 21421 248454
element (itemset/transaction) length distribution:
sizes
2 3 4 5 6 7 8 9 10 11 12 13 14 15 17
33573 29883 15367 12456 9879 3825 1899 603 1064 775 128 38 490 89 2
Min. 1st Qu. Median Mean 3rd Qu. Max.
2.000 2.000 3.000 3.785 5.000 17.000
includes extended item information - examples:
labels
1 ACCOUNT_HOLD_DETAIL
2 ADDRESS
3 ADJUSTMENT
includes extended transaction information - examples:
transactionID
1 1
2 10
3 100
There’s a wealth of information in this output. Note for instance the minimum item
(table) count is two, and the maximum is seventeen, and that there are 33,573
transactions with two items and two transactions with seventeen. While I’m loath to
assign a limit to the maximum number of tables that should ever appear in a query, a
7
DBA would likely be keen to investigate the double-digit queries for potential tuning
opportunities.
Lastly, we can use the itemFrequencyPlot() function to generate an item
frequency distribution. Frequencies can be displayed as relative (percentages), or absolute
(counts). Note that for readability, I limited the plot to the 25 most-frequent items among
the query baskets, by specifying a value for the topN parameter. The command is below,
and the plot is shown in figure 4.
> itemFrequencyPlot(tblTrans, type = "absolute", topN = 25, main
= "Frequency Distribution of Top 25 Tables", xlab = "Table Name",
ylab = "Frequency")
Figure 4 - Item frequency bar plot among the top 25 tables
With the query baskets loaded, it was then time to generate the table association rules.
The R function within arules that does this is apriori(). apriori() takes up to
four arguments but I only used two: a transaction object, tblTrans, and a list of two
parameters that specify the minimum values for the two rule “interestingness criteria,”5
generality and reliability [12; 16]. Confidence, the first parameter, is a measure of
generality, and specifies how often the rule is true when the LHS is true, i.e.,
€
countOfBasketsWithLHSandRHSItems
countOfBasketsWithLHSItems
.
The second is support, which corresponds to the reliability criterion and specifies the
proportion of all baskets where the rule is true, i.e.,
5
This was the “attendant metadata” I mentioned in the Market Basket Analysis section.
8
€
countOfBasketsWithLHSandRHSItems
totalCount
.
A third interestingness criterion, lift, is also useful for evaluating rules and figures
prominently in output from the R-extension package arulesViz (see figure 5).
Paraphrasing [18], lift measures the confidence of a rule and the expected confidence that
the second table will be queried given that the first table was:
€
Confidence(Rule)
Support(RHS)
with Support(RHS) calculated as
€
countOfBasketsWithRHSItem
totalCount
.
Lift indicates the strength of the association over its random co-occurrence. When lift is
greater than 1, the rule is better than guessing at predicting the consequent.
For confidence I specified .8 and for support I specified .05. The command I ran was
> tblRules <- apriori(tblTrans, parameter = list(supp= .05, conf
= .8))
which generated 71 rules. To get a high-level overview of the rules, you can call the
overloaded summary() function against the output of the apriori():
> summary(tblRules)
set of 71 rules
rule length distribution (lhs + rhs):sizes
2 3 4 5
11 28 25 7
Min. 1st Qu. Median Mean 3rd Qu. Max.
2.000 3.000 3.000 3.394 4.000 5.000
summary of quality measures:
support confidence lift
Min. :0.05239 Min. :0.8037 Min. :1.728
1st Qu.:0.05678 1st Qu.:0.8585 1st Qu.:2.460
Median :0.06338 Median :0.9493 Median :4.533
Mean :0.07842 Mean :0.9231 Mean :4.068
3rd Qu.:0.08271 3rd Qu.:0.9870 3rd Qu.:5.066
Max. :0.28343 Max. :1.0000 Max. :7.204
mining info:
data ntransactions support confidence
tblTrans 110071 0.05 0.8
To see the rules, execute the inspect() function (note I’m only showing the first and
last five, as sorted by confidence):
9
> inspect(sort(tblRules, by = "confidence"))
lhs rhs support confidence lift
1 {SUPPORT_ORDER} => {LEGAL_ACTIVITY} 0.11643394 1.0000000 5.847065
2 {CASE_COURT_CASE,
SUPPORT_ORDER} => {LEGAL_ACTIVITY} 0.07112682 1.0000000 5.847065
3 {COURT_CASE,
SUPPORT_ORDER} => {LEGAL_ACTIVITY} 0.08326444 1.0000000 5.847065
4 {CASE_PARTICIPANT,
SUPPORT_ORDER} => {LEGAL_ACTIVITY} 0.05753559 1.0000000 5.847065
5 {CASE,
SUPPORT_ORDER} => {LEGAL_ACTIVITY} 0.06406774 1.0000000 5.847065
<snip>
67 {CASE_COURT_CASE,
COURT_CASE,
SUPPORT_ORDER} => {CASE} 0.05678153 0.8088521 1.738347
68 {CASE_COURT_CASE,
COURT_CASE,
LEGAL_ACTIVITY,
SUPPORT_ORDER} => {CASE} 0.05678153 0.8088521 1.738347
69 {CASE_COURT_CASE,
COURT_CASE} => {CASE} 0.13003425 0.8044175 1.728816
70 {CASE_COURT_CASE,
LEGAL_ACTIVITY} => {CASE} 0.07938512 0.8041598 1.728262
71 {CASE,
CASE_COURT_CASE} => {COURT_CASE} 0.13003425 0.8037399 4.111179
Let’s look at the first and last rules and interpret them. The first rule says that over the
period during which the queries were collected, the SUPPORT_ORDER table appeared in
11.64% of the queries and that when it did it was accompanied by the
LEGAL_ACTIVITY table 100% of the time. The last rule, the 71st
, says that during this
same period CASE and CASE_COURT_CASE appeared in 13% of the queries and that
they were accompanied by COURT_CASE 80.37% of the time.
While it’s not visible from the subset shown, all 71 of the generated rules have a single-
item consequent. This is fortunate, and is not always the case, as such rules are “the most
actionable” in practice compared to rules with compound consequents [4].
Figure 5, generated using arulesViz, is a scatter plot of the support and confidence of
the 71 rules generated by arules. Here we see the majority of the rules are in the 0.05-
0.15 support range, meaning between 5% and 15% of the 110,071 queries analyzed
contain all of the tables represented in the rule.
10
Figure 5 - Plot of the interestingness measures for the 71 generated rules
An illuminating visualization is shown in Figure 6, also generated by the aRulesViz
package. This figure plots the rule antecedents on the x-axis and their consequents on the
y-axis. To economize on space, the table names aren’t displayed but rather are numbered
corresponding to output accompanying the graph that’s displayed on the main R window.
Looking at the plot, two things immediately stand out: the presence of four large rule
groups, and that only nine tables (the y-axis) account for the consequents in all 71 rules.
These nine tables are the “nuclear” tables around which all the others orbit, the most vital
of the vital few.
Rule Interestingness Measures
2
3
4
5
6
7
lift
0.8 0.85 0.9 0.95 1
0.05
0.1
0.15
0.2
0.25
confidence
support
11
Figure 6 - Plot of table prevalence of rules
Another Way: Odds Ratios
As the final step, I computed the odds ratios between all existing pairings occurring
among the top-25 tables, which numbered 225. Odds is the ratio of the probability of an
event’s occurrence to the probability of its non-occurrence, and the odds ratio is the ratio
of the odds of two events (e.g., two tables co-occurring in a given query vs. each table
appearing without the other) [19]. To compute the odds ratios, I used the cc() function
from the epicalc R package which, when given a 2x2 contingency table (see table 1,
generated with the R CrossTable() function), outputs the following (the counts
shown are of the pairing of the PARTICIPANT and CASE_PARTICIPANT tables):
FALSE TRUE Total
FALSE 408021 50902 458923
TRUE 56963 21745 78708
Total 464984 72647 537631
OR = 3.06
Exact 95% CI = 3.01, 3.12
Table Prevalence Among Rules
10 20 30 40 50
2
4
6
8
Antecedent (LHS)
Consequent(RHS)
2
3
4
5
6
7
lift
12
Chi-squared = 15719.49, 1 d.f., P value = 0
Fisher's exact test (2-sided) P value = 0
For these two tables, the odds ratio is 3.06, with a 95% confidence interval of 3.01 and
3.12. Odds greater than 1 are considered significant, and the higher the number the
greater the significance.
Cell Contents
|-------------------------|
| N |
| Chi-square contribution |
| N / Row Total |
| N / Col Total |
| N / Table Total |
|-------------------------|
Total Observations in Table: 537631
| CASE_PARTICIPANT
PARTICIPANT | FALSE | TRUE | Row Total |
--------------------------------------|-----------|-----------|-----------|
FALSE | 408021 | 50902 | 458923 |
| 310.961 | 1990.337 | |
| 0.889 | 0.111 | 0.854 |
| 0.877 | 0.701 | |
| 0.759 | 0.095 | |
--------------------------------------|-----------|-----------|-----------|
TRUE | 56963 | 21745 | 78708 |
| 1813.123 | 11605.065 | |
| 0.724 | 0.276 | 0.146 |
| 0.123 | 0.299 | |
| 0.106 | 0.040 | |
--------------------------------------|-----------|-----------|-----------|
Column Total | 464984 | 72647 | 537631 |
| 0.865 | 0.135 | |
--------------------------------------|-----------|-----------|-----------|
Table 1 - 2x2 contingency table
I tabulated the odds ratios of the 225 pairings, and table 2 shows the first 25, ordered by
the odds ratio in descending order. The top four odds ratios show as Inf--infinite--because
either the FALSE/TRUE or TRUE/FALSE cell in their respective 2x2 contingency table
was 0, indicating that among the 537,631 observations (rows) analyzed, the one table
never appeared without the other.
The results shown in table 2 differ from what figure 3 depicts. The former shows the
strongest link between the CASE and CASE_PARTICIPANT tables, whereas for the
latter it’s SUPPORT_ORDER and SOCIAL_SECURITY_NUMBER. This is because the
line thicknesses in figure 3 are based solely on the count of co-references (analogous to
the TRUE/TRUE cell in the 2x2 contingency table), whereas odds ratios consider the pair
counts relative to each other, i.e., they take into account the other three cells--
FALSE/FALSE, FALSE/TRUE, and TRUE/FALSE--as well.
Table 1 Table 2 OR
SUPPORT_ORDER SOCIAL_SECURITY_NUMBER Inf
SOCIAL_SECURITY_NUMBER PARTICIPANT_PUBLIC_ASSISTANCE_CASE Inf
CASE_ACCOUNT CASE Inf
13
BIRTH_INFORMATION ADDRESS Inf
PARTICIPANT_EMPLOYER EMPLOYER_ADDRESS 209.014
EMPLOYER_ADDRESS EMPLOYER 54.854
ORG_UNIT INTERNAL_USER 53.583
PARTICIPANT_EMPLOYER EMPLOYER 37.200
SOCIAL_SECURITY_NUMBER PARTICIPANT_NAME 36.551
PARTICIPANT_RELATIONSHIP PARTICIPANT_NAME 34.779
CHARGING_INSTRUCTION CASE_ACCOUNT 27.309
SOCIAL_SECURITY_NUMBER PARTICIPANT_RELATIONSHIP 26.885
CHARGING_INSTRUCTION CASE_ACCOUNT_SUMMARY 25.040
PARTICIPANT LOGICAL_COLLECTION_TRANSACTION 23.281
COMBINED_LOG_TEMP_ENTRY CHARGING_INSTRUCTION 22.357
PARTICIPANT_NAME PARTICIPANT 9.739
SOCIAL_SECURITY_NUMBER PARTICIPANT 8.799
PARTICIPANT_PUBLIC_ASSISTANCE_CASE COMBINED_LOG_TEMP_ENTRY 8.578
CASE_COURT_CASE CASE 7.913
PARTICIPANT_ADDRESS PARTICIPANT 7.869
PARTICIPANT COMBINED_LOG_TEMP_ENTRY 7.861
COURT_CASE CASE_COURT_CASE 7.525
PARTICIPANT_NAME PARTICIPANT_EMPLOYER 7.454
SOCIAL_SECURITY_NUMBER PARTICIPANT_ADDRESS 7.420
PARTICIPANT_NAME ORG_UNIT 7.346
Table 2 - Top 25 table-pair odds ratios
Conclusion
In this paper, I’ve described a holistic process for identifying the most-queried “vital
few” tables in a database, uncovering their usage patterns and interrelationships, and
guiding their placement on physical media.
First I captured query metadata and parsed it for further analysis. I then established that
there are a “vital few” tables that account for the majority of query activity. Finally, I
used MBA supplemented with other methods to understand the co-reference patterns
among these tables, which may in turn inform their layout on storage media.
My hope is I’ve described what I did in enough detail that you’re able to adapt it, extend
it, and improve it to the betterment of the performance of your databases and
applications.
Appendix A - Data Placement on Physical Media
Storage devices such as disks maximize throughput by minimizing access time [5], and a
fundamental part of physical database design is the allocation of database objects to such
physical media--deciding where schema objects should be placed on disk to maximize
performance by minimizing disk seek time and rotational latency. The former is a
function of the movement of a disk’s read/write head arm assembly and the latter is
dependent on its rotations per minute. Both are electromechanical absolutes, although
their speeds vary from disk to disk.
As is well known, disk access is several orders of magnitude slower than RAM access--
estimates range from four to six[ibid.], and this relative disparity is no less true today
than it was when the IBM 350 Disk Storage Unit was introduced in 1956. So while this
14
topic may seem like a bit of a chestnut in the annals of physical database design, it
remains a germane topic. The presence of such compensatory components and strategies
as bufferpools, defragmenting, table reorganizations, and read ahead prefetching in the
architecture of modern RDBMS underscores this point [20]. The fact is read/write heads
can only be in one place on the disk platter at a time. Solid-state drives (SSD) offer
potential relief here, but data volumes are rising at a rate much faster than that at which
SSD prices are falling.
Coupled with this physical reality is a fiscal one, as it’s been estimated anywhere
between 16-40% of IT budget outlay is committed to storage [6; 23]. In light of such
costs, it makes good financial sense for an organization to be a wise steward of this
resource and seek its most efficient use.
If one is designing a new database as part of a larger application development initiative,
then such tried-and-true tools as CRUD (create, read, update, delete) matrices and entity
affinity analysis can assist with physical table placement, but such techniques quickly
become tedious, and therefore error-prone, and these early placement decisions are at best
educated guesses. What would be useful is an automated, holistic, approach to help refine
the placement of tables as the full complement of queries comes on line and later as it
changes over the lifetime of the application, without incurring extra storage costs. The
present paper is, of course, an attempt at such an approach.
As to where to locate data on disk media in general, the rule of thumb is to place the
high-use tables on the middle tracks of the disk given this location has the smallest
average distance to all other tracks. In the case of disks employing zone-bit recording
(ZBR), as practically all now do, the recommendation is to place the high-frequency
tables, say, the vital few 20%, on the outermost cylinders, as the raw transfer rate is
higher there since the bits are more densely packed. This idea can be extended further by
placing tables typically co-accessed in queries in the outermost zones on separate disk
drives [2], minimizing read/write head contention and enabling query parallelism. If
zone-level disk placement specificity is not an option, separating co-accessed vital few
tables onto separate media is still a worthwhile practice.
Appendix B - How many baskets?
For this analysis, performed on modest hardware, I used all of the snapshot data I had
available. Indeed, one of the precepts of the burgeoning field of data science [17] is that
with today’s commodity hardware, computing power, and addressable memory sizes, we
no longer have to settle for samples. Plus, when it comes to the data we’ve been
discussing, sampling risks overlooking its long-tailed aspect [7]. Nonetheless, there may
still be instances where analyzing all of the snapshots at your disposal isn’t practicable, or
you may want to know at the outset of an analysis how many statements you’ll need to
capture to get a good representative of the query population (and therefore greater
statistical power), if you don’t have a ready pool of snapshots from which to draw. In
either case, you need to know how large your sample needs to be for a robust analysis.
15
Sample size formulas exist for more conventional hypothesis testing, but Zaki, et al. [24]
give a more suitable sample size formula, one specific to market basket analysis:
€
n =
−2ln(c)
τε2
,
where n is the sample size, c is 1 - α, (α being the confidence level), ε is the acceptable
level of inaccuracy, and τ is the minimum required support [13]. Using this equation,
with 80% confidence (c = .20), 95% accuracy (ε = .05), and 5% support (τ = .05), the
sample size recommendation is 25,751.
Using this sample size, I ran the sample() function of the apriori package against
the tblTrans transaction object, the results of which I then used as input to the
apriori() function to generate a new set of rules. This time, 72 rules were generated.
Figure 7 shows the high degree of correspondence between the relative frequencies of the
tables in the sample (bars) and the population (line).
Figure 7 - Relative frequencies of the tables in the sample vs. in the population.
Figure 8 plots the regression lines of the confidence and support of the rules generated
from the sample (black) and the population (grey). Again, notice the high degree of
correspondence.
Sample Frequency Distribution of Top 25 Tables
Frequency
040008000
C
ASE
C
ASE_PAR
TIC
IPAN
T
PAR
TIC
IPAN
T
C
O
U
R
T_C
ASE
C
ASE_C
O
U
R
T_C
ASE
C
ASE_AC
C
O
U
N
T_SU
M
M
AR
Y
LEG
AL_AC
TIVITY
PAR
TIC
IPAN
T_N
AM
E
C
ASE_AC
C
O
U
N
T
SU
PPO
R
T_O
R
D
ER
AD
D
R
ESS
C
O
M
BIN
ED
_LO
G
_TEM
P_EN
TR
Y
O
R
G
_U
N
IT
C
ASE_AC
C
O
U
N
T_SU
M
M
AR
Y_TR
AN
SAC
TIO
N
BIR
TH
_IN
FO
R
M
ATIO
N
PAR
TIC
IPAN
T_AD
D
R
ESS
SO
C
IAL_SEC
U
R
ITY_N
U
M
BER
PAR
TIC
IPAN
T_EM
PLO
YER
LO
G
IC
AL_C
O
LLEC
TIO
N
_TR
AN
SAC
TIO
N
IN
TER
N
AL_U
SER
C
H
AR
G
IN
G
_IN
STR
U
C
TIO
N
EM
PLO
YER
PAR
TIC
IPAN
T_PH
YSIC
AL_ATTR
IBU
TESTER
M
S
PAR
TIC
IPAN
T_R
ELATIO
N
SH
IP
Table Name
16
Figure 8 - Correspondence between the support and confidence of the sample and population rules
References
[1] Agrawal, Rakesh, et al. “Mining Association Rules Between Sets of Items in Large
Databases.” Proceedings of the 1993 ACM SIGMOD International Conference on
Management of Data. pp. 207-216.
[2] Agrawal, Sanjay, et al. “Automating Layout of Relational Databases.” Proceedings of
the 19th
International Conference on Data Engineering (ICDE ’03). (2003): 607-
618.
[3] Barabási, Albert-László and Zoltán N. Oltvai. “Network Biology: Understanding the
Cell’s Functional Organization.” Nature Reviews Genetics. 5.2 (2004): 101-113.
[4] Berry, Michael J. and Gordon Linoff. Data Mining Techniques: For Marketing, Sales,
and Customer Support. New York: John Wiley & Sons, 1997.
[5] Blanchette, Jean-François. “A Material History of Bits.” Journal of the American
Society For Information Science and Technology. 62.6 (2011): 1042-1057.
[6] Butts, Stuart. “How to Use Single Instancing to Control Storage Expense.” eWeek 03
August 2009. 22 December 2010 <http://mobile.eweek.com/c/a/Green-IT/How-
to-Use-Single-Instancing-to-Control-Storage-Expense/>.
[7] Cohen, Jeffrey, et al. “MAD Skills: New Analysis Practices for Big Data.” Journal
Proceedings of the VLDB Endowment. 2.2 (2009): 1481-1492.
0.80 0.85 0.90 0.95 1.00
0.050.100.150.200.25
Rule Sample vs. Population
Confidence
Support
17
[8] Cunningham, Sally Jo and Eibe Frank. “Market Basket Analysis of Library
Circulation Data.” Proceedings of the Sixth International Conference on Neural
Information Processing (1999). Vol. II, pp. 825-830.
[9] Eldredge, Jonathan D. “The Vital Few Meet the Trivial Many: Unexpected Use
Patterns in a Monographs Collection.” Bulletin of the Medical Library
Association. 86.4 (1998): 496-503.
[10] Erar, Aydin. “Bibliometrics or Informetrics: Displaying Regularity in Scientific
Patterns by Using Statistical Distributions.” Hacettepe Journal of Mathematics
and Statistics. 31 (2002): 113-125.
[11] Fu, W. Wayne and Clarice C Sim. “Aggregate Bandwagon Effect on Online Videos'
Viewership: Value Uncertainty, Popularity Cues, and Heuristics.” Journal of the
American Society For Information Science and Technology. 62.12 (2011): 2382-
2395.
[12] Geng, Liqiang and Howard J. Hamilton. “Interestingness Measures for Data Mining:
A Survey.” ACM Computing Surveys. 38.3 (2006): 1-32.
[13] Hashler, Michael, et al. “Introduction to arules: A Computational Environment for
Mining Association Rules and Frequent Item Sets.” CRAN 16 March 2010.
<http://cran.r-project.org/web/packages/arules/vignettes/arules.pdf>.
[14] Heising, W.P. “Note on Random Addressing Techniques.” IBM Systems Journal. 2.2
(1963): 112-6.
[15] Hsu, W.W., A.J. Smith, and H.C. Young. “Characteristics of Production Database
Workloads and the TPC Benchmarks.” IBM Systems Journal. 40.3 (2001): 781-
802.
[16] Janert, Philipp, K. Data Analysis with Open Source Tools. Sebastopol: O’Reilly,
2010.
[17] Loukides, Mike. “What is Data Science?” O’Reilly Radar. 2 June 2010
http://radar.oreilly.com/2010/06/what-is-data-science.html
[18] Nisbet, Robert, John Elder, and Gary Miner. Handbook of Statistical Analysis and
Data Mining. Burlington, MA: Academic Press, 2009.
[19] Ott, R. Lyman, and Michael Longnecker. An Introduction to Statistical Methods and
Data Analysis. 6th
ed. Belmont, CA: Brooks/Cole, 2010.
[20] Pendle, Paul. “Solid-State Drives: Changing the Data World.” IBM Data
Management Magazine. Issue 3 (2011): 27-30.
18
[21] Spence, Robert. Information Visualization. 2nd
ed. London: Pearson, 2007.
[22] Trueswell, Richard L. “Some Behavioral Patterns of Library Users: the 80/20 Rule.”
Wilson Library Bulletin. 43.5 (1969): 458-461.
[23] Whitely, Robert. “Buyer’s Guide to Infrastructure: Three Steps to IT
Reorganisation.” ComputerWeekly.com 01 September 2010. 22 December 2010
<http://www.computerweekly.com/feature/Buyers-Guide-to-infrastructure-Three-
steps-to-IT-reorganisation >.
[24] Zaki, Mohammed Javeed, et al. “Evaluation of Sampling for Data Mining of
Association Rules.” Proceedings of the 7th International Workshop on Research
Issues in Data Engineering (RIDE '97) High Performance Database Management
for Large-Scale Applications. (1997): 42-50.*
Copyright ©2014 by Jeffrey K. Tyzzer

More Related Content

What's hot

Data mininng trends
Data mininng trendsData mininng trends
Data mininng trends
VijayasankariS
 
Dwbi Project
Dwbi ProjectDwbi Project
Dwbi Project
Sonali Gupta
 
Unit 8 vocabulary words
Unit 8 vocabulary wordsUnit 8 vocabulary words
Unit 8 vocabulary words
alicia_roberts
 
Experiment no 6
Experiment no 6Experiment no 6
Experiment no 6
ganeshhogade
 
59172888 introduction-to-statistics-independent-study-requirements-2nd-sem-20...
59172888 introduction-to-statistics-independent-study-requirements-2nd-sem-20...59172888 introduction-to-statistics-independent-study-requirements-2nd-sem-20...
59172888 introduction-to-statistics-independent-study-requirements-2nd-sem-20...
homeworkping3
 
Dimensional data modeling
Dimensional data modelingDimensional data modeling
Dimensional data modeling
Adam Hutson
 
Commonly used excel formulas
Commonly used excel formulasCommonly used excel formulas
Commonly used excel formulas
saladi330
 
04 olap
04 olap04 olap
03 preprocessing
03 preprocessing03 preprocessing
03 preprocessing
JoonyoungJayGwak
 
8 i index_tables
8 i index_tables8 i index_tables
8 i index_tables
Anil Pandey
 
Phd coursestatalez2datamanagement
Phd coursestatalez2datamanagementPhd coursestatalez2datamanagement
Phd coursestatalez2datamanagement
Marco Delogu
 
Ix calc ii charts
Ix calc ii chartsIx calc ii charts
Ix calc ii charts
Archana Dwivedi
 
Use of-Excel
Use of-ExcelUse of-Excel
Use of-Excel
Brisbane
 
Spreadsheet basics ppt
Spreadsheet basics pptSpreadsheet basics ppt
Spreadsheet basics ppt
Tammy Carter
 
Cis145 Final Review
Cis145 Final ReviewCis145 Final Review
Data warehouse logical design
Data warehouse logical designData warehouse logical design
Data warehouse logical design
Er. Nawaraj Bhandari
 
Chapter 2
Chapter 2Chapter 2
Chapter 2
MaryWall14
 
introduction to spss
introduction to spssintroduction to spss
introduction to spss
Omid Minooee
 
Index Tuning
Index TuningIndex Tuning
Index Tuning
sqlserver.co.il
 
Introduction to spss
Introduction to spssIntroduction to spss
Introduction to spss
rkalidasan
 

What's hot (20)

Data mininng trends
Data mininng trendsData mininng trends
Data mininng trends
 
Dwbi Project
Dwbi ProjectDwbi Project
Dwbi Project
 
Unit 8 vocabulary words
Unit 8 vocabulary wordsUnit 8 vocabulary words
Unit 8 vocabulary words
 
Experiment no 6
Experiment no 6Experiment no 6
Experiment no 6
 
59172888 introduction-to-statistics-independent-study-requirements-2nd-sem-20...
59172888 introduction-to-statistics-independent-study-requirements-2nd-sem-20...59172888 introduction-to-statistics-independent-study-requirements-2nd-sem-20...
59172888 introduction-to-statistics-independent-study-requirements-2nd-sem-20...
 
Dimensional data modeling
Dimensional data modelingDimensional data modeling
Dimensional data modeling
 
Commonly used excel formulas
Commonly used excel formulasCommonly used excel formulas
Commonly used excel formulas
 
04 olap
04 olap04 olap
04 olap
 
03 preprocessing
03 preprocessing03 preprocessing
03 preprocessing
 
8 i index_tables
8 i index_tables8 i index_tables
8 i index_tables
 
Phd coursestatalez2datamanagement
Phd coursestatalez2datamanagementPhd coursestatalez2datamanagement
Phd coursestatalez2datamanagement
 
Ix calc ii charts
Ix calc ii chartsIx calc ii charts
Ix calc ii charts
 
Use of-Excel
Use of-ExcelUse of-Excel
Use of-Excel
 
Spreadsheet basics ppt
Spreadsheet basics pptSpreadsheet basics ppt
Spreadsheet basics ppt
 
Cis145 Final Review
Cis145 Final ReviewCis145 Final Review
Cis145 Final Review
 
Data warehouse logical design
Data warehouse logical designData warehouse logical design
Data warehouse logical design
 
Chapter 2
Chapter 2Chapter 2
Chapter 2
 
introduction to spss
introduction to spssintroduction to spss
introduction to spss
 
Index Tuning
Index TuningIndex Tuning
Index Tuning
 
Introduction to spss
Introduction to spssIntroduction to spss
Introduction to spss
 

Similar to Market Basket Analysis of Database Table References Using R

An improvised frequent pattern tree
An improvised frequent pattern treeAn improvised frequent pattern tree
An improvised frequent pattern tree
IJDKP
 
Database
DatabaseDatabase
Database
Respa Peter
 
Database aggregation using metadata
Database aggregation using metadataDatabase aggregation using metadata
Database aggregation using metadata
Dr Sandeep Kumar Poonia
 
Introduction to database
Introduction to databaseIntroduction to database
Introduction to database
Suleman Memon
 
A Study of Various Projected Data Based Pattern Mining Algorithms
A Study of Various Projected Data Based Pattern Mining AlgorithmsA Study of Various Projected Data Based Pattern Mining Algorithms
A Study of Various Projected Data Based Pattern Mining Algorithms
ijsrd.com
 
A Novel preprocessing Algorithm for Frequent Pattern Mining in Multidatasets
A Novel preprocessing Algorithm for Frequent Pattern Mining in MultidatasetsA Novel preprocessing Algorithm for Frequent Pattern Mining in Multidatasets
A Novel preprocessing Algorithm for Frequent Pattern Mining in Multidatasets
Waqas Tariq
 
System Data Modelling Tools
System Data Modelling ToolsSystem Data Modelling Tools
System Data Modelling Tools
Liam Dunphy
 
A literature review of modern association rule mining techniques
A literature review of modern association rule mining techniquesA literature review of modern association rule mining techniques
A literature review of modern association rule mining techniques
ijctet
 
A Quantified Approach for large Dataset Compression in Association Mining
A Quantified Approach for large Dataset Compression in Association MiningA Quantified Approach for large Dataset Compression in Association Mining
A Quantified Approach for large Dataset Compression in Association Mining
IOSR Journals
 
Data Warehouse ( Dw Of Dwh )
Data Warehouse ( Dw Of Dwh )Data Warehouse ( Dw Of Dwh )
Data Warehouse ( Dw Of Dwh )
Jenny Calhoon
 
Efficient top k retrieval on massive data
Efficient top k retrieval on massive dataEfficient top k retrieval on massive data
Efficient top k retrieval on massive data
Pvrtechnologies Nellore
 
Data warehousing interview_questionsandanswers
Data warehousing interview_questionsandanswersData warehousing interview_questionsandanswers
Data warehousing interview_questionsandanswers
Sourav Singh
 
The D-basis Algorithm for Association Rules of High Confidence
The D-basis Algorithm for Association Rules of High ConfidenceThe D-basis Algorithm for Association Rules of High Confidence
The D-basis Algorithm for Association Rules of High Confidence
ITIIIndustries
 
Review on: Techniques for Predicting Frequent Items
Review on: Techniques for Predicting Frequent ItemsReview on: Techniques for Predicting Frequent Items
Review on: Techniques for Predicting Frequent Items
vivatechijri
 
TRANSFORMATION RULES FOR BUILDING OWL ONTOLOGIES FROM RELATIONAL DATABASES
TRANSFORMATION RULES FOR BUILDING OWL ONTOLOGIES FROM RELATIONAL DATABASESTRANSFORMATION RULES FOR BUILDING OWL ONTOLOGIES FROM RELATIONAL DATABASES
TRANSFORMATION RULES FOR BUILDING OWL ONTOLOGIES FROM RELATIONAL DATABASES
cscpconf
 
7.pptx
7.pptx7.pptx
7.pptx
SaidIsaq
 
2. Chapter Two.pdf
2. Chapter Two.pdf2. Chapter Two.pdf
2. Chapter Two.pdf
fikadumola
 
Applied systems 1 vocabulary
Applied systems 1 vocabularyApplied systems 1 vocabulary
Applied systems 1 vocabulary
Paola Rincón
 
Running Head PROJECT DELIVERABLE 31PROJECT DELIVERABLE 310.docx
Running Head PROJECT DELIVERABLE 31PROJECT DELIVERABLE 310.docxRunning Head PROJECT DELIVERABLE 31PROJECT DELIVERABLE 310.docx
Running Head PROJECT DELIVERABLE 31PROJECT DELIVERABLE 310.docx
todd581
 
T-SQL Overview
T-SQL OverviewT-SQL Overview
T-SQL Overview
Ahmed Elbaz
 

Similar to Market Basket Analysis of Database Table References Using R (20)

An improvised frequent pattern tree
An improvised frequent pattern treeAn improvised frequent pattern tree
An improvised frequent pattern tree
 
Database
DatabaseDatabase
Database
 
Database aggregation using metadata
Database aggregation using metadataDatabase aggregation using metadata
Database aggregation using metadata
 
Introduction to database
Introduction to databaseIntroduction to database
Introduction to database
 
A Study of Various Projected Data Based Pattern Mining Algorithms
A Study of Various Projected Data Based Pattern Mining AlgorithmsA Study of Various Projected Data Based Pattern Mining Algorithms
A Study of Various Projected Data Based Pattern Mining Algorithms
 
A Novel preprocessing Algorithm for Frequent Pattern Mining in Multidatasets
A Novel preprocessing Algorithm for Frequent Pattern Mining in MultidatasetsA Novel preprocessing Algorithm for Frequent Pattern Mining in Multidatasets
A Novel preprocessing Algorithm for Frequent Pattern Mining in Multidatasets
 
System Data Modelling Tools
System Data Modelling ToolsSystem Data Modelling Tools
System Data Modelling Tools
 
A literature review of modern association rule mining techniques
A literature review of modern association rule mining techniquesA literature review of modern association rule mining techniques
A literature review of modern association rule mining techniques
 
A Quantified Approach for large Dataset Compression in Association Mining
A Quantified Approach for large Dataset Compression in Association MiningA Quantified Approach for large Dataset Compression in Association Mining
A Quantified Approach for large Dataset Compression in Association Mining
 
Data Warehouse ( Dw Of Dwh )
Data Warehouse ( Dw Of Dwh )Data Warehouse ( Dw Of Dwh )
Data Warehouse ( Dw Of Dwh )
 
Efficient top k retrieval on massive data
Efficient top k retrieval on massive dataEfficient top k retrieval on massive data
Efficient top k retrieval on massive data
 
Data warehousing interview_questionsandanswers
Data warehousing interview_questionsandanswersData warehousing interview_questionsandanswers
Data warehousing interview_questionsandanswers
 
The D-basis Algorithm for Association Rules of High Confidence
The D-basis Algorithm for Association Rules of High ConfidenceThe D-basis Algorithm for Association Rules of High Confidence
The D-basis Algorithm for Association Rules of High Confidence
 
Review on: Techniques for Predicting Frequent Items
Review on: Techniques for Predicting Frequent ItemsReview on: Techniques for Predicting Frequent Items
Review on: Techniques for Predicting Frequent Items
 
TRANSFORMATION RULES FOR BUILDING OWL ONTOLOGIES FROM RELATIONAL DATABASES
TRANSFORMATION RULES FOR BUILDING OWL ONTOLOGIES FROM RELATIONAL DATABASESTRANSFORMATION RULES FOR BUILDING OWL ONTOLOGIES FROM RELATIONAL DATABASES
TRANSFORMATION RULES FOR BUILDING OWL ONTOLOGIES FROM RELATIONAL DATABASES
 
7.pptx
7.pptx7.pptx
7.pptx
 
2. Chapter Two.pdf
2. Chapter Two.pdf2. Chapter Two.pdf
2. Chapter Two.pdf
 
Applied systems 1 vocabulary
Applied systems 1 vocabularyApplied systems 1 vocabulary
Applied systems 1 vocabulary
 
Running Head PROJECT DELIVERABLE 31PROJECT DELIVERABLE 310.docx
Running Head PROJECT DELIVERABLE 31PROJECT DELIVERABLE 310.docxRunning Head PROJECT DELIVERABLE 31PROJECT DELIVERABLE 310.docx
Running Head PROJECT DELIVERABLE 31PROJECT DELIVERABLE 310.docx
 
T-SQL Overview
T-SQL OverviewT-SQL Overview
T-SQL Overview
 

Recently uploaded

Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Aggregage
 
原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样
原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样
原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样
ihavuls
 
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
sameer shah
 
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
nuttdpt
 
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
bopyb
 
Predictably Improve Your B2B Tech Company's Performance by Leveraging Data
Predictably Improve Your B2B Tech Company's Performance by Leveraging DataPredictably Improve Your B2B Tech Company's Performance by Leveraging Data
Predictably Improve Your B2B Tech Company's Performance by Leveraging Data
Kiwi Creative
 
一比一原版(Unimelb毕业证书)墨尔本大学毕业证如何办理
一比一原版(Unimelb毕业证书)墨尔本大学毕业证如何办理一比一原版(Unimelb毕业证书)墨尔本大学毕业证如何办理
一比一原版(Unimelb毕业证书)墨尔本大学毕业证如何办理
xclpvhuk
 
Udemy_2024_Global_Learning_Skills_Trends_Report (1).pdf
Udemy_2024_Global_Learning_Skills_Trends_Report (1).pdfUdemy_2024_Global_Learning_Skills_Trends_Report (1).pdf
Udemy_2024_Global_Learning_Skills_Trends_Report (1).pdf
Fernanda Palhano
 
DATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docx
DATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docxDATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docx
DATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docx
SaffaIbrahim1
 
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
nyfuhyz
 
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
Social Samosa
 
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
Timothy Spann
 
Challenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more importantChallenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more important
Sm321
 
原版一比一利兹贝克特大学毕业证(LeedsBeckett毕业证书)如何办理
原版一比一利兹贝克特大学毕业证(LeedsBeckett毕业证书)如何办理原版一比一利兹贝克特大学毕业证(LeedsBeckett毕业证书)如何办理
原版一比一利兹贝克特大学毕业证(LeedsBeckett毕业证书)如何办理
wyddcwye1
 
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
nuttdpt
 
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data LakeViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
Walaa Eldin Moustafa
 
DSSML24_tspann_CodelessGenerativeAIPipelines
DSSML24_tspann_CodelessGenerativeAIPipelinesDSSML24_tspann_CodelessGenerativeAIPipelines
DSSML24_tspann_CodelessGenerativeAIPipelines
Timothy Spann
 
Population Growth in Bataan: The effects of population growth around rural pl...
Population Growth in Bataan: The effects of population growth around rural pl...Population Growth in Bataan: The effects of population growth around rural pl...
Population Growth in Bataan: The effects of population growth around rural pl...
Bill641377
 
Palo Alto Cortex XDR presentation .......
Palo Alto Cortex XDR presentation .......Palo Alto Cortex XDR presentation .......
Palo Alto Cortex XDR presentation .......
Sachin Paul
 
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
v7oacc3l
 

Recently uploaded (20)

Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
 
原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样
原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样
原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样
 
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
 
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
 
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
 
Predictably Improve Your B2B Tech Company's Performance by Leveraging Data
Predictably Improve Your B2B Tech Company's Performance by Leveraging DataPredictably Improve Your B2B Tech Company's Performance by Leveraging Data
Predictably Improve Your B2B Tech Company's Performance by Leveraging Data
 
一比一原版(Unimelb毕业证书)墨尔本大学毕业证如何办理
一比一原版(Unimelb毕业证书)墨尔本大学毕业证如何办理一比一原版(Unimelb毕业证书)墨尔本大学毕业证如何办理
一比一原版(Unimelb毕业证书)墨尔本大学毕业证如何办理
 
Udemy_2024_Global_Learning_Skills_Trends_Report (1).pdf
Udemy_2024_Global_Learning_Skills_Trends_Report (1).pdfUdemy_2024_Global_Learning_Skills_Trends_Report (1).pdf
Udemy_2024_Global_Learning_Skills_Trends_Report (1).pdf
 
DATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docx
DATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docxDATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docx
DATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docx
 
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
 
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
 
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
 
Challenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more importantChallenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more important
 
原版一比一利兹贝克特大学毕业证(LeedsBeckett毕业证书)如何办理
原版一比一利兹贝克特大学毕业证(LeedsBeckett毕业证书)如何办理原版一比一利兹贝克特大学毕业证(LeedsBeckett毕业证书)如何办理
原版一比一利兹贝克特大学毕业证(LeedsBeckett毕业证书)如何办理
 
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
 
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data LakeViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
 
DSSML24_tspann_CodelessGenerativeAIPipelines
DSSML24_tspann_CodelessGenerativeAIPipelinesDSSML24_tspann_CodelessGenerativeAIPipelines
DSSML24_tspann_CodelessGenerativeAIPipelines
 
Population Growth in Bataan: The effects of population growth around rural pl...
Population Growth in Bataan: The effects of population growth around rural pl...Population Growth in Bataan: The effects of population growth around rural pl...
Population Growth in Bataan: The effects of population growth around rural pl...
 
Palo Alto Cortex XDR presentation .......
Palo Alto Cortex XDR presentation .......Palo Alto Cortex XDR presentation .......
Palo Alto Cortex XDR presentation .......
 
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
 

Market Basket Analysis of Database Table References Using R

  • 1. Market Basket Analysis of Database Table References Using R: An Application to Physical Database Design Jeffrey Tyzzer1 Summary Market basket and statistical and informetric analyses are applied to a population of database queries (SELECT statements) to better understand table usage and co- occurrence patterns and inform placement on physical media. Introduction In 1999 Sally Jo Cunningham and Eibe Frank of the University of Waikato published a paper titled “Market Basket Analysis of Library Circulation Data” [8]. In it the authors apply market basket analysis (MBA) to library book circulation data, models of which are a staple of informetrics, “the application of statistical and mathematical methods in Library and Information Sciences” [10]. Their paper engendered ideas that led to this paper, which concerns the application of MBA and statistical and informetric analyses to a set of database queries, i.e. SELECT statements, to better understand table usage and co-occurrence patterns. Market Basket Analysis Market basket analysis is a data mining technique that applies association rule analysis, a method of uncovering connections among items in a data set, to supermarket purchases, with the goal of finding items (i.e., groceries) having a high probability of appearing together. For instance, a rule induced by MBA might be “in 85% of the baskets where potato chips appeared, so did root beer.” In the Cunningham and Frank paper, the baskets were the library checkouts and the groceries were the books. In this paper, the baskets are the queries and the groceries are the tables referenced in the queries. MBA was introduced in the seminal paper “Mining Association Rules between Sets of Items in Large Databases,” by Agrawal et al. [1] and is used by retailers to guide store layout (for example, placing products having a high probability of appearing in the same purchase closer together to encourage greater sales) and promotions (e.g., buy one and get the other half-off). The output of MBA is a set of association rules and attendant metadata in the form {LHS => RHS}. LHS means “left-hand side” and RHS means “right-hand side.” These rules are interpreted as “if LHS then RHS,” with the LHS referred to as the antecedent and the RHS referred to as the consequent. For the potato chip and root beer example, we’d have {Chips => Root beer}. 1 jefftyzzer AT sbcglobal DOT net
  • 2. 2 The Project Two questions directed my investigation: 1. Among the tables, are there a “vital few” [9] that account for the bulk of the table references in the queries? If so, which ones are they? 2. Which table pairings (co-occurrences) are most frequent within the queries? The answers to these questions can be used to: • Steer the placement of tables on physical media2 • Justify denormalization decisions • Inform the creation of materialized views, table clusters, and aggregates • Guide partitioning strategies to achieve collocated joins and reduce inter-node data shipping in distributed databases • Identify missing indexes to support frequently joined tables3 • Direct the scope, depth, frequency, and priority of table and index statistics gathering • Contribute to an organization’s overall corpus of operational intelligence The data at the focus of this study are metadata for queries executed against a population of 494 tables within an OLTP database. The queries were captured over a four-day period. There is an ad hoc query capability within the environment, but such queries are run against a separate data store, thus the system under study was effectively closed with respect to random, external, queries. I wrote a Perl program to iterate over the compressed system-generated query log files, 272 in all, cull the tables from each of the SELECT statements within them, and, for those referencing at least one of the 494 tables, output detail and summary data, respectively, to two files. Of the 553,139 total statements read, 373,372 met this criterion (the remainder, a sizable number, were metadata-type statements, e.g., data dictionary lookups and variable instantiations). The summary file lists each table and the number of queries it appears in; its structure is simply {table, count}. The detail file lists {query, table, hour} triples, which were then imported into a simple table consisting of three corresponding columns. query identifies the query in which the table is referenced, tabname is the name of the table, and hour designates the hour the query snapshot was taken, in the range 0-23. 2 Both within a given medium as well as among media with different performance characteristics, e.g., tiering storage between disk and solid-state drives (SSD) [15]. 3 Adding indexes to support SELECTs may come at the expense of increased INSERT, UPDATE, and DELETE costs. When adding indexes to a table the full complement of CRUD operations against it must be considered. The analysis discussed here is easily extended to encompass (other) DML statements as well.
  • 3. 3 The “Vital Few” The 80/20 principle describes a phenomenon in which “20% of a population or group can explain 80% of an effect” [9]. This principle is widely observed in economics, where it’s generally referred to as the Pareto principle, and informetrics, where it’s known as Trueswell’s 80/20 rule [22]. Trueswell argued that 80% of a library’s circulation is accounted for by 20% of its circulating books [9]. This behavior has also been observed in computer science contexts, where it’s been noted that 80 percent of the transactions on a file are applied to the 20 percent most frequently used records within it [14; 15]. I wanted to see if this pattern of skewed access could apply to RDBMS’s as well, i.e., if 20% of the tables in the database might account for 80% of all table references within queries. Figure 1 plots in ascending (rank) order the number of queries each of the 494 tables is referenced in (space prohibits a table listing the frequencies of the 494 tables), showing a characteristic reverse-J shape. Figure 2 presents this data as a Lorenz curve, which was generated using the R package ineq. As [11] puts it, “[t]he straight line represents the expected distribution” if all tables were queried an equal number of times, with the curved line indicating the observed distribution. As the figure shows, 20% of the tables account for a little more than 85% of the table references. Clearly, a subset of tables, the “vital few,” account for the majority of table references in the queries. These tables get the most attention query-wise and therefore deserve the most attention performance-wise. Figure 1 - Plot of query-count-per-table frequency
  • 4. 4 Figure 2 - Lorenz curve illustrating the 80/20 rule for table references The Market Basket Analysis The 373,3724 statements mentioned earlier are the table baskets from which the subset of transactions against the 25 most-queried tables is derived. As a first step toward uncovering connections among the tables in the database using MBA, I used the R package diagram to create a web plot, or annulus [21], of the co-reference relationships among these 25 tables, shown in figure 3. Note that line thickness in the figure is proportional to the frequency of co-occurrence. As can be seen, there is a high level of interconnectedness among these tables. Looking at these connections as a graph, with the tables as nodes and their co-occurrence in queries as edges, I computed the graph’s clustering coefficient, which is the number of actual connections between the nodes divided by the total possible number of connections [3], which turned out to be 0.75, a not surprisingly high value given what figure 3 illustrates. 4 See Appendix B for a sample size formula if a population of table baskets is not already available. 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 Cumulative Table Query Percentages Cum pct of table references Cumpctoftablepopulation (.8525, .2)
  • 5. 5 Figure 3 - Web plot of the co-references between the 25 most-queried tables To mine the tables’ association rules, I used R’s arules package. Before the table basket data could be analyzed it had to be read into an R data structure, which is done using the read.transactions() function. The result is a transactions object, an incidence matrix of the transaction items. To see the structure of the matrix, type the name of the variable at the R prompt: > tblTrans transactions in sparse format with 110071 transactions (rows) and 154 items (columns) Keep in mind the number of transactions (queries) and items (tables) shown here differs from their respective numbers listed previously because I limited the analysis to just those table baskets with at least two of the 25 most-queried tables in them. Looking at the output, that’s 110,071 queries and 154 tables (the top 25 along with 129 others they appear with). To see the contents of the tblTrans transaction object, the inspect() function is used (note I limited inspect() to the first five transactions, as identified by the ASCII ordering of the transactionID): Co-reference Patterns Among the 25 Most-queried Tables ADDRESS BIRTH_INFORMATION CASE CASE_ACCOUNT CASE_ACCOUNT_SUMMARY CASE_ACCOUNT_SUMMARY_TRANSACTION CASE_COURT_CASE CASE_PARTICIPANT CHARGING_INSTRUCTION COMBINED_LOG_TEMP_ENTRY COURT_CASE EMPLOYEREMPLOYER_ADDRESS INTERNAL_USER LEGAL_ACTIVITY LOGICAL_COLLECTION_TRANSACTION ORG_UNIT PARTICIPANT PARTICIPANT_ADDRESS PARTICIPANT_EMPLOYER PARTICIPANT_NAME PARTICIPANT_PUBLIC_ASSISTANCE_CASE PARTICIPANT_RELATIONSHIP SOCIAL_SECURITY_NUMBER SUPPORT_ORDER
  • 6. 6 > inspect(tblTrans[1:5]) items transactionID 1 {PARTICIPANT, PARTICIPANT_NAME} 1 2 {ADDRESS, BIRTH_INFORMATION, PARTICIPANT, PARTICIPANT_ADDRESS, PARTICIPANT_PHONE_NUMBER, PARTICIPANT_PHYSICAL_ATTRIBUTES, SOCIAL_SECURITY_NUMBER} 10 3 {CASE_COURT_CASE, COURT_CASE, LEGAL_ACTIVITY, MEDICAL_TERMS, SUPPORT_ORDER, TERMS} 100 4 {CASE, CASE_ACCOUNT, CASE_ACCOUNT_SUMMARY, CASE_COURT_CASE} 1000 5 {ADDRESS, PARTICIPANT, PARTICIPANT_ADDRESS} 10000 The summary() function provides additional descriptive statistics concerning the make- up of the table transactions (output format edited slightly to fit): > summary(tblTrans) transactions as itemMatrix in sparse format with 110071 rows (elements/itemsets/transactions) and 154 columns (items) and a density of 0.02457888 most frequent items: CASE CASE_PARTICIPANT PARTICIPANT COURT_CASE CASE_COURT_CASE (Other) 51216 38476 35549 21519 21421 248454 element (itemset/transaction) length distribution: sizes 2 3 4 5 6 7 8 9 10 11 12 13 14 15 17 33573 29883 15367 12456 9879 3825 1899 603 1064 775 128 38 490 89 2 Min. 1st Qu. Median Mean 3rd Qu. Max. 2.000 2.000 3.000 3.785 5.000 17.000 includes extended item information - examples: labels 1 ACCOUNT_HOLD_DETAIL 2 ADDRESS 3 ADJUSTMENT includes extended transaction information - examples: transactionID 1 1 2 10 3 100 There’s a wealth of information in this output. Note for instance the minimum item (table) count is two, and the maximum is seventeen, and that there are 33,573 transactions with two items and two transactions with seventeen. While I’m loath to assign a limit to the maximum number of tables that should ever appear in a query, a
  • 7. 7 DBA would likely be keen to investigate the double-digit queries for potential tuning opportunities. Lastly, we can use the itemFrequencyPlot() function to generate an item frequency distribution. Frequencies can be displayed as relative (percentages), or absolute (counts). Note that for readability, I limited the plot to the 25 most-frequent items among the query baskets, by specifying a value for the topN parameter. The command is below, and the plot is shown in figure 4. > itemFrequencyPlot(tblTrans, type = "absolute", topN = 25, main = "Frequency Distribution of Top 25 Tables", xlab = "Table Name", ylab = "Frequency") Figure 4 - Item frequency bar plot among the top 25 tables With the query baskets loaded, it was then time to generate the table association rules. The R function within arules that does this is apriori(). apriori() takes up to four arguments but I only used two: a transaction object, tblTrans, and a list of two parameters that specify the minimum values for the two rule “interestingness criteria,”5 generality and reliability [12; 16]. Confidence, the first parameter, is a measure of generality, and specifies how often the rule is true when the LHS is true, i.e., € countOfBasketsWithLHSandRHSItems countOfBasketsWithLHSItems . The second is support, which corresponds to the reliability criterion and specifies the proportion of all baskets where the rule is true, i.e., 5 This was the “attendant metadata” I mentioned in the Market Basket Analysis section.
  • 8. 8 € countOfBasketsWithLHSandRHSItems totalCount . A third interestingness criterion, lift, is also useful for evaluating rules and figures prominently in output from the R-extension package arulesViz (see figure 5). Paraphrasing [18], lift measures the confidence of a rule and the expected confidence that the second table will be queried given that the first table was: € Confidence(Rule) Support(RHS) with Support(RHS) calculated as € countOfBasketsWithRHSItem totalCount . Lift indicates the strength of the association over its random co-occurrence. When lift is greater than 1, the rule is better than guessing at predicting the consequent. For confidence I specified .8 and for support I specified .05. The command I ran was > tblRules <- apriori(tblTrans, parameter = list(supp= .05, conf = .8)) which generated 71 rules. To get a high-level overview of the rules, you can call the overloaded summary() function against the output of the apriori(): > summary(tblRules) set of 71 rules rule length distribution (lhs + rhs):sizes 2 3 4 5 11 28 25 7 Min. 1st Qu. Median Mean 3rd Qu. Max. 2.000 3.000 3.000 3.394 4.000 5.000 summary of quality measures: support confidence lift Min. :0.05239 Min. :0.8037 Min. :1.728 1st Qu.:0.05678 1st Qu.:0.8585 1st Qu.:2.460 Median :0.06338 Median :0.9493 Median :4.533 Mean :0.07842 Mean :0.9231 Mean :4.068 3rd Qu.:0.08271 3rd Qu.:0.9870 3rd Qu.:5.066 Max. :0.28343 Max. :1.0000 Max. :7.204 mining info: data ntransactions support confidence tblTrans 110071 0.05 0.8 To see the rules, execute the inspect() function (note I’m only showing the first and last five, as sorted by confidence):
  • 9. 9 > inspect(sort(tblRules, by = "confidence")) lhs rhs support confidence lift 1 {SUPPORT_ORDER} => {LEGAL_ACTIVITY} 0.11643394 1.0000000 5.847065 2 {CASE_COURT_CASE, SUPPORT_ORDER} => {LEGAL_ACTIVITY} 0.07112682 1.0000000 5.847065 3 {COURT_CASE, SUPPORT_ORDER} => {LEGAL_ACTIVITY} 0.08326444 1.0000000 5.847065 4 {CASE_PARTICIPANT, SUPPORT_ORDER} => {LEGAL_ACTIVITY} 0.05753559 1.0000000 5.847065 5 {CASE, SUPPORT_ORDER} => {LEGAL_ACTIVITY} 0.06406774 1.0000000 5.847065 <snip> 67 {CASE_COURT_CASE, COURT_CASE, SUPPORT_ORDER} => {CASE} 0.05678153 0.8088521 1.738347 68 {CASE_COURT_CASE, COURT_CASE, LEGAL_ACTIVITY, SUPPORT_ORDER} => {CASE} 0.05678153 0.8088521 1.738347 69 {CASE_COURT_CASE, COURT_CASE} => {CASE} 0.13003425 0.8044175 1.728816 70 {CASE_COURT_CASE, LEGAL_ACTIVITY} => {CASE} 0.07938512 0.8041598 1.728262 71 {CASE, CASE_COURT_CASE} => {COURT_CASE} 0.13003425 0.8037399 4.111179 Let’s look at the first and last rules and interpret them. The first rule says that over the period during which the queries were collected, the SUPPORT_ORDER table appeared in 11.64% of the queries and that when it did it was accompanied by the LEGAL_ACTIVITY table 100% of the time. The last rule, the 71st , says that during this same period CASE and CASE_COURT_CASE appeared in 13% of the queries and that they were accompanied by COURT_CASE 80.37% of the time. While it’s not visible from the subset shown, all 71 of the generated rules have a single- item consequent. This is fortunate, and is not always the case, as such rules are “the most actionable” in practice compared to rules with compound consequents [4]. Figure 5, generated using arulesViz, is a scatter plot of the support and confidence of the 71 rules generated by arules. Here we see the majority of the rules are in the 0.05- 0.15 support range, meaning between 5% and 15% of the 110,071 queries analyzed contain all of the tables represented in the rule.
  • 10. 10 Figure 5 - Plot of the interestingness measures for the 71 generated rules An illuminating visualization is shown in Figure 6, also generated by the aRulesViz package. This figure plots the rule antecedents on the x-axis and their consequents on the y-axis. To economize on space, the table names aren’t displayed but rather are numbered corresponding to output accompanying the graph that’s displayed on the main R window. Looking at the plot, two things immediately stand out: the presence of four large rule groups, and that only nine tables (the y-axis) account for the consequents in all 71 rules. These nine tables are the “nuclear” tables around which all the others orbit, the most vital of the vital few. Rule Interestingness Measures 2 3 4 5 6 7 lift 0.8 0.85 0.9 0.95 1 0.05 0.1 0.15 0.2 0.25 confidence support
  • 11. 11 Figure 6 - Plot of table prevalence of rules Another Way: Odds Ratios As the final step, I computed the odds ratios between all existing pairings occurring among the top-25 tables, which numbered 225. Odds is the ratio of the probability of an event’s occurrence to the probability of its non-occurrence, and the odds ratio is the ratio of the odds of two events (e.g., two tables co-occurring in a given query vs. each table appearing without the other) [19]. To compute the odds ratios, I used the cc() function from the epicalc R package which, when given a 2x2 contingency table (see table 1, generated with the R CrossTable() function), outputs the following (the counts shown are of the pairing of the PARTICIPANT and CASE_PARTICIPANT tables): FALSE TRUE Total FALSE 408021 50902 458923 TRUE 56963 21745 78708 Total 464984 72647 537631 OR = 3.06 Exact 95% CI = 3.01, 3.12 Table Prevalence Among Rules 10 20 30 40 50 2 4 6 8 Antecedent (LHS) Consequent(RHS) 2 3 4 5 6 7 lift
  • 12. 12 Chi-squared = 15719.49, 1 d.f., P value = 0 Fisher's exact test (2-sided) P value = 0 For these two tables, the odds ratio is 3.06, with a 95% confidence interval of 3.01 and 3.12. Odds greater than 1 are considered significant, and the higher the number the greater the significance. Cell Contents |-------------------------| | N | | Chi-square contribution | | N / Row Total | | N / Col Total | | N / Table Total | |-------------------------| Total Observations in Table: 537631 | CASE_PARTICIPANT PARTICIPANT | FALSE | TRUE | Row Total | --------------------------------------|-----------|-----------|-----------| FALSE | 408021 | 50902 | 458923 | | 310.961 | 1990.337 | | | 0.889 | 0.111 | 0.854 | | 0.877 | 0.701 | | | 0.759 | 0.095 | | --------------------------------------|-----------|-----------|-----------| TRUE | 56963 | 21745 | 78708 | | 1813.123 | 11605.065 | | | 0.724 | 0.276 | 0.146 | | 0.123 | 0.299 | | | 0.106 | 0.040 | | --------------------------------------|-----------|-----------|-----------| Column Total | 464984 | 72647 | 537631 | | 0.865 | 0.135 | | --------------------------------------|-----------|-----------|-----------| Table 1 - 2x2 contingency table I tabulated the odds ratios of the 225 pairings, and table 2 shows the first 25, ordered by the odds ratio in descending order. The top four odds ratios show as Inf--infinite--because either the FALSE/TRUE or TRUE/FALSE cell in their respective 2x2 contingency table was 0, indicating that among the 537,631 observations (rows) analyzed, the one table never appeared without the other. The results shown in table 2 differ from what figure 3 depicts. The former shows the strongest link between the CASE and CASE_PARTICIPANT tables, whereas for the latter it’s SUPPORT_ORDER and SOCIAL_SECURITY_NUMBER. This is because the line thicknesses in figure 3 are based solely on the count of co-references (analogous to the TRUE/TRUE cell in the 2x2 contingency table), whereas odds ratios consider the pair counts relative to each other, i.e., they take into account the other three cells-- FALSE/FALSE, FALSE/TRUE, and TRUE/FALSE--as well. Table 1 Table 2 OR SUPPORT_ORDER SOCIAL_SECURITY_NUMBER Inf SOCIAL_SECURITY_NUMBER PARTICIPANT_PUBLIC_ASSISTANCE_CASE Inf CASE_ACCOUNT CASE Inf
  • 13. 13 BIRTH_INFORMATION ADDRESS Inf PARTICIPANT_EMPLOYER EMPLOYER_ADDRESS 209.014 EMPLOYER_ADDRESS EMPLOYER 54.854 ORG_UNIT INTERNAL_USER 53.583 PARTICIPANT_EMPLOYER EMPLOYER 37.200 SOCIAL_SECURITY_NUMBER PARTICIPANT_NAME 36.551 PARTICIPANT_RELATIONSHIP PARTICIPANT_NAME 34.779 CHARGING_INSTRUCTION CASE_ACCOUNT 27.309 SOCIAL_SECURITY_NUMBER PARTICIPANT_RELATIONSHIP 26.885 CHARGING_INSTRUCTION CASE_ACCOUNT_SUMMARY 25.040 PARTICIPANT LOGICAL_COLLECTION_TRANSACTION 23.281 COMBINED_LOG_TEMP_ENTRY CHARGING_INSTRUCTION 22.357 PARTICIPANT_NAME PARTICIPANT 9.739 SOCIAL_SECURITY_NUMBER PARTICIPANT 8.799 PARTICIPANT_PUBLIC_ASSISTANCE_CASE COMBINED_LOG_TEMP_ENTRY 8.578 CASE_COURT_CASE CASE 7.913 PARTICIPANT_ADDRESS PARTICIPANT 7.869 PARTICIPANT COMBINED_LOG_TEMP_ENTRY 7.861 COURT_CASE CASE_COURT_CASE 7.525 PARTICIPANT_NAME PARTICIPANT_EMPLOYER 7.454 SOCIAL_SECURITY_NUMBER PARTICIPANT_ADDRESS 7.420 PARTICIPANT_NAME ORG_UNIT 7.346 Table 2 - Top 25 table-pair odds ratios Conclusion In this paper, I’ve described a holistic process for identifying the most-queried “vital few” tables in a database, uncovering their usage patterns and interrelationships, and guiding their placement on physical media. First I captured query metadata and parsed it for further analysis. I then established that there are a “vital few” tables that account for the majority of query activity. Finally, I used MBA supplemented with other methods to understand the co-reference patterns among these tables, which may in turn inform their layout on storage media. My hope is I’ve described what I did in enough detail that you’re able to adapt it, extend it, and improve it to the betterment of the performance of your databases and applications. Appendix A - Data Placement on Physical Media Storage devices such as disks maximize throughput by minimizing access time [5], and a fundamental part of physical database design is the allocation of database objects to such physical media--deciding where schema objects should be placed on disk to maximize performance by minimizing disk seek time and rotational latency. The former is a function of the movement of a disk’s read/write head arm assembly and the latter is dependent on its rotations per minute. Both are electromechanical absolutes, although their speeds vary from disk to disk. As is well known, disk access is several orders of magnitude slower than RAM access-- estimates range from four to six[ibid.], and this relative disparity is no less true today than it was when the IBM 350 Disk Storage Unit was introduced in 1956. So while this
  • 14. 14 topic may seem like a bit of a chestnut in the annals of physical database design, it remains a germane topic. The presence of such compensatory components and strategies as bufferpools, defragmenting, table reorganizations, and read ahead prefetching in the architecture of modern RDBMS underscores this point [20]. The fact is read/write heads can only be in one place on the disk platter at a time. Solid-state drives (SSD) offer potential relief here, but data volumes are rising at a rate much faster than that at which SSD prices are falling. Coupled with this physical reality is a fiscal one, as it’s been estimated anywhere between 16-40% of IT budget outlay is committed to storage [6; 23]. In light of such costs, it makes good financial sense for an organization to be a wise steward of this resource and seek its most efficient use. If one is designing a new database as part of a larger application development initiative, then such tried-and-true tools as CRUD (create, read, update, delete) matrices and entity affinity analysis can assist with physical table placement, but such techniques quickly become tedious, and therefore error-prone, and these early placement decisions are at best educated guesses. What would be useful is an automated, holistic, approach to help refine the placement of tables as the full complement of queries comes on line and later as it changes over the lifetime of the application, without incurring extra storage costs. The present paper is, of course, an attempt at such an approach. As to where to locate data on disk media in general, the rule of thumb is to place the high-use tables on the middle tracks of the disk given this location has the smallest average distance to all other tracks. In the case of disks employing zone-bit recording (ZBR), as practically all now do, the recommendation is to place the high-frequency tables, say, the vital few 20%, on the outermost cylinders, as the raw transfer rate is higher there since the bits are more densely packed. This idea can be extended further by placing tables typically co-accessed in queries in the outermost zones on separate disk drives [2], minimizing read/write head contention and enabling query parallelism. If zone-level disk placement specificity is not an option, separating co-accessed vital few tables onto separate media is still a worthwhile practice. Appendix B - How many baskets? For this analysis, performed on modest hardware, I used all of the snapshot data I had available. Indeed, one of the precepts of the burgeoning field of data science [17] is that with today’s commodity hardware, computing power, and addressable memory sizes, we no longer have to settle for samples. Plus, when it comes to the data we’ve been discussing, sampling risks overlooking its long-tailed aspect [7]. Nonetheless, there may still be instances where analyzing all of the snapshots at your disposal isn’t practicable, or you may want to know at the outset of an analysis how many statements you’ll need to capture to get a good representative of the query population (and therefore greater statistical power), if you don’t have a ready pool of snapshots from which to draw. In either case, you need to know how large your sample needs to be for a robust analysis.
  • 15. 15 Sample size formulas exist for more conventional hypothesis testing, but Zaki, et al. [24] give a more suitable sample size formula, one specific to market basket analysis: € n = −2ln(c) τε2 , where n is the sample size, c is 1 - α, (α being the confidence level), ε is the acceptable level of inaccuracy, and τ is the minimum required support [13]. Using this equation, with 80% confidence (c = .20), 95% accuracy (ε = .05), and 5% support (τ = .05), the sample size recommendation is 25,751. Using this sample size, I ran the sample() function of the apriori package against the tblTrans transaction object, the results of which I then used as input to the apriori() function to generate a new set of rules. This time, 72 rules were generated. Figure 7 shows the high degree of correspondence between the relative frequencies of the tables in the sample (bars) and the population (line). Figure 7 - Relative frequencies of the tables in the sample vs. in the population. Figure 8 plots the regression lines of the confidence and support of the rules generated from the sample (black) and the population (grey). Again, notice the high degree of correspondence. Sample Frequency Distribution of Top 25 Tables Frequency 040008000 C ASE C ASE_PAR TIC IPAN T PAR TIC IPAN T C O U R T_C ASE C ASE_C O U R T_C ASE C ASE_AC C O U N T_SU M M AR Y LEG AL_AC TIVITY PAR TIC IPAN T_N AM E C ASE_AC C O U N T SU PPO R T_O R D ER AD D R ESS C O M BIN ED _LO G _TEM P_EN TR Y O R G _U N IT C ASE_AC C O U N T_SU M M AR Y_TR AN SAC TIO N BIR TH _IN FO R M ATIO N PAR TIC IPAN T_AD D R ESS SO C IAL_SEC U R ITY_N U M BER PAR TIC IPAN T_EM PLO YER LO G IC AL_C O LLEC TIO N _TR AN SAC TIO N IN TER N AL_U SER C H AR G IN G _IN STR U C TIO N EM PLO YER PAR TIC IPAN T_PH YSIC AL_ATTR IBU TESTER M S PAR TIC IPAN T_R ELATIO N SH IP Table Name
  • 16. 16 Figure 8 - Correspondence between the support and confidence of the sample and population rules References [1] Agrawal, Rakesh, et al. “Mining Association Rules Between Sets of Items in Large Databases.” Proceedings of the 1993 ACM SIGMOD International Conference on Management of Data. pp. 207-216. [2] Agrawal, Sanjay, et al. “Automating Layout of Relational Databases.” Proceedings of the 19th International Conference on Data Engineering (ICDE ’03). (2003): 607- 618. [3] Barabási, Albert-László and Zoltán N. Oltvai. “Network Biology: Understanding the Cell’s Functional Organization.” Nature Reviews Genetics. 5.2 (2004): 101-113. [4] Berry, Michael J. and Gordon Linoff. Data Mining Techniques: For Marketing, Sales, and Customer Support. New York: John Wiley & Sons, 1997. [5] Blanchette, Jean-François. “A Material History of Bits.” Journal of the American Society For Information Science and Technology. 62.6 (2011): 1042-1057. [6] Butts, Stuart. “How to Use Single Instancing to Control Storage Expense.” eWeek 03 August 2009. 22 December 2010 <http://mobile.eweek.com/c/a/Green-IT/How- to-Use-Single-Instancing-to-Control-Storage-Expense/>. [7] Cohen, Jeffrey, et al. “MAD Skills: New Analysis Practices for Big Data.” Journal Proceedings of the VLDB Endowment. 2.2 (2009): 1481-1492. 0.80 0.85 0.90 0.95 1.00 0.050.100.150.200.25 Rule Sample vs. Population Confidence Support
  • 17. 17 [8] Cunningham, Sally Jo and Eibe Frank. “Market Basket Analysis of Library Circulation Data.” Proceedings of the Sixth International Conference on Neural Information Processing (1999). Vol. II, pp. 825-830. [9] Eldredge, Jonathan D. “The Vital Few Meet the Trivial Many: Unexpected Use Patterns in a Monographs Collection.” Bulletin of the Medical Library Association. 86.4 (1998): 496-503. [10] Erar, Aydin. “Bibliometrics or Informetrics: Displaying Regularity in Scientific Patterns by Using Statistical Distributions.” Hacettepe Journal of Mathematics and Statistics. 31 (2002): 113-125. [11] Fu, W. Wayne and Clarice C Sim. “Aggregate Bandwagon Effect on Online Videos' Viewership: Value Uncertainty, Popularity Cues, and Heuristics.” Journal of the American Society For Information Science and Technology. 62.12 (2011): 2382- 2395. [12] Geng, Liqiang and Howard J. Hamilton. “Interestingness Measures for Data Mining: A Survey.” ACM Computing Surveys. 38.3 (2006): 1-32. [13] Hashler, Michael, et al. “Introduction to arules: A Computational Environment for Mining Association Rules and Frequent Item Sets.” CRAN 16 March 2010. <http://cran.r-project.org/web/packages/arules/vignettes/arules.pdf>. [14] Heising, W.P. “Note on Random Addressing Techniques.” IBM Systems Journal. 2.2 (1963): 112-6. [15] Hsu, W.W., A.J. Smith, and H.C. Young. “Characteristics of Production Database Workloads and the TPC Benchmarks.” IBM Systems Journal. 40.3 (2001): 781- 802. [16] Janert, Philipp, K. Data Analysis with Open Source Tools. Sebastopol: O’Reilly, 2010. [17] Loukides, Mike. “What is Data Science?” O’Reilly Radar. 2 June 2010 http://radar.oreilly.com/2010/06/what-is-data-science.html [18] Nisbet, Robert, John Elder, and Gary Miner. Handbook of Statistical Analysis and Data Mining. Burlington, MA: Academic Press, 2009. [19] Ott, R. Lyman, and Michael Longnecker. An Introduction to Statistical Methods and Data Analysis. 6th ed. Belmont, CA: Brooks/Cole, 2010. [20] Pendle, Paul. “Solid-State Drives: Changing the Data World.” IBM Data Management Magazine. Issue 3 (2011): 27-30.
  • 18. 18 [21] Spence, Robert. Information Visualization. 2nd ed. London: Pearson, 2007. [22] Trueswell, Richard L. “Some Behavioral Patterns of Library Users: the 80/20 Rule.” Wilson Library Bulletin. 43.5 (1969): 458-461. [23] Whitely, Robert. “Buyer’s Guide to Infrastructure: Three Steps to IT Reorganisation.” ComputerWeekly.com 01 September 2010. 22 December 2010 <http://www.computerweekly.com/feature/Buyers-Guide-to-infrastructure-Three- steps-to-IT-reorganisation >. [24] Zaki, Mohammed Javeed, et al. “Evaluation of Sampling for Data Mining of Association Rules.” Proceedings of the 7th International Workshop on Research Issues in Data Engineering (RIDE '97) High Performance Database Management for Large-Scale Applications. (1997): 42-50.* Copyright ©2014 by Jeffrey K. Tyzzer