SlideShare a Scribd company logo
1 of 6
Download to read offline
computer methods and programs in biomedicine 8 2 ( 2 0 0 6 ) 38–43
journal homepage: www.intl.elsevierhealth.com/journals/cmpb
Pivoting approaches for bulk extraction of
Entity–Attribute–Value data
Valentin Dinu∗
, Prakash Nadkarni, Cynthia Brandt
Center for Medical Informatics, Yale University School of Medicine, PO Box 208009, New Haven, CT 06520-8009, United States
a r t i c l e i n f o
Article history:
Received 6 October 2005
Received in revised form 2 February
2006
Accepted 3 February 2006
Keywords:
Databases
Entity–Attribute–Value
Clinical patient record systems
Clinical study data management
systems
a b s t r a c t
Entity–Attribute–Value (EAV) data, as present in repositories of clinical patient data, must
be transformed (pivoted) into one-column-per-parameter format before it can be used by
a variety of analytical programs. Pivoting approaches have not been described in depth
in the literature, and existing descriptions are dated. We describe and benchmark three
alternative algorithms to perform pivoting of clinical data in the context of a clinical study
data management system. We conclude that when the number of attributes to be returned is
not too large, it is feasible to use static SQL as the basis for views on the data. An alternative
but more complex approach that utilizes hash tables and the presence of abundant random-
access-memory can achieve improved performance by reducing the load on the database
server.
© 2006 Published by Elsevier Ireland Ltd.
1. Introduction
The “generic” or Entity–Attribute–Value (EAV) data modeling
approach [1] is used in database design when a potentially
vast number of parameters can describe something, but rel-
atively few apply to a given instance. This model is espe-
cially appropriate for physical representation of the clinical
data sub-schema of clinical data repositories (CDRs), where
the total number of possible parameters across all special-
ties of medicine ranges in the hundreds of thousands. Large-
scale systems that utilize EAV design for clinical data are the
HELP CDR [2,3] and its commercial version, the 3M CDR [4],
the Columbia-Presbyterian CDR, the Cerner PowerChart Enter-
prise CDR [5], and the clinical study data management systems
(CSDMS) Oracle Clinical [6] and Phase Forward’s ClinTrial [7].
Conceptually, an EAV table consists of triplets, an Entity
(the thing being described, e.g., a patient’s clinical encounter),
an Attribute (a parameter of interest) and a Value (for that
attribute). Because all types of facts reside in the same
∗
Corresponding author. Fax: +1 203 737 5708.
E-mail address: Valentin.Dinu@yale.edu (V. Dinu).
“value” column, EAV data needs transforming (pivoting) into
one-column-per-parameter structure for use by applications
such as graphing and many statistical analyses. This oper-
ation lacks built-in support in the current versions of rela-
tional database engines. (Note: SQL Server 2005 includes a
newly introduced PIVOT command. This command, however,
requires an aggregate function to be specified, so that its out-
put includes transformed values such as the mean, sum, etc.
rather than the original values that are desired in the basic
EAV-pivoting operation.) The issues of pivoting clinical EAV
data have not been previously explored in depth.
A pivoting operation extracts a subset of data from the
repository, e.g., all results for a test panel or questionnaire
gathered over a period, or for a given clinical study, yielding
a table with as many columns as parameters of interest, with
additional columns identifying individual rows, e.g., patient
ID and time stamps. The table is often written to disk as a
delimited text file for importing by other applications. The
most detailed existing description of pivoting, Johnson et al.
0169-2607/$ – see front matter © 2006 Published by Elsevier Ireland Ltd.
doi:10.1016/j.cmpb.2006.02.001
computer methods and programs in biomedicine 8 2 ( 2 0 0 6 ) 38–43 39
[8], describes the use of data access modules (DAMs) in the
Columbia Clinical Repository—procedural code to implement
the equivalent of pivoted “views”. While noting that DAMs
are “complex and hard to modify to meet the needs of appli-
cation developers in a timely manner”, the authors identify
several limitations of alternative approaches such as static
SQL views, e.g., a static SQL view needs as many joins as
attributes of interest. (Note: In some highly normalized EAV
schemas, one would need twice or thrice this number.) In
1994, most database engines limited the number of joins per
statement to a relatively small number, e.g., 16. These limits
are now more generous (e.g., 256 joins in SQL Server 2000).
A well-known SQL tuning text [9] mentions optimization of a
115-join production query. The issue, however, is whether per-
formance of static pivoting SQL scales well with the number of
parameters.
This paper describes the general pivoting problem, and
benchmarks alternative pivoting approaches. The three
approaches explored here were tested using TrialDB [10], a pro-
duction CSDMS at Yale, with an agoraphobia case report form
(CRF) used in a psychiatry study (Dr. Scott Woods, PI), contain-
ing 42 integer attributes. All approaches used SQL that was
generated dynamically by code that accessed metadata—i.e.,
the IDs of all parameters in the questionnaire, along with
their serial ordering. The three different approaches mea-
sured performance as a function of number of attributes by
progressively generating and executing a series of SQL state-
ments, each incorporating data for an additional attribute.
TrialDB uses a normalized design, storing “Entity” informa-
tion (patient ID, study ID, CRF ID, etc.) in an “Entity/Clinical
Encounter” table with a machine-generated “Entity ID” pri-
mary key. Data-type specific EAV tables comprise triplets
(Entity ID, Attribute ID and Value).
2. Methods
Benchmarks were written in Java and ran against three
DBMSs: Oracle 9i, SQL Server 2000 SP4 and SQL Server 2005
Beta 2. (While two of the systems are by the same ven-
dor, our results indicate that the query execution engines
of the two are quite different.) The benchmarks for each
database used the same schema and data, and utilized the
same indexes. The databases and application ran on the
same dedicated machine (single-CPU 1.8 GHz Pentium 4 with
1 GB RAM), to eliminate the factor of network bandwidth. At
the time of benchmarking, no applications were running on
the computer except the Java code and the database being
tested.
We ran each test at least three times and averaged the
results. The test database schema, test data set, Java code,
generated queries and detailed benchmarks are available via
ftp://custard.med.yale.edu/pivot benchmarks.zip.
Formulating the general pivoting problem: For individual
encounters for particular patients, inapplicable items on a CRF
may be left empty: e.g., questions regarding diabetes treat-
ment apply only for patients with diabetes. Because empty
values are not stored in EAV tables, the number of data
points/values is generally unequal across all attributes in a
set. The creation of a rectangular, one-column-per-attribute
table from longitudinal “strips” representing values for indi-
vidual attributes conceptually requires full outer join opera-
tions, where non-matching rows on either side of a join are
preserved, and missing values recorded as “nulls”. Currently,
most mainstream database engines support “full outer joins”
natively in SQL.
Below we discuss three methods that can generate the
same pivoted output table using different approaches: full
outer join; left outer join; and hash tables performing in mem-
ory the equivalent of multiple joins.
2.1. Method A: using full outer joins
Algorithm. Any given statement generates one strip of data
per attribute of interest, and then combines the strips using
a series of FULL OUTER JOIN operations. Each strip is created
through an inner join between the Entities table and the EAV
data table—the former being filtered on Study ID and CRF ID,
since the same CRF can be used across multiple studies, and
the latter filtered on Attribute ID. In addition, for N attributes,
N − 1 full outer joins are needed. The total number of join oper-
ations per statement is therefore 2 × N − 1.
2.2. Method B: using left outer joins
Algorithm. We determine essential Entity information
(Encounter ID, patient ID, time stamps) on the total number of
clinical encounters (641) by filtering the Entities table alone on
Study ID and CRF ID. We join this information with each strip of
data, generated as above. However, we use LEFT OUTER JOIN
operations, where complete Entity information merges with
whatever matches for each attribute. The total number of joins
per statement is N inner joins (to generate each attribute’s
data), plus N outer joins for merging = 2 × N.
2.3. Method C: using hash tables and memory to
perform the equivalent of multiple joins
Any strategy that generates SQL to join an arbitrary number
of tables in a single statement runs the risk of encounter-
ing the 256-joins-per-query limit, which corresponds to 128
attributes. Several case-report forms, notably certain psychia-
try questionnaires, can exceed this threshold, and so alter-
native approaches must be explored. We describe such an
approach, using extensible hash tables (a standard component
of most modern programming libraries such as Java and the
.NET framework) to perform pivoting.
Algorithm.
Step 1. We already know the IDs of the attributes of interest
and their serial order of presentation in the final out-
put. We load this information into a hash table, with
attribute ID as key and serial number as value. The
hash table enables us to determine in constant time
that attribute ID 1568, for example, is in position 12.
Step 2. We execute a query that fetches complete entity
information, ordered by patient ID and time stamp.
40 computer methods and programs in biomedicine 8 2 ( 2 0 0 6 ) 38–43
The ordering information is used to make the basic
algorithm more scalable if needed, as described later.
We capture this data from the database and use it to
create a second hash table, with Entity/Encounter ID
as key and row number in the array as value—e.g.,
Encounter# 14568 is in row 45.
Step 3. We dynamically allocate a two-dimensional array of
strings (number of entities X number of attributes).
All elements are initialized to blanks.
Step 4a. A query fetches all EAV triplets (Encounter ID,
Attribute ID, Value) for the given Study ID, CRF ID
and Attribute IDs, via a join to the Entities table.
(In a slight variation to this method, which we will
call Method C , one can retrieve the EAV triplets
for all the Attribute IDs for the given Study ID
and CRF ID. This variation decreases the load on
the database and can avoid the performance degra-
dation seen in one of the tested databases—see
Section 3.)
Step 4b. Iterating through each returned row, we place the
Value in the 2-D array in the row and column indi-
cated by its corresponding Encounter and Attribute
IDs, respectively: the two hash tables allow speedy
row/column determination. At the end of the iter-
ations, empty values remain blank. For a situa-
tion where the EAV data is stored across multi-
ple data-type-specific tables (e.g., strings, integers,
decimal numbers, as in TrialDB), one would repeat
the second query for each necessary EAV table,
as determined by metadata that indicated how
many attributes of each data-type existed for the
desired set.
3. Results of benchmark tests
The results for the benchmark tests using the three meth-
ods (A–C, described above) on the three DBMSs (Oracle 9i,
SQL Server 2000 SP4 and SQL Server 2005 Beta 2) are illus-
trated in Figs. 1 and 2. To facilitate comparison across both
method and DBMS, Fig. 1 illustrates the results grouped by
method, while Fig. 2 illustrates the same results grouped by
DBMS.
3.1. Results for Method A: using full outer joins
(Fig. 1A)
Oracle 9i: Execution time increased exponentially, from
76 milliseconds (ms) for one attribute to 56,968 ms for
nine attributes. (Pearson R2 for log(time) versus number of
attributes = 0.955.) The Java process crashed with a SQL excep-
tion on attempting a 10-attribute merge. Inserting a variety of
optimizer hints in the generated SQL did not help, and further
experiments were halted.
SQL Server: Execution time increased at a quadratic rate
with the number of attributes (R2 = 0.994 for SQL Server 2000
SP4, R2 = 0.999 for SQL Server 2005 Beta 2). The query run
times for 1–42 attributes ranged from 206 to 8055 ms on SQL
Server 2000 SP4 and from 87 to 18961 ms on SQL Server 2005
Beta 2.
3.2. Results for Method B: using left outer joins
(Fig. 1B)
Oracle 9i: Performance scaled linearly, from 67 ms for one
attribute, to 965 ms for 42 attributes, the last involving a total
of 82 join operations in a single SQL statement (R2 for time
versus number of attributes = 0.996).
SQL Server: Execution times were much higher—ranging
from 190 ms for 1 attribute to above 22 s (s) and 105 s for SQL
Server 2005 Beta 2 and SQL Server 2000 SP4, respectively. Exe-
cution times on SQL Server had an exponential growth for
the first 11–12 attributes (R2 = 0.981 for SQL Server 2000 SP4,
R2 = 0.990 for SQL Server 2005 Beta 2), followed by a linear
increase for more than 13 attributes (R2 = 0.990 for SQL Server
2000 SP4, R2 = 0.959 for SQL Server 2005 Beta 2). On SQL Server,
the execution times for left outer join were higher than the
times for the full outer join.
3.3. Results for Method C: using hash tables and
memory to perform the equivalent of multiple joins
(Fig. 1C)
Oracle 9i: Execution time grew linearly (R2 = 0.946) from 80 ms
for 1 attribute to 547 ms for 42 attributes.
SQL Server: The behavior of SQL Server 2000 SP4 differed
considerably from that of SQL Server 2005 Beta 2.
SQL Server 2005 Beta 2: Execution time grew linearly
(R2 = 0.967) from 77 ms for 1 attribute to 474 ms for 42
attributes.
SQL Server 2000 SP4: A notable and increasing performance
degradation was observed for about 9–20 attributes, beyond
which the execution times became lower and grew more lin-
early at a smaller rate. This behavior, presumably due to SQL
Server 2000s attempt to optimize the query, prompted us to
try bypassing the SQL Server’s optimization attempt. In this
slightly modified method, called Method C , we retrieved from
the database the values for all the attributes (for the given
Study ID and CRF ID) and only the desired values were then
picked to be stored in the pivoted array—see Fig. 1D.
As expected, the times for Method C stayed almost flat,
independent of the number of attributes desired—since most
of the time was spent executing the query and then iterat-
ing through all the retrieved values to see if they need to be
stored in the pivoted array or not. This modification removed
the irregular increase in execution times for SQL Server 2000
SP4, but also resulted in higher execution times for Oracle 9i
and SQL Server 2005 beta2, especially for a small number of
attributes.
4. Discussion
4.1. Possible explanations of results
It is well known that the multi-table join problem is NP-hard
with respect to performance optimization. The number of join
orders to be evaluated to determine the join order that gives
the fastest performance grows exponentially with the num-
ber of tables to be joined [11]. If the number of tables is large
enough, the CPU time spent by the query execution engine on
computer methods and programs in biomedicine 8 2 ( 2 0 0 6 ) 38–43 41
Fig. 1 – Execution times for Methods A–C are compared on three different databases—Oracle 9i, SQL Server 2000 SP4 and
SQL Server 2005 Beta 2. The x-axis represents the number of attributes for which the values are retrieved and loaded into an
array; the y-axis represents the execution time in milliseconds (ms). (A) Execution times for Method A, using full outer joins,
to retrieve the values for the desired attributes and load the data into an array. Oracle 9i times increased exponentially
(R2 = 0.955) with the number of attributes, and failed with a SQL error at 10 attributes. The times on SQL Server increased at
a quadratic rate (R2 = 0.994 for SQL Server 2000 SP4, R2 = 0.999 for SQL Server 2005 Beta 2). (B) Execution times for Method B,
using left outer joins, to retrieve the values for the desired attributes and load the data into an array. Execution times on SQL
Server had an exponential growth for the first 11–12 attributes (R2 = 0.981 for SQL Server 2000 SP4, R2 = 0.990 for SQL Server
2005 Beta 2), followed by a linear increase for more than 13 attributes (R2 = 0.990 for SQL Server 2000 SP4, R2 = 0.959 for SQL
Server 2005 Beta 2). The times on Oracle 9i increased linearly (R2 = 0.996) and were much lower compared with the SQL
Server times. (C) Execution times for Method C, using the in-memory hash table to pivot the values for the desired attributes
and load the data into an array. In this method, only the values for the desired attributes were selected from the database.
Execution times on Oracle 9i and SQL Server 2005 Beta 2 grew linearly (R2 = 0.946 for Oracle 9i, R2 = 0.967 for SQL Server 2005
Beta 2), with the values for the latter slightly lower in magnitude. For SQL Server 2000 SP4, a notable and increasing
performance degradation was observed for about 9–20 attributes, after which the execution times became lower and grew
more linearly at a smaller rate. This odd behavior, presumably due to SQL Server 2000’s attempt to optimize the query,
prompted us to bypass the SQL Server’s optimization attempt and to investigate the alternative method where the values
for all the attributes (for a desired trial) were retrieved from the database and only the desired values were then picked to be
stored in the pivoted array—see Method C in Fig. 1D. (D) Execution times for Method C , a variation of Method C where all
the values for all the attributes (for a desired trial) were retrieved from the database and only the desired values were then
picked to be stored in the pivoted array. As expected, the times stayed almost constant—since most of the time was spent
executing the query and then iterating through all the retrieved values to see if they need to be stored in the array or not.
42 computer methods and programs in biomedicine 8 2 ( 2 0 0 6 ) 38–43
Fig.2–Foreachofthetesteddatabases(Oracle9i,SQLServer2000SP4andSQLServer2005Beta2),theperformancetimesforthevariousmethodsarecompared.The
samelogarithmictimescaleisusedforalldatabases,tofacilitateside-by-sidecomparison.For42attributes,MethodCprovided1.76,4,and40-foldperformance
improvementswhencomparedtothenextbestperformingmethod(AorB)forOracle9i,SQLServer2000SP4andSQLServerBeta2,respectively.
determining the best way to perform the join can be consider-
ably more than the time that the engine might take in actually
executing the join using a na¨ıve strategy, such as joining the
tables in the order encountered in the SQL statement.
Modern DBMSs achieve their impressive performance
through a combination of heuristics (e.g., the presence or
absence of indexes) and the use of stored database statis-
tics, such as table sizes and data distributions. This infor-
mation lets them select query execution plans that may not
be absolutely optimal, but are reasonably close to optimal,
and which can be determined in polynomial or even linear
time. The strategies are understandably vendor-specific. Since
query performance is an area where vendors compete vig-
orously, these strategies are likely to change between DBMS
versions. Further, the amount of intellectual effort that ven-
dors decide to expend in devising heuristics to accommodate
relatively uncommon situations efficiently is also likely to
vary. Finally, query optimization is a task complex enough to
require a team of programmers, with specific sub-tasks being
delegated to individual team members, some of whom may
be more skilled than others. In any event, different vendor
engines and versions will generate different plans for the same
situation.
4.2. Full outer joins versus left outer joins
Oracle degrades exponentially for full outer joins, while SQL
Server does not. Full outer joins have been introduced rela-
tively recently into DBMSs. Being needed in relatively uncom-
mon situations (the vast majority of “business” databases
do not utilize EAV design), it is possible that Oracle’s imple-
menters invested minimal effort in optimizing their perfor-
mance (to the extent of crashing the engine when the number
of joins exceed 10 attributes), while SQL Server’s implementers
did not.
With SQL Server, full outer joins perform better than left
joins. The performance of left joins has been improved sig-
nificantly in SQL Server 2005 (which is desirable for these
relatively common operations) but it still falls slightly short
of full outer joins. Oracle 9i outperforms both these versions
by a wide margin for left joins. One explanation of these num-
bers is the relative effort and skill that each vendor brought to
bear on the optimization of these operations.
The older version of SQL Server performs full outer joins
more efficiently than the newer version. Most algorithms
incorporate trade-offs, and it is possible that the revised outer
join algorithm for outer joins performs much better for the
common (left join) situation, while performing worse in the
much less common (full outer join) situation. Oracle 9i’s per-
formance characteristics, which show superlative optimiza-
tion of left joins while showing pathological behavior for full
outer joins, are possibly an extreme example of this trade-off.
4.3. In-memory joins
For all the DBMSs tested, in-memory joins give the best overall
performance, as indicated in Fig. 2, especially when the num-
ber of attributes is large. In the 42 attribute-scenario, Methods
C and C were 1.76 and 1.96 times faster than the next best
method (Method B) for Oracle 9i. For SQL Server 2000 SP4, for
computer methods and programs in biomedicine 8 2 ( 2 0 0 6 ) 38–43 43
42 attributes, Method C was four times faster than the next
best performing method, Method A. For SQL Server 2005 Beta
2, Method C was 40 times faster than the next best performing
method, Method A.
In the in-memory join, the SQL that is sent to the database
in step 4A above, is very simple so that the DBMSs do not
need to spend any CPU time trying to optimize it, and returns
a large amount of data. The algorithm is essentially limited
by the rate at which the Java application can deal with the
rows from the resultant dataset. By shifting the work from
the database to an application server, this approach scales
more readily to “Web farm” parallelism [12], where multiple
application servers access a shared database. This approach,
employed in large-scale e-commerce scenarios, is more read-
ily implemented than database-server parallelism. For the lat-
ter, increasing the number of CPUs does not help significantly
unless the data is also partitioned across multiple indepen-
dent disks, because database operations tend to be I/O bound
rather than CPU bound
This algorithm is not limited by the number of attributes,
but, in the simple version described above, assumes availabil-
ity of sufficient RAM. This assumption is generally reasonable
on present-day commodity hardware with 2 GB-plus of RAM,
but not always so. A more complex but better-scaling version
of the algorithm requires a change in steps 3 and 4 above.
Modified step 3: Compute the worst-case RAM required
per row, Mrow, for the 2-D array (based on the total number
of attributes and their individual data types). Determine the
total RAM available to the program (Mmax). Allocate the 2-D
array with number of rows, Nrows = Mmax/Mrow and number of
columns = number of attributes.
Modified step 4: Replace the single query of step 4 with a
series of queries by traversing the ordered Entity information.
Each query retrieves a horizontal “slice” of the EAV data, such
that the number of distinct entities will not exceed Nrows. (To
do this most directly, determine, for each query, the range of
ordered patient IDs in the Entity data that does not exceed
Nrows.) The filter in each query then takes the form “Study
ID = x and CRF ID = y and patient ID between ‘aaa’ and ‘bbb’
“(where x and y are already known, and the values ‘aaa’ and
‘bbb’ are determined each time). In each iteration, write out
the filled array to disk, re-initialize it, and increment the range
of patients, until data for all patients is fetched.
5. Conclusions
While several algorithms can be employed for pivoting EAV
data, each approach must be carefully tested on individual
vendor DBMS implementations, and may need to be period-
ically re-evaluated as vendors upgrade their DBMS versions.
The in-memory join, while algorithmically most complex, is
also the most efficient. It is relatively stable to DBMS upgrades,
because the SQL that it uses combines a limited number
of tables, and needs only elementary optimization from the
DBMS perspective.
Acknowledgments
This work was supported in part by NIH Grants U01 CA78266,
K23 RR16042 and institutional funds from Yale University
School of Medicine. The authors would like to thank Perry
Miller for comments that improved this manuscript.
r e f e r e n c e s
[1] S. Johnson, Generic data modeling for clinical repositories,
J. Am. Med. Informatics Assoc. 3 (1996) 328–339.
[2] S.M. Huff, C.L. Berthelsen, T.A. Pryor, A.S. Dudley,
Evaluation of a SQL Model of the help patient database,
in: Proceedings of the 15th Symposium on Computer
Applications in Medical Care, Washington, DC, 1991, pp.
386–390.
[3] S.M. Huff, D.J. Haug, L.E. Stevens, C.C. Dupont, T.A. Pryor,
HELP the next generation: a new client-server architecture,
in: Proceedings of the 18th Symposium on Computer
Applications in Medical Care, Washington, DC, 1994, pp.
271–275.
[4] 3M Corporation (3M Health Information Systems), 3M
Clinical Data Repository (2004), web site: http://www.
3m.com/us/healthcare/his/products/records/data repository.
jhtml, date accessed: September 9, 2004.
[5] Cerner Corporation, The Powerchart Enterprise Clinical
Data Repository (2004), web site: http://www.cerner.
com/uploadedFiles/1230 03PowerChartFlyer.pdf, date
accessed: September 9, 2004.
[6] Oracle Corporation, Oracle Clinical Version 3.0: User’s
Guide (Oracle Corporation, Redwood Shores CA, 1996).
[7] Phase Forward Inc., ClinTrial (2004), web site: http://www.
phaseforward.com/products cdms clintrial.html, date
accessed: 10/4/04.
[8] S. Johnson, G. Hripcsak, J. Chen, P. Clayton, Accessing the
Columbia Clinical Repository, in: Proceedings 18th
Symposium on Computer Applications in Medical Care,
Washington, DC, 1994, pp. 281–285.
[9] D. Tow, SQL Turning, O’Reilly Books, Sebastopol, CA, 2003.
[10] C. Brandt, P. Nadkarni, L. Marenco, et al., Reengineering a
database for clinical trials management: lessons for
system architects, Control. Clin. Trials 21 (2000) 440–461.
[11] Sybase Corporation, SQL Anywhere Cost Based Query
Optimizer (1999), web site: www.sybase.co.nz/products/
anywhere/optframe.html, date accessed: 12/1/01.
[12] R.J. Chevance, Server Architectures: Multiprocessors,
Clusters, Parallel Systems, Web Servers, Storage Solutions,
Elsevier Digital Press, Burlington, MA, 2004.

More Related Content

What's hot

Data processing and analysis final
Data processing and analysis finalData processing and analysis final
Data processing and analysis final
Akul10
 
Data mining techniques using weka
Data mining techniques using wekaData mining techniques using weka
Data mining techniques using weka
rathorenitin87
 
Topic 4 intro spss_stata 30032012 sy_srini
Topic 4 intro spss_stata 30032012 sy_sriniTopic 4 intro spss_stata 30032012 sy_srini
Topic 4 intro spss_stata 30032012 sy_srini
SM Lalon
 
Data Mining with WEKA WEKA
Data Mining with WEKA WEKAData Mining with WEKA WEKA
Data Mining with WEKA WEKA
butest
 

What's hot (17)

Data processing and analysis final
Data processing and analysis finalData processing and analysis final
Data processing and analysis final
 
Data mining techniques using weka
Data mining techniques using wekaData mining techniques using weka
Data mining techniques using weka
 
WEKA: Data Mining Input Concepts Instances And Attributes
WEKA: Data Mining Input Concepts Instances And AttributesWEKA: Data Mining Input Concepts Instances And Attributes
WEKA: Data Mining Input Concepts Instances And Attributes
 
What is ETL testing & how to enforce it in Data Wharehouse
What is ETL testing & how to enforce it in Data WharehouseWhat is ETL testing & how to enforce it in Data Wharehouse
What is ETL testing & how to enforce it in Data Wharehouse
 
Etl testing
Etl testingEtl testing
Etl testing
 
XL-MINER: Data Utilities
XL-MINER: Data UtilitiesXL-MINER: Data Utilities
XL-MINER: Data Utilities
 
Accounting serx
Accounting serxAccounting serx
Accounting serx
 
SAP BO Web Intelligence Basics
SAP BO Web Intelligence BasicsSAP BO Web Intelligence Basics
SAP BO Web Intelligence Basics
 
Weka tutorial
Weka tutorialWeka tutorial
Weka tutorial
 
Sqlserver interview questions
Sqlserver interview questionsSqlserver interview questions
Sqlserver interview questions
 
Topic 4 intro spss_stata 30032012 sy_srini
Topic 4 intro spss_stata 30032012 sy_sriniTopic 4 intro spss_stata 30032012 sy_srini
Topic 4 intro spss_stata 30032012 sy_srini
 
Data Mining with WEKA WEKA
Data Mining with WEKA WEKAData Mining with WEKA WEKA
Data Mining with WEKA WEKA
 
ETL Testing Training Presentation
ETL Testing Training PresentationETL Testing Training Presentation
ETL Testing Training Presentation
 
Weka
WekaWeka
Weka
 
A WEB REPOSITORY SYSTEM FOR DATA MINING IN DRUG DISCOVERY
A WEB REPOSITORY SYSTEM FOR DATA MINING IN DRUG DISCOVERYA WEB REPOSITORY SYSTEM FOR DATA MINING IN DRUG DISCOVERY
A WEB REPOSITORY SYSTEM FOR DATA MINING IN DRUG DISCOVERY
 
An Introduction To Weka
An Introduction To WekaAn Introduction To Weka
An Introduction To Weka
 
ppt on data tab in ms.excel
ppt on data tab in ms.excelppt on data tab in ms.excel
ppt on data tab in ms.excel
 

Viewers also liked (6)

Data migration into eav model
Data migration into eav modelData migration into eav model
Data migration into eav model
 
Optimized Column-Oriented Model: A Storage and Search Efficient Representatio...
Optimized Column-Oriented Model: A Storage and Search Efficient Representatio...Optimized Column-Oriented Model: A Storage and Search Efficient Representatio...
Optimized Column-Oriented Model: A Storage and Search Efficient Representatio...
 
Data model and entity relationship
Data model and entity relationshipData model and entity relationship
Data model and entity relationship
 
Generalization and specialization
Generalization and specializationGeneralization and specialization
Generalization and specialization
 
Can Innovation Labs Save The World?
Can Innovation Labs Save The World?Can Innovation Labs Save The World?
Can Innovation Labs Save The World?
 
Understanding Data
Understanding Data Understanding Data
Understanding Data
 

Similar to Pivoting approach-eav-data-dinu-2006

K-MEANS AND D-STREAM ALGORITHM IN HEALTHCARE
K-MEANS AND D-STREAM ALGORITHM IN HEALTHCAREK-MEANS AND D-STREAM ALGORITHM IN HEALTHCARE
K-MEANS AND D-STREAM ALGORITHM IN HEALTHCARE
International Journal of Technical Research & Application
 
Improve The Performance of K-means by using Genetic Algorithm for Classificat...
Improve The Performance of K-means by using Genetic Algorithm for Classificat...Improve The Performance of K-means by using Genetic Algorithm for Classificat...
Improve The Performance of K-means by using Genetic Algorithm for Classificat...
IJECEIAES
 
Leveraging the CEDAR Workbench for Ontology-linked Submission of Adaptive Imm...
Leveraging the CEDAR Workbench for Ontology-linked Submission of Adaptive Imm...Leveraging the CEDAR Workbench for Ontology-linked Submission of Adaptive Imm...
Leveraging the CEDAR Workbench for Ontology-linked Submission of Adaptive Imm...
Ahmad C. Bukhari
 

Similar to Pivoting approach-eav-data-dinu-2006 (20)

Patient-Like-Mine
Patient-Like-MinePatient-Like-Mine
Patient-Like-Mine
 
Poster (1)
Poster (1)Poster (1)
Poster (1)
 
K-MEANS AND D-STREAM ALGORITHM IN HEALTHCARE
K-MEANS AND D-STREAM ALGORITHM IN HEALTHCAREK-MEANS AND D-STREAM ALGORITHM IN HEALTHCARE
K-MEANS AND D-STREAM ALGORITHM IN HEALTHCARE
 
Improve The Performance of K-means by using Genetic Algorithm for Classificat...
Improve The Performance of K-means by using Genetic Algorithm for Classificat...Improve The Performance of K-means by using Genetic Algorithm for Classificat...
Improve The Performance of K-means by using Genetic Algorithm for Classificat...
 
Cal Essay
Cal EssayCal Essay
Cal Essay
 
A WEB REPOSITORY SYSTEM FOR DATA MINING IN DRUG DISCOVERY
A WEB REPOSITORY SYSTEM FOR DATA MINING IN DRUG DISCOVERYA WEB REPOSITORY SYSTEM FOR DATA MINING IN DRUG DISCOVERY
A WEB REPOSITORY SYSTEM FOR DATA MINING IN DRUG DISCOVERY
 
K Means Clustering Algorithm for Partitioning Data Sets Evaluated From Horizo...
K Means Clustering Algorithm for Partitioning Data Sets Evaluated From Horizo...K Means Clustering Algorithm for Partitioning Data Sets Evaluated From Horizo...
K Means Clustering Algorithm for Partitioning Data Sets Evaluated From Horizo...
 
Analysis on Data Mining Techniques for Heart Disease Dataset
Analysis on Data Mining Techniques for Heart Disease DatasetAnalysis on Data Mining Techniques for Heart Disease Dataset
Analysis on Data Mining Techniques for Heart Disease Dataset
 
Database Migration Tool
Database Migration ToolDatabase Migration Tool
Database Migration Tool
 
CDISC2RDF poster for Conference on Data Integration in the Life Sciences 2013
CDISC2RDF poster for Conference on Data Integration in the Life Sciences 2013CDISC2RDF poster for Conference on Data Integration in the Life Sciences 2013
CDISC2RDF poster for Conference on Data Integration in the Life Sciences 2013
 
Annotating Search Results from Web Databases
Annotating Search Results from Web Databases Annotating Search Results from Web Databases
Annotating Search Results from Web Databases
 
MIS5101 WK10 Outcome Measures
MIS5101 WK10 Outcome MeasuresMIS5101 WK10 Outcome Measures
MIS5101 WK10 Outcome Measures
 
Presentation1
Presentation1Presentation1
Presentation1
 
Bo4301369372
Bo4301369372Bo4301369372
Bo4301369372
 
Hortizontal Aggregation in SQL for Data Mining Analysis to Prepare Data Sets
Hortizontal Aggregation in SQL for Data Mining Analysis to Prepare Data SetsHortizontal Aggregation in SQL for Data Mining Analysis to Prepare Data Sets
Hortizontal Aggregation in SQL for Data Mining Analysis to Prepare Data Sets
 
Database aggregation using metadata
Database aggregation using metadataDatabase aggregation using metadata
Database aggregation using metadata
 
Leveraging CEDAR workbench for ontology-linked submission of adaptive immune ...
Leveraging CEDAR workbench for ontology-linked submission of adaptive immune ...Leveraging CEDAR workbench for ontology-linked submission of adaptive immune ...
Leveraging CEDAR workbench for ontology-linked submission of adaptive immune ...
 
Leveraging the CEDAR Workbench for Ontology-linked Submission of Adaptive Imm...
Leveraging the CEDAR Workbench for Ontology-linked Submission of Adaptive Imm...Leveraging the CEDAR Workbench for Ontology-linked Submission of Adaptive Imm...
Leveraging the CEDAR Workbench for Ontology-linked Submission of Adaptive Imm...
 
DBMS - Introduction
DBMS - IntroductionDBMS - Introduction
DBMS - Introduction
 
A robust data treatment approach for fuel cells system analysis
A robust data treatment approach for fuel cells system analysisA robust data treatment approach for fuel cells system analysis
A robust data treatment approach for fuel cells system analysis
 

More from Không còn Phù Hợp (6)

Block types
Block typesBlock types
Block types
 
Math g3-m1-topic-a-lesson-1
Math g3-m1-topic-a-lesson-1Math g3-m1-topic-a-lesson-1
Math g3-m1-topic-a-lesson-1
 
Math g3-m1-topic-a-lesson-2
Math g3-m1-topic-a-lesson-2Math g3-m1-topic-a-lesson-2
Math g3-m1-topic-a-lesson-2
 
Hdfs design
Hdfs designHdfs design
Hdfs design
 
Storage and management of semi structured data
Storage and management of semi structured dataStorage and management of semi structured data
Storage and management of semi structured data
 
Week 04-actor&uses case
Week 04-actor&uses caseWeek 04-actor&uses case
Week 04-actor&uses case
 

Recently uploaded

CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
giselly40
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
Earley Information Science
 

Recently uploaded (20)

Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 

Pivoting approach-eav-data-dinu-2006

  • 1. computer methods and programs in biomedicine 8 2 ( 2 0 0 6 ) 38–43 journal homepage: www.intl.elsevierhealth.com/journals/cmpb Pivoting approaches for bulk extraction of Entity–Attribute–Value data Valentin Dinu∗ , Prakash Nadkarni, Cynthia Brandt Center for Medical Informatics, Yale University School of Medicine, PO Box 208009, New Haven, CT 06520-8009, United States a r t i c l e i n f o Article history: Received 6 October 2005 Received in revised form 2 February 2006 Accepted 3 February 2006 Keywords: Databases Entity–Attribute–Value Clinical patient record systems Clinical study data management systems a b s t r a c t Entity–Attribute–Value (EAV) data, as present in repositories of clinical patient data, must be transformed (pivoted) into one-column-per-parameter format before it can be used by a variety of analytical programs. Pivoting approaches have not been described in depth in the literature, and existing descriptions are dated. We describe and benchmark three alternative algorithms to perform pivoting of clinical data in the context of a clinical study data management system. We conclude that when the number of attributes to be returned is not too large, it is feasible to use static SQL as the basis for views on the data. An alternative but more complex approach that utilizes hash tables and the presence of abundant random- access-memory can achieve improved performance by reducing the load on the database server. © 2006 Published by Elsevier Ireland Ltd. 1. Introduction The “generic” or Entity–Attribute–Value (EAV) data modeling approach [1] is used in database design when a potentially vast number of parameters can describe something, but rel- atively few apply to a given instance. This model is espe- cially appropriate for physical representation of the clinical data sub-schema of clinical data repositories (CDRs), where the total number of possible parameters across all special- ties of medicine ranges in the hundreds of thousands. Large- scale systems that utilize EAV design for clinical data are the HELP CDR [2,3] and its commercial version, the 3M CDR [4], the Columbia-Presbyterian CDR, the Cerner PowerChart Enter- prise CDR [5], and the clinical study data management systems (CSDMS) Oracle Clinical [6] and Phase Forward’s ClinTrial [7]. Conceptually, an EAV table consists of triplets, an Entity (the thing being described, e.g., a patient’s clinical encounter), an Attribute (a parameter of interest) and a Value (for that attribute). Because all types of facts reside in the same ∗ Corresponding author. Fax: +1 203 737 5708. E-mail address: Valentin.Dinu@yale.edu (V. Dinu). “value” column, EAV data needs transforming (pivoting) into one-column-per-parameter structure for use by applications such as graphing and many statistical analyses. This oper- ation lacks built-in support in the current versions of rela- tional database engines. (Note: SQL Server 2005 includes a newly introduced PIVOT command. This command, however, requires an aggregate function to be specified, so that its out- put includes transformed values such as the mean, sum, etc. rather than the original values that are desired in the basic EAV-pivoting operation.) The issues of pivoting clinical EAV data have not been previously explored in depth. A pivoting operation extracts a subset of data from the repository, e.g., all results for a test panel or questionnaire gathered over a period, or for a given clinical study, yielding a table with as many columns as parameters of interest, with additional columns identifying individual rows, e.g., patient ID and time stamps. The table is often written to disk as a delimited text file for importing by other applications. The most detailed existing description of pivoting, Johnson et al. 0169-2607/$ – see front matter © 2006 Published by Elsevier Ireland Ltd. doi:10.1016/j.cmpb.2006.02.001
  • 2. computer methods and programs in biomedicine 8 2 ( 2 0 0 6 ) 38–43 39 [8], describes the use of data access modules (DAMs) in the Columbia Clinical Repository—procedural code to implement the equivalent of pivoted “views”. While noting that DAMs are “complex and hard to modify to meet the needs of appli- cation developers in a timely manner”, the authors identify several limitations of alternative approaches such as static SQL views, e.g., a static SQL view needs as many joins as attributes of interest. (Note: In some highly normalized EAV schemas, one would need twice or thrice this number.) In 1994, most database engines limited the number of joins per statement to a relatively small number, e.g., 16. These limits are now more generous (e.g., 256 joins in SQL Server 2000). A well-known SQL tuning text [9] mentions optimization of a 115-join production query. The issue, however, is whether per- formance of static pivoting SQL scales well with the number of parameters. This paper describes the general pivoting problem, and benchmarks alternative pivoting approaches. The three approaches explored here were tested using TrialDB [10], a pro- duction CSDMS at Yale, with an agoraphobia case report form (CRF) used in a psychiatry study (Dr. Scott Woods, PI), contain- ing 42 integer attributes. All approaches used SQL that was generated dynamically by code that accessed metadata—i.e., the IDs of all parameters in the questionnaire, along with their serial ordering. The three different approaches mea- sured performance as a function of number of attributes by progressively generating and executing a series of SQL state- ments, each incorporating data for an additional attribute. TrialDB uses a normalized design, storing “Entity” informa- tion (patient ID, study ID, CRF ID, etc.) in an “Entity/Clinical Encounter” table with a machine-generated “Entity ID” pri- mary key. Data-type specific EAV tables comprise triplets (Entity ID, Attribute ID and Value). 2. Methods Benchmarks were written in Java and ran against three DBMSs: Oracle 9i, SQL Server 2000 SP4 and SQL Server 2005 Beta 2. (While two of the systems are by the same ven- dor, our results indicate that the query execution engines of the two are quite different.) The benchmarks for each database used the same schema and data, and utilized the same indexes. The databases and application ran on the same dedicated machine (single-CPU 1.8 GHz Pentium 4 with 1 GB RAM), to eliminate the factor of network bandwidth. At the time of benchmarking, no applications were running on the computer except the Java code and the database being tested. We ran each test at least three times and averaged the results. The test database schema, test data set, Java code, generated queries and detailed benchmarks are available via ftp://custard.med.yale.edu/pivot benchmarks.zip. Formulating the general pivoting problem: For individual encounters for particular patients, inapplicable items on a CRF may be left empty: e.g., questions regarding diabetes treat- ment apply only for patients with diabetes. Because empty values are not stored in EAV tables, the number of data points/values is generally unequal across all attributes in a set. The creation of a rectangular, one-column-per-attribute table from longitudinal “strips” representing values for indi- vidual attributes conceptually requires full outer join opera- tions, where non-matching rows on either side of a join are preserved, and missing values recorded as “nulls”. Currently, most mainstream database engines support “full outer joins” natively in SQL. Below we discuss three methods that can generate the same pivoted output table using different approaches: full outer join; left outer join; and hash tables performing in mem- ory the equivalent of multiple joins. 2.1. Method A: using full outer joins Algorithm. Any given statement generates one strip of data per attribute of interest, and then combines the strips using a series of FULL OUTER JOIN operations. Each strip is created through an inner join between the Entities table and the EAV data table—the former being filtered on Study ID and CRF ID, since the same CRF can be used across multiple studies, and the latter filtered on Attribute ID. In addition, for N attributes, N − 1 full outer joins are needed. The total number of join oper- ations per statement is therefore 2 × N − 1. 2.2. Method B: using left outer joins Algorithm. We determine essential Entity information (Encounter ID, patient ID, time stamps) on the total number of clinical encounters (641) by filtering the Entities table alone on Study ID and CRF ID. We join this information with each strip of data, generated as above. However, we use LEFT OUTER JOIN operations, where complete Entity information merges with whatever matches for each attribute. The total number of joins per statement is N inner joins (to generate each attribute’s data), plus N outer joins for merging = 2 × N. 2.3. Method C: using hash tables and memory to perform the equivalent of multiple joins Any strategy that generates SQL to join an arbitrary number of tables in a single statement runs the risk of encounter- ing the 256-joins-per-query limit, which corresponds to 128 attributes. Several case-report forms, notably certain psychia- try questionnaires, can exceed this threshold, and so alter- native approaches must be explored. We describe such an approach, using extensible hash tables (a standard component of most modern programming libraries such as Java and the .NET framework) to perform pivoting. Algorithm. Step 1. We already know the IDs of the attributes of interest and their serial order of presentation in the final out- put. We load this information into a hash table, with attribute ID as key and serial number as value. The hash table enables us to determine in constant time that attribute ID 1568, for example, is in position 12. Step 2. We execute a query that fetches complete entity information, ordered by patient ID and time stamp.
  • 3. 40 computer methods and programs in biomedicine 8 2 ( 2 0 0 6 ) 38–43 The ordering information is used to make the basic algorithm more scalable if needed, as described later. We capture this data from the database and use it to create a second hash table, with Entity/Encounter ID as key and row number in the array as value—e.g., Encounter# 14568 is in row 45. Step 3. We dynamically allocate a two-dimensional array of strings (number of entities X number of attributes). All elements are initialized to blanks. Step 4a. A query fetches all EAV triplets (Encounter ID, Attribute ID, Value) for the given Study ID, CRF ID and Attribute IDs, via a join to the Entities table. (In a slight variation to this method, which we will call Method C , one can retrieve the EAV triplets for all the Attribute IDs for the given Study ID and CRF ID. This variation decreases the load on the database and can avoid the performance degra- dation seen in one of the tested databases—see Section 3.) Step 4b. Iterating through each returned row, we place the Value in the 2-D array in the row and column indi- cated by its corresponding Encounter and Attribute IDs, respectively: the two hash tables allow speedy row/column determination. At the end of the iter- ations, empty values remain blank. For a situa- tion where the EAV data is stored across multi- ple data-type-specific tables (e.g., strings, integers, decimal numbers, as in TrialDB), one would repeat the second query for each necessary EAV table, as determined by metadata that indicated how many attributes of each data-type existed for the desired set. 3. Results of benchmark tests The results for the benchmark tests using the three meth- ods (A–C, described above) on the three DBMSs (Oracle 9i, SQL Server 2000 SP4 and SQL Server 2005 Beta 2) are illus- trated in Figs. 1 and 2. To facilitate comparison across both method and DBMS, Fig. 1 illustrates the results grouped by method, while Fig. 2 illustrates the same results grouped by DBMS. 3.1. Results for Method A: using full outer joins (Fig. 1A) Oracle 9i: Execution time increased exponentially, from 76 milliseconds (ms) for one attribute to 56,968 ms for nine attributes. (Pearson R2 for log(time) versus number of attributes = 0.955.) The Java process crashed with a SQL excep- tion on attempting a 10-attribute merge. Inserting a variety of optimizer hints in the generated SQL did not help, and further experiments were halted. SQL Server: Execution time increased at a quadratic rate with the number of attributes (R2 = 0.994 for SQL Server 2000 SP4, R2 = 0.999 for SQL Server 2005 Beta 2). The query run times for 1–42 attributes ranged from 206 to 8055 ms on SQL Server 2000 SP4 and from 87 to 18961 ms on SQL Server 2005 Beta 2. 3.2. Results for Method B: using left outer joins (Fig. 1B) Oracle 9i: Performance scaled linearly, from 67 ms for one attribute, to 965 ms for 42 attributes, the last involving a total of 82 join operations in a single SQL statement (R2 for time versus number of attributes = 0.996). SQL Server: Execution times were much higher—ranging from 190 ms for 1 attribute to above 22 s (s) and 105 s for SQL Server 2005 Beta 2 and SQL Server 2000 SP4, respectively. Exe- cution times on SQL Server had an exponential growth for the first 11–12 attributes (R2 = 0.981 for SQL Server 2000 SP4, R2 = 0.990 for SQL Server 2005 Beta 2), followed by a linear increase for more than 13 attributes (R2 = 0.990 for SQL Server 2000 SP4, R2 = 0.959 for SQL Server 2005 Beta 2). On SQL Server, the execution times for left outer join were higher than the times for the full outer join. 3.3. Results for Method C: using hash tables and memory to perform the equivalent of multiple joins (Fig. 1C) Oracle 9i: Execution time grew linearly (R2 = 0.946) from 80 ms for 1 attribute to 547 ms for 42 attributes. SQL Server: The behavior of SQL Server 2000 SP4 differed considerably from that of SQL Server 2005 Beta 2. SQL Server 2005 Beta 2: Execution time grew linearly (R2 = 0.967) from 77 ms for 1 attribute to 474 ms for 42 attributes. SQL Server 2000 SP4: A notable and increasing performance degradation was observed for about 9–20 attributes, beyond which the execution times became lower and grew more lin- early at a smaller rate. This behavior, presumably due to SQL Server 2000s attempt to optimize the query, prompted us to try bypassing the SQL Server’s optimization attempt. In this slightly modified method, called Method C , we retrieved from the database the values for all the attributes (for the given Study ID and CRF ID) and only the desired values were then picked to be stored in the pivoted array—see Fig. 1D. As expected, the times for Method C stayed almost flat, independent of the number of attributes desired—since most of the time was spent executing the query and then iterat- ing through all the retrieved values to see if they need to be stored in the pivoted array or not. This modification removed the irregular increase in execution times for SQL Server 2000 SP4, but also resulted in higher execution times for Oracle 9i and SQL Server 2005 beta2, especially for a small number of attributes. 4. Discussion 4.1. Possible explanations of results It is well known that the multi-table join problem is NP-hard with respect to performance optimization. The number of join orders to be evaluated to determine the join order that gives the fastest performance grows exponentially with the num- ber of tables to be joined [11]. If the number of tables is large enough, the CPU time spent by the query execution engine on
  • 4. computer methods and programs in biomedicine 8 2 ( 2 0 0 6 ) 38–43 41 Fig. 1 – Execution times for Methods A–C are compared on three different databases—Oracle 9i, SQL Server 2000 SP4 and SQL Server 2005 Beta 2. The x-axis represents the number of attributes for which the values are retrieved and loaded into an array; the y-axis represents the execution time in milliseconds (ms). (A) Execution times for Method A, using full outer joins, to retrieve the values for the desired attributes and load the data into an array. Oracle 9i times increased exponentially (R2 = 0.955) with the number of attributes, and failed with a SQL error at 10 attributes. The times on SQL Server increased at a quadratic rate (R2 = 0.994 for SQL Server 2000 SP4, R2 = 0.999 for SQL Server 2005 Beta 2). (B) Execution times for Method B, using left outer joins, to retrieve the values for the desired attributes and load the data into an array. Execution times on SQL Server had an exponential growth for the first 11–12 attributes (R2 = 0.981 for SQL Server 2000 SP4, R2 = 0.990 for SQL Server 2005 Beta 2), followed by a linear increase for more than 13 attributes (R2 = 0.990 for SQL Server 2000 SP4, R2 = 0.959 for SQL Server 2005 Beta 2). The times on Oracle 9i increased linearly (R2 = 0.996) and were much lower compared with the SQL Server times. (C) Execution times for Method C, using the in-memory hash table to pivot the values for the desired attributes and load the data into an array. In this method, only the values for the desired attributes were selected from the database. Execution times on Oracle 9i and SQL Server 2005 Beta 2 grew linearly (R2 = 0.946 for Oracle 9i, R2 = 0.967 for SQL Server 2005 Beta 2), with the values for the latter slightly lower in magnitude. For SQL Server 2000 SP4, a notable and increasing performance degradation was observed for about 9–20 attributes, after which the execution times became lower and grew more linearly at a smaller rate. This odd behavior, presumably due to SQL Server 2000’s attempt to optimize the query, prompted us to bypass the SQL Server’s optimization attempt and to investigate the alternative method where the values for all the attributes (for a desired trial) were retrieved from the database and only the desired values were then picked to be stored in the pivoted array—see Method C in Fig. 1D. (D) Execution times for Method C , a variation of Method C where all the values for all the attributes (for a desired trial) were retrieved from the database and only the desired values were then picked to be stored in the pivoted array. As expected, the times stayed almost constant—since most of the time was spent executing the query and then iterating through all the retrieved values to see if they need to be stored in the array or not.
  • 5. 42 computer methods and programs in biomedicine 8 2 ( 2 0 0 6 ) 38–43 Fig.2–Foreachofthetesteddatabases(Oracle9i,SQLServer2000SP4andSQLServer2005Beta2),theperformancetimesforthevariousmethodsarecompared.The samelogarithmictimescaleisusedforalldatabases,tofacilitateside-by-sidecomparison.For42attributes,MethodCprovided1.76,4,and40-foldperformance improvementswhencomparedtothenextbestperformingmethod(AorB)forOracle9i,SQLServer2000SP4andSQLServerBeta2,respectively. determining the best way to perform the join can be consider- ably more than the time that the engine might take in actually executing the join using a na¨ıve strategy, such as joining the tables in the order encountered in the SQL statement. Modern DBMSs achieve their impressive performance through a combination of heuristics (e.g., the presence or absence of indexes) and the use of stored database statis- tics, such as table sizes and data distributions. This infor- mation lets them select query execution plans that may not be absolutely optimal, but are reasonably close to optimal, and which can be determined in polynomial or even linear time. The strategies are understandably vendor-specific. Since query performance is an area where vendors compete vig- orously, these strategies are likely to change between DBMS versions. Further, the amount of intellectual effort that ven- dors decide to expend in devising heuristics to accommodate relatively uncommon situations efficiently is also likely to vary. Finally, query optimization is a task complex enough to require a team of programmers, with specific sub-tasks being delegated to individual team members, some of whom may be more skilled than others. In any event, different vendor engines and versions will generate different plans for the same situation. 4.2. Full outer joins versus left outer joins Oracle degrades exponentially for full outer joins, while SQL Server does not. Full outer joins have been introduced rela- tively recently into DBMSs. Being needed in relatively uncom- mon situations (the vast majority of “business” databases do not utilize EAV design), it is possible that Oracle’s imple- menters invested minimal effort in optimizing their perfor- mance (to the extent of crashing the engine when the number of joins exceed 10 attributes), while SQL Server’s implementers did not. With SQL Server, full outer joins perform better than left joins. The performance of left joins has been improved sig- nificantly in SQL Server 2005 (which is desirable for these relatively common operations) but it still falls slightly short of full outer joins. Oracle 9i outperforms both these versions by a wide margin for left joins. One explanation of these num- bers is the relative effort and skill that each vendor brought to bear on the optimization of these operations. The older version of SQL Server performs full outer joins more efficiently than the newer version. Most algorithms incorporate trade-offs, and it is possible that the revised outer join algorithm for outer joins performs much better for the common (left join) situation, while performing worse in the much less common (full outer join) situation. Oracle 9i’s per- formance characteristics, which show superlative optimiza- tion of left joins while showing pathological behavior for full outer joins, are possibly an extreme example of this trade-off. 4.3. In-memory joins For all the DBMSs tested, in-memory joins give the best overall performance, as indicated in Fig. 2, especially when the num- ber of attributes is large. In the 42 attribute-scenario, Methods C and C were 1.76 and 1.96 times faster than the next best method (Method B) for Oracle 9i. For SQL Server 2000 SP4, for
  • 6. computer methods and programs in biomedicine 8 2 ( 2 0 0 6 ) 38–43 43 42 attributes, Method C was four times faster than the next best performing method, Method A. For SQL Server 2005 Beta 2, Method C was 40 times faster than the next best performing method, Method A. In the in-memory join, the SQL that is sent to the database in step 4A above, is very simple so that the DBMSs do not need to spend any CPU time trying to optimize it, and returns a large amount of data. The algorithm is essentially limited by the rate at which the Java application can deal with the rows from the resultant dataset. By shifting the work from the database to an application server, this approach scales more readily to “Web farm” parallelism [12], where multiple application servers access a shared database. This approach, employed in large-scale e-commerce scenarios, is more read- ily implemented than database-server parallelism. For the lat- ter, increasing the number of CPUs does not help significantly unless the data is also partitioned across multiple indepen- dent disks, because database operations tend to be I/O bound rather than CPU bound This algorithm is not limited by the number of attributes, but, in the simple version described above, assumes availabil- ity of sufficient RAM. This assumption is generally reasonable on present-day commodity hardware with 2 GB-plus of RAM, but not always so. A more complex but better-scaling version of the algorithm requires a change in steps 3 and 4 above. Modified step 3: Compute the worst-case RAM required per row, Mrow, for the 2-D array (based on the total number of attributes and their individual data types). Determine the total RAM available to the program (Mmax). Allocate the 2-D array with number of rows, Nrows = Mmax/Mrow and number of columns = number of attributes. Modified step 4: Replace the single query of step 4 with a series of queries by traversing the ordered Entity information. Each query retrieves a horizontal “slice” of the EAV data, such that the number of distinct entities will not exceed Nrows. (To do this most directly, determine, for each query, the range of ordered patient IDs in the Entity data that does not exceed Nrows.) The filter in each query then takes the form “Study ID = x and CRF ID = y and patient ID between ‘aaa’ and ‘bbb’ “(where x and y are already known, and the values ‘aaa’ and ‘bbb’ are determined each time). In each iteration, write out the filled array to disk, re-initialize it, and increment the range of patients, until data for all patients is fetched. 5. Conclusions While several algorithms can be employed for pivoting EAV data, each approach must be carefully tested on individual vendor DBMS implementations, and may need to be period- ically re-evaluated as vendors upgrade their DBMS versions. The in-memory join, while algorithmically most complex, is also the most efficient. It is relatively stable to DBMS upgrades, because the SQL that it uses combines a limited number of tables, and needs only elementary optimization from the DBMS perspective. Acknowledgments This work was supported in part by NIH Grants U01 CA78266, K23 RR16042 and institutional funds from Yale University School of Medicine. The authors would like to thank Perry Miller for comments that improved this manuscript. r e f e r e n c e s [1] S. Johnson, Generic data modeling for clinical repositories, J. Am. Med. Informatics Assoc. 3 (1996) 328–339. [2] S.M. Huff, C.L. Berthelsen, T.A. Pryor, A.S. Dudley, Evaluation of a SQL Model of the help patient database, in: Proceedings of the 15th Symposium on Computer Applications in Medical Care, Washington, DC, 1991, pp. 386–390. [3] S.M. Huff, D.J. Haug, L.E. Stevens, C.C. Dupont, T.A. Pryor, HELP the next generation: a new client-server architecture, in: Proceedings of the 18th Symposium on Computer Applications in Medical Care, Washington, DC, 1994, pp. 271–275. [4] 3M Corporation (3M Health Information Systems), 3M Clinical Data Repository (2004), web site: http://www. 3m.com/us/healthcare/his/products/records/data repository. jhtml, date accessed: September 9, 2004. [5] Cerner Corporation, The Powerchart Enterprise Clinical Data Repository (2004), web site: http://www.cerner. com/uploadedFiles/1230 03PowerChartFlyer.pdf, date accessed: September 9, 2004. [6] Oracle Corporation, Oracle Clinical Version 3.0: User’s Guide (Oracle Corporation, Redwood Shores CA, 1996). [7] Phase Forward Inc., ClinTrial (2004), web site: http://www. phaseforward.com/products cdms clintrial.html, date accessed: 10/4/04. [8] S. Johnson, G. Hripcsak, J. Chen, P. Clayton, Accessing the Columbia Clinical Repository, in: Proceedings 18th Symposium on Computer Applications in Medical Care, Washington, DC, 1994, pp. 281–285. [9] D. Tow, SQL Turning, O’Reilly Books, Sebastopol, CA, 2003. [10] C. Brandt, P. Nadkarni, L. Marenco, et al., Reengineering a database for clinical trials management: lessons for system architects, Control. Clin. Trials 21 (2000) 440–461. [11] Sybase Corporation, SQL Anywhere Cost Based Query Optimizer (1999), web site: www.sybase.co.nz/products/ anywhere/optframe.html, date accessed: 12/1/01. [12] R.J. Chevance, Server Architectures: Multiprocessors, Clusters, Parallel Systems, Web Servers, Storage Solutions, Elsevier Digital Press, Burlington, MA, 2004.