The document describes and benchmarks three approaches for extracting Entity-Attribute-Value (EAV) data from clinical databases into a tabular format:
1. Using full outer joins, which performed poorly for larger numbers of attributes, especially in Oracle.
2. Using left outer joins, which scaled better in Oracle but had exponential growth in execution time in SQL Server.
3. Using hash tables and memory to perform the equivalent of multiple joins, which had the best and most linear performance across databases tested.
Modifying this approach to retrieve all attribute values up front removed irregular performance in SQL Server 2000 but increased execution times for smaller numbers of attributes in other databases.
data entry operations for class 12th chapter vise presentationgopeshkhandelwal7
chapters vise summary of all the chapters of data entry operations for class 12th nios with animation , sounds , transitions ,and other special effects . 54 slides with each chapter having 4-6 slides .
Thank You
with regards
Gopesh Khandelwal
This presentation demonstrated the fundamental of SPSS for beginner to learn what is SPSS and how to create variables and define their definition.
Thank you for your interest.
Please contact for more detail.
data entry operations for class 12th chapter vise presentationgopeshkhandelwal7
chapters vise summary of all the chapters of data entry operations for class 12th nios with animation , sounds , transitions ,and other special effects . 54 slides with each chapter having 4-6 slides .
Thank You
with regards
Gopesh Khandelwal
This presentation demonstrated the fundamental of SPSS for beginner to learn what is SPSS and how to create variables and define their definition.
Thank you for your interest.
Please contact for more detail.
What is ETL testing & how to enforce it in Data WharehouseBugRaptors
Bugraptors always remains up to date with latest technologies and ongoing trends in testing. Technology like ELT Testing bringing the great changes which arises the scope of testing by keeping in mind all the positive and negative scenarios.
A WEB REPOSITORY SYSTEM FOR DATA MINING IN DRUG DISCOVERYIJDKP
This project is to produce a repository database system of drugs, drug features (properties), and drug targets where data can be mined and analyzed. Drug targets are different proteins that drugs try to bind to stop the activities of the protein. Users can utilize the database to mine useful data to predict the specific chemical properties that will have the relative efficacy of a specific target and the coefficient for each chemical property. This database system can be equipped with different data mining approaches/algorithms such as linear, non-linear, and classification types of data modelling. The data models have enhanced with the Genetic Evolution (GE) algorithms. This paper discusses implementation with the linear data models such as Multiple Linear Regression (MLR), Partial Least Square Regression (PLSR), and Support Vector Machine (SVM).
70% of Transformational Change programmes fail. 68% of IT projects fail. Most of what we are working on will never succeed. How can we prevent this huge waste of resources. This deck explores the role of innovation labs in early stage testing and fast failure..
What is ETL testing & how to enforce it in Data WharehouseBugRaptors
Bugraptors always remains up to date with latest technologies and ongoing trends in testing. Technology like ELT Testing bringing the great changes which arises the scope of testing by keeping in mind all the positive and negative scenarios.
A WEB REPOSITORY SYSTEM FOR DATA MINING IN DRUG DISCOVERYIJDKP
This project is to produce a repository database system of drugs, drug features (properties), and drug targets where data can be mined and analyzed. Drug targets are different proteins that drugs try to bind to stop the activities of the protein. Users can utilize the database to mine useful data to predict the specific chemical properties that will have the relative efficacy of a specific target and the coefficient for each chemical property. This database system can be equipped with different data mining approaches/algorithms such as linear, non-linear, and classification types of data modelling. The data models have enhanced with the Genetic Evolution (GE) algorithms. This paper discusses implementation with the linear data models such as Multiple Linear Regression (MLR), Partial Least Square Regression (PLSR), and Support Vector Machine (SVM).
70% of Transformational Change programmes fail. 68% of IT projects fail. Most of what we are working on will never succeed. How can we prevent this huge waste of resources. This deck explores the role of innovation labs in early stage testing and fast failure..
Data is an increasingly common term used on the assumption that its meaning is commonly understood. This presentation seeks to drill down into the very specifics of what data is all about.
We developed a real-time, visual analytics tool for clinical decision support. The system expands the “recall of past experience” approach that a provider (physician) uses to formulate a course of action for a given patient. By utilizing Big-Data techniques, we enable the provider to recall all similar patients from an institution’s electronic medical record (EMR) repository, to explore “what-if” scenarios, and to collect these evidence-based cohorts for future statistical validation and pattern mining.
— The healthcare industry is considered one of the
largest industry in the world. The healthcare industry is same as
the medical industries having the largest amount of health related
and medical related data. This data helps to discover useful
trends and patters that can be used in diagnosis and decision
making. Clustering techniques like K-means, D-streams,
COBWEB, EM have been used for healthcare purposes like heart
disease diagnosis, cancer detection etc. This paper focuses on the
use of K-means and D-stream algorithm in healthcare. This
algorithms were used in healthcare to determine whether a
person is fit or unfit and this fitness decision was taken based on
his/her historical and current data. Both the clustering
algorithms were analyzed by applying them on patients current
biomedical historical databases, this analysis depends on the
attributes like peripheral blood oxygenation, diastolic arterial
blood pressure, systolic arterial blood pressure, heart rate,
heredity, obesity, and this fitness decision was taken based on
his/her historical and current data. Both the clustering
algorithms were analyzed by applying them on patients current
biomedical historical databases, this analysis depends on the
attributes like peripheral blood oxygenation, diastolic arterial
blood pressure, systolic arterial blood pressure, heart rate,
heredity, obesity, cigarette smoking. By analyzing both the
algorithm it was found that the Density-based clustering
algorithm i.e. the D-stream algorithm proves to give more
accurate results than K-means when used for cluster formation of
historical biomedical data. D-stream algorithm overcomes
drawbacks of K-means algorithm
Improve The Performance of K-means by using Genetic Algorithm for Classificat...IJECEIAES
In this research the k-means method was used for classification purposes after it was improved using genetic algorithms. An automated classification system for heart attack was implemented based on the intelligent recruitment of computer capabilities at the same time characterized by high performance based on (270) real cases stored within a globally database known (Statlog). The proposed system aims to support the efforts of staff in medical felid to reduce the diagnostic errors committed by doctors who do not have sufficient experience or because of the fatigue that the doctor suffers as a result of work pressure. The proposed system goes through two stages: in the first-stage genetic algorithm is used to select important features that have a strong influence in the classification process. These features forms the inputs to the K-means method in the second-stage which uses the selected features to divide the database into two groups one of them contain cases infected with the disease while the other group contains the correct cases depending on the distance Euclidean. The comparison of performance for the method (Kmeans) before and after addition genetic algorithm shows that the accuracy of the classification improves remarkably where the accuracy of classification was raised from (68..1481) in the case of use (k- means only) to (84.741) when improved the method by using genetic algorithm.
A WEB REPOSITORY SYSTEM FOR DATA MINING IN DRUG DISCOVERYIJDKP
This project is to produce a repository database system of drugs, drug features (properties), and drug targets where data can be mined and analyzed. Drug targets are different proteins that drugs try to bind to stop the activities of the protein. Users can utilize the database to mine useful data to predict the specific chemical properties that will have the relative efficacy of a specific target and the coefficient for each chemical property. This database system can be equipped with different data mining approaches/algorithms such as linear, non-linear, and classification types of data modelling. The data models have enhanced with the Genetic Evolution (GE) algorithms. This paper discusses implementation with the linear data models such as Multiple Linear Regression (MLR), Partial Least Square Regression (PLSR), and Support Vector Machine (SVM).
This paper deals with the design of the tool Entitled Database Migration Tool Across Standard Database Formats processing data migration between different relational database management systems (RDBMS).The Database Migration Tool Across Standard Database Formats entitled is mainly aimed at designing a tool to handle for migrating one database from a sources of data to another,changing database's schema if necessary. Paper also proposes methodological guidelines for successful migration of database tables and their data between different RDBMS and suggests the possibility of tool extension when the source or target RDBMS is changed. The proposed tool is shown on data migration of selected database of the source RDBMS to the target RDBMS
Hortizontal Aggregation in SQL for Data Mining Analysis to Prepare Data SetsIJMER
International Journal of Modern Engineering Research (IJMER) is Peer reviewed, online Journal. It serves as an international archival forum of scholarly research related to engineering and science education.
International Journal of Modern Engineering Research (IJMER) covers all the fields of engineering and science: Electrical Engineering, Mechanical Engineering, Civil Engineering, Chemical Engineering, Computer Engineering, Agricultural Engineering, Aerospace Engineering, Thermodynamics, Structural Engineering, Control Engineering, Robotics, Mechatronics, Fluid Mechanics, Nanotechnology, Simulators, Web-based Learning, Remote Laboratories, Engineering Design Methods, Education Research, Students' Satisfaction and Motivation, Global Projects, and Assessment…. And many more.
DBMS - Database Management System, Introduction, Data and Database, DBMS meaning, Why DBMS?, History of DBMS, Characteristics of DBMS, Types of DBMS- Hierarchical DBMS, Network DBMS, Relational DBMS, Object-oriented DBMS, Applications of DBMS, Popular DBMS Software, Advantages of DBMS, disadvantages of DBMS.
A robust data treatment approach for fuel cells system analysisISA Interchange
This paper describes the implementation of a practical approach for fuel cells system data analysis. A number of data treatment techniques such as data management and treatment, data synchronization, and data reconciliation are introduced and discussed in order to solve the issues raised in the practical case. These techniques are integrated in a software environment which provides user a fast, efficient, and rational electrochemical investigation. The performance of the approach is illustrated using an industrial fuel cell stack test system.
Similar to Pivoting approach-eav-data-dinu-2006 (20)
Communications Mining Series - Zero to Hero - Session 1DianaGray10
This session provides introduction to UiPath Communication Mining, importance and platform overview. You will acquire a good understand of the phases in Communication Mining as we go over the platform with you. Topics covered:
• Communication Mining Overview
• Why is it important?
• How can it help today’s business and the benefits
• Phases in Communication Mining
• Demo on Platform overview
• Q/A
How to Get CNIC Information System with Paksim Ga.pptxdanishmna97
Pakdata Cf is a groundbreaking system designed to streamline and facilitate access to CNIC information. This innovative platform leverages advanced technology to provide users with efficient and secure access to their CNIC details.
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024Neo4j
Neha Bajwa, Vice President of Product Marketing, Neo4j
Join us as we explore breakthrough innovations enabled by interconnected data and AI. Discover firsthand how organizations use relationships in data to uncover contextual insights and solve our most pressing challenges – from optimizing supply chains, detecting fraud, and improving customer experiences to accelerating drug discoveries.
In the rapidly evolving landscape of technologies, XML continues to play a vital role in structuring, storing, and transporting data across diverse systems. The recent advancements in artificial intelligence (AI) present new methodologies for enhancing XML development workflows, introducing efficiency, automation, and intelligent capabilities. This presentation will outline the scope and perspective of utilizing AI in XML development. The potential benefits and the possible pitfalls will be highlighted, providing a balanced view of the subject.
We will explore the capabilities of AI in understanding XML markup languages and autonomously creating structured XML content. Additionally, we will examine the capacity of AI to enrich plain text with appropriate XML markup. Practical examples and methodological guidelines will be provided to elucidate how AI can be effectively prompted to interpret and generate accurate XML markup.
Further emphasis will be placed on the role of AI in developing XSLT, or schemas such as XSD and Schematron. We will address the techniques and strategies adopted to create prompts for generating code, explaining code, or refactoring the code, and the results achieved.
The discussion will extend to how AI can be used to transform XML content. In particular, the focus will be on the use of AI XPath extension functions in XSLT, Schematron, Schematron Quick Fixes, or for XML content refactoring.
The presentation aims to deliver a comprehensive overview of AI usage in XML development, providing attendees with the necessary knowledge to make informed decisions. Whether you’re at the early stages of adopting AI or considering integrating it in advanced XML development, this presentation will cover all levels of expertise.
By highlighting the potential advantages and challenges of integrating AI with XML development tools and languages, the presentation seeks to inspire thoughtful conversation around the future of XML development. We’ll not only delve into the technical aspects of AI-powered XML development but also discuss practical implications and possible future directions.
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...DanBrown980551
Do you want to learn how to model and simulate an electrical network from scratch in under an hour?
Then welcome to this PowSyBl workshop, hosted by Rte, the French Transmission System Operator (TSO)!
During the webinar, you will discover the PowSyBl ecosystem as well as handle and study an electrical network through an interactive Python notebook.
PowSyBl is an open source project hosted by LF Energy, which offers a comprehensive set of features for electrical grid modelling and simulation. Among other advanced features, PowSyBl provides:
- A fully editable and extendable library for grid component modelling;
- Visualization tools to display your network;
- Grid simulation tools, such as power flows, security analyses (with or without remedial actions) and sensitivity analyses;
The framework is mostly written in Java, with a Python binding so that Python developers can access PowSyBl functionalities as well.
What you will learn during the webinar:
- For beginners: discover PowSyBl's functionalities through a quick general presentation and the notebook, without needing any expert coding skills;
- For advanced developers: master the skills to efficiently apply PowSyBl functionalities to your real-world scenarios.
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionAggregage
Join Maher Hanafi, VP of Engineering at Betterworks, in this new session where he'll share a practical framework to transform Gen AI prototypes into impactful products! He'll delve into the complexities of data collection and management, model selection and optimization, and ensuring security, scalability, and responsible use.
In his public lecture, Christian Timmerer provides insights into the fascinating history of video streaming, starting from its humble beginnings before YouTube to the groundbreaking technologies that now dominate platforms like Netflix and ORF ON. Timmerer also presents provocative contributions of his own that have significantly influenced the industry. He concludes by looking at future challenges and invites the audience to join in a discussion.
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...SOFTTECHHUB
The choice of an operating system plays a pivotal role in shaping our computing experience. For decades, Microsoft's Windows has dominated the market, offering a familiar and widely adopted platform for personal and professional use. However, as technological advancements continue to push the boundaries of innovation, alternative operating systems have emerged, challenging the status quo and offering users a fresh perspective on computing.
One such alternative that has garnered significant attention and acclaim is Nitrux Linux 3.5.0, a sleek, powerful, and user-friendly Linux distribution that promises to redefine the way we interact with our devices. With its focus on performance, security, and customization, Nitrux Linux presents a compelling case for those seeking to break free from the constraints of proprietary software and embrace the freedom and flexibility of open-source computing.
Removing Uninteresting Bytes in Software FuzzingAftab Hussain
Imagine a world where software fuzzing, the process of mutating bytes in test seeds to uncover hidden and erroneous program behaviors, becomes faster and more effective. A lot depends on the initial seeds, which can significantly dictate the trajectory of a fuzzing campaign, particularly in terms of how long it takes to uncover interesting behaviour in your code. We introduce DIAR, a technique designed to speedup fuzzing campaigns by pinpointing and eliminating those uninteresting bytes in the seeds. Picture this: instead of wasting valuable resources on meaningless mutations in large, bloated seeds, DIAR removes the unnecessary bytes, streamlining the entire process.
In this work, we equipped AFL, a popular fuzzer, with DIAR and examined two critical Linux libraries -- Libxml's xmllint, a tool for parsing xml documents, and Binutil's readelf, an essential debugging and security analysis command-line tool used to display detailed information about ELF (Executable and Linkable Format). Our preliminary results show that AFL+DIAR does not only discover new paths more quickly but also achieves higher coverage overall. This work thus showcases how starting with lean and optimized seeds can lead to faster, more comprehensive fuzzing campaigns -- and DIAR helps you find such seeds.
- These are slides of the talk given at IEEE International Conference on Software Testing Verification and Validation Workshop, ICSTW 2022.
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...Neo4j
Leonard Jayamohan, Partner & Generative AI Lead, Deloitte
This keynote will reveal how Deloitte leverages Neo4j’s graph power for groundbreaking digital twin solutions, achieving a staggering 100x performance boost. Discover the essential role knowledge graphs play in successful generative AI implementations. Plus, get an exclusive look at an innovative Neo4j + Generative AI solution Deloitte is developing in-house.
A tale of scale & speed: How the US Navy is enabling software delivery from l...sonjaschweigert1
Rapid and secure feature delivery is a goal across every application team and every branch of the DoD. The Navy’s DevSecOps platform, Party Barge, has achieved:
- Reduction in onboarding time from 5 weeks to 1 day
- Improved developer experience and productivity through actionable findings and reduction of false positives
- Maintenance of superior security standards and inherent policy enforcement with Authorization to Operate (ATO)
Development teams can ship efficiently and ensure applications are cyber ready for Navy Authorizing Officials (AOs). In this webinar, Sigma Defense and Anchore will give attendees a look behind the scenes and demo secure pipeline automation and security artifacts that speed up application ATO and time to production.
We will cover:
- How to remove silos in DevSecOps
- How to build efficient development pipeline roles and component templates
- How to deliver security artifacts that matter for ATO’s (SBOMs, vulnerability reports, and policy evidence)
- How to streamline operations with automated policy checks on container images
Essentials of Automations: The Art of Triggers and Actions in FMESafe Software
In this second installment of our Essentials of Automations webinar series, we’ll explore the landscape of triggers and actions, guiding you through the nuances of authoring and adapting workspaces for seamless automations. Gain an understanding of the full spectrum of triggers and actions available in FME, empowering you to enhance your workspaces for efficient automation.
We’ll kick things off by showcasing the most commonly used event-based triggers, introducing you to various automation workflows like manual triggers, schedules, directory watchers, and more. Plus, see how these elements play out in real scenarios.
Whether you’re tweaking your current setup or building from the ground up, this session will arm you with the tools and insights needed to transform your FME usage into a powerhouse of productivity. Join us to discover effective strategies that simplify complex processes, enhancing your productivity and transforming your data management practices with FME. Let’s turn complexity into clarity and make your workspaces work wonders!
Unlocking Productivity: Leveraging the Potential of Copilot in Microsoft 365, a presentation by Christoforos Vlachos, Senior Solutions Manager – Modern Workplace, Uni Systems
UiPath Test Automation using UiPath Test Suite series, part 6DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 6. In this session, we will cover Test Automation with generative AI and Open AI.
UiPath Test Automation with generative AI and Open AI webinar offers an in-depth exploration of leveraging cutting-edge technologies for test automation within the UiPath platform. Attendees will delve into the integration of generative AI, a test automation solution, with Open AI advanced natural language processing capabilities.
Throughout the session, participants will discover how this synergy empowers testers to automate repetitive tasks, enhance testing accuracy, and expedite the software testing life cycle. Topics covered include the seamless integration process, practical use cases, and the benefits of harnessing AI-driven automation for UiPath testing initiatives. By attending this webinar, testers, and automation professionals can gain valuable insights into harnessing the power of AI to optimize their test automation workflows within the UiPath ecosystem, ultimately driving efficiency and quality in software development processes.
What will you get from this session?
1. Insights into integrating generative AI.
2. Understanding how this integration enhances test automation within the UiPath platform
3. Practical demonstrations
4. Exploration of real-world use cases illustrating the benefits of AI-driven test automation for UiPath
Topics covered:
What is generative AI
Test Automation with generative AI and Open AI.
UiPath integration with generative AI
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
2. computer methods and programs in biomedicine 8 2 ( 2 0 0 6 ) 38–43 39
[8], describes the use of data access modules (DAMs) in the
Columbia Clinical Repository—procedural code to implement
the equivalent of pivoted “views”. While noting that DAMs
are “complex and hard to modify to meet the needs of appli-
cation developers in a timely manner”, the authors identify
several limitations of alternative approaches such as static
SQL views, e.g., a static SQL view needs as many joins as
attributes of interest. (Note: In some highly normalized EAV
schemas, one would need twice or thrice this number.) In
1994, most database engines limited the number of joins per
statement to a relatively small number, e.g., 16. These limits
are now more generous (e.g., 256 joins in SQL Server 2000).
A well-known SQL tuning text [9] mentions optimization of a
115-join production query. The issue, however, is whether per-
formance of static pivoting SQL scales well with the number of
parameters.
This paper describes the general pivoting problem, and
benchmarks alternative pivoting approaches. The three
approaches explored here were tested using TrialDB [10], a pro-
duction CSDMS at Yale, with an agoraphobia case report form
(CRF) used in a psychiatry study (Dr. Scott Woods, PI), contain-
ing 42 integer attributes. All approaches used SQL that was
generated dynamically by code that accessed metadata—i.e.,
the IDs of all parameters in the questionnaire, along with
their serial ordering. The three different approaches mea-
sured performance as a function of number of attributes by
progressively generating and executing a series of SQL state-
ments, each incorporating data for an additional attribute.
TrialDB uses a normalized design, storing “Entity” informa-
tion (patient ID, study ID, CRF ID, etc.) in an “Entity/Clinical
Encounter” table with a machine-generated “Entity ID” pri-
mary key. Data-type specific EAV tables comprise triplets
(Entity ID, Attribute ID and Value).
2. Methods
Benchmarks were written in Java and ran against three
DBMSs: Oracle 9i, SQL Server 2000 SP4 and SQL Server 2005
Beta 2. (While two of the systems are by the same ven-
dor, our results indicate that the query execution engines
of the two are quite different.) The benchmarks for each
database used the same schema and data, and utilized the
same indexes. The databases and application ran on the
same dedicated machine (single-CPU 1.8 GHz Pentium 4 with
1 GB RAM), to eliminate the factor of network bandwidth. At
the time of benchmarking, no applications were running on
the computer except the Java code and the database being
tested.
We ran each test at least three times and averaged the
results. The test database schema, test data set, Java code,
generated queries and detailed benchmarks are available via
ftp://custard.med.yale.edu/pivot benchmarks.zip.
Formulating the general pivoting problem: For individual
encounters for particular patients, inapplicable items on a CRF
may be left empty: e.g., questions regarding diabetes treat-
ment apply only for patients with diabetes. Because empty
values are not stored in EAV tables, the number of data
points/values is generally unequal across all attributes in a
set. The creation of a rectangular, one-column-per-attribute
table from longitudinal “strips” representing values for indi-
vidual attributes conceptually requires full outer join opera-
tions, where non-matching rows on either side of a join are
preserved, and missing values recorded as “nulls”. Currently,
most mainstream database engines support “full outer joins”
natively in SQL.
Below we discuss three methods that can generate the
same pivoted output table using different approaches: full
outer join; left outer join; and hash tables performing in mem-
ory the equivalent of multiple joins.
2.1. Method A: using full outer joins
Algorithm. Any given statement generates one strip of data
per attribute of interest, and then combines the strips using
a series of FULL OUTER JOIN operations. Each strip is created
through an inner join between the Entities table and the EAV
data table—the former being filtered on Study ID and CRF ID,
since the same CRF can be used across multiple studies, and
the latter filtered on Attribute ID. In addition, for N attributes,
N − 1 full outer joins are needed. The total number of join oper-
ations per statement is therefore 2 × N − 1.
2.2. Method B: using left outer joins
Algorithm. We determine essential Entity information
(Encounter ID, patient ID, time stamps) on the total number of
clinical encounters (641) by filtering the Entities table alone on
Study ID and CRF ID. We join this information with each strip of
data, generated as above. However, we use LEFT OUTER JOIN
operations, where complete Entity information merges with
whatever matches for each attribute. The total number of joins
per statement is N inner joins (to generate each attribute’s
data), plus N outer joins for merging = 2 × N.
2.3. Method C: using hash tables and memory to
perform the equivalent of multiple joins
Any strategy that generates SQL to join an arbitrary number
of tables in a single statement runs the risk of encounter-
ing the 256-joins-per-query limit, which corresponds to 128
attributes. Several case-report forms, notably certain psychia-
try questionnaires, can exceed this threshold, and so alter-
native approaches must be explored. We describe such an
approach, using extensible hash tables (a standard component
of most modern programming libraries such as Java and the
.NET framework) to perform pivoting.
Algorithm.
Step 1. We already know the IDs of the attributes of interest
and their serial order of presentation in the final out-
put. We load this information into a hash table, with
attribute ID as key and serial number as value. The
hash table enables us to determine in constant time
that attribute ID 1568, for example, is in position 12.
Step 2. We execute a query that fetches complete entity
information, ordered by patient ID and time stamp.
3. 40 computer methods and programs in biomedicine 8 2 ( 2 0 0 6 ) 38–43
The ordering information is used to make the basic
algorithm more scalable if needed, as described later.
We capture this data from the database and use it to
create a second hash table, with Entity/Encounter ID
as key and row number in the array as value—e.g.,
Encounter# 14568 is in row 45.
Step 3. We dynamically allocate a two-dimensional array of
strings (number of entities X number of attributes).
All elements are initialized to blanks.
Step 4a. A query fetches all EAV triplets (Encounter ID,
Attribute ID, Value) for the given Study ID, CRF ID
and Attribute IDs, via a join to the Entities table.
(In a slight variation to this method, which we will
call Method C , one can retrieve the EAV triplets
for all the Attribute IDs for the given Study ID
and CRF ID. This variation decreases the load on
the database and can avoid the performance degra-
dation seen in one of the tested databases—see
Section 3.)
Step 4b. Iterating through each returned row, we place the
Value in the 2-D array in the row and column indi-
cated by its corresponding Encounter and Attribute
IDs, respectively: the two hash tables allow speedy
row/column determination. At the end of the iter-
ations, empty values remain blank. For a situa-
tion where the EAV data is stored across multi-
ple data-type-specific tables (e.g., strings, integers,
decimal numbers, as in TrialDB), one would repeat
the second query for each necessary EAV table,
as determined by metadata that indicated how
many attributes of each data-type existed for the
desired set.
3. Results of benchmark tests
The results for the benchmark tests using the three meth-
ods (A–C, described above) on the three DBMSs (Oracle 9i,
SQL Server 2000 SP4 and SQL Server 2005 Beta 2) are illus-
trated in Figs. 1 and 2. To facilitate comparison across both
method and DBMS, Fig. 1 illustrates the results grouped by
method, while Fig. 2 illustrates the same results grouped by
DBMS.
3.1. Results for Method A: using full outer joins
(Fig. 1A)
Oracle 9i: Execution time increased exponentially, from
76 milliseconds (ms) for one attribute to 56,968 ms for
nine attributes. (Pearson R2 for log(time) versus number of
attributes = 0.955.) The Java process crashed with a SQL excep-
tion on attempting a 10-attribute merge. Inserting a variety of
optimizer hints in the generated SQL did not help, and further
experiments were halted.
SQL Server: Execution time increased at a quadratic rate
with the number of attributes (R2 = 0.994 for SQL Server 2000
SP4, R2 = 0.999 for SQL Server 2005 Beta 2). The query run
times for 1–42 attributes ranged from 206 to 8055 ms on SQL
Server 2000 SP4 and from 87 to 18961 ms on SQL Server 2005
Beta 2.
3.2. Results for Method B: using left outer joins
(Fig. 1B)
Oracle 9i: Performance scaled linearly, from 67 ms for one
attribute, to 965 ms for 42 attributes, the last involving a total
of 82 join operations in a single SQL statement (R2 for time
versus number of attributes = 0.996).
SQL Server: Execution times were much higher—ranging
from 190 ms for 1 attribute to above 22 s (s) and 105 s for SQL
Server 2005 Beta 2 and SQL Server 2000 SP4, respectively. Exe-
cution times on SQL Server had an exponential growth for
the first 11–12 attributes (R2 = 0.981 for SQL Server 2000 SP4,
R2 = 0.990 for SQL Server 2005 Beta 2), followed by a linear
increase for more than 13 attributes (R2 = 0.990 for SQL Server
2000 SP4, R2 = 0.959 for SQL Server 2005 Beta 2). On SQL Server,
the execution times for left outer join were higher than the
times for the full outer join.
3.3. Results for Method C: using hash tables and
memory to perform the equivalent of multiple joins
(Fig. 1C)
Oracle 9i: Execution time grew linearly (R2 = 0.946) from 80 ms
for 1 attribute to 547 ms for 42 attributes.
SQL Server: The behavior of SQL Server 2000 SP4 differed
considerably from that of SQL Server 2005 Beta 2.
SQL Server 2005 Beta 2: Execution time grew linearly
(R2 = 0.967) from 77 ms for 1 attribute to 474 ms for 42
attributes.
SQL Server 2000 SP4: A notable and increasing performance
degradation was observed for about 9–20 attributes, beyond
which the execution times became lower and grew more lin-
early at a smaller rate. This behavior, presumably due to SQL
Server 2000s attempt to optimize the query, prompted us to
try bypassing the SQL Server’s optimization attempt. In this
slightly modified method, called Method C , we retrieved from
the database the values for all the attributes (for the given
Study ID and CRF ID) and only the desired values were then
picked to be stored in the pivoted array—see Fig. 1D.
As expected, the times for Method C stayed almost flat,
independent of the number of attributes desired—since most
of the time was spent executing the query and then iterat-
ing through all the retrieved values to see if they need to be
stored in the pivoted array or not. This modification removed
the irregular increase in execution times for SQL Server 2000
SP4, but also resulted in higher execution times for Oracle 9i
and SQL Server 2005 beta2, especially for a small number of
attributes.
4. Discussion
4.1. Possible explanations of results
It is well known that the multi-table join problem is NP-hard
with respect to performance optimization. The number of join
orders to be evaluated to determine the join order that gives
the fastest performance grows exponentially with the num-
ber of tables to be joined [11]. If the number of tables is large
enough, the CPU time spent by the query execution engine on
4. computer methods and programs in biomedicine 8 2 ( 2 0 0 6 ) 38–43 41
Fig. 1 – Execution times for Methods A–C are compared on three different databases—Oracle 9i, SQL Server 2000 SP4 and
SQL Server 2005 Beta 2. The x-axis represents the number of attributes for which the values are retrieved and loaded into an
array; the y-axis represents the execution time in milliseconds (ms). (A) Execution times for Method A, using full outer joins,
to retrieve the values for the desired attributes and load the data into an array. Oracle 9i times increased exponentially
(R2 = 0.955) with the number of attributes, and failed with a SQL error at 10 attributes. The times on SQL Server increased at
a quadratic rate (R2 = 0.994 for SQL Server 2000 SP4, R2 = 0.999 for SQL Server 2005 Beta 2). (B) Execution times for Method B,
using left outer joins, to retrieve the values for the desired attributes and load the data into an array. Execution times on SQL
Server had an exponential growth for the first 11–12 attributes (R2 = 0.981 for SQL Server 2000 SP4, R2 = 0.990 for SQL Server
2005 Beta 2), followed by a linear increase for more than 13 attributes (R2 = 0.990 for SQL Server 2000 SP4, R2 = 0.959 for SQL
Server 2005 Beta 2). The times on Oracle 9i increased linearly (R2 = 0.996) and were much lower compared with the SQL
Server times. (C) Execution times for Method C, using the in-memory hash table to pivot the values for the desired attributes
and load the data into an array. In this method, only the values for the desired attributes were selected from the database.
Execution times on Oracle 9i and SQL Server 2005 Beta 2 grew linearly (R2 = 0.946 for Oracle 9i, R2 = 0.967 for SQL Server 2005
Beta 2), with the values for the latter slightly lower in magnitude. For SQL Server 2000 SP4, a notable and increasing
performance degradation was observed for about 9–20 attributes, after which the execution times became lower and grew
more linearly at a smaller rate. This odd behavior, presumably due to SQL Server 2000’s attempt to optimize the query,
prompted us to bypass the SQL Server’s optimization attempt and to investigate the alternative method where the values
for all the attributes (for a desired trial) were retrieved from the database and only the desired values were then picked to be
stored in the pivoted array—see Method C in Fig. 1D. (D) Execution times for Method C , a variation of Method C where all
the values for all the attributes (for a desired trial) were retrieved from the database and only the desired values were then
picked to be stored in the pivoted array. As expected, the times stayed almost constant—since most of the time was spent
executing the query and then iterating through all the retrieved values to see if they need to be stored in the array or not.
5. 42 computer methods and programs in biomedicine 8 2 ( 2 0 0 6 ) 38–43
Fig.2–Foreachofthetesteddatabases(Oracle9i,SQLServer2000SP4andSQLServer2005Beta2),theperformancetimesforthevariousmethodsarecompared.The
samelogarithmictimescaleisusedforalldatabases,tofacilitateside-by-sidecomparison.For42attributes,MethodCprovided1.76,4,and40-foldperformance
improvementswhencomparedtothenextbestperformingmethod(AorB)forOracle9i,SQLServer2000SP4andSQLServerBeta2,respectively.
determining the best way to perform the join can be consider-
ably more than the time that the engine might take in actually
executing the join using a na¨ıve strategy, such as joining the
tables in the order encountered in the SQL statement.
Modern DBMSs achieve their impressive performance
through a combination of heuristics (e.g., the presence or
absence of indexes) and the use of stored database statis-
tics, such as table sizes and data distributions. This infor-
mation lets them select query execution plans that may not
be absolutely optimal, but are reasonably close to optimal,
and which can be determined in polynomial or even linear
time. The strategies are understandably vendor-specific. Since
query performance is an area where vendors compete vig-
orously, these strategies are likely to change between DBMS
versions. Further, the amount of intellectual effort that ven-
dors decide to expend in devising heuristics to accommodate
relatively uncommon situations efficiently is also likely to
vary. Finally, query optimization is a task complex enough to
require a team of programmers, with specific sub-tasks being
delegated to individual team members, some of whom may
be more skilled than others. In any event, different vendor
engines and versions will generate different plans for the same
situation.
4.2. Full outer joins versus left outer joins
Oracle degrades exponentially for full outer joins, while SQL
Server does not. Full outer joins have been introduced rela-
tively recently into DBMSs. Being needed in relatively uncom-
mon situations (the vast majority of “business” databases
do not utilize EAV design), it is possible that Oracle’s imple-
menters invested minimal effort in optimizing their perfor-
mance (to the extent of crashing the engine when the number
of joins exceed 10 attributes), while SQL Server’s implementers
did not.
With SQL Server, full outer joins perform better than left
joins. The performance of left joins has been improved sig-
nificantly in SQL Server 2005 (which is desirable for these
relatively common operations) but it still falls slightly short
of full outer joins. Oracle 9i outperforms both these versions
by a wide margin for left joins. One explanation of these num-
bers is the relative effort and skill that each vendor brought to
bear on the optimization of these operations.
The older version of SQL Server performs full outer joins
more efficiently than the newer version. Most algorithms
incorporate trade-offs, and it is possible that the revised outer
join algorithm for outer joins performs much better for the
common (left join) situation, while performing worse in the
much less common (full outer join) situation. Oracle 9i’s per-
formance characteristics, which show superlative optimiza-
tion of left joins while showing pathological behavior for full
outer joins, are possibly an extreme example of this trade-off.
4.3. In-memory joins
For all the DBMSs tested, in-memory joins give the best overall
performance, as indicated in Fig. 2, especially when the num-
ber of attributes is large. In the 42 attribute-scenario, Methods
C and C were 1.76 and 1.96 times faster than the next best
method (Method B) for Oracle 9i. For SQL Server 2000 SP4, for
6. computer methods and programs in biomedicine 8 2 ( 2 0 0 6 ) 38–43 43
42 attributes, Method C was four times faster than the next
best performing method, Method A. For SQL Server 2005 Beta
2, Method C was 40 times faster than the next best performing
method, Method A.
In the in-memory join, the SQL that is sent to the database
in step 4A above, is very simple so that the DBMSs do not
need to spend any CPU time trying to optimize it, and returns
a large amount of data. The algorithm is essentially limited
by the rate at which the Java application can deal with the
rows from the resultant dataset. By shifting the work from
the database to an application server, this approach scales
more readily to “Web farm” parallelism [12], where multiple
application servers access a shared database. This approach,
employed in large-scale e-commerce scenarios, is more read-
ily implemented than database-server parallelism. For the lat-
ter, increasing the number of CPUs does not help significantly
unless the data is also partitioned across multiple indepen-
dent disks, because database operations tend to be I/O bound
rather than CPU bound
This algorithm is not limited by the number of attributes,
but, in the simple version described above, assumes availabil-
ity of sufficient RAM. This assumption is generally reasonable
on present-day commodity hardware with 2 GB-plus of RAM,
but not always so. A more complex but better-scaling version
of the algorithm requires a change in steps 3 and 4 above.
Modified step 3: Compute the worst-case RAM required
per row, Mrow, for the 2-D array (based on the total number
of attributes and their individual data types). Determine the
total RAM available to the program (Mmax). Allocate the 2-D
array with number of rows, Nrows = Mmax/Mrow and number of
columns = number of attributes.
Modified step 4: Replace the single query of step 4 with a
series of queries by traversing the ordered Entity information.
Each query retrieves a horizontal “slice” of the EAV data, such
that the number of distinct entities will not exceed Nrows. (To
do this most directly, determine, for each query, the range of
ordered patient IDs in the Entity data that does not exceed
Nrows.) The filter in each query then takes the form “Study
ID = x and CRF ID = y and patient ID between ‘aaa’ and ‘bbb’
“(where x and y are already known, and the values ‘aaa’ and
‘bbb’ are determined each time). In each iteration, write out
the filled array to disk, re-initialize it, and increment the range
of patients, until data for all patients is fetched.
5. Conclusions
While several algorithms can be employed for pivoting EAV
data, each approach must be carefully tested on individual
vendor DBMS implementations, and may need to be period-
ically re-evaluated as vendors upgrade their DBMS versions.
The in-memory join, while algorithmically most complex, is
also the most efficient. It is relatively stable to DBMS upgrades,
because the SQL that it uses combines a limited number
of tables, and needs only elementary optimization from the
DBMS perspective.
Acknowledgments
This work was supported in part by NIH Grants U01 CA78266,
K23 RR16042 and institutional funds from Yale University
School of Medicine. The authors would like to thank Perry
Miller for comments that improved this manuscript.
r e f e r e n c e s
[1] S. Johnson, Generic data modeling for clinical repositories,
J. Am. Med. Informatics Assoc. 3 (1996) 328–339.
[2] S.M. Huff, C.L. Berthelsen, T.A. Pryor, A.S. Dudley,
Evaluation of a SQL Model of the help patient database,
in: Proceedings of the 15th Symposium on Computer
Applications in Medical Care, Washington, DC, 1991, pp.
386–390.
[3] S.M. Huff, D.J. Haug, L.E. Stevens, C.C. Dupont, T.A. Pryor,
HELP the next generation: a new client-server architecture,
in: Proceedings of the 18th Symposium on Computer
Applications in Medical Care, Washington, DC, 1994, pp.
271–275.
[4] 3M Corporation (3M Health Information Systems), 3M
Clinical Data Repository (2004), web site: http://www.
3m.com/us/healthcare/his/products/records/data repository.
jhtml, date accessed: September 9, 2004.
[5] Cerner Corporation, The Powerchart Enterprise Clinical
Data Repository (2004), web site: http://www.cerner.
com/uploadedFiles/1230 03PowerChartFlyer.pdf, date
accessed: September 9, 2004.
[6] Oracle Corporation, Oracle Clinical Version 3.0: User’s
Guide (Oracle Corporation, Redwood Shores CA, 1996).
[7] Phase Forward Inc., ClinTrial (2004), web site: http://www.
phaseforward.com/products cdms clintrial.html, date
accessed: 10/4/04.
[8] S. Johnson, G. Hripcsak, J. Chen, P. Clayton, Accessing the
Columbia Clinical Repository, in: Proceedings 18th
Symposium on Computer Applications in Medical Care,
Washington, DC, 1994, pp. 281–285.
[9] D. Tow, SQL Turning, O’Reilly Books, Sebastopol, CA, 2003.
[10] C. Brandt, P. Nadkarni, L. Marenco, et al., Reengineering a
database for clinical trials management: lessons for
system architects, Control. Clin. Trials 21 (2000) 440–461.
[11] Sybase Corporation, SQL Anywhere Cost Based Query
Optimizer (1999), web site: www.sybase.co.nz/products/
anywhere/optframe.html, date accessed: 12/1/01.
[12] R.J. Chevance, Server Architectures: Multiprocessors,
Clusters, Parallel Systems, Web Servers, Storage Solutions,
Elsevier Digital Press, Burlington, MA, 2004.