Scalable scheduling of updates in streaming data warehouses
Scalability Manuscript for Star98
1. Two Generations of Client / Server
Performance Testing
Steven J. Oubre
MedicaLogic, Inc.
20500 NW Evergreen Parkway
Hillsboro, OR 97124
phone: (503) 531-7000, fax: (503) 531-7134
email: steve_oubre@medicalogic.com
Introduction
MedicaLogic, Inc. is a software
development company that specializes
in Electronic Medical Records (EMR)
software products for the medical clinic
environment. The Windows version of
this product is called Logician.
Because Logician is designed to work
with an Oracle database in a
client/server configuration, a network
and server are required in addition to
PC workstations.
The Problem
MedicaLogic’s customers have varying
numbers of users. As more users use
Logician, a faster server and network of
sufficient bandwidth are required.
Another complicating factor is that
clinics typically have a long lead-time for
purchasing hardware and software.
They may need to know what to
purchase six months to a year in
advance. So it was important for
MedicaLogic to produce configuration
data prior to a given release of the
product.
The Solution
In the interest of customer satisfaction,
MedicaLogic saw a need to invest time
and money into determining server,
workstation and network requirements
for various numbers of users of its
products. This investment turned into
what has been called Scalability testing.
First Generation Scalability
Testing
The first requirement was to put
together a Scalability Lab with an
appropriate amount and configuration of
hardware and software. At that time,
MedicaLogic’s customers and potential
customers were not very large and we
thought that the ability to scale to
around 100 users would be sufficient.
We didn’t want to equip our lab with 100
PC workstations. This would have been
prohibitive, both in cost and space
requirements. So we decided that if we
could automate the execution of
Logician at a faster rate than a human
would normally do, we would actually be
sending and receiving data across the
network to the server at a level equal to
some number greater then the number
of computers we would have in our lab.
2. The Lab
Through some fairly extensive research
and analysis, we decided to equip our
test lab with twenty-four PCs and three
Intel-based servers. The current
technology at the time limited us to
mostly Intel 486 based systems with one
90Mhz Pentium server and five 75Mhz
Pentium PCs. Lab furniture, patch
panel, hubs and wiring were purchased
and installed. The first version of
MedicaLogic’s lab used a 10Base-T
Ethernet network. Two servers ran
Novell NetWare 4.1 and one server ran
Microsoft NT Server 3.51. The PCs
were running Microsoft Windows 3.1.
The Test Environment
Initial investigation of software test
automation products indicated that
Microsoft Test would be able to
automate the execution of Logician. Not
only that, but Microsoft Test allowed
execution of tests on any number of
computers without any additional cost. It
also had the ability to control other PCs
from one PC. We felt that this feature
would allow us centralized control over
all twenty-four PCs.
Scripts were written that would allow
Logician to behave as if various clinic
personnel were using it. Wait times were
also incorporated to further make the
behavior as similar to real clinic
personnel as possible.
Measurements of server performance
were done with Microsoft Performance
Monitor on the NT server and Novell’s
monitor program on the NetWare server.
Response times were measured with
another Microsoft Test script that
recorded the time it took to perform
various actions within Logician. Network
performance was measured with
Novell’s LanAlyzer.
Problems Encountered
1. Automating in this fashion does not
take into account multiple
workstations running Oracle. Since
Oracle uses memory differently as
more users log on, testing Logician
at a faster rate, to simulate more
users, did not accurately show
processor utilization for the
equivalent number of users.
2. Logician uses non-standard objects
in its graphical user interface.
Because of this, it was very difficult
to automate with Microsoft Test.
Screen coordinates had to be used
instead of object recognition.
3. At the time, Logician was undergoing
massive amounts of changes and
this caused the Microsoft Test scripts
to fail whenever a new internal
release was received.
4. Because MedicaLogic did not have
very many customers who had been
using Logician for any length of time,
it was hard to judge how much data
would be needed. As it turned out,
too little data was used in the test
database.
The Results
Our scalability testing with this system
only allowed simulation of around 60
users. With this high-end number and
some educated guesswork, we were
able to extrapolate what size server
would be needed for any number of
users between 1 and 100.
Second Generation
Scalability Testing
As MedicaLogic became a leader in
Electronic Medical Records software
development, it became clear that we
3. needed to be able to demonstrate
scalability to greater than 100 users.
Our goal than became to simulate up to
500 users.
However, with the automated system we
had, we would have needed a minimum
of 250 PCs to simulate 500 users.
Again, this was not an option due to
expense and space requirements.
We looked at a number of options from
load testing tools to outside labs. Since
we desired to be able to perform load
testing on a frequent basis and gain
expertise in-house, we eliminated the
lab option. It would have been very
expensive and time consuming over the
long run.
After looking at several load-testing
tools, we selected Compuware’s
QALoad.
The Lab
Our lab was updated with extra memory
and disk space for our QALoad Pentium
& 486 PCs, and a 100BaseT network
including a 100BaseT switch.
Additionally, PC manufacturers loaned
servers to us.
The Test Environment
Compuware’s QALoad tool consists of
several parts:
1. A program to capture SQL
(Structured Query Language)
statements from our application to
the Oracle server.
2. A program to convert the raw
captured SQL to a “C” source file.
3. A “Player” program that runs the
compiled “C” program as one or
more simulated users.
4. A “Conductor” program that
controls the operation of multiple
“Player” systems.
We captured the SQL from Logician
while performing the same steps that
typically occur in a clinic during a patient
encounter (including the roles of front
desk, nurse, doctor, etc).
After converting the raw SQL to a “C”
source file, we modified it to make it
more general and accept data from a
data pool. This way, each simulated
user would have unique data to deal
with thereby avoiding constraint errors
from Oracle.
Intel’s LanDesk Server Manager was
used to measure CPU loading on both a
NetWare server and NT server.
The Results
Our Scalability testing with this system
allowed simulation of around 400 users.
This was more than the amount that the
servers and operating systems (OS) we
tested could support with our application
and Oracle.
Since each simulated user actually
logged in to Oracle, they were seen as
separate users on separate systems.
This allowed our tests to be much closer
to reality. In addition, we populated our
test database with the amount and type
of data that would typically be found in a
clinic that is using Logician.
The chart below shows how CPU
utilization varied over time with 300
Logician users on a Quad Pentium Pro
NT server.
4. Problems Encountered
1. 486 PCs weren’t powerful enough to
use and so we had to rent additional
PCs for Scalability testing.
2. The “C” source code that was
produced by the QALoad tool was
too difficult to modify and maintain in
a timely manner.
3. We use Oracle’s OCI layer for
communicating with Oracle rather
than ODBC. Due to this and the
concurrency capability of Logician,
our load testing tool, QALoad, does
not exactly duplicate what our
application does. We believe this
problem would be encountered with
any load test tool that’s currently
available, so we decided to add
some additional functionality to our
load test tool to get around some of
this deficiency.
Third (Current) Generation
Scalability Testing
As MedicaLogic has grown and become
better known, some of it’s current and
potential customers need to have
support for a larger number of Logician
users. A number of larger sites require
support for well over 1,000 active
Logician users.
The Lab
To support over 1,000 users, the
Scalability lab needed additional
QALoad player systems that have
greater processing capability.
Fortunately, the current technology with
Pentium Pro and Pentium II dual
processor systems provided that
capability. We have purchased a mix of
these types of systems that are able to
simulate several hundred users each.
In addition to our 100Mbit-switched
network we have added additional
100Mbit full duplex hubs to allow us to
create multiple subnets in order to
distribute the network load.
The only remaining items required are
the server systems. Current Intel-based
servers cannot support this many
Logician users. Therefore, we have
purchased several UNIX-based server
systems to provide this capability.
The Test Environment
The test tools will remain the same.
However, we are developing new tools
to help in modifying and maintaining the
“C” source code. In addition, we are
providing more accurate functionality to
the load test environment in the form of
software and the database.
300 User; 4P6/200; NT 3.51; DTS; Full Data and Pictures
Full Time
0
10
20
30
40
50
60
70
80
90
15:11:30
15:24:39
15:29:52
15:34:07
15:37:40
15:41:17
15:44:45
15:48:22
15:51:56
15:55:23
15:58:54
16:02:21
16:05:44
16:09:23
16:12:50
16:16:19
16:19:46
16:23:17
16:26:38
16:30:13
16:33:42
16:37:06
16:40:33
16:44:02
16:47:27
16:50:50
16:54:21
16:57:48
17:01:15
17:04:41
17:08:13
17:11:39
Time
Value
All Virtual Users Active
DTS Start - Scan Inbox
DTS - Begin Imports
DTS Finished
Average CPU
Utilization
while DTS is
acitve:
37.41%
Maximum
CPU
Utilization:
85%
5. The Results
Scalability testing with this system
revealed numerous impediments to our
being able to achieve sustained user
levels much greater than 500
simultaneous users.
Problems Encountered
1. Current disk configuration became a
bottleneck to database performance
in terms of data read and write
efficiency.
2. Network segment traffic started to
saturate above 300 simultaneous
users. This required us to install
addition network cards in the server
in order to segment the network into
multiple subnets. This allowed us to
distribute the network load and
minimize the network bottlenecks.
3. Both the Oracle database and
Logician required better optimization
to handle the increased user loads.
These issues affect not only
performance and scalability testing in-
house but also affect overall application
performance for our customers. With
this in mind we will now discuss some of
these issues.
Issues Affecting
Performance in larger
Client / Server
Environments
The Problem
The issue we face here is one of
tradeoffs: more specifically the issue of
cost vs. performance. One of
MedicaLogic’s goals is to provide our
customers with recommendations that
will give them the best performance at a
reasonable price. With customers
looking to increase the number of
licensed users in their organizations to
1000+ users in some cases, the
identification and resolution of various
performance issues that the customer
may encounter becomes increasingly
important.
The Solution
There is no one solution that will resolve
all the performance issues our
customers will encounter. We can
however make recommendations and
product code changes based on
performance testing done at
MedicaLogic about Local and Wide-
Area Network (LAN / WAN) sizing,
server and workstation configurations,
application tuning at both the client and
server levels, etc. We will describe
some of the more common (and not so
common) issues we continue to
encounter in performance testing at
MedicaLogic.
Disk configuration for
improved database
performance
This is becoming a larger issue with
customers wanting to roll out Logician to
greater numbers of users. With the
larger user base also comes a larger
patient population that translates into a
larger database being accessed by
more users. Improving database I/O is
critical to the success of our customers
and Logician.
6. Splitting database files across
multiple drives
For quite some time now it has been
known that splitting Oracle data files
over multiple disk devices can improve
performance. However, it is not
immediately clear how one should
deploy different Oracle data files on
different disk devices. While it is clear
how to actually put data files on different
disk devices; it is not clear what the
distribution of the data files should be.
The key to this issue is to minimize head
contention for reads and writes with the
drives.
Our Oracle database consists of the
following data files with the following
types of I/O characteristics.
1. System data file. This file
holds information about the
system catalogs and since
this information is read from
frequently during database
operation most of this
information should be cached
in memory.
2. Rollback data file. This file
holds the rollback information
while users are in the middle
of transactions.
3. Temporary data file. This file
holds information about sorts
when a sort will not fit into
memory. Usually a report
with quite a bit of data that
needs to be sorted in a
particular manner.
4. Logician data file. This file
holds the clinical data for the
production system.
5. Logician index data file. This
file holds the information
about the indexes on the data
in the Logician data file.
6. Logician read only data file.
This data file holds
information that is read only.
Mainly knowledge base type
of information that the user
does not change.
7. Logician read only index data
file. This data file holds the
indexes on the data in the
Logician read only data file.
8. Tutorial data file. This file
holds the information about
the tutorial database and its
indexes.
9. Maintenance data file. This
data file holds the information
about growth statistics on
tables in the production
system.
10. L3 data file. This data file
holds the information that
LinkLogic uses for setup and
temporary storage processing
for data imports.
11. Photo data file. This data file
holds the patient photos.
12. Redo Log files. These files
hold a before and an after
image of each transaction.
When one runs out of space it
is copied off to an archive
area.
13. Archive log files. These files
are produced as a result of
filling up of the redo logs.
14. Control files. These files hold
some basic information about
the database and are quite
small.
Using our standard scaling model we
looked at how much input and output
(I/O) was occurring for each file. The
important observations here are how
much relative file I/O was occurring.
This information is necessary to come
7. up with a general model of how to split
the files across multiple disk devices. A
further refinement of the model would
indicate in terms of providers and users
how best to split files across multiple
disk devices. That is a particular disk
device has a limit of how much data can
be read from and written to during a unit
of time.
It should be noted that this model must
be validated in the real world. To
validate the data we would need to run a
couple of simple SQL scripts (utilbstat
and utilestat supplied by Oracle) at a
variety of client sites. These scripts
would not effect the current operation of
the client’s system nor reveal any
confidential data. If we obtain similar
results from several client sites then we
could come up with some general rules
for deployment. This information would
be invaluable for server planning,
performance and scalability.
Model Results
Table 1 shows results from running our
model on a 300 user database with a
DTS (Data Transfer Station) running.
This table illustrates the number of
actual I/O operations that occurred on
each file. This is the number of I/O’s not
the size of each I/O.
As you can see the redo logs account
for a lot of the Write I/O. This is
expected since only in recovery would
the redo logs be read from. The
Logician_data tablespace has the most
write and read activity, since this is
where the clinical data resides.
Table 2 shows results from the same
300 user run that the earlier table
illustrates. This table illustrates the
number of 8K blocks that occurred on
each file.
In Table 2 you can see that the redo
logs provide the majority of the amount
of data written. In terms of the amount
of data read the Logician_data
tablespace has the most by far. The
direction of these results seems
reasonable.
Table 1. Physical I/O Operations
Tablespace Reads % reads Writes % writes % total I/O
System 541 0.25% 74 0.01% 0.08%
Rollback 12 0.01% 44,133 8.60% 6.08%
Temporary 11,386 5.36% 12,559 2.45% 3.30%
Logician_data 183,087 86.14% 243,136 47.37% 58.72%
Logician_index 13,447 6.33% 35,341 6.89% 6.72%
Logician_data_ro 3,572 1.68% 0 0.00% 0.49%
Logician_index_ro 124 0.06% 0 0.00% 0.02%
Tutorial 8 0.00% 0 0.00% 0.00%
Mania 0 0.00% 0 0.00% 0.00%
L3 364 0.17% 6,311 1.23% 0.92%
Photos 0 0.00% 0 0.00% 0.00%
Redo Logs 0 0.00% 171,749 33.46% 23.66%
Total 212,541 513,303 725,844
Table 2. Physical Block I/O
Tablespace
Reads % reads Writes % writes % total I/O
System 689 0.06% 74 0.01% 0.03%
Rollback 12 0.00% 44,133 3.25% 1.75%
Temporary 41,828 3.59% 42,316 3.11% 3.33%
Logician_data 1,105,400 94.84% 24,316 1.79% 44.75%
Logician_index 13,447 1.15% 35,341 2.60% 1.93%
Logician_data_ro 3,572 0.31% 0 0.00% 0.14%
Logician_index_ro 124 0.01% 0 0.00% 0.00%
Tutorial 12 0.00% 0 0.00% 0.00%
Maint 0 0.00% 0 0.00% 0.00%
L3 419 0.04% 6,311 0.46% 0.27%
Photos 0 0.00% 0 0.00% 0.00%
Redo Logs 0 0.00% 1,206,535 88.78% 47.79%
Total 1,165,503 1,359,026 2,524,529
8. In general, disk writes are more
expensive than disk reads. So a basic
strategy would get the redo logs off the
same disk device as all the other files.
This observation is in line with Oracle’s
basic tuning recommendations. One
item that is not noted is that when a
redo log fills up they are copied to the
archive area as an archive log. This
additional disk activity is not accounted
for in these figures.
The technical papers from Oracle listed
in the bibliography can give more insight
into this area.
Which RAID level should I use?
0, 1, 0+1, 5, none
A disk device is a logical drive or set of
drives. It may be a single disk drive or it
could be a mirrored set, or some sort of
Redundant Array of Inexpensive Disks
(RAID).
• RAID 0 is disk striping without parity;
very fast but lacks data protection. A
single disk failure wipes out the
logical device.
• RAID 1 is disk mirroring. Very
reliable in terms of data protection
but requires 2X the number of disks
to hold the same amount of data.
• RAID 0+1 is a mirrored stripe. It is
very fast and reliable.
• RAID 5 is a disk stripe with parity. It
is not quite as fast as 0+1 and
reliability is limited. Since parity is
striped across disks, if more than
one disk fails, the entire logical
device is usually lost.
In general a mirrored stripe, commonly
referred to as 0+1, will give you the most
performance, but at the cost of the
number of disk drives needed.
Hardware or Software RAID
Hardware RAID implementations are
typically faster than software RAID but
with a price. Hardware implementations
are more expensive to implement than
software since the capability to
implement software RAID exists in the
various server operating systems
without the added cost of a hardware
RAID array controller.
Various server and
workstation options
available
MedicaLogic continues to provide
customers with updated server and
workstation sizing information based
upon testing done in our Scalability Lab.
While we do not emphasize what server
or workstation vendor our customers
should purchase their hardware from,
we do recommend that they plan their
purchases to accommodate their
projected user and database growth.
Areas that we provide sizing information
on include:
1. Processor number & speed
2. RAM
3. Disk Space requirements
4. Network and Workstation OS
configuration parameters
We typically recommend that our
customers purchase the most powerful
server (in terms of CPU speed and
number of processors, RAM, disks) they
can afford which meets their projected
growth requirements. We do this in
order to help minimize our customers’
hardware upgrade costs later.
As we attempt to load more users onto a
single database server, we have
essentially reached the maximum user
load that currently available Intel
9. platforms can support under NetWare
and NT Server.
Some network OS configuration
parameters are available that will allow
more users to be loaded onto the
system than what would currently be
allowed. One in particular involves NT
Server. Under Network Properties >
Services > Server > Properties are
available 4 optimization settings. If
“Maximize throughput for network
applications” is enabled, you can bring
NT Server up to an equivalent
performance level with NetWare as an
application server. We are continually
testing other configuration parameters
as well to determine what performance
gains may be achieved.
To meet the user load requirements for
our larger (greater than 500 users)
customer sites MedicaLogic has
extensively evaluated several UNIX
server platforms. From this evaluation
MedicaLogic has chosen HP-UX servers
as our UNIX database server platform of
choice for our large customer
implementations.
WAN configurations
available and
overcoming WAN
bottlenecks
Many customers have remote clinics for
which they want to enable remote client
access to the Logician database as well
as larger customer sites with hundreds
of users. To help our customers
achieve this goal MedicaLogic performs
network load testing as well. These
tests determine the maximum user
loads different LAN / WAN / remote
connections can support and still
maintain product usability. The
connection speeds tested range from
28.8Kbps to 100Mbit/s including Frame
Relay and ISDN connections. From
these tests, MedicaLogic publishes a
WAN Planning Guide for customer use.
Due to the amount of traffic generated
by Logician, the number of remote
connections (especially on slower lines)
is limited.
In order to reduce these limitations we
continue to investigate alternative
connection solutions with an eye toward
reducing our customers’ hardware
implementation and upgrade costs.
One particular solution we have
investigated is Citrix WinFrame (a multi-
user version of Windows NT). This
solution offered several benefits:
- centralized client administration
- reduced hardware upgrade costs
- reduced remote bandwidth
requirements while maintaining
acceptable client performance.
This solution allows our existing
customer sites with older PCs to use
their existing hardware with the newer
version of Logician.
As an example, the older versions of
Logician required a 486 running
Windows 3.11 as the base client
platform. The current version requires a
Pentium 100 running Windows 95 or
NT.
Tuning the client / server
application for
improved database
performance
The purpose of tuning our client/server
application is to achieve increased
performance. For Oracle, this would be
measured by an increase in the
SQL_AREA cache hit ratio of our
10. application such that it rises above 95%.
The SQL_AREA is Oracle's area for
SQL statements that the client machines
send over. As an example, prior
versions of Logician had a cache hit rate
much lower than this (less than 38%).
What this means is that over 62% of the
time Oracle has to fully parse and
validate a SQL statement. This is highly
CPU intensive.
Several methods being employed to
improve this ratio include:
1. Use of Oracle OCI instead of ODBC
for database communications.
2. Reduce SQL query parsing through
the implementation and increased
use of indexing, cursors, and host
variables.
3. Reduce and eliminate the number of
redundant queries.
4. Modifications to the database
schema to reduce table contentions.
Methods 2, 3, and 4 have shown to be
the most effective for tuning Logician to
achieve the desired performance levels
in tests conducted at MedicaLogic with
the current version of Logician achieving
a cache hit ratio of greater than 85% on
average.
The Future
We continue to explore hardware and
software issues that impact
performance. As new OS and software
versions as well as faster hardware
releases come into view, customers
tend to expect performance to improve
as well.
Bibliography
1. Performance Tuning Tips for
Oracle7 RDBMS on Microsoft
Windows NT White Paper
Desktop Performance Group
Oracle Corporation, February 1995
2. Configuring Oracle Server for
VLDB
Cary V. Millsap
Oracle System Performance Group
Oracle Corporation, March 7, 1996
3. Oracle White Paper:
High Confidence Load Testing
A Practical Approach to Low
Risk Implementation
Doug Chandler, Darryl Presley, Robert
Michael, Doug Liles
Oracle Large Systems Support
Oracle Corporation, February 1997