UCSF07 - Research and HPC Infrastructure_Award_2007

University of California
Larry L. Sautter Award Submission
Innovation in Information Technology
at the University of California, San Francisco
Submitted By:
Mr. Michael Williams, MA
University of California
Executive Director, Information Technology
UC San Francisco Diabetes Center, Immune Tolerance Network
and
Chief Information Officer
UC San Francisco Neurology, Epilepsy Phenome/Genome Project
Telephone: (415) 860-3581
Email: mwilliams@immunetolerance.org
Date Submitted: Friday, May 18, 2007

Table of Contents
1. PROJECT TEAM...........................................................................................2
1.1. TEAM LEADERS ........................................................................................2
1.2. TEAM MEMBERS .......................................................................................2
2. PROJECT SUMMARY AND SIGNIFICANCE...............................................4
3. PROJECT DESCRIPTION ............................................................................6
3.1. BACKGROUND INFORMATION .....................................................................6
3.2. SITUATION PRIOR TO ARCAMIS ...............................................................7
3.3. AFTER ARCAMIS DEPLOYMENT ...............................................................8
3.4. BUSINESS IMPACT...................................................................................13
4. TECHNOLOGIES UTILIZED.......................................................................14
4.1. THE ARCAMIS SUITE ............................................................................14
4.2. ITIL TEAM BASED OPERATING MODEL.....................................................15
4.3. SECURITY MODEL AND ARCHITECTURE.....................................................17
4.4. DATA CENTER FACILITIES .......................................................................21
4.5. INTERNET CONNECTIVITY.........................................................................23
4.6. VIRTUAL CPU, RAM, NETWORK, AND DISK RESOURCES ..........................23
4.7. OPERATING SYSTEMS SUPPORTED ..........................................................25
4.8. BACKUP, ARCHIVAL, AND DISASTER RECOVERY .......................................25
4.9. MONITORING, ALERTING, AND REPORTING................................................25
4.10. IT SERVICE MANAGEMENT SYSTEMS ....................................................27
5. IMPLEMENTATION TIMEFRAME ..............................................................28
5.1. PROJECT TIMELINE .................................................................................28
6. CUSTOMER TESTIMONIALS.....................................................................29
APPENDICES.....................................................................................................30
APPENDIX A – CAPABILITIES SUMMARY OF THE ARCAMIS SUITE ........................30
APPENDIX B – EXCERPT FROM THE ARCAMIS SYSTEMS FUNCTIONAL
SPECIFICATION..................................................................................................33

1. Project Team
1.1. Team Leaders
Michael Williams, M.A.
Executive Director, Information Technology
and
Chief Information Officer
Gary Kuyat
Senior Systems Architect, Information Technology
and
1.2. Team Members
Immune Tolerance Network Information Technology:
Jeff Angst
Project Manager
Lijo Neelankavil
Systems Engineer
Diabetes Center Information Technology:
Aaron Gannon
Systems Engineer
Project Sponsors:
Michael Williams, M.A.
Executive Director, Information Technology, Immune Tolerance Network

Jeff Bluestone, Ph.D.
Director Diabetes Center and Immune Tolerance Network
Dr. Daniel Lowenstein, M.D.
Department of Neurology at UCSF, Director of the UCSF Epilepsy Center
Dr. Mark A. Musen, M.D., Ph.D.
Professor; Head, Stanford Medical Informatics
Dr. Hugh Auchincloss, MD.
Chief Operating Officer, Immune Tolerance Network (at time of project)
Currently - Principal Deputy Director of NIAID at NIH

2. Project Summary and Significance
By deploying the Advanced Research Computing and Analysis Managed
Infrastructure Services (ARCAMIS) suite, the Immune Tolerance Network
(ITN) and Epilepsy Phenome Genome Project (EPGP) at the University of
California, San Francisco (UCSF) has implemented multiple Tier 1 networks
and physically secured enterprise class datacenters, storage area network
(SAN) data consolidation, and server virtualization to achieve to achieve a
centralized, scalable network and system architecture that is responsive,
reliable, and secure. This is combined with a nationally consistent, team
centric operating model based on Information Technology Infrastructure
Language (ITIL) best practices. Our deployed solution is compliant with
applicable confidentiality regulations and assures 24 hour business
continuance with no loss of data in the event of a major disaster. ARCAMIS
has also provided significant savings on IT costs.
Over the last 3 years we have efficiently met the constantly expanding
demands for IT resources by using virtualization of disk, CPU, RAM, network,
and ultimately servers. ARCAMIS has allowed us to provision and support
hundreds of production, staging, testing, and development servers at a ratio
of 25 guests to one physical host. By using IP remote management
technologies that do not require physical presence, server consolidation and
virtualization, combined with SAN based thin-provisioning of storage; we have
effectively untied infrastructure upgrades from service delivery cycles.
Furthermore, centralizing storage to a Storage Area Network (SAN) has given
us the ability to provide real-time server backups (no backup window) and
hourly disaster recovery snapshots to a Washington, DC, disaster recovery
(DR) site for business continuance within hours of a disaster.
ARCAMIS provides the University of California with a proven case study of
how to implement enterprise class IT infrastructures and operating models for
the benefit of NIH funded clinical research at UCSF. We have accelerated the

time from the bedside to the bench in clinical research by taking the IT
infrastructure out of the clinical trials’ critical path, thereby providing a
positive impact on our core business: preventing and curing human
disease. ARCAMIS is more agile and responsive, having reduced server
acquisition time to a matter of hours rather than weeks. ARCAMIS is
significantly more secure and reliable, providing in the order of 99.998%
technically architected uptime, and we’ve greatly improved the performance
and utilization of our IT assets. We have created hundreds of thousands of
dollars in measurable costs savings. ARCAMIS is environmentally friendly,
significantly reducing our impact on environmental resources such as power
and cooling. ARCAMIS is able to be used as a blue-print for enterprise class
Clinical Research IT infrastructure services throughout the University of
California, at partner research institutions and universities, and the National
Institute of Health.
The technologies used: Hewlett Packard Proliant Servers and 7000c Series
Blades, VMWare Virtual Infrastructure Enterprise 3.01, Network Appliance
FAS3020 Storage Area Network, Cisco, Brocade, Red Hat LINUX Enterprise,
and Microsoft Windows Server 2003, among others.

3. Project Description
3.1. Background Information
The mission of the Immune Tolerance Network (ITN) is to prevent and cure
human disease. Based at the University of California, San Francisco (UCSF),
the ITN is a collaborative research project that seeks out, develops and
performs clinical trials and biological assays of immune tolerance. ITN
supported researchers are developing new approaches to induce, maintain,
and monitor tolerance with the goal of designing new immune therapies for
kidney and islet transplantation, autoimmune diseases and allergy and
asthma. Key to our success is the ability to collect, store and analyze the
huge amount of data collected on ITN’s 30+ global clinical trials at 90+
medical centers, in a secure and effective manner, so a reliable, scalable and
adaptable IT infrastructure is paramount in this endeavor. The ITN is in the
7th
year of 14-year contracts from the NIH, National Institute of Allergy and
Infectious Diseases (NIAID), the National Institute of Diabetes and Digestive
and Kidney Disorders (NIDDK) and the Juvenile Diabetes Research
Foundation.
The Epilepsy Phenome/Genome Project (EPGP) studies the complex genetic
factors that underlie some of the most common forms of epilepsy; bringing
together 50 researchers and clinicians from 15 medical centers throughout
the US. The overall strategy of EPGP is to collect detailed, high quality
phenotypic information on 3,750 epilepsy patients and 3,000 controls, and to
use state-of-the-art genomic and computational methods to identify the
contribution of genetic variation to: the epilepsy phenotype, developmental
anomalies of the brain, and the varied therapeutic response of patients
treated with AEDs. This initial 5 year grant is being funded by the NIH,
National Institute of Neurological Disorders and Stroke (NINDS).
The ITN and EPGP turned to computing infrastructure centralization in Tier 1
networked enterprise class datacenters, virtualization, and data consolidation

onto a Storage Area Network (SAN) with off-site disaster recovery replication
to address these challenges. Combined with an ITIL, team based, nationally
consistent operating model leveraging specificity of labor; we are in a position
to efficiently and scaleably respond to the increasing demands of the
organization and rapidly adapt the IT infrastructure to dynamic management
goals. This is accomplished while minimizing costs and maintaining requisite
quality: we have a true high availability architecture, assuring zero data loss.
3.2. Situation Prior to ARCAMIS
Like most of today’s geographically dispersed IT organizations, we were faced
with the challenge of providing IT services in a timely, consistent, and cost
effective manner with high customer satisfaction. Unlike other organizations,
ITN and EPGP have many M.D. and Ph.D. clinical research knowledge workers
with higher then normal, computationally intensive, IT requirements.
Escalating site-specific IT infrastructure costs, unpredicted downtime,
geographically inconsistent process and procedure, and lack of a team based
operating model were among the challenges being faced to support such a
multi-site infrastructure. There was a general sense that IT could do better.
Risk of data loss was real. Dynamically growing demands were making it
more difficult to consistently provide high IT service quality, site IT staff were
largely reactionary and isolated. Prior to the ARCAMIS deployment, the IT
infrastructure faced many challenges:
1. High costs of running and managing numerous physical servers at
inconsistent, multiple-site, server rooms, such as power consumption
with poor reliability, sub-standard cooling, and poorly laid out physical
space. Intermittent and unexpected local facility downtime was
common. Global website services were served out of office servers
connected via single T1 lines.
2. Lead time for delivering new services was typically 6 weeks which
directly impacted clinical trials’ costs. Procuring and deploying new
infrastructure for new services or upgrades were major projects
requiring significant downtime and direct physical presence of IT staff.
3. Existing computing capacity was underutilized, but still required
technical support such as backups and patches; with individualized site

based process and procedures. Little automation caused significant
effort for administration. There was a huge amount of IT
administrative effort to manage site specific physical server support,
asset tracking and equipment leases at multiple sites.
4. Lack of IT staff team operating model, consistently automated
architecture, and remote management technologies resulted in process
and procedure inconsistency at any one site and led to severe variance
in service quality and reliability by geography.
5. IT maturity prevented discussion of higher level functions such as
auditable policies and procedures, disaster recovery, redundant
network architectures, and security audits; all required for NIH Clinical
Trail safety compliance.
3.3. After ARCAMIS Deployment
ARCAMIS represents a paradigm shift in our IT philosophy both operationally
and technically. The goal was to move out of a geographically specific,
reactionary mode to a prospective operating model and technical architecture
designed from the ground up to be in alignment with the organizations
growing, dynamic demands for IT services.
Most importantly, we worked with management prospectively to understand
service quality expectations and requirements to scale up to 30 clinical trails
in 7 years. Given management objectives and our limited resources, we
realized a need for a more team centric operating model, providing specificity
of labor. As a result, we logically grouped our human resources into the
Support team and Architecture team. This gave more senior technical talent
the time they needed to re-engineer, build, and migrate to the ARCAMIS
solution while more junior talent continued to focus on reactionary issues.
From a technical perspective, we engineered an architecture that would
eliminate or automate time consuming tasks and improve reliability. By
centralizing all ARCAMIS Managed Infrastructure into bi-coastal, carrier
diverse, redundant, Tier 1 networked, enterprise class datacenters and using
fully “lights-out”, Hewlett Packard Proliant Servers with 4 hour on-site

physical support and a remote, IP based, server administration model, we
have dramatically improved service reliability and supportability without
adding administrative staff. The same senior staff now supports twice the
number of physical servers and 20 times the virtual servers. For example, it
is now common for engineers to administrate infrastructure at all seven sites
simultaneously, including hard reboots and physical failures.
With the integration of the VMWare Virtual Infrastructure 3.01 Enterprise
infrastructure virtualization technologies, ARCAMIS reduces the number of
physical servers at our data centers while continuing to meet the
exponentially expanding business server requirements. Less hardware yields
a reduction in initial server hardware costs and saves ongoing data center
lease, power and cooling costs associated with ARCAMIS infrastructure. The
initial capital expenditure was about the same as purchasing physical servers
due to our investment in virtualization and SAN technologies.
Consolidating all server data onto the Network Appliance Storage Area
Network; the ARCAMIS project deployed a 99.998% uptime, 25 TB,
production and disaster recovery cluster in San Francisco and a 25 TB
99.998% uptime production and disaster recovery cluster site in the
Washington, DC metro area. The SAN allows us to reduce cost and
complexity via automation, resulting in dramatic improvements in operations
efficiency. We can more efficiently use what we already own, oversubscribe
disk, and eliminate silos of underutilized storage. Current storage usage at
the primary site is 65%, up from 25% average per server using Direct
Attached Storage (DAS). We can seamlessly scale to 100 terabytes of
storage by simply adding disk shelves, not possible with a server based
approach. Another key benefit of using SAN technology is risk mitigation via
completely automated backup, archival, and offsite replication. File restores
are instantaneous, eliminating the need for human resource intensive and
less reliable tape backup approaches.
Combining the SAN with VMWare Infrastructure 3.01 Enterprise server
virtualization technologies provides reliable, extensible, manageable, high
availability architecture. Adjusting to changing server requirements is simple

because of the SAN’s storage expansion and reduction capability for live
volumes and VMWare’s ability to scale from 1 to 4 64-bit CPUs with up to
16GB RAM and 16 network ports per virtual server. Also, oversubscription
allows the ITN to more efficiently use the disk, RAM, and CPU we already
own. We can seamlessly control server, firewall, network and data adds,
removes, and changes without business service interruption. The SAN and
VMWare ESX combination provides excellent performance and reliability using
both Fiber Channel & iSCSI Multipathing for a redundant disk to server access
architecture. For certain applications we can create Highly Available
Clustered Systems truly architected to meet rigorous 99.998% uptime
requirements. VMs boot from the SAN and are replicated locally and off-site
while running. This improves business scalability and agility via accelerated
service deployment and expanded utilization of existing hardware assets.
Physical server maintenance requiring the server to be shut down or rebooted
is done during regular working hours without downtime due to support for
VMotion, the ability to move a running VM from one physical machine to
another. This has greatly reduced off-hours engineer work. The increasing
data security & compliance requirements are also able to be met with the
centralized control provided by the SAN. In our experience, storage
availability determines service availability; automation guarantees service
quality of storage.

NetworkAppliance
13 12 11 10 09 08 07 06 05 04 03 02 01 00
Power
System
Shelf ID
Loop B
Fault
Loop A
72F
DS14
NetworkAppliance NetworkAppliance NetworkAppliance NetworkAppliance NetworkAppliance NetworkAppliance NetworkAppliance NetworkAppliance NetworkAppliance NetworkAppliance NetworkAppliance NetworkAppliance NetworkAppliance
72F 72F 72F 72F 72F 72F 72F 72F 72F 72F 72F 72F 72F
NetworkAppliance
13 12 11 10 09 08 07 06 05 04 03 02 01 00
Power
System
Shelf ID
Loop B
Fault
Loop A
72F
DS14
72F 72F 72F 72F 72F 72F 72F 72F 72F 72F 72F 72F 72F
FAS
3050
activity status power
NetworkAppliance
13 12 11 10 09 08 07 06 05 04 03 02 01 00
Power
System
Shelf ID
Loop B
Fault
Loop A
72F
DS14
72F 72F 72F 72F 72F 72F 72F 72F 72F 72F 72F 72F 72F
NetworkAppliance
13 12 11 10 09 08 07 06 05 04 03 02 01 00
Power
System
Shelf ID
Loop B
Fault
Loop A
72F
DS14
72F 72F 72F 72F 72F 72F 72F 72F 72F 72F 72F 72F 72F
FAS
3050
16 port FC switch 16 port FC switch
Active/Active High Availablity 25TB Fiber Channel Cluster
Passive Synchronization
Between Sites
VMWare Server
VMWare Server
VMWare Server
VMWare Server
VMWare Server
San Francisco, CA
VMWare Server
VMWare Server
VMWare Server
VMWare Server
VMWare Server
VMWare ESX Server 1 VMWare ESX Server 2
NetworkAppliance
13 12 11 10 09 08 07 06 05 04 03 02 01 00
Power
System
Shelf ID
Loop B
Fault
Loop A
72F
DS14
72F 72F 72F 72F 72F 72F 72F 72F 72F 72F 72F 72F 72F
NetworkAppliance
13 12 11 10 09 08 07 06 05 04 03 02 01 00
Power
System
Shelf ID
Loop B
Fault
Loop A
72F
DS14
72F 72F 72F 72F 72F 72F 72F 72F 72F 72F 72F 72F 72F
FAS
3050
NetworkAppliance
13 12 11 10 09 08 07 06 05 04 03 02 01 00
Power
System
Shelf ID
Loop B
Fault
Loop A
72F
DS14
72F 72F 72F 72F 72F 72F 72F 72F 72F 72F 72F 72F 72F
NetworkAppliance
13 12 11 10 09 08 07 06 05 04 03 02 01 00
Power
System
Shelf ID
Loop B
Fault
Loop A
72F
DS14
72F 72F 72F 72F 72F 72F 72F 72F 72F 72F 72F 72F 72F
FAS
3050
16 port FC switch 16 port FC switch
Active/Active High Availablity 25TB Fiber Channel Cluster
VMWare Server
VMWare Server
VMWare Server
VMWare Server
VMWare Server
Herndon, VA
VMWare Server
VMWare Server
VMWare Server
VMWare Server
VMWare Server
VMWare ESX Server 1 VMWare ESX Server 2
The ARCAMIS project has proven and demonstrated the many benefits
promised by these new enterprise class technologies. We have significantly
increased the value of IT to our core business, slashed IT operating costs, and
radically improved the quality of our IT service. The ARCAMIS architecture
and operating model is a core competency which other UC organizations can
leverage to achieve similar benefits.
Just some of the benefits resulting from the ARCAMIS project include the
following:
1. Saved hundreds of thousands of dollars and improved security,
reliability, scalability, and deployment time.

2. Helped the environment by reducing power consumption by a factor of
20 for a comparable service infrastructure.
3. Improved conformance with federal and state regulations such as
HIPAA and 21 CFR Part 11.
4. Centralized critical infrastructures into Tier 1, redundantly multi-
homed network, enterprise class datacenters. Space utilization at our
8 sites has been consolidated to three data centers using 5 server
racks.
5. Eliminated the inconsistent complexity of our IT infrastructure and
processes/procedures, and ensured uptime for our business critical
applications; even in the event of hardware failures. All new solution
deployments are done based on nationally consistent operating models
and technical architectures.
6. Consolidated data to SAN and VMWare servers. The infrastructure is
architected to be a true 99.998% uptime solution. Our biggest
downtime risk is human error.
7. Any staff member with security privileges can manage any device at
any site from any Internet connected PC; including hardware failures
and power cycles. We efficiently provision, monitor and manage the
infrastructure with a single top console.
8. Standardized virtual server builds, procurement and deployment time
reduced from as much as 6 weeks to 2 hours without investing in new
server hardware.
9. Automated backup, archival, and disaster recovery.
10. Cloned production servers for testing and troubleshooting. Systems
and networks can be cloned while running, with zero downtime and
rebooted in virtual lab environments. Servers can be rotated back into
production with only a few seconds downtime.
11. Average CPU utilization has risen from 5% to 30% while retaining peak
capacity.
12. Disk utilization has risen from 25% to 65%
13. Savings are in the region of $200,000 in the past 12 months. This will
grow exponentially as the architecture scales.
14. Multiple operating systems are supported, including: RedHat LINUX,
MS Windows 2000, MS Windows 2003, with both 32 and 64 bit

versions of all operating systems supported. These can all be
deployed on the same physical server, providing us with reduced
dependence on vendor’s proprietary solutions.
15. Reduced support overhead and power consumption of legacy
applications by migrating these into the virtual environment.
3.4. Business Impact
ARCAMIS provides the University of California with a proven case study of
how to implement enterprise class IT infrastructures and operating models for
the benefit of NIH funded clinical research at UCSF. We have accelerated the
time from the bedside to the bench in clinical research by taking the IT
infrastructure out of the clinical trials’ critical path, thereby providing a
positive impact on our core business: preventing and curing human
disease. ARCAMIS is more agile and responsive, having reduced server
acquisition time to a matter of hours rather than weeks. ARCAMIS is
significantly more secure and reliable, providing in the order of 99.998%
technically architected uptime, and we’ve greatly improved the performance
and utilization of our IT assets. We have created hundreds of thousands of
dollars in measurable costs savings. ARCAMIS is environmentally friendly,
significantly reducing our impact on environmental resources such as power
and cooling. ARCAMIS is able to be used as a blue-print for enterprise class
Clinical Research IT infrastructure services throughout the University of
California, at partner research institutions and universities, and the National
Institute of Health.

4. Technologies Utilized
4.1. The ARCAMIS Suite
This suite of Academic Research Computing and Analysis Managed
Infrastructure Services (ARCAMIS) includes the following technology
components:
1. ITIL based, nationally consistent, labor specific, team IT operating
model
2. Security model and architecture (including firewalls, intrusion
detection, VPN, automated updates)
3. Enterprise class data center facilities
4. Tier 1, multi-homed, redundant, carrier diverse, networks
5. Virtual CPU, RAM, Network, and Disk resources based on Hewlett
Packard Proliant servers, VMware Infrastructure Enterprise 3.01 and
Network Appliance Storage Area Network (SAN)
6. Various 32 and 64 bit LINUX and Windows Operating Systems
7. Backup, archival and disaster recovery
8. Monitoring, alerting, and reporting
9. IT service management systems

4.2. ITIL Team Based Operating Model
The ARCAMIS Operating Model is based on an ITIL best practices, nationally
consistent, team based, and uses specificity of labor. Via our formal,
documented, Infrastructure Lifecycle Process (ILCP) and support policies,
procedures (SOPs), and support documentation such as operating guides and
systems functional specifications, the ARCAMIS infrastructure evolves though
its lifecycles of continuous improvement. Below are samples of the IT Policies
and Procedures used.
IT Policies

Standard Operating Procedures
Our goal moving forward is to be a completely ITIL shop in the next 12
months. As you can see from the below organizational chart the ARCAMIS
team is logically grouped into a prospective engineering team, and an
administration and support team.

Organizational Chart
Customer Engineer
Laurel Heights/CB
Executive Director,
Information Technology
Manager
Customer Engineering
Customer Engineer
BEA/ITI
Customer Engineer
Parnassus
Level 1 and 2 Support Team
IT Office and
Operations
Manager
Customer Engineer
Parnassus
Systems and
Network Architect
Systems and
Network Engineer
Server and Network
Engineering Team
Level 3 and 4 Support
Systems and
Network Engineer
Customer Engineer
Pittsburgh
4.3. Security Model and Architecture
ARCAMIS is required to meet at minimum the Security Category and Level of
MODERATE for Confidentiality, Integrity, and Availability as defined by the
National Institute of Health. Compliance with this Security Category spans
the entire organization from the initial Concept Proposal phase, through
clinical trial design and approval, into trial operations where patient
information is gathered, including data collection and specimen storage.
Significant amounts of confidential, proprietary and unique patient data are
collected, transferred, and stored in the ARCAMIS infrastructure for analysis
and dissemination by approved parties. Certain parts of the infrastructure are
able to satisfy HIPAA and 21 CFR Part 11 compliance. This becomes
especially important as the ITN and EPGP organizations continue to innovate
and develop new intellectual property which may have significant market
value.

Information Security Category Requirements
Exceeding the minimum compliance requirements with this Information
Security Category is achieved by a holistic approach addressing all aspects of
the ARCAMIS personnel, operations, physical locations, networks and
systems. This includes tested, consistently executed, and audited plans,
policies and procedures, and automated, monitored, and logged security
technologies used on a day to day basis. The overall security posture of the
ARCAMIS has many aspects including legal agreements with partners and
employees, personnel background checks and training, organization wide
disaster recovery plans, backup, systems and network security architectures
(firewalls, intrusion detection systems, multiple levels of encryption, etc.),
and detailed documentation requirements.
Consistent with the NIH Application/System Security Plan (SSP) Template for
Applications and General Support Systems and the US Department of Health
and Human Services Official Information Security Program Policy (HHS IRM
Policy 2004-002.001), ARCAMIS maintains a formal information systems
security program to protect the organization’s information resources. This is
called the Information Security and Information Technology Program (ISITP).
ISITP delineates security controls into the four primary categories of
management, operational, technical and standard operating procedures which
structure the organization of the ISITP.
- Management Policies focus on the management of information security
systems and the management of risk for a system. They are techniques and
concerns that are addressed by management, examples include: Capital
Planning and Investment, and Risk Management.
- Operational Policies address security methods focusing on mechanisms
primarily implemented and executed by people (as opposed to systems).
These controls are put in place to improve the security of a particular system
(or group of systems), examples include: Acceptable Use, Personnel
Separation, and Visitor Policies.

Technical Policies focus on security policies that the computer system
executes. The controls can provide automated protection for unauthorized
access or misuse, facilitate detection of security violations, and support
security requirements for applications and data, examples include: password
requirements, automatic account lockout, and firewall policies.
- Standard Operating Procedures (SOPs) focus on logistical procedures
that staff do routinely to ensure ongoing compliance, examples include: IT
Asset Assessment, Server and Network Support, and Systems Administration.
Specifically, the ARCAMIS ISITP includes detailed definitions of the following
Operational and Technical Security Policies.
PERSONNEL SECURITY
Background Investigations
Rules of Behavior
Disciplinary Action
Acceptable Use
Separation of Duties
Least Privilege
Security Education and Awareness
Personnel Separation
RESOURCE MANAGEMENT
Provision of Resources
Human Resources
Infrastructure
PHYSICAL SECURITY
Physical Access
Physical Security
Visitor Policy
MEDIA CONTROL
Media Protection
Media Marking
Sanitization and Disposal of Information
Input/Output Controls
COMMUNICATIONS SECURITY
Voice Communications
Data Communications
Video Teleconferencing
Audio Teleconferencing
Webcast
Voice-Over Internet Protocol
Facsimile
WIRELESS COMMUNICATIONS
SECURITY
Wireless Local Area Network (LAN)
Multifunctional Wireless Devices
EQUIPMENT SECURITY
Workstations
Laptops and Other Portable Computing
Devices
Personally Owned Equipment and Software
Hardware Security
ENVIRONMENTAL SECURITY
Fire Prevention
Supporting Utilities
DATA INTEGRITY
Documentation
NETWORK SECURITY POLICIES
Remote Access and Dial-In
Network Security
Monitoring
Firewall
System-to-System Interconnection
Internet Security
SYSTEMS SECURITY POLICIES
Identification
Password
Access Control
Automatic Account Lockout
Automatic Session Timeout
Warning Banner
Audit Trails
Peer-to-Peer Communications
Patch Management
Cryptography
Malicious Code Protection
Product Assurance
E-Mail Security
Personal E-Mail Accounts

These policies serve as the foundation of the ARCAMIS Standard Operating
Procedures and technical infrastructure architectures which when combined,
create a secure environment based security best practices.
Security Infrastructure Architecture
To ensure a hardened Information Security and Information Technology
environment, the ARCAMIS has centralized its critical Information Technology
infrastructures into two Tier 1 data centers. Facilities include: Uninterruptible
Power Supply via backup diesel generators that can keep servers running
indefinitely without direct electric grid power. They are equipped with optimal
environment controls, including sophisticated air conditioning and humidifier
equipment as well as stringent physical security systems. They provide 24x7
Network Operations Center network monitoring and physical security. Each
data center also includes fire suppression systems with water-free fire
protection so as not to damage the servers.
For secure data transport, ARCAMIS provides a carrier diverse, redundant,
secure, reliable, Internet connected, high speed Local Area Network (LAN)
and Wide Area Network (WAN). The ARCAMIS network and Virtual Private
Network (VPN) is the foundation for all the ARCAMIS IT services and used by
every ITN and EPGP stakeholder every day. The high speed WAN is protected
by intrusion detection monitored and logged firewalls at all locations. Firewall
and VPN services are provided by industry leading Microsoft and Cisco
products. All network traffic between ITN sites, desktops, and partner
organizations that travels over public networks is encrypted using at least
128-bit encryption using various security protocols including IPSec, SFTP,
RDC, Kerberos, and others. We also implemented a wildcard based virtual
certificate architecture for all port 443 communications, allowing rapid
deployment of new secured services.
Keeping these systems monitored and patched, ARCAMIS provides IP ping,
SNMP MIB monitoring, specific service monitoring and automated restarts,
hardware monitoring, intrusion detection monitoring, and website monitoring

of the ARCAMIS production server environment. Server and end-user
security patches are applied monthly via Software Update Services.
Application and LINUX/Macintosh patches are pushed out on a monthly basis.
We have standardized on McAfee Anti-Virus for virus protection and use
Postini for e-mail SPAM and Virus filtering.
The ITN’s Authoritative Directory uses Microsoft Active Directory and is
exposed via SOAP, RADIUS, and LDAP for cross platform authentication. The
ITN is currently using an Enterprise Certificate Authority (ITNCA) for
certificate based security authentication.
Comprehensive Information Security
The ITN has established mandatory policies, processes, controls, and
procedures to ensure confidentiality, integrity, availability, reliability, and
non-repudiation within the Organization’s infrastructure and its operations. It
is the policy of ARCAMIS that the organization abides by or exceeds the
requirements outlined in ITN Information Security and Information
Technology Program, thereby exceeding the required Security Category and
Level of MODERATE for Confidentiality, Integrity, and Availability outlined
above. In addition, to ensure adequate security, ARCAMIS implements
additional security policies exceeding the minimum requirement, as
appropriate for our specific operational and risk environment as necessary.
4.4. Data Center Facilities
The ITN has centralized its server architecture into two Tier 1 data centers.
The first is located in Herndon, VA with Cogent Communications, and the
second in San Francisco, CA with Level 3 Communications. An additional
research data center is located at the UCSF QB3 facility. Physical access
requires a badge and biometric hand security scanning, and the facilities have
24x7 security staff on-site. Each data center includes redundant
uninterruptible power supplies and backup diesel generators that can keep
each server running indefinitely without direct electric grid power. The centers
provide active server and application monitoring, helping hands and backup
media rotation capabilities. They are equipped with optimal environment

controls, including sophisticated air conditioning and humidifier equipment as
well as stringent physical security systems. There are also waterless fire
suppression systems. Power to our racks specifically is provided by four
redundant, monitored PDUs which report exact power usage at a point in time
and alert us if there is a power surge.
Herndon, VA Rack Diagram
G3
HP
ProLiant
ML570
UID
21
Channel 2Channel 2Channel 1
100
1
2
3
4
5
6
7
G3
HP
ProLiant
ML570
UID
21
100
1
2
3
4
5
6
7
G3
HP
ProLiant
ML570
UID
21
100
1
2
3
4
5
6
7
UID
1
2
SimplexDuplexchch21
0011
3322
4455Tape
UID
1
2
SimplexDuplexchch21
0011
3322
4455Tape
UID
1
2
SimplexDuplexchch21
0011
3322
4455Tape
UID
1
2
SimplexDuplexchch21
0011
3322
4455Tape
UID
1
2
SimplexDuplexchch21
0011
3322
4455Tape
NetApp
FAS 3020
NetApp
FAS 3020
UID
HP
ProLiant
DL320
G3
1 2
NetworkAppliance
Power
System
Shelf ID
Loop B
Fault
Loop A
72F
DS14
MK2
FC
72F 72F 72F 72F 72F 72F 72F 72F 72F 72F 72F 72F 72F
NetworkAppliance
Power
System
Shelf ID
Loop B
Fault
Loop A
72F
DS14
MK2
FC
72F 72F 72F 72F 72F 72F 72F 72F 72F 72F 72F 72F 72F
NetworkAppliance
Power
System
Shelf ID
Loop B
Fault
Loop A
72F
DS14
MK2
FC
72F 72F 72F 72F 72F 72F 72F 72F 72F 72F 72F 72F 72F
NetworkAppliance
Power
System
Shelf ID
Loop B
Fault
Loop A
72F
DS14
MK2
FC
72F 72F 72F 72F 72F 72F 72F 72F 72F 72F 72F 72F 72F
UID
HP
ProLiant
DL320
G3
1 2
UID
HP
ProLiant
DL320
G3
1 2
NetworkAppliance
Power
System
Shelf ID
Loop B
Fault
Loop A
72F
DS14
MK2
FC
72F 72F 72F 72F 72F 72F 72F 72F 72F 72F 72F 72F 72F
NetworkAppliance
Power
System
Shelf ID
Loop B
Fault
Loop A
72F
DS14
MK2
FC
72F 72F 72F 72F 72F 72F 72F 72F 72F 72F 72F 72F 72F
NetworkAppliance
Power
System
Shelf ID
Loop B
Fault
Loop A
72F
DS14
MK2
FC
72F 72F 72F 72F 72F 72F 72F 72F 72F 72F 72F 72F 72F
NetworkAppliance
Power
System
Shelf ID
Loop B
Fault
Loop A
72F
DS14
MK2
FC
72F 72F 72F 72F 72F 72F 72F 72F 72F 72F 72F 72F 72F
G3
HP
ProLiant
ML570
UID
21
100
1
2
3
4
5
6
7
G3
HP
ProLiant
ML570
UID
21
100
1
2
3
4
5
6
7

4.5. Internet Connectivity
Servicing the ARCAMIS customer base is a carrier diverse, redundant,
firewalled, reliable, Internet connected high speed network. This network
combined with the Virtual Private Network (VPN) creates the foundation for all
the ARCAMIS services provided.
Internet connectivity is location dependant:
• San Francisco, China Basin and Level 3. – Tier 1 1000mbs Ethernet
connection to the Internet is provided by Cogent Networks. UCSF
provides a 100mbs Ethernet connection to redundant 45mbs OC198
connections, and 100mbs Ethernet to between UCSF campuses.
• San Francisco, Quantitative Biology III Data Center – UCSF network
provides 1000mbs Ethernet connection to redundant 45mbs OC198
connections and 100mbs Ethernet to between UCSF campuses.
• Herndon, VA – Tier 1 100mbs Ethernet connection to the Internet is
provided by Cogent Networks. AT&T provides a 1.5mbs DSL backup
connection.
4.6. Virtual CPU, RAM, Network, and Disk Resources
ARCAMIS uses the Network Appliances Storage Area Network with a 25 TB
HA Cluster in Herndon and a 25 TB disaster recovery site in San Francisco.
This allows us to reduce cost and complexity via automation and operations
efficiency. We can seamlessly control adds, removes, and updates without
business interruption for our critical storage needs. We can more efficiently
use what we already own and eliminate silos of underutilized memory, CPU,
network, and storage. This improves business scalability and agility via
accelerated service deployment and expansion on existing hardware assets.
We can scale to tens of terabyte storage, not possible with a server based
approach. Another key result of using this technology is risk mitigation. We
architecturally automate the elimination of the possibility of critical data loss.
We fully automate backup, archival, restore - productivity loss goes from days
to minutes, to nothing, in event of user error or HW failure. We have
technologically automated smooth business continuance in the event of a
disaster. The increasing ARCAMIS data security and compliance
requirements are able to be met with a SAN. We can handle HIPAA security

and compliance requirements. In our experience, storage availability
determines service availability; automation guarantees service quality.
VMware Virtual Infrastructure Enterprise 3.01 (VI3) is virtual infrastructure
software for partitioning, consolidating and managing servers in mission-
critical environments. Ideally suited for enterprise data centers, VI3
minimizes the total cost of ownership of computing infrastructure by
increasing resource utilization and its hardware-independent virtual machines
encapsulated in easy-to-manage files maximize administration flexibility.
VMware ESX Server allows enterprises to: boost x86 server utilization to 60-
80%, provision new systems faster with reduced hardware; decouple
application workloads from underlying physical hardware for increased
flexibility; and dramatically lower the cost of business continuity. ESX Server
supports 64-bit VMs with 16GB of RAM, meeting ARCAMIS’s expanding
server computing requirements.
Combining the SAN with server virtualization provides an extremely reliable,
extensible, manageable, high availability architecture for ARCAMIS. The SAN
provides instantaneous VM backups, restores and provisioning and off-site
disaster recovery and archival. File restores are instantaneous, eliminating
the need for human resource intensive and less reliable client side disk
management applications. Adjusting to changing server requirements is
instantaneous because of the SAN’s storage expansion and reduction
capability for live volumes. Also, oversubscription allows the ITN to
significantly more efficiently use the disk we already own. The SAN VMWare
ESX combination provides excellent performance and reliability using both
Fiber Channel and iSCSI Multi-pathing. VMs boot from the SAN are replicated
locally and off-site while running. For certain applications we can create
Highly Available Clustered Systems even greater than 99.998% uptime.
Finally, server maintenance can be done during regular working hours without
downtime due to support for VMotion, the ability to move a running VM from
one physical machine to another.

4.7. Operating Systems Supported
ARCAMIS supports several operating systems, including all flavors of LINUX,
i386 Solaris, and all versions of the Windows operating system.
4.8. Backup, Archival, and Disaster Recovery
ARCAMIS Data Availability, Backup, and Archival is provided by a Storage
Area Network (SAN) with a 25 TB High Availability Cluster in Herndon and a
25 TB disaster recovery site in San Francisco. This SAN houses ARCAMIS
critical clinical data and IT server data. The SAN automates backup, archival,
and restore via NetApp SnapMirror, SnapBackup, and SnapRestore
applications. All critical data at the San Francisco and Herndon sites are
replicated to the other site within 1 hour. In the event of a major disaster at
any ARCAMIS datacenter site, only minimal data (60 minutes) loss can occur
and the critical server infrastructure can be failed over to the other coast’s
facility for business continuance. In addition to the SAN, the ARCAMIS uses
a 7 day incremental backup to offline disk rotation with monthly off-site
stored archives for all production data based on Symantec Veritas software.
4.9. Monitoring, Alerting, and Reporting
We use various monitoring and report technologies and two IT staff do full
infrastructure monitoring audits twice daily, 5 days per week, once at 8:00am
EST and again at 3:00pm PST. We use a 1-800, Priority 1 issue resolution
line that pages and calls 5 senior engineers simultaneously in the event of a
major system failure or issue. We use an on-call rotation schedule that
changes weekly. We use the following technologies: Microsoft Operations
Manager (MOM), WebWatchBot, Brocade Fabric Manager, NetApp Operations
Manger, VMWARE Operations Manager, Cacti, and Oracle, among others.
Cacti Disk Utilization Graph
Below is a sample disk utilization graph.

Monitoring Table
Below is a partial list of monitoring we do.
Server 1 Network 1
Customer Defined Transaction Monitoring
ODBC Database Query Verification
Ping Monitoring
SMTP Server and Account Monitoring
POP3 Server and Account Monitoring
FTP Upload/Download Verification
File Existence and Content Monitoring
Disk/Share Usage Monitoring
Microsoft Performance Counters
Microsoft Process Monitoring
Microsoft Services Performance Monitoring
Microsoft Services Availability Monitoring
Event Log Monitoring
HTTP/HTTPS URL Monitoring
Customer Specified Port Monitoring
Active Directory
Exchange Intelligent Message Filter
HP ProLiant Servers
Microsoft .NET Framework
Microsoft Baseline Security Analyzer
Microsoft Exchange Server Best Practices
Analyzer
Microsoft Exchange Server
Microsoft ISA Server
Microsoft Network Load Balancing
Microsoft Office Live Communications Server 2003
Microsoft Office Live Communications Server 2005
Microsoft Office Project Server
Microsoft Office SharePoint Portal Server 2003
Microsoft Operations Manager MPNotifier
Microsoft Operations Manager

Microsoft Password Change Notification Service
Microsoft SQL Server
Microsoft Web Sites and Services MP
Microsoft Windows Base OS
Microsoft Windows DFS Replication
Microsoft Windows Distributed File Systems
Microsoft Windows Distributed File Systems
Microsoft Windows DHCP
Microsoft Windows Group Policy
Microsoft Windows Internet Information Services
Microsoft Windows RRAS
Microsoft Windows System Resource Manager
Microsoft Windows Terminal Services
Microsoft Windows Ultrasound
NetApp
Volume Utilization
Global Status Indicator
Hardware Event Log
Visual Inspection
Ambient Temperature
Temperature Trending
Location WAN Connectivity
4.10.IT Service Management Systems
We use Remedy and Track-IT Enterprise for Ticketing, Asset Tracking, and
Purchasing.

5. Implementation Timeframe
5.1. Project Timeline

6. Customer Testimonials
“ARCAMIS provides services that allow the ITN knowledge workers to focus
on answering the difficult scientific questions in immune tolerance; we don’t
waste time on basic IT infrastructure functions. ARCAMIS allows me to be
confident our research patient data is stored in a secure, reliable and
responsive IT infrastructure. For example, last week we did a demonstration
to the Network Executive Committee of our Informatics data management
and collaboration portal in real-time. This included the National Institute of
Health senior management responsible for our funding… it all worked
perfectly. This entire application was built on ARCAMIS.”
Jeffrey A. Bluestone, Ph.D.
Director, UCSF Diabetes Center
Director, Immune Tolerance Network
A.W. and Mary Clausen Distinguished Professor of Medicine, Pathology,
Microbiology and Immunology
“With ARCAMIS we are well positioned to meet the rigorous IT requirements
of an NIH funded study. Within weeks of project funding from the NIH, our
entire secure research computing network and server infrastructure of more
than 10 servers was built, our developers finished the public website, and we
began work on the Patient Recruitment portal. That would have taken at least
6 months if I had to hire a team to procure and build it ourselves.
Accelerating scientific progress in neurology is core to everything we do;
ARCAMIS has been an important part of what we are currently doing.”
Dr. Daniel H. Lowenstein, M.D.
Professor of Neurology, UCSF and
Director, Physician-Scientist Education and Training Programs
Director, Epilepsy Phenome Genome Project
“With the investment in ARCAMIS, UCSF and the ITN can confidently partner
with other leading medical research universities across the country. At the
ITN we depend on the on-demand, services based, scalable computing
capacity of ARCAMIS every day to enable our collaborative data analysis and
Informatics data visualization applications.”
Mark Musen, Ph.D.
Director, Medical Informatics Department
Stanford University
Deputy Director, Immune Tolerance Network

Appendices
Appendix A – Capabilities Summary of the ARCAMIS Suite
Fundamentals
• 99.998% production solution uptime guaranteed via Service Level
Agreement.
• Managed multi-homed, Tier 1 network (Zero Downtime SLA)
• High speed 1000mbs connectivity to UCSF network space.
• Bi-costal world-class data centers hosted with Level 3 and Cogent
communications with redundant power and HVAC systems
• Managed DNS or use UCSF DNS
• Managed Active Directory for “Production Servers” and integration with
UCSF CAMPUS AD via trust.
• Phone, e-mail and web based ticketing system to track all issues
• Mature purchasing services with purchases charged to correct account
Monitoring & Issue Response
• 8am EST to 5pm PST business day access to live support personnel
• 24/7/365 with one primary “on call” engineer, paging off hours access,
with a 1-800 P1 issue number that rings 5 infrastructure engineers
simultaneously.
• Microsoft Operations Manager monitoring (CPU, RAM, disk, event log,
ping, ports and services)
• Application script response monitoring for web applications, including
SSL via WebWatchBot 5
• HP Remote Insight Manager hardware monitoring with 4 hour vendor
response on all servers
• NetApp corporate monitoring and 4 hour time to resolution will fully
stocked parts depot on Storage Area Network.
• 24x7 staffed datacenters with secure physical access to all servers
• 24x7 staffed Network Operating Center for WAN

• Notification preferences and standard response specifications can be
customized
Backup, Restore and Disaster Recovery/Business Continuance
• Symantec Backup Exec server agents for Oracle, SQL, MySQL, and
Exchange servers with 7 nightly incremental backups.
• 14 local daily snapshots of full “crash consistent” server state
• Hourly off-site snapshots of full “crash consistent” server state with 40
hourly restore points for DR
• Monthly archive of entire infrastructure, that rolls to quarterly after 3
months.
Reporting
• Online Ticketing
• Detailed Backup Utilization
• Bandwidth Utilization
• Infrastructure uptime reports
• CPU, RAM, Network, and Disk utilization reports
Server & Device Administration
• Customized Specifications using VMWare Infrastructure 3.01
technology up to 4 64-Bit, 3.0Ghz Intel Xeon Processors, 16GB RAM,
1gbs Network with 2TB disk volumes max.
• Based on HP Proliant Enterprise servers. ML570 8 processors per
server and DL380 series. 7000c Blade servers
• IP everywhere, full remote management of every device, including full
KVM via separate backLAN network.
• Microsoft MCCA licensing on key server components,
• Full license and asset tracking
• Senior System Administrator troubleshooting
• Optional high availability (99.999% uptime) server capabilities via
Veritas and Microsoft Clustering
Managed Security

• Automated OS and major application patching
• Managed Network-based Intrusion Detection
• Managed policy based enterprise firewall using Cisco and Microsoft
technologies
• Managed VPN access

Appendix B – Excerpt from the ARCAMIS Systems Functional
Specification
Centralized Virtual Infrastructure Administration
ARCAMIS can move virtual machines between hosts, create new machines from pre-
built templates, and control existing virtual machine configurations. We also can
gather event log information from a central location for all VMware hosts; have an
increased ability to identify asset utilization and troubleshoot warnings prior to
problems occurring; have easier management of physical system bios updates and
firmware upgrades; and have centralized management of all virtual machines within
the network.
The Virtual Center management interface allows us to centrally manage and monitor
our entire physical and virtual infrastructure from one place:
Hosts Clusters and Resource Pools:
By organizing physical hosts into clusters of two or more, we are able to distribute
the aggregate resources as if it were one physical host. For example a single server
might be configured with 4 dual core 2.7 GHz processors and 24 GB of RAM. By
clustering two servers together, the resources are presented as 21 GHz and 48 GB of
RAM which can be provisioned as needed to multiple guests.

DRS and VMotion:
VMotion enables us to migrate live servers from one physical host to another which
allows for physical host maintenance to be performed with no impact to production
service uptime. Dynamic Resource Scheduling (DRS) is used to set different resource
allocation policies for different classes of services which are automatically monitored
and enforced using the aggregate resources of the cluster.

UCSF07 - Research and HPC Infrastructure_Award_2007

Recommended

Recommended

More Related Content

What's hot

What's hot (15)

Viewers also liked

Viewers also liked (14)

Similar to UCSF07 - Research and HPC Infrastructure_Award_2007

Similar to UCSF07 - Research and HPC Infrastructure_Award_2007 (20)

UCSF07 - Research and HPC Infrastructure_Award_2007