SlideShare a Scribd company logo
Designing High Availability
Networks, Systems, and Software
for the University Environment
Deke Kassabian and Shumon Huque
The University of Pennsylvania
January 14, 2004
About Penn
 The University of Pennsylvania was founded
by Ben Franklin in 1751
 Penn is part of the Ivy League
 Located in western Philadelphia
 Community of more than 30,000 people
General Goals
 Networked services available as expected
by our users
 Minimized time to repair (TTR) for when
outages do occur
 Ability to perform maintenance and
upgrades (planned downtime) non-
disruptively
 Cost effectiveness in meeting these goals
Definitions
 Availability
 High Availability (HA)
 Rapid Recovery (RR)
 Disaster Recovery (DR)
 Basic Systems
Definitions
 Disaster Recovery (DR) -The process
of restoring a service to full operation
after an interruption in service
Definitions
 Basic System - a Basic System is a
{Network, System, Service} with only the
most basic of protections against outages
 Examples:
 A network recoverable using spare parts
 A single computer system with RAID disk
 A service recoverable from tape backups
Definitions
 Availability - the percentage of total
time that a {Network, System, Service}
is available for use
 Related points:
 Advertised periods of availability
 Availability as advertised
 Absolute availability
Definitions
 High Availability (HA) - a {Network,
System, Service} with specific design
elements intended to keep availability
above a high threshold (eg, 99.99%)
Definitions
 Rapid Recovery (RR) - a {Network,
System, Service} with specific design
elements intended to recover from
downtime very quickly (eg, 15 minutes)
Metrics
 Economics of high availability (the
costs of non-available)
 Calculating availability
 How availability measurements are
performed
Economics of high availability
 What is the cost of an outage in your
 Student Courseware systems and student record
systems
 Financial systems
 Primary campus web site and Email servers
 DNS, DHCP and AuthN systems
 Internet connection(s)
 Development / Gifts systems
 How much should you be willing to spend to
minimize downtime of any or all of these?
Calculating availability
 Availability can be measured directly through
periodic polling (eg, SNMP, Mon, Nagios)
 A formula for predicting availability of a single
component
MTBF
(MTBF+TTR)
1
TTR
(MTBF+TTR)or
Design Principals
 Towards HA
 Minimize points of catastrophic failure
 Maximize redundancy
 Minimize fault zones
 Minimize complexity and cost
 Applying the above principles to
 Networks
 Systems
 Services
Specific examples at Penn
 High Availability Services
 Rapid Recovery Services
High Availability Design
 Strategies employed to achieve HA:
 Server redundancy
 Hardware component redundancy
 Storage redundancy (RAID)
 Network redundancy
 Redundant power, A/C, cooling etc
 Application protocols that can transparently
failover to alternate servers
 Secondary offsite hosting (of some services like
DNS)
Rapid Recovery Design
 Strategies employed to achieve RR:
 Standby servers and storage
 Some HA design elements:

Hardware redundancy, storage redundancy, network
redundancy, power, A/C redundancy etc
 Note: services deployed in the RR model typically
don’t have an easy way to transparently failover to
alternate servers (eg. E-mail, Web etc)
Network Aggregation Point
 Abbreviation: NAP
 Machine rooms in separate campus locations
that house critical network electronics and
servers.
 Good environmentals and extensive
connectivity to campus fiber-optic cable plant
 Both HA and RR services utilize multiple
NAPs
Central Infra. Networks
 AKA “NOC Networks” (historical name)
 3 highly redundant IP networks that house systems
providing critical infrastructure services
 Each network is triply connected to campus routing
core via distinct NAP locations
 Network wiring traverses physically diverse fiber
conduit pathways
 Use of router redundancy protocols (VRRP) & Layer-
2 path redundancy (802.1D) for high availability
HA Server Platforms
 Two sets of three replicated servers
 3 KDC servers: central authentication
 3 NOC servers: everything else
 Kerberos runs on separate systems mainly
for security reasons.
High Availability: KDCs
 KDCs (3):
 3 distinct machines (kdc1, kdc2, kdc3)
 Run only Kerberos AS and TGS
 Each located in a different campus machine room
 Each connected to a distinct IP network

Via a distinct IP core router
 Additionally each network is triply connected to the
campus routing core via 3 NAPs
High Availability: NOCs
 3 “NOC” systems (a historical name)
 Provide: DNS, DHCP, NTP, RADIUS plus a few
homegrown services
 Same physical and network connectivity as the
KDCs
 In addition: some servers have a secondary
interface on a different NOC network (for reasons
to be explained later)
HA Application Failover
 Kerberos
 DNS
 RADIUS
 NTP
 DHCP
 Current spec supports only 2 failover systems
 Non-HA homegrown services: PennNames
Rapid Recovery service
 Example: E-mail and Web service
 A set of servers and storage is replicated at two sites: primary
and standby
 Primary site: active servers and storage
 Secondary site: standby servers and replicated storage
 Data from 1st site is synchronously replicated to 2nd
 Two separate fibrechannel networks interconnect systems and
storage at both sites
 Catastrophic failure event: system can be manually reconfigured
to use the standby servers and/or secondary storage ( ~ 30
minutes)
 Servers are located on the HA primary infrastructure network
Experiences at Penn
 Where these approaches have been helpful
 Higher availability, non-disruptive maintenance
 Where they have not
 Complexity can be hard to manage!
 Where cost has been high
 Replicated systems and networks, high-end
storage solutions
 Real availability experience
 DNS, a critical service, went from 99.0% to
99.999% availability!
Future Enhancements
 Making RR services highly available:
 “clustering”, IETF rserpool etc
 Metropolitan area DR (or better)
 Rolling disaster protection
 Others:
 IP Multipathing
 Trunking links to servers

802.3ad, SMLT, DMLT or similar
 Rapid Spanning Tree (IEEE 802.1w)
 Multi-master KADM service
 Improved management and monitoring
infrastructure
Feedback
 Questions, comments
 Your designs, experiences, successes
Contact Info:
deke@isc.upenn.edu
shuque@isc.upenn.edu

More Related Content

What's hot

CS9222 ADVANCED OPERATING SYSTEMS
CS9222 ADVANCED OPERATING SYSTEMSCS9222 ADVANCED OPERATING SYSTEMS
CS9222 ADVANCED OPERATING SYSTEMS
Kathirvel Ayyaswamy
 
PACK: Prediction-Based Cloud Bandwidth and Cost Reduction System
PACK: Prediction-Based Cloud Bandwidth and Cost Reduction SystemPACK: Prediction-Based Cloud Bandwidth and Cost Reduction System
PACK: Prediction-Based Cloud Bandwidth and Cost Reduction System
JPINFOTECH JAYAPRAKASH
 
Computer Networks: Quality of service
Computer Networks: Quality of serviceComputer Networks: Quality of service
Computer Networks: Quality of service
Kongu Engineering College, Perundurai, Erode
 
Topic: Virtual circuit & message switching
Topic: Virtual circuit & message switchingTopic: Virtual circuit & message switching
Topic: Virtual circuit & message switching
Dr Rajiv Srivastava
 
Multiplexing and switching(TDM ,FDM, Data gram, circuit switching)
Multiplexing and switching(TDM ,FDM, Data gram, circuit switching)Multiplexing and switching(TDM ,FDM, Data gram, circuit switching)
Multiplexing and switching(TDM ,FDM, Data gram, circuit switching)
Adil Mehmoood
 
06 digital datacomm
06 digital datacomm06 digital datacomm
1.intro. to distributed system
1.intro. to distributed system1.intro. to distributed system
1.intro. to distributed system
Gd Goenka University
 
Unit i packet switching networks
Unit i  packet switching networksUnit i  packet switching networks
Unit i packet switching networks
sangusajjan
 
10 Circuit Packet
10 Circuit Packet10 Circuit Packet
10 Circuit Packet
Waqas !!!!
 
Balman dissertation Copyright @ 2010 Mehmet Balman
Balman dissertation Copyright @ 2010 Mehmet BalmanBalman dissertation Copyright @ 2010 Mehmet Balman
Balman dissertation Copyright @ 2010 Mehmet Balman
balmanme
 
Switching techniques
Switching techniquesSwitching techniques
Switching techniques
Gupta6Bindu
 
Presentation southernstork 2009-nov-southernworkshop
Presentation southernstork 2009-nov-southernworkshopPresentation southernstork 2009-nov-southernworkshop
Presentation southernstork 2009-nov-southernworkshop
balmanme
 
2. Distributed Systems Hardware & Software concepts
2. Distributed Systems Hardware & Software concepts2. Distributed Systems Hardware & Software concepts
2. Distributed Systems Hardware & Software concepts
Prajakta Rane
 
11 circuit-packet
11 circuit-packet11 circuit-packet
11 circuit-packet
Hattori Sidek
 
Switching Techniques (Lecture #2 ET3003 Sem1 2014/2015)
Switching Techniques (Lecture #2 ET3003 Sem1 2014/2015)Switching Techniques (Lecture #2 ET3003 Sem1 2014/2015)
Switching Techniques (Lecture #2 ET3003 Sem1 2014/2015)
Tutun Juhana
 
Comp net 2
Comp net 2Comp net 2
Comp net 2
Abdullaziz Tagawy
 
Switching
SwitchingSwitching
Switching
sheekha_11
 
Circuit Packet
Circuit PacketCircuit Packet
Circuit Packet
Waqas !!!!
 
Congestionin Data Networks
Congestionin Data NetworksCongestionin Data Networks
Congestionin Data Networks
Waqas !!!!
 
Gurpinder_Resume
Gurpinder_ResumeGurpinder_Resume
Gurpinder_Resume
Gurpinder Ghuman
 

What's hot (20)

CS9222 ADVANCED OPERATING SYSTEMS
CS9222 ADVANCED OPERATING SYSTEMSCS9222 ADVANCED OPERATING SYSTEMS
CS9222 ADVANCED OPERATING SYSTEMS
 
PACK: Prediction-Based Cloud Bandwidth and Cost Reduction System
PACK: Prediction-Based Cloud Bandwidth and Cost Reduction SystemPACK: Prediction-Based Cloud Bandwidth and Cost Reduction System
PACK: Prediction-Based Cloud Bandwidth and Cost Reduction System
 
Computer Networks: Quality of service
Computer Networks: Quality of serviceComputer Networks: Quality of service
Computer Networks: Quality of service
 
Topic: Virtual circuit & message switching
Topic: Virtual circuit & message switchingTopic: Virtual circuit & message switching
Topic: Virtual circuit & message switching
 
Multiplexing and switching(TDM ,FDM, Data gram, circuit switching)
Multiplexing and switching(TDM ,FDM, Data gram, circuit switching)Multiplexing and switching(TDM ,FDM, Data gram, circuit switching)
Multiplexing and switching(TDM ,FDM, Data gram, circuit switching)
 
06 digital datacomm
06 digital datacomm06 digital datacomm
06 digital datacomm
 
1.intro. to distributed system
1.intro. to distributed system1.intro. to distributed system
1.intro. to distributed system
 
Unit i packet switching networks
Unit i  packet switching networksUnit i  packet switching networks
Unit i packet switching networks
 
10 Circuit Packet
10 Circuit Packet10 Circuit Packet
10 Circuit Packet
 
Balman dissertation Copyright @ 2010 Mehmet Balman
Balman dissertation Copyright @ 2010 Mehmet BalmanBalman dissertation Copyright @ 2010 Mehmet Balman
Balman dissertation Copyright @ 2010 Mehmet Balman
 
Switching techniques
Switching techniquesSwitching techniques
Switching techniques
 
Presentation southernstork 2009-nov-southernworkshop
Presentation southernstork 2009-nov-southernworkshopPresentation southernstork 2009-nov-southernworkshop
Presentation southernstork 2009-nov-southernworkshop
 
2. Distributed Systems Hardware & Software concepts
2. Distributed Systems Hardware & Software concepts2. Distributed Systems Hardware & Software concepts
2. Distributed Systems Hardware & Software concepts
 
11 circuit-packet
11 circuit-packet11 circuit-packet
11 circuit-packet
 
Switching Techniques (Lecture #2 ET3003 Sem1 2014/2015)
Switching Techniques (Lecture #2 ET3003 Sem1 2014/2015)Switching Techniques (Lecture #2 ET3003 Sem1 2014/2015)
Switching Techniques (Lecture #2 ET3003 Sem1 2014/2015)
 
Comp net 2
Comp net 2Comp net 2
Comp net 2
 
Switching
SwitchingSwitching
Switching
 
Circuit Packet
Circuit PacketCircuit Packet
Circuit Packet
 
Congestionin Data Networks
Congestionin Data NetworksCongestionin Data Networks
Congestionin Data Networks
 
Gurpinder_Resume
Gurpinder_ResumeGurpinder_Resume
Gurpinder_Resume
 

Similar to Designing High Availability Networks, Systems, and Software for the University Environment

Data center disaster recovery.ppt
Data center disaster recovery.ppt Data center disaster recovery.ppt
Data center disaster recovery.ppt
omalreda
 
Datacenter101
Datacenter101Datacenter101
Datacenter101
tarundua
 
Cl306
Cl306Cl306
slides
slidesslides
Networking.pptx
Networking.pptxNetworking.pptx
Networking.pptx
YashShinde96
 
Networking.pptx
Networking.pptxNetworking.pptx
Networking.pptx
FarhanAli951243
 
Tcp ip
Tcp ipTcp ip
Tcp ip
mailalamin
 
Lecture 01
Lecture 01Lecture 01
Lecture 01
maruthi vardhan
 
Systems Support for Many Task Computing
Systems Support for Many Task ComputingSystems Support for Many Task Computing
Systems Support for Many Task Computing
Eric Van Hensbergen
 
Networking
NetworkingNetworking
Networking
AdityaKumar1548
 
Clusters (Distributed computing)
Clusters (Distributed computing)Clusters (Distributed computing)
Clusters (Distributed computing)
Sri Prasanna
 
Storage Primer
Storage PrimerStorage Primer
Storage Primer
sriramr
 
[Altibase] 8 replication part1 (overview)
[Altibase] 8 replication part1 (overview)[Altibase] 8 replication part1 (overview)
[Altibase] 8 replication part1 (overview)
altistory
 
Fundamentals
FundamentalsFundamentals
Fundamentals
Divya Srinivasan
 
Communication Networks 1
Communication Networks 1Communication Networks 1
Communication Networks 1
mahamed Ayesh
 
Distributed Systems.ppt
Distributed Systems.pptDistributed Systems.ppt
Distributed Systems.ppt
AdrianTopoleanu1
 
ln13-ds.pptefefdfdgdgerhfhgjhmmmmmmmmmmm
ln13-ds.pptefefdfdgdgerhfhgjhmmmmmmmmmmmln13-ds.pptefefdfdgdgerhfhgjhmmmmmmmmmmm
ln13-ds.pptefefdfdgdgerhfhgjhmmmmmmmmmmm
peterhaile1
 
Chapter01 -- introduction to networking
Chapter01  -- introduction to networkingChapter01  -- introduction to networking
Chapter01 -- introduction to networking
Raja Waseem Akhtar
 
Supply frame high availability in web content delivery
Supply frame high availability in web content deliverySupply frame high availability in web content delivery
Supply frame high availability in web content delivery
Aleksandar Bilanovic
 
lec3_10.ppt
lec3_10.pptlec3_10.ppt
lec3_10.ppt
ImXaib
 

Similar to Designing High Availability Networks, Systems, and Software for the University Environment (20)

Data center disaster recovery.ppt
Data center disaster recovery.ppt Data center disaster recovery.ppt
Data center disaster recovery.ppt
 
Datacenter101
Datacenter101Datacenter101
Datacenter101
 
Cl306
Cl306Cl306
Cl306
 
slides
slidesslides
slides
 
Networking.pptx
Networking.pptxNetworking.pptx
Networking.pptx
 
Networking.pptx
Networking.pptxNetworking.pptx
Networking.pptx
 
Tcp ip
Tcp ipTcp ip
Tcp ip
 
Lecture 01
Lecture 01Lecture 01
Lecture 01
 
Systems Support for Many Task Computing
Systems Support for Many Task ComputingSystems Support for Many Task Computing
Systems Support for Many Task Computing
 
Networking
NetworkingNetworking
Networking
 
Clusters (Distributed computing)
Clusters (Distributed computing)Clusters (Distributed computing)
Clusters (Distributed computing)
 
Storage Primer
Storage PrimerStorage Primer
Storage Primer
 
[Altibase] 8 replication part1 (overview)
[Altibase] 8 replication part1 (overview)[Altibase] 8 replication part1 (overview)
[Altibase] 8 replication part1 (overview)
 
Fundamentals
FundamentalsFundamentals
Fundamentals
 
Communication Networks 1
Communication Networks 1Communication Networks 1
Communication Networks 1
 
Distributed Systems.ppt
Distributed Systems.pptDistributed Systems.ppt
Distributed Systems.ppt
 
ln13-ds.pptefefdfdgdgerhfhgjhmmmmmmmmmmm
ln13-ds.pptefefdfdgdgerhfhgjhmmmmmmmmmmmln13-ds.pptefefdfdgdgerhfhgjhmmmmmmmmmmm
ln13-ds.pptefefdfdgdgerhfhgjhmmmmmmmmmmm
 
Chapter01 -- introduction to networking
Chapter01  -- introduction to networkingChapter01  -- introduction to networking
Chapter01 -- introduction to networking
 
Supply frame high availability in web content delivery
Supply frame high availability in web content deliverySupply frame high availability in web content delivery
Supply frame high availability in web content delivery
 
lec3_10.ppt
lec3_10.pptlec3_10.ppt
lec3_10.ppt
 

More from Shumon Huque

DANE and DNSSEC Authentication Chain Extension for TLS
DANE and DNSSEC Authentication Chain Extension for TLSDANE and DNSSEC Authentication Chain Extension for TLS
DANE and DNSSEC Authentication Chain Extension for TLS
Shumon Huque
 
Client Certificates in DANE TLSA Records
Client Certificates in DANE TLSA RecordsClient Certificates in DANE TLSA Records
Client Certificates in DANE TLSA Records
Shumon Huque
 
Query-name Minimization and Authoritative Server Behavior
Query-name Minimization and Authoritative Server BehaviorQuery-name Minimization and Authoritative Server Behavior
Query-name Minimization and Authoritative Server Behavior
Shumon Huque
 
DANE and Application Uses of DNSSEC
DANE and Application Uses of DNSSECDANE and Application Uses of DNSSEC
DANE and Application Uses of DNSSEC
Shumon Huque
 
Hands-on getdns Tutorial
Hands-on getdns TutorialHands-on getdns Tutorial
Hands-on getdns Tutorial
Shumon Huque
 
DANE and Application Uses of DNSSEC
DANE and Application Uses of DNSSECDANE and Application Uses of DNSSEC
DANE and Application Uses of DNSSEC
Shumon Huque
 
IPv6 Tutorial; USENIX LISA 2013
IPv6 Tutorial; USENIX LISA 2013IPv6 Tutorial; USENIX LISA 2013
IPv6 Tutorial; USENIX LISA 2013
Shumon Huque
 
DNSSEC Tutorial; USENIX LISA 2013
DNSSEC Tutorial; USENIX LISA 2013DNSSEC Tutorial; USENIX LISA 2013
DNSSEC Tutorial; USENIX LISA 2013
Shumon Huque
 
IPv6 Transition in Research & Education
IPv6 Transition in Research & EducationIPv6 Transition in Research & Education
IPv6 Transition in Research & Education
Shumon Huque
 
Authorization at Penn
Authorization at PennAuthorization at Penn
Authorization at Penn
Shumon Huque
 
IPv6 Deployment Panel
IPv6 Deployment PanelIPv6 Deployment Panel
IPv6 Deployment Panel
Shumon Huque
 
A survey of DNSSEC Deployment in the US R&E Community
A survey of DNSSEC Deployment in the US R&E CommunityA survey of DNSSEC Deployment in the US R&E Community
A survey of DNSSEC Deployment in the US R&E Community
Shumon Huque
 
World IPv6 Launch at Penn
World IPv6 Launch at PennWorld IPv6 Launch at Penn
World IPv6 Launch at Penn
Shumon Huque
 
IPv6 Security Panel (U of Penn)
IPv6 Security Panel (U of Penn)IPv6 Security Panel (U of Penn)
IPv6 Security Panel (U of Penn)
Shumon Huque
 
Open Source VoIP at Penn
Open Source VoIP at PennOpen Source VoIP at Penn
Open Source VoIP at Penn
Shumon Huque
 
Kerberos at Penn (MIT Kerberos Consortium)
Kerberos at Penn (MIT Kerberos Consortium)Kerberos at Penn (MIT Kerberos Consortium)
Kerberos at Penn (MIT Kerberos Consortium)
Shumon Huque
 
.EDU DNSSEC Testbed - Lessons Learned
.EDU DNSSEC Testbed - Lessons Learned.EDU DNSSEC Testbed - Lessons Learned
.EDU DNSSEC Testbed - Lessons Learned
Shumon Huque
 
IPv6 Campus Deployment Panel
IPv6 Campus Deployment PanelIPv6 Campus Deployment Panel
IPv6 Campus Deployment Panel
Shumon Huque
 
.EDU DNSSEC Testbed
.EDU DNSSEC Testbed.EDU DNSSEC Testbed
.EDU DNSSEC Testbed
Shumon Huque
 
DNSSEC at Penn
DNSSEC at PennDNSSEC at Penn
DNSSEC at Penn
Shumon Huque
 

More from Shumon Huque (20)

DANE and DNSSEC Authentication Chain Extension for TLS
DANE and DNSSEC Authentication Chain Extension for TLSDANE and DNSSEC Authentication Chain Extension for TLS
DANE and DNSSEC Authentication Chain Extension for TLS
 
Client Certificates in DANE TLSA Records
Client Certificates in DANE TLSA RecordsClient Certificates in DANE TLSA Records
Client Certificates in DANE TLSA Records
 
Query-name Minimization and Authoritative Server Behavior
Query-name Minimization and Authoritative Server BehaviorQuery-name Minimization and Authoritative Server Behavior
Query-name Minimization and Authoritative Server Behavior
 
DANE and Application Uses of DNSSEC
DANE and Application Uses of DNSSECDANE and Application Uses of DNSSEC
DANE and Application Uses of DNSSEC
 
Hands-on getdns Tutorial
Hands-on getdns TutorialHands-on getdns Tutorial
Hands-on getdns Tutorial
 
DANE and Application Uses of DNSSEC
DANE and Application Uses of DNSSECDANE and Application Uses of DNSSEC
DANE and Application Uses of DNSSEC
 
IPv6 Tutorial; USENIX LISA 2013
IPv6 Tutorial; USENIX LISA 2013IPv6 Tutorial; USENIX LISA 2013
IPv6 Tutorial; USENIX LISA 2013
 
DNSSEC Tutorial; USENIX LISA 2013
DNSSEC Tutorial; USENIX LISA 2013DNSSEC Tutorial; USENIX LISA 2013
DNSSEC Tutorial; USENIX LISA 2013
 
IPv6 Transition in Research & Education
IPv6 Transition in Research & EducationIPv6 Transition in Research & Education
IPv6 Transition in Research & Education
 
Authorization at Penn
Authorization at PennAuthorization at Penn
Authorization at Penn
 
IPv6 Deployment Panel
IPv6 Deployment PanelIPv6 Deployment Panel
IPv6 Deployment Panel
 
A survey of DNSSEC Deployment in the US R&E Community
A survey of DNSSEC Deployment in the US R&E CommunityA survey of DNSSEC Deployment in the US R&E Community
A survey of DNSSEC Deployment in the US R&E Community
 
World IPv6 Launch at Penn
World IPv6 Launch at PennWorld IPv6 Launch at Penn
World IPv6 Launch at Penn
 
IPv6 Security Panel (U of Penn)
IPv6 Security Panel (U of Penn)IPv6 Security Panel (U of Penn)
IPv6 Security Panel (U of Penn)
 
Open Source VoIP at Penn
Open Source VoIP at PennOpen Source VoIP at Penn
Open Source VoIP at Penn
 
Kerberos at Penn (MIT Kerberos Consortium)
Kerberos at Penn (MIT Kerberos Consortium)Kerberos at Penn (MIT Kerberos Consortium)
Kerberos at Penn (MIT Kerberos Consortium)
 
.EDU DNSSEC Testbed - Lessons Learned
.EDU DNSSEC Testbed - Lessons Learned.EDU DNSSEC Testbed - Lessons Learned
.EDU DNSSEC Testbed - Lessons Learned
 
IPv6 Campus Deployment Panel
IPv6 Campus Deployment PanelIPv6 Campus Deployment Panel
IPv6 Campus Deployment Panel
 
.EDU DNSSEC Testbed
.EDU DNSSEC Testbed.EDU DNSSEC Testbed
.EDU DNSSEC Testbed
 
DNSSEC at Penn
DNSSEC at PennDNSSEC at Penn
DNSSEC at Penn
 

Recently uploaded

Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
saastr
 
AWS Cloud Cost Optimization Presentation.pptx
AWS Cloud Cost Optimization Presentation.pptxAWS Cloud Cost Optimization Presentation.pptx
AWS Cloud Cost Optimization Presentation.pptx
HarisZaheer8
 
Trusted Execution Environment for Decentralized Process Mining
Trusted Execution Environment for Decentralized Process MiningTrusted Execution Environment for Decentralized Process Mining
Trusted Execution Environment for Decentralized Process Mining
LucaBarbaro3
 
5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides
DanBrown980551
 
Building Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and MilvusBuilding Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and Milvus
Zilliz
 
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
Jeffrey Haguewood
 
Programming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup SlidesProgramming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup Slides
Zilliz
 
Fueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte WebinarFueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte Webinar
Zilliz
 
GNSS spoofing via SDR (Criptored Talks 2024)
GNSS spoofing via SDR (Criptored Talks 2024)GNSS spoofing via SDR (Criptored Talks 2024)
GNSS spoofing via SDR (Criptored Talks 2024)
Javier Junquera
 
Driving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success StoryDriving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success Story
Safe Software
 
Skybuffer SAM4U tool for SAP license adoption
Skybuffer SAM4U tool for SAP license adoptionSkybuffer SAM4U tool for SAP license adoption
Skybuffer SAM4U tool for SAP license adoption
Tatiana Kojar
 
Public CyberSecurity Awareness Presentation 2024.pptx
Public CyberSecurity Awareness Presentation 2024.pptxPublic CyberSecurity Awareness Presentation 2024.pptx
Public CyberSecurity Awareness Presentation 2024.pptx
marufrahmanstratejm
 
FREE A4 Cyber Security Awareness Posters-Social Engineering part 3
FREE A4 Cyber Security Awareness  Posters-Social Engineering part 3FREE A4 Cyber Security Awareness  Posters-Social Engineering part 3
FREE A4 Cyber Security Awareness Posters-Social Engineering part 3
Data Hops
 
System Design Case Study: Building a Scalable E-Commerce Platform - Hiike
System Design Case Study: Building a Scalable E-Commerce Platform - HiikeSystem Design Case Study: Building a Scalable E-Commerce Platform - Hiike
System Design Case Study: Building a Scalable E-Commerce Platform - Hiike
Hiike
 
Dandelion Hashtable: beyond billion requests per second on a commodity server
Dandelion Hashtable: beyond billion requests per second on a commodity serverDandelion Hashtable: beyond billion requests per second on a commodity server
Dandelion Hashtable: beyond billion requests per second on a commodity server
Antonios Katsarakis
 
Freshworks Rethinks NoSQL for Rapid Scaling & Cost-Efficiency
Freshworks Rethinks NoSQL for Rapid Scaling & Cost-EfficiencyFreshworks Rethinks NoSQL for Rapid Scaling & Cost-Efficiency
Freshworks Rethinks NoSQL for Rapid Scaling & Cost-Efficiency
ScyllaDB
 
Presentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of GermanyPresentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of Germany
innovationoecd
 
Energy Efficient Video Encoding for Cloud and Edge Computing Instances
Energy Efficient Video Encoding for Cloud and Edge Computing InstancesEnergy Efficient Video Encoding for Cloud and Edge Computing Instances
Energy Efficient Video Encoding for Cloud and Edge Computing Instances
Alpen-Adria-Universität
 
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUHCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
panagenda
 
Generating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and MilvusGenerating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and Milvus
Zilliz
 

Recently uploaded (20)

Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
 
AWS Cloud Cost Optimization Presentation.pptx
AWS Cloud Cost Optimization Presentation.pptxAWS Cloud Cost Optimization Presentation.pptx
AWS Cloud Cost Optimization Presentation.pptx
 
Trusted Execution Environment for Decentralized Process Mining
Trusted Execution Environment for Decentralized Process MiningTrusted Execution Environment for Decentralized Process Mining
Trusted Execution Environment for Decentralized Process Mining
 
5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides
 
Building Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and MilvusBuilding Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and Milvus
 
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
 
Programming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup SlidesProgramming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup Slides
 
Fueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte WebinarFueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte Webinar
 
GNSS spoofing via SDR (Criptored Talks 2024)
GNSS spoofing via SDR (Criptored Talks 2024)GNSS spoofing via SDR (Criptored Talks 2024)
GNSS spoofing via SDR (Criptored Talks 2024)
 
Driving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success StoryDriving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success Story
 
Skybuffer SAM4U tool for SAP license adoption
Skybuffer SAM4U tool for SAP license adoptionSkybuffer SAM4U tool for SAP license adoption
Skybuffer SAM4U tool for SAP license adoption
 
Public CyberSecurity Awareness Presentation 2024.pptx
Public CyberSecurity Awareness Presentation 2024.pptxPublic CyberSecurity Awareness Presentation 2024.pptx
Public CyberSecurity Awareness Presentation 2024.pptx
 
FREE A4 Cyber Security Awareness Posters-Social Engineering part 3
FREE A4 Cyber Security Awareness  Posters-Social Engineering part 3FREE A4 Cyber Security Awareness  Posters-Social Engineering part 3
FREE A4 Cyber Security Awareness Posters-Social Engineering part 3
 
System Design Case Study: Building a Scalable E-Commerce Platform - Hiike
System Design Case Study: Building a Scalable E-Commerce Platform - HiikeSystem Design Case Study: Building a Scalable E-Commerce Platform - Hiike
System Design Case Study: Building a Scalable E-Commerce Platform - Hiike
 
Dandelion Hashtable: beyond billion requests per second on a commodity server
Dandelion Hashtable: beyond billion requests per second on a commodity serverDandelion Hashtable: beyond billion requests per second on a commodity server
Dandelion Hashtable: beyond billion requests per second on a commodity server
 
Freshworks Rethinks NoSQL for Rapid Scaling & Cost-Efficiency
Freshworks Rethinks NoSQL for Rapid Scaling & Cost-EfficiencyFreshworks Rethinks NoSQL for Rapid Scaling & Cost-Efficiency
Freshworks Rethinks NoSQL for Rapid Scaling & Cost-Efficiency
 
Presentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of GermanyPresentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of Germany
 
Energy Efficient Video Encoding for Cloud and Edge Computing Instances
Energy Efficient Video Encoding for Cloud and Edge Computing InstancesEnergy Efficient Video Encoding for Cloud and Edge Computing Instances
Energy Efficient Video Encoding for Cloud and Edge Computing Instances
 
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUHCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
 
Generating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and MilvusGenerating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and Milvus
 

Designing High Availability Networks, Systems, and Software for the University Environment

  • 1. Designing High Availability Networks, Systems, and Software for the University Environment Deke Kassabian and Shumon Huque The University of Pennsylvania January 14, 2004
  • 2. About Penn  The University of Pennsylvania was founded by Ben Franklin in 1751  Penn is part of the Ivy League  Located in western Philadelphia  Community of more than 30,000 people
  • 3. General Goals  Networked services available as expected by our users  Minimized time to repair (TTR) for when outages do occur  Ability to perform maintenance and upgrades (planned downtime) non- disruptively  Cost effectiveness in meeting these goals
  • 4. Definitions  Availability  High Availability (HA)  Rapid Recovery (RR)  Disaster Recovery (DR)  Basic Systems
  • 5. Definitions  Disaster Recovery (DR) -The process of restoring a service to full operation after an interruption in service
  • 6. Definitions  Basic System - a Basic System is a {Network, System, Service} with only the most basic of protections against outages  Examples:  A network recoverable using spare parts  A single computer system with RAID disk  A service recoverable from tape backups
  • 7. Definitions  Availability - the percentage of total time that a {Network, System, Service} is available for use  Related points:  Advertised periods of availability  Availability as advertised  Absolute availability
  • 8. Definitions  High Availability (HA) - a {Network, System, Service} with specific design elements intended to keep availability above a high threshold (eg, 99.99%)
  • 9. Definitions  Rapid Recovery (RR) - a {Network, System, Service} with specific design elements intended to recover from downtime very quickly (eg, 15 minutes)
  • 10. Metrics  Economics of high availability (the costs of non-available)  Calculating availability  How availability measurements are performed
  • 11. Economics of high availability  What is the cost of an outage in your  Student Courseware systems and student record systems  Financial systems  Primary campus web site and Email servers  DNS, DHCP and AuthN systems  Internet connection(s)  Development / Gifts systems  How much should you be willing to spend to minimize downtime of any or all of these?
  • 12. Calculating availability  Availability can be measured directly through periodic polling (eg, SNMP, Mon, Nagios)  A formula for predicting availability of a single component MTBF (MTBF+TTR) 1 TTR (MTBF+TTR)or
  • 13. Design Principals  Towards HA  Minimize points of catastrophic failure  Maximize redundancy  Minimize fault zones  Minimize complexity and cost  Applying the above principles to  Networks  Systems  Services
  • 14. Specific examples at Penn  High Availability Services  Rapid Recovery Services
  • 15. High Availability Design  Strategies employed to achieve HA:  Server redundancy  Hardware component redundancy  Storage redundancy (RAID)  Network redundancy  Redundant power, A/C, cooling etc  Application protocols that can transparently failover to alternate servers  Secondary offsite hosting (of some services like DNS)
  • 16. Rapid Recovery Design  Strategies employed to achieve RR:  Standby servers and storage  Some HA design elements:  Hardware redundancy, storage redundancy, network redundancy, power, A/C redundancy etc  Note: services deployed in the RR model typically don’t have an easy way to transparently failover to alternate servers (eg. E-mail, Web etc)
  • 17. Network Aggregation Point  Abbreviation: NAP  Machine rooms in separate campus locations that house critical network electronics and servers.  Good environmentals and extensive connectivity to campus fiber-optic cable plant  Both HA and RR services utilize multiple NAPs
  • 18. Central Infra. Networks  AKA “NOC Networks” (historical name)  3 highly redundant IP networks that house systems providing critical infrastructure services  Each network is triply connected to campus routing core via distinct NAP locations  Network wiring traverses physically diverse fiber conduit pathways  Use of router redundancy protocols (VRRP) & Layer- 2 path redundancy (802.1D) for high availability
  • 19. HA Server Platforms  Two sets of three replicated servers  3 KDC servers: central authentication  3 NOC servers: everything else  Kerberos runs on separate systems mainly for security reasons.
  • 20. High Availability: KDCs  KDCs (3):  3 distinct machines (kdc1, kdc2, kdc3)  Run only Kerberos AS and TGS  Each located in a different campus machine room  Each connected to a distinct IP network  Via a distinct IP core router  Additionally each network is triply connected to the campus routing core via 3 NAPs
  • 21. High Availability: NOCs  3 “NOC” systems (a historical name)  Provide: DNS, DHCP, NTP, RADIUS plus a few homegrown services  Same physical and network connectivity as the KDCs  In addition: some servers have a secondary interface on a different NOC network (for reasons to be explained later)
  • 22.
  • 23.
  • 24.
  • 25.
  • 26.
  • 27. HA Application Failover  Kerberos  DNS  RADIUS  NTP  DHCP  Current spec supports only 2 failover systems  Non-HA homegrown services: PennNames
  • 28. Rapid Recovery service  Example: E-mail and Web service  A set of servers and storage is replicated at two sites: primary and standby  Primary site: active servers and storage  Secondary site: standby servers and replicated storage  Data from 1st site is synchronously replicated to 2nd  Two separate fibrechannel networks interconnect systems and storage at both sites  Catastrophic failure event: system can be manually reconfigured to use the standby servers and/or secondary storage ( ~ 30 minutes)  Servers are located on the HA primary infrastructure network
  • 29.
  • 30. Experiences at Penn  Where these approaches have been helpful  Higher availability, non-disruptive maintenance  Where they have not  Complexity can be hard to manage!  Where cost has been high  Replicated systems and networks, high-end storage solutions  Real availability experience  DNS, a critical service, went from 99.0% to 99.999% availability!
  • 31. Future Enhancements  Making RR services highly available:  “clustering”, IETF rserpool etc  Metropolitan area DR (or better)  Rolling disaster protection  Others:  IP Multipathing  Trunking links to servers  802.3ad, SMLT, DMLT or similar  Rapid Spanning Tree (IEEE 802.1w)  Multi-master KADM service  Improved management and monitoring infrastructure
  • 32. Feedback  Questions, comments  Your designs, experiences, successes Contact Info: deke@isc.upenn.edu shuque@isc.upenn.edu