Designing Highly-
Available Architectures
for OTM

Chris Plough
OTM User Conference
August 2011
Abstract
Hello, I’m Chris Plough. I joined G-Log in November of 1999 and played a key role in developing
the OTM Technical Architecture.

OTM is a critical enterprise application and application downtime can be very expensive; ranging
from tens to hundreds of thousands of dollars per hour of unplanned outage. Learn how to design
your OTM architecture to provide the right amount of redundancy for your company; taking into
consideration the business requirements, balanced with budgetary constraints. Chris Plough will
discuss and demonstrate the benefits and pitfalls learned from real-world scenarios, which stem
from both from our OTM Hosting Architecture, as well as direct customer experiences.

Lorem ipsum dolor sit amet, rutrum lorem dolor vivamus ultricies pulvinar consequat, nisl tincidunt
ligula aliquet odio, sit placerat convallis vulputate quisque, purus ut, quis mauris etiam. Etiam
egestas tristique est, turpis cum, vel risus, arcu tincidunt id velit enim, non nulla feugiat ligula ligula
ante. Sed eget ut nec sit, sed urna esse pellentesque. Velit praesent magna luctus. Turpis et mollis
at. Suspendisse amet quis fringilla. Praesent metus mauris velit sagittis, ut blandit. Suscipit tempus
tincidunt vitae egestas euismod hac, laoreet at, dui donec congue et urna, adipiscing amet a wisi
dui congue amet, mauris vel placerat faucibus nihil. Et ac nec fermentum, vel risus leo cras aut.
Vel parturient integer pede curabitur. Leo penatibus tristique massa facilisis potenti, nullam duis,
habitasse scelerisque tellus. At viverra ut orci, mauris purus vivamus nec ut suspendisse
dignissimos, nulla ipsum. Arcu curabitur diam, at luctus nullam vestibulum lectus semper nunc,
lacinia malesuada curabitur ut sapien, ut morbi eget et tellus, vel ipsum ipsum aliquam ipsum nec.
Fusce consequat sed at libero. Pellentesque scelerisque nulla, lorem ante. Risus urna donec elit
sed euismod ac. Blandit donec pede rhoncus euismod quam, aliquet metus, feugiat quam nostra
justo purus feugiat curabitur, suspendisse nulla nam vitae ac tortor. Class sapien praesent nec est,
metus phasellus donec rutrum porta velit, scelerisque velit lorem vestibulum, blandit mollis sed
vivamus tempor posuere suspendisse. Ante neque. Fringilla mauris erat, non reprehenderit ipsum
nisl a, sit magna pharetra neque wisi taciti neque, tempor quis etiam scelerisque iaculis ut eget,
Abstract
Mr. Plough’s Rules

 This is not a lecture, it is a guided discussion
  • Ask questions and make it interactive
 This is a broad topic
  • We will not cover everything
 I don’t know it all
   • Suggest alternatives
 Have a little fun!
Agenda

 What are the Business Requirements? (What? In a
  tech presentation?)

 Background
  • In the Real World…
  • Lurking Dangers
 Designing OTM for HA
  • Overview
  • Cheat Sheet
Business Requirements
What are the Business Requirements?

 Yep – I’m using the “B” word. Business.
 Do you have SLAs defined for OTM?
  • Uptime requirements, core hours of service, recovery
     point objective (RPO), recovery time objective (RTO)

 Is there a budget defined?
   • 99.0% = $, 99.9% = $$$, 99.99% = $$$$$$$$
 What is the true cost of an outage? What workarounds
  exist?
Real World Risks
In the Real World

 Understand the risks
  • Best failure rate data comes from large server farms
      (i.e. Google) and HPC clusters
  •   However, OTM is not Google (server failure = critical)

 Hard Drives failures = 40-50% of all component failures
  • Cheap insurance: an extra hot spare drive
 Power Supplies account for 10-20% of all component
  failures
   • Want that spare power supply?
In the Real World (continued)


 What about CPU, Memory, Motherboards, etc?
  • Great service contract (i.e. 2-4 hour response time)
 Side Note: Failures increase dramatically from year 3 to 5
  of hardware age

 Okay, I’ve got spare drives, multiple power supplies and
  great service contracts. Now what?
Lurking DANGER!
Lurking Dangers

 Enterprise Applications - App Issues are 15-20x more
  likely than hardware failures
   • No amount of spare hardware will make up for a
      poorly configured OTM instance

 Integrated Systems – Maintaining state across
  multiple apps in the event of a failure
  • Will your data remain synced if a single
    application fails? How do you recover?
What Does All This Mean?

 Before you break out Visio (or LucidChart) and your
  trusty OTM Application Scalability Guide:

 Work with the business to define SLAs and a budget
 Research and determine the risks for your specific
  environment

 Understand that OTM is only a part of a much larger
  business process landscape.
Designing OTM HA Architectures
Designing OTM HA Architectures

 Today
  • Traditional solutions: Load balancers, redundant
      hardware, clustering, DB or storage replication
  •   Maintenance windows for patching

 In 2-3 years (and sooner for more agile companies)
   • Utilize Virtualization with replicated environments
   • Oracle is investing in no-downtime patching
      technologies (Ksplice)
Designing OTM HA Architectures

 Example HA Architecture
Designing OTM HA Architectures - Overview

 Concepts and technology are similar to other 3 Tier
  Applications

 Each tier can be scaled / clustered independently
 App-level clustering uses OTM-specific technology
  (SCA)
      Not Oracle Grid or WebLogic Clustering
Designing OTM HA Architectures – Cheat Sheet

 Web Tier
  • < 99% SLA
       Utilize spare server (can share with other tiers)
       Cost savings / recovery time trade-off
  •   > 99% SLA
       Hardware Load Balancer w/ sticky sessions
       Sessions not replicated, but manageable
       Can provide scalability and failover
       No (minimal) overhead for additional servers
Designing OTM HA Architectures – Cheat Sheet

 App Tier
  • < 99% SLA
       Utilize spare server (can share with other tiers)
       Cost savings / recovery time trade-off
  •   > 99% SLA
       Utilize OTM’s “High Scalability” clustering
       Failure behaviour depends on cluster state and
         failure type
       Can provide scalability and failover
       Overhead for additional servers
Designing OTM HA Architectures – Cheat Sheet

 DB Tier
  • < 99% SLA
       Utilize spare server (can share with other tiers)
       Cost savings / recovery time trade-off
  •   > 99% SLA
       Utilize a form of clustering (Veritas Cluster, Data
         Guard, RAC*)
       (RAC) Can provide scalability and failover
       (RAC) Overhead for additional servers
       *OTM is supported on RAC, but has limitations.
Designing OTM HA Architectures – Cheat Sheet

 Additional Notes
  • Don’t forget DB Storage!
      Redundancy for your DB => Storage
       connectivity (fibre channel, iSCSI, etc)

  • > 99.9% SLA
      You will need a customized architecture. I
       recommend consulting a professional.
Any Specific Questions?



www.MavenWire.com
Thank You!




www.MavenWire.com

Designing Highly-Available Architectures for OTM

  • 1.
    Designing Highly- Available Architectures forOTM Chris Plough OTM User Conference August 2011
  • 2.
    Abstract Hello, I’m ChrisPlough. I joined G-Log in November of 1999 and played a key role in developing the OTM Technical Architecture. OTM is a critical enterprise application and application downtime can be very expensive; ranging from tens to hundreds of thousands of dollars per hour of unplanned outage. Learn how to design your OTM architecture to provide the right amount of redundancy for your company; taking into consideration the business requirements, balanced with budgetary constraints. Chris Plough will discuss and demonstrate the benefits and pitfalls learned from real-world scenarios, which stem from both from our OTM Hosting Architecture, as well as direct customer experiences. Lorem ipsum dolor sit amet, rutrum lorem dolor vivamus ultricies pulvinar consequat, nisl tincidunt ligula aliquet odio, sit placerat convallis vulputate quisque, purus ut, quis mauris etiam. Etiam egestas tristique est, turpis cum, vel risus, arcu tincidunt id velit enim, non nulla feugiat ligula ligula ante. Sed eget ut nec sit, sed urna esse pellentesque. Velit praesent magna luctus. Turpis et mollis at. Suspendisse amet quis fringilla. Praesent metus mauris velit sagittis, ut blandit. Suscipit tempus tincidunt vitae egestas euismod hac, laoreet at, dui donec congue et urna, adipiscing amet a wisi dui congue amet, mauris vel placerat faucibus nihil. Et ac nec fermentum, vel risus leo cras aut. Vel parturient integer pede curabitur. Leo penatibus tristique massa facilisis potenti, nullam duis, habitasse scelerisque tellus. At viverra ut orci, mauris purus vivamus nec ut suspendisse dignissimos, nulla ipsum. Arcu curabitur diam, at luctus nullam vestibulum lectus semper nunc, lacinia malesuada curabitur ut sapien, ut morbi eget et tellus, vel ipsum ipsum aliquam ipsum nec. Fusce consequat sed at libero. Pellentesque scelerisque nulla, lorem ante. Risus urna donec elit sed euismod ac. Blandit donec pede rhoncus euismod quam, aliquet metus, feugiat quam nostra justo purus feugiat curabitur, suspendisse nulla nam vitae ac tortor. Class sapien praesent nec est, metus phasellus donec rutrum porta velit, scelerisque velit lorem vestibulum, blandit mollis sed vivamus tempor posuere suspendisse. Ante neque. Fringilla mauris erat, non reprehenderit ipsum nisl a, sit magna pharetra neque wisi taciti neque, tempor quis etiam scelerisque iaculis ut eget,
  • 3.
  • 4.
    Mr. Plough’s Rules This is not a lecture, it is a guided discussion • Ask questions and make it interactive  This is a broad topic • We will not cover everything  I don’t know it all • Suggest alternatives  Have a little fun!
  • 5.
    Agenda  What arethe Business Requirements? (What? In a tech presentation?)  Background • In the Real World… • Lurking Dangers  Designing OTM for HA • Overview • Cheat Sheet
  • 6.
  • 7.
    What are theBusiness Requirements?  Yep – I’m using the “B” word. Business.  Do you have SLAs defined for OTM? • Uptime requirements, core hours of service, recovery point objective (RPO), recovery time objective (RTO)  Is there a budget defined? • 99.0% = $, 99.9% = $$$, 99.99% = $$$$$$$$  What is the true cost of an outage? What workarounds exist?
  • 8.
  • 9.
    In the RealWorld  Understand the risks • Best failure rate data comes from large server farms (i.e. Google) and HPC clusters • However, OTM is not Google (server failure = critical)  Hard Drives failures = 40-50% of all component failures • Cheap insurance: an extra hot spare drive  Power Supplies account for 10-20% of all component failures • Want that spare power supply?
  • 10.
    In the RealWorld (continued)  What about CPU, Memory, Motherboards, etc? • Great service contract (i.e. 2-4 hour response time)  Side Note: Failures increase dramatically from year 3 to 5 of hardware age  Okay, I’ve got spare drives, multiple power supplies and great service contracts. Now what?
  • 11.
  • 12.
    Lurking Dangers  EnterpriseApplications - App Issues are 15-20x more likely than hardware failures • No amount of spare hardware will make up for a poorly configured OTM instance  Integrated Systems – Maintaining state across multiple apps in the event of a failure • Will your data remain synced if a single application fails? How do you recover?
  • 13.
    What Does AllThis Mean?  Before you break out Visio (or LucidChart) and your trusty OTM Application Scalability Guide:  Work with the business to define SLAs and a budget  Research and determine the risks for your specific environment  Understand that OTM is only a part of a much larger business process landscape.
  • 14.
    Designing OTM HAArchitectures
  • 15.
    Designing OTM HAArchitectures  Today • Traditional solutions: Load balancers, redundant hardware, clustering, DB or storage replication • Maintenance windows for patching  In 2-3 years (and sooner for more agile companies) • Utilize Virtualization with replicated environments • Oracle is investing in no-downtime patching technologies (Ksplice)
  • 16.
    Designing OTM HAArchitectures  Example HA Architecture
  • 17.
    Designing OTM HAArchitectures - Overview  Concepts and technology are similar to other 3 Tier Applications  Each tier can be scaled / clustered independently  App-level clustering uses OTM-specific technology (SCA)  Not Oracle Grid or WebLogic Clustering
  • 18.
    Designing OTM HAArchitectures – Cheat Sheet  Web Tier • < 99% SLA  Utilize spare server (can share with other tiers)  Cost savings / recovery time trade-off • > 99% SLA  Hardware Load Balancer w/ sticky sessions  Sessions not replicated, but manageable  Can provide scalability and failover  No (minimal) overhead for additional servers
  • 19.
    Designing OTM HAArchitectures – Cheat Sheet  App Tier • < 99% SLA  Utilize spare server (can share with other tiers)  Cost savings / recovery time trade-off • > 99% SLA  Utilize OTM’s “High Scalability” clustering  Failure behaviour depends on cluster state and failure type  Can provide scalability and failover  Overhead for additional servers
  • 20.
    Designing OTM HAArchitectures – Cheat Sheet  DB Tier • < 99% SLA  Utilize spare server (can share with other tiers)  Cost savings / recovery time trade-off • > 99% SLA  Utilize a form of clustering (Veritas Cluster, Data Guard, RAC*)  (RAC) Can provide scalability and failover  (RAC) Overhead for additional servers  *OTM is supported on RAC, but has limitations.
  • 21.
    Designing OTM HAArchitectures – Cheat Sheet  Additional Notes • Don’t forget DB Storage!  Redundancy for your DB => Storage connectivity (fibre channel, iSCSI, etc) • > 99.9% SLA  You will need a customized architecture. I recommend consulting a professional.
  • 22.
  • 23.