Introduction to Architecture Operational Aspects

3,170 views

Published on

When doing architecture, one of the often neglected aspect is the operational architecture. It is nevertheless very important to understand what are non functional requirements (also called "ilities"). In the presentation below, we briefly define: Reliability, Availability and SLA, High Availability, Single Point Of Failure (SPOF), Transaction, CAP Theorem, Scalability, Performance (in general and web application performance) and Clustering.

Published in: Technology, Business

Introduction to Architecture Operational Aspects

  1. 1. Introduction to Architecture Operational Aspects William El Kaim Oct. 2016 - V 2.1
  2. 2. This Presentation is part of the Enterprise Architecture Digital Codex http://www.eacodex.com/Copyright © William El Kaim 2016 2
  3. 3. Plan Introduction to Architecture Operational Aspects • Reliability • Availability and SLA • High Availability • Eliminating SPOF • Transaction • CAP Theorem • Scalability • Performance • Measuring Web Application Performance • Clustering Copyright © William El Kaim 2016 3
  4. 4. Introduction • “A functional requirement” is that it essentially specifies something the system should do. • “A non-functional requirement (NFR)” is something that specifies how the system should behave-it is a constraint upon the systems behavior. • Also seen as the Operational Aspect of an architecture • Concerns centered around Run time Environment • Achieve Service Level Requirements • Deployment Units, their connections, locations, nodes • Also called Quality of Service, ilities, etc. Copyright © William El Kaim 2016 4
  5. 5. Operational Aspects: Part of development lifecycle Copyright © William El Kaim 2016 5
  6. 6. Operational Aspects: Concerns Availability •Scheduled service hours •Outage costs •Speed of service recovery •Disaster recovery Process and Data integrity Standards Cost Security • Access to system / data • Threats • Controls Systems Management • Event and Log Management • Configuration Management • Security Management • Performance Management • Scheduling • Backup and Recovery Timescales Skills (User and IT) There are more disciplines / areas of concern than listed here! Data Currency Performance • Response Time • Throughput • Capacity Scalability Copyright © William El Kaim 2016 6
  7. 7. Modeling Operational Aspects • What Operational Model contains? • Domain analysis (actors and use cases) • Plans for achieving Functional and non-functional requirements • The systems management strategy and constraints Requirements IT Architecture Design Architecture Overview Diagram Current IT Environment Interaction Diagram Detailed Design Copyright © William El Kaim 2016 7
  8. 8. Operational Model: Terminology • RAS • Reliability • Availability • Serviceability • RAS terms center entirely around uptime • Originally used by mainframe vendors • RASP • Reliability • Availability • Scalability • Performance • RASP terms describe both uptime and scalable performance Copyright © William El Kaim 2016 8
  9. 9. Operational Model: Takeover • The operational aspect of architecture • Documents the placement of the solution's components • Outlines the systems management aspects of the solution • Focuses on runtime systems design • Developed in concert with the Component Model • Documented via several views, developed in parallel and iteratively through the phases of an engagement Copyright © William El Kaim 2016 9
  10. 10. Plan • Introduction to Architecture Operational Aspects Reliability • Availability and SLA • High Availability • Eliminating SPOF • Transaction • CAP Theorem • Scalability • Performance • Measuring Web Application Performance • Clustering Copyright © William El Kaim 2016 10
  11. 11. Reliability • Defined as the probability of a failure within a given time period (or, how frequently a failure occurs) • Failure rate is often notated as lambda • Typically measured as • MTBF: Mean Time Between Failure • FI T Rate: Failures In Time (failures per 1,000,000,000 hours) Copyright © William El Kaim 2016 11
  12. 12. Plan • Introduction to Architecture Operational Aspects • Reliability Availability and SLA • High Availability • Eliminating SPOF • Transaction • CAP Theorem • Scalability • Performance • Measuring Web Application Performance • Clustering Copyright © William El Kaim 2016 12
  13. 13. Availability • Availability means the system is open for business • Business for a retail store means Customers are browsing and buying • Most retail stores have planned downtime for holidays, inventory or just close during off- peak hours like late night or early morning Copyright © William El Kaim 2016 13
  14. 14. Dimensions of Availability Functionality PerformanceData Accuracy Is the data provided by the system accurate and complete? Does the system do what it is supposed to do? Does the system function within the acceptable performance criteria? Copyright © William El Kaim 2016 14
  15. 15. Availability • Availability can be expressed numerically as the percentage of the time that a service is available for use. • Percentage of availability = (total elapsed time – sum of downtime)/total elapsed time • Influencing Factors • MTBF: Mean Time Between Failure • MTTR: Mean Time To Recovery Copyright © William El Kaim 2016 15
  16. 16. Availability • Defined as the percentage of time that an application is processing requests • Measured in terms of uptime • typically nines (99.999% is five nines) 8.75 hours99.9 52 minutes99.99 5 minutes99.999 30 seconds99.9999 By Year not availableAvailability (%) 3.65 days99 Copyright © William El Kaim 2016 16
  17. 17. Service Level Agreements • Define what you mean by available • The system is available when • The home page displays within 2 seconds when you navigate to the URL • You can add items to the shopping cart in 1 second or less • You can purchase items in your shopping cart using a credit card in 15 seconds or less • Your definition should be testable with automated tools or third party vendors Copyright © William El Kaim 2016 17
  18. 18. Availability Requires People • People are the biggest cause of downtime • Organization - ensure skills are available or on call when required • Procedures - Operators need correctly documented, tested and maintained procedures Copyright © William El Kaim 2016 18
  19. 19. Reliability vs. Availability • Reliability generally has the greater impact on end-user perception • Frequent failures are irritating • Good reliability, bad availability • Infrequent, but potentially major, downtimes • e.g. electrical power grid • Availability generally has the greater impact on operations • Extended downtime can cripple a business • Bad reliability, good availability • Frequent, but minor, failures • e.g. mobile communications If the application is fault tolerant, and failover is instantaneous, then Reliability and Availability can generally be treated as a single objective  The application is reliable as long as there is a server to failover to  The application is available as long as there is at least one server up Copyright © William El Kaim 2016 19
  20. 20. Plan • Introduction to Architecture Operational Aspects • Reliability • Availability and SLA High Availability • Eliminating SPOF • Transaction • CAP Theorem • Scalability • Performance • Measuring Web Application Performance • Clustering Copyright © William El Kaim 2016 20
  21. 21. High Availability • HA refers to application or service availability targets of 99.9 percent or higher availability. • In contrast, the Service Availability Forum defines HA applications or services as applications or services with an availability objective of ‘‘five nines,’’ i.e., 99.999 percent. • For an application that can be accessed at any time, the former definition (99.9 percent) implies unscheduled downtime of 8.76 hours (525.6 minutes) and availability of 8751.24 hours per year (given 8760 hours in a non leap year). • This is equivalent to a few unscheduled outages in a year and a restoration of service in minutes. • In order to provide this level of availability, an application or service requires a set of HA technologies, IT processes, and services supporting HA (the focus of this paper), as well as an IT organization that supports HA. Copyright © William El Kaim 2016 21
  22. 22. High Availability • The first critical aspect of HA management is the understanding and documenting of customer requirements for availability. • Understanding the business requirements clearly can help minimize overinvestment in areas that do not add needed value • Reaching this understanding can be a joint effort of the availability management, service level management and service financial management, requirements engineering, and architecture teams. Copyright © William El Kaim 2016 22
  23. 23. High Availability: Four Key Goals (KGI) • Based on experience and industry trends, there are four key goals associated with HA: 1. Maximizing or extending application or service uptime, i.e., mean time between service failures (MTBSF); 2. Eliminating or minimizing the impact of service related incidents by detecting and resolving component incidents before they impact application or service availability 3. Minimizing unplanned or unscheduled downtime of applications or services, i.e., mean time to recover service (MTTRS); 4. Eliminating or minimizing planned or scheduled downtime (i.e., downtime for changes, releases, and maintenance work). • These goals (KGIs, a term introduced in the COBIT 3) • are consistent with ITIL V3 service design documentation. • Are related to continuous operations (CO) and continuous availability (HA þ CO). Copyright © William El Kaim 2016 23
  24. 24. High Availability: Four KGI and IT Processes Copyright © William El Kaim 2016 24
  25. 25. High Availability: Stages of the services lifecycle Copyright © William El Kaim 2016 25
  26. 26. High Availability: Service Strategy • Service strategy helps in defining availability requirements and rationalizing expenditures for improving service availability by detailing the relationship between service, IT, functional, and business strategy. • As an element in service strategy, service portfolios can be grouped into service tiers, with each tier having its own set of service-level objectives (SLOs) and service-level requirements (SLRs). • The service targets for each SLO may vary by service tier. • Service tiers can in turn include availability tiers, with key differences in their availability objectives. • Key SLOs associated with service availability can help with gathering and documenting service availability requirements. Copyright © William El Kaim 2016 26
  27. 27. High Availability: Service Design • Service design involves determining and documenting service requirements and designing services to meet or exceed a set of functional and nonfunctional requirements. • Availability management is an IT process that is part of the service design stage of the service lifecycle. • Service design is directly responsible for using availability architecture patterns, technologies, and standards, in both processes both in the application design and technology infrastructure design processes. • The ITIL version 3 service design concept is a critical change from ITIL version 2. • In V2, availability management was part of service delivery. • By moving it to service design, ITIL version 3 makes it clear that waiting until service delivery to plan service levels, availability levels, capacity levels, continuity plans, security plans, and financial plans will not result in an efficient design. Copyright © William El Kaim 2016 27
  28. 28. High Availability: Service Transition • Service transition moves the service package into operational mode and involves the development of the base configuration information and knowledge management related to the service. • This includes documentation of the service architecture, service related operational procedures, and other service specific documentation. • Involves the testing, evaluation, and validation of the service in a pre-production environment including testing, evaluation, and validation of HA technologies and capabilities. • Change, release, and transition planning must also be performed, including operational readiness and final production deployment. • Processes employed in service transition include asset configuration and knowledge management, change management, transition planning and support, release and deployment management, and service testing, validation, and evaluation processes. Copyright © William El Kaim 2016 28
  29. 29. High Availability: Service Operation • Service operations in production involve operational activities such as: • Event and Incident management, which is critical for MTBSF, MTBCF, and MTTRS. • Post-deployment configuration audits, including HA configuration audits. • Post-deployment operational audits, such as change audits and maintenance audits. • Advanced change management capabilities and change models for HA services and applications. • Advanced release management capabilities and release models for HA services and applications. • Post-deployment operational work also involves day-to-day maintenance activities, both reactive and proactive, operational, infrastructural, and minor code-related changes, and major releases. Copyright © William El Kaim 2016 29
  30. 30. High Availability: Service Improvement • Service improvement includes the development and implementation of service, application, and infrastructure availability improvement plans. • Availability improvement plans can be based • On thorough availability architecture analysis (i.e., identifying gaps between current availability capabilities and target availability architecture) or • On the ad hoc development and implementation of service, application, infrastructure, and operational architectural improvements as they relate to and impact service availability Copyright © William El Kaim 2016 30
  31. 31. High Availability: Process and Tools Copyright © William El Kaim 2016 31
  32. 32. High Availability: Architecture Patterns Copyright © William El Kaim 2016 32
  33. 33. High Availability: Measuring HA • Question to be answerer • What percentage of the time is the application usable? • For HA, its measured as the number of nines • e.g. 99.999% is five nines • Calculated using simple probability Copyright © William El Kaim 2016 33
  34. 34. High Availability: Measuring HA Example • Hardware • Cluster of 8 2-CPU servers, 99% Uptime each • Question • What is the availability of each configuration if a total of 8 CPUs are required to service user requests within SLA requirements? • Solution • I f a total of 8 CPUs are required, this implies four servers are sufficient to service application requests. • For the application to fail, five of the eight servers must fail simultaneously: 0.01^5 = 1 e- 10 • Predicted application availability is 99.99999999% (annual downtime of ~ 3ms) Copyright © William El Kaim 2016 34
  35. 35. High Availability: Measuring HA – In reality … • Application availability is more complicated than it appears at first glance. • Human error is one of the biggest contributor to application downtime • Servers are rarely truly independent • A server failure may increase load on the remaining servers, triggering a cascade effect • Errors in shared components (network switches, clustering, power systems) can impact multiple servers Copyright © William El Kaim 2016 35
  36. 36. High Availability: Measuring HA – Sequential Dependency • Components connected is a chain, relying on the previous component for availability • The total availability is always lower than the availability of the weakest link Server 1 Server 2 Server 3 Availability (A)= AS1*AS2*AS3 Copyright © William El Kaim 2016 36
  37. 37. High Availability Measuring HA – Sequ. Dep. Example • Availability = Database * Network * Web Server * Desktop • Availability = 98% * 98% * 97.5% * 96% = 89.89% • Total Infrastructure Availability = 89.89% 98% Database Server 98% Network 97.5% Web Server 96% Desktop Copyright © William El Kaim 2016 37
  38. 38. High Availability Measuring HA – Redundant Dep. Ex. • Database Availability= 1 – ((1 – 0.98) * (1 – 0.98)) = 0.9996 • Database Availability = 99.96% • Availability = Database * Network * Server * Workstation • Availability = 0.9996 * 0.98 * 0.975 * 0.96 = 0.9169 • Total Infrastructure Availability = 91.69% • Total availability is higher than the availability of the individual links 98% Network 97.5% Web Server 96% Desktop 98% Database Servers 98% Copyright © William El Kaim 2016 38
  39. 39. High Availability Measuring HA – Reality • If an application has three tiers with 99% uptime each, what’s the up-time for the application? • Measure probability as being up • Probability = .99 * .99 * .99 • 97% uptime = not even two nines! • The application availability is not even as good as the weakest link when adding a tier • even if it is more reliable than the other tiers, adding a tier will always reduce application availability Copyright © William El Kaim 2016 39
  40. 40. High Availability: Synthesis • Use redundancy with failover to increase availability by eliminating Single Points Of Failure (SPOFs) • Decouple tiers and as much as possible make each tier self-sufficient ( or at least fail gracefully) • Keep Common Sense • Not all applications need HA it s expensive! • There is still room for human error and unavoidable downtime (e.g. certain upgrades) Copyright © William El Kaim 2016 40
  41. 41. Plan • Introduction to Architecture Operational Aspects • Reliability • Availability and SLA • High Availability Eliminating SPOF • Transaction • CAP Theorem • Scalability • Performance • Measuring Web Application Performance • Clustering Copyright © William El Kaim 2016 41
  42. 42. Eliminating SPOFs: Introduction • SPOF = Single Point Of Failure • Whenever a single server can die and take down the application (or part of an application), that server is a SPOF • Eliminating SPOFs increases application availability • When a working system can take over for a failed system, that is called failover • A system that can fail over is not a SPOF Copyright © William El Kaim 2016 42
  43. 43. Component Redundancy Eliminates single point of failure Active / Active configuration Example: Web Farm Active / Passive Example: Cluster of SQL Servers Use High Availability Patterns Y1 Y3 Y2 ZX Y1 Y3 Y2 Load Balancer Y1 Y3 Y2 Load Balancer Copyright © William El Kaim 2016 43
  44. 44. Eliminating SPOFs N-Tiers Architecture Local or global load balancer are used today. Can be either hardware or software based Generally software based. Load balancing can keep track of the session Mainly software solution. Local and distributed cache management Mainly software solution. Local and distributed cache management Copyright © William El Kaim 2016 44
  45. 45. Eliminating SPOFs HA Load Balancer • Local HA Load Balancers • Typically a master/slave configuration • Both Load Balancers receive all the traffic • The Load Balancers communicate directly over a dedicated cable • When the slave detects failure of the master, it assumes all responsibility for the current connections • May even be able to failover stateful connections, including HTTPS • Global Load Balancers • Used to direct traffic to a particular data center • Use an Authoritative Name Server e.g. to resolve www to particular data center • For disaster recovery, the www resolves to the primary data center unless it is down, in which case it resolves to the backup • For regional load-balancing, the www is resolved to the geographically closest data center Modern Global Load Balancers do both Global and Local balancing Copyright © William El Kaim 2016 45
  46. 46. Eliminating SPOFs HA Load Balancer • Known appliance • BigIP • Alteon • Other software Software • Continuent (ex EMIC) a/cluster (European Connect) • Apache has multiple load balancing and failover plug-ins • BEA is providing natively an HTTP load balancing • IIS also Copyright © William El Kaim 2016 46
  47. 47. Eliminating SPOFs HA Database • HA Databases do not normally require additional programming in the application tier • Often implemented in the JDBC driver level or below • Failover may cause current pending transactions to roll back, but with a real HA database, no previously committed transactions are lost • The most reliable HA Database configuration is master/slave • The slave server is always ready for the master to die • One-way replication may even work across datacenters Copyright © William El Kaim 2016 47
  48. 48. Eliminating SPOFs HA Database • Hardware • SAN/NAS BAY with RAIDx • Software • Continuent m/cluster for mySQL (heterogeneous, SQL Server, Sybase and Oracle cluster in Q3 2006) • Shared Nothing Architecture - load balance read / broadcast write • Oracle RAC • RAC is a kind of distributed cache on top of Oracle. • Require the same OS, same Oracle version • No load balancing, no failover within transactions • Only supports retry on new connections or retry of reads. Copyright © William El Kaim 2016 48
  49. 49. Eliminating SPOFs HA Application Tiers • Application tiers can be stateless • Stateless tiers (e.g. web servers) are HA using simple redundancy • Only problem is that statelessness in one tier usually just passes the buck to the next tier, which is almost always more expensive • Application tiers are almost always stateful • Only two things can be lost: State and in-flight requests • To achieve HA, the Application tier must either manage its state resiliently (e.g. in a clustered coherent cache) or back it up to a central store • Idempotent actions can be replayed by the web tier when a server fails Copyright © William El Kaim 2016 49
  50. 50. Eliminating SPOFs HA Application Tiers • Application cache can be implemented • in application using JEE standard jcache • using open source tool (OScache, ehcache) • Using commercial tool (like Tangosol, Coherence) • Application cache can be implemented in application using .net • ASP.NET application cache is a smart in-memory repository for data • Caching Application Block. Copyright © William El Kaim 2016 50
  51. 51. Eliminating SPOFs Web Tier to App Tier • Load balancers slow way down if the load balancing is sticky (keep track of pair session/server) • Best approach is for the load balancer to round-robin or randomize its load-balancing across all available web servers • but there’s a good reason: • Web servers (e.g. Apache, IIS, JES) can handle lots of concurrent connections, serve static content, and route requests to app servers • The web server plug-in for routing to the app server can do the sticky load balancing, guaranteeing that HTTP Sessions stick ! • All application servers are offering clustering. Copyright © William El Kaim 2016 51
  52. 52. Plan • Introduction to Architecture Operational Aspects • Reliability • Availability and SLA • High Availability • Eliminating SPOF Transaction • CAP Theorem • Scalability • Performance • Measuring Web Application Performance • Clustering Copyright © William El Kaim 2016 52
  53. 53. Transactions: Introduction • A transaction is a sequence of operations that change the state of an object or collection of objects in a well defined way. • Transactions are useful because they satisfy constraints about what the state of an object must be before, after or during a transaction. • For example, a particular type of transaction may satisfy a constraint that an attribute of an object must be greater after the transaction than it was before the transaction. • Sometimes, the constraints are unrelated to the objects that the transactions operate on. • For example, a transaction may be required to take place in less than a certain amount of time. Copyright © William El Kaim 2016 53
  54. 54. Transactions: Compliance to ACID Properties • Atomicity: All-or-nothing process. • Atomicity guarantees that all operations within a transaction happen within a single unit of work • Consistency: System in consistent state. • Consistency guarantees that all transactional resources within a transaction are left in a consistent state either after the transaction succeeds and is committed or after it fails and all resources are rolled back to their previous state • Isolation: Not affected by other. • Isolation ensures that even though multiple transactions may be running in parallel they appear to be running in a serial manner • Durability: Once committed, effects persist. • Durability ensures that once a transaction has been marked as committed all information relating to the transaction has been committed to durable storage Copyright © William El Kaim 2016 54
  55. 55. Transactions: XA Protocol • Open Group's X/Open Distributed Transaction Processing (DTP) model • Defines how an application program uses a transaction manager to coordinate a distributed transaction across multiple resource managers • Any resource manager that adheres to the XA specification can participate in a transaction coordinated by an XA-compliant transaction manager, thereby enabling different vendors' transactional products to work together. • All XA-compliant transactions are distributed transactions • XA supports both single-phase and two-phase commit • The transaction manager is responsible for making the final decision either to commit or rollback any distributed transaction. • For the transaction to commit successfully all of the individual resources must commit successfully; if any of them are unsuccessful, the transaction must roll back in all of the resources. Copyright © William El Kaim 2016 55
  56. 56. Transactions: Tools • JTA (Java Transaction API) required for XA style transactions. An XA transaction involves coordination among the various resource managers, which is the responsibility of the transaction manager. • JTA specifies standard Java interfaces between the transaction manager and the application server and the resource managers. Copyright © William El Kaim 2016 56
  57. 57. Transactions: XA and SOA • Today some technologists are even positioning the enterprise service bus (ESB) as a standard mechanism for integrating systems with heterogeneous data interfaces. • While ESB and Web services can clearly be used to move data between disparate data sources, I would not recommend using a set of Web services to implement a distributed transaction if the transaction requirements could be achieved by using XA, even if the enabling Web services technology supported WS-Transaction (WS-TX). • The one advantage Web services have over XA is that XA is not a remote protocol. Copyright © William El Kaim 2016 57
  58. 58. Transactions: WS-TX for SOA • WS-Transaction is the name of the OASIS group that is currently working on the transaction management specification. WS-TX is the name of the committee, and they are working on three specs: • WS-Coordination (WS-C) - a basic coordination mechanism on which protocols are layered; • WS-AtomicTransaction (WS-AT) - a classic two-phase commit protocol similar to XA; • WS-BusinessActivity (WS-BA) - a compensation based protocol designed for longer running interactions, such as BPEL scripts. • In practice, WS-AT should be used in conjunction with XA to implement a true distributed transaction. • WS-TX essentially extends transaction coordinators, such as OTS/JTS and Microsoft DTC to handle transactional Web services interoperability requirements. Copyright © William El Kaim 2016 58
  59. 59. Transactions: Birth of XTP • XTP = Extreme Transaction-Processing Platform • Traditional online transaction processing (OLTP) architectures and products are wearing thin when it comes to supporting the growing transactional workloads generated by modern service oriented and event-driven architectures (SOAs and EDAs) • Users are looking for alternatives based on low-cost commodity hardware and modern software. Copyright © William El Kaim 2016 59
  60. 60. Transactions: XTP Platform • An XTPP will be characterized by the following features: • A cohesive programming model supporting the development paradigms offered by the containers • Event-processing and service containers to enable development and execution of rich applications supporting even the most-complex requirements • Flow management container to enable application development and execution through composition of loosely coupled components (services or event handlers) • A batch container to support batch and high-performance computing (HPC)-style applications • A common distributed transaction manager, leveraged by the application containers for supporting transaction integrity in highly distributed architectures Copyright © William El Kaim 2016 60
  61. 61. Transactions: XTP Platform • A high-performance computing fabric, a communication and data-sharing infrastructure combining enterprise service bus and distributed caching mechanisms to support fast event propagation, service request dispatching and transactional data sharing between XTP application components and external applications. • Tera-architecture support to manage transparent and dynamic application and system components deployment and execution over large clusters of Linux, Unix or Windows servers but also pervasive computing processors. • Development tool, security, administration and management capabilities Copyright © William El Kaim 2016 61
  62. 62. Transactions: XTP Platform Copyright © William El Kaim 2016 62
  63. 63. Transactions: XTP Platform Vendors • IBM WebSphere XD • an add-on product for several WebSphere (and non-IBM alike) products, providing distributed caching, stream processing, a batch framework, virtualization and other XTP features; it has announced support for OSGi in the WebSphere family and is one of the strongest SCA supporters. • Oracle declared on many public occasions that XTP was an area of strategic investment • in March 2007, it acquired Tangosol • Oracle also announced that it will introduce SCA and OSGi support in the next release of Oracle Fusion Middleware. • Oracle bought BEA and Sun • Microsoft's Windows Workflow Foundation • Layers a flow management, event-driven programming model atop the "classic," client/server-oriented .NET environment. Copyright © William El Kaim 2016 63
  64. 64. Transactions: XTP Platform Vendors • Tibco ActiveMatrix • Hybrid combination of container technology (POJO, Java EE and .NET), policy management and core enterprise service bus technologies. • Red Hat • acquired Mobicents, an open-source, JSLEE 1.0-compliant event-driven application platform • Mobicents runs on the mickokernel foundation of the Java EE-based, JBoss Application Server. • E2E Technologies provides E2E Bridge • Hybrid combination of ESB, flow management, service and event containers providing a UML-based programming model Copyright © William El Kaim 2016 64
  65. 65. Transactions: XTP Platform Vendors • GigaSpaces • Extreme Application Platform (XAP) — a platform middleware product combining Java, Spring, JavaSpaces, OSGi and a Java EE subset (JDBC, JMS and JCA) meant to address analytical and transactional applications. • Several grid-based application platform vendors (Appistry, Majitek and Paremus) have announced support for Spring and are extending their platforms for event-driven programming. • Event-driven application platform vendors (Kabira, jNetX, OpenCloud and WareLite) are beginning to move out of the traditional telecommunications and financial service sectors into other verticals (such as retail and defense). Copyright © William El Kaim 2016 65
  66. 66. Plan • Introduction to Architecture Operational Aspects • Reliability • Availability and SLA • High Availability • Eliminating SPOF • Transaction CAP Theorem • Scalability • Performance • Measuring Web Application Performance • Clustering Copyright © William El Kaim 2016 66
  67. 67. CAP theorem • What goals might you want from a shared-data system? • Strong Consistency: all clients see the same view, even in presence of updates • High Availability: all clients can find some replica of the data, even in the presence of failures • Partition-tolerance: the system properties hold even when the system is partitioned Copyright © William El Kaim 2016 67
  68. 68. Brewer’s CAP Theorem • The Consistency, Availability, and Partition Tolerance (CAP) theorem states that for any system sharing data, it is “impossible” to guarantee simultaneously all of these three properties: • Consistency: all copies have same value 1. Strong consistency – ACID (Atomicity, Consistency, Isolation, Durability) 2. Weak consistency – BASE (Basically Available Soft-state Eventual consistency) • Availability: reads and writes always succeed • Partition-tolerance: system properties (consistency and/or availability) hold even when network failures prevent some machines from communicating with others • You can have at most two of these three properties for any shared-data system Copyright © William El Kaim 2016 68
  69. 69. ACID vs. CAP • ACID A DBMS is expected to support “ACID transactions,” processes that are: • Atomicity: either the whole process is done or none is • Consistency: only valid data are written • Isolation: one operation at a time • Durability: once committed, it stays that way • CAP • Consistency: all data on cluster has the same copies • Availability: cluster always accepts reads and writes • Partition tolerance: guaranteed properties are maintained even when network failures prevent some machines from communicating with others Copyright © William El Kaim 2016 69
  70. 70. CAP and Databases Visual Guide Source: http://blog.beany.co.kr/archives/275Copyright © William El Kaim 2016 70
  71. 71. CAP theorem: Lessons learned from Amazon • Most legacy application servers and relational database systems are built with consistency as their primary target, while big shops really need high availability. • A transaction like two-phase commit protocol is never an appropriate choice in case of big scalability needs. • To scale-up what you really need are : • asynchronous, stateless services, together with a good reconciliation and compensation mechanism in case of errors. • An adapted data model (dramatic impact on performances) Copyright © William El Kaim 2016 71
  72. 72. Plan • Introduction to Architecture Operational Aspects • Reliability • Availability and SLA • High Availability • Eliminating SPOF • Transaction • CAP Theorem Scalability • Performance • Measuring Web Application Performance • Clustering Copyright © William El Kaim 2016 72
  73. 73. Scalability • Defined in terms of the impact on throughput as additional hardware resources are added • Adding CPUs/ RAM to a server : Scaling Up • Adding servers to a cluster : Scaling Out • Measured with Scaling Factor • The ratio of new capacity to old capacity as resources are increased • If doubling CPUs results in 1.9 x throughput, then Scalability Factor is 1.9, and the adjusted SF/CPU is 0.95 • The ideal SF/ CPU is 1.0, a.k.a. linear scalability Copyright © William El Kaim 2016 73
  74. 74. Scaling in vs. Scaling out • IS professionals typically add capacity to computer systems by scaling up. • When response time starts to degrade because of additional workload or higher database capacities, the straightforward answer to the immediate performance problem is adding bigger, faster hardware. • Extrapolating from Moore's Law, which states that hardware performance will double every 18 months, you might conclude that scaling up is an adequate solution to handle growth for the foreseeable future. • However, you'll soon realize that Murphy's Law precludes Moore's Law. Copyright © William El Kaim 2016 74
  75. 75. Scaling in vs. Scaling out • Although the current 8-way SMP systems equipped with high-speed Storage Area Network (SAN) storage arrays provide tremendous scalability, they also bring to light several other scalability problems. • First, when a system reaches a certain point, further scaling up becomes prohibitively expensive. • Second, even with Moore's Law in full effect, you can't scale beyond a certain point—at least until vendors release the next generation of hardware. • Even beyond the hardware problems, you'll probably encounter software hurdles when you're trying to scale up. • Software systems such as databases have internal mechanisms that handle locking and other multi-user database issues. • These software structures have limited efficiency, and these limits typically become the real governing impediments to continued upward scalability. Copyright © William El Kaim 2016 75
  76. 76. Scaling in vs. Scaling out • Thus, you don't see SMP performance graphs continuing to demonstrate linear upward scalability as you add more processor power. • At some point, the curve always begins to flatten. At the upper reaches of that curve, you'll find that you need very expensive hardware upgrades to get very small performance improvements. • That’s where scaling out came into the game. Copyright © William El Kaim 2016 76
  77. 77. Scaling in vs. Scaling out • Scaling out can provide an effective answer to the problems of the scale-up scenario, by using the shared-nothing architecture. • Essentially, shared-nothing architecture means that each system operates independently. • Each system in the cluster maintains separate CPU, memory, and disk storage that other systems can't directly access. • To address capacity issues by scaling out, you add more hardware-not bigger hardware. • When you scale out, the absolute size and speed of a single system doesn't limit total capacity. • Shared-nothing architecture also skirts the software bottleneck by providing multiple multi-user concurrency mechanisms. • Because the workload is divided among the servers, total software capacity increases. Copyright © William El Kaim 2016 77
  78. 78. Scaling in vs. Scaling out • Although scaling out provides great answers to the inherent limitations in scale-up architecture, this method is no stranger to Murphy's Law, either. • At this point in the technology lifecycle, scaling out requires increased management overhead that is potentially as great as the performance gains it offers. • Even so, scaling out might be a viable solution to database implementations that have reached the limits of SMP scalability. Copyright © William El Kaim 2016 78
  79. 79. Scalability for Database Tier • Applications that go to the database for each request likely will have scalability problems • The Database tier is difficult and expensive to scale; it is difficult to scale a database server to more than a single host, and it becomes exponentially more expensive to add CPUs • Database servers scale sub- linearly at best with additonal CPUs, and there is a CPU limit Copyright © William El Kaim 2016 79
  80. 80. Super-Linear Scalability • I t is possible to exceed an SF of 1 .0 • With two disks, reduced head content ion can increase the throughput of sequential I/O by even 100x. • Similar effects can occur with CPU caches and context switches. • Large cluster - aggregated data caches can offer super- linear scale by significantly increasing the hit rate, reducing the average data access cost • Can be also explained as a super-linear slow down as resources are reduced ( i.e. the converse) Copyright © William El Kaim 2016 80
  81. 81. Plan • Introduction to Architecture Operational Aspects • Reliability • Availability and SLA • High Availability • Eliminating SPOF • Transaction • CAP Theorem • Scalability Performance • Measuring Web Application Performance • Clustering Copyright © William El Kaim 2016 81
  82. 82. Performance • Performance is the degree to which a software system or component meets its objectives for timeliness • Typically measured as time (wall clock) elapsed between request and response • Elapsed time also known as latency • Web apps often measured on the server side as time to last byte (TTLB) • Common Performance Bottlenecks • Bad algorithms (e.g. lots of iterating –think big O notation) • Rendering –i.e. printing on a page, changing what’s displayed on a website • Common issues listed here: https://developer.yahoo.com/performance/rules.html • Latency in communication between servers/different bits of hardware • HTTP requests, IO operations • Badly written database queries Copyright © William El Kaim 2016 82
  83. 83. Scalability vs. Performance • Users are affected by poor performance • Poor performance is usually a result of poor scalability • Operating costs and capacity limitations are caused by poor scalability • Capacity is the space, computer hardware, software and connection infrastructure resources that will be needed over some future period of time. • Designing for scalability often has a negative impact on single-user performance • Building in the ability to scale out has overhead • But single-user performance doesn't often matter ! • Once the maximum sustainable request rate is exceeded, performance will degrade • End user apps will degrade in a linear fashion as the request queue backs up • Automated applications will degrade exponentially Copyright © William El Kaim 2016 83
  84. 84. Scalable Performance • Scalable Performance is NOT focused on making an application faster ; rather, it is focused on insuring that the application Performance does not degrade beyond defined boundaries as the application gains additional users, how resources must grow to ensure that, and how one can be certain that additional resources ill solve the problem • Scalable Performance refers to overall response times for an application (SLA) that are within defined tolerances for normal use, remain within those tolerances up to the expected peak user load, and for which a clear understanding exists as to the resources that would be required to support additional load without exceeding those tolerances Copyright © William El Kaim 2016 84
  85. 85. Engineering For Performance • Build performance and scalability thinking in the development lifecycle • Define your objectives • Measure against your objectives When You measure what you are speaking about, and express it in numbers, you know something about it; but when You cannot express it in numbers, your knowledge is of a meager and unsatisfactory kind; it may be the beginning of knowledge, but you have scarcely in your thoughts advanced to the state of science. - Lord Kelvin (William Thomson) Copyright © William El Kaim 2016 85
  86. 86. Performance Modeling • A structured and repeatable approach to modeling the performance of your software • Similar to “Threat Modeling” in security • Begins during the early phases of your application design • Continues throughout the application lifecycle • Consists of • A document that captures your performance requirements • A process to incrementally define and capture the information that helps the teams working on your solution to focus on using, capturing, and sharing the correct information. Copyright © William El Kaim 2016 86
  87. 87. Performance modeling Process • Critical Scenarios • Have specific performance expectations or requirements. • Significant Scenarios • Do not have specific performance objectives • May impact other critical scenarios. • Look for scenarios which • Run in parallel to a performance critical scenario • Frequently executed • Account for a high percentage of system use • Consume significant system resources 1. Identify Key Scenarios 2. Identify Workloads 3. Identify Performance Objectives 4. Identify Processing Steps 5. Allocate Budget 6. Evaluate 7. Validate Iterate Copyright © William El Kaim 2016 87
  88. 88. Performance modeling Process • Workload is usually derived from marketing data • Total users • Concurrently active users • Data volumes • Transaction volumes and transaction mix • Identify how this workload applies to an individual scenario • Support 100 concurrent users browsing • Support 10 concurrent users placing orders. 1. Identify Key Scenarios 2. Identify Workloads 3. Identify Performance Objectives 4. Identify Processing Steps 5. Allocate Budget 6. Evaluate 7. Validate Iterate Copyright © William El Kaim 2016 88
  89. 89. Performance modeling Process • Performance and scalability goals should be defined as non-functional or operational requirements • Requirements should be based on previously identified workload • Consider the following: • Service level agreements • Response times • Projected growth • Lifetime of your application 1. Identify Key Scenarios 2. Identify Workloads 3. Identify Performance Objectives 4. Identify Processing Steps 5. Allocate Budget 6. Evaluate 7. Validate Iterate Copyright © William El Kaim 2016 89
  90. 90. Define Your Objectives • Performance and scalability goals should be defined as non-functional or operational requirements • Requirements should be based on expected use of the system • Compare to previous versions or similar systems Metric Definition Measured By Impacts Throughput How Many? Requests per second Number of servers Response Time How Fast? Client latency Customer Satisfaction Resource Util. How Much? % of resource Hardware/ Network Workload How many concurrent requests? Concurrent requests for the system Scalability, Concurrency Copyright © William El Kaim 2016 90
  91. 91. Define Your Objectives • Objectives must be SMART • S – Specific • M – Measurable • A – Achievable • R – Results Oriented • T – Time Specific "application must run fast" “Page should load quickly" "3 second response time on home page with 100 concurrent users and < 70% CPU" "25 journal updates posted per second with 500 concurrent users and < 70% CPU" r a "If You cannot measure it, You cannot improve it.“ -Lord Kelvin Copyright © William El Kaim 2016 91
  92. 92. Build an objective Scenario Response Time Throughput Workload Resource Utilization Browse Home page Client latency 3 seconds 50 requests per second 100 concurrent users < 60% CPU Utilization Search Catalog Client latency 5 seconds 10 requests per second 100 concurrent users < 60% CPU Utilization Copyright © William El Kaim 2016 92
  93. 93. Performance modeling Process • Identify the steps that must take place to complete a scenario • Use cases, sequence diagrams, flowcharts etc. all provide useful input • Helps you to know where to instrument your code later • Start at a high level, don’t go to low 1. Identify Key Scenarios 2. Identify Workloads 3. Identify Performance Objectives 4. Identify Processing Steps 5. Allocate Budget 6. Evaluate 7. Validate Iterate Copyright © William El Kaim 2016 93
  94. 94. Performance modeling Process • Use your performance baseline to measure how much time each processing step is taking • If you are not meeting your target budget the time among the processing steps 1. Identify Key Scenarios 2. Identify Workloads 3. Identify Performance Objectives 4. Identify Processing Steps 5. Allocate Budget 6. Evaluate 7. Validate Iterate Copyright © William El Kaim 2016 94
  95. 95. Performance modeling Process • Run automated test scenarios and evaluate the performance against objectives • As much as possible, tests must be repeatable throughout application lifecycle 1. Identify Key Scenarios 2. Identify Workloads 3. Identify Performance Objectives 4. Identify Processing Steps 5. Allocate Budget 6. Evaluate 7. Validate Iterate Copyright © William El Kaim 2016 95
  96. 96. Performance modeling Process • Check your results against performance objectives • Leave yourself a margin early in the project to avoid early performance optimization • As you progress toward completion allow less margin 1. Identify Key Scenarios 2. Identify Workloads 3. Identify Performance Objectives 4. Identify Processing Steps 5. Allocate Budget 6. Evaluate 7. Validate Iterate Copyright © William El Kaim 2016 96
  97. 97. Plan • Introduction to Architecture Operational Aspects • Reliability • Availability and SLA • High Availability • Eliminating SPOF • Transaction • CAP Theorem • Scalability • Performance Web Application Performance • Clustering Copyright © William El Kaim 2016 97
  98. 98. If page is too long to load, user will go away! • 80-90% of the end-user response time is spent on the frontend. Start there. • Client side processing is virtually unexamined in most performance management programs • Not tracked by most tools • A page should not load in more than 3 seconds • Akamaï study shows that after 4s people cancel access to the page • Yahoo! Talked about 5 to 9% of cancel rate for 400ms lost. • And now … Google will show the performance • Google will add the performance of the web site on their search results • In the future people will may be click on the quicker site ? • You should Monitor Along the Transaction in real time and/or pro-active way! Copyright © William El Kaim 2016 98
  99. 99. How To Measure Web Application Performance? Monitor Along the Transaction Copyright © William El Kaim 2016 99
  100. 100. Never Forget Network Latency • Network latency, or network response time (NRT), is a measure of the amount of time required for a packet to travel across a network path from a sender to a receiver. Copyright © William El Kaim 2016 100
  101. 101. Copyright © William El Kaim 2016 101
  102. 102. Waterfall View Connection ViewCopyright © William El Kaim 2016 102
  103. 103. Performance and User Experience … • Instantaneous = 0.1 second (100 ms) or less • Limit for having the user feel that the system is reacting instantaneously, meaning that no special feedback is necessary except to display the result. • Delay noticed: 1 second • limit for the user's flow of thought to stay uninterrupted • Normally, no special feedback is necessary during delays of more than 0.1 but less than 1 second, but the user does lose the feeling of operating directly on the data. • Loss of attention: 10 seconds • limit for keeping the user's attention focused on the dialogue. • For longer delays, users will want to perform other tasks while waiting for the computer to finish, so they should be given feedback indicating when the computer expects to be done. Copyright © William El Kaim 2016 103
  104. 104. Example: Search Rental Car Variability is very important do not look only at average! Copyright © William El Kaim 2016 104
  105. 105. Measuring Web Application Performance • Real User Monitoring (RUM): an approach to Web monitoring that aims to capture and analyze every transaction of every user of your website or application. • Form of passive monitoring, relying on Web-monitoring services that continuously observe your system in action, tracking availability, functionality, and responsiveness. • By using local agents or small bits of JavaScript to gauge site performance and reliability from the perspective of client apps and browsers, top-down RUM focuses on the direct relationship between site speed and user satisfaction, providing valuable insights into ways you can optimize your application’s components and improve overall performance. • Synthetic User Monitoring (also known as active monitoring or proactive monitoring) is done using a Web browser emulation or scripted recordings of Web transactions. • Behavioral scripts (or paths) are created to simulate an action or path that a customer or end- user would take on a site. • Those paths are then continuously monitored at specified intervals for performance, such as: functionality, availability, and response time measures. Copyright © William El Kaim 2016 105
  106. 106. Measuring Web Application Performance: Tools • Manual Page Measurement • Browser Add-on: Firebug, TamperData, Yslow, • Cloud Service: Keynote KITE, Google Page Speed, Web Page Test • Local Application: Microsoft Visual Roundtrip Analyzer, IE Inspector, HTTP watch and Fiddler. • Real User Monitoring • BMC Truesight, IBM Tealeaf CX, New Relic, etc. • Synthetic Monitoring • Alertsite, Appview, AppDynamics, App Synthetic Monitor, Alyvix, CloudMonit, Dynatrace, Keynote, New Relic, Pingdom, Site 24*7, SiteScope, WebMetrics, etc. • Network Analysis • Wireshark, Clearsight, Microsoft Network Monitor • In Depth Metrics • Riverbed, CA APM, etc. • Load Testing • HP/Mercury LoadRunner, Apache Jmeter, etc. Copyright © William El Kaim 2016 106
  107. 107. Focus On 9 Core Metrics To Measure And Improve User Centric Performance • 9 Core Metrics • Availability • Outages • Average Download Time - Geo Mean • Time in Client Versus Time In Generation/Backend • Variability - 85th and 95th percentiles • Geographic Variability • Hourly Variability (Load Handling) • Third Party Quality • Size/Element Count/Domains • Diagnose performance over time • Yesterday, Last week, Last month and Month-To-Date, Last Year and Year-to-Date Copyright © William El Kaim 2016 107
  108. 108. Focus On 9 Core Metrics: Examples • Availability – 99.5% for multi-step transaction • Outages – 1 hour per month • Average Download Time - 1.5 -2.5s (broadband) • Time in Client Versus Time In Generation/Backend – Less than 30% of page load • Variability - 85th and 95th percentiles – No more than 1.5X the median • Geographic Variability – No more than 2X (fastest versus slowest) • Hourly Variability (Load Handling) – Less than 20% peak versus off peak • Third Party Quality – Tags under 50MS each (limited variability, good availability) • Size/Element Count/Domains – Depends Copyright © William El Kaim 2016 108
  109. 109. Synthesis: How To Measure Web Application Performance? • Begin with the user centric approach and monitor along the transaction • Apply competitive context and business goals to create appropriate targets • Collect 9 core Performance Management metrics • Use an ongoing, external, geographically distributed, browser based solution to collect data • Path based, key pages/function approach • Apply collected data against targets • Flag change/target exceeded • Perform diagnostic process Copyright © William El Kaim 2016 109
  110. 110. Improving Performance • Memory Cache Improves the performance of web applications by allowing you to retrieve information from fast, managed, in-memory caches, instead of relying entirely on slower disk-based databases. • Redis: Open source, BSD licensed, advanced key-value cache and store. It is often referred to as a data structure server since keys can contain strings, hashes, lists, sets,sorted sets, bitmaps and hyperloglogs. • Memcached - a widely adopted memory object caching system. • AWS ElastiCache is protocol compliant with Memcached and Redis. • Data Grid In-memory data grids are often used with databases in order to improve performance of applications, to distribute data and computation across servers, clusters and geographies, and to manage very large data sets or high data ingest rates. • HazelCast / GridGain In-Memory Data Grid / ScaleOut StateServer / Oracle Coherence In-Memory Data Grid / XAP In-Memory Data Grid / Tibco Activespaces Source: High ScalabityCopyright © William El Kaim 2016 110
  111. 111. Improving Performance • HTTP Cache and reverse proxy • Web application accelerator also known as a caching HTTP reverse proxy. You install it in front of any server that speaks HTTP and configure it to cache the contents. • Varnish Cache: Varnish Cache is • Varnish API Engine is a tool that allows you to manage your APIs through one central point. • Nginx: nginx is an HTTP and reverse proxy server, as well as a mail proxy server. • AWS Elastic Cache • Content Delivery Network • A content delivery network (CDN) is an interconnected system of cache servers that use geographical proximity as a criteria for delivering Web content. • Azure CDN, AWS CloudFront, CloudFlare, OVH CDN Source: High ScalabityCopyright © William El Kaim 2016 111
  112. 112. Improving Performance • Search • Algolia, a YC backed startup. • Unlike ElasticSearch’s open source solution they offer their proprietary search technology via the hosted model • Amazon cloudsearch • ElasticSearch (with Kibana) • Apache Solr (with Banana) • Newsfeed/ Activity Streams • Stream Framework • GetStream.io • Others: Cassandra, Redis, Celery and RabbitMQ Source: High ScalabityCopyright © William El Kaim 2016 112
  113. 113. Improving Performance • Message Notification Service • Faye, PubNub will enable you to be ready in few minutes • StreamData.io transforms any JSON API into a real-time push API without a single line of server side code • AWS SNS (Simple Notification Service): fully managed push messaging service • Event Analytics • Snowplow • Perfkit • PerfKit Benchmarker: PerfKit is unique because it measures the end to end time to provision resources in the cloud, in addition to reporting on the most standard metrics of peak performance. • Perfkit Explorer: a visualization tool Copyright © William El Kaim 2016 113
  114. 114. Plan • Introduction to Architecture Operational Aspects • Reliability • Availability and SLA • High Availability • Eliminating SPOF • Transaction • CAP Theorem • Scalability • Performance Clustering Copyright © William El Kaim 2016 114
  115. 115. Clustering • Clustering enables multiple servers or server processes to work together • Clustering can be used to horizontally scale a tier, i.e. scale by adding servers • Clustering usually costs much less than buying a bigger server (vertical scaling) • Clustering also typically provide failover and other reliability benefits Copyright © William El Kaim 2016 115
  116. 116. Clustering: Concepts • The less communication required, the better • Always better to be stateless in a tier if it does not cause a bottleneck in the next tier • Server farms: a stateless clustering model • The less coordination required, the better • Independence: Don’t go to the committee • Concurrency Control: Reduces scalability, so use only as necessary Copyright © William El Kaim 2016 116
  117. 117. Clustering: Benefits • If the application has been built correctly, it supports a predictable scaling model • Clustering allows relatively inexpensive CPU and memory resources to be added to a production application in order to handle more concurrent users and/or more data • Provides redundancy • Simple ( n+ 1 ) model Copyright © William El Kaim 2016 117
  118. 118. Scalability of Clustering • The Potential for Negative Scale • Single server model allows unrestricted caching • Clustering may require the disabling of caching • Two servers often slower than one! • Data Integrity Challenges in a Cluster • How to maintain the data in sync among servers • How to keep in sync with the data tier • How to failover and fail-back servers without impact Copyright © William El Kaim 2016 118
  119. 119. Twitter http://www.twitter.com/welkaim SlideShare http://www.slideshare.net/welkaim EA Digital Codex http://www.eacodex.com/ Linkedin http://fr.linkedin.com/in/williamelkaim Claudine O'Sullivan Copyright © William El Kaim 2016 119

×