The document discusses challenges with current rule-based approaches to elasticity management in cloud applications and proposes a decentralized autonomous solution. It notes that rule-based systems require defining optimal thresholds upfront and do not scale well to large applications. The proposed approach uses reinforcement learning to allow instances to autonomously share load during critical events without a centralized controller. This could enable better placements of applications across instances and more efficient scaling decisions in dynamic cloud environments.
Autonomic Resource Provisioning for Cloud-Based SoftwarePooyan Jamshidi
This document proposes using fuzzy logic and type-2 fuzzy sets to develop an autonomous resource provisioning system for cloud-based software. Current auto-scaling solutions have limitations including requiring deep application knowledge and performance modeling expertise from users. The proposed system would use fuzzy inference to map monitored performance data to scaling actions, eliminating the need for users to specify scaling parameters or policies. It would incorporate uncertainty into the modeling and use expert knowledge from multiple users to develop robust and adaptive provisioning behavior.
How to you manage Performance in the Cloud, in particular in "Platform as a Service (PaaS) environments like Window's Azure or Heroku where you don't have a "virtual machine" to manage?
Even in "Infrastructure as a Service (IaaS)" environments like Amazon EC2 there are limitations on the tools you can deploy into that environment to assist in performance management, troubleshooting etc (e.g. you can't deploy promiscuous mode network sniffing tools in EC2).
James Smith from Adactus will give us an overview of Cloud Services as a whole, and then drill down into some of the issues they have experienced in deployed their "Pulse" Claims Management Solution into the Azure cloud (http://www.pulseclaims.com/home).
Beyond just looking at page speed performance he'll talk about the challenges involved in managing SLA's, Cloud "support" (or lack of it!), performance troubleshooting and the whole "performance lifecycle".
More Nines for Your Dimes: Improving Availability and Lowering Costs using Au...Amazon Web Services
Running your Amazon EC2 instances in Auto Scaling groups allows you to improve your application's availability right out of the box. Auto Scaling replaces impaired or unhealthy instances automatically to maintain your desired number of instances (even if that number is one). You can also use Auto Scaling to automate the provisioning of new instances and software configurations as well as to track of usage and costs by app, project, or cost center. Of course, you can also use Auto Scaling to adjust capacity as needed - on demand, on a schedule, or dynamically based on demand. In this session, we show you a few of the tools you can use to enable Auto Scaling for the applications you run on Amazon EC2.
The goal of centralized tools & support teams is to provide the most value for our development and operations customers with the least amount of overhead. One way of optimizing this delivery of value in large enterprise engineering environments is through standardization and automation.
Learn how Bose engineers use ElectricFlow to deliver standardized and integrated build environment to teams in a matter of minutes. In this session we’ll cover:
• Effective use of a Procedure Library to abstract away complexity and standardize build procedures
• Creation and use of a Build Data Management system to handle dependencies between components
• Methods for refactoring existing automation to take advantage of advanced ElectricFlow features
• Options for enabling shared visibility into the health of all applications and commit pipelines
More Nines for Your Dimes: Improving Availability and Lowering Costs using Au...Amazon Web Services
Running your Amazon EC2 instances in Auto Scaling groups allows you to improve your application's availability right out of the box. Auto Scaling replaces impaired or unhealthy instances automatically to maintain your desired number of instances (even if that number is one). You can also use Auto Scaling to automate the provisioning of new instances and software configurations as well as to track of usage and costs by app, project, or cost center. Of course, you can also use Auto Scaling to adjust capacity as needed - on demand, on a schedule, or dynamically based on demand. In this session, we show you a few of the tools you can use to enable Auto Scaling for the applications you run on Amazon EC2. We also share tips and tricks we've picked up from customers such as Netflix, Adobe, Nokia, and Amazon.com about managing capacity, balancing performance against cost, and optimizing availability.
Steve Brodie - Electric Cloud - The Yin and Yang of DevOps TransformationDevOps Enterprise Summit
In Chinese philosophy, Yin and Yang represent opposite yet equally necessary elements that work in harmony to maintain balance while bringing about change. In today’s software economy, DevOps represents multiple people in the value delivery chain (Dev, QA, Ops) that work in harmony to maintain balance while bringing about change. But scratching deeper, we find another key harmonic relationship – that of teams, tactics and tools. This session will focus on how ways organizations can empower DevOps teams by providing the necessary support, processes and tools they need to flourish and accelerate the rate of sustainable change they unleash on the world.
Sam Fell - Electric Cloud - Faster Continuous Integration with ElectricAccele...DevOps Enterprise Summit
This document discusses how ElectricAccelerator can dramatically accelerate software builds and tests by automatically parallelizing jobs across shared CPU clusters. It parallelizes builds, detects dependencies to ensure correctness, and utilizes infrastructure efficiently. Example use cases demonstrate accelerating builds 2.4-11.5x and tests 7.2-61x. ElectricAccelerator Huddle provides a free option for small teams.
Autonomic Resource Provisioning for Cloud-Based SoftwarePooyan Jamshidi
This document proposes using fuzzy logic and type-2 fuzzy sets to develop an autonomous resource provisioning system for cloud-based software. Current auto-scaling solutions have limitations including requiring deep application knowledge and performance modeling expertise from users. The proposed system would use fuzzy inference to map monitored performance data to scaling actions, eliminating the need for users to specify scaling parameters or policies. It would incorporate uncertainty into the modeling and use expert knowledge from multiple users to develop robust and adaptive provisioning behavior.
How to you manage Performance in the Cloud, in particular in "Platform as a Service (PaaS) environments like Window's Azure or Heroku where you don't have a "virtual machine" to manage?
Even in "Infrastructure as a Service (IaaS)" environments like Amazon EC2 there are limitations on the tools you can deploy into that environment to assist in performance management, troubleshooting etc (e.g. you can't deploy promiscuous mode network sniffing tools in EC2).
James Smith from Adactus will give us an overview of Cloud Services as a whole, and then drill down into some of the issues they have experienced in deployed their "Pulse" Claims Management Solution into the Azure cloud (http://www.pulseclaims.com/home).
Beyond just looking at page speed performance he'll talk about the challenges involved in managing SLA's, Cloud "support" (or lack of it!), performance troubleshooting and the whole "performance lifecycle".
More Nines for Your Dimes: Improving Availability and Lowering Costs using Au...Amazon Web Services
Running your Amazon EC2 instances in Auto Scaling groups allows you to improve your application's availability right out of the box. Auto Scaling replaces impaired or unhealthy instances automatically to maintain your desired number of instances (even if that number is one). You can also use Auto Scaling to automate the provisioning of new instances and software configurations as well as to track of usage and costs by app, project, or cost center. Of course, you can also use Auto Scaling to adjust capacity as needed - on demand, on a schedule, or dynamically based on demand. In this session, we show you a few of the tools you can use to enable Auto Scaling for the applications you run on Amazon EC2.
The goal of centralized tools & support teams is to provide the most value for our development and operations customers with the least amount of overhead. One way of optimizing this delivery of value in large enterprise engineering environments is through standardization and automation.
Learn how Bose engineers use ElectricFlow to deliver standardized and integrated build environment to teams in a matter of minutes. In this session we’ll cover:
• Effective use of a Procedure Library to abstract away complexity and standardize build procedures
• Creation and use of a Build Data Management system to handle dependencies between components
• Methods for refactoring existing automation to take advantage of advanced ElectricFlow features
• Options for enabling shared visibility into the health of all applications and commit pipelines
More Nines for Your Dimes: Improving Availability and Lowering Costs using Au...Amazon Web Services
Running your Amazon EC2 instances in Auto Scaling groups allows you to improve your application's availability right out of the box. Auto Scaling replaces impaired or unhealthy instances automatically to maintain your desired number of instances (even if that number is one). You can also use Auto Scaling to automate the provisioning of new instances and software configurations as well as to track of usage and costs by app, project, or cost center. Of course, you can also use Auto Scaling to adjust capacity as needed - on demand, on a schedule, or dynamically based on demand. In this session, we show you a few of the tools you can use to enable Auto Scaling for the applications you run on Amazon EC2. We also share tips and tricks we've picked up from customers such as Netflix, Adobe, Nokia, and Amazon.com about managing capacity, balancing performance against cost, and optimizing availability.
Steve Brodie - Electric Cloud - The Yin and Yang of DevOps TransformationDevOps Enterprise Summit
In Chinese philosophy, Yin and Yang represent opposite yet equally necessary elements that work in harmony to maintain balance while bringing about change. In today’s software economy, DevOps represents multiple people in the value delivery chain (Dev, QA, Ops) that work in harmony to maintain balance while bringing about change. But scratching deeper, we find another key harmonic relationship – that of teams, tactics and tools. This session will focus on how ways organizations can empower DevOps teams by providing the necessary support, processes and tools they need to flourish and accelerate the rate of sustainable change they unleash on the world.
Sam Fell - Electric Cloud - Faster Continuous Integration with ElectricAccele...DevOps Enterprise Summit
This document discusses how ElectricAccelerator can dramatically accelerate software builds and tests by automatically parallelizing jobs across shared CPU clusters. It parallelizes builds, detects dependencies to ensure correctness, and utilizes infrastructure efficiently. Example use cases demonstrate accelerating builds 2.4-11.5x and tests 7.2-61x. ElectricAccelerator Huddle provides a free option for small teams.
Sam Fell - Electric Cloud - Automating Continuous Delivery with ElectricFlowDevOps Enterprise Summit
Continuous Delivery takes Agile to its logical conclusion with a way of working that ensures software is always ready to release. It does this by building upon and extending Agile, CI and DevOps practices and tools to transform the way software is delivered.
Organizations that want to adopt Continuous Delivery need frequent check-ins to be verified by automated builds and tests so teams can reduce risk, deploy more often, and detect problems early.
This talk will focus on the ElectricFlow DevOps automation platform, and the functionality it exposes to:
- Enable Devs to automate complex build and test processes to drive efficient predictability at scale
- Give Ops teams a way to eliminate manual and error-prone processes to safely deploy any applications anywhere, anytime.
- Any teams to securely plug-in the clouds and tools they care about to abstract out complexity and ensure process compliance
The document discusses several issues with utilizing utilization as a metric for measuring resource usage and performance in modern computing systems. It argues that utilization metrics are broken due to unsafe assumptions about workload characteristics, system architecture like multi-core CPUs, and measurement errors. Alternative metrics that take these factors into account, like response time and capability utilization for storage, are suggested to provide more accurate performance insights.
Tanay Nagjee - Electric Cloud - Better Continuous Integration with Test Accel...DevOps Enterprise Summit
Stop sacrificing comprehensive testing to save time
Software testers and quality assurance engineers are often pressured to cut testing time to ensure products are released on time. Usually this means running fewer tests, thus reducing software quality. This pressure is exacerbated as companies embrace a continuous integration (CI) approach which involves frequent build and test cycles, but has the side effect of further limiting the time allocated to test and analysis. Instead of reducing the number of tests in a CI cycle to reduce test time, Tanay Nagjee will discuss how entire test suites can be broken down and parallelized, reducing the time to run them by 80% or more. By leveraging a cluster of computing horsepower (either on-premise physical machines or in the cloud), large test suites can execute in a fraction of the time it takes by smartly parallelizing their individual tests. Tanay will outline a 3-step approach to achieve these results with different test frameworks. He will discuss the tools used, and will present real example data and a live demonstration.
Rohit Jainendra - Electric Cloud - Enabling DevOps Adoption with Electric CloudDevOps Enterprise Summit
Join Rohit Jainendra, Chief Product Officer, as he gives you a firsthand look at how Electric Cloud products have evolved over the past year and a view into the 2015-2016 roadmap. Gain insight into new features and learn how we plan to help you and your organization adopt DevOps practices so that you can deliver better software faster.
This document discusses self-learning cloud controllers that can dynamically scale cloud resources. It notes that current auto-scaling approaches require deep application knowledge and expertise to determine scaling parameters and policies. The paper proposes a type-2 fuzzy logic approach called RobusT2Scale that uses fuzzy rules and monitoring data to determine scaling actions. It aims to handle uncertainty in elastic systems and accommodate different user preferences through fuzzy reasoning over workload and response time data. The approach pre-computes scaling decisions to enable efficient runtime elasticity control. It is evaluated based on its ability to meet an SLA target response time compared to over- and under-provisioning approaches.
This document summarizes Chaos Engineering techniques used at T-Mobile for their Cloud Foundry platform. It introduces the tools Monarch and Turbulence++ that were developed to inject failures at the infrastructure and application levels. Examples of chaos attacks demonstrated include killing VMs, blocking network traffic, and crashing application instances. The tools help test the resiliency of the platform and applications deployed on it. Limitations and potential improvements discussed include merging the two tools and supporting multiple clusters.
Capacity Planning for Virtualized Datacenters - Sun Network 2003Adrian Cockcroft
Presentation I made at the Sun Network conference in 2003 on how to do capacity planning for virtualized systems, tied into the N1 product that Sun was pushing at the time. This project was structured as a design for six sigma (DFSS) project.
This document discusses factors that influence web search latency from both the user and system perspectives. It summarizes that users expect fast response times from search engines, while search engines aim to balance speed, quality, and costs. The document then outlines components that contribute to latency, experiments measuring user sensitivity to latency, and the impact of latency on user search experience. Specifically, it finds users notice delays over 1000ms and that faster search sites lead to higher user engagement.
(SPOT302) Availability: The New Kind of Innovator’s DilemmaAmazon Web Services
Successful companies, while focusing on their current customers' needs, often fail to embrace disruptive technologies and business models. This phenomenon, known as the "Innovator's Dilemma," eventually leads to many companies' downfall and is especially relevant in the fast-paced world of online services. In order to protect its leading position and grow its share of the highly competitive global digital streaming market, Netflix has to continuously increase the pace of innovation by constantly refining recommendation algorithms and adding new product features, while maintaining a high level of service uptime. The Netflix streaming platform consists of hundreds of microservices that are constantly evolving, and even the smallest production change may cause a cascading failure that can bring the entire service down. We face a new kind of Innovator's Dilemma, where product changes may not only disrupt the business model but also cause production outages that deny customers service access. This talk will describe various architectural, operational and organizational changes adopted by Netflix in order to reconcile rapid innovation with service availability.
Java/Hybris performance monitoring and optimizationEPAM Lviv
⏩На EPAM Java/Hybris вебінарі обговорили:
- performance monitoring,
- оптимізація java server applications,
- вирішення проблем java server applications,
демо: performance monitoring and optimization
⏩Доповідачі:
Михайло Драч, Hybris Software Engineer @ EPAM
Андрій Давиденко, Senior Performance Analyst @ EPAM
⏩Корисна інформація:
▶Переглянути різницю між створенням об'єктів через BigDecimal.valueOf(value) та new BigDecimal(value), коли value часто повторюється: https://epa.ms/1nUmq3
▶Відео: https://youtu.be/Y7JCWVrhBm8
Autonomic Decentralised Elasticity Management of Cloud ApplicationsSrikumar Venugopal
This document presents an autonomic decentralized elasticity management system called ADEC for cloud applications. ADEC uses reinforcement learning where each instance independently monitors itself and learns optimal management policies over time through a reward/punishment system. Instances coordinate using a distributed hash table to provision and dynamically place applications across instances to maximize utilization while meeting response time and availability requirements. The system was evaluated on Amazon EC2 using a hotel management application under varying workloads, demonstrating ADEC's ability to independently start and shutdown instances to meet application objectives.
This document discusses autonomic decentralized elasticity management of cloud applications. It presents a reinforcement learning approach called ADEC where each instance independently monitors and manages its resources and applications using a set of simple states and actions. The instances coordinate using a distributed key-value store to optimize placement of applications across instances and elastically scale instances up and down to meet application objectives like response time thresholds. An evaluation on Amazon EC2 showed ADEC could dynamically provision instances and applications in response to changing workloads to satisfy application service level objectives with low overhead.
Improving DevOps through Cloud Automation and Management - Real-World Rocket ...Ostrato
Explore how DevOps processes can be made more efficient through improved service delivery and cloud automation. Check out this real-world example to see how Chef and Ostrato helped OpenWhere, a geospatial analytics startup, compete in the hyper-competitive defense marketplace.
Chef allows enterprises like OpenWhere to automate infrastructure deployments to accelerate and simplify the development process. Ostrato’s cloud management platform enables enterprises to control costs and institute governance in hybrid cloud environments.
This document discusses designing applications for resiliency in cloud environments. It defines resiliency, high availability, and disaster recovery. It describes why resiliency is important given the transient faults that can occur in cloud systems. The document outlines a process for improving resiliency that includes planning, designing, implementing, testing, deploying, monitoring, and responding to failures. It provides examples of resiliency techniques like load balancing, failover/failback, data replication, retries, circuit breakers, and deployment strategies.
Automated acceptance testing is an important part of the deployment pipeline. It tests that the application meets business requirements and provides value to users. Creating maintainable acceptance test suites involves deriving tests from acceptance criteria, layering the tests, and avoiding direct coupling to the GUI. Non-functional requirements like performance and capacity also need to be tested. The deployment process should be automated and standardized across environments using techniques like blue-green deployment and canary releases to allow rolling back changes if needed.
This document provides information about the SCQAA-SF organization and an upcoming event. SCQAA-SF is a chapter that sponsors sharing of information to promote quality practices in IT through networking, training and professional development. They meet every two months in San Fernando Valley. The upcoming event will feature presentations on technology advancements and methodology, networking opportunities, and opportunities to earn PDU and other credits. Recently, the organization revised their membership dues policy to better accommodate members' needs.
Adding Value in the Cloud with Performance TestRodolfo Kohn
This document discusses the importance of performance testing cloud applications and outlines best practices for defining performance requirements, testing methodology, and identifying issues. It provides examples of performance problems found in databases, applications, operating systems, and networks. The key goals of performance testing are to understand system behavior under load, find bottlenecks and hidden bugs, and verify that requirements are met.
Total cloud control with oracle enterprise manager 12csolarisyougood
This document discusses Oracle Enterprise Manager 12c and its capabilities for managing cloud computing environments. It can provide complete lifecycle management of applications, infrastructure, and platforms from planning through metering and optimization. Key capabilities include integrated management of applications, middleware, databases, and infrastructure; self-service provisioning; monitoring of business services and transactions; and metering for chargeback. It aims to provide total control and visibility while also enabling business users through self-service access.
Presentation of the Ph. D. dissertation SLA-Driven Cloud Computing Domain Representation and Management. This presentation explains a new methodology for the representation and management of Cloud services using SLA fragments. Cloud resources are described as independent SLA fragments, which are composed on the fly to create complete Cloud services.
An architecture for the management of Cloud services is also presented.
Cloudcompaas, an open source SLA-driven framework is introduced. Cloudcompaas implements the methodology and architecture presented earlier and enables the management of the complete lifecycle of Cloud services.
Finally a set of experiments to validate the utility and performance of the contributions is presented.
“Spikey Workloads”:
Emergency Management in the Cloud
One of the best use cases for the cloud involves websites with surges in computing needs. This session will feature organizations that have leveraged the cloud to handle their unique burst workloads without breaking the bank:
Speaker: , Solutions Architect, Amazon Web Services
One of the best use cases for the cloud involves websites with surges in computing needs. This session will feature organizations that have leveraged the cloud to handle their unique burst workloads without breaking the bank:
Speaker: Cameron Maxwell, Professional Services, Amazon Web Services
In this Dagstuhl talk, I presented my current research on cloud auto-scaling and component connector self-adaptation and how I employed type-2 fuzzy control to tame the uncertainty regarding knowledge specification.
Sam Fell - Electric Cloud - Automating Continuous Delivery with ElectricFlowDevOps Enterprise Summit
Continuous Delivery takes Agile to its logical conclusion with a way of working that ensures software is always ready to release. It does this by building upon and extending Agile, CI and DevOps practices and tools to transform the way software is delivered.
Organizations that want to adopt Continuous Delivery need frequent check-ins to be verified by automated builds and tests so teams can reduce risk, deploy more often, and detect problems early.
This talk will focus on the ElectricFlow DevOps automation platform, and the functionality it exposes to:
- Enable Devs to automate complex build and test processes to drive efficient predictability at scale
- Give Ops teams a way to eliminate manual and error-prone processes to safely deploy any applications anywhere, anytime.
- Any teams to securely plug-in the clouds and tools they care about to abstract out complexity and ensure process compliance
The document discusses several issues with utilizing utilization as a metric for measuring resource usage and performance in modern computing systems. It argues that utilization metrics are broken due to unsafe assumptions about workload characteristics, system architecture like multi-core CPUs, and measurement errors. Alternative metrics that take these factors into account, like response time and capability utilization for storage, are suggested to provide more accurate performance insights.
Tanay Nagjee - Electric Cloud - Better Continuous Integration with Test Accel...DevOps Enterprise Summit
Stop sacrificing comprehensive testing to save time
Software testers and quality assurance engineers are often pressured to cut testing time to ensure products are released on time. Usually this means running fewer tests, thus reducing software quality. This pressure is exacerbated as companies embrace a continuous integration (CI) approach which involves frequent build and test cycles, but has the side effect of further limiting the time allocated to test and analysis. Instead of reducing the number of tests in a CI cycle to reduce test time, Tanay Nagjee will discuss how entire test suites can be broken down and parallelized, reducing the time to run them by 80% or more. By leveraging a cluster of computing horsepower (either on-premise physical machines or in the cloud), large test suites can execute in a fraction of the time it takes by smartly parallelizing their individual tests. Tanay will outline a 3-step approach to achieve these results with different test frameworks. He will discuss the tools used, and will present real example data and a live demonstration.
Rohit Jainendra - Electric Cloud - Enabling DevOps Adoption with Electric CloudDevOps Enterprise Summit
Join Rohit Jainendra, Chief Product Officer, as he gives you a firsthand look at how Electric Cloud products have evolved over the past year and a view into the 2015-2016 roadmap. Gain insight into new features and learn how we plan to help you and your organization adopt DevOps practices so that you can deliver better software faster.
This document discusses self-learning cloud controllers that can dynamically scale cloud resources. It notes that current auto-scaling approaches require deep application knowledge and expertise to determine scaling parameters and policies. The paper proposes a type-2 fuzzy logic approach called RobusT2Scale that uses fuzzy rules and monitoring data to determine scaling actions. It aims to handle uncertainty in elastic systems and accommodate different user preferences through fuzzy reasoning over workload and response time data. The approach pre-computes scaling decisions to enable efficient runtime elasticity control. It is evaluated based on its ability to meet an SLA target response time compared to over- and under-provisioning approaches.
This document summarizes Chaos Engineering techniques used at T-Mobile for their Cloud Foundry platform. It introduces the tools Monarch and Turbulence++ that were developed to inject failures at the infrastructure and application levels. Examples of chaos attacks demonstrated include killing VMs, blocking network traffic, and crashing application instances. The tools help test the resiliency of the platform and applications deployed on it. Limitations and potential improvements discussed include merging the two tools and supporting multiple clusters.
Capacity Planning for Virtualized Datacenters - Sun Network 2003Adrian Cockcroft
Presentation I made at the Sun Network conference in 2003 on how to do capacity planning for virtualized systems, tied into the N1 product that Sun was pushing at the time. This project was structured as a design for six sigma (DFSS) project.
This document discusses factors that influence web search latency from both the user and system perspectives. It summarizes that users expect fast response times from search engines, while search engines aim to balance speed, quality, and costs. The document then outlines components that contribute to latency, experiments measuring user sensitivity to latency, and the impact of latency on user search experience. Specifically, it finds users notice delays over 1000ms and that faster search sites lead to higher user engagement.
(SPOT302) Availability: The New Kind of Innovator’s DilemmaAmazon Web Services
Successful companies, while focusing on their current customers' needs, often fail to embrace disruptive technologies and business models. This phenomenon, known as the "Innovator's Dilemma," eventually leads to many companies' downfall and is especially relevant in the fast-paced world of online services. In order to protect its leading position and grow its share of the highly competitive global digital streaming market, Netflix has to continuously increase the pace of innovation by constantly refining recommendation algorithms and adding new product features, while maintaining a high level of service uptime. The Netflix streaming platform consists of hundreds of microservices that are constantly evolving, and even the smallest production change may cause a cascading failure that can bring the entire service down. We face a new kind of Innovator's Dilemma, where product changes may not only disrupt the business model but also cause production outages that deny customers service access. This talk will describe various architectural, operational and organizational changes adopted by Netflix in order to reconcile rapid innovation with service availability.
Java/Hybris performance monitoring and optimizationEPAM Lviv
⏩На EPAM Java/Hybris вебінарі обговорили:
- performance monitoring,
- оптимізація java server applications,
- вирішення проблем java server applications,
демо: performance monitoring and optimization
⏩Доповідачі:
Михайло Драч, Hybris Software Engineer @ EPAM
Андрій Давиденко, Senior Performance Analyst @ EPAM
⏩Корисна інформація:
▶Переглянути різницю між створенням об'єктів через BigDecimal.valueOf(value) та new BigDecimal(value), коли value часто повторюється: https://epa.ms/1nUmq3
▶Відео: https://youtu.be/Y7JCWVrhBm8
Autonomic Decentralised Elasticity Management of Cloud ApplicationsSrikumar Venugopal
This document presents an autonomic decentralized elasticity management system called ADEC for cloud applications. ADEC uses reinforcement learning where each instance independently monitors itself and learns optimal management policies over time through a reward/punishment system. Instances coordinate using a distributed hash table to provision and dynamically place applications across instances to maximize utilization while meeting response time and availability requirements. The system was evaluated on Amazon EC2 using a hotel management application under varying workloads, demonstrating ADEC's ability to independently start and shutdown instances to meet application objectives.
This document discusses autonomic decentralized elasticity management of cloud applications. It presents a reinforcement learning approach called ADEC where each instance independently monitors and manages its resources and applications using a set of simple states and actions. The instances coordinate using a distributed key-value store to optimize placement of applications across instances and elastically scale instances up and down to meet application objectives like response time thresholds. An evaluation on Amazon EC2 showed ADEC could dynamically provision instances and applications in response to changing workloads to satisfy application service level objectives with low overhead.
Improving DevOps through Cloud Automation and Management - Real-World Rocket ...Ostrato
Explore how DevOps processes can be made more efficient through improved service delivery and cloud automation. Check out this real-world example to see how Chef and Ostrato helped OpenWhere, a geospatial analytics startup, compete in the hyper-competitive defense marketplace.
Chef allows enterprises like OpenWhere to automate infrastructure deployments to accelerate and simplify the development process. Ostrato’s cloud management platform enables enterprises to control costs and institute governance in hybrid cloud environments.
This document discusses designing applications for resiliency in cloud environments. It defines resiliency, high availability, and disaster recovery. It describes why resiliency is important given the transient faults that can occur in cloud systems. The document outlines a process for improving resiliency that includes planning, designing, implementing, testing, deploying, monitoring, and responding to failures. It provides examples of resiliency techniques like load balancing, failover/failback, data replication, retries, circuit breakers, and deployment strategies.
Automated acceptance testing is an important part of the deployment pipeline. It tests that the application meets business requirements and provides value to users. Creating maintainable acceptance test suites involves deriving tests from acceptance criteria, layering the tests, and avoiding direct coupling to the GUI. Non-functional requirements like performance and capacity also need to be tested. The deployment process should be automated and standardized across environments using techniques like blue-green deployment and canary releases to allow rolling back changes if needed.
This document provides information about the SCQAA-SF organization and an upcoming event. SCQAA-SF is a chapter that sponsors sharing of information to promote quality practices in IT through networking, training and professional development. They meet every two months in San Fernando Valley. The upcoming event will feature presentations on technology advancements and methodology, networking opportunities, and opportunities to earn PDU and other credits. Recently, the organization revised their membership dues policy to better accommodate members' needs.
Adding Value in the Cloud with Performance TestRodolfo Kohn
This document discusses the importance of performance testing cloud applications and outlines best practices for defining performance requirements, testing methodology, and identifying issues. It provides examples of performance problems found in databases, applications, operating systems, and networks. The key goals of performance testing are to understand system behavior under load, find bottlenecks and hidden bugs, and verify that requirements are met.
Total cloud control with oracle enterprise manager 12csolarisyougood
This document discusses Oracle Enterprise Manager 12c and its capabilities for managing cloud computing environments. It can provide complete lifecycle management of applications, infrastructure, and platforms from planning through metering and optimization. Key capabilities include integrated management of applications, middleware, databases, and infrastructure; self-service provisioning; monitoring of business services and transactions; and metering for chargeback. It aims to provide total control and visibility while also enabling business users through self-service access.
Presentation of the Ph. D. dissertation SLA-Driven Cloud Computing Domain Representation and Management. This presentation explains a new methodology for the representation and management of Cloud services using SLA fragments. Cloud resources are described as independent SLA fragments, which are composed on the fly to create complete Cloud services.
An architecture for the management of Cloud services is also presented.
Cloudcompaas, an open source SLA-driven framework is introduced. Cloudcompaas implements the methodology and architecture presented earlier and enables the management of the complete lifecycle of Cloud services.
Finally a set of experiments to validate the utility and performance of the contributions is presented.
“Spikey Workloads”:
Emergency Management in the Cloud
One of the best use cases for the cloud involves websites with surges in computing needs. This session will feature organizations that have leveraged the cloud to handle their unique burst workloads without breaking the bank:
Speaker: , Solutions Architect, Amazon Web Services
One of the best use cases for the cloud involves websites with surges in computing needs. This session will feature organizations that have leveraged the cloud to handle their unique burst workloads without breaking the bank:
Speaker: Cameron Maxwell, Professional Services, Amazon Web Services
In this Dagstuhl talk, I presented my current research on cloud auto-scaling and component connector self-adaptation and how I employed type-2 fuzzy control to tame the uncertainty regarding knowledge specification.
Service Stampede: Surviving a Thousand ServicesAnil Gursel
How many services do you have? 5, 10, 100? How do you even run large number of services? A micro service may be relatively simple. But services also mean distributed systems, which are inherently complex. 5 services are complex. A thousand services across many generations are at least 200 times as complex. How do we deal with such complexity?
This talk discusses service architecture at Internet scale, the need for larger transaction density, larger horizontal and vertical scale, more predictable latencies under stress, and the need for standardization and visibility. We’ll dive into how we build our latest generation service infrastructure based on Scala and Akka to serve the needs of such a large scale ecosystem.
Lastly, have the cake and eat it too. No, we’re not keeping all the goodies only to ourselves. They are all there for you in open source.
Automated Discovery of Performance Regressions in Enterprise ApplicationsSAIL_QU
This document summarizes the author's research on automated discovery of performance regressions in enterprise applications. It discusses challenges with current performance verification practices, and proposes approaches at the design and implementation levels. At the design level, it suggests using layered simulation models to evaluate design changes early. At the implementation level, it presents techniques to analyze large performance datasets, detect regressions while limiting subjectivity, and deal with tests in heterogeneous environments. Case studies show the approaches achieve 75-100% precision and 52-80% recall. The research aims to help analysts efficiently identify performance regressions.
Planning a Successful Cloud - Design from Workload to Infrastructurebuildacloud
Tim Mackey discusses key considerations for planning a successful private cloud. Private clouds offer control, speed, and future-proofing compared to public clouds. While server virtualization focused on consolidation and hardware independence, clouds are designed for massive scale, open architectures, and failure tolerance. Key features for successful clouds include multi-hypervisor support, availability zones, flexible networking, and tenant isolation without per-VM licensing. Lessons from companies like Zynga, telcos, and CloudStack emphasize clearly defining offerings, infrastructure choices optimized for workloads, and designing for maintainability and monitoring in cloud operations.
Adaptive Server Farms for the Data Centerelliando dias
The document discusses adaptive server farms for data centers. It addresses challenges like inefficient utilization, overprovisioning, and high costs. It proposes pooling server resources, automating management, and dynamically allocating resources based on demand. This improves utilization and reduces costs through automation, load balancing, and continuous service availability.
This document discusses Viewpoint's approach to web API performance testing. It outlines three key checkpoints: (1) ensuring performance during agile sprints through design reviews and trend monitoring, (2) integrating and testing components from different teams, and (3) performing full regression testing before release. It also defines different types of performance testing and describes the tools and processes used, including load testing with Visual Studio, tracking performance metrics, and using dashboards to socialize goals.
The document discusses performance tuning for Grails applications. It outlines that performance aspects include latency, throughput, and quality of operations. Performance tuning optimizes costs and ensures systems meet requirements under high load. Amdahl's law states that parallelization cannot speed up non-parallelizable tasks. The document recommends measuring and profiling, making single changes in iterations, and setting up feedback cycles for development and production environments. Common pitfalls in profiling Grails applications are also discussed.
- The document discusses the challenges of traditional web performance testing and introduces trusted cloud web performance testing as an effective alternative using real-time monitoring, sophisticated analytics, and affordable load and performance testing capabilities.
- A case study is presented of a tax filing website that was performance tested using cloud-based testing and was able to detect 27 critical issues while ramping up to 300,000 concurrent users, achieving results 75 times better than traditional testing.
- An accreditation process is described that uses cloud-based performance testing to validate a website's performance over 25 hours against key metrics and benchmarks.
The document discusses MySQL performance tuning basics. It covers key topics like defining performance metrics, MySQL server architecture, commands and tools for monitoring performance like slow query log and processlist, and server configuration parameters that impact performance like connection settings and buffer sizes. The presentation aims to provide an overview of MySQL performance optimization.
Similar to Towards a Unified View of Cloud Elasticity (20)
Climate Impact of Software Testing at Nordic Testing DaysKari Kakkonen
My slides at Nordic Testing Days 6.6.2024
Climate impact / sustainability of software testing discussed on the talk. ICT and testing must carry their part of global responsibility to help with the climat warming. We can minimize the carbon footprint but we can also have a carbon handprint, a positive impact on the climate. Quality characteristics can be added with sustainability, and then measured continuously. Test environments can be used less, and in smaller scale and on demand. Test techniques can be used in optimizing or minimizing number of tests. Test automation can be used to speed up testing.
Monitoring and Managing Anomaly Detection on OpenShift.pdfTosin Akinosho
Monitoring and Managing Anomaly Detection on OpenShift
Overview
Dive into the world of anomaly detection on edge devices with our comprehensive hands-on tutorial. This SlideShare presentation will guide you through the entire process, from data collection and model training to edge deployment and real-time monitoring. Perfect for those looking to implement robust anomaly detection systems on resource-constrained IoT/edge devices.
Key Topics Covered
1. Introduction to Anomaly Detection
- Understand the fundamentals of anomaly detection and its importance in identifying unusual behavior or failures in systems.
2. Understanding Edge (IoT)
- Learn about edge computing and IoT, and how they enable real-time data processing and decision-making at the source.
3. What is ArgoCD?
- Discover ArgoCD, a declarative, GitOps continuous delivery tool for Kubernetes, and its role in deploying applications on edge devices.
4. Deployment Using ArgoCD for Edge Devices
- Step-by-step guide on deploying anomaly detection models on edge devices using ArgoCD.
5. Introduction to Apache Kafka and S3
- Explore Apache Kafka for real-time data streaming and Amazon S3 for scalable storage solutions.
6. Viewing Kafka Messages in the Data Lake
- Learn how to view and analyze Kafka messages stored in a data lake for better insights.
7. What is Prometheus?
- Get to know Prometheus, an open-source monitoring and alerting toolkit, and its application in monitoring edge devices.
8. Monitoring Application Metrics with Prometheus
- Detailed instructions on setting up Prometheus to monitor the performance and health of your anomaly detection system.
9. What is Camel K?
- Introduction to Camel K, a lightweight integration framework built on Apache Camel, designed for Kubernetes.
10. Configuring Camel K Integrations for Data Pipelines
- Learn how to configure Camel K for seamless data pipeline integrations in your anomaly detection workflow.
11. What is a Jupyter Notebook?
- Overview of Jupyter Notebooks, an open-source web application for creating and sharing documents with live code, equations, visualizations, and narrative text.
12. Jupyter Notebooks with Code Examples
- Hands-on examples and code snippets in Jupyter Notebooks to help you implement and test anomaly detection models.
Fueling AI with Great Data with Airbyte WebinarZilliz
This talk will focus on how to collect data from a variety of sources, leveraging this data for RAG and other GenAI use cases, and finally charting your course to productionalization.
Generating privacy-protected synthetic data using Secludy and MilvusZilliz
During this demo, the founders of Secludy will demonstrate how their system utilizes Milvus to store and manipulate embeddings for generating privacy-protected synthetic data. Their approach not only maintains the confidentiality of the original data but also enhances the utility and scalability of LLMs under privacy constraints. Attendees, including machine learning engineers, data scientists, and data managers, will witness first-hand how Secludy's integration with Milvus empowers organizations to harness the power of LLMs securely and efficiently.
AI-Powered Food Delivery Transforming App Development in Saudi Arabia.pdfTechgropse Pvt.Ltd.
In this blog post, we'll delve into the intersection of AI and app development in Saudi Arabia, focusing on the food delivery sector. We'll explore how AI is revolutionizing the way Saudi consumers order food, how restaurants manage their operations, and how delivery partners navigate the bustling streets of cities like Riyadh, Jeddah, and Dammam. Through real-world case studies, we'll showcase how leading Saudi food delivery apps are leveraging AI to redefine convenience, personalization, and efficiency.
Removing Uninteresting Bytes in Software FuzzingAftab Hussain
Imagine a world where software fuzzing, the process of mutating bytes in test seeds to uncover hidden and erroneous program behaviors, becomes faster and more effective. A lot depends on the initial seeds, which can significantly dictate the trajectory of a fuzzing campaign, particularly in terms of how long it takes to uncover interesting behaviour in your code. We introduce DIAR, a technique designed to speedup fuzzing campaigns by pinpointing and eliminating those uninteresting bytes in the seeds. Picture this: instead of wasting valuable resources on meaningless mutations in large, bloated seeds, DIAR removes the unnecessary bytes, streamlining the entire process.
In this work, we equipped AFL, a popular fuzzer, with DIAR and examined two critical Linux libraries -- Libxml's xmllint, a tool for parsing xml documents, and Binutil's readelf, an essential debugging and security analysis command-line tool used to display detailed information about ELF (Executable and Linkable Format). Our preliminary results show that AFL+DIAR does not only discover new paths more quickly but also achieves higher coverage overall. This work thus showcases how starting with lean and optimized seeds can lead to faster, more comprehensive fuzzing campaigns -- and DIAR helps you find such seeds.
- These are slides of the talk given at IEEE International Conference on Software Testing Verification and Validation Workshop, ICSTW 2022.
Ivanti’s Patch Tuesday breakdown goes beyond patching your applications and brings you the intelligence and guidance needed to prioritize where to focus your attention first. Catch early analysis on our Ivanti blog, then join industry expert Chris Goettl for the Patch Tuesday Webinar Event. There we’ll do a deep dive into each of the bulletins and give guidance on the risks associated with the newly-identified vulnerabilities.
Full-RAG: A modern architecture for hyper-personalizationZilliz
Mike Del Balso, CEO & Co-Founder at Tecton, presents "Full RAG," a novel approach to AI recommendation systems, aiming to push beyond the limitations of traditional models through a deep integration of contextual insights and real-time data, leveraging the Retrieval-Augmented Generation architecture. This talk will outline Full RAG's potential to significantly enhance personalization, address engineering challenges such as data management and model training, and introduce data enrichment with reranking as a key solution. Attendees will gain crucial insights into the importance of hyperpersonalization in AI, the capabilities of Full RAG for advanced personalization, and strategies for managing complex data integrations for deploying cutting-edge AI solutions.
Threats to mobile devices are more prevalent and increasing in scope and complexity. Users of mobile devices desire to take full advantage of the features
available on those devices, but many of the features provide convenience and capability but sacrifice security. This best practices guide outlines steps the users can take to better protect personal devices and information.
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?Speck&Tech
ABSTRACT: A prima vista, un mattoncino Lego e la backdoor XZ potrebbero avere in comune il fatto di essere entrambi blocchi di costruzione, o dipendenze di progetti creativi e software. La realtà è che un mattoncino Lego e il caso della backdoor XZ hanno molto di più di tutto ciò in comune.
Partecipate alla presentazione per immergervi in una storia di interoperabilità, standard e formati aperti, per poi discutere del ruolo importante che i contributori hanno in una comunità open source sostenibile.
BIO: Sostenitrice del software libero e dei formati standard e aperti. È stata un membro attivo dei progetti Fedora e openSUSE e ha co-fondato l'Associazione LibreItalia dove è stata coinvolta in diversi eventi, migrazioni e formazione relativi a LibreOffice. In precedenza ha lavorato a migrazioni e corsi di formazione su LibreOffice per diverse amministrazioni pubbliche e privati. Da gennaio 2020 lavora in SUSE come Software Release Engineer per Uyuni e SUSE Manager e quando non segue la sua passione per i computer e per Geeko coltiva la sua curiosità per l'astronomia (da cui deriva il suo nickname deneb_alpha).
Ocean lotus Threat actors project by John Sitima 2024 (1).pptxSitimaJohn
Ocean Lotus cyber threat actors represent a sophisticated, persistent, and politically motivated group that poses a significant risk to organizations and individuals in the Southeast Asian region. Their continuous evolution and adaptability underscore the need for robust cybersecurity measures and international cooperation to identify and mitigate the threats posed by such advanced persistent threat groups.
GraphRAG for Life Science to increase LLM accuracyTomaz Bratanic
GraphRAG for life science domain, where you retriever information from biomedical knowledge graphs using LLMs to increase the accuracy and performance of generated answers
Taking AI to the Next Level in Manufacturing.pdfssuserfac0301
Read Taking AI to the Next Level in Manufacturing to gain insights on AI adoption in the manufacturing industry, such as:
1. How quickly AI is being implemented in manufacturing.
2. Which barriers stand in the way of AI adoption.
3. How data quality and governance form the backbone of AI.
4. Organizational processes and structures that may inhibit effective AI adoption.
6. Ideas and approaches to help build your organization's AI strategy.
1. Towards a UnifiedView of Elasticity
Srikumar Venugopal & Team
School of Computer Science and Engineering,
University of New South Wales, Sydney, Australia
srikumarv@cse.unsw.edu.au
5. Elasticity
The ability of a system to change its capacity
in direct response to the workload demand
6. DifferentViews of Elasticity
• Performance View
– When to scale and how much ?
• Application View
– Does the architecture accommodate scaling ?
– How is state managed ?
• Configuration View
– Are there changes in configuration due to
scaling?
10. State-of-the-art in Auto-scaling
Product/Project
Trigger
Controller
Ac3ons
Amazon
Autoscaling
Cloudwatch
metrics/
Threshold
Rule-‐based/
Schedule-‐based
Add/Remove
Capacity
WASABi
Azure
Diagnos3cs/
Threshold
Rule-‐based
Add/Remove
Capacity,
Custom
RightScale/Scalr
Load
monitoring
Rule-‐based/
Schedule-‐based
Add/Remove
Capacity,
Custom
Google
Compute
Engine
CPU
Load,
etc.
Rule-‐based
Add/Remove
Capacity
Academic
CloudScale
Demand
Predic3on
Control
theory
Voltage-‐scaling
Cataclysm
Threshold-‐based
Queueing-‐model
Admission
Control
IBM
Unity
Applica3on
U3lity
U3lity
func3ons/RL
Add/Remove
Capacity
11. Summary
• Currently, the most popular mechanisms
for auto-scaling are rule-based
mechanisms
• The effectiveness of rule-based
autoscaling is determined by the trigger
conditions
• So, how do we know how to set up the
right triggers ?
13. Elasticity (Auto-Scaling) Rules
Examples:
• If CPU Utilization ≥ 85% for 7 min. add 1 server (Scale Out)
• If RespTimeSLA ≥ 95% for 10 min. remove 1 server (Scale In)
B. Suleiman, S. Venugopal, Modeling Performance of Elasticity Rules for Cloud-based Applications, EDOC 2013.
14. Performance of Different Elasticity Rules
• How well do elasticity rules perform in terms of SLA satisfaction,
CPU utilization , costs and % served request?
Rule
Elasticity Rules
CPU75
If CPU Util.>75% for 5 min; add 1 server
If CPU Util.≤30% for 5 min; remove 1 server
CPU80
If CPU Util.>80% for 5 min; add 1 server
If CPU Util.≤30% for 5 min; remove 1 server
CPU85
If CPU Util.>85% for 5 min; add 1 server
If CPU Util.≤30% for 5 min; remove 1 server
SLA90
If SLA < 90% for 5 min; add 1 server
If SLA ≥ 90% for 5 min; remove 1 server
SLA95
If SLA < 95% for 5 mins; add 1 server
If SLA ≥ 95% for 5 mins; remove 1 server
B.
Suleiman,
S.
Sakr,
S.
Venugopal,
W.
Sadiq,
Trade-‐off
Analysis
of
Elas2city
Approaches
for
Cloud-‐Based
Business
Applica2ons,
Proc.
WISE
2012
15. Cloud Testbed for Collecting Metrics
TPC-W
database
EC2
EC2
TPC-W
application
.......
Elastic Load
Balancer
EC2
EC2
% SLA Satisfaction, Avg. CPU Utilization
Server Costs and % served Requests
Response Time
B.
Suleiman,
S.
Sakr,
S.
Venugopal,
W.
Sadiq,
Trade-‐off
Analysis
of
Elas2city
Approaches
for
Cloud-‐Based
Business
Applica2ons,
Proc.
WISE
2012
16. Performance Evaluation - Different
Elasticity Rules
Max
Min
Median
Q3
Q1
Mean
Legend
$0.00
$0.50
$1.00
$1.50
$2.00
$2.50
CPU75
CPU80
CPU85
SLA90
SLA95
Costs
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
CPU75
CPU80
CPU85
SLA90
SLA95
CPUUtilization
B.
Suleiman,
S.
Sakr,
S.
Venugopal,
W.
Sadiq,
Trade-‐off
Analysis
of
Elas2city
Approaches
for
Cloud-‐Based
Business
Applica2ons,
Proc.
WISE
2012
17. The Challenges of Thresholds
You must be at least
this tall to scale up!
• Threshold values determine performance
and cost
• E.g. Low CPU utilization => Higher cost,
Better Performance
• Thresholds vary from one application to
another
• Empirically determining thresholds is
expensive.
B. Suleiman, S. Venugopal, Modeling Performance of Elasticity Rules for Cloud-based Applications, EDOC 2013.
18. Can we construct a model that allows
us to establish the right thresholds ?
19. Queue Model of 3-tier
B.
Suleiman,
S.
Venugopal,
Modeling
Performance
of
Elas2city
Rules
for
Cloud-‐based
Applica2ons,
EDOC
2013
(Accepted)
20. Establishing Rule Thresholds
• Developed a model based on M/M/m
queuing model
– Simultaneous session initiations on 1 server
– Provisioning Lag Time of the provider
– Cool-down interval after elasticity action
– Algorithms to model scale-in and scale-out
– Request Mix
• Compared model fidelity with actual cloud
execution of TPC-W workload.
B. Suleiman, S. Venugopal, Modeling Performance of Elasticity Rules for Cloud-based Applications, EDOC 2013.
21. Experiments: Methodology
• Run the TPC-W workload on Amazon
cloud resources using thresholds
• Simulate the model using MATLAB with
the same thresholds
• Compare the simulation results to the
results from the actual execution
– If both are equivalent, then we are good J
B.
Suleiman,
S.
Venugopal,
Modeling
Performance
of
Elas2city
Rules
for
Cloud-‐based
Applica2ons,
EDOC
2013
(Accepted)
22. Experiments: Testbed
TPC-W
database
EC2
TPC-W user
emulation
Linux – Extra-large
EC2
TPC-W
application
.......
Elastic Load
Balancer
EC2
Small/Medium server
Linux – JBoss/JSDK
Extra-large server
Linux - MySQL
EC2
24. Experiments: Elasticity Rules
Rule
Rule
Expansion
CPU75
If CPU Util. > 75% for 5 min, add 1 server
If CPU Util. < 30% for 5 min, remove 1 server
CPU80
If CPU Util. > 80% for 5 min, add 1 server
If CPU Util. < 30% for 5 min, remove 1 server
Common parameters:
• Waiting time – 10 mins., Measuring interval – 1 min.
Metrics Captured:
• Average CPU Utilization across all the servers
• Average Response Time in a time interval
• Number of servers in operation at any point of time
28. Summary
• Developed a queueing model that can be
used to reason about elasticity
• Model captures effects of thresholds and
can be used for testing different rules
• Evaluations show that the model approx.
real-world conditions closely
• Future work: handling initial bursts in
workload
30. Cons of Rule-based Autoscaling
• Commercial products are rule-based
– Gives “illusion of control” to users
– Leads to the problem of defining the “right”
thresholds
• Centralised controllers
– Communication overhead increases with size
– Processing overhead also increases (Big
Data!)
• One application/VM at a time
31. Challenges of large-scale elasticity
• Large numbers of instances and apps
– Deriving solutions takes time
• Dynamic conditions
– Apps are going into critical all the time
• Shifting bottlenecks
– Greedy solutions may create bottlenecks in
other places
• Network partitions, fault tolerance…
H. Li, S. Venugopal, Using Reinforcement Learning for Controlling an Elastic Web Application Hosting Platform, Proceedings of
8th ICAC '11.
37. Problems for Automatic Placement
• Provisioning
– Smallest number of servers required to satisfy
resource requirements of all the applications
• Dynamic Placement
– Distribute applications so as to maximise
utilisation yet meet each app’s response time
and availability requirements
H. Li, S. Venugopal, Using Reinforcement Learning for Controlling an Elastic Web Application Hosting Platform, Proceedings of
8th ICAC '11.
38. Co-ordinated Control of Elasticity
• Instances control their own utilisation
– Monitoring, management and feedback
• Local controllers are learning agents
– Reinforcement Learning
• Controllers learn from each other
– Share their knowledge and update their own
• Servers are linked by a DHT
– Agility, Flexibility, Co-ordination
H. Li, S. Venugopal, “Using Reinforcement Learning for Controlling an Elastic Web Application Hosting Platform”, Proceedings
of 8th ICAC '11.
39. Abstract View of the Control
Scheme
H. Li, S. Venugopal, “Using Reinforcement Learning for Controlling an Elastic Web Application Hosting Platform”, Proceedings
of 8th ICAC '11.
40. Fuzzy Thresholds
H. Li, S. Venugopal, Using Reinforcement Learning for Controlling an Elastic Web Application Hosting Platform, Proceedings of
8th ICAC '11.
42. Co-ordination using find!
• Server looks up other servers with the
least load
– DHT lookup
• Sends a move message to the selected
server
• Replies with accept or reject!
– accept has a +ve reward
43. Shrinking
• The controller is always reward
maximising
– Highest Reward is for merge+terminate
• A controller initiates its own shutdown
– Low load on its applications
• Gets exclusive lock on termination
– Only one instance can terminate at a time
• Transfers state before shutdown
44. Experiments
• Six web applications
– Test Application: Hotel Management
– Search à Book à Confirm
• Five were subjected to a background load
– Uniform Random
• One was subjected to the test load
• Application threshold: 200 and 500 ms
• Metrics
– Average Response Time, Drop Rate, Servers
H.
Li,
S.
Venugopal,
“Using
Reinforcement
Learning
for
Controlling
an
Elas3c
Web
Applica3on
Hos3ng
Plaorm”,
Proceedings
of
8th
ICAC
'11.
48. Key-value Stores
• The standard component for cloud data
management
• Increasing workload à Node bootstrapping
– Incorporate a new, empty node as a member of KVS
• Decreasing workload à Node decommissioning
– Eliminate an existing member with redundant data off
the KVS
H. Li, S. Venugopal, Efficient Node Bootstrapping for Decentralised Shared-Nothing Key-Value Stores, Proceedings of
MIddleware 2013.
49. Research Questions
• As the system scales, how to efficiently
incorporate or remove data nodes?
– Load balancing, migration overheads, etc.
• How to partition and place the data
replicas when the system is elastic?
– Data consistency, durability, availability, etc..
H. Li, S. Venugopal, Efficient Node Bootstrapping for Decentralised Shared-Nothing Key-Value Stores, Proceedings of
MIddleware 2013.
50. Elasticity in Key-Value Stores
• Minimise the overhead of data movement
– How to partition/store data?
• Balance the load at node bootstrapping
– Both data volume and workload
– How to place/allocate data?
• Maintain data consistency and availability
– How to execute data movement?
H. Li, S. Venugopal, Efficient Node Bootstrapping for Decentralised Shared-Nothing Key-Value Stores, Proceedings of
MIddleware 2013.
51. A
B
G
F
C
D
E
I
H
Key space
Split-Move Approach
A
I
C C
D
Node 1 Node 2
Node 3 Node 4
B
I
B
B
A
Master Replica Slave Replica
A
H
A
I B2
C CD
Node 1 Node 2
Node 3 Node 4
New Node
B1 B2
I
B1
B2
A
B1
Master Replica Slave Replica
A
H
①
①①
A
B
G
F
C
D
E
I
H
B2
B1
①
Key space
②A
I B2
C CD
B2
A B1
Node 1 Node 2
Node 3 Node 4
New Node
②
B1 B2
I
B1
B2
A
B1
Master Replica Slave Replica
A
H
A
I B2
C CD
B2
A B1
Node 1 Node 2
Node 3 Node 4
New Node
②②
B1 B2
I
B1
B2
A
B1
Master Replica Slave Replica
To be deleted
③
A
H
Partition at node bootstrapping
H. Li, S. Venugopal, Efficient Node Bootstrapping for Decentralised Shared-Nothing Key-Value Stores, Proceedings of
MIddleware 2013.
52. Virtual-Node Approach
A
B
G
F
C
D
E
I
H
Key space
D B
E H
I G
A C
D F
G I
A B
C E
I
C D
F H
G
Node 1 Node 2
Node 3 Node 4
D B
E H
I G
A C
D F
G I
A B
C E
I
C D
F H
G
Node 1 Node 2
Node 3 Node 4
New Node
D B
E H
I G
A C
D F
G I
A B
C E
I
C D
F H
G
B A
E F
H
Node 1 Node 2
Node 3 Node 4
New Node
......
Partition at system startup
Data skew: e.g., the majority of data is stored in a minority of partitions.
Moving around giant partitions is not a good idea.
H. Li, S. Venugopal, Efficient Node Bootstrapping for Decentralised Shared-Nothing Key-Value Stores, Proceedings of
MIddleware 2013.
53. Our Solution
• Virtual-node based movement
– Each partition of data is stored in separated files
– Reduced overhead of data movement
– Many existing nodes can participate in bootstrapping
• Automatic sharding
– Split and merge partitions at runtime
– Each partition stores a bounded volume of data
– Easy to reallocate data
– Easy to balance the load
H. Li, S. Venugopal, Efficient Node Bootstrapping for Decentralised Shared-Nothing Key-Value Stores, Proceedings of
MIddleware 2013.
54. The timing for data partitioning
• Shard partitions at writes (insert and delete)
– Split: Size(Pi) ≤ Θmax
– Merge: Size(Pi) + Size(Pi+1) ≥ Θmin
Split
Delete
Insert
Merge
B
A
C
D
E
B1
A
C
D
E
B2
B1A
C
D
E
B2
B1A
M
D
E
Split
Delete
Insert
B
A
C
D
E
B1
A
C
D
E
B2
B1A
C
D
E
B2
Split
Insert
B
A
C
D
E
B1
A
C
D
E
B2
B
A
C
D
E
Θmax
≥
2Θmin
Avoid
oscilla3on!
H. Li, S. Venugopal, Efficient Node Bootstrapping for Decentralised Shared-Nothing Key-Value Stores, Proceedings of
MIddleware 2013.
55. Sharding coordination
• Solution: Election-based coordination
Node-A
Node-C
Node-E
Node-B
SortedList:
C, E, ..., A, ..., B Step1
Election
Node-A
Coordinator
Node-C
Node-E
Node-B
Step 2
Enforce Split/Merge
Data/Node
mappingNode-A
Coordinator
Node-C
Node-E
Node-B 1st
Data/Node
mapping
Step 3
Finish Split/Merge
2nd
3rd
4th
Node-A
Coordinator
Node-C
Node-E
Node-B
Step 4
Announce to all nodes
H. Li, S. Venugopal, Efficient Node Bootstrapping for Decentralised Shared-Nothing Key-Value Stores, Proceedings of
MIddleware 2013.
56. Node failover during sharding
Non-
coordinators
Non-
coordinators
Non-
coordinators
Election
Notification:
Shard Pi
Time
Before
execution
During
execution
After
execution
Replace Replicas
Coordinator
Announce:
Successful
Step2
Step3
Step4
Step1
Non-
coordinators
Non-
coordinators
Removed from
candidate list
Non-
coordinators
Election
Failed Resurrect
yes
No
Yes
Notification:
Shard Pi
Append to
candidate list
Gossip
No Dead
Time
Before
execution
During
execution
After
execution
Replace Replicas
Coordinator
Announce:
Successful
Step2
Step3
Step4
Step1
Non-
coordinators
Non-
coordinators
Non-
coordinators
Election
Notification:
Shard Pi
Gossip Continue without coordinator Resurrect
Dead
No
Yes
Time
Before
execution
During
execution
After
execution
Failed
Replace Replicas
Coordinator
Announce:
Successful
Step2
Step3
Step4
Step1
Non-
coordinators
Non-
coordinators
Non-
coordinators
Election
Notification:
Shard Pi
Failed
Gossip
Yes
Continue without coordinator
Elect
New coordinator
No
Invalidate Pi
in this node
Timeout
Time
Before
execution
During
execution
After
execution
Replace Replicas
Coordinator
Announce:
Successful
Step2
Step3
Step4
Step1
H. Li, S. Venugopal, Efficient Node Bootstrapping for Decentralised Shared-Nothing Key-Value Stores, Proceedings of
MIddleware 2013.
57. Evaluation Setup
• ElasCass: An implemention of auto-sharding,
building on Apache Cassandra (version 1.0.5),
which uses Split-Move approach.
• Key-value stores: ElasCass vs. Cassandra
(v1.0.5)
• Test bed: Amazon EC2, m1.large type, 2 CPU
cores, 8GB ram
• Benchmark: YCSB
H. Li, S. Venugopal, Efficient Node Bootstrapping for Decentralised Shared-Nothing Key-Value Stores, Proceedings of
MIddleware 2013.
58. Evaluation – Bootstrap Time
• Start from 1 node, with 100GB of data,
R=2. Scale up to 10 nodes.
• In Split-Move, data volume transferred
reduces by half from 3 nodes onwards.
• In ElasCass, data volume transferred
remains below 10GB from 2 nodes.
• Bootstrap time is determined by data
volume transferred. ElasCass exhibits a
consistent performance at all scales.
H. Li, S. Venugopal, Efficient Node Bootstrapping for Decentralised Shared-Nothing Key-Value Stores, Proceedings of
MIddleware 2013.
59. Conclusions
• We have designed and implemented a
decentralised auto-sharding scheme that
– consolidates each partition replica into single
transferable units to provide efficient data
movement;
– automatically shards the partitions into
bounded ranges to address data skew;
– reduces the time to bootstrap nodes,
achieves more balancing load and better
performance of query processing.
61. Final Thoughts
• Elasticising Application Logic is done
– How do we eliminate thresholds ?
– Should it be more autonomic ?
• Application View of Elasticity
– Managing state is the big challenge
– Decoupling of different components (service-
oriented model)
– How would you scale interconnected
components ?