Avoiding Cloud Outage

•Download as PPTX, PDF•

0 likes•51,088 views

Building cross-region and cross could high availability into your app, a real life use case by Gigaspaces, Nati Shalom, Funder & CTO, Gigaspaces Achieving high levels of availability and disaster recovery in a cloud environment requires the implementation of patterns and practices that introduce redundancy through multi-zone, multi-region, and multi-cloud deployments. As we move towards implementing higher availability, we cannot escape the direct increase in the accidental complexity of the deployment architecture resulting from lack of cloud portability and deployment lifecycle automation. We present how high availability and disaster recovery were achieved in reality by using the Cloudify open source framework on top of AWS. This approach applies to not just AWS but also other public clouds and private cloud environments such as Eucalyptus. The resulting reference architecture provides portable PostgreSQL replication and disaster recovery as well as application tier scalability across zones, regions, and public/private clouds through a unified deployment workflow.

Technology

Protect your app from Outages
Nati Shalom CTO GigaSpaces
@natishalom
May 2013

 AWS and outages
 Outage impact
 Disaster Recovery – it’s all about redundancy!
 Cloudify as a solution for redundancy
 Demo with Cloudify on EC2
® Copyright 2013 GigaSpaces Ltd. All Rights Reserved2
AGENDA

3
AWS USAGE
• AWS – around 0.5M servers
• Facebook – less than 0.1M servers
• Google – around 1M servers

OUTAGE – APRIL 21, 2011
® Copyright 2012 GigaSpaces Ltd. All Rights Reserved5

OUTAGE - JUNE 29, 2012
® Copyright 2012 GigaSpaces Ltd. All Rights Reserved6

OUTAGE - OCTOBER 22, 2012
® Copyright 2012 GigaSpaces Ltd. All Rights Reserved7

OUTAGE - CHRISTMAS EVE 2012
® Copyright 2012 GigaSpaces Ltd. All Rights Reserved8

NOT ONLY AMAZON
® Copyright 2012 GigaSpaces Ltd. All Rights Reserved9
 28 December 2012 - some owners of
Microsoft's XBox 360 gaming console were
unable to access some of their cloud-based
storage files.
 26 July 2012 - Service for Microsoft’s
Windows Azure Europe region went down for
more than two hours
 29 February 2012 - The ultimate result was
service impacts of 8-10 hours for users of
Azure data centers in Dublin, Ireland, Chicago,
and San Antonio.

10
THAT’S WHAT YOU EXPECT?
99% - 3.65 days downtime
99.9% - 8.76 hours downtime
99.99% - 53 minutes downtime
99.999% - 5.26 minutes downtime

® Copyright 2012 GigaSpaces Ltd. All Rights Reserved11
OUTAGE IMPACT – DESIGN FOR FAILURES
Outage could cost…
$89K per hour for Amadeus
$225K per hour for PayPal!

14
PREPARE FOR DISASTER RECOVERY
•Dedicated expert for DR architecture
•Define target recovery time & point
•Assume every tier can fail
•Use monitoring and alerts
•Document your operational processes

Leverage Existing Automation Frameworks
Configuration Centric APP Centric (PaaS)

BUILT IN SUPPORT FOR MANAGING DATA IN THE CLOUD
Real Time Relational DB
Clusters
NoSQL Clusters Hadoop
Storm MySQL MongoDB Hadoop (Hive,
Pig,..)
Elastic Caching XAP Postgress Cassandra ZooKeeper
Couchbase
ElasticSearch

VERIFI (CURRENT) DEPLOYMENT ARCHITECTURE
24
Availability region (US-West: Oregon)
Data Volume
Internet EC2 Instance
mod_cluster
EC2 Instance
JBoss
Data Volume
EC2 Instance
EC2 Instance
PostgresSQL
Cassandra
4 recipes

TARGET ARCHITECTURE
Availability Region (US-West Oregon)
Data Volume
Internet EC2 Instance
mod_cluster
EC2 Instance
JBoss
Data Volume
Postgres Master
EC2 Instance
EC2 Instance
Cassandra
Availability Region (US-East Virginia)
Data Volume
EC2 Instance
mod_cluster
EC2 Instance
JBoss
Data Volume
Postgres Slave
EC2 Instance
EC2 Instance
Cassandra
replication
Bootstrap two EC2 clouds in different regions, install the “verifi” application on each. The second cloud will have a slightly modified
(extended) postgres recipe for acting as a slave + no running app servers. Upon the primary zone failure, the second cloud will spin up
instances of the app servers and turn the data instance into master, then bootstrapping another “slave” cloud in another zone.

FAILOVER SCENARIO
26
Region (US-West Oregon)
App Servers
PostgresSQL
Region (US-East Virginia)
PostgresSQL
Cloud #1 Cloud #2
Region (US-East Virginia )
PostgresSQL
Cloud #1 Cloud #2
App Servers
Region (US-West California)
PostgresSQL
Cloud #3
Region failure
occurs
Bootstrap another cloud in
a different region using the
same application recipe
used to bootstrap cloud #2
above*
Liveness poll
Liveness poll
Upon initial deployment, the primary deployment
of the application will be bootstrapped onto cloud
#1, another slightly modified application recipe
will be bootstrapped as cloud #2, polling cloud #1
for failure, and acting as a PostgresSQL db slave.
Turn Postgres slave into
master, Start app server
instances*

Copyright 2012 Gigaspaces. All Rights Reserved27
NEXT STEPS
Across clouds
(AWS, Rackspace, Azure…etc)
Across AWS regions
Across AWS zones
1 application
+ overrides
Several cloud
drivers
1 application
+ overrides
1 cloud driver
1 application +
overrides
1 cloud driver
Availability
Supported by
Verifi phase #1

Copyright 2012 Gigaspaces. All Rights Reserved28
EVOLUTION PATH
Availability
Complexity Multi
cloud/provider
Multi
region
Multi
zone
Multi
instance
Multi
cloud/provider
Multi
region
Multi
zoneMulti
instance

 AWS and outages
 Outage impact
 Disaster Recovery – it’s all about redundancy!
 Cloning your environment – app stack
 Cloning your DB – Replication
 Cloudify as a solution for Redundancy
 Use recipes to work on any cloud
 Fast and customized data replication
 Demo with Cloudify on EC2
® Copyright 2013 GigaSpaces Ltd. All Rights Reserved29
SUMMARY

Thank You!
@natishalom
® Copyright 2013 GigaSpaces Ltd. All Rights Reserved30
QUESTIONS & ANSWERS

This webinar teaches you how to use Amazon API Gateway and AWS Lambda to run your existing Express.js applications with just a few lines of code. We will introduce three new features in API Gateway: proxy integrations, greedy paths, and the ANY HTTP method. Combining these features, you can configure API Gateway in a few simple clicks via the management console and express all of your logic and API definition in code. Learning Objectives: 1. Easier migration to API Gateway and Lambda 2. New API Gateway Catch-all methods Who Should Attend: Developers

Terraform in production - experiences, best practices and deep dive- Piotr Ki...

PROIDEA

AWS Tagging Strategy

Shiva Narayanaswamy

Cloud Economics

Amazon Web Services

Getting Started with Serverless Architectures with Microservices_AWSPSSummit_...

Amazon Web Services

AWS provides a platform that is ideally suited for building highly available systems, enabling you to build reliable, affordable, fault-tolerant systems that operate with a minimal amount of human interaction. This session covers many of the high-availability and fault-tolerance concepts and features of the various services that you can use to build highly reliable and highly available applications in the AWS Cloud: architectures involving multiple Availability Zones, including EC2 best practices and RDS Multi-AZ deployments; loosely coupled and self-healing systems involving SQS and Auto Scaling; networking best practices for high availability, including Elastic IP addresses, load balancing, and DNS; leveraging services that inherently are built with high-availability and fault tolerance in mind, including S3, Elastic Beanstalk and more.

Breaking Down the Economics and TCO of Migrating to AWS

Amazon Web Services

This session is for anyone interested in understanding the financial costs associated with migrating workloads to AWS. By presenting real cases from AWS Professional Services and directly from a customer, we explore how to measure value, improve the economics of a migration project, and manage migration costs and expectations through large-scale IT transformations. We’ll also look at automation tooling that can further assist and accelerate the migration process.

Cloud workload migration guidelines

Jen Wei Lee

Google Cloud Connect Korea - Sep 2017

Google Cloud Korea

Cloud Optimization: Filling in the Gaps

2nd Watch

Cloud promises a simple pay-as-you-go approach to technology, with cost-savings at the top of the list. As more enterprises adopt the cloud, cost continues to be a major issue with new pricing models, services and features that introduce waste and complexity into the decision-making process. In this webinar, you’ll learn expert strategies that will amplify your cloud performance and maximize your ROI with a level of intricacy that can’t be solved using manual process – tools and expertise are needed.

Aws platform overview

Vinay Yelluri

Intro to Amazon ECS

Amazon Web Services

DevOps on AWS

Amazon Web Services

Software release cycles are now measured in days instead of months. Cutting edge companies are continuously delivering high-quality software at a fast pace. In this session, we will cover how you can begin your DevOps journey by sharing best practices and tools used by the engineering teams at Amazon. We will showcase how you can accelerate developer productivity by implementing continuous Integration and delivery workflows. We will also cover an introduction to AWS CodeStar, AWS CodeCommit, AWS CodeBuild, AWS CodePipeline, AWS CodeDeploy, AWS Cloud9, and AWS X-Ray the services inspired by Amazon's internal developer tools and DevOps practice. Level: 200 Speaker: Nick Brandaleone - Solutions Architect, AWS

How To Run Your Containers on AWS with ECS & Fargate: Collision 2018

Amazon Web Services

Observability

Maganathin Veeraragaloo

cloud computing Multi cloud

Dr.Neeraj Kumar Pandey

[NEW LAUNCH!] AWS Transit Gateway and Transit VPCs - Reference Architectures ...

Amazon Web Services

In this session, we will review the new AWS Transit Gateway and new networking features. We compare AWS Transit Gateway and Transit VPCs and discuss how to architect your accounts and VPCs. This session will be helpful if the developers have been let loose, and you are planning lots of VPCs or accounts. How should you connect them; what limits do you need to be aware of; and how does routing work with many VPCs? We dive into the details of recent launches and how to work with concepts like Transit VPCs, account strategies, scaling services, using firewalls, and direct connect gateways to solve problems of many VPCs.

DevOps on AWS

Amazon Web Services

마이크로 서비스를 위한 AWS Cloud Map & App Mesh - Saeho Kim (AWS Solutions Architect)

Amazon Web Services Korea

VMware: The Fastest Path to Hybrid Cloud

Amazon Web Services

Organisations are rapidly adopting hybrid cloud strategies to take advantage of both on-premises and cloud services. However, moving applications to the cloud can be difficult and time-consuming, often taking months. VMware offers solutions that customers are using to migrate hundreds of applications to the cloud in a few days. Additionally, VMware solutions simplify day 2 operations by providing consistent infrastructure and operations across on-premises and public cloud services. Come to this session to hear how VMware is helping organisations migrate applications to the cloud, extend their data centers to the cloud, deploy cloud-based disaster recovery solutions, and modernize their applications with the power of VMware and AWS cloud services.

Scaling Up to Your First 10 Million Users (ARC205-R1) - AWS re:Invent 2018

Amazon Web Services

Cloud computing provides a number of advantages, such as the ability to scale your web application or website on demand. If you have a new web application and want to use cloud computing, you might be asking yourself, "Where do I start?" Join us in this session for best practices on scaling your resources from one to millions of users. We show you how to best combine different AWS services, how to make smarter decisions for architecting your application, and how to scale your infrastructure in the cloud.

AWS CloudFormation macros: Coding best practices - MAD201 - New York AWS Summit

Amazon Web Services

Introduction to the Microsoft Azure Cloud.pptx

EverestMedinilla2

Delivering IaaS with Open Source Software

Mark Hinkle

Simplifying The Cloud Top 10 Questions By SMBsSun Digital, Inc.

What's hot

Understand AWS Pricing

Lynn Langit

Cloud Migration: Moving to the Cloud

Dr.-Ing. Michael Menzel

Architecting for High Availability

Amazon Web Services

Breaking Down the Economics and TCO of Migrating to AWS

Amazon Web Services

Cloud workload migration guidelines

Jen Wei Lee

Google Cloud Connect Korea - Sep 2017

Google Cloud Korea

Cloud Optimization: Filling in the Gaps

2nd Watch

Aws platform overview

Vinay Yelluri

Intro to Amazon ECS

Amazon Web Services

DevOps on AWS

Amazon Web Services

How To Run Your Containers on AWS with ECS & Fargate: Collision 2018

Amazon Web Services

Observability

Maganathin Veeraragaloo

cloud computing Multi cloud

Dr.Neeraj Kumar Pandey

[NEW LAUNCH!] AWS Transit Gateway and Transit VPCs - Reference Architectures ...

Amazon Web Services

DevOps on AWS

Amazon Web Services

마이크로 서비스를 위한 AWS Cloud Map & App Mesh - Saeho Kim (AWS Solutions Architect)

Amazon Web Services Korea

VMware: The Fastest Path to Hybrid Cloud

Amazon Web Services

Scaling Up to Your First 10 Million Users (ARC205-R1) - AWS re:Invent 2018

Amazon Web Services

AWS CloudFormation macros: Coding best practices - MAD201 - New York AWS Summit

Amazon Web Services

Introduction to the Microsoft Azure Cloud.pptx

EverestMedinilla2

What's hot (20)

Understand AWS Pricing

Cloud Migration: Moving to the Cloud

Architecting for High Availability

Breaking Down the Economics and TCO of Migrating to AWS

Cloud workload migration guidelines

Google Cloud Connect Korea - Sep 2017

Cloud Optimization: Filling in the Gaps

Aws platform overview

Intro to Amazon ECS

DevOps on AWS

How To Run Your Containers on AWS with ECS & Fargate: Collision 2018

Observability

cloud computing Multi cloud

[NEW LAUNCH!] AWS Transit Gateway and Transit VPCs - Reference Architectures ...

DevOps on AWS

마이크로 서비스를 위한 AWS Cloud Map & App Mesh - Saeho Kim (AWS Solutions Architect)

VMware: The Fastest Path to Hybrid Cloud

Scaling Up to Your First 10 Million Users (ARC205-R1) - AWS re:Invent 2018

AWS CloudFormation macros: Coding best practices - MAD201 - New York AWS Summit

Introduction to the Microsoft Azure Cloud.pptx

Viewers also liked

Delivering IaaS with Open Source Software

Mark Hinkle

Simplifying The Cloud Top 10 Questions By SMBsSun Digital, Inc.

The Inevitable Cloud Outage

Newvewm

Public cloud's are going to crash. It's inevitable. The best thing you can do is be prepared with a highly available architecture to ensure you're not affected by the outage. Join a live webinar with Gigaspaces founder and CTO Nati Shalom to discuss best practices in high availability to safe guard your cloud from the inevitable outage. http://www.newvem.com/cloud-webinar-safe-guard-your-application-from-outages/

Best Practices for Architecting in the Cloud - Jeff BarrAmazon Web Services

Penetrating the Cloud: Opportunities & Challenges for Businesses

CompTIA

Summer School Scale Cloud Across the EnterpriseWSO2

LinuxFest NW 2013: Hitchhiker's Guide to Open Source Cloud Computing

Mark Hinkle

Presented on April 27th, 2013 at LinuxFest NW Imagine it’s eight o’clock on a Thursday morning and you awake to see a bulldozer out your window ready to plow over your data center. Normally you may wish to consult the Encyclopedia Galáctica to discern the best course of action but your copy is likely out of date. And while the Hitchhiker’s Guide to the Galaxy (HHGTTG) is a wholly remarkable book it doesn’t cover the nuances of cloud computing. That’s why you need the Hitchhiker’s Guide to Cloud Computing (HHGTCC) or at least to attend this talk understand the state of open source cloud computing. Specifically this talk will cover infrastructure-as-a-service, platform-as-a-service and developments in big data and how to more effectively take advantage of these technologies using open source software. Technologies that will be covered in this talk include Apache CloudStack, Chef, CloudFoundry, NoSQL, OpenStack, Puppet and many more. Specific topics for discussion will include: Infrastructure-as-a-Service - The Systems Cloud - Get a comparision of the open source cloud platforms including OpenStack, Apache CloudStack, Eucalyptus, OpenNebula Platform-as-a-Service - The Developers Cloud - Find out what tools are availble to build portable auto-scaling applications including CloudFoundry, OpenShift, Stackato and more. Data-as-a-Service - The Analytics Cloud - Want to figure out the who, what , where , when and why of big data ? You get an overview of open source NoSQL databases and technologies like MapReduce to help crunch massive data sets in the cloud. Finally you'll get a overview of the tools that can help you really take advantage of the cloud? Want to auto-scale virtual machiens to serve millions of web pages or want to automate the configuration of cloud computing environments. You'll learn how to combine these tools to provide continous deployment systems that will help you earn DevOps cred in any data center. [Finally, for those of you that are Douglas Adams fans please accept the deepest apologies for bad analogies to the HHGTTG.]

The Total Cost of Ownership (TCO) of Web Applications in the AWS Cloud - Jine...

Amazon Web Services

Weighing the financial considerations of owning and operating a data center facility versus employing a cloud infrastructure requires detailed and careful analysis. In practice, it is not as simple as just measuring potential hardware expense alongside utility pricing for compute and storage resources. The Total Cost of Ownership (TCO) is often the financial metric used to estimate and compare direct and indirect costs of a product or a service. Given the large differences between the two models, it is challenging to perform accurate apples-to-apples cost comparisons between on-premises data centers and cloud infrastructure that is offered as a service. In this presentation, we explain the economic benefits of deploying a web application in the Amazon Web Services (AWS) cloud over deploying an equivalent web application hosted in an on-premises data center and highlight the 5 things to not forget while calculating TCO. Whitepaper: http://bit.ly/aws-tco-webapps

Intro to cloud computing — MegaCOMM 2013, Jerusalem

Reuven Lerner

Breaking through the Clouds

Andy Piper

Linthicum what is-the-true-future-of-cloud-computingDavid Linthicum

Can we hack open source #cloud platforms to help reduce emissions?

Tom Raftery

2013 State of Cloud Survey SMB Results

Symantec

2013 Future of Cloud Computing - 3rd Annual Survey Results

Michael Skok

The 2013 Future of Cloud Computing 3rd Annual Survey was conducted in partnership with GigaOM Research and 57 industry collaborators. It focuses on Cloud adoption, growth, investment, and key trends emanating from the 2011 and 2012 surveys. For additional information and to get involved follow us @futureofcloud #futurecloud and visit http://www.mjskok.com/resource/2013-future-cloud-computing-3rd-annual-survey-results.

Cloud computing simple ppt

Agarwaljay

AWSome Day 2016 - Module 1: AWS Introduction and History

Amazon Web Services

AWS Canberra WWPS Summit 2013 - Cloud Computing with AWS: Introduction to AWS

Amazon Web Services

Amazon Elastic Compute Cloud (Amazon EC2) provides resizable compute capacity in the cloud and is often the starting point for your first week using AWS. This session will introduce these concepts, along with the fundamentals of EC2, by employing an agile approach that is made possible by the cloud. Attendees will experience the reality of what a first week on EC2 looks like from the perspective of someone deploying an actual application on EC2. You will follow them as they progress from deploying their entire application from an EC2 AMI on day 1 to more advanced features and patterns available in EC2 by day 5. Throughout the process we will identify cloud best practices that can be applied to your first week on EC2 and beyond.

Technical Track

Amazon Web Services

Cloud 101: The Basics of Cloud Computing

Hostway|HOSTING

Welcome - Keynote - AWSome Day Helsinki 2017

Amazon Web Services

Viewers also liked (20)

Delivering IaaS with Open Source Software

Simplifying The Cloud Top 10 Questions By SMBs

The Inevitable Cloud Outage

Best Practices for Architecting in the Cloud - Jeff Barr

Penetrating the Cloud: Opportunities & Challenges for Businesses

Summer School Scale Cloud Across the Enterprise

LinuxFest NW 2013: Hitchhiker's Guide to Open Source Cloud Computing

The Total Cost of Ownership (TCO) of Web Applications in the AWS Cloud - Jine...

Intro to cloud computing — MegaCOMM 2013, Jerusalem

Breaking through the Clouds

Linthicum what is-the-true-future-of-cloud-computing

Can we hack open source #cloud platforms to help reduce emissions?

2013 State of Cloud Survey SMB Results

2013 Future of Cloud Computing - 3rd Annual Survey Results

Cloud computing simple ppt

AWSome Day 2016 - Module 1: AWS Introduction and History

AWS Canberra WWPS Summit 2013 - Cloud Computing with AWS: Introduction to AWS

Technical Track

Cloud 101: The Basics of Cloud Computing

Welcome - Keynote - AWSome Day Helsinki 2017

Similar to Avoiding Cloud Outage

Disaster recovery on demand on the cloudNati Shalom

Disaster Recovery on Demand on the Cloud

Nati Shalom

Improving Availability & Lowering Costs with Auto Scaling & Amazon EC2 (CPN20...

Amazon Web Services

Running your Amazon EC2 instances in Auto Scaling groups allows you to improve your application's availability right out of the box. Auto Scaling replaces impaired or unhealthy instances automatically to maintain your desired number of instances (even if that number is one). You can also use Auto Scaling to automate the provisioning of new instances and software configurations as well as to track of usage and costs by app, project, or cost center. Of course, you can also use Auto Scaling to adjust capacity as needed - on demand, on a schedule, or dynamically based on demand. In this session, we show you a few of the tools you can use to enable Auto Scaling for the applications you run on Amazon EC2. We also share tips and tricks we've picked up from customers such as Netflix, Adobe, Nokia, and Amazon.com about managing capacity, balancing performance against cost, and optimizing availability.

Agile Infrastructure with Windows Azure

HARMAN Services

Over 60 CIOs and Tech Leaders attended the #GoCloudWebinar on “AGILE INFRASTRUCTURE WITH WINDOWS AZURE” hosted by Aditi Technologies and Microsoft. Our CTO, Wade Wegner and Microsoft Azure solution specialist, Dina Frandsen discussed how Windows Azure Infrastructure Services (WAIS) can help organizations stay agile and what Windows Azure technology environment looks like and what it means to your organization. We Explored 1. How IT teams can execute fast and stay lean with WAIS – A case study 2. Which enterprise workloads are best suited of WAIS migration 3. What are the best practices on how to plan, execute, deploy WAIS Download this slidedeck and Sign up with the below link for viewing the Webinar - http://www.aditi.com/webevent/Agile_Infrastructure_with_WAIS/

Running SQL Server on AWS | John McCormack | DataGrillen 2019

John McCormack

Azure en Nutanix: your journey to the hybrid cloud

ICT-Partners

David_BerminghamPeter Vervaene

Backup on the cloud 10.1.13

2nd Watch

AWS re:Invent 2016: Reinventing Disaster Recovery Leveraging AWS Cloud Infras...

Amazon Web Services

These days, EVERY workload is considered critical by someone in the organization. As a result, SLAs are shrinking. IT is challenged to meet these SLAs, but there isn’t enough budget to provide services like disaster recovery (DR) using traditional methods and infrastructure. The good news is that public cloud platforms, like AWS, are becoming the de facto infrastructure choice for DR. However, workload portability solutions that simplify cross-platform or cloud recovery are required to meet most RTO & RPO SLAs in the cloud. AWS provides the infrastructure we need to bring DR to tier 2 and tier 3 workloads that have never been able to afford it before. Now, we need orchestration and automation to make it scalable and reliable. In this session you will learn key considerations and practical steps for getting to the AWS cloud and how you can leverage Amazon S3 storage for cost-effective disaster recovery. Dow Jones will also share details on their migration to AWS Cloud, the benefits realized there, and what the future looks like. Session sponsored by Commvault.

Cloud Patterns Beuth Hochschule

Sascha Möllering

AWS 201 - A Walk through the AWS Cloud: App Hosting on AWS - Games, Apps and ...

Amazon Web Services

Playfish, Gumi, and Halfbrick are three of many gaming companies on AWS. Pinterest, Netflix and Flipboard host web and mobile applications using the AWS Cloud. What are the best practices to build an application to take advantage of the benefits of AWS? Learn about these approaches and how customers have built highly scalable, durable and reliable infrastructures to host their internet-facing businesses on AWS. Attend this complimentary webinar to learn more.

19th February 2013, AWS User Group UK, Meetup #3, Managing your apps on AWS: ...

AWS User Group UK

Agenda entry: Managing your apps on AWS: Real life lessons with GigaSpaces, Ron Zanver. We’ve all learned Murphy’s inevitable law the hard way – if it can go wrong, it often will! But that doesn’t mean we can’t be ready for such scenarios in the cloud. In this talk, GigaSpaces will focus on the AWS environment, which is dynamic and volatile by nature, and how to maximise your utilisation and minimise downtime. This session will show you how you can architect your cloud-hosted systems to sustain such outages, delving into how to choose the right PaaS for the job, addressing data centre failures, how to avoid single points of failure, and more. Organiser's commentary: Ron Zanver from GigaSpaces came to talk about the inherent instability of life in the cloud, and what you can do to protect yourself - it's all about good design and architecture. He also introduced us to GigaSaces' new Cloudify product, for abstracting estate management across multiple clouds and cloud vendors.

Protect your app from Outages

Ron Zavner

Cloud Composer workshop at Airflow Summit 2023.pdf

Leah Cole

Flink Forward SF 2017: James Malone - Make The Cloud Work For You

Flink Forward

You should spend your time using the powerful Apache Flink ecosystem to get value from your data, not on your data processing infrastructure. Cloud environments can help you with this problem by providing managed services and infrastructure. Since Google Cloud Dataproc, Google's managed service to power the Apache big data ecosystem, runs Flink, you can easily combine the benefits of cloud with your Flink data pipelines. With new support for Flink and long-running streaming jobs, we will show you how you can set up a cluster and a streaming job in less than three minutes.

RightScale: Single Pane of Glass at Computerworld 2013RightScale

2013 05-multicloud-paas-interop-scenarios-fia-dublin

Alex Heneveld

Scaling Databricks to Run Data and ML Workloads on Millions of VMs

Matei Zaharia

Keynote at Scale By The Bay 2020. Cloud service developers need to handle massive scale workloads from thousands of customers with no downtime or regressions. In this talk, I’ll present our experience building a very large-scale cloud service at Databricks, which provides a data and ML platform service used by many of the largest enterprises in the world. Databricks manages millions of cloud VMs that process exabytes of data per day for interactive, streaming and batch production applications. This means that our control plane has to handle a wide range of workload patterns and cloud issues such as outages. We will describe how we built our control plane for Databricks using Scala services and open source infrastructure such as Kubernetes, Envoy, and Prometheus, and various design patterns and engineering processes that we learned along the way. In addition, I’ll describe how we have adapted data analytics systems themselves to improve reliability and manageability in the cloud, such as creating an ACID storage system that is as reliable as the underlying cloud object store (Delta Lake) and adding autoscaling and auto-shutdown features for Apache Spark.

Deploying MediaWiki On IBM DB2 in The Cloud PresentationLeons Petražickis

1 architecture & design

Mark Swarbrick

Similar to Avoiding Cloud Outage (20)

Disaster recovery on demand on the cloud

Disaster Recovery on Demand on the Cloud

Improving Availability & Lowering Costs with Auto Scaling & Amazon EC2 (CPN20...

Agile Infrastructure with Windows Azure

Running SQL Server on AWS | John McCormack | DataGrillen 2019

Azure en Nutanix: your journey to the hybrid cloud

David_Bermingham

Backup on the cloud 10.1.13

AWS re:Invent 2016: Reinventing Disaster Recovery Leveraging AWS Cloud Infras...

Cloud Patterns Beuth Hochschule

AWS 201 - A Walk through the AWS Cloud: App Hosting on AWS - Games, Apps and ...

19th February 2013, AWS User Group UK, Meetup #3, Managing your apps on AWS: ...

Protect your app from Outages

Cloud Composer workshop at Airflow Summit 2023.pdf

Flink Forward SF 2017: James Malone - Make The Cloud Work For You

RightScale: Single Pane of Glass at Computerworld 2013

2013 05-multicloud-paas-interop-scenarios-fia-dublin

Scaling Databricks to Run Data and ML Workloads on Millions of VMs

Deploying MediaWiki On IBM DB2 in The Cloud Presentation

1 architecture & design

More from Nati Shalom

Cloudify and terraform integration

Nati Shalom

Why NFV and Digital Transformation Projects Fail!

Nati Shalom

Cloudify and terraform integration

Nati Shalom

1 cloud, 2 clouds, 3 clouds, tons...

Nati Shalom

Open Stack Days israel Keynote 2017

Nati Shalom

What A No Compromises Hybrid Cloud Looks Like

Nati Shalom

Running OpenStack in Production

Nati Shalom

It has long been debated whether OpenStack is production ready. In this session you will learn how a major bank has gone to production with more than 5000 VMs that delivered the results of a 40% decrease in cost, reduced deployment time to hours not weeks, 56 new technologies introduced, 7 new platforms launched - all in under a year. Learn how their platform built on Rackspace and RHEL, coupled with best of breed open source tooling - SaltStack, Jenkins, Cloudify, and Nexus are the enablers for production-grade OpenStack. http://sched.co/7fH1

Orchestration tool roundup kubernetes vs. docker vs. heat vs. terra form vs...

Nati Shalom

Video recording: https://www.youtube.com/watch?v=tGlIgUeoGz8 It’s no news that containers represent a portable unit of deployment, and OpenStack has proven an ideal environment for running container workloads. However, where it usually becomes more complex is that many times an application is often built out of multiple containers. What’s more, setting up a cluster of container images can be fairly cumbersome because you need to make one container aware of another and expose intimate details that are required for them to communicate which is not trivial especially if they’re not on the same host. These scenarios have instigated the demand for some kind of orchestrator. The list of container orchestrators is growing fairly fast. This session will compare the different orchestation projects out there - from Heat to Kubernetes to TOSCA - and help you choose the right tool for the job. Session link from teh summit: https://openstacksummitmay2015vancouver.sched.org/event/abd484e0dedcb9774edda1548ad47518#.VV5eh5NViko

Real World Example of Orchestrating Docker, Node JS, NFV on OpenStack

Nati Shalom

Real World Application Orchestration Made Easy on VMware vCloud Air, vSphere ...

Nati Shalom

Looking for application orchestration in a hybrid or multi-cloud environment? You’ve got to hear about TOSCA orchestration. TOSCA (Topology and Orchestration Specification for Cloud Applications), brought to you by the same people who brought us XML, enables you to seamlessly migrate your workloads across environments or build a hybrid deployment that runs simultaneously across the VMware cloud offering. Join our Cloud Online Meetup to learn how Cloudify’s TOSCA-compliant orchestration can be your common management interface across the VMware cloud offering, OpenStack and heterogeneous cloud environments. Speakers: Nati Shalom, Founder and CTO at GigaSpaces, is a thought leader in Cloud Computing and Big Data Technologies. Shalom was recently recognized as a Top Cloud Computing Blogger for CIOs by The CIO Magazine and his blog is listed as an excellent blog by YCombinator. Shalom is the founder and also one of leaders of OpenStack Israel group, and is a frequent presenter at industry conferences. Paco Gomez, Senior Solution Architect at VMware vCloud Air. Paco evaluates and integrates strategic solutions that help vCloud Air clients benefit from VMware's hybrid cloud and application services. Paco is a seasoned technologist, having extensive experience in diverse fields including mainframes, distributed systems, enterprise development, cloud computing, mobile, assistive technology, electrical engineering and embedded systems. Across his career, Paco has held positions in consulting, sales engineering

OpenStack Juno The Complete Lowdown and Tales from the Summit

Nati Shalom

Application and Network Orchestration using Heat & Tosca

Nati Shalom

The buzzwords Neutron, Heat, and TOSCA are spoken about quite often when it comes to the OpenStack - and many of us are still trying to make sense of the terminology and its place in the OpenStack world. Where OpenStack Neutron provides APIs for creating network elements, OpenStack Heat provides an orchestration engine for automating the setup and configuration of OpenStack infrastructure, while TOSCA is a standard for templating and defining application topology and policies (that form the basis for Heat). In this context, it really makes sense to put these all together to achieve application and network automation for OpenStack on steroids. In this session we will learn how we can use the robust combination of Heat and TOSCA to configure and control resources on Nova and Neutron in order to automate the network configuration as part of the application deployment. The session will include a demo and code examples that show how you can configure virtual networks, attach public IPs, set up security groups, set up load balancing and automatically scale up/down and more. You will leave this session understanding where Neutron meets Heat and TOSCA. This talk was delivered as part of OpenStack Paris summit - 2014 - http://openstacksummitnovember2014paris.sched.org/event/2b85b682ccaf3a5961e463b61e2403f8#.VFeuG_TF8mc

Introduction to Cloudify for OpenStack users

Nati Shalom

Software Defined Operator

Nati Shalom

During the past few years we’ve seen how our entire data-center becomes software defined. This include the Compute, Storage, Network and also Configuration. This new data centre is the cloud. The missing piece in the puzzle: While this is pretty much old news there is one big thing that is missing in this puzzle and that is the operator itself. The operator is responsible for running processes such: * Installation of new apps * Upgrades and update of new features or patches * Performance tuning * Handling failure * Managing the capacity to meet the scaling demand. Most of those tasks today involves lots of human intervention. Users who realised that gap try to mitigate that by putting their own custom automation - usually that comes in a form of scripts on-top of the configuration management. Those custom scripts tend to grow fairly quickly to the point where they become unmanageable. This presentation will introduce how we can use an orchestrator to automate those tasks and by that create a software defined Operator.

Complex Analytics with NoSQL Data Store in Real Time

Nati Shalom

NOSQL are often limited in the type of queries that they can support due to the distributed nature of the data. In this session we would learn patterns on how we can overcome this limitation and combine multiple query semantics with NoSQL based engines. We will demonstrate specifically a combination of key/value, SQL like, Document model and Graph based queries as well as more advanced topic such as handling partial update and query through projection. We will also demonstrate how we can create a meshaup between those API's i.e. write fast through Key/Value API and execute complex queries on that same data through SQL query. - See more at: http://nosql2014.dataversity.net/sessionPop.cfm?confid=81&proposalid=6335#sthash.PNSZi5TJ.dpuf

Is Orchestration the Next Big Thing in DevOps

Nati Shalom

DevOps processes (such as continuous deployment and delivery) often involve writing many custom scripts that are triggered by the build system. With that approach, it is relatively hard to trace the deployment process and troubleshoot when something goes wrong. Additionally, custom scripts are often not written in an easily understood manner. In this session we will walk through specific DevOps workflows (such as install, update, etc) using Riemann as the framework in subject and see the steps required to automate those processes. We will also discuss how Cloudify uses Riemann to provide simple execution and monitoring of those workflow processes. We will share how one customer, PaddyPower, was able to leverage Cloudify to transition their traditional IT into a DevOps environment, bridging the gap betwe

When networks meets apps (open stack atlanta)

Nati Shalom

Recent advancements in OpenStack capabilities have made the cloud better tuned to enterprise needs by introducing much more flexible network designs and networking services, with the tradeoff of making the cloud more complex. In this session we will describe how we can leverage the power of the new networking advancement without exposing the complexity to the end user. We will present alternative approaches and their tradeoffs for automating the deployment of a typical n-tier enterprise application that include multi-tenant environment, separate network for admin and applications, cross region network, attach a floating IP, setup security groups etc. all through a combination of Heat, TOSCA, Chef, Puppet, and more.

Application Centric Approach to Devops

Nati Shalom

The experience of automating continuous delivery processes with Chef and Cloudify through an application-centric approach to DevOps, and how this model transformed PaddyPower's traditional IT into DevOps, keeping their Devs and their Ops happy. References: --------------- - Cloudify & Chef : http://www.cloudifysource.org/guide/2.7/integrations/chef_documentation - Blog Post: http://www.cloudifysource.org/2013/10/27/application_centric_approach_to_devops.html - Earlier Video Presentation : http://www.youtube.com/watch?v=YhDNKyP_s7U

Case Studies for moving apps to the cloud - DLD 2013Nati Shalom

Application Centric DevOps

Nati Shalom

More from Nati Shalom (20)

Cloudify and terraform integration

Why NFV and Digital Transformation Projects Fail!

Cloudify and terraform integration

1 cloud, 2 clouds, 3 clouds, tons...

Open Stack Days israel Keynote 2017

What A No Compromises Hybrid Cloud Looks Like

Running OpenStack in Production

Orchestration tool roundup kubernetes vs. docker vs. heat vs. terra form vs...

Real World Example of Orchestrating Docker, Node JS, NFV on OpenStack

Real World Application Orchestration Made Easy on VMware vCloud Air, vSphere ...

OpenStack Juno The Complete Lowdown and Tales from the Summit

Application and Network Orchestration using Heat & Tosca

Introduction to Cloudify for OpenStack users

Software Defined Operator

Complex Analytics with NoSQL Data Store in Real Time

Is Orchestration the Next Big Thing in DevOps

When networks meets apps (open stack atlanta)

Application Centric Approach to Devops

Case Studies for moving apps to the cloud - DLD 2013

Application Centric DevOps

Recently uploaded

"Impact of front-end architecture on development cost", Viktor Turskyi

Fwdays

I have heard many times that architecture is not important for the front-end. Also, many times I have seen how developers implement features on the front-end just following the standard rules for a framework and think that this is enough to successfully launch the project, and then the project fails. How to prevent this and what approach to choose? I have launched dozens of complex projects and during the talk we will analyze which approaches have worked for me and which have not.

LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...

DanBrown980551

Do you want to learn how to model and simulate an electrical network from scratch in under an hour? Then welcome to this PowSyBl workshop, hosted by Rte, the French Transmission System Operator (TSO)! During the webinar, you will discover the PowSyBl ecosystem as well as handle and study an electrical network through an interactive Python notebook. PowSyBl is an open source project hosted by LF Energy, which offers a comprehensive set of features for electrical grid modelling and simulation. Among other advanced features, PowSyBl provides: - A fully editable and extendable library for grid component modelling; - Visualization tools to display your network; - Grid simulation tools, such as power flows, security analyses (with or without remedial actions) and sensitivity analyses; The framework is mostly written in Java, with a Python binding so that Python developers can access PowSyBl functionalities as well. What you will learn during the webinar: - For beginners: discover PowSyBl's functionalities through a quick general presentation and the notebook, without needing any expert coding skills; - For advanced developers: master the skills to efficiently apply PowSyBl functionalities to your real-world scenarios.

UiPath Test Automation using UiPath Test Suite series, part 3

DianaGray10

ODC, Data Fabric and Architecture User Group

CatarinaPereira64715

From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...

Product School

FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf

FIDO Alliance

GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...

James Anderson

Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management. The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM). Speakers: Bob Boule Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle. Gopinath Rebala Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.

The Future of Platform Engineering

Jemma Hussein Allen

Mission to Decommission: Importance of Decommissioning Products to Increase E...

Product School

FIDO Alliance Osaka Seminar: Overview.pdf

FIDO Alliance

Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...

UiPathCommunity

💥 Speed, accuracy, and scaling – discover the superpowers of GenAI in action with UiPath Document Understanding and Communications Mining™: See how to accelerate model training and optimize model performance with active learning Learn about the latest enhancements to out-of-the-box document processing – with little to no training required Get an exclusive demo of the new family of UiPath LLMs – GenAI models specialized for processing different types of documents and messages This is a hands-on session specifically designed for automation developers and AI enthusiasts seeking to enhance their knowledge in leveraging the latest intelligent document processing capabilities offered by UiPath. Speakers: 👨‍🏫 Andras Palfi, Senior Product Manager, UiPath 👩‍🏫 Lenka Dulovicova, Product Program Manager, UiPath

Accelerate your Kubernetes clusters with Varnish Caching

Thijs Feryn

De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...

Product School

Essentials of Automations: Optimizing FME Workflows with Parameters

Safe Software

Are you looking to streamline your workflows and boost your projects’ efficiency? Do you find yourself searching for ways to add flexibility and control over your FME workflows? If so, you’re in the right place. Join us for an insightful dive into the world of FME parameters, a critical element in optimizing workflow efficiency. This webinar marks the beginning of our three-part “Essentials of Automation” series. This first webinar is designed to equip you with the knowledge and skills to utilize parameters effectively: enhancing the flexibility, maintainability, and user control of your FME projects. Here’s what you’ll gain: - Essentials of FME Parameters: Understand the pivotal role of parameters, including Reader/Writer, Transformer, User, and FME Flow categories. Discover how they are the key to unlocking automation and optimization within your workflows. - Practical Applications in FME Form: Delve into key user parameter types including choice, connections, and file URLs. Allow users to control how a workflow runs, making your workflows more reusable. Learn to import values and deliver the best user experience for your workflows while enhancing accuracy. - Optimization Strategies in FME Flow: Explore the creation and strategic deployment of parameters in FME Flow, including the use of deployment and geometry parameters, to maximize workflow efficiency. - Pro Tips for Success: Gain insights on parameterizing connections and leveraging new features like Conditional Visibility for clarity and simplicity. We’ll wrap up with a glimpse into future webinars, followed by a Q&A session to address your specific questions surrounding this topic. Don’t miss this opportunity to elevate your FME expertise and drive your projects to new heights of efficiency.

When stars align: studies in data quality, knowledge graphs, and machine lear...

Elena Simperl

PHP Frameworks: I want to break free (IPC Berlin 2024)

Ralf Eggert

In this presentation, we examine the challenges and limitations of relying too heavily on PHP frameworks in web development. We discuss the history of PHP and its frameworks to understand how this dependence has evolved. The focus will be on providing concrete tips and strategies to reduce reliance on these frameworks, based on real-world examples and practical considerations. The goal is to equip developers with the skills and knowledge to create more flexible and future-proof web applications. We'll explore the importance of maintaining autonomy in a rapidly changing tech landscape and how to make informed decisions in PHP development. This talk is aimed at encouraging a more independent approach to using PHP frameworks, moving towards a more flexible and future-proof approach to PHP development.

Search and Society: Reimagining Information Access for Radical Futures

Bhaskar Mitra

The field of Information retrieval (IR) is currently undergoing a transformative shift, at least partly due to the emerging applications of generative AI to information access. In this talk, we will deliberate on the sociotechnical implications of generative AI for information access. We will argue that there is both a critical necessity and an exciting opportunity for the IR community to re-center our research agendas on societal needs while dismantling the artificial separation between the work on fairness, accountability, transparency, and ethics in IR and the rest of IR research. Instead of adopting a reactionary strategy of trying to mitigate potential social harms from emerging technologies, the community should aim to proactively set the research agenda for the kinds of systems we should build inspired by diverse explicitly stated sociotechnical imaginaries. The sociotechnical imaginaries that underpin the design and development of information access technologies needs to be explicitly articulated, and we need to develop theories of change in context of these diverse perspectives. Our guiding future imaginaries must be informed by other academic fields, such as democratic theory and critical theory, and should be co-developed with social science scholars, legal scholars, civil rights and social justice activists, and artists, among others.

DevOps and Testing slides at DASA Connect

Kari Kakkonen

JMeter webinar - integration with InfluxDB and Grafana

RTTS

Watch this recorded webinar about real-time monitoring of application performance. See how to integrate Apache JMeter, the open-source leader in performance testing, with InfluxDB, the open-source time-series database, and Grafana, the open-source analytics and visualization application. In this webinar, we will review the benefits of leveraging InfluxDB and Grafana when executing load tests and demonstrate how these tools are used to visualize performance metrics. Length: 30 minutes Session Overview ------------------------------------------- During this webinar, we will cover the following topics while demonstrating the integrations of JMeter, InfluxDB and Grafana: - What out-of-the-box solutions are available for real-time monitoring JMeter tests? - What are the benefits of integrating InfluxDB and Grafana into the load testing stack? - Which features are provided by Grafana? - Demonstration of InfluxDB and Grafana using a practice web application To view the webinar recording, go to: https://www.rttsweb.com/jmeter-integration-webinar

Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...

Jeffrey Haguewood

Sidekick Solutions uses Bonterra Impact Management (fka Social Solutions Apricot) and automation solutions to integrate data for business workflows. We believe integration and automation are essential to user experience and the promise of efficient work through technology. Automation is the critical ingredient to realizing that full vision. We develop integration products and services for Bonterra Case Management software to support the deployment of automations for a variety of use cases. This video focuses on the notifications, alerts, and approval requests using Slack for Bonterra Impact Management. The solutions covered in this webinar can also be deployed for Microsoft Teams. Interested in deploying notification automations for Bonterra Impact Management? Contact us at sales@sidekicksolutionsllc.com to discuss next steps.

Recently uploaded (20)

"Impact of front-end architecture on development cost", Viktor Turskyi

LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...

UiPath Test Automation using UiPath Test Suite series, part 3

ODC, Data Fabric and Architecture User Group

From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...

FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf

GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...

The Future of Platform Engineering

Mission to Decommission: Importance of Decommissioning Products to Increase E...

FIDO Alliance Osaka Seminar: Overview.pdf

Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...

Accelerate your Kubernetes clusters with Varnish Caching

De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...

Essentials of Automations: Optimizing FME Workflows with Parameters

When stars align: studies in data quality, knowledge graphs, and machine lear...

PHP Frameworks: I want to break free (IPC Berlin 2024)

Search and Society: Reimagining Information Access for Radical Futures

DevOps and Testing slides at DASA Connect

JMeter webinar - integration with InfluxDB and Grafana

Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...

Avoiding Cloud Outage

1. Protect your app from Outages Nati Shalom CTO GigaSpaces @natishalom May 2013

2.  AWS and outages  Outage impact  Disaster Recovery – it’s all about redundancy!  Cloudify as a solution for redundancy  Demo with Cloudify on EC2 ® Copyright 2013 GigaSpaces Ltd. All Rights Reserved2 AGENDA

3. 3 AWS USAGE • AWS – around 0.5M servers • Facebook – less than 0.1M servers • Google – around 1M servers

4. THE OUTAGE PROBLEM 4

9. NOT ONLY AMAZON ® Copyright 2012 GigaSpaces Ltd. All Rights Reserved9  28 December 2012 - some owners of Microsoft's XBox 360 gaming console were unable to access some of their cloud-based storage files.  26 July 2012 - Service for Microsoft’s Windows Azure Europe region went down for more than two hours  29 February 2012 - The ultimate result was service impacts of 8-10 hours for users of Azure data centers in Dublin, Ireland, Chicago, and San Antonio.

10. 10 THAT’S WHAT YOU EXPECT? 99% - 3.65 days downtime 99.9% - 8.76 hours downtime 99.99% - 53 minutes downtime 99.999% - 5.26 minutes downtime

12. 12 DISASTER RECOVERY

13. 13 MULTI CLOUD

14. 14 PREPARE FOR DISASTER RECOVERY •Dedicated expert for DR architecture •Define target recovery time & point •Assume every tier can fail •Use monitoring and alerts •Document your operational processes

15. 15 CHAOS MONKEY

16. 16

17. 17 CLONE YOUR ENVIORMENT

18. 18 CLONE YOUR DATA

19. 19

20. Leverage Existing Automation Frameworks Configuration Centric APP Centric (PaaS)

21. CLONE YOUR ENV - HOW DOES IT WORK?

22. BUILT IN SUPPORT FOR MANAGING DATA IN THE CLOUD Real Time Relational DB Clusters NoSQL Clusters Hadoop Storm MySQL MongoDB Hadoop (Hive, Pig,..) Elastic Caching XAP Postgress Cassandra ZooKeeper Couchbase ElasticSearch

23. 23

24. VERIFI (CURRENT) DEPLOYMENT ARCHITECTURE 24 Availability region (US-West: Oregon) Data Volume Internet EC2 Instance mod_cluster EC2 Instance JBoss Data Volume EC2 Instance EC2 Instance PostgresSQL Cassandra 4 recipes

25. TARGET ARCHITECTURE Availability Region (US-West Oregon) Data Volume Internet EC2 Instance mod_cluster EC2 Instance JBoss Data Volume Postgres Master EC2 Instance EC2 Instance Cassandra Availability Region (US-East Virginia) Data Volume EC2 Instance mod_cluster EC2 Instance JBoss Data Volume Postgres Slave EC2 Instance EC2 Instance Cassandra replication Bootstrap two EC2 clouds in different regions, install the “verifi” application on each. The second cloud will have a slightly modified (extended) postgres recipe for acting as a slave + no running app servers. Upon the primary zone failure, the second cloud will spin up instances of the app servers and turn the data instance into master, then bootstrapping another “slave” cloud in another zone.

26. FAILOVER SCENARIO 26 Region (US-West Oregon) App Servers PostgresSQL Region (US-East Virginia) PostgresSQL Cloud #1 Cloud #2 Region (US-East Virginia ) PostgresSQL Cloud #1 Cloud #2 App Servers Region (US-West California) PostgresSQL Cloud #3 Region failure occurs Bootstrap another cloud in a different region using the same application recipe used to bootstrap cloud #2 above* Liveness poll Liveness poll Upon initial deployment, the primary deployment of the application will be bootstrapped onto cloud #1, another slightly modified application recipe will be bootstrapped as cloud #2, polling cloud #1 for failure, and acting as a PostgresSQL db slave. Turn Postgres slave into master, Start app server instances*

27. Copyright 2012 Gigaspaces. All Rights Reserved27 NEXT STEPS Across clouds (AWS, Rackspace, Azure…etc) Across AWS regions Across AWS zones 1 application + overrides Several cloud drivers 1 application + overrides 1 cloud driver 1 application + overrides 1 cloud driver Availability Supported by Verifi phase #1

28. Copyright 2012 Gigaspaces. All Rights Reserved28 EVOLUTION PATH Availability Complexity Multi cloud/provider Multi region Multi zone Multi instance Multi cloud/provider Multi region Multi zoneMulti instance

29.  AWS and outages  Outage impact  Disaster Recovery – it’s all about redundancy!  Cloning your environment – app stack  Cloning your DB – Replication  Cloudify as a solution for Redundancy  Use recipes to work on any cloud  Fast and customized data replication  Demo with Cloudify on EC2 ® Copyright 2013 GigaSpaces Ltd. All Rights Reserved29 SUMMARY

Editor's Notes

A high-ranking Amazon executive said there are 60,000 different customers across the various Amazon Web Services, and most of them are not the startups that are normally associated with on-demand computing. Rather the biggest customers in both number and amount of computing resources consumed are divisions of banks, pharmaceuticals companies and other large corporations who try AWS once for a temporary project, and then get hooked. According to Statspotting.com in March 2012 - researcher estimates that Amazon Web Services is using at least 454,400 servers in seven data center hubs around the globe. Let us try this: Google is powered by a million servers. Maybe a little more than that. And Amazon has half a million servers. Now, things fall in place. Facebook, the service that takes up one fourth of all our time online, is powered by less than 100,000 servers.Biggest customers – pinterest, instagram, Netflix, heroku, quora, foursquare etcAmazon Web Services runs more than 835,000 requests per second for hundreds of thousands of customers in 190 countries, including 300 government agencies and 1,500 educational institutions.
The Amazon cloud proved itself in that sufficient resources were available world-wide such that many well-prepared users could continue operating with relatively little downtime. But because Amazon’s reliability has been incredible, many users were not well-prepared leading to widespread outages.Amazon EC2 outage on April 2011 was the worst in cloud computing’s history back then. It made the front page of many news pages, including the New York Times, probably because many people were shocked by how many web sites and services rely on EC2.Microsoft Azure outageDec 28 2012 - some owners of Microsoft's Xbox 360 game console were unable to access some of their cloud-based save storage files.July 26 - 2012 - Service for Microsoft’s Windows Azure Europe region went down for more than two hoursFeb 29 2012 - The ultimate result was service impacts of 8-10 hours for users of Azure data centers in Dublin, Ireland, Chicago, and San Antonio.
Some parts of Amazon Web Services suffered a major outage. A portion of volumes utilizing the Elastic Block Store (EBS) service became "stuck" and were unable to fulfill read/write requests. It took at least two days for service to be fully restored. Reddit, one of the better-known sites to go down due to the error, said it has 700 EBS volumes with Amazon.Sites like Quora and Reddit were able to come back online in "read-only" mode, but users couldn't post new content for many hours.
For second time in less than a month, Amazon’s Northern Virginia data center has suffered an outage and is impacting many popular services such as Instagram, Pinterest & Netflix.Several websites that rely on Amazon Web Services were taken offline due to a severe storm of historic proportions in the Northern Virginia area where Amazon's largest datacenter is located. Amazon previously suffered an outage in its Northern Virginia facilities on June 14, 2012.A line of severe storms packing winds of up to 80 mph has caused extensive damage and power outages in Virginia. Dominion Virginia Power crews are assessing damages and will be restoring power where safe to do so.
A major outage occurred, affecting many sites such as reddit, Foursquare, Pinterest, and others. The cause was a latent bug in an operational data collection agent. A memory leak and a failed monitoring system caused the Amazon Web Services outage on Monday that took out Reddit and other major services.According to a post Friday night, AWS explained that the problem arose after a simple replacement of a data collection server. After installation, the server did not propagate its DNS address correctly and so a fraction of servers did not get the message. Those servers kept trying to reach the server, which led to a memory leak that then went out of control due to the failure of an internal monitoring alarm. Eventually the system ground to a virtual stop and millions of customers felt the pain.
Amazon AWS again suffered an outage, causing websites such as Netflix instant video to be unavailable for some customers, particularly in the North-eastern US. Amazon later issued a statement detailing the issues with the Elastic Load Balancing service that led up to the outage.The disruption began shortly after noon Pacific time on December 24 when data was accidentally deleted by a developer during maintenance on the East Coast Elastic Load Balancing system, which is designed to distribute traffic volume among servers."Netflix is designed to handle failure of all or part of a single availability zone in a region as we run across three zones and operate with no loss of functionality on two," the company said in ablog post this afternoon. "We are working on ways of extending our resiliency to handle partial or complete regional outages."
Fault tolerant systems are measured by their uptime / downtime for end usersAmazon says it is "committed" to a 99.95 percent uptime
Although AWS went offline for a few hours only, the downtime experience did have an impact on customers’ businesses. There is no known data for the number of people affected by a cloud computing service outage. It is estimated that the travel service provider Amadeus loses $89,000 per hour during any cloud computing outage, while Paypal loses around $225,000 per hour.
DR – The process and procedures you take to restore your system after catastrophic event.Cloud infrastructure has made DR much easier and affordable comparing to previous options.Cloud can also suffer from large scale failures because of network, power or any IT failures.Applications owners need to be responsible for HA and DR – can use multiple servers, AZ, regions and even clouds.Zones within a region share a LAN so they have high bandwidth, low latency and private IP access. Zones utilize separate power resources. Regions are “islands” – they share no resources.
Each cloud is unique in many aspects offering different API and functionality to manage the resources.Different set of available resourcesDifferent format, encoding and versionsDifferent security groups, machine images, snapshots etc.
Make sure to have a dedicated expert to manage your DR architecture, processes and testing.Define what your target recovery time and recovery point is.Be pessimistic and design for failures – (assume everything will fail and design a solution that is capable of handling it). Avoid single point of failures – all parts of your app should be highly available (different AZ / regions / cloud) – load balancers, app servers, web servers, message bus, database.Use monitoring and alerts for failover processes and for every change in state.Document your DR operational processes and automations.Try to “break” different part in your application. Try different ways to break it – unplug the network, turn machine off etc. Try it again.
Netflix has open sourced ”Chaos Monkey,” its tool designed to purposely cause failure in order to increase the resiliency of an application in Amazon Web Services (AWS.)It’s a timely move as AWS has had its fair share of outages. With tools like Chaos Monkey, companies can be better prepared when a cloud infrastructure has a failure.In a blog post, Netflx says that this is the first of several tools that it will open source to help companies better manage the services they run in cloud infrastructures. Next up is likely to be Janitor Monkey which helps keep an environment tidy and costs down.Chaos Monkey has achieved its own fame for its innovative approach. According to Netflix, the tool “randomly disables production instances to make sure it can survive common types of failure without any customer impact. The name comes from the idea of unleashing a wild monkey with a weapon in your data center (or cloud region) to randomly shoot down instances and chew through cables — all the while we continue serving our customers without interruption.”
Netflix provides an excellent toolset for surving outages at the operation level.In this part i wanted to zoom-in more on the design implication of our application.The core principle for surviving failure is actually fairly simple and in fact applies to any systems not just cloud whether they happen to be Airplane, Missiles, Cars etc.. At the end its all about redundancy. The degree of tolerance is often determined by how many alternate systems or parts of the system we have in our design and how much they are separated from one another. The degree tolerance is also determined by how fast we can detect the broken part in our system and make the switch. In software terms the common parts that comprises our system is built out of two main groups - the business logic and the data.Making a redundant software application that can survive failure is often based on setting up clones for two of those parts of our system.
We need abstraction – we don’t want to be locked in. We want to use tools that offer this abstraction layers both for daily management and for DR. This tool should translate our architecture concepts to the cloud specific properties (using recipes).To clone our application business logic we need to be able to ensure that all parts of our system runs the exact same version of all our software components . That include not just the binaries but also the configuration, the scripts that runs our application and more importantly that all our post deployment procedures such as fail-over, scaling and monitoring are also kept consistent. Quite often the things that makes the cloning of our business logic complex is due to the fact that the information on how to run our application is often scattered in many different sources such as scripts, as well as the mind of the people that runs those apps. To make the job of cloning our application much simpler and thus more consistent we need to be able to capture all parts of the information for running our apps in the same place. Configuration management tools such as Chef, Puppet and in the case of Amazon CloudFormation can help on this regard.
RDS read replica - Amazon RDS uses MySQL’s built-in replication functionality to create a special type of DB Instance called a Read Replica that allows you to elastically scale out beyond the capacity constraints of a single DB Instance for read-heavy database workloads. Once you create a Read Replica, database updates on the source DB Instance are replicated to the Read Replica using MySQL’s native, asynchronous replication. Since Read Replicas leverage standard MySQL replication, they may fall behind their sources, and they are therefore not intended to be used for enhancing fault tolerance in the event of source DB Instance failure or Availability Zone failure.
There are lots of patterns on how to avoid failure.It took Netflix lots of development work to build a framework that can handle them well.Most users, startup don't have the lactury of implementing them themselves. You need a tool that will enable you to automate those patterns in a consistent way. - Enter Cloudify
Any App, Any Stack — Move your application to the cloud without making any code changes, regardless of the application stack (Java/Spring, Java EE, Ruby on Rails, …), database store (relational such as MySQL or non-relational such as Apache Cassandra), or any other middleware components it uses. This enables you to achieve your objective of no code changes.To make the work of setting all this work simpler we tried to bake all those patterns into a readymade tools and are scripted into out of the box recipes. The cloudify recipes includes: Database cluster recipes with support for MySQL, MongoDB, Cassandra, Postgress etc..Integration with Chef and Puppet Automation of fail-over, scaling and continues maintenance of our application.Application recipes that allows you to capture all the aspect of running your application including the post deployment aspect such as fail-over, scaling and monitoring.
There are lots of patterns on how to avoid failure.It took Netflix lots of development work to build a framework that can handle them well.Most users, startup don't have the lactury of implementing them themselves. You need a tool that will enable you to automate those patterns in a consistent way. - Enter Cloudify
Cloud brings lots of promise for making our business more agile.Cloud has also become a huge shared infrastructure in which every failure has a much more significant impact on our business world wide.The experience in the past year had tought us that even a robust cloud infrastructure such as Amazon can fail. Through this experience we've learned that rather than relying on the infrastructure for preventing failure we need to design our system to cope with failure and get used to failure as away of life. Having said that the investment required to build a robust application can be fairly large and not something that everyone can afford.Using tools like Cloudify, Chef Puppet and if your a pure Amazon shop Netflix <framework> could help greatly to reduce this effort by making a lot of those patterns pre-backed into recipes.

Avoiding Cloud Outage

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Avoiding Cloud Outage

Similar to Avoiding Cloud Outage (20)

More from Nati Shalom

More from Nati Shalom (20)

Recently uploaded

Recently uploaded (20)

Avoiding Cloud Outage

Editor's Notes