BigDoor's Jeff Malek Gluecon Presentation

•Download as PPTX, PDF•

1 like•260 views

This document discusses lessons learned from building a startup entirely in the cloud on AWS and dealing with an outage in April 2011. The key points are: 1. The importance of scripted repeatability and automation to easily set up and tear down cloud infrastructure. This allows one person to manage many server instances. 2. Eliminating single points of failure by distributing servers across availability zones and enabling failover of load balancers, application servers and databases. 3. The importance of clear communication during an outage to keep stakeholders informed of the status and resolution.

Technology

Retrospective from a startup built in the cloud : top 3 big lessons from the AWS outage on 04.21.2011 plus 4,369 other smaller ones 5/27/2011 1

What a country : entrepreneurial resiliency 5/27/2011 2

(true story) “robust systems:highly fault-tolerant, on or off grid. eg: our culture wrt entrepreneurs, AWS, the BD API” 5/27/2011 3

me: previous startupteams in 3 countrieshighly transactional systemMS tech : IIS/MS SQL Serverco-located, leased/owned hardware0% in cloud$75M/yearly rev 5/27/2011 6

me : current startupsystems 100% on AWS99% free/open-source software 5/27/2011 7 standing on the shoulders of giants

fault tolerance: 3 to 47 important failearnings and 4,369 less important ones 5/27/2011 8

in the context of our startup, of course YMMV depending on velocity 5/27/2011 9

The Ruger Fault Equivalencytime = money fault tolerance = time² - risk tolerance Also known as: 'Fast, good and cheap : pick two‘ 5/27/2011 11

system design philosophy: 5/27/2011 12 leverage proven, open-source tech in the cloud to build a scaleable reliable secure operational foundation quickly

So how do you achievethe right level of fault tolerance in the cloud? 3 tenets 5/27/2011 13

Tenet #1 5/27/2011 14 Scripted Repeatability Tenet #2 SPOF Elimination Tenet #3 Clear-Cut Communication

Tenet #1prepare a fault-tolerant foundation with scripted repeatability aka automation 5/27/2011 16

from the start :script the non-interactive install of your toolsand OScustom AMIDebian : great package managementbased on Eric Hammond’s workhttp://alestic.com/ 5/27/2011 17

which will allow you toscript the setup/tear-down of your stack 5/27/2011 18

which will allow you toscript system testsintegrity (3-4K tests)performance (30-40K tests)load, capacity (2-4M requests) 5/27/2011 19

5/27/2011 20 A/B system test results : MySQL Percona Upgrade

That’s how1 person set up andmanaged a networkcomprised of 90+/- server instancesfor 1.5 yearswhile serving various other roleswithout having to leave their chair 5/27/2011 21 try that with real hardware

Tenet #2SPOF Elimination We don’t need no stinkin single points of failure. 5/27/2011 22

SPOF Examples:Cloud ProviderRegionZoneLoad BalancerApp Server DatabaseFred 5/27/2011 23

Cloud Provider fail-over? e.g. AWS –> Rackspace 5/27/2011 24

Region fail-over? e.g. useast->uswest within AWS Nah. 5/27/2011 25

Zone fail-over? Yes. 5/27/2011 26 US-WEST US-EAST

Zone fail-over best practices:are you using auto-scaling?no : distribute server instances evenly between 2 or more zonesyes : trigger scaling on network I/O or custom metrics 5/27/2011 27

Load-balancer (ELB), app server, database fail-over? Yes. 5/27/2011 28

So it’s actually all about reduction of the right SPOFs for your business context Just adding the ability to fail-over and have backups within a region is huge! Probably enough for most. What about Fred? 5/27/2011 29

Tenet #3Clear-Cut Communication transparency is soooo 2010 5/27/2011 30

During an outage, communicating the right things at the right time:hard. But not that hard. 5/27/2011 31

Tenet #1 5/27/2011 32 Three Tenets Revisited Scripted Repeatability Tenet #2 SPOF Elimination Tenet #3 Clear-Cut Communication

What's hot

Take control of your dev ops dumping ground

Puppet

SIMCLOUD: Running Operational Simulators in the Cloud

Finmeccanica

All You Need to Know about AWS Elastic Load Balancer

Cloudlytics

Amazon cloud failure

Suhas Kelkar

Microsoft Azure Automation

Alexander Feschenko

Tyson Norris, Dragos Dascalita Haut OpenWhisk is quickly gaining momentum as a serverless platform, where function developers can deploy their executable code (or even docker containers), without having to manage builds or servers or monitoring or all the things required to host a conventional web application. To date, it has not been possible to leverage the cluster-wide scheduling facilities for running functions via OpenWhisk. We will present a Mesos framework for efficiently leveraging cluster resources for use by OpenWhisk functions, and trace the evolution of this framework from web application, to Mesos framework, and finally to DC/OS package based on dcos-commons.

MesosCon 2017 - OpenWhisk as an Apache Mesos Framework

Dragos Dascalita Haut

Efficient way to manage environments in AWS

amii894

Microsoft Azure. Troubleshooting and monitoring.

Alexander Feschenko

Elastic Load Balancing provides a scalable and highly-available load balancer that automatically distributes incoming application traffic across multiple Amazon EC2 instances. It enables you to achieve even greater fault tolerance in your applications, seamlessly providing the amount of load balancing capacity needed in response to incoming application traffic. In this session, we take a deeper look at some of the existing and newer features that enable application developers to architect highly-available architectures that are resilient to load spikes and application failures. We also explore some of the features that allow seamless integration with services such as Auto Scaling and Amazon Route 53 to further improve the scalability and resilience of your applications.

Availability & Scalability with Elastic Load Balancing & Route 53 (CPN204) | ...

Amazon Web Services

Docker and AWS have been working together to improve the Docker experience you already know and love. Deploying from Docker straight to AWS with your existing workflow has never been easier. Developers can use Docker Compose and Docker Desktop to deploy applications on Amazon ECS on AWS Fargate. This new functionality streamlines the process of deploying and managing containers in AWS from a local development environment running Docker. Join us for a hands-on walk through of how you can get started today.

From Docker Straight to AWS

DevOps.com

Amazon Elastic Beanstalk

Eberhard Wolff

How NYTimes.com uses Amazon Web Services - AWS Summit 2011

Vadim Jelezniakov

Magento Developer Talk. Microservice Architecture and Actor Model

Igor Miniailo

Evolve18 | Brian Johnson & Ira Lessack | Business Track How To Move Your On-...

Evolve The Adobe Digital Marketing Community

Autoscaling Ws On Ec2 Apache Con Presentation

guest60ed0b

Architecting in Cloud : Your Guide to Amazon Web Services

Edureka!

AWS Elastic Beanstalk is the fastest and simplest way to get an application up and running on Amazon Web Services. Developers can simply upload their application code and the service automatically handles all the details such as resource provisioning, load balancing, auto-scaling, and monitoring. This session shows you how to connect your Git repository with Amazon Web Services, deploy your code to AWS Elastic Beanstalk, easily enable or disable application functionality, and perform zero-downtime deployments through interactive demos and code samples. Timothee Cruse, Solutions Architect, Amazon Web Services, ASEAN

Agile Deployment using Git and AWS Elastic Beanstalk

Amazon Web Services

Operating OpenStack - Case Study in the Rackspace Cloud

Rainya Mosher

What's hot (18)

Take control of your dev ops dumping ground

SIMCLOUD: Running Operational Simulators in the Cloud

All You Need to Know about AWS Elastic Load Balancer

Amazon cloud failure

Microsoft Azure Automation

MesosCon 2017 - OpenWhisk as an Apache Mesos Framework

Efficient way to manage environments in AWS

Microsoft Azure. Troubleshooting and monitoring.

Availability & Scalability with Elastic Load Balancing & Route 53 (CPN204) | ...

From Docker Straight to AWS

Amazon Elastic Beanstalk

How NYTimes.com uses Amazon Web Services - AWS Summit 2011

Magento Developer Talk. Microservice Architecture and Actor Model

Evolve18 | Brian Johnson & Ira Lessack | Business Track How To Move Your On-...

Autoscaling Ws On Ec2 Apache Con Presentation

Architecting in Cloud : Your Guide to Amazon Web Services

Agile Deployment using Git and AWS Elastic Beanstalk

Operating OpenStack - Case Study in the Rackspace Cloud

Similar to BigDoor's Jeff Malek Gluecon Presentation

Retrospective from a startup built in the cloud: top three big lessons learne...

Jeff Malek

Powering the Cloud with Oracle WebLogic

Lucas Jellema

RightScale Webinar: So you want to move to the cloud... but you’re not sure what that means, or where you would even start. Or you want to get your feet wet with a proof-of-concept project before you bring out the big guns. We asked Brian Adler, our Professional Services Architect who works directly with customers on cloud projects every single day, to select five cloud projects that you can get started with (and complete!) quickly. In this webinar, Brian and Rafael Saavedra, our VP of Engineering, will walk you through those five projects and will help you demonstrate success in the cloud now.

5 Quick Wins for the Cloud

RightScale

Was liberty at scale

sflynn073

"The typical Forbes Global 2000 enterprise has more than 5,000 applications. Amazon EC2 has more than 54,800 possible instance configurations for a simple three-tiered application. Managing the application lifecycle is challenging due to three interrelated degrees of freedom in the cloud: application proliferation, execution venue diversification, and the increase in the rate of change to one or both of the latter. Creating rigid, hard-wired relationships between applications and infrastructure simply won’t scale. What’s needed is the ability to abstract up to the application level, define each application’s topology and dependencies, and use this metadata to allow the infrastructure environment to dynamically provision resources based on application criteria. This focus on the application and not just infrastructure is what’s behind application-defined cloud management. Session sponsored by cliQr."

(ISM319) What Drives the Need for Application-Defined Management

Amazon Web Services

Keynote at Scale By The Bay 2020. Cloud service developers need to handle massive scale workloads from thousands of customers with no downtime or regressions. In this talk, I’ll present our experience building a very large-scale cloud service at Databricks, which provides a data and ML platform service used by many of the largest enterprises in the world. Databricks manages millions of cloud VMs that process exabytes of data per day for interactive, streaming and batch production applications. This means that our control plane has to handle a wide range of workload patterns and cloud issues such as outages. We will describe how we built our control plane for Databricks using Scala services and open source infrastructure such as Kubernetes, Envoy, and Prometheus, and various design patterns and engineering processes that we learned along the way. In addition, I’ll describe how we have adapted data analytics systems themselves to improve reliability and manageability in the cloud, such as creating an ACID storage system that is as reliable as the underlying cloud object store (Delta Lake) and adding autoscaling and auto-shutdown features for Apache Spark.

Scaling Databricks to Run Data and ML Workloads on Millions of VMs

Matei Zaharia

Cto cloud

Sean Hull

Muves3 Elastic Grid Java One2009 Final

Elastic Grid, LLC.

Si so product 1 day technical

Bjørn Hell Larsen

At IBM Think 2019, FlowFactor shared insights into the transformation and migration from a traditionally managed WebSphere application to a modern platform that provides your development team with self-service capabilities. This helps you overcome the typical challenges in a traditionally managed application environment. The result is a lower TCO and a faster time-to-market. A recent project carried out by FlowFactor at ENGIE / ELECTRABEL, the largest E & U provider in Belgium, is used during the session. We share the challenges and benefits.

Web sphere application transformation and modernization at engie electrabel

FlowFactor

VMworld 2013: Practicing What We Preach: VMware IT on vCenter Operations Mana...

VMworld

VMworld 2013: Virtualizing and Tuning Large Scale Java Platforms

VMworld

V mware v fabric 5 - what's new technical sales training presentation

solarisyourep

Just over a year ago (before becoming the full time chair and advocate of QCon London, San Francisco, and New York), my main role was with HPE as the principal architect for a client in the US public sector. The systems we supported were responsible for personnel information, scholarships decisions, and record management. Like so many others, we were also faced with legacy applications, COTS product integrations, polyglot code bases, and often brittle deployments. In an effort to decouple code bases and address some of these issues, we started advocating for a Microservice architecture and trying to distinguish it from the SOA practices of the past. Now, it’s a year later. I have had the incredible opportunity to have access to architects, engineers, and leaders from some of the world’s more respected software companies. These are companies like Uber, Microsoft, Netflix, Apple, Google, Slack, Pinterest, and Etsy. I’ve had the chance to have one-on-one discussions with Chief Architects, developers, and engineers building the apps I most admire and use every day (some leveraging Microservices, some embracing Monoliths, and others falling somewhere in between). Patterns & Practices of Microservices is some of the things I wish I knew before beginning a push towards Microservices just over a year ago. It’s the practices of companies leveraging Microservices, it’s the technology tradeoffs when deciding between Monoliths and Microservices, and it’s the advice I’ve heard in interviewing, podcasting, and iterating on presentations from software giants like Adrian Cockcroft, Matt Ranney, Josh Evans, Martin Thompson, and literally hundreds of other engineers who drop knowledge at QCons around the world.

Patterns & Practices of Microservices

Wesley Reisz

Introduction To Cloud Computing

Rinat Shagisultanov

Oracle on AWS partner webinar series

Tom Laszewski

Kafka communicates within a larger complex and evolving environment. The current modular approach to the integration means that the structure of the software stack is much more dynamic than in the past and operators no longer have the time to become intimate with how dependent components interact. The number of dependencies combined with lack of familiarity can create significant risks to the business including increased outages and longer time to resolve incidents. Both can result in loss of revenue and customers. These risks are significantly reduced by applying best-practice monitoring. Monitoring can provide a complete end-to-end view of the touch points within the application flow, so they are presented in comprehensive service-based views. This provides the user with a true single-pane of glass for monitoring and alerting for Kafka and its dependent technologies.

Reduce Risk with End to End Monitoring of Middleware-based Applications

SL Corporation

Madrid meetup #7 deployment models

Mario Alberto Martinez Lopez

OMEGAMON XE for Mainframe Networks v5.3 Long presentation

IBM z Systems Software - IT Service Management

RightScale Webinar: Cloud is the most nebulous and abused term in information technology today. It describes multiple, disparate service models and has been retroactively applied to countless legacy technologies in attempts to keep them current. In this webinar, we'll discuss the cloud technology landscape and where RightScale fits in to drive agility, cost, and time savings above cloud infrastructure. RightScale has been investing heavily for the past four years to make cloud infrastructure easy to leverage. This webinar will clarify elements that are straightforward, what continues to be difficult, and the impact on your schedule and budget.

Why Cloud Management Makes Sense

RightScale

Similar to BigDoor's Jeff Malek Gluecon Presentation (20)

Retrospective from a startup built in the cloud: top three big lessons learne...

Powering the Cloud with Oracle WebLogic

5 Quick Wins for the Cloud

Was liberty at scale

(ISM319) What Drives the Need for Application-Defined Management

Scaling Databricks to Run Data and ML Workloads on Millions of VMs

Cto cloud

Muves3 Elastic Grid Java One2009 Final

Si so product 1 day technical

Web sphere application transformation and modernization at engie electrabel

VMworld 2013: Practicing What We Preach: VMware IT on vCenter Operations Mana...

VMworld 2013: Virtualizing and Tuning Large Scale Java Platforms

V mware v fabric 5 - what's new technical sales training presentation

Patterns & Practices of Microservices

Introduction To Cloud Computing

Oracle on AWS partner webinar series

Reduce Risk with End to End Monitoring of Middleware-based Applications

Madrid meetup #7 deployment models

OMEGAMON XE for Mainframe Networks v5.3 Long presentation

Why Cloud Management Makes Sense

Recently uploaded

presentation ICT roal in 21st century education

jfdjdjcjdnsjd

How to Troubleshoot Apps for the Modern Connected Worker

ThousandEyes

Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024

The Digital Insurer

MINDCTI Revenue Release Quarter One 2024

MIND CTI

Join our latest Connector Corner webinar to discover how UiPath Integration Service revolutionizes API-centric automation in a 'Quote to Cash' process—and how that automation empowers businesses to accelerate revenue generation. A comprehensive demo will explore connecting systems, GenAI, and people, through powerful pre-built connectors designed to speed process cycle times. Speakers: James Dickson, Senior Software Engineer Charlie Greenberg, Host, Product Marketing Manager

Connector Corner: Accelerate revenue generation using UiPath API-centric busi...

DianaGray10

Webinar Recording: https://www.panagenda.com/webinars/why-teams-call-analytics-is-critical-to-your-entire-business Nothing is as frustrating and noticeable as being in an important call and being unable to see or hear the other person. Not surprising then, that issues with Teams calls are among the most common problems users call their helpdesk for. Having in depth insight into everything relevant going on at the user’s device, local network, ISP and Microsoft itself during the call is crucial for good Microsoft Teams Call quality support. To ensure a quick and adequate solution and to ensure your users get the most out of their Microsoft 365. But did you know that ‘bad calls’ are also an excellent indicator of other problems arising? Precisely because it is so noticeable!? Like the canary in the mine, bad calls can be early indicators of problems. Problems that might otherwise not have been noticed for a while but can have a big impact on productivity and satisfaction. Join this session by Christoph Adler to learn how true Microsoft Teams call quality analytics helped other organizations troubleshoot bad calls and identify and fix problems that impacted Teams calls or the use of Microsoft365 in general. See what it can do to keep your users happy and productive! In this session we will cover - Why CQD data alone is not enough to troubleshoot call problems - The importance of attributing call problems to the right call participant - What call quality analytics can do to help you quickly find, fix-, and prevent problems - Why having retrospective detailed insights matters - Real life examples of how others have used Microsoft Teams call quality monitoring to problem shoot problems with their ISP, network, device health and more.

Why Teams call analytics are critical to your entire business

panagenda

💉💊+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHABI}}+971581248768 +971581248768 Mtp-Kit (500MG) Prices » Dubai [(+971581248768**)] Abortion Pills For Sale In Dubai, UAE, Mifepristone and Misoprostol Tablets Available In Dubai, UAE CONTACT DR.Maya Whatsapp +971581248768 We Have Abortion Pills / Cytotec Tablets /Mifegest Kit Available in Dubai, Sharjah, Abudhabi, Ajman, Alain, Fujairah, Ras Al Khaimah, Umm Al Quwain, UAE, Buy cytotec in Dubai +971581248768''''Abortion Pills near me DUBAI | ABU DHABI|UAE. Price of Misoprostol, Cytotec” +971581248768' Dr.DEEM ''BUY ABORTION PILLS MIFEGEST KIT, MISOPROTONE, CYTOTEC PILLS IN DUBAI, ABU DHABI,UAE'' Contact me now via What's App…… abortion Pills Cytotec also available Oman Qatar Doha Saudi Arabia Bahrain Above all, Cytotec Abortion Pills are Available In Dubai / UAE, you will be very happy to do abortion in Dubai we are providing cytotec 200mg abortion pill in Dubai, UAE. Medication abortion offers an alternative to Surgical Abortion for women in the early weeks of pregnancy. We only offer abortion pills from 1 week-6 Months. We then advise you to use surgery if its beyond 6 months. Our Abu Dhabi, Ajman, Al Ain, Dubai, Fujairah, Ras Al Khaimah (RAK), Sharjah, Umm Al Quwain (UAQ) United Arab Emirates Abortion Clinic provides the safest and most advanced techniques for providing non-surgical, medical and surgical abortion methods for early through late second trimester, including the Abortion By Pill Procedure (RU 486, Mifeprex, Mifepristone, early options French Abortion Pill), Tamoxifen, Methotrexate and Cytotec (Misoprostol). The Abu Dhabi, United Arab Emirates Abortion Clinic performs Same Day Abortion Procedure using medications that are taken on the first day of the office visit and will cause the abortion to occur generally within 4 to 6 hours (as early as 30 minutes) for patients who are 3 to 12 weeks pregnant. When Mifepristone and Misoprostol are used, 50% of patients complete in 4 to 6 hours; 75% to 80% in 12 hours; and 90% in 24 hours. We use a regimen that allows for completion without the need for surgery 99% of the time. All advanced second trimester and late term pregnancies at our Tampa clinic (17 to 24 weeks or greater) can be completed within 24 hours or less 99% of the time without the need surgery. The procedure is completed with minimal to no complications. Our Women's Health Center located in Abu Dhabi, United Arab Emirates, uses the latest medications for medical abortions (RU-486, Mifeprex, Mifegyne, Mifepristone, early options French abortion pill), Methotrexate and Cytotec (Misoprostol). The safety standards of our Abu Dhabi, United Arab Emirates Abortion Doctors remain unparalleled. They consistently maintain the lowest complication rates throughout the nation. Our Physicians and staff are always available to answer questions and care for women in one of the most difficult times in their lives. The decision to have an abortion at the Abortion Cl

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...

?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@

A Principled Technologies deployment guide Conclusion Deploying VMware Cloud Foundation 5.1 on next gen Dell PowerEdge servers brings together critical virtualization capabilities and high-performing hardware infrastructure. Relying on our hands-on experience, this deployment guide offers a comprehensive roadmap that can guide your organization through the seamless integration of advanced VMware cloud solutions with the performance and reliability of Dell PowerEdge servers. In addition to the deployment efficiency, the Cloud Foundation 5.1 and PowerEdge solution delivered strong performance while running a MySQL database workload. By leveraging VMware Cloud Foundation 5.1 and PowerEdge servers, you could help your organization embrace cloud computing with confidence, potentially unlocking a new level of agility, scalability, and efficiency in your data center operations.

Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...

Principled Technologies

Axa Assurance Maroc - Insurer Innovation Award 2024

The Digital Insurer

Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...

Neo4j

As privacy and data protection regulations evolve rapidly, organizations operating in multiple jurisdictions face mounting challenges to ensure compliance and safeguard customer data. With state-specific privacy laws coming up in multiple states this year, it is essential to understand what their unique data protection regulations will require clearly. How will data privacy evolve in the US in 2024? How to stay compliant? Our panellists will guide you through the intricacies of these states' specific data privacy laws, clarifying complex legal frameworks and compliance requirements. This webinar will review: - The essential aspects of each state's privacy landscape and the latest updates - Common compliance challenges faced by organizations operating in multiple states and best practices to achieve regulatory adherence - Valuable insights into potential changes to existing regulations and prepare your organization for the evolving landscape

TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments

TrustArc

GenAI Risks & Security Meetup 01052024.pdf

lior mazor

Top 10 Most Downloaded Games on Play Store in 2024

SynarionITSolutions

MySQL Webinar, presented on the 25th of April, 2024. Summary: MySQL solutions enable the deployment of diverse Database Architectures tailored to specific needs, including High Availability, Disaster Recovery, and Read Scale-Out. With MySQL Shell's AdminAPI, administrators can seamlessly set up, manage, and monitor these solutions, ensuring efficiency and ease of use in their administration. MySQL Router, on the other hand, provides transparent routing from the application traffic to the backend servers in the architectures, requiring minimal configuration. Completely built in-house and supported by Oracle, these solutions have been adopted by enterprises of all sizes for their business-critical applications. In this presentation, we'll delve into various database architecture solutions to help you choose the right one based on your business requirements. Focusing on technical details and the latest features to maximize the potential of these solutions.

Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...

Miguel Araújo

The Good, the Bad and the Governed - Why is governance a dirty word? David O'Neill, Chief Operating Officer - APIContext Apidays New York 2024: The API Economy in the AI Era (April 30 & May 1, 2024) ------ Check out our conferences at https://www.apidays.global/ Do you want to sponsor or talk at one of our conferences? https://apidays.typeform.com/to/ILJeAaV8 Learn more on APIscene, the global media made by the community for the community: https://www.apiscene.io Explore the API ecosystem with the API Landscape: https://apilandscape.apiscene.io/

Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...

apidays

Following the popularity of "Cloud Revolution: Exploring the New Wave of Serverless Spatial Data," we're thrilled to announce this much-anticipated encore webinar. In this sequel, we'll dive deeper into the Cloud-Native realm by uncovering practical applications and FME support for these new formats, including COGs, COPC, FlatGeoBuf, GeoParquet, STAC, and ZARR. Building on the foundation laid by industry leaders Michelle Roby of Radiant Earth and Chris Holmes of Planet in the first webinar, this second part offers an in-depth look at the real-world application and behind-the-scenes dynamics of these cutting-edge formats. We will spotlight specific use-cases and workflows, showcasing their efficiency and relevance in practical scenarios. Discover the vast possibilities each format holds, highlighted through detailed discussions and demonstrations. Our expert speakers will dissect the key aspects and provide critical takeaways for effective use, ensuring attendees leave with a thorough understanding of how to apply these formats in their own projects. Elevate your understanding of how FME supports these cutting-edge technologies, enhancing your ability to manage, share, and analyze spatial data. Whether you're building on knowledge from our initial session or are new to the serverless spatial data landscape, this webinar is your gateway to mastering cloud-native formats in your workflows.

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME

Safe Software

The 7 Things I Know About Cyber Security After 25 Years | April 2024

Rafal Los

🐬 The future of MySQL is Postgres 🐘

RTylerCroy

Manulife - Insurer Innovation Award 2024

The Digital Insurer

Tata AIG General Insurance Company - Insurer Innovation Award 2024

The Digital Insurer

Recently uploaded (20)

presentation ICT roal in 21st century education

How to Troubleshoot Apps for the Modern Connected Worker

Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024

MINDCTI Revenue Release Quarter One 2024

Connector Corner: Accelerate revenue generation using UiPath API-centric busi...

Why Teams call analytics are critical to your entire business

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...

Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...

Axa Assurance Maroc - Insurer Innovation Award 2024

Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...

TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments

GenAI Risks & Security Meetup 01052024.pdf

Top 10 Most Downloaded Games on Play Store in 2024

Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...

Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME

The 7 Things I Know About Cyber Security After 25 Years | April 2024

🐬 The future of MySQL is Postgres 🐘

Manulife - Insurer Innovation Award 2024

Tata AIG General Insurance Company - Insurer Innovation Award 2024

BigDoor's Jeff Malek Gluecon Presentation

1. Retrospective from a startup built in the cloud : top 3 big lessons from the AWS outage on 04.21.2011 plus 4,369 other smaller ones 5/27/2011 1

2. What a country : entrepreneurial resiliency 5/27/2011 2

3. (true story) “robust systems:highly fault-tolerant, on or off grid. eg: our culture wrt entrepreneurs, AWS, the BD API” 5/27/2011 3

4. Boom 5/27/2011 4

5. good to be home! Go Buffs 5/27/2011 5

6. me: previous startupteams in 3 countrieshighly transactional systemMS tech : IIS/MS SQL Serverco-located, leased/owned hardware0% in cloud$75M/yearly rev 5/27/2011 6

7. me : current startupsystems 100% on AWS99% free/open-source software 5/27/2011 7 standing on the shoulders of giants

8. fault tolerance: 3 to 47 important failearnings and 4,369 less important ones 5/27/2011 8

9. in the context of our startup, of course YMMV depending on velocity 5/27/2011 9

10. Ruger 5/27/2011 10

11. The Ruger Fault Equivalencytime = money fault tolerance = time² - risk tolerance Also known as: 'Fast, good and cheap : pick two‘ 5/27/2011 11

12. system design philosophy: 5/27/2011 12 leverage proven, open-source tech in the cloud to build a scaleable reliable secure operational foundation quickly

13. So how do you achievethe right level of fault tolerance in the cloud? 3 tenets 5/27/2011 13

14. Tenet #1 5/27/2011 14 Scripted Repeatability Tenet #2 SPOF Elimination Tenet #3 Clear-Cut Communication

15. who here has used AWS? 5/27/2011 15

16. Tenet #1prepare a fault-tolerant foundation with scripted repeatability aka automation 5/27/2011 16

17. from the start :script the non-interactive install of your toolsand OScustom AMIDebian : great package managementbased on Eric Hammond’s workhttp://alestic.com/ 5/27/2011 17

18. which will allow you toscript the setup/tear-down of your stack 5/27/2011 18

19. which will allow you toscript system testsintegrity (3-4K tests)performance (30-40K tests)load, capacity (2-4M requests) 5/27/2011 19

20. 5/27/2011 20 A/B system test results : MySQL Percona Upgrade

21. That’s how1 person set up andmanaged a networkcomprised of 90+/- server instancesfor 1.5 yearswhile serving various other roleswithout having to leave their chair 5/27/2011 21 try that with real hardware

22. Tenet #2SPOF Elimination We don’t need no stinkin single points of failure. 5/27/2011 22

23. SPOF Examples:Cloud ProviderRegionZoneLoad BalancerApp Server DatabaseFred 5/27/2011 23

24. Cloud Provider fail-over? e.g. AWS –> Rackspace 5/27/2011 24

25. Region fail-over? e.g. useast->uswest within AWS Nah. 5/27/2011 25

26. Zone fail-over? Yes. 5/27/2011 26 US-WEST US-EAST

27. Zone fail-over best practices:are you using auto-scaling?no : distribute server instances evenly between 2 or more zonesyes : trigger scaling on network I/O or custom metrics 5/27/2011 27

28. Load-balancer (ELB), app server, database fail-over? Yes. 5/27/2011 28

29. So it’s actually all about reduction of the right SPOFs for your business context Just adding the ability to fail-over and have backups within a region is huge! Probably enough for most. What about Fred? 5/27/2011 29

30. Tenet #3Clear-Cut Communication transparency is soooo 2010 5/27/2011 30

31. During an outage, communicating the right things at the right time:hard. But not that hard. 5/27/2011 31

32. Tenet #1 5/27/2011 32 Three Tenets Revisited Scripted Repeatability Tenet #2 SPOF Elimination Tenet #3 Clear-Cut Communication

33. Notes 5/27/2011 33

Editor's Notes

Nothing to see here, move along
‘what a country’ : my dad always says this, I like itso, BradFeld was in our offices recently, and was asking how AWS was working out for us i'd replied very much in the positive, with a few exceptions regarding their support services. that night at dinner brad was talking about how resilient our culture is for entrepreneurs; how we can fail and retry here in the united statesdoing things that folks might get strung up for, in other countries.the following night, I found myself exploring analogies between that idea and computing systems, and wound up pulling out my phone and started typing up a twitter post
It went something like this.this was going to be the brilliant culmination of my twitter career, to date. I was almost ready to hit the send button, when I started getting alerts from our systemsThe alerts were appearing literally right above what I had written : ‘system DOWN’. Oh, the irony. wish i had a screenshot from my phone
that was the evening of 4/20, morning of 4/21 - the AWS outageLasted for a number of days; our API was intermittently affected for about 12 hours; that could have been mitigatedThat outage totally sucked for so many reasons I’m hoping that by sharing some of my experience with AWS , you’ll gain some insights that may help you prepare adequatelyAlso hoping that this can turn into a conversation toward the end, so you can share your experiences as well.
So who am I? My name is Jeff MalekI grew up here, folks still here – in fact today is their 43rd anniversarygraduated from CU in 93 after 6 long years and a suspensionduring which time I hitch-hiked around the country, winding up in hawaiigraduated, moved around, met some great friends, helped to start up a company
was at Zango for 10 years,responsible for engineering, QA and product development teams distributed across three countries50+ people who built and maintained the high-transaction system that resulted in $75M yearly revenue at its peakleveragedthe client side software I wrote in the C programming language which talked to backend systems built on Windows technology (IIS, MSSQL Server, etc) which was sitting on co-located , purchased hardware
BigDoor: over 2 years oldplatform that powers game mechanics and social loyalty programs for digital communities.freeRESTfulAPI that you can brand any way you wantBuilt in the cloud on AWS using 99.99% free/open-source software.Even after the outage, still a huge fan of AWS, generally very impressed with what they’ve built and their speed of innovationAfter the outage some great folks from the AWS team visited us from across the street (just realized they were across the street)my team gave me shit for bringing sodas into the conference room for them (bitter). I just figured they’d get dry from all the explaining they were about to do. When was the last time you got a newsletter letting you know that a vendor’s pricing was going down Funded by Foundry and Brad Feld in 2010, who as you know also do a lot to make this event happen. You guys are money.
So that’s what I’m here to talk about: fault tolerance in the cloud, AWS and hopefully some API stuff of interestI’ll share this presentation including notes and supporting material later, on our blog at www.bigdoor.com
I want to talk about all of this in the context of our startup, of courseUltimately the AWS outage didn’t result in any major changes to the way we do thingsWhile there were a few smaller things that we bumped up the priority chain, there’s a certain level of risk that a start up is willing to live with
My girlfriend Jenny and I got Ruger as a puppy, right when BigDoor startedRaised him from a puppy while building out our operational infrastructure, working out of our houseSo he’s kind of our mascot, and to help put things in context, I came up with a formula : The Ruger Fault Equivalency.
IOW given a low tolerance for risk, you can create a highly-fault tolerant system if you have a lot of time and/or money. that’s not BigDoor. Conversely, executing with a higher tolerance for risk gets you to market faster with less money, but with lower fault tolerance.For us, scalability is more important than extremely high fault tolerancestartup = time^2 is low (little time and money)So, fun and interesting, but what does it mean in the context of BigDoor system design?
I designed the BigDoor systems at a high level with this philosophy in mind. A bit more regarding our context : Django/PythonWeek long sprints that end in production code release 260G+ and growing transactional database, so still not that bigPeak so far: 18MM API requests/day, so still a ways to go Response times need to be faster than 500ms
OK, given that context – how do you achieve the right amount of fault tolerance in the cloud?Three basic tenets, and in the context of the AWS outage:the first sets a foundation for fault tolerance the second leverages the first to improve fault toleranceand the third will help keep your customers around when you are in crisis mode, ultimately also improving fault tolerance
Scripted repeatabilitySPOF eliminationClear-Cut Communication
Get count of audience who are/have used AWSLow count? Give more background around what the various services are/do.High count? Give less explanation around what things are, and ask for other's best practices
Nothing to see here, move along
AMIs (amazon machine image, install images; OS blueprints), these are used to start new server instancesLeverage pre-built AMIsDebian has great package managementpackages are verified, tested before making it into the main line - less to think aboutThank you Eric HammondA good best practice : use a single master AMI re-buildregularly via automation with new softwarenew package patches (apt)your application code we thentag per environment (test, staging, production) switch services (Apache, MySQL) on and off during boot via init scriptsAnother good practice :All app code and software config is checked out via SVN and baked into the AMIsvn up during boot via init scriptsenables fast initialization during auto-scaling activities
AWS has cloud formationThey came out with that a few days after I’d finished pretty much doing the sameI wrapped the AWS command line tools in shell scriptsSince we’re a Python shop, we’re likely going to be using boto (which has matured quite a bit in the last two years) and fabric
Nothing to see here, move along
Nothing to see here, move along
That’s a picture of the IBM RAMAC, built in 1956, which had 5M of storage and weighed a ton. We’ve come a long way, baby!
For anyone unfamiliar: if a system stops working when a part of it fails, that part is a single point of failure. So in every system there’s potential for many single points of failure, proportional to system complexityBecause of the Ruger Fault Equivalency, the idea is to pick the right SPsOF and eliminate (or at least mitigate) themI used the word ‘elimination’ here, hoping that it would make some folks chuckle; it’s really not possible to eliminate all SPOFs. You can mitigate them, though. So here are some examples, and I’ll drill into which ones are critical in our context.
If your cloud provider goes out of business, you’re hosed. SPOF.In AWS, a region is…etc. If a region disappears, you’re hosed. SPOF.Within regions, are zones. If an entire zone fails, you’re hosed. SPOF.Same with load balancers, application servers, databasesAnd even Fred. If Fred is the only guy who knows your operational systems, and he trips over the extension cord, knocking himself out in the process – you’re hosed. SPOF. The critical ones in our context and likely in many others : Zones and everything below.
Should you attempt to achieve high fault-tolerance through cloud-cloud failover?Ruger Fault Equivalency says : Cost prohibitive (times squared)RightScale , who provides a very cool cloud management system, apparently has some of this functionality, and will likely be the place to go for cloud-cloud fail-over in the future. They will also add roughly 30% to your overall AWS costs. Their scripts are also in ruby. Blah. In time my arguments against doing this will sound similar to the current arguments you hear against going to the cloud.
Ruger Fault Equivalency says :Ditto – cost prohibitiveIf you try to migrate an ELB-balanced tech stack from one region to the next, you’ll learn:EIPs can’t be pointed from an instance in one region to another (at least not easily, I’ve heard you can ask to have it done)Your custom useast (for example) AMI can’t be used in the new region Your useast Security groups can’t be used in the new regionYour snapshots can’t be used to create new volumes, in the new regionSure, this can all be worked-around, but do you have the time and money? Do set up a DB replicant in another region, if possible.
Ruger says : yes, even in light of the recent outage, that affected the entire useast region. It’s not cost-prohibitive, and you get data-center fail-over.At my last company, we co-located in a downtown Seattle data center that also hosted MS, amazon, expedia servers. at the time seemed like a fortress, but it is in fact a single building. Contrast that to an AWS region, which contain zones that are four separate data centers, separate buildings. The Seattle data center caught fire a couple of years ago, causing a major outage for our last company (after we left). Many years previous, we had spent a lot of time and money creating our own data-center fail-over, within our Seattle office, even backed up by a generator.Did they fail over to the backup data center? I don’t think so. What about the recent AWS outage? A human error caused a major problem in one zone that had a ripple effect into the other zones, to a large degree caused by folks failing over. But ultimately, downtime suffered was in proportion to how well you were already leveraging other zones, and how dependent you were on EBS volumes. If all of your eggs were in the wrong zone, or didn’t have the right backup strategy in place – totally screwed. Otherwise – not so bad! I’ve often heard that VCs like entrepreneurs who have failed; what’s the cloud analogy? Something about lightening striking twice…
Our zone scenario and why were were down intermittently for 12 hours during the AWS outagebefore the outage we had auto-scaling groups in two zones within a single regionat some point I brought everything into a single zone, while debugging odd performance between the twoconscientiously de-prioritized revisiting that, in light of other priorities, figuring the single-zone group would at least scale with trafficbut I’d configured the groups with a trigger to auto-scale when CPU spikedover time our application grew more resource efficient, which meant CPU wasn’t spiking, which meant we weren’t scaling with trafficled to the learning that it’s better to scale on network IO, or now that AWS supports them, custom scaling triggerswe’re in multiple zones again now
Ruger says : don’t even think about not doing it.What’s generally worked for us:ELBs for same-region traffic distribution auto scaling groups to allow application server fail-over, within a zone and across themreplication to put secondary fail-over database servers in other zones within a region.
What about Fred? Cut Fred some slack for tripping over the extension cord, we all make mistakes. You need Fred. That is, assuming he communicates what happened widely. If he doesn’t, he’s going to suffer the wrath of his internal and external customers.
Transparency is so 2010, and it hints at over-communication. Your customers don’t need to know that your only DBA is out sick today. they don’t need a ton of detail; they need status updates and anything actionable. Does open communication increase fault tolerance? I’d argue yes. As I’ve tried to point out, people are core parts of our systemsYour customers will be more tolerant of your faults if you’re open and clear about them
At BigDoor, if there’s a crisis, our standard operating procedure identifies a single person responsible for stopping the team on an hourly basis to get status and determine what should be communicated externally, if anything. As much as we love him, we don’t involve our lawyer in that conversation, by the way.
In summary, these are the three tenets that I’m hoping will help you achieve the right amount of fault tolerance in the cloud:Scripted repeatabilitySPOF eliminationClear-Cut CommunicationThanks again Gluecon, I’ll be at the BigDoor pod in the lobby if anyone wants to talk more about this stuff later. I also have some notes that describe the good and bad about AWS, will be available on our blog @ www.bigdoor.com. Thanks again.
Tools : the good and bad ELBsGood : quick to configure, auto-scaling load-balancerscan be used for fail-over within a regionBad : no loggingreturn 503s on error - you won't know unless you can monitor every request end to ende.g. if there aren't instances that can service requestsname servers disregarding ttls + auto-scaling = traffic routing issuesbest practice : return custom HTTP headers in your response so that you can distinguish calls during support incidentscan't be used for failover between AWS regions; need separate DNS solution for funneling trafficAMIs (amazon machine image, install images; OS blueprints)Good : Leveraged pre-built Debian AMIDebian has great package management, which can be scripted.packages are verified, tested before making it into the main line - less to think aboutThank you Eric Hammondhttp://alestic.com/scripted repeatability : script the non-interactive install of your toolscan be used to stand-up instances within a regionbest practice : single master AMI built on top of pre-existing, re-built regularly with new software, app code and patches, via automation. Tagged. best practice : put app code, package configuration into SVN and include in your AMI, svn-up regularly or during instance start-upfaster for things like auto-scalingBad : Can't copy/port AMIs from region to region easilyNot having the entire process scripted from kernel means loss of flexibility (regional AMIs) and securitypitfall : easy to get off track. Didn't start out with a single script that installs everything or stay diligent about including everything? Have fun re-doing all that!EC2 instancesGood :Leverages AMIsObviously, script-able automated instance creationEIPs allow for easy, dependable service re-routing from one instance to anotherSecurity groups are an easy way to firewall (and tag, before they came out with those)Zones allow easy fail-over within a geographic region (most of the time)Regions provide the promise of fail-over between data centers more geographically separated (virginiavscalifornia)Init scripts allow you to create/update on a per-instance basisBad:Security groups can't be added to or removed from an instance once it's runningbest practice: use a different group for each narrower categorye.g. instead of 'database group', create groups for 'primary transactional db server in production', 'replicant...' etc best practice : use a group that whitelists trusted IPs to give access to otherwise un-needed ports and servicesRegions don't allow easy failover; EIPs can't be mapped between them (at least not programmatically)Can't port AMIs from region to region easily, so setup to fail region-region is difficult.EBSGood:provides redundant storage for instances that can be snapshot-ed for easy backup and volume duplication within a regionBad:volumes from snapshots can't be done between regions data loss: it happened (not to us, fortunately) so be prepared and apply the amount of resources your risk tolerance allowspoor I/O in general, specifically writes, typically only has been an issue for us on our primary tx DB serversbest pracitice : RAID 0 array for MySQL data directory, but make sure it's replicated and backed upAuto-scalingGood:n scaling groups in 1-4 zones behind an ELB; provides same-region fail-overn# of instances in a scaling groupcloud watch monitors provide great statspreviously, limited scaling triggers were provided, latest integrate CloudWatch much better including custom metrics you defineBad:learning : we had no baselines for when to scale on anything other than CPU utilization, which at the time was easy to differentiate; we spikedapplication improvements fixed the spikes, which in return stopped auto scaling triggers need monitoring/alerting via nagios/other tool? figure out how to (de-)register new instances during scaling activitiesthis is changing - cloud watch is getting better. do you trust amazon's monitoring/alerting on amazon's monitoring/alerting?EMRGood :Great for async log analysiswhat's worked for us : centralized log hostsapache logs rotated via logrotate and rsync'd via cron, pre-processed, sync'd to S3 and drawn into EMR/Hive cluster for aggregations and reporting Hive/HQL very similar to SQLBad :asynchronous, takes a fair amount of time to munge data S3Good:Available from anywhere, any regionS3cmd is a great tool , for the most partBad:no full support for standard paths and directories…TBDCloudWatch Good :can monitor various services and trigger/alert when thresholds are crossed (e.g. ELB network in)new : auto-scaling can leverage triggers more broadly, custom metrics (new)Bad :no built-in ability to trigger/alert based on % change from previous measurementsconsole reports/graphs need decoder tool and most recently, appear buggy. but they've made big steps forward.AWS APIsGood :API wrappers provided; allow for cmd-line scriptingDRY : Can (and should) script most things that repeat, repeatableAll done via scripts :a bit about our process and how the cloud fits well1 week sprints - lockdown tuesdays, test overnight (uTEST), release wedtest first methodologysystem tests for backend, other big changes, our API changesTested a new ver of MySQL (Percona, recommended)http://screencast.com/t/yVf5RnaUN9http://screencast.com/t/WJaL2qiSRperformance, integrity, load, capacitythese require full-stack stand-up/tear-down , including a 230G+ db backendBad :Keep your eye out for library updates (why not open-source these things? Verify they’re not already…)Scripts, wrappers trail AWS innovation, which is fast. BASH isn't as well-known or readable as Python, for example - maintainabilityscripted stuff bakes you in a bit, no way around this w/out baking yourself into RightScale or some other solution anyway thoughAPI key management : not straight-forwardAPI keys aren't portable between regions; region-region fail-over not as easy as it sounds. not rocket science, either.Bake region 1’s keys into region 2’s new AMIAPI's - GeneralBuild things test first, run integrity tests before pushing out changes to your APIDon't version; make it backwards-compatibleWe try to keep away from anything that’s going to lock us in too muchWe continue to shy away from SQS (simple queuing service), RDS (relational database service), SimpleDB (non-relational datastore)SQS, SimpleDB proprietary, would prefer to avoid lock-in for these things and their need hasn't been high enough for us yetRDS : doesn't provide enough flexibility for us. would love to use it as a replicant pool for reads/reporting though. can't.multi-zone RDS suffered one of the biggest hits during recent AWS outageWhat we're looking forward to leveragingNew CW status, PUTs, scaling triggers from them

BigDoor's Jeff Malek Gluecon Presentation

Recommended

Recommended

More Related Content

What's hot

What's hot (18)

Similar to BigDoor's Jeff Malek Gluecon Presentation

Similar to BigDoor's Jeff Malek Gluecon Presentation (20)

Recently uploaded

Recently uploaded (20)

BigDoor's Jeff Malek Gluecon Presentation

Editor's Notes