How Netflix tests in production to augment more traditional testing methods. This talk covers the Simian Army (Chaos Monkey & friends, code coverage in production, and canary testing.
Release the Monkeys ! Testing in the Wild at NetflixGareth Bowles
This document discusses Netflix's use of "chaos monkeys" to deliberately cause failures in their systems to test resiliency. The chaos monkeys include Chaos Monkey which terminates instances, Chaos Gorilla which simulates an availability zone outage, and Chaos Kong which simulates a full region outage. The monkeys help validate redundancy, improve designs to avoid failures, and ensure systems can handle degradation without affecting other services. The chaos testing is released as open source and helps Netflix understand how systems will behave during random failures.
The basics you need to know to get up and running with Chaos Monkey in your Amazon Web Service's Cloud enviornment.
Links:
CloudFormation Template:
https://github.com/joehack3r/aws/blob/master/cloudformation/templates/chaosMonkey.json
Simian Army Quick Start Guide:
https://github.com/Netflix/SimianArmy/wiki/Quick-Start-Guide
Chaos Monkey Configuration:
https://github.com/Netflix/SimianArmy/wiki/Chaos-Settings
Chaos Monkey Army:
https://github.com/Netflix/SimianArmy/wiki/The-Chaos-Monkey-Army
This document discusses how DevOps practices can help organizations accelerate innovation through software delivery. It outlines the core components of DevOps as self-service, automation, and collaboration. It then walks through the software development lifecycle from developing to operating software. Throughout, it emphasizes practices like continuous delivery, automated testing and configuration, zero-downtime deployments, monitoring, and designing systems to withstand failures using techniques like Netflix's "Chaos Monkey" which intentionally causes failures to test resiliency. The overall message is that DevOps can help organizations innovate faster by breaking down silos and automating the process of delivering high quality software.
The document is an agenda for a presentation titled "DevOps: the Atlassian way, how to accelerate your Operations". The presentation will cover preparing infrastructure, an overview of ALM tools like Jira, Bitbucket, Bamboo, and Chef, and how to build a scalable infrastructure for deployment using these tools. It will also discuss managing test environments from Jira, autoscaling infrastructure, and accelerating the concept to launch cycle from 10 days to 10 minutes using an Atlassian-based approach.
Spring Boot makes it easier to create Java web applications. It provides sensible defaults and infrastructure so developers don't need to spend time wiring applications together. Spring Boot applications are also easier to develop, test, and deploy. The document demonstrates how to create a basic web application with Spring Boot, add Spring Data JPA for database access, and use features for development and operations.
Embracing Failure - Fault Injection and Service Resilience at NetflixJosh Evans
A presentation given at AWS re:Invent on how Netflix induces failure to validate and harden production systems. Technologies discussed include the Simian Army (Chaos Monkey, Gorilla, Kong) and our next gen Failure Injection Test framework (FIT).
Release the Monkeys ! Testing in the Wild at NetflixGareth Bowles
This document discusses Netflix's use of "chaos monkeys" to deliberately cause failures in their systems to test resiliency. The chaos monkeys include Chaos Monkey which terminates instances, Chaos Gorilla which simulates an availability zone outage, and Chaos Kong which simulates a full region outage. The monkeys help validate redundancy, improve designs to avoid failures, and ensure systems can handle degradation without affecting other services. The chaos testing is released as open source and helps Netflix understand how systems will behave during random failures.
The basics you need to know to get up and running with Chaos Monkey in your Amazon Web Service's Cloud enviornment.
Links:
CloudFormation Template:
https://github.com/joehack3r/aws/blob/master/cloudformation/templates/chaosMonkey.json
Simian Army Quick Start Guide:
https://github.com/Netflix/SimianArmy/wiki/Quick-Start-Guide
Chaos Monkey Configuration:
https://github.com/Netflix/SimianArmy/wiki/Chaos-Settings
Chaos Monkey Army:
https://github.com/Netflix/SimianArmy/wiki/The-Chaos-Monkey-Army
This document discusses how DevOps practices can help organizations accelerate innovation through software delivery. It outlines the core components of DevOps as self-service, automation, and collaboration. It then walks through the software development lifecycle from developing to operating software. Throughout, it emphasizes practices like continuous delivery, automated testing and configuration, zero-downtime deployments, monitoring, and designing systems to withstand failures using techniques like Netflix's "Chaos Monkey" which intentionally causes failures to test resiliency. The overall message is that DevOps can help organizations innovate faster by breaking down silos and automating the process of delivering high quality software.
The document is an agenda for a presentation titled "DevOps: the Atlassian way, how to accelerate your Operations". The presentation will cover preparing infrastructure, an overview of ALM tools like Jira, Bitbucket, Bamboo, and Chef, and how to build a scalable infrastructure for deployment using these tools. It will also discuss managing test environments from Jira, autoscaling infrastructure, and accelerating the concept to launch cycle from 10 days to 10 minutes using an Atlassian-based approach.
Spring Boot makes it easier to create Java web applications. It provides sensible defaults and infrastructure so developers don't need to spend time wiring applications together. Spring Boot applications are also easier to develop, test, and deploy. The document demonstrates how to create a basic web application with Spring Boot, add Spring Data JPA for database access, and use features for development and operations.
Embracing Failure - Fault Injection and Service Resilience at NetflixJosh Evans
A presentation given at AWS re:Invent on how Netflix induces failure to validate and harden production systems. Technologies discussed include the Simian Army (Chaos Monkey, Gorilla, Kong) and our next gen Failure Injection Test framework (FIT).
Scaling on Amazon AWS : From the perspective of AWS, and the application stack. Talks about the available options on AWS, and also the architecture of the scalable application.
Amazon Inspector is a security assessment service launched in 2016 that helps users evaluate the security state of their AWS environments. It examines compute instances like EC2 for vulnerabilities and deviations from security best practices, standards, and guidelines. Users can initiate on-demand or scheduled assessments of their environments through the Inspector console or APIs to identify any issues and prioritize remediation.
The document discusses microservices and their advantages over monolithic architectures. Microservices break applications into small, independent components that can be developed, deployed and scaled independently. This allows for faster development and easier continuous delivery. The document recommends using Spring Boot to implement microservices and Docker to deploy and manage the microservices as independent components. It provides an example of implementing an ELK stack as Dockerized microservices.
Immutable Infrastructure: Rise of the Machine ImagesC4Media
Video and slides synchronized, mp3 and slide download available at URL http://bit.ly/1WlpXHF.
Axel Fontaine looks at what Immutable Infrastructure is and how it affects scaling, logging, sessions, configuration, service discovery and more. He also looks at how containers and machine images compare and why some things people took for granted may not be necessary anymore. Filmed at qconlondon.com.
Axel Fontaine is the founder and CEO of Boxfuse. Axel is also the creator and project lead of Flyway, the open source tool that makes database migration easy. He is a Continuous Delivery and Immutable Infrastructure expert, a Java Champion, a JavaOne Rockstar and a regular speaker at various large international conferences.
The document discusses event-driven infrastructure and how infrastructure can react to different types of events. It describes how infrastructure as code tools like Puppet, Chef, and Ansible can be used to configure infrastructure. It also discusses how serverless architectures using AWS Lambda allow infrastructure to scale automatically in response to events with no administration. Finally, it considers how event-driven infrastructure affects operational practices for DevOps.
Heroku is a platform as a service that originally started as a Ruby PaaS but now supports Node.js, Clojure, Grails, Scala, and Python. It uses the Git version control system for deployment and a dyno process model for scaling applications. While flexible in allowing custom buildpacks and configuration via environment variables, there are also restrictions like maximum source code size and memory limits for dyno processes.
Riot Games Scalable Data Warehouse Lecture at UCSB / UCLAsean_seannery
This is a talk that was given for the Scalable Internet Services Masters-level Computer Science class at UCLA and UCSB. It briefly discusses the server architecture for the game League of Legends before going into depth about how the data warehouse can hold petabytes of player data. Discussion about message queue architecture and scalability occurs along the way
How Netflix thinks of DevOps. Spoiler: we don’t.Dianne Marsh
Dianne Marsh, Director of Engineering at Netflix, discusses Netflix's DevOps practices for managing their large and growing global ecosystem. Key aspects include building a blameless culture where developers are responsible for operations, extensive automation using tools like Spinnaker and Atlas, and chaos engineering practices like Chaos Monkey to test system reliability. Netflix also leverages machine learning for tasks like anomaly detection and automated canary analysis to improve operations.
Deploy Faster Without Failing Faster - Metrics-Driven - Dynatrace User Groups...Andreas Grabner
Do it like the "DevOps Unicorns" Etsy, Facebook and Co: Deploy more frequently. But how and why? Challenges?
Deploying Software Faster without Failing Faster is possible through Metrics driven Engineering. Identify problems early on using a "Shift-Left in Quality". This requires a Level-Up of Dev, Test, Ops, Biz
See some of the metrics that I think you need to look at and how to upgrade your engineering team to produce better quality right from the start
At Netflix, we provide a Java-based API that supports the content discovery, sign-up, and playback experience on thousands of device types that millions use around the world every day. As our user base and traffic have grown by leaps and bounds, we are continuously evolving this API to enable the best user experience. In this talk, I will give an overview of how and why the Netflix API has evolved to where it is today and where we plan to take it in the future. I will discuss how we make our system resilient against failures using tools such as Hystrix and FIT, while keeping it flexible and nimble enough to support continuous A/B testing.
Principles Of Chaos Engineering - Chaos Engineering HamburgNils Meder
This document discusses chaos engineering and its principles. It provides an agenda for a talk on AWS basics, the evolution of chaos testing, tooling for chaos engineering, and chaos engineering itself. It describes how chaos engineering experiments are used to test systems by simulating failures like instance or availability zone outages. The key principles of chaos engineering are to understand normal system behavior, build hypotheses around steady states, vary real-world events by conducting experiments in production, and automate experiments continuously. Popular tools for chaos engineering include Chaos Monkey, Chaos Gorilla, and Chaos Kong.
This document discusses applying security automation principles through a SecDevOps approach. It begins by highlighting lessons from other companies that deployed features in a disabled state using feature flags and integrated security testing in continuous integration. The document then outlines how Kenna applies SecDevOps principles through automation, with examples like using Chef for configuration management and testing security at each code check. It also presents a use case where Kenna loads security scanning results from various tools into its platform via API to enable continuous security testing.
Scaling Your First 1000 Containers with DockerAtlassian
Deploying large numbers of containers to production can be a difficult proposition if you don’t approach the problem with the right strategy – one that's appropriate for both your developers and the size of your operations team. Choosing a strategy lets you codify your deployment patterns in a repeatable manner and reuse them over hundreds of deployments without incurring unnecessary cost and complexity.
Using Atlassian’s PaaS as a model, we will discuss important milestones as you scale from a single container to tens, hundreds, and eventually to a thousand containers. At what points should you begin to embrace log aggregation? How about monitoring and metrics collection? Orchestration and clustering solutions? Learn how to incorporate ever more sophisticated third-party solutions as you go, to achieve cost-effective and stable management of your containers in production.
Micro Service – The New Architecture ParadigmEberhard Wolff
The document discusses microservices as a new software architecture paradigm. It defines microservices as small, independent processes that work together to form an application. The key benefits of microservices are that they allow for easier, faster deployment of features since each service is its own deployment unit and teams can deploy independently without integration. However, the document also notes challenges of microservices such as increased communication overhead, difficulty of code reuse across services, and managing dependencies between many different services. It concludes that microservices are best for projects where time to market is important and continuous delivery is a priority.
The document summarizes Recommendo, a RESTful product recommendations API built by Nordstrom and hosted on AWS. Some key details:
- Recommendo serves over 2 billion recommendations to Nordstrom.com customers via API and emails.
- It was built by 2 developers and 2 data scientists and deployed to production on AWS in just 105 days.
- The API sees over 3 million hits per day and scales automatically on AWS with average request latency of 70ms.
- Lessons learned include the difficulty of zero downtime deployments and importance of health checks and error handling.
Saturn 2014. Engineering Velocity: Continuous Delivery at NetflixDianne Marsh
At Netflix, we realize that there’s a tension between the availability of our service and our speed of innovation. If we move slowly, we can be very available -- but that’s not a good business proposition. If we move super fast, we risk downtime -- and that might annoy our customers. But
what if we could increase our velocity without significantly impacting availability? How can we shift that curve so that we’re moving faster without dropping any of those coveted 9’s?
How can we engineer velocity by weaving together tooling and culture with software development to expose and elevate highly effective practices? This talk describes various
components of Netflix’s continuous delivery platform -- much of which is available in open source. I’ll show how these pieces fit together and allow us to build scaffolding so that we’re comfortable with software developers making the decision to push the button for prod deployment -- and helps them to recover if necessary. As a result, we can run fast, trusting our tooling and our culture. I’ll also describe how we test our resiliency through simulating failure, unleashing the monkeys (Simian Army) on our production environment. Because if you’re afraid of cute little monkeys,
imagine how afraid you’ll be of a production environment that offers those same risks but doesn’t give you an opportunity to test your response to those dangers.
Throughout this talk, I hope that you will challenge yourself to consider how your company can "shift the curve" through tooling and to achieve a high velocity environment without negatively impacting reliability.
Continuous Integration and Deployment Best Practices on AWS Amazon Web Services
AWS Summit 2014 Brisbane - Breakout 5
With AWS companies now have the ability to develop and run their applications with speed and flexibility like never before. Working with an infrastructure that can be 100% API driven enables businesses to use lean methodologies and realize these benefits. This in turn leads to greater success for those who make use of these practices. In this session we'll talk about some key concepts and design patterns for Continuous Deployment and Continuous Integration, two elements of lean development of applications and infrastructures.
Presenter: Adrian White, Solutions Architect, Amazon Web Services
AWS Meetup - Nordstrom Data Lab and the AWS CloudNordstromDataLab
The document discusses Nordstrom's development of a recommendations API and service called Recommendo using AWS services like DynamoDB, Elastic Beanstalk, and Node.js. Some key points:
- Recommendo provides product recommendations to Nordstrom's website and emails, serving over 4 billion recommendations from 105 days of development.
- It was built on AWS using services like DynamoDB for storage, Elastic Beanstalk for deployment, and Node.js for the backend. This allowed a small team to build and deploy it quickly.
- Performance was improved through tuning, and the system now handles the load with an average latency of 90ms from a few auto-scaling servers.
- Lessons learned
Web Scale Applications using NeflixOSS Cloud PlatformSudhir Tonse
Web Scale Applications using NeflixOSS Cloud Platform. Infographics on IaaS, PaaS, SaaS. Commandments of developing a cloud based distributed application.
Open Business Conference: Continuous Delivery At Netflix -- Powered by Open S...Dianne Marsh
Netflix uses continuous delivery practices powered by open source tools to deploy code rapidly and reliably across multiple AWS regions. Teams deploy their own code using tools like Nebula/Gradle and Jenkins Job DSL for automated builds. The Aminator creates AMIs and Asgard deploys them using red/black deployment. Simian Army monkeys like Chaos Monkey test resiliency. Self-service, awareness of regions, and rollback ability are key to Netflix's approach.
Scaling on Amazon AWS : From the perspective of AWS, and the application stack. Talks about the available options on AWS, and also the architecture of the scalable application.
Amazon Inspector is a security assessment service launched in 2016 that helps users evaluate the security state of their AWS environments. It examines compute instances like EC2 for vulnerabilities and deviations from security best practices, standards, and guidelines. Users can initiate on-demand or scheduled assessments of their environments through the Inspector console or APIs to identify any issues and prioritize remediation.
The document discusses microservices and their advantages over monolithic architectures. Microservices break applications into small, independent components that can be developed, deployed and scaled independently. This allows for faster development and easier continuous delivery. The document recommends using Spring Boot to implement microservices and Docker to deploy and manage the microservices as independent components. It provides an example of implementing an ELK stack as Dockerized microservices.
Immutable Infrastructure: Rise of the Machine ImagesC4Media
Video and slides synchronized, mp3 and slide download available at URL http://bit.ly/1WlpXHF.
Axel Fontaine looks at what Immutable Infrastructure is and how it affects scaling, logging, sessions, configuration, service discovery and more. He also looks at how containers and machine images compare and why some things people took for granted may not be necessary anymore. Filmed at qconlondon.com.
Axel Fontaine is the founder and CEO of Boxfuse. Axel is also the creator and project lead of Flyway, the open source tool that makes database migration easy. He is a Continuous Delivery and Immutable Infrastructure expert, a Java Champion, a JavaOne Rockstar and a regular speaker at various large international conferences.
The document discusses event-driven infrastructure and how infrastructure can react to different types of events. It describes how infrastructure as code tools like Puppet, Chef, and Ansible can be used to configure infrastructure. It also discusses how serverless architectures using AWS Lambda allow infrastructure to scale automatically in response to events with no administration. Finally, it considers how event-driven infrastructure affects operational practices for DevOps.
Heroku is a platform as a service that originally started as a Ruby PaaS but now supports Node.js, Clojure, Grails, Scala, and Python. It uses the Git version control system for deployment and a dyno process model for scaling applications. While flexible in allowing custom buildpacks and configuration via environment variables, there are also restrictions like maximum source code size and memory limits for dyno processes.
Riot Games Scalable Data Warehouse Lecture at UCSB / UCLAsean_seannery
This is a talk that was given for the Scalable Internet Services Masters-level Computer Science class at UCLA and UCSB. It briefly discusses the server architecture for the game League of Legends before going into depth about how the data warehouse can hold petabytes of player data. Discussion about message queue architecture and scalability occurs along the way
How Netflix thinks of DevOps. Spoiler: we don’t.Dianne Marsh
Dianne Marsh, Director of Engineering at Netflix, discusses Netflix's DevOps practices for managing their large and growing global ecosystem. Key aspects include building a blameless culture where developers are responsible for operations, extensive automation using tools like Spinnaker and Atlas, and chaos engineering practices like Chaos Monkey to test system reliability. Netflix also leverages machine learning for tasks like anomaly detection and automated canary analysis to improve operations.
Deploy Faster Without Failing Faster - Metrics-Driven - Dynatrace User Groups...Andreas Grabner
Do it like the "DevOps Unicorns" Etsy, Facebook and Co: Deploy more frequently. But how and why? Challenges?
Deploying Software Faster without Failing Faster is possible through Metrics driven Engineering. Identify problems early on using a "Shift-Left in Quality". This requires a Level-Up of Dev, Test, Ops, Biz
See some of the metrics that I think you need to look at and how to upgrade your engineering team to produce better quality right from the start
At Netflix, we provide a Java-based API that supports the content discovery, sign-up, and playback experience on thousands of device types that millions use around the world every day. As our user base and traffic have grown by leaps and bounds, we are continuously evolving this API to enable the best user experience. In this talk, I will give an overview of how and why the Netflix API has evolved to where it is today and where we plan to take it in the future. I will discuss how we make our system resilient against failures using tools such as Hystrix and FIT, while keeping it flexible and nimble enough to support continuous A/B testing.
Principles Of Chaos Engineering - Chaos Engineering HamburgNils Meder
This document discusses chaos engineering and its principles. It provides an agenda for a talk on AWS basics, the evolution of chaos testing, tooling for chaos engineering, and chaos engineering itself. It describes how chaos engineering experiments are used to test systems by simulating failures like instance or availability zone outages. The key principles of chaos engineering are to understand normal system behavior, build hypotheses around steady states, vary real-world events by conducting experiments in production, and automate experiments continuously. Popular tools for chaos engineering include Chaos Monkey, Chaos Gorilla, and Chaos Kong.
This document discusses applying security automation principles through a SecDevOps approach. It begins by highlighting lessons from other companies that deployed features in a disabled state using feature flags and integrated security testing in continuous integration. The document then outlines how Kenna applies SecDevOps principles through automation, with examples like using Chef for configuration management and testing security at each code check. It also presents a use case where Kenna loads security scanning results from various tools into its platform via API to enable continuous security testing.
Scaling Your First 1000 Containers with DockerAtlassian
Deploying large numbers of containers to production can be a difficult proposition if you don’t approach the problem with the right strategy – one that's appropriate for both your developers and the size of your operations team. Choosing a strategy lets you codify your deployment patterns in a repeatable manner and reuse them over hundreds of deployments without incurring unnecessary cost and complexity.
Using Atlassian’s PaaS as a model, we will discuss important milestones as you scale from a single container to tens, hundreds, and eventually to a thousand containers. At what points should you begin to embrace log aggregation? How about monitoring and metrics collection? Orchestration and clustering solutions? Learn how to incorporate ever more sophisticated third-party solutions as you go, to achieve cost-effective and stable management of your containers in production.
Micro Service – The New Architecture ParadigmEberhard Wolff
The document discusses microservices as a new software architecture paradigm. It defines microservices as small, independent processes that work together to form an application. The key benefits of microservices are that they allow for easier, faster deployment of features since each service is its own deployment unit and teams can deploy independently without integration. However, the document also notes challenges of microservices such as increased communication overhead, difficulty of code reuse across services, and managing dependencies between many different services. It concludes that microservices are best for projects where time to market is important and continuous delivery is a priority.
The document summarizes Recommendo, a RESTful product recommendations API built by Nordstrom and hosted on AWS. Some key details:
- Recommendo serves over 2 billion recommendations to Nordstrom.com customers via API and emails.
- It was built by 2 developers and 2 data scientists and deployed to production on AWS in just 105 days.
- The API sees over 3 million hits per day and scales automatically on AWS with average request latency of 70ms.
- Lessons learned include the difficulty of zero downtime deployments and importance of health checks and error handling.
Saturn 2014. Engineering Velocity: Continuous Delivery at NetflixDianne Marsh
At Netflix, we realize that there’s a tension between the availability of our service and our speed of innovation. If we move slowly, we can be very available -- but that’s not a good business proposition. If we move super fast, we risk downtime -- and that might annoy our customers. But
what if we could increase our velocity without significantly impacting availability? How can we shift that curve so that we’re moving faster without dropping any of those coveted 9’s?
How can we engineer velocity by weaving together tooling and culture with software development to expose and elevate highly effective practices? This talk describes various
components of Netflix’s continuous delivery platform -- much of which is available in open source. I’ll show how these pieces fit together and allow us to build scaffolding so that we’re comfortable with software developers making the decision to push the button for prod deployment -- and helps them to recover if necessary. As a result, we can run fast, trusting our tooling and our culture. I’ll also describe how we test our resiliency through simulating failure, unleashing the monkeys (Simian Army) on our production environment. Because if you’re afraid of cute little monkeys,
imagine how afraid you’ll be of a production environment that offers those same risks but doesn’t give you an opportunity to test your response to those dangers.
Throughout this talk, I hope that you will challenge yourself to consider how your company can "shift the curve" through tooling and to achieve a high velocity environment without negatively impacting reliability.
Continuous Integration and Deployment Best Practices on AWS Amazon Web Services
AWS Summit 2014 Brisbane - Breakout 5
With AWS companies now have the ability to develop and run their applications with speed and flexibility like never before. Working with an infrastructure that can be 100% API driven enables businesses to use lean methodologies and realize these benefits. This in turn leads to greater success for those who make use of these practices. In this session we'll talk about some key concepts and design patterns for Continuous Deployment and Continuous Integration, two elements of lean development of applications and infrastructures.
Presenter: Adrian White, Solutions Architect, Amazon Web Services
AWS Meetup - Nordstrom Data Lab and the AWS CloudNordstromDataLab
The document discusses Nordstrom's development of a recommendations API and service called Recommendo using AWS services like DynamoDB, Elastic Beanstalk, and Node.js. Some key points:
- Recommendo provides product recommendations to Nordstrom's website and emails, serving over 4 billion recommendations from 105 days of development.
- It was built on AWS using services like DynamoDB for storage, Elastic Beanstalk for deployment, and Node.js for the backend. This allowed a small team to build and deploy it quickly.
- Performance was improved through tuning, and the system now handles the load with an average latency of 90ms from a few auto-scaling servers.
- Lessons learned
Web Scale Applications using NeflixOSS Cloud PlatformSudhir Tonse
Web Scale Applications using NeflixOSS Cloud Platform. Infographics on IaaS, PaaS, SaaS. Commandments of developing a cloud based distributed application.
Open Business Conference: Continuous Delivery At Netflix -- Powered by Open S...Dianne Marsh
Netflix uses continuous delivery practices powered by open source tools to deploy code rapidly and reliably across multiple AWS regions. Teams deploy their own code using tools like Nebula/Gradle and Jenkins Job DSL for automated builds. The Aminator creates AMIs and Asgard deploys them using red/black deployment. Simian Army monkeys like Chaos Monkey test resiliency. Self-service, awareness of regions, and rollback ability are key to Netflix's approach.
The Journey of Chaos Engineering Begins with a Single StepBruce Wong
PagerDuty Summit 2016
Presenters: Bruce Wong, James Burns
https://www.pagerduty.com/pagerduty-summit-2016/
Heard of Netflix' Chaos Engineering & the Simian Army? Google's legendary DiRT exercises? Hear about how Twilio is getting started on its journey with Chaos Engineering. This talk is the story of how Twilio got started with Chaos Engineering, lessons learned, and the impact to our engineering culture.
Transition::IT -- Leadership and Cultural Changemike d. kail
This document discusses driving cultural change within an IT organization. It notes that initially only 15% of people may be positive about change, with others reacting with anger, distrust or uncertainty. To increase acceptance of change, clear communication is important along with encouraging questions, accepting input, avoiding quick judgments, and remaining open to change. Leaders must be comfortable with discomfort and continue pushing boundaries. Defining core cultural beliefs and values is also key, as is assessing performance and removing toxic employees. Driving continual improvement requires ongoing efforts like reinforcing reasons for change, setting objectives, listening, and celebrating wins.
This document summarizes Ajay Vaddadi's work on developing an automated fault-tolerance testing tool called ScrewDriver at Groupon. It discusses (1) the need for fault tolerance testing due to economic losses from outages, (2) ScrewDriver's main components like the Controller, Capsule, and Topology Translation Engine, and (3) next steps to test ScrewDriver extensively on Groupon services and eventually open source it.
Past present and future of Recommender Systems: an Industry PerspectiveXavier Amatriain
The document summarizes the past, present, and future of recommender systems from an industry perspective.
[1] In the past, Netflix popularized recommender systems with their 2006 Netflix Prize competition.
[2] Currently, recommender systems are used widely across many applications and industries. They have evolved to use implicit feedback and contextual information beyond just explicit ratings. Ranking items is also central to recommender systems.
[3] Future directions include addressing indirect feedback challenges, incorporating the value or reward of recommendations, optimizing full pages rather than just individual recommendations, and personalizing not just what is recommended but how it is recommended to users.
The document discusses a presentation given by Bill Burns, Sr. Manager of Networks & Security at Netflix, to the CISO Executive Forum on February 26, 2012 about Netflix's move to scaling operations in the cloud. The presentation covered Netflix's background and engineering-centric culture, the reasons for moving to the cloud including availability, capacity and agility. It also discussed the information security challenges of running in an IaaS cloud, such as confidentiality, integrity, availability and possession/control of systems. The presentation showed how Netflix addressed these challenges through automation, embedded security controls, and tools like the Simian Army that induce failures to test availability.
An overview of the Netflix Security Monkey Open Source tool. The presentation provides some background information, architectural overview, and screenshots showing the tool in action.
Big Data Testing: Ensuring MongoDB Data QualityRTTS
You've made the move to MongoDB for its flexible schema and querying capabilities in order to enhance agility and reduce costs for your business. Shouldn't your data quality process be just as organized and efficient?
Using QuerySurge for testing your MongoDB data as part of your quality effort will increase your testing speed, boost your testing coverage (up to 100%), and improve the level of quality within your Big Data store. QuerySurge will help you keep your team organized and on track too!
To learn more about QuerySurge, visit www.QuerySurge.com
Netflix receives 2 billion requests per day to its API from users and makes 12 billion outbound requests from its personalization engine to power recommendations. The personalization engine uses data on users, movies, ratings, reviews, and similar movies to conduct A/B tests and has experienced 30 times growth over two years. The document requests feedback on the presentation and conference.
From resilient to antifragile - Chaos Engineering Primer DevSecConSergiu Bodiu
Can we inject failure scenarios into deployed systems to reduce platform risk? During this talk, demonstrations of the Simian Army, Chaos Lemur and Locust.io tools will be presented. We will go beyond reliability, stability and availability to help your platform operations team build a continuous process improvement program which will prepare your production systems for the unexpected.
92% of catastrophic system failures were the result of incorrect handling of nonfatal errors.
It is simply not possible to fully reproduce the entire architecture and run an end to end test.
Don't trust claims systems make about themselves & their dependencies. Verify by breaking.
This document outlines Netflix's culture of freedom and responsibility. Some key points:
- Netflix focuses on attracting and retaining "stunning colleagues" through a high-performance culture rather than perks. Managers use a "Keeper Test" to determine which employees they would fight to keep.
- The culture emphasizes values over rules. Netflix aims to minimize complexity as it grows by increasing talent density rather than imposing processes. This allows the company to maintain flexibility.
- Employees are given significant responsibility and freedom in their roles, such as having no vacation tracking or expense policies beyond acting in the company's best interests. The goal is to avoid chaos through self-discipline rather than controls.
- Providing
Performance Metrics for your Build Pipeline - presented at Vienna WebPerf Oct...Andreas Grabner
Software Performance Metrics that you should look at throughout your Build Pipeline and not just when your app crashes in productiong.
Find performance and scalability problems as soon as executing your first Unit Test. Simply focus on metrics such as #SQLs, #LogMessages, #Objects on Heap, ...
London Atlassian User Group - February 2014Steve Smith
Continuous deployment is causing organisations to rethink how they build and release software. Atlassian Bamboo is rapidly adding features to help with automating deployment, but there are a lot of other practical and organisational issues that need to be addressed when adopting this development model. The Atlassian business-platforms team has been dealing with these issues over the last few months as we transition our order system to continuous deployment. This talk will cover why we adopted this model, some of challenges we encountered, and the approaches and tools we used to overcome them.
Cloud patterns forwardjs April Ottawa 2019Taswar Bhatti
The document discusses various software design patterns including the external configuration pattern, cache aside pattern, federated identity pattern, valet key pattern, gatekeeper pattern, circuit breaker pattern, retry pattern, and strangler pattern. It provides descriptions of each pattern, examples of problems they aim to address, and considerations for applying the patterns. Taswar Bhatti presents on these patterns and takes questions.
The document provides an overview of Netflix's approach to continuous delivery using their open source tools. It discusses how Netflix builds immutable infrastructure by baking software packages into pre-configured server images. It also describes how their build system tools like Gradle and Nebula plugins help standardize builds at scale. Finally, it outlines how tools like Eureka, Ribbon, and Asgard help enable ongoing deployment and management of cloud resources through concepts like service discovery and application clusters.
8 cloud design patterns you ought to know - Update Conference 2018Taswar Bhatti
This document discusses 8 cloud design patterns: External Configuration, Cache Aside, Federated Identity, Valet Key, Gatekeeper, Circuit Breaker, Retry, and Strangler. It provides an overview of each pattern, including what problem it addresses, when to use it, considerations, and examples of cloud offerings that implement each pattern. It aims to help developers understand and apply common best practices for cloud application design.
Test driven infrastructure development (2 - puppetconf 2013 edition)Tomas Doran
The document discusses test driven infrastructure development. It describes issues with the current state where infrastructure changes are not repeatable and difficult to test. The speaker proposes modeling infrastructure as code where environments are defined programmatically and configuration is generated externally rather than defined directly in puppet code. This allows for entire environments to be provisioned on demand and tested in an automated and repeatable way. Key benefits include high availability, ability to test all infrastructure changes, fully repeatable environments, high confidence in changes, and continuous integration/deployment of infrastructure.
20111110 how puppet-fits_into_your_existing_infrastructure_and_change_managem...garrett honeycutt
Puppet can help with change management by using its environments and version control features. Environments represent different stages like development, testing, and production. Changes are made on branches in version control and merged to trunk/master after testing. Tags mark versions to deploy to each environment. Documentation and gates between environments ensure changes meet requirements before moving forward.
The document discusses 3 key ways that developing software for the cloud differs from traditional approaches:
1. Incremental delivery, with frequent small releases of new features rather than large periodic releases.
2. Increased automation, including automated testing and continuous integration/deployment pipelines to support more agile development and deployment.
3. Analytics of usage data to inform product decisions and ensure features are valuable to users. Developing with the cloud in mind requires rethinking processes to focus on agility, automation and data-driven insights.
Slides for my talk at Expert Day for Xamarin 2018
---
It has never been more important to create apps that also work offline. Mobile app users can flick that 'airplane mode' switch any given time and the cellular connection isn't as stable as it is at home. To ensure a great user-experience you, as a developer, need to account for these scenarios. And honestly: that can be a pain in the butt.
In this session I will show you how to use awesome libraries like Akavache and Polly to create connected apps in a very easy way. Step-by-step I will guide you through a sample application, so when we're done you can go home and implement it in your every app. Have a good flight!
Netflix has built a highly available architecture using microservices running across AWS availability zones. They induce failures through "chaos monkeys" like Chaos Monkey and Latency Monkey to test resiliency. This validated that their designs worked as intended and helped them identify issues. Netflix has now open sourced many of their cloud tools and libraries through projects like Hystrix and Eureka.
Virtual Flink Forward 2020: How Streaming Helps Your Staging Environment and ...Flink Forward
In this session, we will look at how Apache Flink can be used to stream anonymized API request and response data from a production environment to make sure staging environments are up-to-date and reflect the most recent features (and bugs) that comprise a service. The talk will also examine how to deal with issues of data retention, throttling, and persistence, finishing with recommendations for how to use these sandbox environments to rapidly prototype and test new features and fixes.
Immutable infrastructure isn’t the answerSam Bashton
Immutable infrastructure wasn't suitable for the consultancy's needs as it led to long deployment times and a lack of visibility into instance configurations. Instead, they developed a hybrid approach using Packer, Puppet, S3, and AWS services that provides faster deployments, self-healing infrastructure, and a known, verifiable state for instances. This allows them to focus on application development rather than infrastructure management.
Continuous Deployment of your Application - SpringOne Tour DallasVMware Tanzu
The document discusses Spring Cloud Pipelines, which provides an opinionated template for continuous delivery pipelines. It describes Spring Cloud Pipelines' support for different automation servers like Concourse and Jenkins, as well as languages like Maven and Gradle. It also covers Spring Cloud Pipelines' default configuration options around environments, testing types, and cloud-native applications.
Continuous Deployment of your Application @SpringOneciberkleid
Spring Cloud Pipelines is an opinionated framework that automates the creation of structured continuous deployment pipelines.
In this presentation we’ll go through the contents of the Spring Cloud Pipelines project. We’ll start a new project for which we’ll have a deployment pipeline set up in no time. We’ll deploy to Cloud Foundry and check if our application is backwards compatible so that we can roll it back on production.
The document discusses incorporating chaos engineering experiments into automated pipelines. It provides an example of automating chaos experiments using AWS services like the Fault Injection Simulator, CodePipeline, and Lambda. The example shows how to create an experiment template in FIS, add a stage to CodePipeline to run the experiment, and use Lambda to invoke the experiment from CodePipeline using the FIS API. Best practices for chaos experiments like having recovery plans and notifying stakeholders are also covered.
This document provides an agenda and summaries of key points from a presentation on integrating systems using Apache Camel. The presentation discusses how Apache Camel is an open-source integration library that uses enterprise integration patterns to connect disparate systems. It highlights features of Camel including components, data formats, and testing frameworks. Customer examples are presented that demonstrate large returns on investment and cost savings from using Camel for integration projects. The presenters argue that Camel provides flexibility, reusability and rapid development of integrations.
This document discusses code management strategies for Puppet using Git. It presents different options for code review structures and repository hosting. Some key points:
- Git allows for distributed, branch-based development with a change history. It integrates well with Puppet through tools like r10k and dynamic environments.
- Popular repository hosting options include GitHub, GitLab, and Stash, each with different pricing, permissions, and features.
- Different code review workflows are presented, modeled after forms of government. These include autocracy, democracy, plutocracy, and others.
- Pre-commit and pre-receive hooks can validate code quality, while web hooks allow automating workflows between tools. Trade
“I have stopped counting how many times I’ve done this from scratch” - was one of the responses to the tweet about starting the project called Spring Cloud Pipelines. Every company sets up a pipeline to take code from your source control, through unit testing and integration testing, to production from scratch. Every company creates some sort of automation to deploy its applications to servers. Enough is enough - time to automate that and focus on delivering business value.
In this presentation we’ll go through the contents of the Spring Cloud Pipelines project. We’ll start a new project for which we’ll have a deployment pipeline set up in no time. We’ll deploy to Cloud Foundry (but we also could do it with Kubernetes) and check if our application is backwards compatible so that we can roll it back on production.
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc
How does your privacy program stack up against your peers? What challenges are privacy teams tackling and prioritizing in 2024?
In the fifth annual Global Privacy Benchmarks Survey, we asked over 1,800 global privacy professionals and business executives to share their perspectives on the current state of privacy inside and outside of their organizations. This year’s report focused on emerging areas of importance for privacy and compliance professionals, including considerations and implications of Artificial Intelligence (AI) technologies, building brand trust, and different approaches for achieving higher privacy competence scores.
See how organizational priorities and strategic approaches to data security and privacy are evolving around the globe.
This webinar will review:
- The top 10 privacy insights from the fifth annual Global Privacy Benchmarks Survey
- The top challenges for privacy leaders, practitioners, and organizations in 2024
- Key themes to consider in developing and maintaining your privacy program
In the rapidly evolving landscape of technologies, XML continues to play a vital role in structuring, storing, and transporting data across diverse systems. The recent advancements in artificial intelligence (AI) present new methodologies for enhancing XML development workflows, introducing efficiency, automation, and intelligent capabilities. This presentation will outline the scope and perspective of utilizing AI in XML development. The potential benefits and the possible pitfalls will be highlighted, providing a balanced view of the subject.
We will explore the capabilities of AI in understanding XML markup languages and autonomously creating structured XML content. Additionally, we will examine the capacity of AI to enrich plain text with appropriate XML markup. Practical examples and methodological guidelines will be provided to elucidate how AI can be effectively prompted to interpret and generate accurate XML markup.
Further emphasis will be placed on the role of AI in developing XSLT, or schemas such as XSD and Schematron. We will address the techniques and strategies adopted to create prompts for generating code, explaining code, or refactoring the code, and the results achieved.
The discussion will extend to how AI can be used to transform XML content. In particular, the focus will be on the use of AI XPath extension functions in XSLT, Schematron, Schematron Quick Fixes, or for XML content refactoring.
The presentation aims to deliver a comprehensive overview of AI usage in XML development, providing attendees with the necessary knowledge to make informed decisions. Whether you’re at the early stages of adopting AI or considering integrating it in advanced XML development, this presentation will cover all levels of expertise.
By highlighting the potential advantages and challenges of integrating AI with XML development tools and languages, the presentation seeks to inspire thoughtful conversation around the future of XML development. We’ll not only delve into the technical aspects of AI-powered XML development but also discuss practical implications and possible future directions.
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUpanagenda
Webinar Recording: https://www.panagenda.com/webinars/hcl-notes-und-domino-lizenzkostenreduzierung-in-der-welt-von-dlau/
DLAU und die Lizenzen nach dem CCB- und CCX-Modell sind für viele in der HCL-Community seit letztem Jahr ein heißes Thema. Als Notes- oder Domino-Kunde haben Sie vielleicht mit unerwartet hohen Benutzerzahlen und Lizenzgebühren zu kämpfen. Sie fragen sich vielleicht, wie diese neue Art der Lizenzierung funktioniert und welchen Nutzen sie Ihnen bringt. Vor allem wollen Sie sicherlich Ihr Budget einhalten und Kosten sparen, wo immer möglich. Das verstehen wir und wir möchten Ihnen dabei helfen!
Wir erklären Ihnen, wie Sie häufige Konfigurationsprobleme lösen können, die dazu führen können, dass mehr Benutzer gezählt werden als nötig, und wie Sie überflüssige oder ungenutzte Konten identifizieren und entfernen können, um Geld zu sparen. Es gibt auch einige Ansätze, die zu unnötigen Ausgaben führen können, z. B. wenn ein Personendokument anstelle eines Mail-Ins für geteilte Mailboxen verwendet wird. Wir zeigen Ihnen solche Fälle und deren Lösungen. Und natürlich erklären wir Ihnen das neue Lizenzmodell.
Nehmen Sie an diesem Webinar teil, bei dem HCL-Ambassador Marc Thomas und Gastredner Franz Walder Ihnen diese neue Welt näherbringen. Es vermittelt Ihnen die Tools und das Know-how, um den Überblick zu bewahren. Sie werden in der Lage sein, Ihre Kosten durch eine optimierte Domino-Konfiguration zu reduzieren und auch in Zukunft gering zu halten.
Diese Themen werden behandelt
- Reduzierung der Lizenzkosten durch Auffinden und Beheben von Fehlkonfigurationen und überflüssigen Konten
- Wie funktionieren CCB- und CCX-Lizenzen wirklich?
- Verstehen des DLAU-Tools und wie man es am besten nutzt
- Tipps für häufige Problembereiche, wie z. B. Team-Postfächer, Funktions-/Testbenutzer usw.
- Praxisbeispiele und Best Practices zum sofortigen Umsetzen
Building Production Ready Search Pipelines with Spark and MilvusZilliz
Spark is the widely used ETL tool for processing, indexing and ingesting data to serving stack for search. Milvus is the production-ready open-source vector database. In this talk we will show how to use Spark to process unstructured data to extract vector representations, and push the vectors to Milvus vector database for search serving.
Introduction of Cybersecurity with OSS at Code Europe 2024Hiroshi SHIBATA
I develop the Ruby programming language, RubyGems, and Bundler, which are package managers for Ruby. Today, I will introduce how to enhance the security of your application using open-source software (OSS) examples from Ruby and RubyGems.
The first topic is CVE (Common Vulnerabilities and Exposures). I have published CVEs many times. But what exactly is a CVE? I'll provide a basic understanding of CVEs and explain how to detect and handle vulnerabilities in OSS.
Next, let's discuss package managers. Package managers play a critical role in the OSS ecosystem. I'll explain how to manage library dependencies in your application.
I'll share insights into how the Ruby and RubyGems core team works to keep our ecosystem safe. By the end of this talk, you'll have a better understanding of how to safeguard your code.
Digital Marketing Trends in 2024 | Guide for Staying AheadWask
https://www.wask.co/ebooks/digital-marketing-trends-in-2024
Feeling lost in the digital marketing whirlwind of 2024? Technology is changing, consumer habits are evolving, and staying ahead of the curve feels like a never-ending pursuit. This e-book is your compass. Dive into actionable insights to handle the complexities of modern marketing. From hyper-personalization to the power of user-generated content, learn how to build long-term relationships with your audience and unlock the secrets to success in the ever-shifting digital landscape.
leewayhertz.com-AI in predictive maintenance Use cases technologies benefits ...alexjohnson7307
Predictive maintenance is a proactive approach that anticipates equipment failures before they happen. At the forefront of this innovative strategy is Artificial Intelligence (AI), which brings unprecedented precision and efficiency. AI in predictive maintenance is transforming industries by reducing downtime, minimizing costs, and enhancing productivity.
HCL Notes and Domino License Cost Reduction in the World of DLAUpanagenda
Webinar Recording: https://www.panagenda.com/webinars/hcl-notes-and-domino-license-cost-reduction-in-the-world-of-dlau/
The introduction of DLAU and the CCB & CCX licensing model caused quite a stir in the HCL community. As a Notes and Domino customer, you may have faced challenges with unexpected user counts and license costs. You probably have questions on how this new licensing approach works and how to benefit from it. Most importantly, you likely have budget constraints and want to save money where possible. Don’t worry, we can help with all of this!
We’ll show you how to fix common misconfigurations that cause higher-than-expected user counts, and how to identify accounts which you can deactivate to save money. There are also frequent patterns that can cause unnecessary cost, like using a person document instead of a mail-in for shared mailboxes. We’ll provide examples and solutions for those as well. And naturally we’ll explain the new licensing model.
Join HCL Ambassador Marc Thomas in this webinar with a special guest appearance from Franz Walder. It will give you the tools and know-how to stay on top of what is going on with Domino licensing. You will be able lower your cost through an optimized configuration and keep it low going forward.
These topics will be covered
- Reducing license cost by finding and fixing misconfigurations and superfluous accounts
- How do CCB and CCX licenses really work?
- Understanding the DLAU tool and how to best utilize it
- Tips for common problem areas, like team mailboxes, functional/test users, etc
- Practical examples and best practices to implement right away
Your One-Stop Shop for Python Success: Top 10 US Python Development Providersakankshawande
Simplify your search for a reliable Python development partner! This list presents the top 10 trusted US providers offering comprehensive Python development services, ensuring your project's success from conception to completion.
Ocean lotus Threat actors project by John Sitima 2024 (1).pptxSitimaJohn
Ocean Lotus cyber threat actors represent a sophisticated, persistent, and politically motivated group that poses a significant risk to organizations and individuals in the Southeast Asian region. Their continuous evolution and adaptability underscore the need for robust cybersecurity measures and international cooperation to identify and mitigate the threats posed by such advanced persistent threat groups.
Generating privacy-protected synthetic data using Secludy and MilvusZilliz
During this demo, the founders of Secludy will demonstrate how their system utilizes Milvus to store and manipulate embeddings for generating privacy-protected synthetic data. Their approach not only maintains the confidentiality of the original data but also enhances the utility and scalability of LLMs under privacy constraints. Attendees, including machine learning engineers, data scientists, and data managers, will witness first-hand how Secludy's integration with Milvus empowers organizations to harness the power of LLMs securely and efficiently.
5. @garethbowles
television network with more than 57 million
members in 50 countries enjoying more
than one billion hours of TV shows and
movies per month.
We account for up to 34% of downstream
US internet traffic. Source: http://ir.netflix.com
9. @garethbowles
What AWS Provides
• Machine Images (AMI)
• Instances (EC2)
• Elastic Load Balancers
• Security groups / Autoscaling
groups
• Availability zones and regions
11. @garethbowles
How AWS Can Go Wrong -1
• Service goes down in one or more
availability zones
• 6/29/12 - storm related power outage
caused loss of EC2 and RDS instances
in Eastern US
• https://gigaom.com/2012/06/29/some-of-
amazon-web-services-are-down-again/
12. @garethbowles
How AWS Can Go Wrong - 2
• Loss of service in an entire region
• 12/24/12 - operator error caused loss of
multiple ELBs in Eastern US
• http://techblog.netflix.com/2012/12/a-
closer-look-at-christmas-eve-outage.html
13. @garethbowles
How AWS Can Go Wrong - 3
• Large number of instances get rebooted
• 9/25/14 to 9/30/14 - rolling reboot of
1000s of instances to patch a security
bug
• http://techblog.netflix.com/2014/10/a-
state-of-xen-chaos-monkey-
cassandra.html
14. @garethbowles
Our Goal is Availability
• Members can stream Netflix whenever
they want
• New users can explore and sign up
• New members can activate their service
and add devices
16. @garethbowles
Freedom and Responsibility
• Developers deploy when
they want
• They also manage their
own capacity and
autoscaling
• And are on-call to fix
anything that breaks at
3am!
18. @garethbowles
Failure is All Around Us
• Disks fail
• Power goes out - and your backup
generator fails
• Software bugs are introduced
• People make mistakes
19. @garethbowles
Design to Avoid Failure
• Exception handling
• Redundancy
• Fallback or degraded experience (circuit
breakers)
• But is it enough ?
20. @garethbowles
It’s Not Enough
• How do we know we’ve succeeded ?
• Does the system work as designed ?
• Is it as resilient as we believe ?
• How do we avoid drifting into failure ?
22. @garethbowles
Exhaustive Testing ~
Impossible
• Massive, rapidly changing data sets
• Internet scale traffic
• Complex interaction and information flow
• Independently-controlled services
• All while innovating and building features
23. @garethbowles
Another Way
• Cause failure deliberately to validate
resiliency
• Test design assumptions by stressing
them
• Don’t wait for random failure. Remove
its uncertainty by forcing it regularly
26. @garethbowles
Chaos Monkey
• The original Monkey (2009)
• Randomly terminates instances in a cluster
• Simulates failures inherent to running in the
cloud
• During business hours
• Default for production services
29. @garethbowles
Chaos Gorilla
• Simulate an Availability Zone becoming
unavailable
• Validate multi-AZ redundancy
• Deploy to multiple AZs by default
• Run regularly (but not continually !)
31. @garethbowles
Chaos Kong
• “One louder” than Chaos Gorilla
• Simulate an entire region outage
• Used to validate our “active-active” region
strategy
• Traffic has to be switched to the new region
• Run once every few months
33. @garethbowles
Latency Monkey• Simulate degraded instances
• Ensure degradation doesn’t affect other
services
• Multiple scenarios: network, CPU, I/O,
memory
• Validate that your service can handle
degradation
• Find effects on other services, then validate
that they can handle it too
35. @garethbowles
Conformity Monkey• Apply a set of conformity rules to all
instances
• Notify owners with a list of instances and
problems
• Example rules
• Standard security groups not applied
• Instance age is too old
• No health check URL
36. @garethbowles
Failure Injection Testing (FIT)
• Latency Monkey adds delay / failure on server side of
requests
• Impacts all calling apps - whether they want to
participate or not
• FIT decorates requests with failure data
• Can limit failures to specific accounts or devices, then
dial up
• http://techblog.netflix.com/2014/10/fit-failure-injection-
testing.html
37. @garethbowles
Try it out !
• Open sourced and available at
https://github.com/Netflix/SimianArmy
and
https://github.com/Netflix/security_monke
y
• Chaos, Conformity, Janitor and Security
available now; more to come
• VMware as well as AWS
38. @garethbowles
What’s Next ?
• New failure modes
• Run monkeys more frequently and
aggressively
• Make chaos testing as well-understood
as regular regression testing
39. @garethbowles
A message from the owners
“Use Chaos Monkey to induce various kinds of
failures in a controlled environment.”
AWS blog post following the mass instance
reboot in Sep 2014:
http://aws.amazon.com/blogs/aws/ec2-
maintenance-update-2/
41. @garethbowles
What We Get
• Real time code usage patterns
• Focus testing by prioritizing frequently
executed paths with low test coverage
• Identify dead code that can be removed
42. @garethbowles
How We Do It
• Use Cobertura as it counts how many
times each LOC is executed
• Easy to enable - Cobertura JARs
included in our base AMI, set a flag to
add them to Tomcat’s classpath
• Enable on a single instance
• Very low performance hit
44. @garethbowles
Canaries
• Push changes to a small number of
instances
• Use Asgard for red / black push
• Monitor closely to detect regressions
• Automated canary analysis
• Automatically cancel deployment if
problems occur
45. @garethbowles
Closing Thoughts
• Don’t be scared to test in production !
• You’ll get tons of data that you couldn’t get
from test …
• … and hopefully sleep better at night
46. @garethbowles
Thanks, QA or the Highway !
Email: gbowles@{gmail,netflix}.com
Twitter: @garethbowles
Linkedin:
www.linkedin.com/in/garethbowles
Editor's Notes
Notes
Hi, everyone. Thanks for coming today. This is my first keynote, so I hope I can do it justice !
Just to set some expectations - I heard that at Matt’s keynote last year, he was giving out $20 bills. I’m afraid I don’t have any cash, but I do have plenty of snazzy Simian Army stickers up here.
I’m going to talk about some big testing challenges that we face at Netflix. In particular, we have such a large and complex distributed system that that testing it exhaustively in an isolated environment is next to impossible. To meet that challenge we came up with a few different approaches, and I’m going to talk about three of those today: the Simian Army, which is a set of tools that induces failures in production; code coverage analysis on production servers; and using canaries to test new versions in production.
I’ll spend a bit of time talking about Netflix and our streaming service to set up the problem space, then go over some of the things we need to test for and how more traditional test practices can fall short. Finally I’ll go into some detail on the tools themselves.
A little bit about me. I’ve been with Netflix for 4 1/2 years and I’m part of our Engineering Tools team. We’re responsible for developer productivity, with the goal that any engineer can build and deploy their code with the minimum possible effort.
Before Netflix I spent a long time in test engineering and technical operations, so once I got to Netflix I was fascinated to see how such a complex system gets tested.
Let’s take a look at how it all works.
Who here ISN’T familiar with Netflix ?
Any customers ? Thanks very much !
Netflix is first & foremost an entertainment company, but you can also look at us as an engineering company that creates all the technology to serve up that entertainment, and also collects a ton of data on who watches what, when they watch it, and how much they watch. We continuously analyze all that data to improve our customers’ entertainment experience by making it easy for them to find things they want to watch, and making sure they have a top quality viewing experience when they get comfy on the sofa (or the bus, or in the park, or wherever they can get connected).
So there’s a lot of engineering that goes on behind the scenes to make all that possible.
Some data that might be new to some of you guys. Our membership is growing fast; about two thirds of our members are in the US, but we’re now in more than 50 other countries - all countries in North and South America, plus most of Europe.
The amount of content our viewers are watching is growing, too.
And we’re doing our best to break the internet.
This is an overview of our current architecture.
2 billion requests flow in from all kinds of connected devices - game consoles, PCs and Macs, phones, tablets, TVs, DVD players, and more.
Those generate 12 billion outbound requests to individual services. The diagram shows some of the main ones: the personalization engine that recommends what to watch based on your viewing history and ratings, movie metadata to give you information about what you’re watching, and one of the most important - the A/B test engine that lets us serve up a different customer experience to a set of users and measure its effect on how much they watch, what they watch and when the watch it.
I like to pause for audience reaction here. Anyone care to suggest what this diagram shows ?
We call it the “Shock and Awe” diagram at Netflix ! It’s generated by one of our monitoring systems and shows the interconnections between all of our services and data. Don’t look too hard, you won’t make out much detail - it’s only meant to illustrate the complexity of the system.
We run our production systems on Amazon Web Services, which many of you are probably familiar with. We’re one of AWS’ biggest customers - apart from Amazon itself, who uses AWS to power their e-commerce sites as well as their streaming service that competes with Netflix.
Using Amazon Web Services lets us stop worrying about procurement of hardware - servers, network switches, storage, firewalls, load balancers ...
AWS allows us to scale up and down without worrying about exceeding or underusing data center capacity.
And since every AWS service has an API, we can automate our deployments and throw away all those runbooks.
Each AWS service is available in multiple regions (geographic areas) and multiple availability zones (data centers) within each region.
So now that I’ve described how everything works at a really high level, here’s a big problem - in actual fact, there’s nobody who knows how it all works in depth. Although we have world experts in areas such as personalization, video encoding and machine learning, there’s just too much going on and it’s changing too fast for any individual to keep up.
AWS is increasingly reliable, but it has had some fairly spectacular outages as well as many smaller ones. When you run on a cloud platform that’s not under your control, you have to be able to cope with these outages.
On June 29th 2012, a storm caused a widespread power outage in Northern Virginia that took out many instances and database servers. Netflix streaming was affected for a while.
An even bigger outage happened on Christmas Eve 2012, when an Amazon engineer made a mistake that took out many Elastic Load Balancers in the US East region, which at that time was Netflix’ primary region for serving traffic. ELBs aren’t replicated between availability zones, they only apply to a given region. Many Netflix customers were affected; luckily for us, Christmas Eve is a much less busy day than Christmas Day when everyone gets given Netflix subscriptions, and the problem was fixed by the 25th.
In late September this year, AWS restarted a large number of instances, in multiple regions, to patch a security bug in the virtualization software that the instances run on. This time we were hardly affected at all and there was negligible impact on customers - again, more in our tech blog post.
Given those kinds of problems, we need to work pretty hard to deal with them. Netflix wants our 50 million plus members to be able to play movies and TV shows whenever and wherever they want. We also want to make it as simple and fast as possible for people to sign up and start using their new subscriptions.
One little digression that’s an important part of how we meet these challenges - we couldn’t do it without our company culture, which we’re quite proud of - proud enough to publish the 126-slide deck I’ve linked here. All those slides can be boiled down to one key takeaway, which we call Freedom and Responsibility.
Here’s how the “freedom & responsibility” principle applies to our technical development and deployment.
Freedom - engineers deploy when and how often they need to, and control their own production capacity and scaling.
Responsibility - every engineer in each service team is in the PagerDuty rotation in case things go wrong.
So how can these teams be confident that their new versions will still work with all their dependencies, under all kinds of failure conditions ?
It’s a very tough problem.
Failure is unavoidable. We already saw some ways that our AWS platform can fail. Add to that bugs in our own software, and the inevitable human errors that you get when there are actual people involved in the development and deployment pipeline.
We can do a lot to make our code handle failure gracefully.
We can catch errors as exceptions and make sure they are handled in a way that doesn’t crash the code.
We can run multiple instances of our services to avoid single points of failure.
And we can use technologies such as the circuit breaker pattern to have services provide a degraded experience if one of their dependent services goes offline. For example, if our recommendation service is unavailable, we don’t show you a blank list of recommendations on your Netflix page - we fall back to a list of the most-watched content.
But that only gets us so far.
We want to make sure all our features work properly without waiting for customers to tell us they don’t.
We want to know that we are as resilient as we think we are, without waiting for an outage to happen. Given the scale we run at, it’s effectively impossible to create a realistic test system for running load and reliability tests.
We also want to be sure that the configuration of our services doesn’t diverge as we redeploy them over time - this can lead to errors that are very hard to debug given the thousands of instances that we run in production.
So, most of you guys are testers and probably have this reaction - let’s do more testing !
But can we effectively simulate such a large- scale distributed system - and what’s more, can we predict every possible failure mode and encode it into our tests ?
Today’s large internet systems have become too big and complex to just rely on traditional testing - but don’t get me wrong, all those types of testing I just mentioned have a very important place at Netflix, and we have some of the best test engineers in the business working on them. But here are some of the things they struggle with.
It’s very hard to find realistic test data.
It would be hugely expensive for us to create a similarly-sized copy of our production system for testing - not quite “copying the internet”, but getting there.
Because teams deploy their own changes on different schedules, it’s difficult to keep up with changes and code them into integration tests.
So we came up with the idea of deliberately triggering failures in production, to augment our more traditional testing. By causing our own failures on a known schedule, we can be prepared to deal with their effects and test our assumptions in a predictable way, rather than having a fire drill when a “real” outage happens.
So with all that context done with, let’s take a look at the lovable monkeys who make up our Simian Army.
Chaos Monkey was the one who started it all.
Chaos Monkey has been around in some form for about 5 years.
It’s a service that looks for groups of instances (known as clusters) of each of our services and picks a random instance to terminate, on a defined schedule and with a defined probability of termination.
This simulates a fairly frequent thing in AWS (although not nearly as frequent as it used to be) where instances are terminated unexpectedly, usually due to a failure in the underlying hardware.
We run Chaos Monkey during business hours so that engineers are on hand to diagnose and fix problems,rather than getting a 3am page.
If you deploy a new Netflix service, Chaos Monkey will be enabled for it unless you explicitly turn it off.
We’ve got to a point where Chaos Monkey instance terminations go virtually unnoticed.
We didn’t want to stop once we were happy that we could deal with individual instances dying.
Gorillas are bigger than monkeys and can carry bigger weapons.
Chaos Gorilla takes out an entire Availability Zone.
AWS has multiple regions in different parts of the world, such Eastern USA, Western USA, Asia Pacific and Western Europe.
Each region has multiple Availability Zones. These are equivalent to physical data centers in different geographic locations - for example, the US East region is located in Virginia but has 3 separate Availability Zones.
Running Chaos Gorilla ensures that our service is running correctly in multiple Availability Zones, and that we have sufficient capacity in each zone to handle our traffic load.
Runs of Chaos Gorilla are announced ahead of time, and our Reliability Engineering team sets up an incident room where engineers from each service team can watch progress.
So what next ? As we already picked the gorilla, we had to resort to a fictional creature to cause even bigger chaos.
Once we were happy that we could survive an Availability Zone outage, we wanted to go a step further and see if we could cope with an entire region being taken out.
This hasn’t happened in reality yet, but there’s a small possibility that it could - for example, the us-west-1 region is in Northern California, so a really big earthquake could feasibly take out all of its Availability Zones.
And the Elastic Load Balancer outage at the end of 2012 did have the effect of bringing down a key service in an entire region.
To handle this, we had to rearchitect to an “active-active” setup where we have complete copies of our services and data running in two different regions.
If an Availability Zone goes down we just have to make sure we have enough capacity in the surviving zones to handle all our traffic, but if we lose a region we also have to reroute all traffic to the backup region.
Chaos Kong gets an outing every few months, again with an incident room where engineers can watch progress and react to any problems.
So we can deal with instances disappearing individually, or in large numbers. But what if instances are still there, but running in a degraded state ? Because our architecture involves so many interdependent services, we have to be careful that problems with one service don’t cascade to other services.
This is where Latency Monkey comes in. It can simulate multiple types of degradation: network connections maxing out, high CPU loads, high disk I/O and running out of memory.
Degradation in a service oriented architecture is extremely hard to test exhaustively. With Latency Monkey we can introduce degradation in a controlled way (like the recommendations example I mentioned earlier), find any problems in dependent services and fix them, then verify those fixes.
Some service teams even discovered dependencies they didn’t know they had by running Latency Monkey and finding unexpected degradation in the dependent services.
One other key aspect of running so many services, all with dozens or hundreds of instances, is that it’s very important to keep all of the instances consistent. They should all have the same system configuration and the same version and configuration of the service, for example. This decreases the complexity / surface area of the testing
We use Conformity Monkey to automate these checks.
It runs over all services at a fixed interval, and notifies service owners when any of the conditions are not met. An email is sent containing a list of rule violations, each with a list of non-conforming instances.
Here are a few examples of the things we check:
Instances should be in the correct security groups so that they are reachable by other services and our monitoring and deployment tools.
Instances shouldn’t have been running for more than a given time, which depends on how often the service is deployed. Older instances could be running an out of date version of the service.
Instances should all have a valid health check URL so that our monitoring tools can know whether they are running properly.
We just came up with a system called FIT, for Failure Injection Testing. We need to come up with a monkey for this one !
Latency Monkey injects delays or failures on the server side and thus affects all calling services. If all those calling services don’t have proper fallbacks and timeouts implemented, they can stop working and impact customers - which we obviously want to avoid.
FIT allows failures to be simulated on the client side. We can add failure data to a specific set of API calls and propagate it through the system so that only the services we want to test are affected. We usually start with a specific test customer account or a particular client device, then dial up the failures to affect more and more production traffic if the initial results look good.
Check the Netflix Tech Blog for more details.
You can try the Monkeys out for yourself by going to the Netflix GitHub page. There’s an active community of users, and Netflix engineers regularly monitor the mailing list.
Chaos, Conformity, Janitor and Security Monkeys are currently available, with more to come.
For those of you not using AWS but with an in house VMWare setup - you can use the Monkeys too, thanks to some of our open source contributors.
There’s a lot more we want to do with the Monkeys.
We’d like to have a way to induce failures that are more chaotic than the individual instances that Chaos Monkey knocks out, but less impactful than having Chaos Gorilla take out an entire availability zone.
We’re constantly on the lookout for new failure modes that we can trigger - hopefully before they happen in the wild.
We’re working on an effort to increase the frequency and reach of monkey runs. This is in response to some interesting data - our uptime degraded when we ran the monkeys less frequently.
Eventually, we’d like to make chaos testing in large distributed systems as well understood and commonly practiced as regular regression testing.
After the mass instance reboot I mentioned early, Amazon themselves recommended that AWS customers use Chaos Monkey to test their resilience. High praise indeed !
Let’s move on to our second way of testing in production - code coverage analysis. Our view is that if you’re not doing it in prod, you’re missing out on a ton of useful data.
In contrast to just showing what code paths are covered by test, we get data on the paths which are actually used in production, plus how often each path is run. We compare the production code coverage data with the results from our test environment.
This enables us to focus our testing, for example by identifying commonly used code paths with low test coverage. We can also find dead code that is never used in production, and remove that code and its tests to make maintenance easier.
All of our services run on the JVM, so we needed a Java code coverage tool. We picked Cobertura because it counts how many times each line of code is executed, in contrast to most other tools which just give you a binary result of whether or not the line was executed.
We put the Cobertura JAR files on our base machine image that all of the services build on, and set a flag at runtime to enable code coverage analysis.
We’ll usually run code coverage on a single instance in a service cluster, and leave it running for a day or so to make sure that all the code paths are hit. We’ve found the performance hit to be very low - typically less than 5% degradation in performance.
Our third way of testing in production is the use of canary deployments.
The term came from coal mining, when miners wanted to detect dangerous levels of coal gas in the mine shaft. They would take down a canary, which was very sensitive to the gas; if the canary keeled over or died, it was time to get out of there before the miners did the same.
Automated rollback happens near beginning of canary if there is a drastic regression - we’re still learning what is “bad enough”, varies by team.
If regression is less severe, team will make decision to put the new service in prod, or cancel the push. Each team can have a different level of tolerance for regressions. Teams that deploy very frequently will tend to have a lower tolerance than teams who deploy less often and maybe don’t have their automated analysis fully developed.
And that’s the end of my talk; before we go to some questions, I’d like to give a big thanks to the organizers for putting on such a great conference, and to all of you for coming. I’ve been really impressed by what a great testing community you have here.