This document discusses canary analysis, which is a deployment process where a new change is gradually rolled out to production with checkpoints to examine the new systems versus the old systems and make go/no-go decisions. It proposes using canary analysis to test software releases by routing a small percentage of traffic to new servers and comparing metrics like error rates and requests per second between the new and old servers before fully deploying the new release. The document provides advice on automating canary analysis, focusing on relative metrics, ignoring outliers, balancing fidelity with customer impact, and letting application owners choose when differences are acceptable.
Rise of the Machines: PHP and IoT - ZendCon 2017Colin O'Dell
The Internet of Things (IoT) is fundamentally changing how we interact with the digital world. In this session we’ll explore the implementation of real examples which bridge the gap between the physical and digital world using PHP: asking Alexa for information within a PHP application; displaying API data on an Arduino-powered display; using PHP to control LEDs on a Raspberry Pi to monitor application uptime; and connecting IR sensors to Slack to see whether a conference room is in use.
How does the Cloud Foundry Diego Project Run at Scale, and Updates on .NET Su...Amit Gupta
The Cloud Foundry Diego team at Pivotal has been hard at work for the past few months exploring and improving Diego's performance at scale and under stress. This talk covers the goals, tools, and results of the experiments to date, as well as a glimpse of what's next.
And finally, a brief teaser about the current state of .NET support in Diego
OnAndroidConf 2013: Accelerating the Android Platform BuildDavid Rosen
Presented at the OnAndroidConf, October 22 2013, http://www.onandroidconf.com/sessions.html
Abstract:
Optimizing the Android build environment to perform at world-class level is a big challenge for many Android device and chipset makers today. Churning through thousands of platform builds per week requires laser-focus on high-performance infrastructure and tooling. If you’re looking at improving your overall engineering and developer productivity, the software build use case is an obvious area to prioritize.
This technical talk will focus on the following aspects of the Android platform build:
Common Android platform build challenges and opportunities with real-life production references
The various Android build use cases and their needs – full integration and release builds, developer incremental builds
Evolution of the Android build and codebase with trends and statistics
Detailed technical analysis of the Android platform build, highlighting opportunities for improvements
Proposed solutions and technical tricks to optimize an Android software build environment
Twitch Plays Pokémon: Twitch's Chat ArchitectureC4Media
Video and slides synchronized, mp3 and slide download available at URL http://bit.ly/2hmKFK1.
John Rizzo introduces Twitch's chat's architecture, telling how their engineers investigated and worked through the issues in what turned out to be a make-or-break situation for the company. Filmed at qconsf.com.
John Rizzo is a Senior Software Engineer at Twitch.
Rise of the Machines: PHP and IoT - ZendCon 2017Colin O'Dell
The Internet of Things (IoT) is fundamentally changing how we interact with the digital world. In this session we’ll explore the implementation of real examples which bridge the gap between the physical and digital world using PHP: asking Alexa for information within a PHP application; displaying API data on an Arduino-powered display; using PHP to control LEDs on a Raspberry Pi to monitor application uptime; and connecting IR sensors to Slack to see whether a conference room is in use.
How does the Cloud Foundry Diego Project Run at Scale, and Updates on .NET Su...Amit Gupta
The Cloud Foundry Diego team at Pivotal has been hard at work for the past few months exploring and improving Diego's performance at scale and under stress. This talk covers the goals, tools, and results of the experiments to date, as well as a glimpse of what's next.
And finally, a brief teaser about the current state of .NET support in Diego
OnAndroidConf 2013: Accelerating the Android Platform BuildDavid Rosen
Presented at the OnAndroidConf, October 22 2013, http://www.onandroidconf.com/sessions.html
Abstract:
Optimizing the Android build environment to perform at world-class level is a big challenge for many Android device and chipset makers today. Churning through thousands of platform builds per week requires laser-focus on high-performance infrastructure and tooling. If you’re looking at improving your overall engineering and developer productivity, the software build use case is an obvious area to prioritize.
This technical talk will focus on the following aspects of the Android platform build:
Common Android platform build challenges and opportunities with real-life production references
The various Android build use cases and their needs – full integration and release builds, developer incremental builds
Evolution of the Android build and codebase with trends and statistics
Detailed technical analysis of the Android platform build, highlighting opportunities for improvements
Proposed solutions and technical tricks to optimize an Android software build environment
Twitch Plays Pokémon: Twitch's Chat ArchitectureC4Media
Video and slides synchronized, mp3 and slide download available at URL http://bit.ly/2hmKFK1.
John Rizzo introduces Twitch's chat's architecture, telling how their engineers investigated and worked through the issues in what turned out to be a make-or-break situation for the company. Filmed at qconsf.com.
John Rizzo is a Senior Software Engineer at Twitch.
Engineering Netflix Global Operations in the CloudJosh Evans
Delivered at re:Invent 2015.
Operating a massively scalable, constantly changing, distributed global service is a daunting task. We innovate at breakneck speed to attract new customers and stay ahead of the competition. This means more features, more experiments, more deployments, more engineers making changes in production environments, and ever-increasing complexity. Simultaneously improving service availability and accelerating rate of change seems impossible on the surface. At Netflix, operations engineering is both a technical and organizational construct designed to accomplish just that by integrating disciplines like continuous delivery, fault injection, regional traffic management, crisis response, best practice automation, and real-time analytics. In this talk, designed for technical leaders seeking a path to operational excellence, we'll explore these disciplines in depth and how they integrate and create competitive advantages.
Interactive Data Analysis with Apache Flink @ Flink Meetup in BerlinTill Rohrmann
This talk shows how we can use Apache Flink and Apache Zeppelin to do interactive data analysis. The examples show the usage of FlinkML to solve a linear regression and classification problem.
In this presentation, learn how Agile Infrastructure for OpenStack enables you to quickly stand up a dynamic self-service cloud infrastructure so you can easily take advantage of the flexibility, scalability, and efficiency of OpenStack.
You'll gain a better understanding of how Agile Infrastructure:
* Extends the core values of cloud: scale, guaranteed performance, automation, high availability and efficiency
* Ensures you deploy OpenStack using a process that's repeatable and error free
* Allows you to run production and test/dev workloads on one storage platform
* Provides higher utilization, better performance and more operational efficiency than legacy solutions
Making Glance tasks work for you - OpenStack Summit May 2015 VancouverBrian Rosmaita
It's not widely known that the OpenStack Images API v2 contains an implementation of a "tasks" API that can be customized by operators to enable asynchronous processing of long-running operations. For example, a deployer might want to enable end users to upload their own custom images ... but only after such images have been approved by some thorough, computation-intensive validation process. The Glance tasks API provides a common interface across OpenStack installations, but allows the implementation of tasks to be customizable to a particular cloud environment. Join Brian Rosmaita, Compute Control Plane Product Manager at Rackspace to see how Glance tasks are being used at Rackspace and to learn how you can use Glance tasks in your OpenStack cloud.
Monitoring Cassandra: Don't Miss a Thing (Alain Rodriguez, The Last Pickle) |...DataStax
Monitoring is critical to successfully running Apache Cassandra in production. Creating a comprehensive and insightful set of dashboards requires a deep knowledge of Cassandra internals that can be intimidating. Everyone however can benefit from knowing where to start looking and why. So that the next time there is a problem you have the right metrics and knowing which dashboards to look at.
In this talk Alain Rodriguez, Consultant at The Last Pickle, will discuss what to monitor in Apache Cassandra, how, and why. He will present examples from commercial products such as DataDog, and open source systems like Grafana.
About the Speaker
Alain Rodriguez Consultant, The Last Pickle
Alain has been working with Apache Cassandra since version 0.8. He was the first Engineer at teads.tv which had grown to 400+ employees by the time he left. During his time at Teads Alain managed and scaled Cassandra clusters across multiple AWS Regions, fully on his own, taking care of the data modeling as well as the troubleshooting and tuning. Alain frequently contributes to the Apache Cassandra users mailing list.
Canary Analyze All The Things: How We Learned to Keep Calm and Release OftenC4Media
Video and slides synchronized, mp3 and slide download available at URL http://bit.ly/1ph8Rq1.
Roy Rapoport discusses canary analysis deployment and observability patterns he believes that are generally useful, and talks about the difference between manual and automated canary analysis. Filmed at qconnewyork.com.
Roy Rapoport manages the Insight Engineering group at Netflix, responsible for building Netflix's Operational Insight platforms, including cloud telemetry, alerting, and real-time analytics. He originally joined Netflix as part of its datacenter-based IT/Ops group, and prior to transferring over to Product Engineering, was managing Service Delivery for IT/Ops.
Video and slides synchronized, mp3 and slide download available at URL http://bit.ly/1mz2piq.
Damon Edwards explores the successful patterns - and damaging anti-patterns - observed at dozens of companies going through DevOps transformations. The main focus is on how Development teams can influence and take a leading role in the closing of the DevOps divide. Filmed at qconlondon.com.
Damon Edwards is the co-founder and managing partner of the DTO Solutions consulting group. Damon is also a frequent contributor to the Web Operations focused dev2ops.org blog, the co-host of the DevOps Cafe podcast series, and a co-author of the DevOps Cookbook from IT Revolution Press.
Video and slides synchronized, mp3 and slide download available at URL http://bit.ly/1uRYaAR.
Volker Pacher, Sam Phillips present key differences between relational databases and graph databases, and how they use the later to model a complex domain and to gain insights into their data. Filmed at qconlondon.com.
Sam Phillips is Head of Engineering for eBay's Local Delivery team, bringing super fast delivery to customers in the UK and US. Volker Pacher is a Senior Developer at eBay Local Delivery. Before its acquisition by eBay, he was a member of the core team at Shutl helping to transition from a monolithic application to SOA and introducing new technologies, among them Neo4j.
Netflix: Amazon S3 & Amazon Elastic MapReduce to Monitor at Gigascale (BDT302...Amazon Web Services
How does Netflix stay on top of the operations of its Internet service with millions of users and billions of metrics? With Atlas, its own massively distributed, large-scale monitoring system. Come learn how Netflix built Atlas with multiple processing pipelines using Amazon S3 and Amazon EMR to provide low-latency access to billions of metrics while supporting query-time aggregation along multiple dimensions.
Building Confidence in a Distributed SystemC4Media
Video and slides synchronized, mp3 and slide download available at URL http://bit.ly/2fQ3oLp.
Sean T. Allen talks about the various means his team has come up with to create repeatable tests that allow them to start trusting that their system will give correct results, discussing how to combine repeatable programmatic fault injection, message tracing, and auditing to create a trustworthy system. Filmed at qconnewyork.com.
Sean T. Allen is VP of Engineering at Sendence- a startup focused on high speed data analytics. He enjoys programming languages, distributed computing, Hiwatt amplifiers, and Fender Telecasters. He is one of the authors of Storm Applied.
Beyond DevOps: How Netflix Bridges the Gap?C4Media
Video and slides synchronized, mp3 and slide download available at URL http://bit.ly/1mv6Kpr.
Josh Evans uses the Netflix Operations Engineering as a case study to explore the challenges faced by centralized engineering teams and approaches to addressing those challenges. Filmed at qconsf.com.
Josh Evans is Director of Operations Engineering at Netflix, with experience in e-commerce, playback control services, infrastructure, tools, testing, and operations.
Engineering Netflix Global Operations in the CloudJosh Evans
Delivered at re:Invent 2015.
Operating a massively scalable, constantly changing, distributed global service is a daunting task. We innovate at breakneck speed to attract new customers and stay ahead of the competition. This means more features, more experiments, more deployments, more engineers making changes in production environments, and ever-increasing complexity. Simultaneously improving service availability and accelerating rate of change seems impossible on the surface. At Netflix, operations engineering is both a technical and organizational construct designed to accomplish just that by integrating disciplines like continuous delivery, fault injection, regional traffic management, crisis response, best practice automation, and real-time analytics. In this talk, designed for technical leaders seeking a path to operational excellence, we'll explore these disciplines in depth and how they integrate and create competitive advantages.
Interactive Data Analysis with Apache Flink @ Flink Meetup in BerlinTill Rohrmann
This talk shows how we can use Apache Flink and Apache Zeppelin to do interactive data analysis. The examples show the usage of FlinkML to solve a linear regression and classification problem.
In this presentation, learn how Agile Infrastructure for OpenStack enables you to quickly stand up a dynamic self-service cloud infrastructure so you can easily take advantage of the flexibility, scalability, and efficiency of OpenStack.
You'll gain a better understanding of how Agile Infrastructure:
* Extends the core values of cloud: scale, guaranteed performance, automation, high availability and efficiency
* Ensures you deploy OpenStack using a process that's repeatable and error free
* Allows you to run production and test/dev workloads on one storage platform
* Provides higher utilization, better performance and more operational efficiency than legacy solutions
Making Glance tasks work for you - OpenStack Summit May 2015 VancouverBrian Rosmaita
It's not widely known that the OpenStack Images API v2 contains an implementation of a "tasks" API that can be customized by operators to enable asynchronous processing of long-running operations. For example, a deployer might want to enable end users to upload their own custom images ... but only after such images have been approved by some thorough, computation-intensive validation process. The Glance tasks API provides a common interface across OpenStack installations, but allows the implementation of tasks to be customizable to a particular cloud environment. Join Brian Rosmaita, Compute Control Plane Product Manager at Rackspace to see how Glance tasks are being used at Rackspace and to learn how you can use Glance tasks in your OpenStack cloud.
Monitoring Cassandra: Don't Miss a Thing (Alain Rodriguez, The Last Pickle) |...DataStax
Monitoring is critical to successfully running Apache Cassandra in production. Creating a comprehensive and insightful set of dashboards requires a deep knowledge of Cassandra internals that can be intimidating. Everyone however can benefit from knowing where to start looking and why. So that the next time there is a problem you have the right metrics and knowing which dashboards to look at.
In this talk Alain Rodriguez, Consultant at The Last Pickle, will discuss what to monitor in Apache Cassandra, how, and why. He will present examples from commercial products such as DataDog, and open source systems like Grafana.
About the Speaker
Alain Rodriguez Consultant, The Last Pickle
Alain has been working with Apache Cassandra since version 0.8. He was the first Engineer at teads.tv which had grown to 400+ employees by the time he left. During his time at Teads Alain managed and scaled Cassandra clusters across multiple AWS Regions, fully on his own, taking care of the data modeling as well as the troubleshooting and tuning. Alain frequently contributes to the Apache Cassandra users mailing list.
Canary Analyze All The Things: How We Learned to Keep Calm and Release OftenC4Media
Video and slides synchronized, mp3 and slide download available at URL http://bit.ly/1ph8Rq1.
Roy Rapoport discusses canary analysis deployment and observability patterns he believes that are generally useful, and talks about the difference between manual and automated canary analysis. Filmed at qconnewyork.com.
Roy Rapoport manages the Insight Engineering group at Netflix, responsible for building Netflix's Operational Insight platforms, including cloud telemetry, alerting, and real-time analytics. He originally joined Netflix as part of its datacenter-based IT/Ops group, and prior to transferring over to Product Engineering, was managing Service Delivery for IT/Ops.
Video and slides synchronized, mp3 and slide download available at URL http://bit.ly/1mz2piq.
Damon Edwards explores the successful patterns - and damaging anti-patterns - observed at dozens of companies going through DevOps transformations. The main focus is on how Development teams can influence and take a leading role in the closing of the DevOps divide. Filmed at qconlondon.com.
Damon Edwards is the co-founder and managing partner of the DTO Solutions consulting group. Damon is also a frequent contributor to the Web Operations focused dev2ops.org blog, the co-host of the DevOps Cafe podcast series, and a co-author of the DevOps Cookbook from IT Revolution Press.
Video and slides synchronized, mp3 and slide download available at URL http://bit.ly/1uRYaAR.
Volker Pacher, Sam Phillips present key differences between relational databases and graph databases, and how they use the later to model a complex domain and to gain insights into their data. Filmed at qconlondon.com.
Sam Phillips is Head of Engineering for eBay's Local Delivery team, bringing super fast delivery to customers in the UK and US. Volker Pacher is a Senior Developer at eBay Local Delivery. Before its acquisition by eBay, he was a member of the core team at Shutl helping to transition from a monolithic application to SOA and introducing new technologies, among them Neo4j.
Netflix: Amazon S3 & Amazon Elastic MapReduce to Monitor at Gigascale (BDT302...Amazon Web Services
How does Netflix stay on top of the operations of its Internet service with millions of users and billions of metrics? With Atlas, its own massively distributed, large-scale monitoring system. Come learn how Netflix built Atlas with multiple processing pipelines using Amazon S3 and Amazon EMR to provide low-latency access to billions of metrics while supporting query-time aggregation along multiple dimensions.
Building Confidence in a Distributed SystemC4Media
Video and slides synchronized, mp3 and slide download available at URL http://bit.ly/2fQ3oLp.
Sean T. Allen talks about the various means his team has come up with to create repeatable tests that allow them to start trusting that their system will give correct results, discussing how to combine repeatable programmatic fault injection, message tracing, and auditing to create a trustworthy system. Filmed at qconnewyork.com.
Sean T. Allen is VP of Engineering at Sendence- a startup focused on high speed data analytics. He enjoys programming languages, distributed computing, Hiwatt amplifiers, and Fender Telecasters. He is one of the authors of Storm Applied.
Beyond DevOps: How Netflix Bridges the Gap?C4Media
Video and slides synchronized, mp3 and slide download available at URL http://bit.ly/1mv6Kpr.
Josh Evans uses the Netflix Operations Engineering as a case study to explore the challenges faced by centralized engineering teams and approaches to addressing those challenges. Filmed at qconsf.com.
Josh Evans is Director of Operations Engineering at Netflix, with experience in e-commerce, playback control services, infrastructure, tools, testing, and operations.
AppSec Pipelines and Event based SecurityMatt Tesauro
Presented at AppSec California 2017, this is a continuation of earlier talks about AppSec Pipelines and demonstrates 1st and 2nd Gen Pipelines, how OWASP is creating a pipeline for its projects and how several companies have benefited from combining DevOps, Agile, CI/CD and Security into an AppSec Pipeline to move beyond traditional AppSec testing.
Immutable Infrastructure: Rise of the Machine ImagesC4Media
Video and slides synchronized, mp3 and slide download available at URL http://bit.ly/1WlpXHF.
Axel Fontaine looks at what Immutable Infrastructure is and how it affects scaling, logging, sessions, configuration, service discovery and more. He also looks at how containers and machine images compare and why some things people took for granted may not be necessary anymore. Filmed at qconlondon.com.
Axel Fontaine is the founder and CEO of Boxfuse. Axel is also the creator and project lead of Flyway, the open source tool that makes database migration easy. He is a Continuous Delivery and Immutable Infrastructure expert, a Java Champion, a JavaOne Rockstar and a regular speaker at various large international conferences.
Modern Release Engineering in a Nutshell - Why Researchers should Care!Bram Adams
Invited talk at the Leaders of Tomorrow Symposium of the 23rd IEEE International Conference on Software Analysis, Evolution, and Reengineering (SANER 2016).
The presentation (and its accompanying paper, see http://mcis.polymtl.ca/publications/2016/fose.pdf) explain the basics of release engineering pipelines, common challenges industry is facing as well as pitfalls software engineering researchers are falling into.
Speakers are Bram Adams (MCIS, http://mcis.polymtl.ca) and Shane McIntosh (McGill University, http://shanemcintosh.org).
A video-taped version of the talk will be available soon at https://www.youtube.com/channel/UCL8yG6qpHk7V66l1Jt3aZrA/featured.
Maintaining the Netflix Front Door - Presentation at Intuit MeetupDaniel Jacobson
This presentation goes into detail on the key principles behind the Netflix API, including design, resiliency, scaling, and deployment. Among other things, I discuss our migration from our REST API to what we call our Experienced-Based API design. It also shares several of our open source efforts such as Zuul, Scryer, Hystrix, RxJava and the Simian Army.
Keynote at Dockercon Europe Amsterdam Dec 4th, 2014.
Speeding up development with Docker.
Summary of some interesting web scale microservice architectures.
Please send me updates and corrections to the architecture summaries @adrianco
Thanks Adrian
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...pchutichetpong
M Capital Group (“MCG”) expects to see demand and the changing evolution of supply, facilitated through institutional investment rotation out of offices and into work from home (“WFH”), while the ever-expanding need for data storage as global internet usage expands, with experts predicting 5.3 billion users by 2023. These market factors will be underpinned by technological changes, such as progressing cloud services and edge sites, allowing the industry to see strong expected annual growth of 13% over the next 4 years.
Whilst competitive headwinds remain, represented through the recent second bankruptcy filing of Sungard, which blames “COVID-19 and other macroeconomic trends including delayed customer spending decisions, insourcing and reductions in IT spending, energy inflation and reduction in demand for certain services”, the industry has seen key adjustments, where MCG believes that engineering cost management and technological innovation will be paramount to success.
MCG reports that the more favorable market conditions expected over the next few years, helped by the winding down of pandemic restrictions and a hybrid working environment will be driving market momentum forward. The continuous injection of capital by alternative investment firms, as well as the growing infrastructural investment from cloud service providers and social media companies, whose revenues are expected to grow over 3.6x larger by value in 2026, will likely help propel center provision and innovation. These factors paint a promising picture for the industry players that offset rising input costs and adapt to new technologies.
According to M Capital Group: “Specifically, the long-term cost-saving opportunities available from the rise of remote managing will likely aid value growth for the industry. Through margin optimization and further availability of capital for reinvestment, strong players will maintain their competitive foothold, while weaker players exit the market to balance supply and demand.”
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Data and AI
Discussion on Vector Databases, Unstructured Data and AI
https://www.meetup.com/unstructured-data-meetup-new-york/
This meetup is for people working in unstructured data. Speakers will come present about related topics such as vector databases, LLMs, and managing data at scale. The intended audience of this group includes roles like machine learning engineers, data scientists, data engineers, software engineers, and PMs.This meetup was formerly Milvus Meetup, and is sponsored by Zilliz maintainers of Milvus.
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...John Andrews
SlideShare Description for "Chatty Kathy - UNC Bootcamp Final Project Presentation"
Title: Chatty Kathy: Enhancing Physical Activity Among Older Adults
Description:
Discover how Chatty Kathy, an innovative project developed at the UNC Bootcamp, aims to tackle the challenge of low physical activity among older adults. Our AI-driven solution uses peer interaction to boost and sustain exercise levels, significantly improving health outcomes. This presentation covers our problem statement, the rationale behind Chatty Kathy, synthetic data and persona creation, model performance metrics, a visual demonstration of the project, and potential future developments. Join us for an insightful Q&A session to explore the potential of this groundbreaking project.
Project Team: Jay Requarth, Jana Avery, John Andrews, Dr. Dick Davis II, Nee Buntoum, Nam Yeongjin & Mat Nicholas
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Subhajit Sahu
Abstract — Levelwise PageRank is an alternative method of PageRank computation which decomposes the input graph into a directed acyclic block-graph of strongly connected components, and processes them in topological order, one level at a time. This enables calculation for ranks in a distributed fashion without per-iteration communication, unlike the standard method where all vertices are processed in each iteration. It however comes with a precondition of the absence of dead ends in the input graph. Here, the native non-distributed performance of Levelwise PageRank was compared against Monolithic PageRank on a CPU as well as a GPU. To ensure a fair comparison, Monolithic PageRank was also performed on a graph where vertices were split by components. Results indicate that Levelwise PageRank is about as fast as Monolithic PageRank on the CPU, but quite a bit slower on the GPU. Slowdown on the GPU is likely caused by a large submission of small workloads, and expected to be non-issue when the computation is performed on massive graphs.
Adjusting primitives for graph : SHORT REPORT / NOTESSubhajit Sahu
Graph algorithms, like PageRank Compressed Sparse Row (CSR) is an adjacency-list based graph representation that is
Multiply with different modes (map)
1. Performance of sequential execution based vs OpenMP based vector multiply.
2. Comparing various launch configs for CUDA based vector multiply.
Sum with different storage types (reduce)
1. Performance of vector element sum using float vs bfloat16 as the storage type.
Sum with different modes (reduce)
1. Performance of sequential execution based vs OpenMP based vector element sum.
2. Performance of memcpy vs in-place based CUDA based vector element sum.
3. Comparing various launch configs for CUDA based vector element sum (memcpy).
4. Comparing various launch configs for CUDA based vector element sum (in-place).
Sum with in-place strategies of CUDA mode (reduce)
1. Comparing various launch configs for CUDA based vector element sum (in-place).
1. Canary Analyze All the
Things
Roy Rapoport
@royrapoport
June 12, 2014
Significant contributions by Chris Sanden, @chris_sanden
1
2. Oh, the Places We’ll Go!
• Introductions
• Proposed Use Case and Definition
• Continuous Improvement / MVP Model
• Issues, Solutions
• Cloud Considerations
• The Road at Netflix
2
4. A Word About Me …
•About 20 years in technology
3
5. A Word About Me …
•About 20 years in technology
•Systems engineering, networking, software development, QA,
release management
3
6. A Word About Me …
•About 20 years in technology
•Systems engineering, networking, software development, QA,
release management
•Time at Netflix: 1809 days
3
7. A Word About Me …
•About 20 years in technology
•Systems engineering, networking, software development, QA,
release management
•Time at Netflix: 1809 days 4y:11m:14d
3
8. A Word About Me …
•About 20 years in technology
•Systems engineering, networking, software development, QA,
release management
•Time at Netflix: 1809 days
•At Netflix:
4y:11m:14d
3
9. A Word About Me …
•About 20 years in technology
•Systems engineering, networking, software development, QA,
release management
•Time at Netflix: 1809 days
•At Netflix:
•Systems Engineering, Service Delivery in IT/Ops
4y:11m:14d
3
10. A Word About Me …
•About 20 years in technology
•Systems engineering, networking, software development, QA,
release management
•Time at Netflix: 1809 days
•At Netflix:
•Systems Engineering, Service Delivery in IT/Ops
•Troubleshooter and Builder of Python Things[tm] in Product
Engineering
4y:11m:14d
3
11. A Word About Me …
•About 20 years in technology
•Systems engineering, networking, software development, QA,
release management
•Time at Netflix: 1809 days
•At Netflix:
•Systems Engineering, Service Delivery in IT/Ops
•Troubleshooter and Builder of Python Things[tm] in Product
Engineering
•Current role: Insight Engineering in Product Engineering
4y:11m:14d
3
12. A Word About Me …
•About 20 years in technology
•Systems engineering, networking, software development, QA,
release management
•Time at Netflix: 1809 days
•At Netflix:
•Systems Engineering, Service Delivery in IT/Ops
•Troubleshooter and Builder of Python Things[tm] in Product
Engineering
•Current role: Insight Engineering in Product Engineering
•Real-Time Operational Insight
4y:11m:14d
3
20. A Word About Netflix…
Freedom and Responsibility Culture
5
21. A Word About Netflix…
•Optimize speed of innovation
Constrain availability
Cost will be what cost will be
Freedom and Responsibility Culture
5
22. A Word About Netflix…
•Optimize speed of innovation
Constrain availability
Cost will be what cost will be
•Hire smart (experienced)
people
Get out of their way
Freedom and Responsibility Culture
5
23. A Word About Netflix…
•Optimize speed of innovation
Constrain availability
Cost will be what cost will be
•Hire smart (experienced)
people
Get out of their way
•Anti-process bias
Freedom and Responsibility Culture
5
25. A Word About Netflix…
Technology and Operations
6
26. A Word About Netflix…
•Service Oriented Architecture
Technology and Operations
6
27. A Word About Netflix…
•Service Oriented Architecture
•Decentralized Operations. You
Technology and Operations
6
28. A Word About Netflix…
•Service Oriented Architecture
•Decentralized Operations. You
•Build
Technology and Operations
6
29. A Word About Netflix…
•Service Oriented Architecture
•Decentralized Operations. You
•Build
•Test
Technology and Operations
6
30. A Word About Netflix…
•Service Oriented Architecture
•Decentralized Operations. You
•Build
•Test
•Deploy
Technology and Operations
6
31. A Word About Netflix…
•Service Oriented Architecture
•Decentralized Operations. You
•Build
•Test
•Deploy
•Set up alerting and monitoring
Technology and Operations
6
32. A Word About Netflix…
•Service Oriented Architecture
•Decentralized Operations. You
•Build
•Test
•Deploy
•Set up alerting and monitoring
•Wake up at 2AM
Technology and Operations
6
33. Oh, the Places We’ll Go!
• Introductions
• Proposed Use Case and Definition
• Continuous Improvement / MVP Model
• Issues, Solutions
• Cloud Considerations
• The Road at Netflix
7
44. So You’ve Just Done a Release
> curl http://WhatDoesTheFooSay.prod.netflix.net/api/v1/fox
11
45. So You’ve Just Done a Release
> curl http://WhatDoesTheFooSay.prod.netflix.net/api/v1/fox
{“response”: “wa-pa-pa-pa-pa-pa-pow”}
11
46. So You’ve Just Done a Release
> curl http://WhatDoesTheFooSay.prod.netflix.net/api/v1/fox
{“response”: “wa-pa-pa-pa-pa-pa-pow”}
The correct answer to “what does the fox say?” is left an exercise for the reader
11
49. You Need Better Testing!
“I’m going to push to production, though
I’m pretty sure it’s going to kill the system”
13
- Said no one, ever*
* Hopefully
50. Rate of Change
1 10 100 1000
0
1
2
3
4
5
6
Availability(nines)
Detour
Rate of Change vs Availability
14
51. Rate of Change
1 10 100 1000
0
1
2
3
4
5
6
Availability(nines)
Detour
Rate of Change vs Availability
14
52. Rate of Change
1 10 100 1000
0
1
2
3
4
5
6
Availability(nines)
Detour
Rate of Change vs Availability
14
53. Rate of Change
1 10 100 1000
0
1
2
3
4
5
6
Availability(nines)
Detour
Rate of Change vs Availability
14
54. Rate of Change
1 10 100 1000
0
1
2
3
4
5
6
Availability(nines)
Detour
Rate of Change vs Availability
14
55. Rate of Change
1 10 100 1000
0
1
2
3
4
5
6
Availability(nines)
Detour
Rate of Change vs Availability
Operations
Engineering
14
56. You Need Better Testing!Deployments!
Canary Analysis!
!
• A deployment process where
• a new change (in behavior, code, or both)
• is rolled out into production gradually,
• with checkpoints along the way to examine the new (canary) systems
• (optionally versus the old (baseline) systems)
• and make go/no-go decisions.
15
69. Oh, the Places We’ll Go!
• Introductions
• Proposed Use Case and Definition
• Continuous Improvement / MVP Model
• Issues, Solutions
• Cloud Considerations
• The Road at Netflix
20
104. A Quick Recap
• Observe
• Segregate metrics
• Partial deploy
• Compare to Baseline
31
105. A Quick Recap
• Observe
• Segregate metrics
• Partial deploy
• Compare to Baseline
• Absolutes are never right
31
106. A Quick Recap
• Observe
• Segregate metrics
• Partial deploy
• Compare to Baseline
• Absolutes are never right
• Automate decision
31
107. A Quick Recap
• Observe
• Segregate metrics
• Partial deploy
• Compare to Baseline
• Absolutes are never right
• Automate decision
• Automate execution
31
108. Oh, the Places We’ll Go!
• Introductions
• Proposed Use Case and Definition
• Continuous Improvement / MVP Model
• Issues, Solutions
• Cloud Considerations
• The Road at Netflix
32
109. To Save You Some Time …
Not all
metrics are
created
equal
33
110. To Save You Some Time …
Not all
metrics are
created
equal
Focus on
System and
Application
Metrics
33
111. To Save You Some Time …
Not all
metrics are
created
equal
Focus on
System and
Application
Metrics
Weight by
category
(system,
latency, etc)
33
112. To Save You Some Time …
Outliers are
out, lying
34
113. To Save You Some Time …
Outliers are
out, lying
Use a group
of servers
34
114. To Save You Some Time …
Outliers are
out, lying
Use a group
of servers
Balance
fidelity with
customer
impact
34
115. To Save You Some Time …
Exercise
without
warmup
can result
in injury
35
116. To Save You Some Time …
Exercise
without
warmup
can result
in injury
Repeat
canary
analysis
frequently
35
117. To Save You Some Time …
Exercise
without
warmup
can result
in injury
Repeat
canary
analysis
frequently
Both traffic
and startup
time are
factors
35
118. To Save You Some Time …
vive la
différence!
36
119. To Save You Some Time …
vive la
différence!
Hot-OK,
Cold-OK
36
120. To Save You Some Time …
vive la
différence!
Hot-OK,
Cold-OK
Let
Application
Owners
Choose
36
121. To Save You Some Time …
Signal is better
than no1$#[NO
CARRIER]
37
122. To Save You Some Time …
Signal is better
than no1$#[NO
CARRIER]
Ignore weak
signals
37
123. Oh, the Places We’ll Go!
• Introductions
• Proposed Use Case and Definition
• Continuous Improvement / MVP Model
• Issues, Solutions
• Cloud Considerations
• The Road at Netflix
38
132. Oh, the Places We’ll Go!
• Introductions
• Proposed Use Case and Definition
• Continuous Improvement / MVP Model
• Issues, Solutions
• Cloud Considerations
• The Road at Netflix
41
168. For Our Next Trick …
• Configuration GUI
• Deployment System Integration
55
169. For Our Next Trick …
• Configuration GUI
• Deployment System Integration
• ACA All The Things
55
170. For Our Next Trick …
• Configuration GUI
• Deployment System Integration
• ACA All The Things
• OpenConnect firmware updates
55
171. For Our Next Trick …
• Configuration GUI
• Deployment System Integration
• ACA All The Things
• OpenConnect firmware updates
• Client software changes
55
172. For Our Next Trick …
• Configuration GUI
• Deployment System Integration
• ACA All The Things
• OpenConnect firmware updates
• Client software changes
• Configuration changes in production
55